Programming Praxis


Home | Pages | Archives


Data Laundry, Again

July 13, 2018 9:00 AM

My favorite tool for data laundry is awk:

$ cat infile
ABCDE This is some text.
This is more text. ABCDE, ABCDE.
ABCDE And this is [ABCDE] still more text.
$ awk '{while ($0 ~ /ABCDE/) sub(/ABCDE/, "X"++count); print}' infile
X1 This is some text.
This is more text. X2, X3.
X4 And this is [X5] still more text.

The print command is not part of the loop; it prints a potentially-modified line after all the substitutions have been performed. The count variable is automatically initialized to zero, then pre-incremented each time it is needed.

You can run the program at https://ideone.com/W82V3S.

Posted by programmingpraxis

Categories: Exercises

Tags:

6 Responses to “Data Laundry, Again”

  1. Using Perl’s regular expressions….

    print 'ABCDE This is some text.
    This is more text. ABCDE, ABCDE.
    ABCDE And this is [ABCDE] still more text.
    ' =~ s{ABCDE}{'X'.++$t}reg;
    

    g – every occurance, e – evaluate replacement string, r – return string after replacements…

    By James Curtis-Smith on July 13, 2018 at 9:07 AM

  2. Here’s a solution in Python.

    def idx_replace(string, sub, prefix):
      return "".join("{}{}{}".format(prefix, idx, s) if idx > 0 else s
                     for idx, s in enumerate(string.split(sub)))
    
    string = """ABCDE This is some text.
    This is more text. ABCDE, ABCDE.
    ABCDE And this is [ABCDE] still more text."""
    
    print(idx_replace(string, "ABCDE", "X"))
    

    Output:

    X1 This is some text.
    This is more text. X2, X3.
    X4 And this is [X5] still more text.
    

    By Daniel on July 13, 2018 at 9:08 PM

  3. It sounded like you need to do this a lot, so here is a python function that takes a search pattern and a sequence number format and returns a function that can be used to clean up a text string. That way, you can make many cleanup functions with independent sequence numbers.

    def cleaner_factory(pattern, seq_no_format='X{}'):
        sn = it.count(1)
        return ft.partial(re.sub, pattern, lambda _:seq_no_format.format(next(sn)))
    

    It can be used on the entire text:

    cleaner = cleaner_factory('ABCDE')
    print(cleaner(text))
    
    X1 This is some text.
    This is more text. X2, X3.
    X4 And this is [X5] still more text.
    

    It can also be applied to the text in chunks (e.g. line by line). The sequence number format can also be more elaborate.

    cleaner = cleaner_factory('ABCDE', '<<{:03d}>>')
    
    for line in text.splitlines():
        print(cleaner(line))
    
    =>
    <<001>> This is some text.
    This is more text. <<002>>, <<003>>.
    <<004>> And this is [<<005>>] still more text.
    

    By Mike on July 13, 2018 at 11:46 PM

  4. I forgot to include the imports

    import re
    import itertools as it
    import functools as ft
    

    By Mike on July 13, 2018 at 11:48 PM

  5. Kotlin at https://pastebin.com/rrAZM3gA

    By Scott on July 15, 2018 at 9:03 PM

  6. […] time at work, so it’s an exercise worth examining. We looked at data laundry in two previous exercises. Today’s exercise in data laundry comes to us from Stack […]

    By Data Laundry, Revisited | Programming Praxis on March 22, 2019 at 4:01 AM

Leave a Reply



Mobile Site | Full Site


Get a free blog at WordPress.com Theme: WordPress Mobile Edition by Alex King.