Data Laundry, Again

July 13, 2018

My favorite tool for data laundry is awk:

$ cat infile
ABCDE This is some text.
This is more text. ABCDE, ABCDE.
ABCDE And this is [ABCDE] still more text.
$ awk '{while ($0 ~ /ABCDE/) sub(/ABCDE/, "X"++count); print}' infile
X1 This is some text.
This is more text. X2, X3.
X4 And this is [X5] still more text.

The print command is not part of the loop; it prints a potentially-modified line after all the substitutions have been performed. The count variable is automatically initialized to zero, then pre-incremented each time it is needed.

You can run the program at https://ideone.com/W82V3S.

Advertisements

Pages: 1 2

5 Responses to “Data Laundry, Again”

  1. Using Perl’s regular expressions….

    print 'ABCDE This is some text.
    This is more text. ABCDE, ABCDE.
    ABCDE And this is [ABCDE] still more text.
    ' =~ s{ABCDE}{'X'.++$t}reg;
    

    g – every occurance, e – evaluate replacement string, r – return string after replacements…

  2. Daniel said

    Here’s a solution in Python.

    def idx_replace(string, sub, prefix):
      return "".join("{}{}{}".format(prefix, idx, s) if idx > 0 else s
                     for idx, s in enumerate(string.split(sub)))
    
    string = """ABCDE This is some text.
    This is more text. ABCDE, ABCDE.
    ABCDE And this is [ABCDE] still more text."""
    
    print(idx_replace(string, "ABCDE", "X"))
    

    Output:

    X1 This is some text.
    This is more text. X2, X3.
    X4 And this is [X5] still more text.
    
  3. Mike said

    It sounded like you need to do this a lot, so here is a python function that takes a search pattern and a sequence number format and returns a function that can be used to clean up a text string. That way, you can make many cleanup functions with independent sequence numbers.

    def cleaner_factory(pattern, seq_no_format='X{}'):
        sn = it.count(1)
        return ft.partial(re.sub, pattern, lambda _:seq_no_format.format(next(sn)))
    

    It can be used on the entire text:

    cleaner = cleaner_factory('ABCDE')
    print(cleaner(text))
    
    X1 This is some text.
    This is more text. X2, X3.
    X4 And this is [X5] still more text.
    

    It can also be applied to the text in chunks (e.g. line by line). The sequence number format can also be more elaborate.

    cleaner = cleaner_factory('ABCDE', '<<{:03d}>>')
    
    for line in text.splitlines():
        print(cleaner(line))
    
    =>
    <<001>> This is some text.
    This is more text. <<002>>, <<003>>.
    <<004>> And this is [<<005>>] still more text.
    
  4. Mike said

    I forgot to include the imports

    import re
    import itertools as it
    import functools as ft
    

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: