Data Laundry, Again

July 13, 2018

Data laundry is the act of cleaning data, as when it arrives in one format and must be translated to another, or when external data must be checked for validity. We looked at data laundry in a previous exercise. We return to it today because I have been doing data laundry all week, handling data from a new vendor. Today’s task is similar to one I have been doing this week; convert the input to the output shown below, changing all appearances of the string ABCDE to an incrementally-numbered string with a prefix:

ABCDE This is some text.
This is more text. ABCDE, ABCDE.
ABCDE And this is [ABCDE] still more text.
X1 This is some text.
This is more text. X2, X3.
X4 And this is [X5] still more text.

Your task is to write a program to perform the data laundry shown above. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Advertisement

Pages: 1 2

6 Responses to “Data Laundry, Again”

  1. Using Perl’s regular expressions….

    print 'ABCDE This is some text.
    This is more text. ABCDE, ABCDE.
    ABCDE And this is [ABCDE] still more text.
    ' =~ s{ABCDE}{'X'.++$t}reg;
    

    g – every occurance, e – evaluate replacement string, r – return string after replacements…

  2. Daniel said

    Here’s a solution in Python.

    def idx_replace(string, sub, prefix):
      return "".join("{}{}{}".format(prefix, idx, s) if idx > 0 else s
                     for idx, s in enumerate(string.split(sub)))
    
    string = """ABCDE This is some text.
    This is more text. ABCDE, ABCDE.
    ABCDE And this is [ABCDE] still more text."""
    
    print(idx_replace(string, "ABCDE", "X"))
    

    Output:

    X1 This is some text.
    This is more text. X2, X3.
    X4 And this is [X5] still more text.
    
  3. Mike said

    It sounded like you need to do this a lot, so here is a python function that takes a search pattern and a sequence number format and returns a function that can be used to clean up a text string. That way, you can make many cleanup functions with independent sequence numbers.

    def cleaner_factory(pattern, seq_no_format='X{}'):
        sn = it.count(1)
        return ft.partial(re.sub, pattern, lambda _:seq_no_format.format(next(sn)))
    

    It can be used on the entire text:

    cleaner = cleaner_factory('ABCDE')
    print(cleaner(text))
    
    X1 This is some text.
    This is more text. X2, X3.
    X4 And this is [X5] still more text.
    

    It can also be applied to the text in chunks (e.g. line by line). The sequence number format can also be more elaborate.

    cleaner = cleaner_factory('ABCDE', '<<{:03d}>>')
    
    for line in text.splitlines():
        print(cleaner(line))
    
    =>
    <<001>> This is some text.
    This is more text. <<002>>, <<003>>.
    <<004>> And this is [<<005>>] still more text.
    
  4. Mike said

    I forgot to include the imports

    import re
    import itertools as it
    import functools as ft
    
  5. […] time at work, so it’s an exercise worth examining. We looked at data laundry in two previous exercises. Today’s exercise in data laundry comes to us from Stack […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: