Programming Praxis

Data Laundry, Again

July 13, 2018 9:00 AM

Data laundry is the act of cleaning data, as when it arrives in one format and must be translated to another, or when external data must be checked for validity. We looked at data laundry in a previous exercise. We return to it today because I have been doing data laundry all week, handling data from a new vendor. Today’s task is similar to one I have been doing this week; convert the input to the output shown below, changing all appearances of the string ABCDE to an incrementally-numbered string with a prefix:

ABCDE This is some text.
This is more text. ABCDE, ABCDE.
ABCDE And this is [ABCDE] still more text.

X1 This is some text.
This is more text. X2, X3.
X4 And this is [X5] still more text.

Your task is to write a program to perform the data laundry shown above. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Posted by programmingpraxis

Categories: Exercises

Tags:

« Older Newer »

6 Responses to “Data Laundry, Again”

Using Perl’s regular expressions….
```
print 'ABCDE This is some text.
This is more text. ABCDE, ABCDE.
ABCDE And this is [ABCDE] still more text.
' =~ s{ABCDE}{'X'.++$t}reg;
```
g – every occurance, e – evaluate replacement string, r – return string after replacements…

By James Curtis-Smith on July 13, 2018 at 9:07 AM

Here’s a solution in Python.

def idx_replace(string, sub, prefix):
  return "".join("{}{}{}".format(prefix, idx, s) if idx > 0 else s
                 for idx, s in enumerate(string.split(sub)))

string = """ABCDE This is some text.
This is more text. ABCDE, ABCDE.
ABCDE And this is [ABCDE] still more text."""

print(idx_replace(string, "ABCDE", "X"))

Output:

X1 This is some text.
This is more text. X2, X3.
X4 And this is [X5] still more text.

By Daniel on July 13, 2018 at 9:08 PM

It sounded like you need to do this a lot, so here is a python function that takes a search pattern and a sequence number format and returns a function that can be used to clean up a text string. That way, you can make many cleanup functions with independent sequence numbers.

def cleaner_factory(pattern, seq_no_format='X{}'):
    sn = it.count(1)
    return ft.partial(re.sub, pattern, lambda _:seq_no_format.format(next(sn)))

It can be used on the entire text:

cleaner = cleaner_factory('ABCDE')
print(cleaner(text))

X1 This is some text.
This is more text. X2, X3.
X4 And this is [X5] still more text.

It can also be applied to the text in chunks (e.g. line by line). The sequence number format can also be more elaborate.

cleaner = cleaner_factory('ABCDE', '<<{:03d}>>')

for line in text.splitlines():
    print(cleaner(line))

=>
<<001>> This is some text.
This is more text. <<002>>, <<003>>.
<<004>> And this is [<<005>>] still more text.

By Mike on July 13, 2018 at 11:46 PM

I forgot to include the imports
```
import re
import itertools as it
import functools as ft
```
By Mike on July 13, 2018 at 11:48 PM
Kotlin at https://pastebin.com/rrAZM3gA

By Scott on July 15, 2018 at 9:03 PM
[…] time at work, so it’s an exercise worth examining. We looked at data laundry in two previous exercises. Today’s exercise in data laundry comes to us from Stack […]

By Data Laundry, Revisited | Programming Praxis on March 22, 2019 at 4:01 AM

Mobile Site | Full Site

Get a free blog at WordPress.com Theme: WordPress Mobile Edition by Alex King.

Programming Praxis

Data Laundry, Again

Share this:

Related

6 Responses to “Data Laundry, Again”

Leave a Reply