Data Laundry, Again

July 13, 2018

Data laundry is the act of cleaning data, as when it arrives in one format and must be translated to another, or when external data must be checked for validity. We looked at data laundry in a previous exercise. We return to it today because I have been doing data laundry all week, handling data from a new vendor. Today’s task is similar to one I have been doing this week; convert the input to the output shown below, changing all appearances of the string ABCDE to an incrementally-numbered string with a prefix:

ABCDE This is some text.
This is more text. ABCDE, ABCDE.
ABCDE And this is [ABCDE] still more text.

X1 This is some text.
This is more text. X2, X3.
X4 And this is [X5] still more text.

Your task is to write a program to perform the data laundry shown above. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Posted by programmingpraxis

Filed in Exercises

6 Comments »

6 Responses to “Data Laundry, Again”

James Curtis-Smith said
July 13, 2018 at 9:07 AM
Using Perl’s regular expressions….
```
print 'ABCDE This is some text.
This is more text. ABCDE, ABCDE.
ABCDE And this is [ABCDE] still more text.
' =~ s{ABCDE}{'X'.++$t}reg;
```
g – every occurance, e – evaluate replacement string, r – return string after replacements…

Daniel said

July 13, 2018 at 9:08 PM

Here’s a solution in Python.

def idx_replace(string, sub, prefix):
  return "".join("{}{}{}".format(prefix, idx, s) if idx > 0 else s
                 for idx, s in enumerate(string.split(sub)))

string = """ABCDE This is some text.
This is more text. ABCDE, ABCDE.
ABCDE And this is [ABCDE] still more text."""

print(idx_replace(string, "ABCDE", "X"))

Output:

X1 This is some text.
This is more text. X2, X3.
X4 And this is [X5] still more text.

Mike said

July 13, 2018 at 11:46 PM

It sounded like you need to do this a lot, so here is a python function that takes a search pattern and a sequence number format and returns a function that can be used to clean up a text string. That way, you can make many cleanup functions with independent sequence numbers.

def cleaner_factory(pattern, seq_no_format='X{}'):
    sn = it.count(1)
    return ft.partial(re.sub, pattern, lambda _:seq_no_format.format(next(sn)))

It can be used on the entire text:

cleaner = cleaner_factory('ABCDE')
print(cleaner(text))

X1 This is some text.
This is more text. X2, X3.
X4 And this is [X5] still more text.

It can also be applied to the text in chunks (e.g. line by line). The sequence number format can also be more elaborate.

cleaner = cleaner_factory('ABCDE', '<<{:03d}>>')

for line in text.splitlines():
    print(cleaner(line))

=>
<<001>> This is some text.
This is more text. <<002>>, <<003>>.
<<004>> And this is [<<005>>] still more text.

Mike said
July 13, 2018 at 11:48 PM
I forgot to include the imports
```
import re
import itertools as it
import functools as ft
```
Scott said
July 15, 2018 at 9:03 PM
Kotlin at https://pastebin.com/rrAZM3gA
Data Laundry, Revisited | Programming Praxis said
March 22, 2019 at 4:01 AM
[…] time at work, so it’s an exercise worth examining. We looked at data laundry in two previous exercises. Today’s exercise in data laundry comes to us from Stack […]

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Programming Praxis