Data Laundry, Again

July 13, 2018

My favorite tool for data laundry is awk:

$ cat infile
ABCDE This is some text.
This is more text. ABCDE, ABCDE.
ABCDE And this is [ABCDE] still more text.
$ awk '{while ($0 ~ /ABCDE/) sub(/ABCDE/, "X"++count); print}' infile
X1 This is some text.
This is more text. X2, X3.
X4 And this is [X5] still more text.

The print command is not part of the loop; it prints a potentially-modified line after all the substitutions have been performed. The count variable is automatically initialized to zero, then pre-incremented each time it is needed.

You can run the program at https://ideone.com/W82V3S.

Posted by programmingpraxis

Filed in Exercises

6 Comments »

6 Responses to “Data Laundry, Again”

James Curtis-Smith said
July 13, 2018 at 9:07 AM
Using Perl’s regular expressions….
```
print 'ABCDE This is some text.
This is more text. ABCDE, ABCDE.
ABCDE And this is [ABCDE] still more text.
' =~ s{ABCDE}{'X'.++$t}reg;
```
g – every occurance, e – evaluate replacement string, r – return string after replacements…

Daniel said

July 13, 2018 at 9:08 PM

Here’s a solution in Python.

def idx_replace(string, sub, prefix):
  return "".join("{}{}{}".format(prefix, idx, s) if idx > 0 else s
                 for idx, s in enumerate(string.split(sub)))

string = """ABCDE This is some text.
This is more text. ABCDE, ABCDE.
ABCDE And this is [ABCDE] still more text."""

print(idx_replace(string, "ABCDE", "X"))

Output:

X1 This is some text.
This is more text. X2, X3.
X4 And this is [X5] still more text.

Mike said

July 13, 2018 at 11:46 PM

It sounded like you need to do this a lot, so here is a python function that takes a search pattern and a sequence number format and returns a function that can be used to clean up a text string. That way, you can make many cleanup functions with independent sequence numbers.

def cleaner_factory(pattern, seq_no_format='X{}'):
    sn = it.count(1)
    return ft.partial(re.sub, pattern, lambda _:seq_no_format.format(next(sn)))

It can be used on the entire text:

cleaner = cleaner_factory('ABCDE')
print(cleaner(text))

X1 This is some text.
This is more text. X2, X3.
X4 And this is [X5] still more text.

It can also be applied to the text in chunks (e.g. line by line). The sequence number format can also be more elaborate.

cleaner = cleaner_factory('ABCDE', '<<{:03d}>>')

for line in text.splitlines():
    print(cleaner(line))

=>
<<001>> This is some text.
This is more text. <<002>>, <<003>>.
<<004>> And this is [<<005>>] still more text.

Mike said
July 13, 2018 at 11:48 PM
I forgot to include the imports
```
import re
import itertools as it
import functools as ft
```
Scott said
July 15, 2018 at 9:03 PM
Kotlin at https://pastebin.com/rrAZM3gA
Data Laundry, Revisited | Programming Praxis said
March 22, 2019 at 4:01 AM
[…] time at work, so it’s an exercise worth examining. We looked at data laundry in two previous exercises. Today’s exercise in data laundry comes to us from Stack […]

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Programming Praxis