Data Laundry, Again
July 13, 2018
My favorite tool for data laundry is awk:
$ cat infile ABCDE This is some text. This is more text. ABCDE, ABCDE. ABCDE And this is [ABCDE] still more text. $ awk '{while ($0 ~ /ABCDE/) sub(/ABCDE/, "X"++count); print}' infile X1 This is some text. This is more text. X2, X3. X4 And this is [X5] still more text.
The print
command is not part of the loop; it prints a potentially-modified line after all the substitutions have been performed. The count
variable is automatically initialized to zero, then pre-incremented each time it is needed.
You can run the program at https://ideone.com/W82V3S.
Using Perl’s regular expressions….
g – every occurance, e – evaluate replacement string, r – return string after replacements…
Here’s a solution in Python.
Output:
It sounded like you need to do this a lot, so here is a python function that takes a search pattern and a sequence number format and returns a function that can be used to clean up a text string. That way, you can make many cleanup functions with independent sequence numbers.
It can be used on the entire text:
It can also be applied to the text in chunks (e.g. line by line). The sequence number format can also be more elaborate.
I forgot to include the imports
Kotlin at https://pastebin.com/rrAZM3gA
[…] time at work, so it’s an exercise worth examining. We looked at data laundry in two previous exercises. Today’s exercise in data laundry comes to us from Stack […]