Data Laundry, Again
July 13, 2018
My favorite tool for data laundry is awk:
$ cat infile
ABCDE This is some text.
This is more text. ABCDE, ABCDE.
ABCDE And this is [ABCDE] still more text.
$ awk '{while ($0 ~ /ABCDE/) sub(/ABCDE/, "X"++count); print}' infile
X1 This is some text.
This is more text. X2, X3.
X4 And this is [X5] still more text.
The print command is not part of the loop; it prints a potentially-modified line after all the substitutions have been performed. The count variable is automatically initialized to zero, then pre-incremented each time it is needed.
You can run the program at https://ideone.com/W82V3S.
Using Perl’s regular expressions….
print 'ABCDE This is some text. This is more text. ABCDE, ABCDE. ABCDE And this is [ABCDE] still more text. ' =~ s{ABCDE}{'X'.++$t}reg;g – every occurance, e – evaluate replacement string, r – return string after replacements…
Here’s a solution in Python.
def idx_replace(string, sub, prefix): return "".join("{}{}{}".format(prefix, idx, s) if idx > 0 else s for idx, s in enumerate(string.split(sub))) string = """ABCDE This is some text. This is more text. ABCDE, ABCDE. ABCDE And this is [ABCDE] still more text.""" print(idx_replace(string, "ABCDE", "X"))Output:
It sounded like you need to do this a lot, so here is a python function that takes a search pattern and a sequence number format and returns a function that can be used to clean up a text string. That way, you can make many cleanup functions with independent sequence numbers.
def cleaner_factory(pattern, seq_no_format='X{}'): sn = it.count(1) return ft.partial(re.sub, pattern, lambda _:seq_no_format.format(next(sn)))It can be used on the entire text:
cleaner = cleaner_factory('ABCDE') print(cleaner(text)) X1 This is some text. This is more text. X2, X3. X4 And this is [X5] still more text.It can also be applied to the text in chunks (e.g. line by line). The sequence number format can also be more elaborate.
cleaner = cleaner_factory('ABCDE', '<<{:03d}>>') for line in text.splitlines(): print(cleaner(line)) => <<001>> This is some text. This is more text. <<002>>, <<003>>. <<004>> And this is [<<005>>] still more text.I forgot to include the imports
Kotlin at https://pastebin.com/rrAZM3gA
[…] time at work, so it’s an exercise worth examining. We looked at data laundry in two previous exercises. Today’s exercise in data laundry comes to us from Stack […]