Data Laundry, Revisited
March 22, 2019
Data laundry is the act of cleaning data, as when it arrives in one format and must be translated to another, or when external data must be checked for validity. Some weeks, as this week, data laundry occupies a significant portion of my time at work, so it’s an exercise worth examining. We looked at data laundry in two previous exercises. Today’s exercise in data laundry comes to us from Stack Overflow:
I have a file like this:
AAKRKA HIST1H1B AAGAGAAKRKATGPP AAKRKA HIST1H1E RKSAGAAKRKASGPP AAKRLN ACAT1 LMTADAAKRLNVTPL AAKRLN SUCLG2 NEALEAAKRLNAKEI AAKRLR GTF2F1 VSEMPAAKRLRLDTG AAKRMA VCL NDIIAAAKRMALLMA AAKRPL WIZ YLGSVAAKRPLQEDR AAKRQK MTA2 SSSQPAAKRQKLNPAI would like to kind of merge 2 lines if they are exactly the same in the 1st column. The desired output is:
AAKRKA HIST1H1B,HIST1H1E AAGAGAAKRKATGPP,RKSAGAAKRKASGPP AAKRLN ACAT1,SUCLG2 LMTADAAKRLNVTPL,NEALEAAKRLNAKEI AAKRLR GTF2F1 VSEMPAAKRLRLDTG AAKRMA VCL NDIIAAAKRMALLMA AAKRPL WIZ YLGSVAAKRPLQEDR AAKRQK MTA2 SSSQPAAKRQKLNPASometimes there could be more than two lines starting with the same word. How could I reach the desired output with bash/awk?
Your task is to write a program to solve this simple task of data laundry. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
awk version
Scheme (Chicken):
For something a bit different, here’s a solution using SQLite.
Here’s a solution in Python.
Example Usage:
In SWI-Prolog 7 and up:
collapse([], []).
collapse([X], [X]).
collapse([[A,B1,C1],[A,B2,C2]|Rest], CollapsedRest) :-
atomics_to_string([B1,B2], ",", B3),
atomics_to_string([C1,C2], ",", C3),
collapse([[A,B3,C3]|Rest], CollapsedRest).
collapse([X|Rest], [X|CollapsedRest]) :-
collapse(Rest, CollapsedRest).
read_file_to_lines(Filename, Lines) :-
read_file_to_string(Filename, String, []),
split_string(String, "\n", "\n", Lines).
process(Lines, CollapsedLines) :-
maplist(split_line, Lines, SplitLines),
collapse(SplitLines, CollapsedLines).
split_line(Line, Parts) :-
split_string(Line, " ", " ", Parts).
go :-
read_file_to_lines("data-laundry-input.txt", Lines),
process(Lines, CollapsedLines),
maplist(format("~w ~w ~w~n"), CollapsedLines).
Another AWK solution. (Well, GNU AWK…) It makes various assumptions, such as fields not being printf format strings, etc.
AWK.
BEGIN {
a = b = c = “”
}
$1 == a {
b = b “,” $2
c = c “,” $3
next
}
{
if (a != “”) print a, b, c
a = $1
b = $2
c = $3
}
END {
if (a != “”) print a, b, c
}
Nothing is said about the order of the input or the output, so I have assumed
that the input is already sorted. If we cannot assume that, the program gets
simpler, at the price of storing the data in memory.
{
a = $1
if (a in B) {
B[a] = B[a] “,” $2
C[a] = C[a] “,” $3
} else {
B[a] = $2
C[a] = $3
}
}
END {
for (a in B) print a, B[a], C[a] # | “sort”
}
If the output is supposed to be sorted, then remove the “#”.
Another issue is that I have assumed that the lines have exactly
three fields.