Data Laundry, Revisited
March 22, 2019
Awk is my normal detergent for data laundry:
{ second[$1] = second[$1] "," $2; third[$1] = third[$1] "," $3 } END { for (i in second) print i, substr(second[i],2), substr(third[i],2) }
The first line stores the second and third fields in associative arrays indexed by the first field, accumulating fields with identical indices with leading commas before each field, and the second line iterates over the two arrays, deleting the leading comma on output. Here is the output:
AAKRLN ACAT1,SUCLG2 LMTADAAKRLNVTPL,NEALEAAKRLNAKEI AAKRMA VCL NDIIAAAKRMALLMA AAKRQK MTA2 SSSQPAAKRQKLNPA AAKRPL WIZ YLGSVAAKRPLQEDR AAKRLR GTF2F1 VSEMPAAKRLRLDTG AAKRKA HIST1H1B,HIST1H1E AAGAGAAKRKATGPP,RKSAGAAKRKASGPP
I made no assumptions about the order of the input or the output. If you want sorted output, pipe the output through sort
. You can run the program at https://ideone.com/sbgLNk.
awk version
Scheme (Chicken):
For something a bit different, here’s a solution using SQLite.
Here’s a solution in Python.
Example Usage:
In SWI-Prolog 7 and up:
collapse([], []).
collapse([X], [X]).
collapse([[A,B1,C1],[A,B2,C2]|Rest], CollapsedRest) :-
atomics_to_string([B1,B2], ",", B3),
atomics_to_string([C1,C2], ",", C3),
collapse([[A,B3,C3]|Rest], CollapsedRest).
collapse([X|Rest], [X|CollapsedRest]) :-
collapse(Rest, CollapsedRest).
read_file_to_lines(Filename, Lines) :-
read_file_to_string(Filename, String, []),
split_string(String, "\n", "\n", Lines).
process(Lines, CollapsedLines) :-
maplist(split_line, Lines, SplitLines),
collapse(SplitLines, CollapsedLines).
split_line(Line, Parts) :-
split_string(Line, " ", " ", Parts).
go :-
read_file_to_lines("data-laundry-input.txt", Lines),
process(Lines, CollapsedLines),
maplist(format("~w ~w ~w~n"), CollapsedLines).
Another AWK solution. (Well, GNU AWK…) It makes various assumptions, such as fields not being printf format strings, etc.
AWK.
BEGIN {
a = b = c = “”
}
$1 == a {
b = b “,” $2
c = c “,” $3
next
}
{
if (a != “”) print a, b, c
a = $1
b = $2
c = $3
}
END {
if (a != “”) print a, b, c
}
Nothing is said about the order of the input or the output, so I have assumed
that the input is already sorted. If we cannot assume that, the program gets
simpler, at the price of storing the data in memory.
{
a = $1
if (a in B) {
B[a] = B[a] “,” $2
C[a] = C[a] “,” $3
} else {
B[a] = $2
C[a] = $3
}
}
END {
for (a in B) print a, B[a], C[a] # | “sort”
}
If the output is supposed to be sorted, then remove the “#”.
Another issue is that I have assumed that the lines have exactly
three fields.