Data Laundry, Revisited

March 22, 2019

Awk is my normal detergent for data laundry:

{ second[$1] = second[$1] "," $2; third[$1] = third[$1] "," $3 }
END { for (i in second) print i, substr(second[i],2), substr(third[i],2) }

The first line stores the second and third fields in associative arrays indexed by the first field, accumulating fields with identical indices with leading commas before each field, and the second line iterates over the two arrays, deleting the leading comma on output. Here is the output:

AAKRLN ACAT1,SUCLG2 LMTADAAKRLNVTPL,NEALEAAKRLNAKEI
AAKRMA VCL NDIIAAAKRMALLMA
AAKRQK MTA2 SSSQPAAKRQKLNPA
AAKRPL WIZ YLGSVAAKRPLQEDR
AAKRLR GTF2F1 VSEMPAAKRLRLDTG
AAKRKA HIST1H1B,HIST1H1E AAGAGAAKRKATGPP,RKSAGAAKRKASGPP

I made no assumptions about the order of the input or the output. If you want sorted output, pipe the output through sort. You can run the program at https://ideone.com/sbgLNk.

Posted by programmingpraxis

Filed in Exercises

7 Comments »

7 Responses to “Data Laundry, Revisited”

Steve said

March 22, 2019 at 6:33 AM

awk version

$ cat data-laundry-revisited.awk
# Data Laundry, Revisited
#

function createname(str) {
        buffer = "";
        name   = $1;
        n      = split($0,arr);
        for (i = 2; i <= n; i++) {
                buffer = buffer arr[i] " ";
        }
        return substr(buffer,1,length(buffer)-1);
}

function addname(str,buffer,name) {
        n1 = split(buffer,arr1);
        n2 = split(str,arr2);
        for (i = 2; i <= n2; i++) {
                arr2[i] = arr2[i] "," arr1[i-1]
        }
        buffer2 = "";
        for (i = 2; i <= n2; i++) {
                buffer2 = buffer2 arr2[i] " ";
        }
        return substr(buffer2,1,length(buffer2)-1);
}

BEGIN {
        buffer    = "";
        name      = ""
}

{
        if ($1 != name) {
                if (buffer == "") {
                        buffer = createname($0);
                }
                else {
                        print name " " buffer;
                        buffer = createname($0);
                }
        }
        else {
                buffer = addname($0,buffer,name);
        }
}

END {
        if (buffer != "") {
                print name " " buffer;
        }
}

$ cat data-laundry.txt
AAKRKA HIST1H1B AAGAGAAKRKATGPP
AAKRKA HIST1H1E RKSAGAAKRKASGPP
AAKRLN ACAT1 LMTADAAKRLNVTPL
AAKRLN SUCLG2 NEALEAAKRLNAKEI
AAKRLR GTF2F1 VSEMPAAKRLRLDTG
AAKRMA VCL NDIIAAAKRMALLMA
AAKRPL WIZ YLGSVAAKRPLQEDR
AAKRQK MTA2 SSSQPAAKRQKLNPA

$ awk -f data-laundry-revisited.awk data-laundry.txt
AAKRKA HIST1H1E,HIST1H1B RKSAGAAKRKASGPP,AAGAGAAKRKATGPP
AAKRLN SUCLG2,ACAT1 NEALEAAKRLNAKEI,LMTADAAKRLNVTPL
AAKRLR GTF2F1 VSEMPAAKRLRLDTG
AAKRMA VCL NDIIAAAKRMALLMA
AAKRPL WIZ YLGSVAAKRPLQEDR
AAKRQK MTA2 SSSQPAAKRQKLNPA

Tim said

March 22, 2019 at 9:01 AM

Scheme (Chicken):

(import (srfi-1) (srfi-13) (chicken io))

(define (words line)
  (string-tokenize line))

(define (lines-mergeable? line1 line2)
  (equal? (car (words line1))
          (car (words line2))))

(define (merge-lines line1 line2)
  (let ((line1-words (words line1))
        (line2-words (words line2)))
    (string-join
     (cons
      (car line1-words)
      (map (lambda (l) (string-join l ","))
           (zip (cdr line1-words)
                (cdr line2-words)))))))

(define (do-merge input-lines)
  (let loop ((remaining (cdr input-lines))
             (current (car input-lines))
             (output '()))
    (if (null? remaining)
        (reverse (cons current output))
        (let ((next (car remaining)))
          (if (lines-mergeable? next current)
              (loop (cdr remaining) (merge-lines current next) output)
              (loop (cdr remaining) next (cons current output)))))))

(with-input-from-file "laundry-input-file.txt"
  (lambda ()
    (for-each print (do-merge (read-lines)))))

chaw said

March 22, 2019 at 10:18 AM

For something a bit different, here’s a solution using SQLite.

create temp table R(
  a text,
  b text,
  c text);

.separator ' '

.import input.txt R

.once output.txt
select a, group_concat(b) as bs, group_concat(c) as cs
from R
group by a;

Daniel said

March 22, 2019 at 10:25 AM

Here’s a solution in Python.

from collections import OrderedDict
import sys

d = OrderedDict()

with open(sys.argv[1]) as f:
    for line in f:
        key, val1, val2 = line.strip().split()
        if key not in d:
            d[key] = []
        d[key].extend((val1, val2))

for key, vals in d.items():
    vals1 = ','.join(vals[0::2])
    vals2 = ','.join(vals[1::2])
    print('{} {} {}'.format(key, vals1, vals2))

Example Usage:

$ python3 laundry.py input.txt
AAKRKA HIST1H1B,HIST1H1E AAGAGAAKRKATGPP,RKSAGAAKRKASGPP
AAKRLN ACAT1,SUCLG2 LMTADAAKRLNVTPL,NEALEAAKRLNAKEI
AAKRLR GTF2F1 VSEMPAAKRLRLDTG
AAKRMA VCL NDIIAAAKRMALLMA
AAKRPL WIZ YLGSVAAKRPLQEDR
AAKRQK MTA2 SSSQPAAKRQKLNPA

Tovervogel said
March 22, 2019 at 1:03 PM
In SWI-Prolog 7 and up:

collapse([], []).
collapse([X], [X]).
collapse([[A,B1,C1],[A,B2,C2]|Rest], CollapsedRest) :-
atomics_to_string([B1,B2], ",", B3),
atomics_to_string([C1,C2], ",", C3),
collapse([[A,B3,C3]|Rest], CollapsedRest).
collapse([X|Rest], [X|CollapsedRest]) :-
collapse(Rest, CollapsedRest).

read_file_to_lines(Filename, Lines) :-
read_file_to_string(Filename, String, []),
split_string(String, "\n", "\n", Lines).

process(Lines, CollapsedLines) :-
maplist(split_line, Lines, SplitLines),
collapse(SplitLines, CollapsedLines).

split_line(Line, Parts) :-
split_string(Line, " ", " ", Parts).

go :-
read_file_to_lines("data-laundry-input.txt", Lines),
process(Lines, CollapsedLines),
maplist(format("~w ~w ~w~n"), CollapsedLines).

Globules said

March 23, 2019 at 7:16 AM

Another AWK solution. (Well, GNU AWK…) It makes various assumptions, such as fields not being printf format strings, etc.

#!/usr/local/bin/gawk -f

{
    for (i = 2; i <= NF; i++) {
        field[$1][i-2][NR] = $i
    }
}

END {
    PROCINFO["sorted_in"] = "@ind_num_asc"
    for (key in field) {
        printf key
        for (i = 0; i < length(field[key]); i++) {
            s = ""
            for (j in field[key][i]) {
                s = s "," field[key][i][j]
            }
            printf " " gensub(/^,/, "", "1", s)
        }
        print ""
    }
}

$ ./laundry.awk < laundry.txt
AAKRKA HIST1H1B,HIST1H1E AAGAGAAKRKATGPP,RKSAGAAKRKASGPP
AAKRLN ACAT1,SUCLG2 LMTADAAKRLNVTPL,NEALEAAKRLNAKEI
AAKRLR GTF2F1 VSEMPAAKRLRLDTG
AAKRMA VCL NDIIAAAKRMALLMA
AAKRPL WIZ YLGSVAAKRPLQEDR
AAKRQK MTA2 SSSQPAAKRQKLNPA

Richard A. O'Keefe said
August 21, 2019 at 10:09 PM
AWK.

BEGIN {
a = b = c = “”
}
$1 == a {
b = b “,” $2
c = c “,” $3
next
}
{
if (a != “”) print a, b, c
a = $1
b = $2
c = $3
}
END {
if (a != “”) print a, b, c
}

Nothing is said about the order of the input or the output, so I have assumed
that the input is already sorted. If we cannot assume that, the program gets
simpler, at the price of storing the data in memory.

{
a = $1
if (a in B) {
B[a] = B[a] “,” $2
C[a] = C[a] “,” $3
} else {
B[a] = $2
C[a] = $3
}
}
END {
for (a in B) print a, B[a], C[a] # | “sort”
}

If the output is supposed to be sorted, then remove the “#”.
Another issue is that I have assumed that the lines have exactly
three fields.

S	M	T	W	T	F	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

Programming Praxis