Data Laundry, Revisited

March 22, 2019

Data laundry is the act of cleaning data, as when it arrives in one format and must be translated to another, or when external data must be checked for validity. Some weeks, as this week, data laundry occupies a significant portion of my time at work, so it’s an exercise worth examining. We looked at data laundry in two previous exercises. Today’s exercise in data laundry comes to us from Stack Overflow:

I have a file like this:

AAKRKA HIST1H1B AAGAGAAKRKATGPP
AAKRKA HIST1H1E RKSAGAAKRKASGPP
AAKRLN ACAT1 LMTADAAKRLNVTPL
AAKRLN SUCLG2 NEALEAAKRLNAKEI
AAKRLR GTF2F1 VSEMPAAKRLRLDTG
AAKRMA VCL NDIIAAAKRMALLMA
AAKRPL WIZ YLGSVAAKRPLQEDR
AAKRQK MTA2 SSSQPAAKRQKLNPA

I would like to kind of merge 2 lines if they are exactly the same in the 1st column. The desired output is:

AAKRKA HIST1H1B,HIST1H1E AAGAGAAKRKATGPP,RKSAGAAKRKASGPP
AAKRLN ACAT1,SUCLG2 LMTADAAKRLNVTPL,NEALEAAKRLNAKEI
AAKRLR GTF2F1 VSEMPAAKRLRLDTG
AAKRMA VCL NDIIAAAKRMALLMA
AAKRPL WIZ YLGSVAAKRPLQEDR
AAKRQK MTA2 SSSQPAAKRQKLNPA

Sometimes there could be more than two lines starting with the same word. How could I reach the desired output with bash/awk?

Your task is to write a program to solve this simple task of data laundry. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Advertisements

Pages: 1 2

7 Responses to “Data Laundry, Revisited”

  1. Steve said

    awk version

    $ cat data-laundry-revisited.awk
    # Data Laundry, Revisited
    #
    
    function createname(str) {
            buffer = "";
            name   = $1;
            n      = split($0,arr);
            for (i = 2; i <= n; i++) {
                    buffer = buffer arr[i] " ";
            }
            return substr(buffer,1,length(buffer)-1);
    }
    
    function addname(str,buffer,name) {
            n1 = split(buffer,arr1);
            n2 = split(str,arr2);
            for (i = 2; i <= n2; i++) {
                    arr2[i] = arr2[i] "," arr1[i-1]
            }
            buffer2 = "";
            for (i = 2; i <= n2; i++) {
                    buffer2 = buffer2 arr2[i] " ";
            }
            return substr(buffer2,1,length(buffer2)-1);
    }
    
    BEGIN {
            buffer    = "";
            name      = ""
    }
    
    {
            if ($1 != name) {
                    if (buffer == "") {
                            buffer = createname($0);
                    }
                    else {
                            print name " " buffer;
                            buffer = createname($0);
                    }
            }
            else {
                    buffer = addname($0,buffer,name);
            }
    }
    
    END {
            if (buffer != "") {
                    print name " " buffer;
            }
    }
    
    $ cat data-laundry.txt
    AAKRKA HIST1H1B AAGAGAAKRKATGPP
    AAKRKA HIST1H1E RKSAGAAKRKASGPP
    AAKRLN ACAT1 LMTADAAKRLNVTPL
    AAKRLN SUCLG2 NEALEAAKRLNAKEI
    AAKRLR GTF2F1 VSEMPAAKRLRLDTG
    AAKRMA VCL NDIIAAAKRMALLMA
    AAKRPL WIZ YLGSVAAKRPLQEDR
    AAKRQK MTA2 SSSQPAAKRQKLNPA
    
    $ awk -f data-laundry-revisited.awk data-laundry.txt
    AAKRKA HIST1H1E,HIST1H1B RKSAGAAKRKASGPP,AAGAGAAKRKATGPP
    AAKRLN SUCLG2,ACAT1 NEALEAAKRLNAKEI,LMTADAAKRLNVTPL
    AAKRLR GTF2F1 VSEMPAAKRLRLDTG
    AAKRMA VCL NDIIAAAKRMALLMA
    AAKRPL WIZ YLGSVAAKRPLQEDR
    AAKRQK MTA2 SSSQPAAKRQKLNPA
    
    
  2. Tim said

    Scheme (Chicken):

    (import (srfi-1) (srfi-13) (chicken io))
    
    (define (words line)
      (string-tokenize line))
    
    (define (lines-mergeable? line1 line2)
      (equal? (car (words line1))
              (car (words line2))))
    
    (define (merge-lines line1 line2)
      (let ((line1-words (words line1))
            (line2-words (words line2)))
        (string-join
         (cons
          (car line1-words)
          (map (lambda (l) (string-join l ","))
               (zip (cdr line1-words)
                    (cdr line2-words)))))))
    
    (define (do-merge input-lines)
      (let loop ((remaining (cdr input-lines))
                 (current (car input-lines))
                 (output '()))
        (if (null? remaining)
            (reverse (cons current output))
            (let ((next (car remaining)))
              (if (lines-mergeable? next current)
                  (loop (cdr remaining) (merge-lines current next) output)
                  (loop (cdr remaining) next (cons current output)))))))
    
    (with-input-from-file "laundry-input-file.txt"
      (lambda ()
        (for-each print (do-merge (read-lines)))))
    
  3. chaw said

    For something a bit different, here’s a solution using SQLite.

    create temp table R(
      a text,
      b text,
      c text);
    
    .separator ' '
    
    .import input.txt R
    
    .once output.txt
    select a, group_concat(b) as bs, group_concat(c) as cs
    from R
    group by a;
    
  4. Daniel said

    Here’s a solution in Python.

    from collections import OrderedDict
    import sys
    
    d = OrderedDict()
    
    with open(sys.argv[1]) as f:
        for line in f:
            key, val1, val2 = line.strip().split()
            if key not in d:
                d[key] = []
            d[key].extend((val1, val2))
    
    for key, vals in d.items():
        vals1 = ','.join(vals[0::2])
        vals2 = ','.join(vals[1::2])
        print('{} {} {}'.format(key, vals1, vals2))
    

    Example Usage:

    $ python3 laundry.py input.txt
    AAKRKA HIST1H1B,HIST1H1E AAGAGAAKRKATGPP,RKSAGAAKRKASGPP
    AAKRLN ACAT1,SUCLG2 LMTADAAKRLNVTPL,NEALEAAKRLNAKEI
    AAKRLR GTF2F1 VSEMPAAKRLRLDTG
    AAKRMA VCL NDIIAAAKRMALLMA
    AAKRPL WIZ YLGSVAAKRPLQEDR
    AAKRQK MTA2 SSSQPAAKRQKLNPA
    
  5. Tovervogel said

    In SWI-Prolog 7 and up:

    collapse([], []).
    collapse([X], [X]).
    collapse([[A,B1,C1],[A,B2,C2]|Rest], CollapsedRest) :-
    atomics_to_string([B1,B2], ",", B3),
    atomics_to_string([C1,C2], ",", C3),
    collapse([[A,B3,C3]|Rest], CollapsedRest).
    collapse([X|Rest], [X|CollapsedRest]) :-
    collapse(Rest, CollapsedRest).

    read_file_to_lines(Filename, Lines) :-
    read_file_to_string(Filename, String, []),
    split_string(String, "\n", "\n", Lines).

    process(Lines, CollapsedLines) :-
    maplist(split_line, Lines, SplitLines),
    collapse(SplitLines, CollapsedLines).

    split_line(Line, Parts) :-
    split_string(Line, " ", " ", Parts).

    go :-
    read_file_to_lines("data-laundry-input.txt", Lines),
    process(Lines, CollapsedLines),
    maplist(format("~w ~w ~w~n"), CollapsedLines).

  6. Globules said

    Another AWK solution. (Well, GNU AWK…) It makes various assumptions, such as fields not being printf format strings, etc.

    #!/usr/local/bin/gawk -f
    
    {
        for (i = 2; i <= NF; i++) {
            field[$1][i-2][NR] = $i
        }
    }
    
    END {
        PROCINFO["sorted_in"] = "@ind_num_asc"
        for (key in field) {
            printf key
            for (i = 0; i < length(field[key]); i++) {
                s = ""
                for (j in field[key][i]) {
                    s = s "," field[key][i][j]
                }
                printf " " gensub(/^,/, "", "1", s)
            }
            print ""
        }
    }
    
    $ ./laundry.awk < laundry.txt
    AAKRKA HIST1H1B,HIST1H1E AAGAGAAKRKATGPP,RKSAGAAKRKASGPP
    AAKRLN ACAT1,SUCLG2 LMTADAAKRLNVTPL,NEALEAAKRLNAKEI
    AAKRLR GTF2F1 VSEMPAAKRLRLDTG
    AAKRMA VCL NDIIAAAKRMALLMA
    AAKRPL WIZ YLGSVAAKRPLQEDR
    AAKRQK MTA2 SSSQPAAKRQKLNPA
    
  7. Richard A. O'Keefe said

    AWK.

    BEGIN {
    a = b = c = “”
    }
    $1 == a {
    b = b “,” $2
    c = c “,” $3
    next
    }
    {
    if (a != “”) print a, b, c
    a = $1
    b = $2
    c = $3
    }
    END {
    if (a != “”) print a, b, c
    }

    Nothing is said about the order of the input or the output, so I have assumed
    that the input is already sorted. If we cannot assume that, the program gets
    simpler, at the price of storing the data in memory.

    {
    a = $1
    if (a in B) {
    B[a] = B[a] “,” $2
    C[a] = C[a] “,” $3
    } else {
    B[a] = $2
    C[a] = $3
    }
    }
    END {
    for (a in B) print a, B[a], C[a] # | “sort”
    }

    If the output is supposed to be sorted, then remove the “#”.
    Another issue is that I have assumed that the lines have exactly
    three fields.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: