Data Laundry, Revisited
March 22, 2019
Data laundry is the act of cleaning data, as when it arrives in one format and must be translated to another, or when external data must be checked for validity. Some weeks, as this week, data laundry occupies a significant portion of my time at work, so it’s an exercise worth examining. We looked at data laundry in two previous exercises. Today’s exercise in data laundry comes to us from Stack Overflow:
I have a file like this:
AAKRKA HIST1H1B AAGAGAAKRKATGPP AAKRKA HIST1H1E RKSAGAAKRKASGPP AAKRLN ACAT1 LMTADAAKRLNVTPL AAKRLN SUCLG2 NEALEAAKRLNAKEI AAKRLR GTF2F1 VSEMPAAKRLRLDTG AAKRMA VCL NDIIAAAKRMALLMA AAKRPL WIZ YLGSVAAKRPLQEDR AAKRQK MTA2 SSSQPAAKRQKLNPAI would like to kind of merge 2 lines if they are exactly the same in the 1st column. The desired output is:
AAKRKA HIST1H1B,HIST1H1E AAGAGAAKRKATGPP,RKSAGAAKRKASGPP AAKRLN ACAT1,SUCLG2 LMTADAAKRLNVTPL,NEALEAAKRLNAKEI AAKRLR GTF2F1 VSEMPAAKRLRLDTG AAKRMA VCL NDIIAAAKRMALLMA AAKRPL WIZ YLGSVAAKRPLQEDR AAKRQK MTA2 SSSQPAAKRQKLNPASometimes there could be more than two lines starting with the same word. How could I reach the desired output with bash/awk?
Your task is to write a program to solve this simple task of data laundry. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.