Natural Join

June 15, 2010

In relational databases, the natural join of two tables is an operator that takes two tables as input and returns a new table that combines the two input tables on their common keys. If the input tables are sorted, the join simply scans the two tables, writing records for the cross-product of all records with equal keys. For instance, the join of the two tables

Key  Field1  Field2               Key  Field3
 A      w       p                  A      1
 B      x       q        and       A      2
 B      y       r                  B      3
 C      z       s

is the table

Key  Field1  Field2  Field3
 A      w       p       1
 A      w       p       2
 B      x       q       3
 B      y       r       3

We represent a table as a file; each line is a record, and fields are separated by tabs. For simplicity, we’ll assume that the first field in each record is the key.

Your task is to write a program that takes two input files representing tables (you may assume they are sorted) and produces their natural join as output. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

5 Responses to “Natural Join”

  1. […] Praxis – Natural Join By Remco Niemeijer In today’s Programming Praxis exercise we have to implement a program that joins two files with tables in […]

  2. Remco Niemeijer said

    My Haskell solution (see http://bonsaicode.wordpress.com/2010/06/15/programming-praxis-natural-join/ for a version with comments):

    import Data.List
    import Data.List.Split
    import System.Environment
    
    loadFile :: FilePath -> IO [[String]]
    loadFile = fmap (map (splitOn "\t") . lines) . readFile
    
    join :: Eq a => [[a]] -> [[a]] -> [[a]]
    
    join (ks1:f1) (ks2:f2) = union ks1 ks2 :
        [k1 : v1++v2 | (k1:v1) <- f1, (k2:v2) <- f2, k1 == k2]
    join _        _        = []
    
    main :: IO ()
    main = mapM_ putStrLn . map (intercalate "\t") .
           foldl1 join =<< mapM loadFile =<< getArgs
    
  3. kbob said

    Here’s a Python solution. It uses Python’s defaultdict(list) as a multimap from keys to records. make_table() and join_tables() do the work; the rest is just I/O.

    from collections import defaultdict
    
    def make_table(records):
        d = defaultdict(list)
        for rec in records:
            d[rec[0]].append(rec[1:])
        return d
    
    def join_tables(t1, t2):
        return [[k] + r1 + r2
                for k in t1
                for r1 in t1[k]
                for r2 in t2[k]]
             
    def read_file(fname):
        with open(fname) as f:
            return [line.strip('\n').split('\t') for line in f]
    
    def read_table(fname):
        return make_table(read_file(fname))
    
    def join_files(fn1, fn2):
        return join_tables(read_table(fn1), read_table(fn2))
    
    def print_records(records):
        for rec in records:
            print '\t'.join(str(f) for f in rec)
    
    print_records(join_tables(read_table('t1'), read_table('t2')))
    
  4. I’ve posted my Java solution here. The bulk of the code is I/O and the definition of rows and columns as full-fledged classes.

  5. Vikas Tandi said

    my solution in C
    http://codepad.org/k8tw70k0

Leave a comment