Natural Join
June 15, 2010
In relational databases, the natural join of two tables is an operator that takes two tables as input and returns a new table that combines the two input tables on their common keys. If the input tables are sorted, the join simply scans the two tables, writing records for the cross-product of all records with equal keys. For instance, the join of the two tables
Key Field1 Field2 Key Field3
A w p A 1
B x q and A 2
B y r B 3
C z s
is the table
Key Field1 Field2 Field3
A w p 1
A w p 2
B x q 3
B y r 3
We represent a table as a file; each line is a record, and fields are separated by tabs. For simplicity, we’ll assume that the first field in each record is the key.
Your task is to write a program that takes two input files representing tables (you may assume they are sorted) and produces their natural join as output. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
[…] Praxis – Natural Join By Remco Niemeijer In today’s Programming Praxis exercise we have to implement a program that joins two files with tables in […]
My Haskell solution (see http://bonsaicode.wordpress.com/2010/06/15/programming-praxis-natural-join/ for a version with comments):
import Data.List import Data.List.Split import System.Environment loadFile :: FilePath -> IO [[String]] loadFile = fmap (map (splitOn "\t") . lines) . readFile join :: Eq a => [[a]] -> [[a]] -> [[a]] join (ks1:f1) (ks2:f2) = union ks1 ks2 : [k1 : v1++v2 | (k1:v1) <- f1, (k2:v2) <- f2, k1 == k2] join _ _ = [] main :: IO () main = mapM_ putStrLn . map (intercalate "\t") . foldl1 join =<< mapM loadFile =<< getArgsHere’s a Python solution. It uses Python’s defaultdict(list) as a multimap from keys to records. make_table() and join_tables() do the work; the rest is just I/O.
from collections import defaultdict def make_table(records): d = defaultdict(list) for rec in records: d[rec[0]].append(rec[1:]) return d def join_tables(t1, t2): return [[k] + r1 + r2 for k in t1 for r1 in t1[k] for r2 in t2[k]] def read_file(fname): with open(fname) as f: return [line.strip('\n').split('\t') for line in f] def read_table(fname): return make_table(read_file(fname)) def join_files(fn1, fn2): return join_tables(read_table(fn1), read_table(fn2)) def print_records(records): for rec in records: print '\t'.join(str(f) for f in rec) print_records(join_tables(read_table('t1'), read_table('t2')))I’ve posted my Java solution here. The bulk of the code is I/O and the definition of rows and columns as full-fledged classes.
my solution in C
http://codepad.org/k8tw70k0