Distinct Words
January 24, 2017
I’m not sure of the original source of today’s exercise; it could be a homework problem or an interview question:
Given a huge file of words, print all the distinct words in it, in ascending order; the definition of “huge” is “too big to fit in memory.”
Your task is to write a program to print all the distinct words in a huge file. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
Or, of course, you look in Knuth and find out how to do a polyphasic merge sort with replacement selection.
it can be accomplished in java by adding all those words in Treeset(which will automatically taken care by it)
@John: as @praxis says, it’s likely the set of distinct words will fit in to memory, even if the input file won’t, so sorting (polyphasic merge or not) isn’t likely to be the best solution – a trie or hashtable would be better.
Here is my implementation in Julia. Tested with a real-world example and works well. BTW, even though the “huge” file that’s used as an input may not fit into memory, it’s quite likely that the string var. containing the unique words does fit into memory.
function clean_word(w::AbstractString)
while length(w) > 0
if w[end] in ” .,!;-:’?\n\r”
w = w[1:end-1]
else
return lowercase(w)
end
end
end
function dw(fni::AbstractString, fno::AbstractString)
# distinct words in a huge file (assuming it exists in the current directory)
words = “”
fs1 = open(fni) # file stream for inputs
fs2 = open(fno, “w”) # file stream for outputs
c = 0 # unique word counter
for ln in eachline(fs1)
uw = unique(split(ln, ” “))
for w in uw
cw = clean_word(w)
if typeof(cw) != Void
if !contains(words, cw)
words *= ” ” * cw
write(fs2, cw * “\n”)
c += 1
end
end
end
end
close(fs1)
close(fs2)
return c
end
Of course there is room for improvement here, since the aux function clean_word() doesn’t take into account some other special characters, which in some applications may need to be removed. Anyway, for a POC, I think this code should do. Thanks
A solution in Racket.
A Haskell version. It consumes O(n) memory, where n is the number of distinct words in the input. (Verified by looking at heap profiles when the input consists of one copy of a dictionary, then again with ten copies. In both cases it uses the same maximum amount of memory.) It works by lazily reading from stdin, splitting the input into a list of words, creating a set from them, converting the set back to a list of words in ascending order, appending newlines to the words, then writing the result to stdout.