Distinct Words
January 24, 2017
Although the problem didn’t say so, we will assume that the input file consists of one word per line; if that’s not right, you can preprocess the input to make it right.
Unless there is some reason to do something different, I would call on the standard Unix sort utility to solve this problem; the -u
flag to sort
specifies unique output:
$ sort -u infile
If that doesn’t work, the solution depends on the exact problem and on the capabilities of your computing environment. One answer is to write your own program that mimics the operation of sort -u
. How you do that depends strongly on your computing environment — language, compiler, and operating system — so we won’t try to tackle it here.
If the number of distinct words in the input will fit into memory, you could use a set data structure, perhaps implementing the set with an underlying balanced binary tree to provide the final output in sorted order. Here’s a program that uses the AVL trees of the Standard Prelude to solve the problem:
(define (sort-unique infile) (with-input-from-file infile (lambda () (let ((words (make-dict string<?))) (do ((word (read-line) (read-line))) ((eof-object? word) (for-each (lambda (word) (display word) (newline)) (map car (words 'to-list)))) (set! words (words 'update (lambda (k v) (+ v 1)) word 1)))))))
Given that there are less than a quarter of a million words in the English language, and most documents won’t use more than a fraction of them, that will likely work in the majority of cases. You can run the program at http://ideone.com/ogxnFq.
Or, of course, you look in Knuth and find out how to do a polyphasic merge sort with replacement selection.
it can be accomplished in java by adding all those words in Treeset(which will automatically taken care by it)
@John: as @praxis says, it’s likely the set of distinct words will fit in to memory, even if the input file won’t, so sorting (polyphasic merge or not) isn’t likely to be the best solution – a trie or hashtable would be better.
Here is my implementation in Julia. Tested with a real-world example and works well. BTW, even though the “huge” file that’s used as an input may not fit into memory, it’s quite likely that the string var. containing the unique words does fit into memory.
function clean_word(w::AbstractString)
while length(w) > 0
if w[end] in ” .,!;-:’?\n\r”
w = w[1:end-1]
else
return lowercase(w)
end
end
end
function dw(fni::AbstractString, fno::AbstractString)
# distinct words in a huge file (assuming it exists in the current directory)
words = “”
fs1 = open(fni) # file stream for inputs
fs2 = open(fno, “w”) # file stream for outputs
c = 0 # unique word counter
for ln in eachline(fs1)
uw = unique(split(ln, ” “))
for w in uw
cw = clean_word(w)
if typeof(cw) != Void
if !contains(words, cw)
words *= ” ” * cw
write(fs2, cw * “\n”)
c += 1
end
end
end
end
close(fs1)
close(fs2)
return c
end
Of course there is room for improvement here, since the aux function clean_word() doesn’t take into account some other special characters, which in some applications may need to be removed. Anyway, for a POC, I think this code should do. Thanks
A solution in Racket.
A Haskell version. It consumes O(n) memory, where n is the number of distinct words in the input. (Verified by looking at heap profiles when the input consists of one copy of a dictionary, then again with ten copies. In both cases it uses the same maximum amount of memory.) It works by lazily reading from stdin, splitting the input into a list of words, creating a set from them, converting the set back to a list of words in ascending order, appending newlines to the words, then writing the result to stdout.