Distinct Words

January 24, 2017

Although the problem didn’t say so, we will assume that the input file consists of one word per line; if that’s not right, you can preprocess the input to make it right.

Unless there is some reason to do something different, I would call on the standard Unix sort utility to solve this problem; the -u flag to sort specifies unique output:

$ sort -u infile

If that doesn’t work, the solution depends on the exact problem and on the capabilities of your computing environment. One answer is to write your own program that mimics the operation of sort -u. How you do that depends strongly on your computing environment — language, compiler, and operating system — so we won’t try to tackle it here.

If the number of distinct words in the input will fit into memory, you could use a set data structure, perhaps implementing the set with an underlying balanced binary tree to provide the final output in sorted order. Here’s a program that uses the AVL trees of the Standard Prelude to solve the problem:

(define (sort-unique infile)
  (with-input-from-file infile
    (lambda ()
      (let ((words (make-dict string<?)))
        (do ((word (read-line) (read-line)))
            ((eof-object? word)
              (for-each
                (lambda (word)
                  (display word) (newline))
                (map car (words 'to-list))))
          (set! words
            (words 'update (lambda (k v) (+ v 1))
                   word 1)))))))

Given that there are less than a quarter of a million words in the English language, and most documents won’t use more than a fraction of them, that will likely work in the majority of cases. You can run the program at http://ideone.com/ogxnFq.

Posted by programmingpraxis

Filed in Exercises

6 Comments »

6 Responses to “Distinct Words”

John Cowan said
January 24, 2017 at 3:21 PM
Or, of course, you look in Knuth and find out how to do a polyphasic merge sort with replacement selection.
Somashaker said
January 24, 2017 at 4:03 PM
it can be accomplished in java by adding all those words in Treeset(which will automatically taken care by it)
matthew said
January 24, 2017 at 5:31 PM
@John: as @praxis says, it’s likely the set of distinct words will fit in to memory, even if the input file won’t, so sorting (polyphasic merge or not) isn’t likely to be the best solution – a trie or hashtable would be better.
Zack said
January 24, 2017 at 8:54 PM
Here is my implementation in Julia. Tested with a real-world example and works well. BTW, even though the “huge” file that’s used as an input may not fit into memory, it’s quite likely that the string var. containing the unique words does fit into memory.

function clean_word(w::AbstractString)
while length(w) > 0
if w[end] in ” .,!;-:’?\n\r”
w = w[1:end-1]
else
return lowercase(w)
end
end
end

function dw(fni::AbstractString, fno::AbstractString)
# distinct words in a huge file (assuming it exists in the current directory)
words = “”
fs1 = open(fni) # file stream for inputs
fs2 = open(fno, “w”) # file stream for outputs
c = 0 # unique word counter

for ln in eachline(fs1)
uw = unique(split(ln, ” “))

for w in uw
cw = clean_word(w)

if typeof(cw) != Void
if !contains(words, cw)
words *= ” ” * cw
write(fs2, cw * “\n”)
c += 1
end
end
end
end

close(fs1)
close(fs2)
return c
end

Of course there is room for improvement here, since the aux function clean_word() doesn’t take into account some other special characters, which in some applications may need to be removed. Anyway, for a POC, I think this code should do. Thanks
r. clayton said
January 27, 2017 at 12:54 AM
A solution in Racket.
Globules said
January 27, 2017 at 4:47 AM
A Haskell version. It consumes O(n) memory, where n is the number of distinct words in the input. (Verified by looking at heap profiles when the input consists of one copy of a dictionary, then again with ten copies. In both cases it uses the same maximum amount of memory.) It works by lazily reading from stdin, splitting the input into a list of words, creating a set from them, converting the set back to a list of words in ascending order, appending newlines to the words, then writing the result to stdout.
```
import qualified Data.Set as S
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as IO

main :: IO ()
main = IO.interact (T.unlines . S.toAscList . S.fromList . T.words)
```

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Programming Praxis

Distinct Words

January 24, 2017

6 Responses to “Distinct Words”

Leave a comment

Categories

Archives

Archives

Programming Praxis

Distinct Words

January 24, 2017

Share this:

Related

6 Responses to “Distinct Words”

Leave a comment

Categories

Archives

Archives