Distinct Words

January 24, 2017

I’m not sure of the original source of today’s exercise; it could be a homework problem or an interview question:

Given a huge file of words, print all the distinct words in it, in ascending order; the definition of “huge” is “too big to fit in memory.”

Your task is to write a program to print all the distinct words in a huge file. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.


Pages: 1 2

6 Responses to “Distinct Words”

  1. John Cowan said

    Or, of course, you look in Knuth and find out how to do a polyphasic merge sort with replacement selection.

  2. Somashaker said

    it can be accomplished in java by adding all those words in Treeset(which will automatically taken care by it)

  3. matthew said

    @John: as @praxis says, it’s likely the set of distinct words will fit in to memory, even if the input file won’t, so sorting (polyphasic merge or not) isn’t likely to be the best solution – a trie or hashtable would be better.

  4. Zack said

    Here is my implementation in Julia. Tested with a real-world example and works well. BTW, even though the “huge” file that’s used as an input may not fit into memory, it’s quite likely that the string var. containing the unique words does fit into memory.

    function clean_word(w::AbstractString)
    while length(w) > 0
    if w[end] in ” .,!;-:’?\n\r”
    w = w[1:end-1]
    return lowercase(w)

    function dw(fni::AbstractString, fno::AbstractString)
    # distinct words in a huge file (assuming it exists in the current directory)
    words = “”
    fs1 = open(fni) # file stream for inputs
    fs2 = open(fno, “w”) # file stream for outputs
    c = 0 # unique word counter

    for ln in eachline(fs1)
    uw = unique(split(ln, ” “))

    for w in uw
    cw = clean_word(w)

    if typeof(cw) != Void
    if !contains(words, cw)
    words *= ” ” * cw
    write(fs2, cw * “\n”)
    c += 1

    return c

    Of course there is room for improvement here, since the aux function clean_word() doesn’t take into account some other special characters, which in some applications may need to be removed. Anyway, for a POC, I think this code should do. Thanks

  5. r. clayton said

    A solution in Racket.

  6. Globules said

    A Haskell version. It consumes O(n) memory, where n is the number of distinct words in the input. (Verified by looking at heap profiles when the input consists of one copy of a dictionary, then again with ten copies. In both cases it uses the same maximum amount of memory.) It works by lazily reading from stdin, splitting the input into a list of words, creating a set from them, converting the set back to a list of words in ascending order, appending newlines to the words, then writing the result to stdout.

    import qualified Data.Set as S
    import qualified Data.Text.Lazy as T
    import qualified Data.Text.Lazy.IO as IO
    main :: IO ()
    main = IO.interact (T.unlines . S.toAscList . S.fromList . T.words)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: