Word Sets

March 17, 2017

[ I was visiting my daughter in Houston this week, and unbeknownst to me her computer was fried by lightning a few weeks ago; she was waiting for me to visit so I could advise her what to do. I’m typing this exercise on my tablet, where typing lots of words is difficult and typing code is nearly impossible. I’ll add code when I get back home in a few days. ]

There is an ambiguity in the problem definition, which is probably purposeful on the part of the interviewer. The word tact can be formed from the three letters a, c and t only if duplicates are permitted. In that case, the simple algorithm is to insert the characters in a set, then reject each word in the list that has a letter not in the set:

(define (alpha-set cs)
  (let ((s (make-vector 26 #f)))
    (do ((cs cs (cdr cs)))
        ((null? cs) s)
      (let ((t (- (char->integer (car cs)) 97)))
        (vector-set! s t #t))))
(define (word-set s word)
  (let loop ((cs (string->list word)))
    (if (null? cs) #t
      (let ((t (- (char->integer (car cs)) 97)))
        (if (vector-ref s t) (loop (cdr cs)) #f)))))
> (define s (alpha-set '(#\a #\c #\t)))
> (word-set s "act")
#t
> (word-set s "tact")
#t
> (word-set s "stop")
#f

Here, the set of input characters is stored in the alpha vector; it could be stored in other ways — a hash table, a binary tree, something else — if desired. We are assuming all the letters are lower case.

If duplicates aren’t allowed, you have to count. Our solution keeps an array of length 26, and initializes each array slot with the number of available copies of the corresponding letter. Then each word is examined, and the count of each letter in the word is reduced each time it appears. If any count is ever negative, the word is rejected:

(define (alpha-count cs)
  (let ((c (make-vector 26 0)))
    (do ((cs cs (cdr cs)))
        ((null? cs) c)
      (let ((t (- (char->integer (car cs)) 97)))
        (vector-set! c t (+ (vector-ref c t) 1))))))
(define (word-count c word)
  (let ((c (vector-copy c)))
    (let loop ((cs (string->list word)))
      (if (null? cs) #t
        (let ((t (- (char->integer (car cs)) 97)))
          (vector-set! c t (- (vector-ref c t) 1))
          (if (negative? (vector-ref c t)) #f
            (loop (cdr cs))))))))
> (define c (alpha-count '(#\a #\c #\t)))
> (word-count c "act")
#t
> (word-count c "tact")
#f
> (word-count c "stop")
#f

Here the alpha vector stores letter counts instead of booleans. The letter counts are updated after each input character, so the vector must be copied afresh for each word that is tested.

You can run the program at http://ideone.com/mQoc3j.
.

Advertisements

Pages: 1 2

20 Responses to “Word Sets”

  1. Rutger said

    Something (not so memory efficient) in Python3.

    from collections import defaultdict
    
    def load_word_dict(words_file):
    	word_dict = defaultdict(list)
    	for line in open(words_file):
    		word = line.strip()
    		key = "".join(sorted(word))
    		word_dict[key].append(word)
    	return word_dict
    
    words_file = '/usr/share/dict/words'
    letters = "".join(sorted('west'))
    wd = load_word_dict(words_file)
    print(wd[letters])
    

    Outputs [‘stew’, ‘west’, ‘wets’, ‘tews’].
    Added bonus: learned a new word tew :)

  2. Globules said

    A Haskell version. Letters can be used more than once. It’s case-sensitive.

    import Data.List.Ordered (nubSort)
    import System.Environment (getArgs)
    
    -- Return true iff the word is composed entirely from the letters.  All letters
    -- must be used, but they can be used more than once.
    madeFrom :: String -> String -> Bool
    word `madeFrom` letters = nubSort word == letters
    
    main :: IO ()
    main = do
      letters <- fmap (nubSort . concat) getArgs
      interact (unlines . filter (`madeFrom` letters)  . words)
    

    Here’s the output of running it with an English dictionary and the Finnish translation of Homer’s Odyssey.

    $ ./wordsets astringently < /usr/share/dict/words
    arrestingly
    astringently
    intransigently
    restrainingly
    staggeringly
    transgressingly
    $ ./wordsets sisältyvää < ~/Downloads/pg52108.txt
    sisältyvää
    liittyvistä
    ystävällisistä
    ystävällisiä
    ystävällisistä
    $ 
    
  3. […] completed the solution to the previous exercise, which you can look at if you wish. We’ll have a new exercise on […]

  4. matthew said

    Here’s some C++ that uses a binary trie to store the character sets of the words given on standard input, represented as bitsets. Given a word and the derived bitset, we can efficiently find all words that can be made up with those characters:

    #include <map>
    #include <vector>
    #include <string>
    #include <iostream>
    #include <stdint.h>
    
    typedef uint32_t T;
    
    struct Node {
      T s; Node *zero; Node *one;
    };
    
    // t is s shifted to just contain the uninserted bits
    void insert(T s, T t, Node *&p) {
      if (!p) p = new Node();
      if (t == 0) p->s = s; // No more bits, we are done
      else if (t&1) insert(s,t>>1,p->one);
      else insert(s,t>>1,p->zero);
    }
    
    void find(T s, T t, const Node *p, std::vector<T> &a) {
      if (!p) return;
      if (p->s) a.push_back(p->s); // Found something!
      if (t == 0) return;          // No more bits
      find(s,t>>1,p->zero,a);
      if (t&1) find(s,t>>1,p->one,a); // 
    }
    
    // Make a bitset from the characters of a
    T mkset(const std::string &a) {
      T s = 0;
      for (int i = 0; i < (int)a.size(); i++) {
        int c = a[i];
        if (c < 'a' || c > 'z') return 0;
        s |= (1 << (c-'a'));
      }
      return s;
    }
    
    int main(int argc, char *argv[]) {
      Node *root = 0;
      std::map<T,std::vector<std::string>> words;
      std::string word;
      while(std::cin >> word) {
        auto s = mkset(word);
        if (s) {
          words[s].push_back(word);
          insert(s,s,root);
        }
      }
      for (int i = 1; i < argc; i++) {
        std::vector<T> a; // Result goes here
        auto s = mkset(argv[i]);
        find(s,s,root,a);
        for (auto i = a.begin(); i != a.end(); i++) {
          auto t = words[*i];
          for (auto j = t.begin(); j != t.end(); j++) {
            std::cout << *j << "\n";
          }
        }
      }
    }
    
    $ ./a.out praxis < /usr/share/dict/words | sort | fmt
    a air airs aria arias as asp asps ass ax axis i iris is p pa pair pairs
    pap papa papas paps par pars pas pass pi pip pips pis piss r rap raps
    rasp rasps rip rips rs s sap saps sari saris sass sax sip sips sir sirs
    sis six spa spar spars spas x
    

    Of course, this works just as well:

    $ egrep '^[praxis]*$' /usr/share/dict/words | sort | fmt
    a air airs aria arias as asp asps ass ax axis i iris is p pa pair pairs
    pap papa papas paps par pars pas pass pi pip pips pis piss r rap raps
    rasp rasps rip rips rs s sap saps sari saris sass sax sip sips sir sirs
    sis six spa spar spars spas x
    
  5. Jebastin J said

    C# Implementation:

    static List GetMatchingWordsSet(List wordsSet, List charSet)
    {
    List lstMatchingWordsSet = new List();

    foreach (var word in wordsSet)
    {
    int wordFlag = 0;
    int wordlength = word.Length;
    for (int i = 0; i < wordlength; i++)
    {
    if (charSet.Contains(Convert.ToChar(word[i].ToString().ToLower())))
    {
    wordFlag++;
    }
    }
    if (wordFlag == word.Length)
    {
    lstMatchingWordsSet.Add(word);
    }
    wordFlag = 0;
    }

    return lstMatchingWordsSet;
    }

    static void Main(string[] args)
    {
    List lstWords = new List { “One”, “Two”, “Three”, “Four”, “Five”, “Six”, “Cat”,”Tin” };
    List lstChars = new List { ‘a’, ‘e’, ‘i’, ‘o’, ‘u’, ‘n’, ‘s’, ‘x’, ‘t’ };
    var MatchingWordsSet = GetMatchingWordsSet(lstWords, lstChars);
    Console.WriteLine(“Matching Words:”);
    foreach (var w in MatchingWordsSet)
    {
    Console.WriteLine(w);
    }
    }

  6. Jebastin J said

    To avoid duplicates the following changes have to be done in my code (my above comment):

    string distinctWord = new String(word.ToLower().Distinct().ToArray());
    if (wordFlag == word.Length && distinctWord.Length == word.Length)
    {
    lstMatchingWordsSet.Add(word);
    }

  7. Mumps Implementation:

    MCL> type fword
    fword (word,file) ; Find matching words in file
    ;
    n char,fword,i,newword,str
    i $g(word)=”” w !,”No word was supplied” q
    i $g(file)=”” w !,*7,”No file was included” q
    o 1:(file:”R”):0
    i ‘$t w !,*7,”Unable to open “,file q
    w !!,”file: “,file
    w !,”word: “,word
    u 1
    s newword=””,str=””
    f i=1:1:$l(word) s char=$e(word,i) s:newword'[char newword=newword_char
    f i=1:1 r fword q:fword=”” d
    . i $$matches(newword,fword) s str=str_fword_”,”
    c 1
    w !,”Matching words: “,$s($l(str):$e(str,1,$l(str)-1),1:”None in “_file
    _)
    q
    ;
    matches (word,fword); See if fword can be built with chars in word
    ;
    n flg,i
    s flg=1
    f i=1:1:$l(fword) d q:’flg
    . s:word'[$e(fword,i) flg=0
    q flg

    MCL> d ^fword(“praxis”,”/usr/share/dict/words”)

    file: /usr/share/dict/words
    word: praxis
    Matching words: a,air,airs,aria,arias,as,asp,asps,ass,ax,axis,i,iris,is,p,pa,pair,pairs,pap,papa,papas,paps,par,pars,pas,pass,pi,pip,pips,pis,piss,r,rap,raps,rasp,rasps,rip,rips,rs,s,sap,saps,sari,saris,sass,sax,sip,sips,sir,sirs,sis,six,spa,spar,spars,spas,x

    MCL> d ^fword(“^~”,”/usr/share/dict/words”)

    file: /usr/share/dict/words
    word: ^~
    Matching words: None in /usr/share/dict/words

  8. Mumps Implementation:

    w !,”Matching words: “,$s($l(str):$e(str,1,$l(str)-1),1:”None in “_file
    _)

    should be:

    w !,”Matching words: “,$s($l(str):$e(str,1,$l(str)-1),1:”None in “_file)

  9. programmingpraxis said

    @bookofstevegraham: We don’t often see mumps here. Perhaps you could give a brief explanation of your program, for those unfamiliar with mumps.

  10. I’ve been asked to give a brief explanation on my implementation. Here goes. I will be including the complete name of commands and functions, instead of the abbreviations.

    Mumps Implementation: (https://en.wikipedia.org/wiki/MUMPS)

    MCL> type fword
    fword (word,file) ; Find matching words in file # Program name, accepting 2 arguments
    ;
    new char,fword,i,newword,str # Create new versions of variables
    if $get(word)=”” write !,”No word was supplied” quit # Quit program if the word was not supplied, or it was = “”
    if $get(file)=”” w !,*7,”No file was included” quit # Quit program if the filename was not supplied, or it was = “”
    open 1:(file:”R”):0 # Open device 1 in read mode and assign the file to it
    if ‘$test write !,*7,”Unable to open “,file quit # When you (try to) open a device with a trailing “:0”, and it is successful, $test is set to 1. If unable it is set to 0.
    # If 0, then quit program
    write !!,”file: “,file # Write filename with 2 lf/cr (!)
    write !,”word: “,word # Write supplied word with lf/cr
    use 1 # Begin using the device/file for input
    set newword=””,str=”” # Initialize variables
    for i=1:1:$length(word) set char=$extract(word,i) set:newword'[char newword=newword_char # Take each unique character in the supplied word and put it in variable newword
    # Mumps allows post-conditionals for most commands. set:newword'[char = if newword does not contain char, set …
    # [ is the operator for contains and ‘ is the operator for negate/not
    for i=1:1 read fword quit:fword=”” do # For each word in file
    . if $$matches(newword,fword) set str=str_fword_”,” # If the file word matches the characters in the supplied word, append it to a comma-delimited list of matched words
    # $$matches indicates a user-created function named matches. _ is the operator for concatenation.
    close 1 # Close device/file and revert to std in/out
    write !,”Matching words: “,$select($length(str):$extract(str,1,$length(str)-1),1:”None in “_file) # Write out list of matched words with lf/cr or a message that there were no matches
    quit # End program
    ;
    matches(word,fword); See if fword can be built with chars in word # Subroutine for matching supplied word with list word
    ;
    new flg,i # Create new versions of variables
    set flg=1 # Initialize flg variable (Assuming the supplied word matches list word)
    for i=1:1:$l(fword) do quit:’flg # For each character in list word, do next line, quit loop if flg = 0
    . set:word'[$e(fword,i) flg=0 # If supplied word does not contain this character in list word, set flg = 0. This will cause the loop to quit upon returning to it
    quit flg # Quit subroutine returning value of flg (1 = it matched, 0 = it did not match)

    MCL> d ^fword(“praxis”,”/usr/share/dict/words”)

    file: /usr/share/dict/words
    word: praxis
    Matching words: a,air,airs,aria,arias,as,asp,asps,ass,ax,axis,i,iris,is,p,pa,pair,pairs,pap,papa,papas,paps,par,pars,pas,pass,pi,pip,pips,pis,piss,r,rap,raps,rasp,rasps,rip,rips,rs,s,sap,saps,sari,saris,sass,sax,sip,sips,sir,sirs,sis,six,spa,spar,spars,spas,x

    MCL> d ^fword(“^~”,”/usr/share/dict/words”)

    file: /usr/share/dict/words
    word: ^~
    Matching words: None in /usr/share/dict/words

  11. programmingpraxis said

    @Bookofstevegraham: Fascinating language. Thank you!

  12. You’re welcome. Thanks for the interest.

  13. Steve Wampler said

    Looking at the solutions so far it looks like there are several interpretations of the problem: (a) look for words that contain exactly the letters in the set (i.e. if there’s only one ‘s’, then
    words can only contain one ‘s’), (b) look for words that only use letters from the set with no repeats, but not necessarily all of them, and (c) words that only letters from the set, but with repeats allowed.

    The following Unicon [works in Icon as well] solution assumes (a):
    ————————————————————-
    procedure main(a)
    w := a[1] | “gomiz”
    every write(matched(w, !&input))
    end

    procedure matched(w, pword)
    every ((*pword = *w), (t := pword), t[bound(upto,!w,t)] := “”)
    return (0 = *\t, pword)
    end

    procedure bound(a[])
    return a[1]!a[2:0]
    end
    ————————————————————-
    Sample run:
    ———————————————————-
    ->anagram west
    ————————————————————

  14. Steve Wampler said

    Hmmm, sample run didn’t come through for Unicon solution:

    ## anagram west </usr/share/dict/words
    stew
    tews
    west
    wets
    ##
    
  15. Steve Wampler: Could you comment on how your solution works? I would guess that most are not familiar with the language.

  16. an icon program meeting sub-challenge (c). 8 lines of code

    procedure main(a)
    # given a set of characters in parameter 1 and a wordlist as a file as parameter 2
    # respond with the words in the wordlist that meet the criteria
    # solution type (c)
    #
    # what criteria?
    #
    # The letters in parameter 1 may be used in the target word
    # 1 or more times or not at all. in any order.
    # for simplicity,
    # I have made no special provision for punctuation
    # so backslash and quotes, for example may behave unexpectedly.
    #
    # This is partly also because different shells will have different behaviour.
    # backslash escaping may help some situations.
    #
    ###############################
    # get the characters
    thechars := cset(a[1]) | stop(“No characters found; the word \”\” is made of no characters”)
    #get the wordlist
    put(a,”C:\\Users\\…\\webster\\WORDS.TXT”)
    # local fix puts a dictionary in position3 if user supplied a dictionary, else at position 2
    thewords := open(a[2],r) | stop(“No words to test. Is the file valid?”)
    #########################
    every aword := !thewords do {
    charsofaword := cset(aword)
    # thediff := charsofaword — thechars
    # cset difference, any letters in diff are those in the word, but not in the
    # characters supplied by the user as parameter 1.
    # if (*() = 0) then write(aword)
    if (*(charsofaword — thechars) = 0) then write(aword)
    }
    end

    run using “cat” on a 20C websters list I had.

    a
    aa
    acca
    act
    acta
    at
    atta
    c
    ca
    cat
    t
    ta
    taa
    tact
    tat
    tatta

  17. Sorry, non-fatal error; “thewords := open(a[2],r)” should read
    “thewords := open(a[2],’rt’)”
    The variable r is initially null, so the 2nd parameter to “open” defaults to ‘rt’. The compiled program executes the same.

  18. Steve Wampler said

    I was asked to provide a commented version of my Unicon solution, since both Unicon and Icon might not be familiar to some folks.
    A little background – Unicon is a successor to Icon. In both languages expressions are capable of generating sequences of results.
    A result sequence might have anywhere from 0 to an infinity of results. Normally expressions only so far as needed to produce a single
    result. If no result is produced or if there are no more results that can be produced then expression evaluation is said to fail, otherwise
    the evaluation succeeds. If an expression fails, subexpressions that can produce further results are automatically back-tracked into
    to produce their next result in an attempt to produce successful evaluation of the entire expression. This process is called goal-directed evaluation.
    The every control structure essentially tells a successful evaluation of its expression that it really failed – forcing goal-directed evaluation
    to exhaust all possibilities.
    Control flow is governed by this success or failure of the evaluation,
    not by the values computed during evaluation.
    (I apologize in advance for the length of this explanation. Unicon and Icon behave differently from most common procedural programming languages
    even though they don’t look all that different.)

    # Evaluation of a program starts in main.  The parameter a is an array
    #   of the arguments.
    procedure main(a)
       w := a[1] | "gomiz"  # If there is an argument, assign it to w,
                                    #   otherwise assign "gomiz" to w
          # The expression !&input can generate all the lines in standard input
          #   As they are generated they are passed to the procedure matched
          #   If evaluation of matched fails for a line, then backtracking
          #   will try the next line until input is exhausted or matched
          #   succeeds.  When matched succeeds, its value is output.
          #   The every simply forces backtracking to continue after
          #   matched succeeds.  Without the every, evaluation would have
          #   stopped after finding the first word that matched.
       every write(matched(w, !&input)
    end
    
    procedure matched(w, pword)
          # Start with a quick check on word lengths, if that fails,
          #   there's no place to backtrack so the evaluation of 
          #   every quickly fails.  If the check succeeds, then assign
          #   a copy of the word taken from the file to t.  Then
          #   for each letter in the set of letters w (!w generates
          #   the individual letters in w, in order), pass that letter,
          #   the value of t, and the function upto to the procedure
          #   bound.  [Read about bound, below.]
          # If bound succeeds, a letter from w has been found in t.
          #   The function upto returns the index of that letter and
          #   The t[bound(upto(!w,t)] := "" expression then has the
          #   effect of removing that letter from t so it won't be
          #   found again.
          # The every keeps backtracking until all of the letters from
          #   !w have been tried.
          # Finally if all of the letters in t have been removed, we
          #   know we have a match.  The return statement's expression
          #   is a short way of saying if 0 = *t then return pword.
       every ((*pword = *w), (t := pword), t[bound(upto,!w,t)] := "")
       return (0 = *\t, pword)
    end
    
    # The procedure bound is a way to limit backtracking into the
    #   results proceduced by a function while still allowing
    #   backtracking into the results produced by the arguments
    #   passed to the function.  The a[] in the parameter list
    #   simply collects all of the arguments into a the array a
    #   (varargs in some other languages).  a[1]!a[2:0] is
    #   the dynamic equivalent of [in this case] calling
    #   upto(c,t), where c is the letter generated by !w in
    #   the call to bound.  A return expression blocks backtracking
    #   from returning to that expression.
    #
    # Bound is used here because upto(c,s) generates all occurences
    #   of the character c in s, but we only want to find a single
    #   occurence (each letter is to be used exactly once).
    #   By using bound the call to upto is effectively blocked
    #   from producing more than one result.
    procedure bound(a[])
       return a[1]!a[2:0]
    end
    
  19. Implemented in the klong programming language (http://t3x.org/klong/), an array language patterned after the k programming language.

    $ kg ./wordsets4.kg mississippi < /usr/share/dict/words
    [i imp imps is ism isms m mi miss ms p pi pimp pimps pip pips pis piss s sip sips sis]

  20. Oops, forgot the program. Try again.

    $ cat wordsets5.kg
    main::{[a fword list word]; word::?.a@0; list::[]; .mi{fword::x; :[~#({word?x}’fword)?[]; list::list,fword; a::””]; .rl()}:~.rl(); .p(list)}

    main()

    $ kg ./wordsets5.kg mississippi < /usr/share/dict/words
    [i imp imps is ism isms m mi miss ms p pi pimp pimps pip pips pis piss s sip sips sis]

    $

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: