Data Structures Homework

August 24, 2018

We don’t need a data structure to solve this problem, nor do we need regular expressions or Java:

tr -cs A-Za-z '\n' < infile | tr A-Z a-z |
    sort | uniq -c | sort -rn | sed 10q

The first tr splits each word (a maximal sequence of letters) onto a separate line, the second tr converts upper case to lower case, the first sort brings like words together, uniq counts them, the second sort puts the list into descending order by count, and sed outputs the top ten. We can do the same thing in Scheme:

(define (word-freq n file-name)
  (define (freq-gt? a b) (> (cdr a) (cdr b)))
  (with-input-from-file file-name (lambda ()
    (let loop ((word (read-word)) (words (list)))
       (if (eof-object? word)
           (take n
             (sort freq-gt? 
               (uniq-c string=?
                 (sort string<? words))))
           (loop (read-word) (cons word words)))))))

And here is the program applied to the text of Lincoln’s Gettysburg Address:

> (word-freq 10 "gettysburg.txt")
((that . 13) (the . 11) (we . 10) (here . 8)
 (to . 8) (a . 7) (and . 6) (can . 5) (for . 5)
 (have . 5))

You can run the program at https://ideone.com/o529HD. We discussed this same exercise previously, where we used a hash table, which isn’t permitted for the current exercise.

Advertisements

Pages: 1 2

5 Responses to “Data Structures Homework”

  1. Rutger said
    text = """Today’s exercise is from a programming student taking a Data Structures course, using Java.
    
    Given a text file, print a list of the ten most common words in the file, along with their frequency of occurrence. You may not use a HashMap or ArrayList, but you may use regular expressions.
    
    Your task is to write a program to find the ten most common words in an input file. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below."""
    
    from collections import Counter
    
    c = Counter()
    
    for w in text.split(' '):
    	c[w] += 1
    
    print(c.most_common(5))
    
    
    # outputs [('a', 7), ('the', 5), ('or', 4), ('to', 4), ('in', 3), ('you', 3), ('exercise', 2), ('is', 2), ('file,', 2), ('of', 2)]
    
  2. Daniel said

    @Rutger, the Counter in Python is a subclass of dict, which is a hash map. However, while the problem prohibits hash maps, it doesn’t specify whether data structures built on hash maps are permitted.

    Here’s a solution in Java. It uses a trie. A linked list was used to hold children to avoid using a HashMap (or ArrayList). For retrieving the most common words, I used a heap, but a linked list would have seemingly sufficed.

    import java.io.File;
    import java.io.FileNotFoundException;
    import java.util.Comparator;
    import java.util.LinkedList;
    import java.util.PriorityQueue;
    import java.util.Scanner;
    
    public class Main {
      static class Pair<T, U> {
        T first;
        U second;
        public Pair(T first, U second) {
          this.first = first;
          this.second = second;
        }
      }
      
      static class Trie {
        static class Node {
          int count = 0;
          Character c = null;
          Node parent = null;
          // Can't use ArrayList or HashMap, so store children in LinkedList.
          // (alternatively a binary search tree could be used here)
          LinkedList<Node> children = new LinkedList<>();
          
          String string() {
            LinkedList<Character> chars = new LinkedList<>();
            Node n = this;
            while (n.c != null) {
              chars.addFirst(n.c);
              n = n.parent;
            }
            StringBuffer sb = new StringBuffer();
            for (char c : chars) {
              sb.append(c);
            }
            return sb.toString();
          }
        }
        
        Node head = new Node();
        
        void insert(String s) {
          Node n = head;
          outer:
          for (int i = 0; i < s.length(); ++i) {
            char c = s.charAt(i);
            for (Node child : n.children) {
              if (child.c.equals(c)) {
                n = child;
                continue outer;
              }
            }
            Node child = new Node();
            child.parent = n;
            child.c = c;
            n.children.add(child);
            n = child;
          }
          ++n.count;
        }
        
        LinkedList<Pair<String,Integer>> mostCommon(int count) {
          PriorityQueue<Pair<String,Integer>> minHeap = new PriorityQueue<>(
              new Comparator<Pair<String,Integer>>() {
            @Override
            public int compare(Pair<String,Integer> o1, Pair<String,Integer> o2) {
              if (o1.second < o2.second) return -1;
              if (o1.second > o2.second) return 1;
              return 0;
            }
          });
          LinkedList<Node> frontier = new LinkedList<>();
          frontier.addAll(head.children);
          while (!frontier.isEmpty()) {
            Node n = frontier.pop();
            for (Node child : n.children) {
              frontier.add(child);
            }
            if (n.count == 0) continue;
            if (minHeap.size() < count) {
              minHeap.add(new Pair<String,Integer>(n.string(), n.count));
            } else if (n.count > minHeap.peek().second) {
              minHeap.remove();
              minHeap.add(new Pair<String,Integer>(n.string(), n.count));
            }
          }
          LinkedList<Pair<String,Integer>> result = new LinkedList<>();
          while (!minHeap.isEmpty()) {
            result.addFirst(minHeap.remove());
          }
          return result;
        }
      }
      
      public static void main(String[] args) throws FileNotFoundException {
        if (args.length != 1) {
          System.err.println("Usage: <PROGRAM> FILE");
          System.exit(1);
        }
        Trie t = new Trie();
        Scanner s = new Scanner(new File(args[0]));
        while (s.hasNext()) {
          String next = s.next();
          // Remove all punctuation and convert to lower case
          String word = next.replaceAll("[^a-zA-Z ]", "").toLowerCase();
          t.insert(word);
        }
        s.close();
        for (Pair<String,Integer> pair : t.mostCommon(10)) {
          System.out.printf("%s: %d\n", pair.first, pair.second);
        }
      }
    }
    

    Here’s example output when running on Pride and Prejudice, the top book on the http://www.gutenberg.org.

    the: 4496
    to: 4207
    of: 3717
    and: 3602
    her: 2215
    i: 2051
    a: 1997
    in: 1919
    was: 1843
    she: 1704
    
  3. Globules said

    Here’s a Haskell version, also run on Pride and Prejudice. (Haskell’s Map is based on “size balanced binary trees”.)

    {-# LANGUAGE TupleSections #-}
    
    import           Data.List (sortOn)
    import qualified Data.Map.Strict as M
    import           Data.Ord (Down(..))
    import qualified Data.Text.Lazy as T
    import qualified Data.Text.Lazy.IO as T
    import           System.Environment (getArgs, getProgName)
    import           Text.Printf (PrintfArg, printf)
    import           Text.Read (readMaybe)
    
    -- Return the distinct values of a list paired with their number of occurrences.
    counts :: (Ord a, Integral b) => [a] -> [(a, b)]
    counts = M.toList . M.fromListWith (+) . map (,1)
    
    -- Print the number of occurrences of a value, followed by the value itself.
    printCount :: PrintfArg a => (a, Int) -> IO ()
    printCount (s, n) = printf "%-8d %s\n" n s
    
    main :: IO ()
    main = do
      prog <- getProgName
      args <- getArgs
      case map readMaybe args of
        [Just n] -> do
          txt <- T.getContents
          mapM_ printCount $ take n $ sortOn (Down . snd) $ counts $ T.words txt
        _        -> error $ "Usage: " ++ prog ++ " number-of-words"
    
    $ tr -Cs '[:alnum:]' ' ' < ~/Downloads/1342.txt | tr 'A-Z' 'a-z' | ./topwords 10
    4507     the
    4243     to
    3730     of
    3658     and
    2225     her
    2070     i
    2012     a
    1937     in
    1847     was
    1710     she
    

    The counts are the same as:

    $ tr -Cs '[:alnum:]' '\n' < ~/Downloads/1342.txt | tr 'A-Z' 'a-z' | sort | uniq -c | sort -rn | head -10
    4507 the
    4243 to
    3730 of
    3658 and
    2225 her
    2070 i
    2012 a
    1937 in
    1847 was
    1710 she
    
  4. Daniel said

    @Globules, your solution’s output differs from mine on the same input, which alerted me to a bug in my program. When parsing words, my program converts e.g., “table–nor” to “tablenor”, as opposed to splitting it into separate words “table” and “nor”.

    Here’s the fixed while loop in my main function.

        while (s.hasNext()) {
          String next = s.next();
          // Remove all punctuation and convert to lower case
          String words = next.replaceAll("[^a-zA-Z ]", " ").toLowerCase();
          for (String word : words.split(" ")) {
            t.insert(word); 
          }
        }
    

    And the updated output:

    the: 4507
    to: 4243
    of: 3730
    and: 3658
    her: 2225
    i: 2070
    a: 2012
    in: 1937
    was: 1847
    she: 1710
    
  5. Globules said

    @Daniel, my main motivation for translating all non-alphanumerics to spaces was to deal with abbreviations like “Mr.” and punctuation at the ends of sentences, but it’s possible there’s no simple solution to the problem. For example, I would consider “pre-authorized payment” to be two words, but my “tr” will turn it into three words. (Maybe it would be worth just treating dashes/hyphens specially?)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: