Data Structures Homework

August 24, 2018

Today’s exercise is from a programming student taking a Data Structures course, using Java.

Given a text file, print a list of the ten most common words in the file, along with their frequency of occurrence. You may not use a HashMap or ArrayList, but you may use regular expressions.

Your task is to write a program to find the ten most common words in an input file. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Posted by programmingpraxis

Filed in Exercises

5 Comments »

5 Responses to “Data Structures Homework”

Rutger said

August 24, 2018 at 2:10 PM

text = """Today’s exercise is from a programming student taking a Data Structures course, using Java.

Given a text file, print a list of the ten most common words in the file, along with their frequency of occurrence. You may not use a HashMap or ArrayList, but you may use regular expressions.

Your task is to write a program to find the ten most common words in an input file. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below."""

from collections import Counter

c = Counter()

for w in text.split(' '):
	c[w] += 1

print(c.most_common(5))


# outputs [('a', 7), ('the', 5), ('or', 4), ('to', 4), ('in', 3), ('you', 3), ('exercise', 2), ('is', 2), ('file,', 2), ('of', 2)]

Daniel said

August 25, 2018 at 5:08 AM

@Rutger, the Counter in Python is a subclass of dict, which is a hash map. However, while the problem prohibits hash maps, it doesn’t specify whether data structures built on hash maps are permitted.

Here’s a solution in Java. It uses a trie. A linked list was used to hold children to avoid using a HashMap (or ArrayList). For retrieving the most common words, I used a heap, but a linked list would have seemingly sufficed.

import java.io.File;
import java.io.FileNotFoundException;
import java.util.Comparator;
import java.util.LinkedList;
import java.util.PriorityQueue;
import java.util.Scanner;

public class Main {
  static class Pair<T, U> {
    T first;
    U second;
    public Pair(T first, U second) {
      this.first = first;
      this.second = second;
    }
  }
  
  static class Trie {
    static class Node {
      int count = 0;
      Character c = null;
      Node parent = null;
      // Can't use ArrayList or HashMap, so store children in LinkedList.
      // (alternatively a binary search tree could be used here)
      LinkedList<Node> children = new LinkedList<>();
      
      String string() {
        LinkedList<Character> chars = new LinkedList<>();
        Node n = this;
        while (n.c != null) {
          chars.addFirst(n.c);
          n = n.parent;
        }
        StringBuffer sb = new StringBuffer();
        for (char c : chars) {
          sb.append(c);
        }
        return sb.toString();
      }
    }
    
    Node head = new Node();
    
    void insert(String s) {
      Node n = head;
      outer:
      for (int i = 0; i < s.length(); ++i) {
        char c = s.charAt(i);
        for (Node child : n.children) {
          if (child.c.equals(c)) {
            n = child;
            continue outer;
          }
        }
        Node child = new Node();
        child.parent = n;
        child.c = c;
        n.children.add(child);
        n = child;
      }
      ++n.count;
    }
    
    LinkedList<Pair<String,Integer>> mostCommon(int count) {
      PriorityQueue<Pair<String,Integer>> minHeap = new PriorityQueue<>(
          new Comparator<Pair<String,Integer>>() {
        @Override
        public int compare(Pair<String,Integer> o1, Pair<String,Integer> o2) {
          if (o1.second < o2.second) return -1;
          if (o1.second > o2.second) return 1;
          return 0;
        }
      });
      LinkedList<Node> frontier = new LinkedList<>();
      frontier.addAll(head.children);
      while (!frontier.isEmpty()) {
        Node n = frontier.pop();
        for (Node child : n.children) {
          frontier.add(child);
        }
        if (n.count == 0) continue;
        if (minHeap.size() < count) {
          minHeap.add(new Pair<String,Integer>(n.string(), n.count));
        } else if (n.count > minHeap.peek().second) {
          minHeap.remove();
          minHeap.add(new Pair<String,Integer>(n.string(), n.count));
        }
      }
      LinkedList<Pair<String,Integer>> result = new LinkedList<>();
      while (!minHeap.isEmpty()) {
        result.addFirst(minHeap.remove());
      }
      return result;
    }
  }
  
  public static void main(String[] args) throws FileNotFoundException {
    if (args.length != 1) {
      System.err.println("Usage: <PROGRAM> FILE");
      System.exit(1);
    }
    Trie t = new Trie();
    Scanner s = new Scanner(new File(args[0]));
    while (s.hasNext()) {
      String next = s.next();
      // Remove all punctuation and convert to lower case
      String word = next.replaceAll("[^a-zA-Z ]", "").toLowerCase();
      t.insert(word);
    }
    s.close();
    for (Pair<String,Integer> pair : t.mostCommon(10)) {
      System.out.printf("%s: %d\n", pair.first, pair.second);
    }
  }
}

Here’s example output when running on Pride and Prejudice, the top book on the http://www.gutenberg.org.

the: 4496
to: 4207
of: 3717
and: 3602
her: 2215
i: 2051
a: 1997
in: 1919
was: 1843
she: 1704

Globules said

August 25, 2018 at 6:07 AM

Here’s a Haskell version, also run on Pride and Prejudice. (Haskell’s Map is based on “size balanced binary trees”.)

{-# LANGUAGE TupleSections #-}

import           Data.List (sortOn)
import qualified Data.Map.Strict as M
import           Data.Ord (Down(..))
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as T
import           System.Environment (getArgs, getProgName)
import           Text.Printf (PrintfArg, printf)
import           Text.Read (readMaybe)

-- Return the distinct values of a list paired with their number of occurrences.
counts :: (Ord a, Integral b) => [a] -> [(a, b)]
counts = M.toList . M.fromListWith (+) . map (,1)

-- Print the number of occurrences of a value, followed by the value itself.
printCount :: PrintfArg a => (a, Int) -> IO ()
printCount (s, n) = printf "%-8d %s\n" n s

main :: IO ()
main = do
  prog <- getProgName
  args <- getArgs
  case map readMaybe args of
    [Just n] -> do
      txt <- T.getContents
      mapM_ printCount $ take n $ sortOn (Down . snd) $ counts $ T.words txt
    _        -> error $ "Usage: " ++ prog ++ " number-of-words"

$ tr -Cs '[:alnum:]' ' ' < ~/Downloads/1342.txt | tr 'A-Z' 'a-z' | ./topwords 10
4507     the
4243     to
3730     of
3658     and
2225     her
2070     i
2012     a
1937     in
1847     was
1710     she

The counts are the same as:

$ tr -Cs '[:alnum:]' '\n' < ~/Downloads/1342.txt | tr 'A-Z' 'a-z' | sort | uniq -c | sort -rn | head -10
4507 the
4243 to
3730 of
3658 and
2225 her
2070 i
2012 a
1937 in
1847 was
1710 she

Daniel said
August 25, 2018 at 6:36 AM
@Globules, your solution’s output differs from mine on the same input, which alerted me to a bug in my program. When parsing words, my program converts e.g., “table–nor” to “tablenor”, as opposed to splitting it into separate words “table” and “nor”.

Here’s the fixed while loop in my main function.
```
    while (s.hasNext()) {
      String next = s.next();
      // Remove all punctuation and convert to lower case
      String words = next.replaceAll("[^a-zA-Z ]", " ").toLowerCase();
      for (String word : words.split(" ")) {
        t.insert(word); 
      }
    }
```
And the updated output:
```
the: 4507
to: 4243
of: 3730
and: 3658
her: 2225
i: 2070
a: 2012
in: 1937
was: 1847
she: 1710
```
Globules said
August 25, 2018 at 7:05 AM
@Daniel, my main motivation for translating all non-alphanumerics to spaces was to deal with abbreviations like “Mr.” and punctuation at the ends of sentences, but it’s possible there’s no simple solution to the problem. For example, I would consider “pre-authorized payment” to be two words, but my “tr” will turn it into three words. (Maybe it would be worth just treating dashes/hyphens specially?)

Programming Praxis