Data Structures Homework
August 24, 2018
Today’s exercise is from a programming student taking a Data Structures course, using Java.
Given a text file, print a list of the ten most common words in the file, along with their frequency of occurrence. You may not use a HashMap or ArrayList, but you may use regular expressions.
Your task is to write a program to find the ten most common words in an input file. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
text = """Today’s exercise is from a programming student taking a Data Structures course, using Java. Given a text file, print a list of the ten most common words in the file, along with their frequency of occurrence. You may not use a HashMap or ArrayList, but you may use regular expressions. Your task is to write a program to find the ten most common words in an input file. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.""" from collections import Counter c = Counter() for w in text.split(' '): c[w] += 1 print(c.most_common(5)) # outputs [('a', 7), ('the', 5), ('or', 4), ('to', 4), ('in', 3), ('you', 3), ('exercise', 2), ('is', 2), ('file,', 2), ('of', 2)]@Rutger, the Counter in Python is a subclass of dict, which is a hash map. However, while the problem prohibits hash maps, it doesn’t specify whether data structures built on hash maps are permitted.
Here’s a solution in Java. It uses a trie. A linked list was used to hold children to avoid using a HashMap (or ArrayList). For retrieving the most common words, I used a heap, but a linked list would have seemingly sufficed.
import java.io.File; import java.io.FileNotFoundException; import java.util.Comparator; import java.util.LinkedList; import java.util.PriorityQueue; import java.util.Scanner; public class Main { static class Pair<T, U> { T first; U second; public Pair(T first, U second) { this.first = first; this.second = second; } } static class Trie { static class Node { int count = 0; Character c = null; Node parent = null; // Can't use ArrayList or HashMap, so store children in LinkedList. // (alternatively a binary search tree could be used here) LinkedList<Node> children = new LinkedList<>(); String string() { LinkedList<Character> chars = new LinkedList<>(); Node n = this; while (n.c != null) { chars.addFirst(n.c); n = n.parent; } StringBuffer sb = new StringBuffer(); for (char c : chars) { sb.append(c); } return sb.toString(); } } Node head = new Node(); void insert(String s) { Node n = head; outer: for (int i = 0; i < s.length(); ++i) { char c = s.charAt(i); for (Node child : n.children) { if (child.c.equals(c)) { n = child; continue outer; } } Node child = new Node(); child.parent = n; child.c = c; n.children.add(child); n = child; } ++n.count; } LinkedList<Pair<String,Integer>> mostCommon(int count) { PriorityQueue<Pair<String,Integer>> minHeap = new PriorityQueue<>( new Comparator<Pair<String,Integer>>() { @Override public int compare(Pair<String,Integer> o1, Pair<String,Integer> o2) { if (o1.second < o2.second) return -1; if (o1.second > o2.second) return 1; return 0; } }); LinkedList<Node> frontier = new LinkedList<>(); frontier.addAll(head.children); while (!frontier.isEmpty()) { Node n = frontier.pop(); for (Node child : n.children) { frontier.add(child); } if (n.count == 0) continue; if (minHeap.size() < count) { minHeap.add(new Pair<String,Integer>(n.string(), n.count)); } else if (n.count > minHeap.peek().second) { minHeap.remove(); minHeap.add(new Pair<String,Integer>(n.string(), n.count)); } } LinkedList<Pair<String,Integer>> result = new LinkedList<>(); while (!minHeap.isEmpty()) { result.addFirst(minHeap.remove()); } return result; } } public static void main(String[] args) throws FileNotFoundException { if (args.length != 1) { System.err.println("Usage: <PROGRAM> FILE"); System.exit(1); } Trie t = new Trie(); Scanner s = new Scanner(new File(args[0])); while (s.hasNext()) { String next = s.next(); // Remove all punctuation and convert to lower case String word = next.replaceAll("[^a-zA-Z ]", "").toLowerCase(); t.insert(word); } s.close(); for (Pair<String,Integer> pair : t.mostCommon(10)) { System.out.printf("%s: %d\n", pair.first, pair.second); } } }Here’s example output when running on Pride and Prejudice, the top book on the http://www.gutenberg.org.
Here’s a Haskell version, also run on Pride and Prejudice. (Haskell’s Map is based on “size balanced binary trees”.)
{-# LANGUAGE TupleSections #-} import Data.List (sortOn) import qualified Data.Map.Strict as M import Data.Ord (Down(..)) import qualified Data.Text.Lazy as T import qualified Data.Text.Lazy.IO as T import System.Environment (getArgs, getProgName) import Text.Printf (PrintfArg, printf) import Text.Read (readMaybe) -- Return the distinct values of a list paired with their number of occurrences. counts :: (Ord a, Integral b) => [a] -> [(a, b)] counts = M.toList . M.fromListWith (+) . map (,1) -- Print the number of occurrences of a value, followed by the value itself. printCount :: PrintfArg a => (a, Int) -> IO () printCount (s, n) = printf "%-8d %s\n" n s main :: IO () main = do prog <- getProgName args <- getArgs case map readMaybe args of [Just n] -> do txt <- T.getContents mapM_ printCount $ take n $ sortOn (Down . snd) $ counts $ T.words txt _ -> error $ "Usage: " ++ prog ++ " number-of-words"The counts are the same as:
@Globules, your solution’s output differs from mine on the same input, which alerted me to a bug in my program. When parsing words, my program converts e.g., “table–nor” to “tablenor”, as opposed to splitting it into separate words “table” and “nor”.
Here’s the fixed while loop in my main function.
while (s.hasNext()) { String next = s.next(); // Remove all punctuation and convert to lower case String words = next.replaceAll("[^a-zA-Z ]", " ").toLowerCase(); for (String word : words.split(" ")) { t.insert(word); } }And the updated output:
@Daniel, my main motivation for translating all non-alphanumerics to spaces was to deal with abbreviations like “Mr.” and punctuation at the ends of sentences, but it’s possible there’s no simple solution to the problem. For example, I would consider “pre-authorized payment” to be two words, but my “tr” will turn it into three words. (Maybe it would be worth just treating dashes/hyphens specially?)