Data Structures Homework
August 24, 2018
We don’t need a data structure to solve this problem, nor do we need regular expressions or Java:
tr -cs A-Za-z '\n' < infile | tr A-Z a-z |
sort | uniq -c | sort -rn | sed 10q
The first tr splits each word (a maximal sequence of letters) onto a separate line, the second tr converts upper case to lower case, the first sort brings like words together, uniq counts them, the second sort puts the list into descending order by count, and sed outputs the top ten. We can do the same thing in Scheme:
(define (word-freq n file-name)
(define (freq-gt? a b) (> (cdr a) (cdr b)))
(with-input-from-file file-name (lambda ()
(let loop ((word (read-word)) (words (list)))
(if (eof-object? word)
(take n
(sort freq-gt?
(uniq-c string=?
(sort string<? words))))
(loop (read-word) (cons word words)))))))
And here is the program applied to the text of Lincoln’s Gettysburg Address:
> (word-freq 10 "gettysburg.txt") ((that . 13) (the . 11) (we . 10) (here . 8) (to . 8) (a . 7) (and . 6) (can . 5) (for . 5) (have . 5))
You can run the program at https://ideone.com/o529HD. We discussed this same exercise previously, where we used a hash table, which isn’t permitted for the current exercise.
text = """Today’s exercise is from a programming student taking a Data Structures course, using Java. Given a text file, print a list of the ten most common words in the file, along with their frequency of occurrence. You may not use a HashMap or ArrayList, but you may use regular expressions. Your task is to write a program to find the ten most common words in an input file. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.""" from collections import Counter c = Counter() for w in text.split(' '): c[w] += 1 print(c.most_common(5)) # outputs [('a', 7), ('the', 5), ('or', 4), ('to', 4), ('in', 3), ('you', 3), ('exercise', 2), ('is', 2), ('file,', 2), ('of', 2)]@Rutger, the Counter in Python is a subclass of dict, which is a hash map. However, while the problem prohibits hash maps, it doesn’t specify whether data structures built on hash maps are permitted.
Here’s a solution in Java. It uses a trie. A linked list was used to hold children to avoid using a HashMap (or ArrayList). For retrieving the most common words, I used a heap, but a linked list would have seemingly sufficed.
import java.io.File; import java.io.FileNotFoundException; import java.util.Comparator; import java.util.LinkedList; import java.util.PriorityQueue; import java.util.Scanner; public class Main { static class Pair<T, U> { T first; U second; public Pair(T first, U second) { this.first = first; this.second = second; } } static class Trie { static class Node { int count = 0; Character c = null; Node parent = null; // Can't use ArrayList or HashMap, so store children in LinkedList. // (alternatively a binary search tree could be used here) LinkedList<Node> children = new LinkedList<>(); String string() { LinkedList<Character> chars = new LinkedList<>(); Node n = this; while (n.c != null) { chars.addFirst(n.c); n = n.parent; } StringBuffer sb = new StringBuffer(); for (char c : chars) { sb.append(c); } return sb.toString(); } } Node head = new Node(); void insert(String s) { Node n = head; outer: for (int i = 0; i < s.length(); ++i) { char c = s.charAt(i); for (Node child : n.children) { if (child.c.equals(c)) { n = child; continue outer; } } Node child = new Node(); child.parent = n; child.c = c; n.children.add(child); n = child; } ++n.count; } LinkedList<Pair<String,Integer>> mostCommon(int count) { PriorityQueue<Pair<String,Integer>> minHeap = new PriorityQueue<>( new Comparator<Pair<String,Integer>>() { @Override public int compare(Pair<String,Integer> o1, Pair<String,Integer> o2) { if (o1.second < o2.second) return -1; if (o1.second > o2.second) return 1; return 0; } }); LinkedList<Node> frontier = new LinkedList<>(); frontier.addAll(head.children); while (!frontier.isEmpty()) { Node n = frontier.pop(); for (Node child : n.children) { frontier.add(child); } if (n.count == 0) continue; if (minHeap.size() < count) { minHeap.add(new Pair<String,Integer>(n.string(), n.count)); } else if (n.count > minHeap.peek().second) { minHeap.remove(); minHeap.add(new Pair<String,Integer>(n.string(), n.count)); } } LinkedList<Pair<String,Integer>> result = new LinkedList<>(); while (!minHeap.isEmpty()) { result.addFirst(minHeap.remove()); } return result; } } public static void main(String[] args) throws FileNotFoundException { if (args.length != 1) { System.err.println("Usage: <PROGRAM> FILE"); System.exit(1); } Trie t = new Trie(); Scanner s = new Scanner(new File(args[0])); while (s.hasNext()) { String next = s.next(); // Remove all punctuation and convert to lower case String word = next.replaceAll("[^a-zA-Z ]", "").toLowerCase(); t.insert(word); } s.close(); for (Pair<String,Integer> pair : t.mostCommon(10)) { System.out.printf("%s: %d\n", pair.first, pair.second); } } }Here’s example output when running on Pride and Prejudice, the top book on the http://www.gutenberg.org.
Here’s a Haskell version, also run on Pride and Prejudice. (Haskell’s Map is based on “size balanced binary trees”.)
{-# LANGUAGE TupleSections #-} import Data.List (sortOn) import qualified Data.Map.Strict as M import Data.Ord (Down(..)) import qualified Data.Text.Lazy as T import qualified Data.Text.Lazy.IO as T import System.Environment (getArgs, getProgName) import Text.Printf (PrintfArg, printf) import Text.Read (readMaybe) -- Return the distinct values of a list paired with their number of occurrences. counts :: (Ord a, Integral b) => [a] -> [(a, b)] counts = M.toList . M.fromListWith (+) . map (,1) -- Print the number of occurrences of a value, followed by the value itself. printCount :: PrintfArg a => (a, Int) -> IO () printCount (s, n) = printf "%-8d %s\n" n s main :: IO () main = do prog <- getProgName args <- getArgs case map readMaybe args of [Just n] -> do txt <- T.getContents mapM_ printCount $ take n $ sortOn (Down . snd) $ counts $ T.words txt _ -> error $ "Usage: " ++ prog ++ " number-of-words"The counts are the same as:
@Globules, your solution’s output differs from mine on the same input, which alerted me to a bug in my program. When parsing words, my program converts e.g., “table–nor” to “tablenor”, as opposed to splitting it into separate words “table” and “nor”.
Here’s the fixed while loop in my main function.
while (s.hasNext()) { String next = s.next(); // Remove all punctuation and convert to lower case String words = next.replaceAll("[^a-zA-Z ]", " ").toLowerCase(); for (String word : words.split(" ")) { t.insert(word); } }And the updated output:
@Daniel, my main motivation for translating all non-alphanumerics to spaces was to deal with abbreviations like “Mr.” and punctuation at the ends of sentences, but it’s possible there’s no simple solution to the problem. For example, I would consider “pre-authorized payment” to be two words, but my “tr” will turn it into three words. (Maybe it would be worth just treating dashes/hyphens specially?)