Data Structures Homework
August 24, 2018
Today’s exercise is from a programming student taking a Data Structures course, using Java.
Given a text file, print a list of the ten most common words in the file, along with their frequency of occurrence. You may not use a HashMap or ArrayList, but you may use regular expressions.
Your task is to write a program to find the ten most common words in an input file. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
@Rutger, the Counter in Python is a subclass of dict, which is a hash map. However, while the problem prohibits hash maps, it doesn’t specify whether data structures built on hash maps are permitted.
Here’s a solution in Java. It uses a trie. A linked list was used to hold children to avoid using a HashMap (or ArrayList). For retrieving the most common words, I used a heap, but a linked list would have seemingly sufficed.
Here’s example output when running on Pride and Prejudice, the top book on the http://www.gutenberg.org.
Here’s a Haskell version, also run on Pride and Prejudice. (Haskell’s Map is based on “size balanced binary trees”.)
The counts are the same as:
@Globules, your solution’s output differs from mine on the same input, which alerted me to a bug in my program. When parsing words, my program converts e.g., “table–nor” to “tablenor”, as opposed to splitting it into separate words “table” and “nor”.
Here’s the fixed while loop in my main function.
And the updated output:
@Daniel, my main motivation for translating all non-alphanumerics to spaces was to deal with abbreviations like “Mr.” and punctuation at the ends of sentences, but it’s possible there’s no simple solution to the problem. For example, I would consider “pre-authorized payment” to be two words, but my “tr” will turn it into three words. (Maybe it would be worth just treating dashes/hyphens specially?)