Word Breaks
August 12, 2011
Daniel Tunkelang posted this interview question to his blog:
Given an input string and a dictionary of words, segment the input string into a space-separated sequence of dictionary words if possible. For example, if the input string is “applepie” and dictionary contains a standard set of English words, then we would return the string “apple pie” as output.
He also gave a number of constraints: The dictionary provides a single operation, exact string lookup, and is a given to the task; you are not to consider how to implement the dictionary, nor or you to worry about stemming, spelling correction, or other aspects of the dictionary. The output may have more than two words, if there is more than one solution you only need to return one of them, and your function should indicate if there are no solutions.
Your task is to write a function that solves the “word break” problem. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
A solution in Scheme. Nice!
My straightforward (and naïve) solutions in both Racket and Python:
A Python solution:
Not pretty, to be sure, but it seems to get the job done.
Another Python solution.
I looked at a histogram of word length for the dictionary. In order of decreasing frequency, the word lengths are [8, 7, 9, 6, …]. gen_split() tries to split off a prefix of the input string using the word lengths in order of decreasing frequency.
gen_split() generates different splits of the input string. split() returns just the first one.
My clojure solution
Sorry, let’s try this again.
My clojure solution
Maybe this will be better:
Consider a procedure, (for-each-partition proc s word?), that walks
proc over all partitions of the given string s into words recognized
by the given predicate, word?, like this:
The problem statement asks for joining the parts into a single string;
that would be easy to do. The example uses this dictionary predicate:
Then an escape procedure can be used to receive the first solution
found by for-each-partition, or #f if there are none, as follows:
It is now possible to add conditions. For example, the first partition
into three words:
(The indentations look all wrong to me after cut-and-paste and I see no
revied button. Let us see.)
Ok, the indentations look right in the published version, within the
sourcode brackets. Below is an implementation of for-each-partition.
It works on an agenda of reversed position sequences, so that the
first index in an agenda task is a position where the next word needs
to be found. It memoizes the end positions of the recognized words at
those start positions where it needs to find more ways forward.
(I have no idea what happened to the indentation of line 17 above. It
is right in my editor window. There are tabs, but there are tabs on
many other lines that did not break.)
Here are my Clojure and Ruby solutions: http://benmabey.com/2011/08/14/word-break-in-clojure-and-ruby-and-laziness-in-ruby.html
static void Main(string[] args)
{
string[] lookupTable = { “apple”, “Bat”, “Candle”, “Donkey”, “Eat”, “Sat” };
string str = string.Empty;
Console.WriteLine(“Please enter a string”);
str = Console.ReadLine();
string strToSearch = string.Empty;
for (int i = 0; i < str.Length; i++)
{
if (' ' == str[i])
continue;
else
{
strToSearch += str[i];
for (int j = 0; j < lookupTable.Length; j++)
{
if (strToSearch.ToUpper() == lookupTable[j].ToUpper())
{
int pos = str.IndexOf(strToSearch) + strToSearch.Length;
str = str.Insert(pos, " ");
strToSearch = string.Empty;
}
}
}
}
Console.WriteLine(str);
}
This is my C# Code
string[] dic = { “a”, “brown” ,”apple”, “pie”};
string input = “abrownapplepie”;
StringBuilder sb = new StringBuilder();
Regex[] regexes = dic.Select(c => new Regex(c)).ToArray();
foreach (var reg in regexes)
{
sb.Append(reg.Match(input).Value + ” “);
}
Console.WriteLine(sb.ToString().Trim());
My 2 cents, in Perl.
In this implementation, I try to find the longest prefix in the input that exists in the dictionary, and then do a recursive call for the rest of the input.
In order to test it, and just for fun, I built myself a small dictionary file (using the 187 most used words in the english dictionary), taken from here:
http://www.manythings.org/vocabulary/lists/l/
And I tested it like this:
And here’s the source:
DP and Recursive Solution with working code at http://www.gohired.in/2014/12/word-break-problem.html