Google Interview Question
April 8, 2016
Today’s exercise is an interview question from Google:
Given a list of words, find the maximum value of the product of the lengths of two words from the list, subject to the constraint that the two words cannot share any letters. For instance, given the words ABCW, BAZ, FOO, BAR, XTFN, and ABCDEF, the pair FOO BAR has a product of 9, the pair BAZ XFTN has a product of 12, and the pair ABCW XTFN has a product of 16, which is the maximum. Note that the pair ABCW ABCDEF doesn’t work because the two words share three letters.
Your task is to write a program to solve Google’s interview question. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
Again a nice one for perl as it’s string handling!
Using Factor:
The first version finds the “maximum” pair of words. The key depends on whether the intersection of their respective sets of letters is empty. If it is empty, the key is the product of the lengths of the two words. Otherwise, the key is zero.
This second version builds a dictionary mapping letters to sets of words containing that letter. Then set operations can be used to calculate pairs of words that have no letters in common.
The second line was a comment, it lost a ‘#’ character somewhere.
This works pretty fast for larger number of words.
Another Python solution, similar to Paul’s (I borrowed some your code too, thanks Paul), but using a heap to ensure we examine possible solutions in order. Using negative lengths is because Python heapq uses a fixed less than ordering & also to ensure the solutions come out in lexicographic order. This version prints all valid word pairs:
Another method. Make a heap of the (-len(word), set(word)) tuples. Then build a sorted list and simultanuously see if any combination of a new item with the already sorted items has no overlap. If so, the solution is found. Otherwise sort further. This method is fast as it is not necessary to sort all items and the items are tested for large products first.
As usual, simple solutions are also often the fastest. This one is about as fast as g3 and is much simpler.
Here’s a solution in Python. It isn’t as concise as the posted solution, but the critical difference is that it runs in O(n) time and constant additional space when given n words of input.
The trick is to realize that we don’t need to keep track of each previously seen word, only the maximum word length seen for each combination of letters, and there’s a large-ish but fixed size of possible letter combinations at 2^26-1. This means we can put a constant upper bound on the amount of work done to process each additional word.
In the solution below, each word gets reduced to an “lset”, a string identifying the set of letters it contains. For example, “FOOBAR” has the lset “ABFOR”; “FOO” and “FFOO” both have the lset “FO”. We store the maximum word length seen for each lset in a dictionary that we iterate over for each new word. The maximum number of lsets in the dictionary is 2^26-1.
@Mark. I do not see that your method is O(n). I see a loop over words and an inner loop over the dict lset_len. I tested the method on 355000 words. After half an hour your method did not finish. g3 and g4 finish in about 5 seconds.
@Paul – I think Mark’s point is that the size of the lset dict is bounded as there are only 2^26-1 possibly entries, therefore time iterating over all the entries is also bounded – so it’s O(n) but with a rather large constant factor on the upper bound – nice illustration of the limitations of big-O notation.
Of course, that’s assuming 8-bit characters, doing Unicode would be a different kettle of fish (stll theoretically O(n) though)
I meant “characters in A-Z”, not 8-bit characters.
@matthew and @Mark. I understand (now) that one could say that this is O(n). It is certainly so, that the lset_len dict saves space and time as words with the same characters are collapsed into one item. I printed some sizes of the dict and saw that after 15000 (dutch) words the number of entries in the dict were 0.88 * 15000. The time and space saving are not very significant. Therefore the loop over lset_len is still over the majority of the words seen sofar. If the number of words would be much larger than 2^26-1 (about 67,000,000; the number of words in the Bible is about 783,000 (about 12,000 unique words)) then the saving would be significant.
Methods g3 and g4 have a bug. Here is a (more) correct version of g3.
Here is Scala solution:
def googInterview(ls: List[String]) = {
ls map(x => x.toUpperCase) sorted
val yy = for {
xx <- ls
yy xx && ((xx.toSet intersect yy.toSet).isEmpty)
}yield((xx length)*(yy length), xx,yy)
yy max
}
Solution in C#, using LINQ and Extension Methods: