Google Interview Question
April 8, 2016
Here’s our solution:
(define (g xs) (car (sort (lambda (xs ys) (> (car xs) (car ys))) (map (lambda (xs) (cons (* (string-length (car xs)) (string-length (cadr xs))) xs)) (filter (lambda (xs) (null? (intersect (string->list (car xs)) (string->list (cadr xs))))) (cross xs xs))))))
Starting from the inside, cross
forms the cross product of the input words, filter
retains only those pairs that do not share any letters (the intersection of their letters is null), map
decorates each pair with the product of its string-lengths, sort arranges from largest product to smallest, and car
takes the winner. The way to write programs like this is one step at a time: take the cross product, and check that it does what you expect; then filter, and check again; then map, and check again; then sort, and make one last check, then take the winner. It took me only a few minutes to write, and I was confident in my solution at each step along the way. Here it is in action:
> (g '("ABCW" "BAZ" "FOO" "BAR" "XTFN" "ABCDEF")) (16 "ABCW" "XTFN")
This is obviously a brute-force solution. In an interview situation, I’d probably give that first, just to make sure I had a working solution, then start improving. The cross product could be improved to give only one of each pair of permutations (FOO BAR and BAR FOO), thus reducing the work by half. The intersection could be computed by bit arrays. Instead of sorting, you could simply find the maximum. But this solves the problem, simply and obviously, makes good use of the standard library, and would probably have acceptable performance up to a fairly large input.
You can run the program at http://ideone.com/1WgN56, where you will also see the definitions of cross
and intersect
.
Again a nice one for perl as it’s string handling!
Using Factor:
The first version finds the “maximum” pair of words. The key depends on whether the intersection of their respective sets of letters is empty. If it is empty, the key is the product of the lengths of the two words. Otherwise, the key is zero.
This second version builds a dictionary mapping letters to sets of words containing that letter. Then set operations can be used to calculate pairs of words that have no letters in common.
The second line was a comment, it lost a ‘#’ character somewhere.
This works pretty fast for larger number of words.
Another Python solution, similar to Paul’s (I borrowed some your code too, thanks Paul), but using a heap to ensure we examine possible solutions in order. Using negative lengths is because Python heapq uses a fixed less than ordering & also to ensure the solutions come out in lexicographic order. This version prints all valid word pairs:
Another method. Make a heap of the (-len(word), set(word)) tuples. Then build a sorted list and simultanuously see if any combination of a new item with the already sorted items has no overlap. If so, the solution is found. Otherwise sort further. This method is fast as it is not necessary to sort all items and the items are tested for large products first.
As usual, simple solutions are also often the fastest. This one is about as fast as g3 and is much simpler.
Here’s a solution in Python. It isn’t as concise as the posted solution, but the critical difference is that it runs in O(n) time and constant additional space when given n words of input.
The trick is to realize that we don’t need to keep track of each previously seen word, only the maximum word length seen for each combination of letters, and there’s a large-ish but fixed size of possible letter combinations at 2^26-1. This means we can put a constant upper bound on the amount of work done to process each additional word.
In the solution below, each word gets reduced to an “lset”, a string identifying the set of letters it contains. For example, “FOOBAR” has the lset “ABFOR”; “FOO” and “FFOO” both have the lset “FO”. We store the maximum word length seen for each lset in a dictionary that we iterate over for each new word. The maximum number of lsets in the dictionary is 2^26-1.
@Mark. I do not see that your method is O(n). I see a loop over words and an inner loop over the dict lset_len. I tested the method on 355000 words. After half an hour your method did not finish. g3 and g4 finish in about 5 seconds.
@Paul – I think Mark’s point is that the size of the lset dict is bounded as there are only 2^26-1 possibly entries, therefore time iterating over all the entries is also bounded – so it’s O(n) but with a rather large constant factor on the upper bound – nice illustration of the limitations of big-O notation.
Of course, that’s assuming 8-bit characters, doing Unicode would be a different kettle of fish (stll theoretically O(n) though)
I meant “characters in A-Z”, not 8-bit characters.
@matthew and @Mark. I understand (now) that one could say that this is O(n). It is certainly so, that the lset_len dict saves space and time as words with the same characters are collapsed into one item. I printed some sizes of the dict and saw that after 15000 (dutch) words the number of entries in the dict were 0.88 * 15000. The time and space saving are not very significant. Therefore the loop over lset_len is still over the majority of the words seen sofar. If the number of words would be much larger than 2^26-1 (about 67,000,000; the number of words in the Bible is about 783,000 (about 12,000 unique words)) then the saving would be significant.
Methods g3 and g4 have a bug. Here is a (more) correct version of g3.
Here is Scala solution:
def googInterview(ls: List[String]) = {
ls map(x => x.toUpperCase) sorted
val yy = for {
xx <- ls
yy xx && ((xx.toSet intersect yy.toSet).isEmpty)
}yield((xx length)*(yy length), xx,yy)
yy max
}
Solution in C#, using LINQ and Extension Methods: