## Three Amazon Interview Questions

### November 25, 2016

These three questions come from Career Cup:

First:A kidnapper wants to write a ransom note by cutting characters from the text of a magazine. Given two strings containing the characters of the ransom note and the characters of the magazine, write a program to determine if the ransom note can be formed from the magazine.

Second:Write a program that operates in linear time that finds the item in a list that appears the most times consecutively.

Third:Given two finite streams of integers that are too large to fit in memory, write a program that finds the integers that appear in both streams; it must operate in time linear in the length of the longer of the two streams.

Your task is to write the three programs described above. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

The first two are simple in Python, as you can use Counter and groupby. The third one I think you can solve using a trie. Load first all integers from the shorter sequence in the trie. The trie is much smaller than the sequence of all numbers as the numbers are compressed in a tree like structure. The length of the trie is dependend only on the length of the integers. Than you can scan the longer sequence and check if you find any number. I tested it for 10^7 numbers ranging from -10^6 to 10^6. For 10^8 numbers I got a MemoryError (so this still is not a real solution). The second (longer) sequence can be as long as you like.

Although it was fun the wite the “trie solution” I do not think it really solves the problem. I will think further.

For the third problem – if the values are 32 bit integers, it only needs 500MB to store a bitset with 2^32 entries.

It was not stated that the integers fit in 32 bits. On the contrary, the question can be (mis)construed as stating that the integers themselves are too large to fit in memory!

It was also not stated which of the two streams is longer, so the solution should not rely on knowing this beforehand.

But I think the worst case, as the problem is stated, really is such that even the distinct integers in the streams don’t fit in memory at once. What then? Perhaps some unstated assumptions are intended after all. Perhaps the interviewer thinks it somehow obvious that any integer fits in some fixed number of bits.

A possibility would be to mergesort the smaller sequence on disk and split it in parts, that fit in memory. Than scan the longer sequence and try to find a match. This means that for each element you would have to load one of the sorted parts and use bisection to try find a match. It is a long process, but it looks a linear for the longer sequence.

@Praxis, I suspect that Amazon would, and should, object to restricting characters to the ASCII set. It’s high time to go past 8-bit sets to something approximately Unicode-sized, and so to the hash-based solutions and stuff that you refer to at the end. (To be pedantic, ASCII is a 7-bit set; we used to use a national variant, 30 years ago; but 8-bit sets are not much better any more.)

It’s also not stated that the shorter sequence is significantly shorter. I think the solution needs to be linear in both.

@Jussi: I can’t see that any solution is possible that doesn’t require assuming that there is some (not too large) upper bound on the integers.

@matthew, neither can I. But then I think the solution is to just collect the distinct elements from the streams into suitable set-like data structures and then compute their intersection. In Python, with hashing:

So it’s either quite simple or impossibly hard.

@Jussi: Fair enough. Good point on Unicode by the way.

It is actually very simple.

A solution is to read the shorter stream1 in chunks and dump the chunks on file. Then read the longer stream2 in chunks and do a set intersection with all the stream1 files. Write the intersections to file (the common integers could also not fit in memory). This works for any size of integers and in principle for any size of streams. The limitation is disk size.

The whole process is (very) slow, of course, as data has to be read from disk often.

If there are double entries in the longer stream, that do not end up in the same chunk, they might both be written to output.

It is not really relevant which stream is the shorter one. If it is not given, you can dump both streams on disk first and see which one is shorter.

It should be linear for the longer sequence, meaning that if you make the longer sequence 10 times longer the time will go up, by about a factor of 10. (This requirement was clearly stated in the question.)

I have written code to do all this, but it is very boring and you need a lot of time to run it.

I am curious, if someone can come up with a solution, that is really fast.

@Paul: So if everything will fit on disk, we can just implement a virtual memory system, read in the two streams & do “for a in s: for b in t: if a==b print(a)” – this is more or less your solution with a chunk size of 1 & as you say, for fixed t, takes time linear in length of s. However, if we really can fit everything on disk, probably better to use some cache-friendly sort (eg. funnel sort or some other merge sort variant) & get an n log n solution or maybe we can do something clever with Bloom filters.

My take on the third question is that the interviewer wants you to say that it is impossible, with a short reason why. If you ask a few questions (bounds on the size of integers, frequency of duplicates, is amount of memory sufficient for bit vector), then determine that it is impossible, that’s even better. If you start making assumptions, you fail. If you propose a sorting solution, you fail. If you propose something obviously too big for memory, you fail. If you start writing code, you fail. Of course, a cagey interviewer might not tell you that you have already failed, and will give you all the rope you need to hang yourself a half dozen times over. But I don’t work at Amazon.

@programmingpraxis: I think the interviewer wants to see how you react to the problem. How do you analyse the problem? Do you search for a solution? Then discuss the implications of the solution. It is a good question and gives much more insight than the first 2 questions.

Here’s a Java solution to the third question that uses a disk-based trie.

A disk-based key-value store with hashing, like BerkeleyDB, could seemingly be used to implement a solution with expected linear time, but not worst-case.

Output:

@Daniel: Nice program, though I fear you would run out of inodes before too long.

Here are some more solutions to the 3 problems, not sure if good enough for Amazon (maybe I should find out – they have a Cambridge, UK office and I might be looking for a new job soon).

For the ransom note, if the note is eg. “SEND MONEY NOW” and the magazine is the Economist, we don’t need to scan the entire magazine, so this just counts the letters needed and then tries to find them in the second string, terminating early when all have been found. Javascript ES6, which should handle Unicode better (I’ve ignored case here):

More ES6 for the second, much the same as the @praxis solution:

Finally, here is the bitset idea coded up in C++ (Javascript didn’t seem to like making 500M arrays). Just generate random numbers for each sequence:

Just noticed that those left shifts are a bit dodgy – should make sure the shifted values are unsigned.

For the third, best case I can think of. If an int is 32 bits or less….

A char array[max_unsigned_int/4]

one stream gets bit 1 the other stream gets bit 2

So you’d translate the int to the index and correct bit and bitwise or everytime you encounter it. So it’d take up roughly a 1GB.

If 4 GB isn’t too much to ask for, you could just cast as unsigned and index would be the int. But still leaving stream 1 with one bit and stream 2 with another. So each index with a value of 3 would be an intersection.