Three Amazon Interview Questions

November 25, 2016

These three questions come from Career Cup:

First: A kidnapper wants to write a ransom note by cutting characters from the text of a magazine. Given two strings containing the characters of the ransom note and the characters of the magazine, write a program to determine if the ransom note can be formed from the magazine.

Second: Write a program that operates in linear time that finds the item in a list that appears the most times consecutively.

Third: Given two finite streams of integers that are too large to fit in memory, write a program that finds the integers that appear in both streams; it must operate in time linear in the length of the longer of the two streams.

Your task is to write the three programs described above. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

18 Responses to “Three Amazon Interview Questions”

  1. Paul said

    The first two are simple in Python, as you can use Counter and groupby. The third one I think you can solve using a trie. Load first all integers from the shorter sequence in the trie. The trie is much smaller than the sequence of all numbers as the numbers are compressed in a tree like structure. The length of the trie is dependend only on the length of the integers. Than you can scan the longer sequence and check if you find any number. I tested it for 10^7 numbers ranging from -10^6 to 10^6. For 10^8 numbers I got a MemoryError (so this still is not a real solution). The second (longer) sequence can be as long as you like.

    from collections import Counter
    from itertools import groupby
    
    _end = '_end'
    
    def make_trie(*words):
        root = dict()
        for word in words:
            current_dict = root
            for letter in word:
                current_dict = current_dict.setdefault(letter, {})
            current_dict = current_dict.setdefault(_end, _end)
        return root
    
    def in_trie(trie, word):
        current_dict = trie
        for letter in word:
            current_dict = current_dict.get(letter)
            if current_dict is None:
                return False
        else:
            return _end in current_dict
    
    def q1(magazine, message):
        mag = Counter(magazine)
        mes = Counter(message)
        return all(mag[k] >= mes[k] for k in mes.keys())
    
    def q2(seq):
        return max((len(list(g)), k) for k, g in groupby(seq))[1]
    
    def q3(seq1, seq2):
        'assume seq1 < seq2'
        trie = make_trie(*(str(i) for i in seq1))
        for e in seq2:
            if in_trie(trie, str(e)):
                return e
    
  2. Paul said

    Although it was fun the wite the “trie solution” I do not think it really solves the problem. I will think further.

  3. matthew said

    For the third problem – if the values are 32 bit integers, it only needs 500MB to store a bitset with 2^32 entries.

  4. Jussi Piitulainen said

    It was not stated that the integers fit in 32 bits. On the contrary, the question can be (mis)construed as stating that the integers themselves are too large to fit in memory!

    It was also not stated which of the two streams is longer, so the solution should not rely on knowing this beforehand.

    But I think the worst case, as the problem is stated, really is such that even the distinct integers in the streams don’t fit in memory at once. What then? Perhaps some unstated assumptions are intended after all. Perhaps the interviewer thinks it somehow obvious that any integer fits in some fixed number of bits.

  5. Paul said

    A possibility would be to mergesort the smaller sequence on disk and split it in parts, that fit in memory. Than scan the longer sequence and try to find a match. This means that for each element you would have to load one of the sorted parts and use bisection to try find a match. It is a long process, but it looks a linear for the longer sequence.

  6. Jussi Piitulainen said

    @Praxis, I suspect that Amazon would, and should, object to restricting characters to the ASCII set. It’s high time to go past 8-bit sets to something approximately Unicode-sized, and so to the hash-based solutions and stuff that you refer to at the end. (To be pedantic, ASCII is a 7-bit set; we used to use a national variant, 30 years ago; but 8-bit sets are not much better any more.)

  7. Jussi Piitulainen said

    It’s also not stated that the shorter sequence is significantly shorter. I think the solution needs to be linear in both.

  8. matthew said

    @Jussi: I can’t see that any solution is possible that doesn’t require assuming that there is some (not too large) upper bound on the integers.

  9. Jussi Piitulainen said

    @matthew, neither can I. But then I think the solution is to just collect the distinct elements from the streams into suitable set-like data structures and then compute their intersection. In Python, with hashing:

    set(stream1) & set(stream2)
    

    So it’s either quite simple or impossibly hard.

  10. matthew said

    @Jussi: Fair enough. Good point on Unicode by the way.

  11. Paul said

    It is actually very simple.
    A solution is to read the shorter stream1 in chunks and dump the chunks on file. Then read the longer stream2 in chunks and do a set intersection with all the stream1 files. Write the intersections to file (the common integers could also not fit in memory). This works for any size of integers and in principle for any size of streams. The limitation is disk size.
    The whole process is (very) slow, of course, as data has to be read from disk often.
    If there are double entries in the longer stream, that do not end up in the same chunk, they might both be written to output.
    It is not really relevant which stream is the shorter one. If it is not given, you can dump both streams on disk first and see which one is shorter.
    It should be linear for the longer sequence, meaning that if you make the longer sequence 10 times longer the time will go up, by about a factor of 10. (This requirement was clearly stated in the question.)
    I have written code to do all this, but it is very boring and you need a lot of time to run it.
    I am curious, if someone can come up with a solution, that is really fast.

  12. matthew said

    @Paul: So if everything will fit on disk, we can just implement a virtual memory system, read in the two streams & do “for a in s: for b in t: if a==b print(a)” – this is more or less your solution with a chunk size of 1 & as you say, for fixed t, takes time linear in length of s. However, if we really can fit everything on disk, probably better to use some cache-friendly sort (eg. funnel sort or some other merge sort variant) & get an n log n solution or maybe we can do something clever with Bloom filters.

  13. programmingpraxis said

    My take on the third question is that the interviewer wants you to say that it is impossible, with a short reason why. If you ask a few questions (bounds on the size of integers, frequency of duplicates, is amount of memory sufficient for bit vector), then determine that it is impossible, that’s even better. If you start making assumptions, you fail. If you propose a sorting solution, you fail. If you propose something obviously too big for memory, you fail. If you start writing code, you fail. Of course, a cagey interviewer might not tell you that you have already failed, and will give you all the rope you need to hang yourself a half dozen times over. But I don’t work at Amazon.

  14. Paul said

    @programmingpraxis: I think the interviewer wants to see how you react to the problem. How do you analyse the problem? Do you search for a solution? Then discuss the implications of the solution. It is a good question and gives much more insight than the first 2 questions.

  15. Daniel said

    Here’s a Java solution to the third question that uses a disk-based trie.

    A disk-based key-value store with hashing, like BerkeleyDB, could seemingly be used to implement a solution with expected linear time, but not worst-case.

    import java.io.File;
    import java.io.IOException;
    import java.math.BigInteger;
    import java.nio.file.Files;
    import java.nio.file.Path;
    import java.util.Random;
    
    // Stream simulates a finite stream of integers.
    class Stream {
        Random rng = new Random();
        BigInteger remaining;
        
        public Stream(BigInteger length) {
            remaining = length;
        }
        
        public boolean hasNext() {
            return remaining.compareTo(BigInteger.ZERO) == 1;
        }
        
        public BigInteger next() {
            if (!hasNext()) return null;
            remaining = remaining.subtract(BigInteger.ONE);
            StringBuffer next = new StringBuffer();
            do next.append(rng.nextInt(10)); while (rng.nextFloat() > 0.5);
            return new BigInteger(next.toString());
        }
    }
    
    // A disk-based set using a trie
    class TrieSet {
        String root;
        
        public TrieSet(String root) {
            this.root = root;
        }
        
        public File path(String s) {
            File path = new File(root);
            for (int i = 0; i < s.length(); ++i) {
                path = new File(path, Character.toString(s.charAt(i)));
            }
            path = new File(path, "_");
            return path;
        }
        
        public void add(String s) throws IOException {
            File path = path(s);
            path.getParentFile().mkdirs();
            path.createNewFile();
        }
        
        public boolean contains(String s) {
            return path(s).exists();
        }
    }
    
    public class Intersection {
        public static void main(String[] args) throws IOException {
            Path root = Files.createTempDirectory("root");
            System.out.println(root);
            TrieSet set = new TrieSet(root.toString());
            Stream s1 = new Stream(BigInteger.valueOf(100000));
            Stream s2 = new Stream(BigInteger.valueOf(10));
            while (s1.hasNext()) {
                // add to set
                BigInteger next = s1.next();
                set.add(next.toString());
            }
            while (s2.hasNext()) {
                // if next is in set, print to stdout
                BigInteger next = s2.next();
                if (set.contains(next.toString())) System.out.println(next);
            }
            // TODO: clean-up temporary directories/files.
        }
    }
    

    Output:

    655
    759
    5
    2
    0
    50
    7
    801
    448
    7
    
  16. matthew said

    @Daniel: Nice program, though I fear you would run out of inodes before too long.

    Here are some more solutions to the 3 problems, not sure if good enough for Amazon (maybe I should find out – they have a Cambridge, UK office and I might be looking for a new job soon).

    For the ransom note, if the note is eg. “SEND MONEY NOW” and the magazine is the Economist, we don’t need to scan the entire magazine, so this just counts the letters needed and then tries to find them in the second string, terminating early when all have been found. Javascript ES6, which should handle Unicode better (I’ve ignored case here):

    function check(s,t) {
        const m = new Map();
        let count = 0
        for (let c of s) {
            m.set(c,(m.get(c)||0)+1);
            count++;
        }
        for (let c of t) {
            const k = m.get(c)
            if (k) {
                m.set(c,k-1);
                count--;
                if (count == 0) break;
            }
        }
        return count == 0;
    }
    
    const test1 = (s,t) => console.log(`'${s}', '${t}': ${check(s,t)}`)
    
    test1("foo","raboof")
    test1("🀴🁓🁮🂈","🁮🀴🀴🂈🁓🁓")
    

    More ES6 for the second, much the same as the @praxis solution:

    function maxconsec(s) {
        let count = 0, maxcount = 0, lastitem = s[0], maxitem
        for (let a of s) {
            if (a === lastitem) {
                count++;
                if (count > maxcount) {
                    maxcount = count
                    maxitem = a
                }
            } else {
                count = 0
                lastitem = a
            }
        }
        return maxitem
    }
    
    const test2 = s => console.log(`${s}: '${maxconsec(s)}'`)
    
    test2([1,1,2,2,2,3,3,3,3,2,2,2,1,1])
    test2("△.♢♢.⬠⬠⬠.⬡⬡⬡⬡.⬠⬠⬠.♢♢.△");
    

    Finally, here is the bitset idea coded up in C++ (Javascript didn’t seem to like making 500M arrays). Just generate random numbers for each sequence:

    #include <stdio.h>
    #include <stdint.h>
    #include <stdlib.h>
    int main() {
      uint32_t *bitset = new uint32_t[(1<<(32-5))];
      for (int i = 0; i < 10000000; i++) {
        uint32_t a = rand()^(rand()<<16);
        bitset[a>>5] |= 1<<(a&31);
      }
      long int count = 0;
      for (int i = 0; i < 10000000; i++) {
        uint32_t a = rand()^(rand()<<16);
        if (bitset[a>>5] & (1<<(a&31))) count++;
      }
      delete [] bitset;
      printf("%ld\n", count);
    }
    
  17. matthew said

    Just noticed that those left shifts are a bit dodgy – should make sure the shifted values are unsigned.

  18. Tyler said

    For the third, best case I can think of. If an int is 32 bits or less….

    A char array[max_unsigned_int/4]

    one stream gets bit 1 the other stream gets bit 2

    So you’d translate the int to the index and correct bit and bitwise or everytime you encounter it. So it’d take up roughly a 1GB.

    If 4 GB isn’t too much to ask for, you could just cast as unsigned and index would be the int. But still leaving stream 1 with one bit and stream 2 with another. So each index with a value of 3 would be an intersection.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: