Three Amazon Interview Questions

November 25, 2016

These three questions come from Career Cup:

First: A kidnapper wants to write a ransom note by cutting characters from the text of a magazine. Given two strings containing the characters of the ransom note and the characters of the magazine, write a program to determine if the ransom note can be formed from the magazine.

Second: Write a program that operates in linear time that finds the item in a list that appears the most times consecutively.

Third: Given two finite streams of integers that are too large to fit in memory, write a program that finds the integers that appear in both streams; it must operate in time linear in the length of the longer of the two streams.

Your task is to write the three programs described above. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Posted by programmingpraxis

Filed in Exercises

19 Comments »

19 Responses to “Three Amazon Interview Questions”

Paul said

November 25, 2016 at 10:40 AM

The first two are simple in Python, as you can use Counter and groupby. The third one I think you can solve using a trie. Load first all integers from the shorter sequence in the trie. The trie is much smaller than the sequence of all numbers as the numbers are compressed in a tree like structure. The length of the trie is dependend only on the length of the integers. Than you can scan the longer sequence and check if you find any number. I tested it for 10^7 numbers ranging from -10^6 to 10^6. For 10^8 numbers I got a MemoryError (so this still is not a real solution). The second (longer) sequence can be as long as you like.

from collections import Counter
from itertools import groupby

_end = '_end'

def make_trie(*words):
    root = dict()
    for word in words:
        current_dict = root
        for letter in word:
            current_dict = current_dict.setdefault(letter, {})
        current_dict = current_dict.setdefault(_end, _end)
    return root

def in_trie(trie, word):
    current_dict = trie
    for letter in word:
        current_dict = current_dict.get(letter)
        if current_dict is None:
            return False
    else:
        return _end in current_dict

def q1(magazine, message):
    mag = Counter(magazine)
    mes = Counter(message)
    return all(mag[k] >= mes[k] for k in mes.keys())

def q2(seq):
    return max((len(list(g)), k) for k, g in groupby(seq))[1]

def q3(seq1, seq2):
    'assume seq1 < seq2'
    trie = make_trie(*(str(i) for i in seq1))
    for e in seq2:
        if in_trie(trie, str(e)):
            return e

Paul said
November 25, 2016 at 10:58 AM
Although it was fun the wite the “trie solution” I do not think it really solves the problem. I will think further.
matthew said
November 25, 2016 at 11:09 AM
For the third problem – if the values are 32 bit integers, it only needs 500MB to store a bitset with 2^32 entries.
Jussi Piitulainen said
November 25, 2016 at 11:41 AM
It was not stated that the integers fit in 32 bits. On the contrary, the question can be (mis)construed as stating that the integers themselves are too large to fit in memory!

It was also not stated which of the two streams is longer, so the solution should not rely on knowing this beforehand.

But I think the worst case, as the problem is stated, really is such that even the distinct integers in the streams don’t fit in memory at once. What then? Perhaps some unstated assumptions are intended after all. Perhaps the interviewer thinks it somehow obvious that any integer fits in some fixed number of bits.
Paul said
November 25, 2016 at 11:47 AM
A possibility would be to mergesort the smaller sequence on disk and split it in parts, that fit in memory. Than scan the longer sequence and try to find a match. This means that for each element you would have to load one of the sorted parts and use bisection to try find a match. It is a long process, but it looks a linear for the longer sequence.
Jussi Piitulainen said
November 25, 2016 at 11:54 AM
@Praxis, I suspect that Amazon would, and should, object to restricting characters to the ASCII set. It’s high time to go past 8-bit sets to something approximately Unicode-sized, and so to the hash-based solutions and stuff that you refer to at the end. (To be pedantic, ASCII is a 7-bit set; we used to use a national variant, 30 years ago; but 8-bit sets are not much better any more.)
Jussi Piitulainen said
November 25, 2016 at 12:00 PM
It’s also not stated that the shorter sequence is significantly shorter. I think the solution needs to be linear in both.
matthew said
November 25, 2016 at 12:06 PM
@Jussi: I can’t see that any solution is possible that doesn’t require assuming that there is some (not too large) upper bound on the integers.
Jussi Piitulainen said
November 25, 2016 at 12:21 PM
@matthew, neither can I. But then I think the solution is to just collect the distinct elements from the streams into suitable set-like data structures and then compute their intersection. In Python, with hashing:
```
set(stream1) & set(stream2)
```
So it’s either quite simple or impossibly hard.
matthew said
November 25, 2016 at 12:43 PM
@Jussi: Fair enough. Good point on Unicode by the way.
Paul said
November 25, 2016 at 10:41 PM
It is actually very simple.
A solution is to read the shorter stream1 in chunks and dump the chunks on file. Then read the longer stream2 in chunks and do a set intersection with all the stream1 files. Write the intersections to file (the common integers could also not fit in memory). This works for any size of integers and in principle for any size of streams. The limitation is disk size.
The whole process is (very) slow, of course, as data has to be read from disk often.
If there are double entries in the longer stream, that do not end up in the same chunk, they might both be written to output.
It is not really relevant which stream is the shorter one. If it is not given, you can dump both streams on disk first and see which one is shorter.
It should be linear for the longer sequence, meaning that if you make the longer sequence 10 times longer the time will go up, by about a factor of 10. (This requirement was clearly stated in the question.)
I have written code to do all this, but it is very boring and you need a lot of time to run it.
I am curious, if someone can come up with a solution, that is really fast.
matthew said
November 26, 2016 at 1:13 PM
@Paul: So if everything will fit on disk, we can just implement a virtual memory system, read in the two streams & do “for a in s: for b in t: if a==b print(a)” – this is more or less your solution with a chunk size of 1 & as you say, for fixed t, takes time linear in length of s. However, if we really can fit everything on disk, probably better to use some cache-friendly sort (eg. funnel sort or some other merge sort variant) & get an n log n solution or maybe we can do something clever with Bloom filters.
programmingpraxis said
November 26, 2016 at 10:46 PM
My take on the third question is that the interviewer wants you to say that it is impossible, with a short reason why. If you ask a few questions (bounds on the size of integers, frequency of duplicates, is amount of memory sufficient for bit vector), then determine that it is impossible, that’s even better. If you start making assumptions, you fail. If you propose a sorting solution, you fail. If you propose something obviously too big for memory, you fail. If you start writing code, you fail. Of course, a cagey interviewer might not tell you that you have already failed, and will give you all the rope you need to hang yourself a half dozen times over. But I don’t work at Amazon.
Paul said
November 27, 2016 at 5:15 PM
@programmingpraxis: I think the interviewer wants to see how you react to the problem. How do you analyse the problem? Do you search for a solution? Then discuss the implications of the solution. It is a good question and gives much more insight than the first 2 questions.

Daniel said

November 29, 2016 at 4:52 AM

Here’s a Java solution to the third question that uses a disk-based trie.

A disk-based key-value store with hashing, like BerkeleyDB, could seemingly be used to implement a solution with expected linear time, but not worst-case.

import java.io.File;
import java.io.IOException;
import java.math.BigInteger;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.Random;

// Stream simulates a finite stream of integers.
class Stream {
    Random rng = new Random();
    BigInteger remaining;
    
    public Stream(BigInteger length) {
        remaining = length;
    }
    
    public boolean hasNext() {
        return remaining.compareTo(BigInteger.ZERO) == 1;
    }
    
    public BigInteger next() {
        if (!hasNext()) return null;
        remaining = remaining.subtract(BigInteger.ONE);
        StringBuffer next = new StringBuffer();
        do next.append(rng.nextInt(10)); while (rng.nextFloat() > 0.5);
        return new BigInteger(next.toString());
    }
}

// A disk-based set using a trie
class TrieSet {
    String root;
    
    public TrieSet(String root) {
        this.root = root;
    }
    
    public File path(String s) {
        File path = new File(root);
        for (int i = 0; i < s.length(); ++i) {
            path = new File(path, Character.toString(s.charAt(i)));
        }
        path = new File(path, "_");
        return path;
    }
    
    public void add(String s) throws IOException {
        File path = path(s);
        path.getParentFile().mkdirs();
        path.createNewFile();
    }
    
    public boolean contains(String s) {
        return path(s).exists();
    }
}

public class Intersection {
    public static void main(String[] args) throws IOException {
        Path root = Files.createTempDirectory("root");
        System.out.println(root);
        TrieSet set = new TrieSet(root.toString());
        Stream s1 = new Stream(BigInteger.valueOf(100000));
        Stream s2 = new Stream(BigInteger.valueOf(10));
        while (s1.hasNext()) {
            // add to set
            BigInteger next = s1.next();
            set.add(next.toString());
        }
        while (s2.hasNext()) {
            // if next is in set, print to stdout
            BigInteger next = s2.next();
            if (set.contains(next.toString())) System.out.println(next);
        }
        // TODO: clean-up temporary directories/files.
    }
}

Output:

matthew said

November 29, 2016 at 7:45 AM

@Daniel: Nice program, though I fear you would run out of inodes before too long.

Here are some more solutions to the 3 problems, not sure if good enough for Amazon (maybe I should find out – they have a Cambridge, UK office and I might be looking for a new job soon).

For the ransom note, if the note is eg. “SEND MONEY NOW” and the magazine is the Economist, we don’t need to scan the entire magazine, so this just counts the letters needed and then tries to find them in the second string, terminating early when all have been found. Javascript ES6, which should handle Unicode better (I’ve ignored case here):

function check(s,t) {
    const m = new Map();
    let count = 0
    for (let c of s) {
        m.set(c,(m.get(c)||0)+1);
        count++;
    }
    for (let c of t) {
        const k = m.get(c)
        if (k) {
            m.set(c,k-1);
            count--;
            if (count == 0) break;
        }
    }
    return count == 0;
}

const test1 = (s,t) => console.log(`'${s}', '${t}': ${check(s,t)}`)

test1("foo","raboof")
test1("🀴🁓🁮🂈","🁮🀴🀴🂈🁓🁓")

More ES6 for the second, much the same as the @praxis solution:

function maxconsec(s) {
    let count = 0, maxcount = 0, lastitem = s[0], maxitem
    for (let a of s) {
        if (a === lastitem) {
            count++;
            if (count > maxcount) {
                maxcount = count
                maxitem = a
            }
        } else {
            count = 0
            lastitem = a
        }
    }
    return maxitem
}

const test2 = s => console.log(`${s}: '${maxconsec(s)}'`)

test2([1,1,2,2,2,3,3,3,3,2,2,2,1,1])
test2("△.♢♢.⬠⬠⬠.⬡⬡⬡⬡.⬠⬠⬠.♢♢.△");

Finally, here is the bitset idea coded up in C++ (Javascript didn’t seem to like making 500M arrays). Just generate random numbers for each sequence:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
int main() {
  uint32_t *bitset = new uint32_t[(1<<(32-5))];
  for (int i = 0; i < 10000000; i++) {
    uint32_t a = rand()^(rand()<<16);
    bitset[a>>5] |= 1<<(a&31);
  }
  long int count = 0;
  for (int i = 0; i < 10000000; i++) {
    uint32_t a = rand()^(rand()<<16);
    if (bitset[a>>5] & (1<<(a&31))) count++;
  }
  delete [] bitset;
  printf("%ld\n", count);
}

matthew said
November 29, 2016 at 7:53 AM
Just noticed that those left shifts are a bit dodgy – should make sure the shifted values are unsigned.
Tyler said
December 4, 2016 at 2:29 PM
For the third, best case I can think of. If an int is 32 bits or less….

A char array[max_unsigned_int/4]

one stream gets bit 1 the other stream gets bit 2

So you’d translate the int to the index and correct bit and bitwise or everytime you encounter it. So it’d take up roughly a 1GB.

If 4 GB isn’t too much to ask for, you could just cast as unsigned and index would be the int. But still leaving stream 1 with one bit and stream 2 with another. So each index with a value of 3 would be an intersection.

sbocq said

June 2, 2018 at 7:17 AM

Clojure/Script (try online at http://clojurescript.net/)

1 Ransom

(defn ransom [note magazine]
  (let [available (frequencies magazine)]
    (loop [required (seq (frequencies note))]
      (if-let [[c1 f1] (first required)]
        (when-let [f (available c1)]
          (when (>= f f1)
            (recur (rest required))))
        true))))

Test:
=> (ransom "send money" "four score and seven years ago") nil => (ransom "send money" "four million and seven years ago") true

2 Most frequent consecutive item

(defn most-conseq-item [l]
  (when (seq l)
    (letfn [(cmp [fav-item fav-count new-item new-count]
              (if (> new-count fav-count) [new-item new-count] [fav-item fav-count]))]
      (first (apply cmp (reduce (fn [[fav-item fav-count new-item new-count] next-item]
                                  (if (= next-item new-item)
                                    [fav-item fav-count new-item (inc new-count)]
                                    (into (cmp fav-item fav-count new-item new-count)
                                          [next-item 1])))
                                [nil 0 (first l) 1]
                                (next l)))))))

=> (most-conseq-item "helloo") \l => (most-conseq-item "hellooo") \o

3 Duplicates

Sequences in Clojure can be chunked streams. This naive algo uses 2 integer sets
to store integers already seen in each stream, and one map to track integer
frequencies (max 2). Assuming 32-bit integers and a 64-bit machine, given enough memory,
even if the streams don’t fit, the map and the 2 sets can fit.

(defn dups [s1 s2]
  (letfn [(count-if-not-seen [m seen val]
            (if (and val (not (seen val)))
              (update m val (fnil inc 0))
              m))]
    (->> (loop [m {} i1-seen #{} i2-seen #{} ss1 s1 ss2 s2]
           (let [i1 (first ss1)
                 i2 (first ss2)]
             (if (or i1 i2)
               (recur (-> m
                          (count-if-not-seen i1-seen i1)
                          (count-if-not-seen i2-seen i2))
                      (if i1 (conj i1-seen i1))
                      (if i2 (conj i2-seen i2))
                      (rest ss1)
                      (rest ss2))
               m)))
         (filter #(> (second %) 1))
         (map first))))

Test:
=> (dups "he" "hooee") (\h \e) => (dups "hahahao" "hooee") (\h \o)

Programming Praxis