Amazon Interview Question

November 27, 2012

My reaction to this question is that it is impossible, and the interviewer is testing me to see what I will do with it. The best solution I can come up with requires visiting each of the n points, calculating the distance from each point to the origin using the Pythagorean theorem, then inserting each distance/point pair into a heap of size k, reporting the kth closest points by emptying the heap after all the points have been read. But that algorithm is o(n log k), which fails to meet the requirements. In an interview, I would explain that to the interviewer, then challenge him to state an o(n) solution. However, the source from which this question came apparently accepts that solution, on the grounds that log(100) is a constant, so that is the solution that we will write, even though it seems to me to be cheating. Bah! Humbug!

(define heap #f)

Normally we would compute the distance of a point (x, y) from (0, 0) as the square root of the sum of x2 and y2. But since we are only comparing distances, and don’t care about the absolute difference, we forego the square root.

(define (closest points k)
  (set! heap (make-vector k (list -1 0 0)))
  (do ((points points (cdr points)))
      ((null? points) heap)
    (let ((dist (+ (square (caar points)) (square (cadar points)))))
      (when (lt? (vector-ref heap 0) (list dist))
        (vector-set! heap 0 (cons dist (car points)))
        (siftdown k)))))

The heap is a maxheap, with the largest element in the heap stored at the root. Thus, when we compare the distance, we replace the largest-of-the-least only when the new distance is smaller.

Points are represented as three-slot lists with distance from (0, 0) in the car, the x coordinate in the cadr, and the y coordinate in the caddr. The heap is stored in a vector h of length k with the maximum distance at h[0] and each point h[i] greater than h[2i] and h[2i+1], initialized with all distances equal to infinity. When a new point is inserted into the heap, it is immediately rejected if its distance is greater than the heap maximum, otherwise it replaces the heap maximum and is sifted down the heap to its final position.

(define (siftdown len)
  (let loop ((parent 0))
    (let ((child (+ parent parent 1)))
      (when (< child len)
        (if (and (< (+ child 1) len) ; (+ child 1) is sibling
                 (lt? (vector-ref heap (+ child 1))
                      (vector-ref heap child)))
            (set! child (+ child 1))) ; the least child
        (when (lt? (vector-ref heap child)
                   (vector-ref heap parent))
          (let ((temp (vector-ref heap child)))
            (vector-set! heap child (vector-ref heap parent))
            (vector-set! heap parent temp))
          (loop child))))))

We described leftist-heaps, pairing heaps, maxiphobic heaps, and binomial heaps in previous exercises. The heap shown above is another type of heap, typically used in imperative languages when it is convenient to store a heap in an array.

The last thing we need is a less-than function that recognizes infinity, which is represented as -1, since all distances are positive; we insert a not because our siftheap is coded as a minheap, but we need a maxheap:

(define (lt? pt1 pt2)
  (not
    (if (negative? (car pt1)) #f
      (if (negative? (car pt2)) #t
        (< (car pt1) (car pt2))))))

To demonstrate the function, we generate a thousand random (x, y) points with -1000 <= x, y < 1000 then find the ten closest to (0, 0):

> (define points
    (let loop ((n 1000) (pts (list)))
      (if (zero? n) pts
        (loop (- n 1) (cons (list (randint -1000 1000)
                                  (randint -1000 1000)) pts)))))
> (closest points 10)
(480573 18 693)
(462376 530 -426)
(192596 -436 50)
(221341 -210 421)
(458330 -677 -1)
(158089 380 -117)
(153722 29 -391)
(62057 109 224)
(34064 -92 -160)
(290132 -526 116)

We used square and randint from the Standard Prelude. You can run the program at http://programmingpraxis.codepad.org/ZmUkqk5Y. This exercise is similar to the exercise Ullman’s Puzzle, which collected elements, sorted them in time o(n log n), then took the first k elements. Another solution is to collect all the elements in a list, then use the selection algorithm to identify the k smallest elements.

About these ads

Pages: 1 2

16 Responses to “Amazon Interview Question”

  1. [...] today’s Programming Praxis exercise, our goal is to find the 100 points closest to (0, 0) out of a list of [...]

  2. My Haskell solution (see http://bonsaicode.wordpress.com/2012/11/27/programming-praxis-amazon-interview-question/ for a version with comments):

    import Control.Applicative
    import qualified Data.IntMap as I
    import System.Random
    
    closest :: Int -> [(Int, Int)] -> [(Int, Int)]
    closest n = take n . concat . I.elems . I.fromListWith (flip (++)) .
                map (\(x,y) -> (x*x+y*y, [(x,y)]))
    
    points :: IO [(Int, Int)]
    points = take 1000000 <$> liftA2 zip (randomRs (-1000, 1000) <$> newStdGen)
                                         (randomRs (-1000, 1000) <$> newStdGen)
    
    main :: IO ()
    main = print . closest 100 =<< points
    
  3. sbmassey said

    Why not insert into a min heap, and limit its size to 100 by removing the top item on insertion having reached the limit? That should make it linear, right?

  4. Paul said

    A Python version. To my opinion this is linear, as long as n is much smaller than the length of the xypoints array. I checked it with N (length of xypoints) equal to 100000 and with values for n of 10 and 100. The time taken is about the same.

    def nsmallest(xypoints, n):
        """Use a heap with the maximum of the n lowest at the top of the heap"""
        # first fill the heap with the first n points
        heap = [(-hypot(x, y), (x, y)) for x, y in xypoints[:n]]
        heapify(heap)
        # now push new elements on heap only if distance small enough
        for x, y in xypoints[n:]:
            dist = hypot(x, y)
            if dist < -heap[0][0] :
                heappushpop(heap, (-dist, (x, y))) # remove "worst"
        return heap
    
  5. kawas44 said

    What do you think?
    Because 100 is a factor, even if you use insert sort (in an already sorted array since you’re building it), the complexity of the algorithm will still be O(n), no ?

  6. cosmin said

    Actually the problem can be solved in O(n) no matter how big k is (here k = 100). This problem is the 2D version of finding the first k smallest elements. Below is an algorithm that uses the Quicksort’s idea of splitting the list in 2 sub-lists, but the difference is that only one recursion is used. Due to the randomization the complexity is about T(n) = T(n/2) + O(n) which has the solution T(n) = O(n). Still the worst case could be O(n^2) but this is highly improbable. Another key points that are worth mentioning:

    [1] If a min-heap (not necessarily with the size fixed to k) is used the complexity would be O(n + k*log(n)) (O(n) for building the heap and O(k*log(n)) for k extractions) But if k < n/log(n) (which is the case here) the complexity will still be O(n).

    [2] There is another algorithm that splits the list in buckets of 5 elements that has the complexity O(n) in the worst case scenario. A starting point in understanding that algorithm could be found http://en.wikipedia.org/wiki/Selection_algorithm

    from random import randint
    from math import sqrt
    
    def dist(point):
    	return sqrt(point[0]**2 + point[1]**2)
    
    def k_smallest(k, a):
    	if k == len(a): return a
    	splitElm = a[randint(0, len(a) - 1)]
    	a1 = [e for e in a if dist(e) <= dist(splitElm)]
    	a2 = [e for e in a if dist(e) > dist(splitElm)]
    	if k <= len(a1):
    		return k_smallest(k, a1)
    	else:
    		return a1 + k_smallest(k - len(a1), a2)
    
    print k_smallest(3, [(0, 0), (1, 7), (2, 2), (3, 2), (1, 4), (3, 0)])
    
  7. Andrey said

    If, indeed, the question was phrased like “given a million points (x, y), give an O(n) solution to find the 100 points closest to (0, 0),” it was a trick.

    One thing is that “n” is not defined, which calls for making an assumption or, better yet, changing the question before attempting to answer it.

    Another trick, is understanding the math of big-O: if you limit in advance the total number of points, there are no variables affecting the time of calculation, making any valid solution O(1), as in “runs no longer than predetermined time”.

    Once you add variables to the task, you can start talking about big-O. “Given n points (x, y), give an O(n) solution to find the 100 points closest to (0, 0)” – now that question is different, since you may be given 5, or a million, or 10^100, or any other number of points, and you want an algorithm that would always run no longer than (constant seconds * n).

    What you assumed originally, was yet another question: “Given n points (x, y), give an O(n) solution to find k points closest to (0, 0)”. That was a misunderstanding leading to the “cheating” feeling.

  8. razvan said

    My take on this in Java:

  9. cosmin said

    The worst case O(n) solution using the algorithm described in http://www.cse.ust.hk/~dekai/271/notes/L05/L05.pdf :

    from math import sqrt, ceil
    
    def norm(point):
    	return sqrt(point[0]**2 + point[1]**2)
    
    def k_smallest(k, a, f):
    	n, medians = len(a), []
    	# solve the particular cases
    	if k == n: return a[:]
    	if n <= 5: return sorted(a, key=f)[:k]
    	# split the list in n/5 buckets and compute their medians
    	for i in xrange( int(ceil(n / 5.)) ):
    		start, end = 5*i, min(5*(i + 1), n)
    		b = sorted(a[start:end], key=f)
    		medians.append(b[(end - start) // 2])
    	# find the median of the medians using recursivity with k=n/2
    	splitElm = max(k_smallest(n // 2, medians, f))
    	# split the list in 2 sub-lists picking splitElm as a pivot 
    	a1 = [e for e in a if f(e) < f(splitElm)]
    	a2 = [e for e in a if f(e) >= f(splitElm)]
    	# use again the recursivity on one of the sub-lists to solve the problem
    	if k <= len(a1):
    		return k_smallest(k, a1, f)
    	else:
    		return a1 + k_smallest(k - len(a1), a2, f)
    
    print k_smallest(3, [(0, 0), (1, 7), (2, 2), (3, 2), (1, 4), (3, 0)], norm)
    
  10. programmingpraxis said

    Wow! This question generated lots of interest. Page hits have been about half again greater than normal for both of the last two days.

    One of the questions raised in the comments is about the time complexity of the suggested solution. It is indeed O(n log k), not O(n), so it’s not linear. Just because k is very much smaller than n doesn’t change the theoretical time complexity, though in practice the extra factor of log k may not be noticeable.

    It looks like I won’t pass the Amazon interview, as my algorithm was not linear and thus failed to answer the question. A linear algorithm is possible by using select to find the k’th item, as a couple of people have pointed out. A previous exercise gave an expected-time linear algorithm for select based on randomization; on Friday, we’ll look at a guaranteed-time linear algorithm for select based on median-of-medians.

    Thanks to all who responded.

  11. HBK said

    The question can be approached as follows
    Given the list of numbers we can find the corresponding distance from the origin. While computing the distance from the origin we also keep the track on number of points(x, y) that have the distance between a < s <= b. Suppose the distance of a point(x, y) came out to be a value A < S <= B, then we increment the count corresponding to Bth index of the array(where we are keeping the track of points) by one. At the end of the exercise when all the point are processed we will have the total count of points with distance a < s <= b at index b of the count array. Using the count array we can then find the cumulative number of points with distance s = 100. The index where sum exceeds or equals 100 gives the radii of the circle which marks the boundary for the first 100 points that lie closest to the origin. Now another linear traversal on the original points list could be done to include all those points which have distance from the origin d <= radii, till the number of points in the list < 100.

    The time complexity of the algorithm would be 2*O(N) + O(M), N = number of points, M = distance of a point that lie farthest from the origin, in worst case.
    O(M) additional space complexity.

  12. Rob Mayoff said

    Sort the points by distance from the origin in linear time using radix sort.

  13. Brian said

    Here’s my solution in Python:

    import math
    from operator import itemgetter
    #   Accepts point set n and closest number of points k
    #   e.g. n=[[3,4],[-1,2],[4,4],[-5,-2],[1,-3]]
    def O(n,k):
        i=[]
        closest=[]
        for point in n:
            dist=math.sqrt(point[0]**2+point[1]**2)
            i.append((point[0],point[1],dist))
        i=sorted(i,key=itemgetter(2))
        for x in range(0,k):
            closest.append([i[x][0],i[x][1]])
        return closest
    

    It has two FOR loops, but one is only for returning k points (100 for the example). The FOR loop that cycles through the million points doesn’t use any logic – it only creates the hypotenuse for sorting to find the closest points.

  14. Philip G said

    The post directly above mine that does a sort can hardly be considered O(n) as the sort is going to be O(n log n) complex.

  15. brooknovak said

    Ha nice – came up with the same solution present in page 2: forgo computing the roots, and use a max heap.
    I also went further an created a special bin heap operation called SwapTop, which Pops the top (max) element and inserts a new element at worst O(log n) time, where n = 100.
    Here is my c# solution:

    public static List<int[]> GetClosestPoints (IEnumerable<int[]> points, int x, int y, int k)
    {
    	if (points == null)
    		throw new ArgumentNullException ("points");
    	if (k < 0)
    		throw new ArgumentOutOfRangeException ("n");
    	if (k == 0)
    		return new List<int[]> ();
    
    	var maxheap = new BinHeap<XYPoint>(false);
    	var enumerator = points.GetEnumerator ();
    
    	for (int i = 0; i < k; i++) {
    		if (!enumerator.MoveNext ())
    			return maxheap.Select (pt => pt.Points).ToList ();
    		var p = enumerator.Current;
    		long sqDist = ((long)(p[X] - x) * (long)(p[X] - x)) + ((long)(p[Y] - y) * (long)(p[Y] - y));
    		maxheap.Insert (new XYPoint(p, sqDist)); // O(Log k)
    	}
    
    	while (enumerator.MoveNext()) {
    		var p = enumerator.Current;
    		long sqDist = ((long)(p[X] - x) * (long)(p[X] - x)) + (long)((p[Y] - y) * (long)(p[Y] - y));
    		if (sqDist < maxheap.Peek ().SquaredDist) {
    			maxheap.SwapTop (new XYPoint(p, sqDist)); // O(Log k)
    		}
    	}
    
    	return maxheap.Select (pt => pt.Points).ToList ();
    }
    
    private struct XYPoint : IComparable {
    	public XYPoint(int[] points, long squareDist) {
    		Points = points;
    		SquaredDist = squareDist;
    	}
    	public int[] Points;
    	public long SquaredDist;
    
    	public int CompareTo(object other)
    	{
    		if (other == null) return 1; 
    		return SquaredDist.CompareTo(((XYPoint)other).SquaredDist);
    	}
    }
    

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 627 other followers

%d bloggers like this: