Amazon Interview Question

November 27, 2012

My reaction to this question is that it is impossible, and the interviewer is testing me to see what I will do with it. The best solution I can come up with requires visiting each of the n points, calculating the distance from each point to the origin using the Pythagorean theorem, then inserting each distance/point pair into a heap of size k, reporting the kth closest points by emptying the heap after all the points have been read. But that algorithm is o(n log k), which fails to meet the requirements. In an interview, I would explain that to the interviewer, then challenge him to state an o(n) solution. However, the source from which this question came apparently accepts that solution, on the grounds that log(100) is a constant, so that is the solution that we will write, even though it seems to me to be cheating. Bah! Humbug!

(define heap #f)

Normally we would compute the distance of a point (x, y) from (0, 0) as the square root of the sum of x² and y². But since we are only comparing distances, and don’t care about the absolute difference, we forego the square root.

(define (closest points k) (set! heap (make-vector k (list -1 0 0))) (do ((points points (cdr points))) ((null? points) heap) (let ((dist (+ (square (caar points)) (square (cadar points))))) (when (lt? (vector-ref heap 0) (list dist)) (vector-set! heap 0 (cons dist (car points))) (siftdown k)))))

The heap is a maxheap, with the largest element in the heap stored at the root. Thus, when we compare the distance, we replace the largest-of-the-least only when the new distance is smaller.

Points are represented as three-slot lists with distance from (0, 0) in the car, the x coordinate in the cadr, and the y coordinate in the caddr. The heap is stored in a vector h of length k with the maximum distance at h[0] and each point h[i] greater than h[2i] and h[2i+1], initialized with all distances equal to infinity. When a new point is inserted into the heap, it is immediately rejected if its distance is greater than the heap maximum, otherwise it replaces the heap maximum and is sifted down the heap to its final position.

(define (siftdown len) (let loop ((parent 0)) (let ((child (+ parent parent 1))) (when (< child len) (if (and (< (+ child 1) len) ; (+ child 1) is sibling (lt? (vector-ref heap (+ child 1)) (vector-ref heap child))) (set! child (+ child 1))) ; the least child (when (lt? (vector-ref heap child) (vector-ref heap parent)) (let ((temp (vector-ref heap child))) (vector-set! heap child (vector-ref heap parent)) (vector-set! heap parent temp)) (loop child))))))

We described leftist-heaps, pairing heaps, maxiphobic heaps, and binomial heaps in previous exercises. The heap shown above is another type of heap, typically used in imperative languages when it is convenient to store a heap in an array.

The last thing we need is a less-than function that recognizes infinity, which is represented as -1, since all distances are positive; we insert a not because our siftheap is coded as a minheap, but we need a maxheap:

(define (lt? pt1 pt2) (not (if (negative? (car pt1)) #f (if (negative? (car pt2)) #t (< (car pt1) (car pt2))))))

To demonstrate the function, we generate a thousand random (x, y) points with -1000 <= x, y < 1000 then find the ten closest to (0, 0):

> (define points (let loop ((n 1000) (pts (list))) (if (zero? n) pts (loop (- n 1) (cons (list (randint -1000 1000) (randint -1000 1000)) pts))))) > (closest points 10) (480573 18 693) (462376 530 -426) (192596 -436 50) (221341 -210 421) (458330 -677 -1) (158089 380 -117) (153722 29 -391) (62057 109 224) (34064 -92 -160) (290132 -526 116)

We used square and randint from the Standard Prelude. You can run the program at http://programmingpraxis.codepad.org/ZmUkqk5Y. This exercise is similar to the exercise Ullman’s Puzzle, which collected elements, sorted them in time o(n log n), then took the first k elements. Another solution is to collect all the elements in a list, then use the selection algorithm to identify the k smallest elements.

Posted by programmingpraxis

Filed in Exercises

17 Comments »

17 Responses to “Amazon Interview Question”

Programming Praxis – Amazon Interview Question « Bonsai Code said
November 27, 2012 at 9:57 AM
[…] today’s Programming Praxis exercise, our goal is to find the 100 points closest to (0, 0) out of a list of […]

Remco Niemeijer said

November 27, 2012 at 9:57 AM

My Haskell solution (see http://bonsaicode.wordpress.com/2012/11/27/programming-praxis-amazon-interview-question/ for a version with comments):

import Control.Applicative
import qualified Data.IntMap as I
import System.Random

closest :: Int -> [(Int, Int)] -> [(Int, Int)]
closest n = take n . concat . I.elems . I.fromListWith (flip (++)) .
            map (\(x,y) -> (x*x+y*y, [(x,y)]))

points :: IO [(Int, Int)]
points = take 1000000 <$> liftA2 zip (randomRs (-1000, 1000) <$> newStdGen)
                                     (randomRs (-1000, 1000) <$> newStdGen)

main :: IO ()
main = print . closest 100 =<< points

sbmassey said
November 27, 2012 at 12:43 PM
Why not insert into a min heap, and limit its size to 100 by removing the top item on insertion having reached the limit? That should make it linear, right?

Paul said

November 27, 2012 at 12:51 PM

A Python version. To my opinion this is linear, as long as n is much smaller than the length of the xypoints array. I checked it with N (length of xypoints) equal to 100000 and with values for n of 10 and 100. The time taken is about the same.

def nsmallest(xypoints, n):
    """Use a heap with the maximum of the n lowest at the top of the heap"""
    # first fill the heap with the first n points
    heap = [(-hypot(x, y), (x, y)) for x, y in xypoints[:n]]
    heapify(heap)
    # now push new elements on heap only if distance small enough
    for x, y in xypoints[n:]:
        dist = hypot(x, y)
        if dist < -heap[0][0] :
            heappushpop(heap, (-dist, (x, y))) # remove "worst"
    return heap

kawas44 said
November 27, 2012 at 1:55 PM
What do you think?
Because 100 is a factor, even if you use insert sort (in an already sorted array since you’re building it), the complexity of the algorithm will still be O(n), no ?
cosmin said
November 27, 2012 at 2:58 PM
Actually the problem can be solved in O(n) no matter how big k is (here k = 100). This problem is the 2D version of finding the first k smallest elements. Below is an algorithm that uses the Quicksort’s idea of splitting the list in 2 sub-lists, but the difference is that only one recursion is used. Due to the randomization the complexity is about T(n) = T(n/2) + O(n) which has the solution T(n) = O(n). Still the worst case could be O(n^2) but this is highly improbable. Another key points that are worth mentioning:

[1] If a min-heap (not necessarily with the size fixed to k) is used the complexity would be O(n + k*log(n)) (O(n) for building the heap and O(k*log(n)) for k extractions) But if k < n/log(n) (which is the case here) the complexity will still be O(n).

[2] There is another algorithm that splits the list in buckets of 5 elements that has the complexity O(n) in the worst case scenario. A starting point in understanding that algorithm could be found http://en.wikipedia.org/wiki/Selection_algorithm
```
from random import randint
from math import sqrt

def dist(point):
	return sqrt(point[0]**2 + point[1]**2)

def k_smallest(k, a):
	if k == len(a): return a
	splitElm = a[randint(0, len(a) - 1)]
	a1 = [e for e in a if dist(e) <= dist(splitElm)]
	a2 = [e for e in a if dist(e) > dist(splitElm)]
	if k <= len(a1):
		return k_smallest(k, a1)
	else:
		return a1 + k_smallest(k - len(a1), a2)

print k_smallest(3, [(0, 0), (1, 7), (2, 2), (3, 2), (1, 4), (3, 0)])
```
Andrey said
November 27, 2012 at 3:01 PM
If, indeed, the question was phrased like “given a million points (x, y), give an O(n) solution to find the 100 points closest to (0, 0),” it was a trick.

One thing is that “n” is not defined, which calls for making an assumption or, better yet, changing the question before attempting to answer it.

Another trick, is understanding the math of big-O: if you limit in advance the total number of points, there are no variables affecting the time of calculation, making any valid solution O(1), as in “runs no longer than predetermined time”.

Once you add variables to the task, you can start talking about big-O. “Given n points (x, y), give an O(n) solution to find the 100 points closest to (0, 0)” – now that question is different, since you may be given 5, or a million, or 10^100, or any other number of points, and you want an algorithm that would always run no longer than (constant seconds * n).

What you assumed originally, was yet another question: “Given n points (x, y), give an O(n) solution to find k points closest to (0, 0)”. That was a misunderstanding leading to the “cheating” feeling.

razvan said

November 27, 2012 at 8:03 PM

My take on this in Java:

	/*
	* Exercise from https://programmingpraxis.com/2012/11/27/amazon-interview-question/
	* Given a million points (x, y), give an O(n) solution to find the 100 points closest to (0, 0).
	* Well, you have to read "a million" as "n" and "100" as "100".
	* Uses a heap of size 100. The complexity-theoretic trick is that 100 is a constant here.
	*/

	import java.io.File;
	import java.io.FileNotFoundException;
	import java.util.PriorityQueue;
	import java.util.Scanner;


	public class Main {

	static int MAX_PRIORITY_QUEUE_SIZE = 100;

	/**
	* @param args
	*/
	public static void main(String[] args) {
	try {
	PriorityQueue<Point> pq = new PriorityQueue<Point>(MAX_PRIORITY_QUEUE_SIZE, new ReversePointComparator());
	if(args.length == 0) {
	System.out.println("Give me a filename");
	System.exit(0);
	}
	Scanner sc = new Scanner(new File(args[0]));
	int n = sc.nextInt();
	Point num;
	for(int i=0; i<MAX_PRIORITY_QUEUE_SIZE && i<n; i++) {
	num = new Point(sc.nextDouble(), sc.nextDouble());
	pq.add(num);
	}
	for(int i=MAX_PRIORITY_QUEUE_SIZE; i<n; i++)
	{
	num = new Point(sc.nextDouble(), sc.nextDouble());
	if(num.getDist() < pq.peek().getDist())
	{
	pq.poll();
	pq.add(num);
	}
	}
	Point [] pts = new Point[pq.size()];
	pts = pq.toArray(pts);
	for(int i=0; i<pts.length; i++)
	System.out.print(pts[i] + " ");
	System.out.println();
	} catch (FileNotFoundException e) {
	// TODO Auto-generated catch block
	e.printStackTrace();
	}

	}

	}

view raw

Main.java

hosted with ❤ by GitHub


	public class Point {

	double x,y;
	double dist;
	public Point(double d, double e) {
	this.x = d;
	this.y = e;
	dist = Math.sqrt(dd + ee);
	}

	public double getDist() {
	return dist;
	}

	@Override
	public String toString() {
	return "(" + x +" ," + y +")";
	}

	}

view raw

Point.java

hosted with ❤ by GitHub

	import java.util.Comparator;


	public class ReversePointComparator implements Comparator<Point> {

	@Override
	public int compare(Point p1, Point p2) {
	return (int) (p2.getDist() – p1.getDist());
	}

	}

view raw

ReversePointComparator.java

hosted with ❤ by GitHub

cosmin said

November 27, 2012 at 11:47 PM

The worst case O(n) solution using the algorithm described in http://www.cse.ust.hk/~dekai/271/notes/L05/L05.pdf :

from math import sqrt, ceil

def norm(point):
	return sqrt(point[0]**2 + point[1]**2)

def k_smallest(k, a, f):
	n, medians = len(a), []
	# solve the particular cases
	if k == n: return a[:]
	if n <= 5: return sorted(a, key=f)[:k]
	# split the list in n/5 buckets and compute their medians
	for i in xrange( int(ceil(n / 5.)) ):
		start, end = 5*i, min(5*(i + 1), n)
		b = sorted(a[start:end], key=f)
		medians.append(b[(end - start) // 2])
	# find the median of the medians using recursivity with k=n/2
	splitElm = max(k_smallest(n // 2, medians, f))
	# split the list in 2 sub-lists picking splitElm as a pivot 
	a1 = [e for e in a if f(e) < f(splitElm)]
	a2 = [e for e in a if f(e) >= f(splitElm)]
	# use again the recursivity on one of the sub-lists to solve the problem
	if k <= len(a1):
		return k_smallest(k, a1, f)
	else:
		return a1 + k_smallest(k - len(a1), a2, f)

print k_smallest(3, [(0, 0), (1, 7), (2, 2), (3, 2), (1, 4), (3, 0)], norm)

Amazon Interview Question | My Blog said
November 28, 2012 at 3:24 AM
[…] Pages: 1 2 […]
programmingpraxis said
November 29, 2012 at 3:05 AM
Wow! This question generated lots of interest. Page hits have been about half again greater than normal for both of the last two days.

One of the questions raised in the comments is about the time complexity of the suggested solution. It is indeed O(n log k), not O(n), so it’s not linear. Just because k is very much smaller than n doesn’t change the theoretical time complexity, though in practice the extra factor of log k may not be noticeable.

It looks like I won’t pass the Amazon interview, as my algorithm was not linear and thus failed to answer the question. A linear algorithm is possible by using select to find the k’th item, as a couple of people have pointed out. A previous exercise gave an expected-time linear algorithm for select based on randomization; on Friday, we’ll look at a guaranteed-time linear algorithm for select based on median-of-medians.

Thanks to all who responded.
HBK said
December 11, 2012 at 7:02 PM
The question can be approached as follows
Given the list of numbers we can find the corresponding distance from the origin. While computing the distance from the origin we also keep the track on number of points(x, y) that have the distance between a < s <= b. Suppose the distance of a point(x, y) came out to be a value A < S <= B, then we increment the count corresponding to Bth index of the array(where we are keeping the track of points) by one. At the end of the exercise when all the point are processed we will have the total count of points with distance a < s <= b at index b of the count array. Using the count array we can then find the cumulative number of points with distance s = 100. The index where sum exceeds or equals 100 gives the radii of the circle which marks the boundary for the first 100 points that lie closest to the origin. Now another linear traversal on the original points list could be done to include all those points which have distance from the origin d <= radii, till the number of points in the list < 100.

The time complexity of the algorithm would be 2*O(N) + O(M), N = number of points, M = distance of a point that lie farthest from the origin, in worst case.
O(M) additional space complexity.
Rob Mayoff said
December 11, 2012 at 7:52 PM
Sort the points by distance from the origin in linear time using radix sort.

Brian said

December 12, 2012 at 12:42 AM

Here’s my solution in Python:

import math
from operator import itemgetter
#   Accepts point set n and closest number of points k
#   e.g. n=[[3,4],[-1,2],[4,4],[-5,-2],[1,-3]]
def O(n,k):
    i=[]
    closest=[]
    for point in n:
        dist=math.sqrt(point[0]**2+point[1]**2)
        i.append((point[0],point[1],dist))
    i=sorted(i,key=itemgetter(2))
    for x in range(0,k):
        closest.append([i[x][0],i[x][1]])
    return closest

It has two FOR loops, but one is only for returning k points (100 for the example). The FOR loop that cycles through the million points doesn’t use any logic – it only creates the hypotenuse for sorting to find the closest points.

Philip G said
December 30, 2012 at 3:12 PM
The post directly above mine that does a sort can hardly be considered O(n) as the sort is going to be O(n log n) complex.

brooknovak said

June 17, 2013 at 10:01 AM

Ha nice – came up with the same solution present in page 2: forgo computing the roots, and use a max heap.
I also went further an created a special bin heap operation called SwapTop, which Pops the top (max) element and inserts a new element at worst O(log n) time, where n = 100.
Here is my c# solution:

public static List<int[]> GetClosestPoints (IEnumerable<int[]> points, int x, int y, int k)
{
	if (points == null)
		throw new ArgumentNullException ("points");
	if (k < 0)
		throw new ArgumentOutOfRangeException ("n");
	if (k == 0)
		return new List<int[]> ();

	var maxheap = new BinHeap<XYPoint>(false);
	var enumerator = points.GetEnumerator ();

	for (int i = 0; i < k; i++) {
		if (!enumerator.MoveNext ())
			return maxheap.Select (pt => pt.Points).ToList ();
		var p = enumerator.Current;
		long sqDist = ((long)(p[X] - x) * (long)(p[X] - x)) + ((long)(p[Y] - y) * (long)(p[Y] - y));
		maxheap.Insert (new XYPoint(p, sqDist)); // O(Log k)
	}

	while (enumerator.MoveNext()) {
		var p = enumerator.Current;
		long sqDist = ((long)(p[X] - x) * (long)(p[X] - x)) + (long)((p[Y] - y) * (long)(p[Y] - y));
		if (sqDist < maxheap.Peek ().SquaredDist) {
			maxheap.SwapTop (new XYPoint(p, sqDist)); // O(Log k)
		}
	}

	return maxheap.Select (pt => pt.Points).ToList ();
}

private struct XYPoint : IComparable {
	public XYPoint(int[] points, long squareDist) {
		Points = points;
		SquaredDist = squareDist;
	}
	public int[] Points;
	public long SquaredDist;

	public int CompareTo(object other)
	{
		if (other == null) return 1; 
		return SquaredDist.CompareTo(((XYPoint)other).SquaredDist);
	}
}

JavaScript, CSS: interview questions | EugeneBichel's Blog said
October 1, 2015 at 1:01 PM
[…] from an array of integers find 10 numbers closest to a given number amazon […]

Programming Praxis