## Amazon Interview Question

### November 27, 2012

My reaction to this question is that it is impossible, and the interviewer is testing me to see what I will do with it. The best solution I can come up with requires visiting each of the *n* points, calculating the distance from each point to the origin using the Pythagorean theorem, then inserting each distance/point pair into a heap of size *k*, reporting the *k*th closest points by emptying the heap after all the points have been read. But that algorithm is *o*(*n* log *k*), which fails to meet the requirements. In an interview, I would explain that to the interviewer, then challenge him to state an *o*(*n*) solution. However, the source from which this question came apparently accepts that solution, on the grounds that log(100) is a constant, so that is the solution that we will write, even though it seems to me to be cheating. Bah! Humbug!

`(define heap #f)`

Normally we would compute the distance of a point (*x*, *y*) from (0, 0) as the square root of the sum of *x*^{2} and *y*^{2}. But since we are only comparing distances, and don’t care about the absolute difference, we forego the square root.

`(define (closest points k)`

(set! heap (make-vector k (list -1 0 0)))

(do ((points points (cdr points)))

((null? points) heap)

(let ((dist (+ (square (caar points)) (square (cadar points)))))

(when (lt? (vector-ref heap 0) (list dist))

(vector-set! heap 0 (cons dist (car points)))

(siftdown k)))))

The heap is a maxheap, with the largest element in the heap stored at the root. Thus, when we compare the distance, we replace the largest-of-the-least only when the new distance is smaller.

Points are represented as three-slot lists with distance from (0, 0) in the car, the *x* coordinate in the cadr, and the *y* coordinate in the caddr. The heap is stored in a vector *h* of length *k* with the maximum distance at *h*[0] and each point *h*[*i*] greater than *h*[2*i*] and *h*[2*i*+1], initialized with all distances equal to infinity. When a new point is inserted into the heap, it is immediately rejected if its distance is greater than the heap maximum, otherwise it replaces the heap maximum and is sifted down the heap to its final position.

`(define (siftdown len)`

(let loop ((parent 0))

(let ((child (+ parent parent 1)))

(when (< child len)

(if (and (< (+ child 1) len) ; (+ child 1) is sibling

(lt? (vector-ref heap (+ child 1))

(vector-ref heap child)))

(set! child (+ child 1))) ; the least child

(when (lt? (vector-ref heap child)

(vector-ref heap parent))

(let ((temp (vector-ref heap child)))

(vector-set! heap child (vector-ref heap parent))

(vector-set! heap parent temp))

(loop child))))))

We described leftist-heaps, pairing heaps, maxiphobic heaps, and binomial heaps in previous exercises. The heap shown above is another type of heap, typically used in imperative languages when it is convenient to store a heap in an array.

The last thing we need is a less-than function that recognizes infinity, which is represented as -1, since all distances are positive; we insert a `not`

because our `siftheap`

is coded as a minheap, but we need a maxheap:

`(define (lt? pt1 pt2)`

(not

(if (negative? (car pt1)) #f

(if (negative? (car pt2)) #t

(< (car pt1) (car pt2))))))

To demonstrate the function, we generate a thousand random (*x*, *y*) points with -1000 <= *x*, *y* < 1000 then find the ten closest to (0, 0):

`> (define points`

(let loop ((n 1000) (pts (list)))

(if (zero? n) pts

(loop (- n 1) (cons (list (randint -1000 1000)

(randint -1000 1000)) pts)))))

> (closest points 10)

(480573 18 693)

(462376 530 -426)

(192596 -436 50)

(221341 -210 421)

(458330 -677 -1)

(158089 380 -117)

(153722 29 -391)

(62057 109 224)

(34064 -92 -160)

(290132 -526 116)

We used `square`

and `randint`

from the Standard Prelude. You can run the program at http://programmingpraxis.codepad.org/ZmUkqk5Y. This exercise is similar to the exercise Ullman’s Puzzle, which collected elements, sorted them in time *o*(*n* log *n*), then took the first *k* elements. Another solution is to collect all the elements in a list, then use the selection algorithm to identify the *k* smallest elements.

Pages: 1 2

[...] today’s Programming Praxis exercise, our goal is to find the 100 points closest to (0, 0) out of a list of [...]

My Haskell solution (see http://bonsaicode.wordpress.com/2012/11/27/programming-praxis-amazon-interview-question/ for a version with comments):

Why not insert into a min heap, and limit its size to 100 by removing the top item on insertion having reached the limit? That should make it linear, right?

A Python version. To my opinion this is linear, as long as n is much smaller than the length of the xypoints array. I checked it with N (length of xypoints) equal to 100000 and with values for n of 10 and 100. The time taken is about the same.

What do you think?

Because 100 is a factor, even if you use insert sort (in an already sorted array since you’re building it), the complexity of the algorithm will still be O(n), no ?

Actually the problem can be solved in O(n) no matter how big k is (here k = 100). This problem is the 2D version of finding the first k smallest elements. Below is an algorithm that uses the Quicksort’s idea of splitting the list in 2 sub-lists, but the difference is that only one recursion is used. Due to the randomization the complexity is about T(n) = T(n/2) + O(n) which has the solution T(n) = O(n). Still the worst case could be O(n^2) but this is highly improbable. Another key points that are worth mentioning:

[1] If a min-heap (not necessarily with the size fixed to k) is used the complexity would be O(n + k*log(n)) (O(n) for building the heap and O(k*log(n)) for k extractions) But if k < n/log(n) (which is the case here) the complexity will still be O(n).

[2] There is another algorithm that splits the list in buckets of 5 elements that has the complexity O(n) in the worst case scenario. A starting point in understanding that algorithm could be found http://en.wikipedia.org/wiki/Selection_algorithm

If, indeed, the question was phrased like “given a million points (x, y), give an O(n) solution to find the 100 points closest to (0, 0),” it was a trick.

One thing is that “n” is not defined, which calls for making an assumption or, better yet, changing the question before attempting to answer it.

Another trick, is understanding the math of big-O: if you limit in advance the total number of points, there are no variables affecting the time of calculation, making any valid solution O(1), as in “runs no longer than predetermined time”.

Once you add variables to the task, you can start talking about big-O. “Given n points (x, y), give an O(n) solution to find the 100 points closest to (0, 0)” – now that question is different, since you may be given 5, or a million, or 10^100, or any other number of points, and you want an algorithm that would always run no longer than (constant seconds * n).

What you assumed originally, was yet another question: “Given n points (x, y), give an O(n) solution to find k points closest to (0, 0)”. That was a misunderstanding leading to the “cheating” feeling.

My take on this in Java:

https://gist.github.com/4156657

The worst case O(n) solution using the algorithm described in http://www.cse.ust.hk/~dekai/271/notes/L05/L05.pdf :

[...] Pages: 1 2 [...]

Wow! This question generated lots of interest. Page hits have been about half again greater than normal for both of the last two days.

One of the questions raised in the comments is about the time complexity of the suggested solution. It is indeed O(n log k), not O(n), so it’s not linear. Just because k is very much smaller than n doesn’t change the theoretical time complexity, though in practice the extra factor of log k may not be noticeable.

It looks like I won’t pass the Amazon interview, as my algorithm was not linear and thus failed to answer the question. A linear algorithm is possible by using select to find the k’th item, as a couple of people have pointed out. A previous exercise gave an expected-time linear algorithm for select based on randomization; on Friday, we’ll look at a guaranteed-time linear algorithm for select based on median-of-medians.

Thanks to all who responded.

The question can be approached as follows

Given the list of numbers we can find the corresponding distance from the origin. While computing the distance from the origin we also keep the track on number of points(x, y) that have the distance between a < s <= b. Suppose the distance of a point(x, y) came out to be a value A < S <= B, then we increment the count corresponding to Bth index of the array(where we are keeping the track of points) by one. At the end of the exercise when all the point are processed we will have the total count of points with distance a < s <= b at index b of the count array. Using the count array we can then find the cumulative number of points with distance s = 100. The index where sum exceeds or equals 100 gives the radii of the circle which marks the boundary for the first 100 points that lie closest to the origin. Now another linear traversal on the original points list could be done to include all those points which have distance from the origin d <= radii, till the number of points in the list < 100.

The time complexity of the algorithm would be 2*O(N) + O(M), N = number of points, M = distance of a point that lie farthest from the origin, in worst case.

O(M) additional space complexity.

Sort the points by distance from the origin in linear time using radix sort.

Here’s my solution in Python:

It has two FOR loops, but one is only for returning k points (100 for the example). The FOR loop that cycles through the million points doesn’t use any logic – it only creates the hypotenuse for sorting to find the closest points.

The post directly above mine that does a sort can hardly be considered O(n) as the sort is going to be O(n log n) complex.

Ha nice – came up with the same solution present in page 2: forgo computing the roots, and use a max heap.

I also went further an created a special bin heap operation called SwapTop, which Pops the top (max) element and inserts a new element at worst O(log n) time, where n = 100.

Here is my c# solution: