## Array Duplicates

### September 23, 2011

We have today another exercise from our large file of interview questions:

You are given an array with integers between 1 and 1,000,000. One integer is in the array twice. How can you determine which one?

Your task is to write code to solve the array duplicates problem. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

This sentence is confusing me, do you mean the sort can be reduced to O(n)?

> The sorting solution takes time O(n log n) for the sort plus time O(n) for the search (that could be reduced to O(n) if the items are integers

Clojure implementation of the searching solution using an auxiliary data structure.

@William: check out Wikipedia’s list of non-comparison sorts (about halfway through the page), like counting or radix sort.

I haven’t much time today, so here are some quick Python solutions:

William: That’s poorly worded, but yes. Since the input is guaranteed to be integers over a small range, sorting could be done by radix sort or counting sort in

O(n) time.Solution in Go for first two options. Didn’t think of doing the sorting version until I read the solution.

Went looking for a set in Go first, but an array of bools is probably better in this case.

Why not use a modified quicksort that terminates when the comparison function sees two equal numbers? There is no need to continue the sort when the duplicate number is found.

uh, it’s known that the sum of x from 1 to n is a formula: (n(n+1))/2, right? So, take the sum of your entire input array, subtract off the calculated constant, and you’re left with the duplicate value. No sorting or extra arrays necessary. :-)

@Mike: The integers in the array are not guaranteed to be from 1 to 1000000. They could be

anyvalue between 1 and 1000000, and you could have an arbitrary number of array elements. Your method would work otherwise :)The logic is a mere two lines in racket. Unfortunately, this is assuming that no other integer appears more than once in the array, that there is enough space in memory to hold twice the number of required integers. On the other hand, with Racket’s tail-call optimization, this code runs in linear time (set-operations are constant-time hashing calculations).

My mistake, I meant constant space, not constant time. The recursive calls can’t overflow the stack because of Racket’s use of continuation passing style to avoid ever returning from the tail position. Neat stuff!

I prefer burning lots of memory (in C …)

#define MAX_SIZE 1000000

static unsigned short x[(MAX_SIZE/16)+1] ;

// return index of duplicate

int fdup_arr( int * arr, int n )

{

int ival, istep, ipos, ibit, irec ;

memset( &x, 0, sizeof( x)) ;

for ( istep= 0 ; ( n — ) ; istep ++ ) {

ival= *( arr ++) ;

ipos= ival >> 4 ;

ibit= ival & 0xf ;

if ( ( irec= x[ipos] ) && ( irec & ibit ) ) { return istep ; }

x[ipos]= irec | ibit ;

}

return -1 ;

}

A couple of implementations in python 3

The first uses the sorting technique.

The second uses the searching technique.

The third uses the Counter class from the standard library.

It counts the occurances of each number in the array and then

returns the most common one.

Oopps. I pasted the wrong test code above. Here is the correct test code:

[/sourceode]

Yet another scheme solution, now with 100% more call/cc

> j woolverton said

> September 23, 2011 at 11:56 PM

with the array in example below your function doesn’t work.

Please find a small correction.

It takes more memory, but works.

HASH (size) could be optimized, but that was not required initially ;)

#include

#include

#define MAX_SIZE 1000000

#define MAX 10

static unsigned int x[MAX_SIZE];

// return index of duplicate

int fdup_arr( int * arr, int n )

{

int istep;

bzero( x, sizeof( x )) ;

for ( istep = 0 ; n–; istep++ )

{

if ( x[*arr] )

return istep;

x[*arr++]++;

}

return -1 ;

}

void pr_arr( int *arr, int size)

{

int i;

for( i = 0; i < size; i++ )

printf( "%4d", *arr++ );

printf( "\n" );

}

int main ()

{

int arr[MAX];

int i;

for( i=0; i < MAX; i++ )

arr[i] = i + 1;

arr[4] = 7;

pr_arr( arr, MAX );

printf( "Duplicated index: %d\n", fdup_arr( arr, MAX ) );

}

b = b xor a, if b = 0 return a, else a = b, b = arr[i++]

Maybe it is not the fastest code, but the Python code below works:

Using ruby and a hash table …

[javascript]

/**

* Program that found one duplicate given an array with integers between 1 and

* 1,000,000.

*

* @see https://programmingpraxis.com/2011/09/23/array-duplicates/

* @author rmariuzzo

*/

var App = function() {

this.findDuplicate = function(array) {

var duplicated;

for (var i = 0; i < array.length; i++) {

for (var j = 0; j = 0 && array[j] <= 1000000) {

duplicated = array[i];

break;

}

}

if (duplicated !== undefined) {

break;

}

}

return duplicated;

};

};

[/javascript]

In JavaScript using NodeJS:

Another solution is to sort the list and use a binary search to find the duplicate value. The insight is that the index of an element in the sorted list is either equal to the element or one greater than the element.

Moving the indexes is a little different than vanilla binary search though:

if ary[mid] == mid + 1:

beg = mid + 1

elif ary[mid] == mid:

end = mid – 1

Runs in O((N+1)log N).