Array Duplicates

September 23, 2011

There are essentially two solutions: sorting and searching. In the sorting solution, the vector is sorted in a first stage, then the items in the vector are run through sequentially in a second stage, stopping when an item is equal to its predecessor. In the searching solution, each item in the vector is inserted in an auxiliary data structure, stopping when it is found that the item being inserted is already present. The sorting solution takes time O(n log n) for the sort plus time O(n) for the search (that could be reduced to O(n) if the items are integers, as in the problem statement, but not if the items are some other data type), but only requires as much auxiliary space as the sort needs for the recursive stack, assuming quicksort, which is O(log n); it also has the side effect of scrambling the original order of the vector, unless you make a copy first. The searching solution requires time O(n) to run through the items in the vector, but requires O(n) space to store the auxiliary data structure. Here is our version of the sorting solution, which calls the Bentley/McIlroy vector sort from the Standard Prelude:

(define (sorting vec) (define (cmp lt?) (lambda (a b) (if (lt? a b) -1 (if (lt? b a) 1 0)))) (vector-sort! vec (cmp <)) (do ((i 1 (+ i 1))) ((= (vector-ref vec i) (vector-ref vec (- i 1))) (vector-ref vec i))))

We use a bit vector for the searching solution, because the vector items are all integers over a small range. In other contexts, it might make sense to use hash tables or some kind of balanced tree:

(define (searching vec) (let ((bits (make-vector (vector-length vec) #f))) (do ((i 0 (+ i 1))) ((vector-ref bits (vector-ref vec i)) (vector-ref vec i)) (vector-set! bits (vector-ref vec i) #t))))

The problem statement is not clear that all the integers from one to a million appear in the input vector, though in many variants of the problem that is true; you may wish to ask the interviewer to clarify. If it is true, there is another solution based on Gauss' formula; sum the integers in the array and subtract the sum of the first n integers, so that the difference is the duplicated item.

(define (gaussian vec) (let ((n (vector-length vec))) (do ((i 0 (+ i 1)) (sum 0 (+ sum (vector-ref vec i)))) ((= i n) (- sum (* (- n 1) n 1/2))))))

Sample runs appear below. The sorting solution is given last because it modifies the input vector:

(define x (apply vector (shuffle (cons 176 (range 1 1000001))))) > (gaussian x) 1776 > (searching x) 1776 > (sorting x) 1776

There are doubtless other solutions, and variants on the problem; one variant is to find a missing rather than a duplicated item. But we’ve given the basic solutions, so we’ll stop there. You can run the program at http://programmingpraxis.codepad.org/sGxhES1h. If you look at codepad, you’ll see that the sorting solution is much slower than the other two. That’s because our sort is relatively slow. Replacing our sort with the system sort drastically improves the time, though it is still not as fast as the other two methods.

Posted by programmingpraxis

Filed in Exercises

22 Comments »

22 Responses to “Array Duplicates”

William Stevenson said
September 23, 2011 at 9:19 AM
This sentence is confusing me, do you mean the sort can be reduced to O(n)?
> The sorting solution takes time O(n log n) for the sort plus time O(n) for the search (that could be reduced to O(n) if the items are integers

kawas said

September 23, 2011 at 9:43 AM

Clojure implementation of the searching solution using an auxiliary data structure.

(defn find-one-dup [coll]
  (loop [v (first coll) coll (next coll) vset #{}]
    (cond
      (nil? v) nil
      (vset v)   v
      :else (recur (first coll) (next coll) (conj vset v)))))

Graham said

September 23, 2011 at 12:17 PM

@William: check out Wikipedia’s list of non-comparison sorts (about halfway through the page), like counting or radix sort.

I haven’t much time today, so here are some quick Python solutions:

def searching(arr):
    s = set()
    for a in arr:
        if a in s:
            return a
        else:
            s.add(a)

def sorting(arr):
    brr = sorted(arr)
    for i in xrange(1, len(brr)):
        if brr[i] == brr[i-1]:
            return brr[i]

def gauss(arr):
    # requires that arr contains _all_ natural numbers from one to len(arr)
    n = len(arr)
    return sum(arr) - n * (n - 1) // 2

programmingpraxis said
September 23, 2011 at 12:44 PM
William: That’s poorly worded, but yes. Since the input is guaranteed to be integers over a small range, sorting could be done by radix sort or counting sort in O(n) time.

DGel said

September 23, 2011 at 1:23 PM

Solution in Go for first two options. Didn’t think of doing the sorting version until I read the solution.
Went looking for a set in Go first, but an array of bools is probably better in this case.


func searching(data []int) int {
	had_num := make([]bool, len(data) + 1)
	for _, num := range data {
		if had_num[num] {
			return num
		} else {
			had_num[num] = true
		}
	}	
	return -1 // yay for magic numbers
}

func sorting(data []int) int {
	a := make([]int, len(data)) // don't mess up order of original
	for i := range data {
		a[i] = data[i]
	}
	sort.Ints(a)
	for i := 1; i < len(a); i++ {
		if a[i] == a[i - 1] {
			return a[i]
		}
	}
	return -1 // magic number again
}

Peter Mielke said
September 23, 2011 at 1:24 PM
Why not use a modified quicksort that terminates when the comparison function sees two equal numbers? There is no need to continue the sort when the duplicate number is found.
Mike said
September 23, 2011 at 2:13 PM
uh, it’s known that the sum of x from 1 to n is a formula: (n(n+1))/2, right? So, take the sum of your entire input array, subtract off the calculated constant, and you’re left with the duplicate value. No sorting or extra arrays necessary. :-)
Philip G said
September 23, 2011 at 2:46 PM
@Mike: The integers in the array are not guaranteed to be from 1 to 1000000. They could be any value between 1 and 1000000, and you could have an arbitrary number of array elements. Your method would work otherwise :)
j2kun said
September 23, 2011 at 10:34 PM
The logic is a mere two lines in racket. Unfortunately, this is assuming that no other integer appears more than once in the array, that there is enough space in memory to hold twice the number of required integers. On the other hand, with Racket’s tail-call optimization, this code runs in linear time (set-operations are constant-time hashing calculations).
```
#lang racket

(define (find-dupe list-with-dupes)
  (with-handlers ([number? (λ (x) x)])
      (find-dupe-recurse list-with-dupes (set))))

(define (find-dupe-recurse list a-set)
  (cond [(empty? list) (error "no duplicate found")]
        [(set-member? a-set (first list)) (raise (first list))]
        [else (find-dupe-recurse (rest list) (set-add a-set (first list)))]))
```
j2kun said
September 23, 2011 at 11:20 PM
My mistake, I meant constant space, not constant time. The recursive calls can’t overflow the stack because of Racket’s use of continuation passing style to avoid ever returning from the tail position. Neat stuff!
j woolverton said
September 23, 2011 at 11:56 PM
I prefer burning lots of memory (in C …)

#define MAX_SIZE 1000000

static unsigned short x[(MAX_SIZE/16)+1] ;

// return index of duplicate
int fdup_arr( int * arr, int n )
{
int ival, istep, ipos, ibit, irec ;

memset( &x, 0, sizeof( x)) ;

for ( istep= 0 ; ( n — ) ; istep ++ ) {
ival= *( arr ++) ;
ipos= ival >> 4 ;
ibit= ival & 0xf ;
if ( ( irec= x[ipos] ) && ( irec & ibit ) ) { return istep ; }
x[ipos]= irec | ibit ;
}

return -1 ;
}

Mike said

September 24, 2011 at 12:53 AM

A couple of implementations in python 3

The first uses the sorting technique.

The second uses the searching technique.

The third uses the Counter class from the standard library.
It counts the occurances of each number in the array and then
returns the most common one.

#### 1
from itertools import dropwhile
from itertools_recipies import pairwise

def sort(array):
    return next(dropwhile(lambda t:t[0]!=t[1], pairwise(sorted(array))))[0]

#### 2
def search(array):
    seen = set()
    return next(filter(None,(e if e in seen else seen.add(e) for e in array)))

#### 3
from collections import Counter

def counter(array):
    return Counter(array).most_common(1)[0][0]


#test
from random import sample, randint

# make random list of 200,000 integers then randomly duplicate one of them
array = list(set(sample(range(1,1000000),200000)))
i,j = 0,0
while i == j:
    i, j = randint(0,len(array)), randint(0,len(array))
array[i] = array[j]
print("Answer is {}".format(array[i]))

print("finddup1() -> {}".format(finddup1(array)))
print("finddup2() -> {}".format(finddup2(array)))
print("finddup3() -> {}".format(finddup3(array)))

Mike said

September 24, 2011 at 1:01 AM

Oopps. I pasted the wrong test code above. Here is the correct test code:

####test
from random import sample, randint

# make random list of 200,000 integers then randomly duplicate one of them
array = sample(range(1,1000000),200000)

i,j = 0,0
while i == j:
    i, j = randint(0,len(array)), randint(0,len(array))

array[i] = array[j]
print("Answer is {}".format(array[i]))

for f in (sort, search, counter):
    print("{0.__name__}() -> {1}".format(f, f(array)))
##

[/sourceode]

ijp said

September 24, 2011 at 1:56 AM

Yet another scheme solution, now with 100% more call/cc

#!r6rs
(import (rnrs)
        (srfi :27 random-bits)
        (only (srfi :1 lists) iota))

(define (shuffle x)
  ;; definition from https://programmingpraxis.com/contents/standard-prelude/
  (do ((v (list->vector x)) (n (length x) (- n 1)))
      ((zero? n) (vector->list v))
    (let* ((r (random-integer n)) (t (vector-ref v r)))
      (vector-set! v r (vector-ref v (- n 1)))
      (vector-set! v (- n 1) t))))

(define (pick-duplicate list)
  (call-with-current-continuation
   (lambda (break)
     (list-sort (lambda (x y)
                  (if (= x y)
                      (break x)
                      (< x y)))
                list))))


(write (pick-duplicate (shuffle (cons 176 (iota 1000000)))))
(newline)

sergant said
September 24, 2011 at 2:56 AM
> j woolverton said
> September 23, 2011 at 11:56 PM

with the array in example below your function doesn’t work.
Please find a small correction.
It takes more memory, but works.
HASH (size) could be optimized, but that was not required initially ;)

#include
#include

#define MAX_SIZE 1000000
#define MAX 10

static unsigned int x[MAX_SIZE];

// return index of duplicate
int fdup_arr( int * arr, int n )
{
int istep;

bzero( x, sizeof( x )) ;

for ( istep = 0 ; n–; istep++ )
{
if ( x[*arr] )
return istep;

x[*arr++]++;
}

return -1 ;
}

void pr_arr( int *arr, int size)
{
int i;

for( i = 0; i < size; i++ )
printf( "%4d", *arr++ );

printf( "\n" );
}

int main ()
{
int arr[MAX];
int i;

for( i=0; i < MAX; i++ )
arr[i] = i + 1;
arr[4] = 7;

pr_arr( arr, MAX );

printf( "Duplicated index: %d\n", fdup_arr( arr, MAX ) );
}
temp said
September 24, 2011 at 4:48 AM
b = b xor a, if b = 0 return a, else a = b, b = arr[i++]
Paul Hofstra said
September 25, 2011 at 3:49 PM
Maybe it is not the fastest code, but the Python code below works:
```
def find(arr):
    return sum(arr) - sum(set(arr))
```

slabounty said

September 26, 2011 at 3:54 AM

Using ruby and a hash table …

def find_duplicate(a)
    h = {}
    a.each do |v| 
        return v if h[v] != nil
        h[v] = v
    end
end

a = [2, 3, 10, 20, 3, 1, 7, 9, 44, 200]
puts "Duplicate = #{find_duplicate(a)}"

Axio said

September 26, 2011 at 3:55 AM

exception Found of int;;

module IntOrd = struct type t = int let compare = ( - ) end
module IntMap = Map.Make(IntOrd);;

let find_dup2 arr =
  Array.fold_left
  (fun m v ->
    try
      ignore (IntMap.find v m);
      raise (Found(v))
    with Not_found -> IntMap.add v true m)
  IntMap.empty
  arr;;


let my_array = [|1; 2; 4; 3; 1|];;
find_dup2 my_array;;

Rubens Mariuzzo (@rmariuzzo) said
September 26, 2011 at 6:53 PM
[javascript]
/**
* Program that found one duplicate given an array with integers between 1 and
* 1,000,000.
*
* @see https://programmingpraxis.com/2011/09/23/array-duplicates/
* @author rmariuzzo
*/
var App = function() {

this.findDuplicate = function(array) {
var duplicated;
for (var i = 0; i < array.length; i++) {
for (var j = 0; j = 0 && array[j] <= 1000000) {
duplicated = array[i];
break;
}
}
if (duplicated !== undefined) {
break;
}
}
return duplicated;
};

};
[/javascript]

Rubens Mariuzzo (@rmariuzzo) said

September 26, 2011 at 6:55 PM

In JavaScript using NodeJS:

/**
 * Program that found one duplicate given an array with integers between 1 and
 * 1,000,000.
 * 
 * @see https://programmingpraxis.com/2011/09/23/array-duplicates/
 * @author rmariuzzo
 */
var App = function() {
    
    this.findDuplicate = function(array) {
        var duplicated;
        for (var i = 0; i < array.length; i++) {
            for (var j = 0; j = 0 && array[j] <= 1000000) {
                    duplicated = array[i];
                    break;
                }
            }
            if (duplicated !== undefined) {
                break;
            }
        }
        return duplicated;
    };
    
};

tcbcb said
September 30, 2011 at 11:06 PM
Another solution is to sort the list and use a binary search to find the duplicate value. The insight is that the index of an element in the sorted list is either equal to the element or one greater than the element.

Moving the indexes is a little different than vanilla binary search though:

if ary[mid] == mid + 1:
beg = mid + 1
elif ary[mid] == mid:
end = mid – 1

Runs in O((N+1)log N).

Programming Praxis