Sorting Without Duplicates

February 7, 2017

My first thought was to build up a frequency table of the items in the array, perhaps in an AVL tree, then scan the tree to produce the desired output. But that doesn’t meet the requirement to perform the work in-place, and it won’t have O(n) time complexity.

Thus, my suggestion is to sort the array by any desired method, using some kind of counting sort or radix sort to get O(n) time complexity, or any O(n log n) sorting algorithm if you don’t care about that, then scan the array, moving duplicates to the end. When the first portion of the array is sorted without duplicates, reverse the remainder of the array and process that part of the array recursively, stopping when there is no sub-array left to process. That’s a sort followed by a scan, so it’s linear if the sort is linear. The trick is the “moving duplicates to the end” part, which can easily become non-linear. It works better with lists, which even if they’re not in-place at least don’t require more storage than the original. Function partition scans a sorted input to find the first partition, returning it along with the sorted remainder; function sort-dups handles one partition after another, appending them at the end:

(define (partition lt? xs) ; assumes sorted xs
  (let loop ((xs xs) (prev #f) (ys (list)) (zs (list)))
    (cond ((null? xs) (values (reverse ys) (reverse zs)))
          ((equal? (car xs) prev)
            (loop (cdr xs) prev ys (cons (car xs) zs)))
          (else (loop (cdr xs) (car xs) (cons (car xs) ys) zs)))))

(define (sort-dups lt? xs)
  (let loop ((xs xs) (zs (list)))
    (if (null? xs)
        (apply append (reverse zs))
        (call-with-values
          (lambda () (partition lt? xs))
          (lambda (first rest)
            (loop rest (cons first zs)))))))

And here is the function at work on some sample data:

> (sort-dups < (sort < '(2 9 1 5 1 4 9 7 2 1 4)))
(1 2 4 5 7 9 1 2 4 9 1)
> (sort-dups < (sort < '(1 1 1 1 1 1 1 1 1)))
(1 1 1 1 1 1 1 1 1)
> (sort-dups < (sort < '()))
()
> (sort-dups < (sort < '(7 3 1 6 2 5 4)))
(1 2 3 4 5 6 7)

Time complexity of sort-dups is O(n²) in the worst case, where all the integers are identical, but O(n log n) in the average case. I don’t see any way to reduce that, but maybe one of my smart readers has a better idea. I thought of using a hash table to collect the frequencies, but you still have to sort the keys, which makes the problem O(n log n), and you still have to deal with the case of identical integers.

You can run the program at http://ideone.com/j0tZR8.

Posted by programmingpraxis

Filed in Exercises

9 Comments »

9 Responses to “Sorting Without Duplicates”

Paul said

February 7, 2017 at 3:02 PM

In Python. This solution is not in-place.

from collections import Counter
def sortwdup(seq):
    c = Counter(seq)
    result = [[] for _ in range(max(c.values()) if c else 0)]
    for e in sorted(c.keys()):
        for L in result[:c[e]]:
            L.append(e)
    return [r for res in result for r in res]

Rutger said

February 7, 2017 at 3:03 PM

In Python. I believe it is O(n) (as Python counter dictionary takes n inserts) and the second for loop has exactly n elements.

from collections import Counter
from collections import defaultdict

example = [2, 9, 1, 5, 1, 4, 9, 7, 2, 1, 4]


ctr = Counter(example)

result = defaultdict(list)
for v in ctr:
	for i in range(ctr[v]):
		result[i] += [v]
print(result)

# output
#defaultdict(<class 'list'>, {0: [1, 2, 4, 5, 7, 9], 1: [1, 2, 4, 9], 2: [1]})

Paul said
February 7, 2017 at 7:38 PM
@Rutger: You have to sort ctr in line 9, as the order is otherwise not defined. Try for example to multiply all entries of example with 101 and see what happens. (I get then [707, 404, 101, 505, 202, 909, 404, 101, 202, 909, 101])

Jussi Piitulainen said

February 7, 2017 at 8:34 PM

A sort (could be in-place) followed by a reordering that is much like mergesort (but requires not only ordered input but a disjoint split, because I was not able to deal with ties correctly when combining the parts). It might be O(n log n), even in the worst case, and it might not be impossible to do in-place or with less space than here. The combination step attempts to pick a large element whenever possible; it seemed to work for many examples after I disallowed ties. This also is in Python.

from itertools import count

def sortificate(xs):
    '''Return the elements of xs in a chain-list of maximally long
    strictly ascending chains.'''
    return chainify(sorted(xs))

def chainify(xs):
    '''Return the elements of an ascending list of elements in a
    chain-list of maximally long strictly ascending groups.

    Hopefully comparable to mergesort in computational complexity.'''

    if xs == [] or xs[0] == xs[-1]: return xs
    lo, hi = separate(xs)
    return combine(chainify(lo), chainify(hi))

def separate(xs):
    '''Split an ascending list of elements near the middle so every
    small element is stricly smaller than any large element -- not
    able to combine chain-lists correctly when there were ties, so
    ruled ties out.

    Precondition: a split point must exist.'''

    m = len(xs) // 2
    for k in count():
        lo, hi = m - k, m + k
        if xs[lo - 1] < xs[lo]:
            return xs[:lo], xs[lo:]
        elif xs[hi - 1] < xs[hi]:
            return xs[:hi], xs[hi:]

def combine(xs, ys):
    '''Combine chain-lists xs, ys where every x is smaller than any y
    -- was not able to combine all chain-lists correctly when there
    were ties, so ruled ties out.'''

    m, n = len(xs), len(ys)
    rs, j, k = list(range(m + n)), 0, 0
    for t in rs:
        if (k < n
            and (j == m
                 or (0 < t and xs[j] <= rs[t - 1] < ys[k]))):
            rs[t] = ys[k]
            k += 1
        else:
            rs[t] = xs[j]
            j += 1

    return rs

def test(xs):
    '''Check elements are there, print for ocular inspection'''
    rs = sortificate(xs)
    if sorted(xs) != sorted(rs):
        print("FAIL", xs, " ==> ", rs)
    else:
        print(xs, " ==> ", rs)

test([2, 9, 1, 5, 1, 4, 9, 7, 2, 1, 4])

Kevin said
February 7, 2017 at 9:01 PM
I don’t think an O(n) solution is possible. If it was, you could use it to construct an O(n) sort of any array without ties, and apply it to arrays with ties by, for example, using the element’s original index to break them.

Anyway, because of the way its “transpose” function behaves, the Haskell solution is almost embarrassing: we can sort the array, group the elements, use transpose to turn it into a list of the correct subarrays, and concatenate the whole thing together:
```
import Data.List
sortwodup :: (Ord a) => [a] -> [a]
sortwodup = concat . transpose . group . sort
```

Paul said

February 7, 2017 at 9:33 PM

Inspired by@Kevin another Python solution.

from collections import Counter
from itertools import zip_longest
def sortwdup(seq):
    items = [[k]*v for k, v in sorted(Counter(seq).items())]
    result = list(zip_longest(*items))
    return [r for res in result for r in res if r is not None]

Globules said

February 10, 2017 at 3:33 AM

Here’s a Haskell version in the REPL. It’s not in-place.

> transpose . group . sort $ [2, 9, 1, 5, 1, 4, 9, 7, 2, 1, 4 :: Int]
[[1,2,4,5,7,9],[1,2,4,9],[1]]
>

Sorting Without Duplicates, Revisited | Programming Praxis said
February 10, 2017 at 9:01 AM
[…] the comments to the previous exercise, reader Kevin gave a beautiful solution in Haskell; it sorts the input items, groups them into […]
Mike said
February 17, 2017 at 2:01 AM
Uses ‘key’ function of the built in ‘list.sort()’ method to sort the list into the desired order, in-place.

The trick or insight is to generate a key for each item in the list that reflects the number of times that item has been seen as well as the item itself. For example, in the test sequence below the three 1’s would get keys: (0,1), (1,1), (2,1).

The dd is a ‘defaultdict’ that holds an ‘itertools.count()’ counter for each unique value in the list. The key function takes a list item and returns a tuple (next value of the counter associated with that item, item).

Complexity is O(n log n). (The key function is called once for each element in the list, O(n). Sorting is O(n log n)).
```
from collections import defaultdict
from itertools import count

def sortwithoutdups(lst):
    dd = defaultdict(count)
    lst.sort(key=lambda el:(next(dd[el]), el))

# test
d = [2, 9, 1, 5, 1, 4, 9, 7, 2, 1, 4]

print(d)
sortwithoutdups(d)
print(d)

#output
[2, 9, 1, 5, 1, 4, 9, 7, 2, 1, 4]
[1, 2, 4, 5, 7, 9, 1, 2, 4, 9, 1]
```

S	M	T	W	T	F	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28

Programming Praxis