List Difference

November 20, 2012

These are all similar to the intersection and union functions of the prior exercise. Here is the quadratic version of the function, which uses member to look up items in a list:

(define (difference1 xs ys)
  (let loop ((xs xs) (zs (list)))
    (cond ((null? xs) zs)
          ((member (car xs) ys)
            (loop (cdr xs) zs))
          (else (loop (cdr xs) (cons (car xs) zs))))))

The O(n log n) version performs an n log n sort followed by a linear-time “merge”-like operation:

(define (difference2 xs ys)
  (let loop ((xs (sort < xs)) (ys (sort < ys)) (zs (list)))
    (cond ((null? xs) zs)
          ((null? ys) (append xs zs))
          ((< (car xs) (car ys))
            (loop (cdr xs) ys (cons (car xs) zs)))
          ((< (car ys) (car xs))
            (loop xs (cdr ys) zs))
          (else (loop (cdr xs) (cdr ys) zs)))))

The linear version of the function is the same as the quadratic version, but uses a hash table instead of a list:

(define (difference3 xs ys)
  (let ((h (make-hash identity = #f 9973)))
    (do ((ys ys (cdr ys))) ((null? ys))
      (h 'insert (car ys) (car ys)))
    (let loop ((xs xs) (zs (list)))
      (cond ((null? xs) zs)
            ((h 'lookup (car xs))
              (loop (cdr xs) zs))
            (else (loop (cdr xs) (cons (car xs) zs)))))))

The timing test is similar to the prior exercise:

(define (time-test difference n)
  (let ((a (rand-list n)) (b (rand-list n)))
    (time (begin (difference a b) (difference b a) (if #f #f)))))

Our results are also similar to the prior exercise. As before, sorting is fast, and our hash tables aren’t:

n       100000  200000  400000  800000
------  ------  ------  ------  ------
first      452    1794    7114   28626
second      16      31      63     125
third       47      94     203     515

We used make-hash, sort and identity from the Standard Prelude. You can run the program at http://programmingpraxis.codepad.org/4FQ7oMpV.

Pages: 1 2

5 Responses to “List Difference”

  1. Paul said

    A Python version. After last weeks exercise, this was not too difficult. A full version can be found here.

    def diff1(a, b):
        """Loop over a and check if ai not in b"""
        return [ai for ai in a if ai not in b]
        
    def diff4(a, b):    
        """Make heaps of a and b. Copy to output elements that are i na but not in b"""
        ha = list(a)
        hb = list(b)
        res = []
        heapq.heapify(ha)
        heapq.heapify(hb)
        while ha and hb:
            if ha[0] < hb[0]:
                res.append(heapq.heappop(ha))
            elif ha[0] > hb[0]:
                heapq.heappop(hb)
            else:
                heapq.heappop(ha)
                heapq.heappop(hb)
        return res + ha # ha can contain a tail
        
    def diff6(a, b):
        """Make set of b and loop over a and check that ai not in set(b)"""
        setb = set(b)
        return [ai for ai in a if ai not in setb]
        
    def diff99(a, b):
        """Make sets of a and b and subtract"""
        return list(set(a) - set(b))
        
    """
    diff1    O(n2)    coeff: 4.01e-06  ratio:    27   alt: O(n3)    Loop over a and check if ai not in b
    diff4    O(nlogn) coeff: 6.06e-07  ratio:     5   alt: O(n)     Make heaps of a and b. Copy to output elements that are i na but not in b
    diff6    O(n)     coeff: 1.38e-08  ratio:     4   alt: O(nlogn) Make set of b and loop over a and check that ai not in set(b)
    diff99   O(nlogn) coeff: 2.07e-08  ratio:     2   alt: O(n)     Make sets of a and b and subtract
    """
    
  2. JP said

    Here’s my try in Racket and Python (including the list intersection and union from the previous post): List algorithms and efficiency

    And just the source code: python, racket

    I went ahead and graphed the timings as well, to show that they were actually O(n2), O(n log n), and O(n). I was actually a little surprised at how well that turned out.

  3. Pedro said

    Hi,
    the algorithm complexity is measured in “list operations”, I gather. For example, sets in Python are implemented as (open-addressed) hashes. This
    may make their instantiation much more complex than the creation of a linked list.
    Interesting problem, though.

    Pedro.

  4. JP said

    Wouldn’t the instantiation of the data structure be a one time cost and thus factored out of the big-O runtime? Or is there a cost to growing the hash as more and more items are added to it?

    I did timing graphs for the three which matched up pretty well, but they were based on the Racket versions rather than Python. Perhaps I should try that.

Leave a comment