## Sparse Sets

### March 9, 2012

We look today at a clever data structure for storing sparse sets of integers on the range 0 .. *u*−1 and performing initialization, lookup, and insertion is time *O*(1) and iteration in *O*(*n*), where *n* is the number of elements in the set. The data structure was studied in a 1993 article “An Efficient Representation for Sparse Sets” by Preston Briggs and Linda Torczon, in Exercise 1.9 (1.8 in the first edition) of Jon Bentley’s book *Programming Pearls*, and in exercise 2.12 of the 1974 book *The Design and Analysis of Computer Algorithms* by Al Aho, John Hopcroft and Jeffrey Ullman; the data structure itself dates to the folklore of computing.

The data structure considers a universe of integers from 0 to *u*−1; depending on the circumstances, the integers probably map to something else, but we don’t care about that. Any given set consists of *n* items chose from the universe; there are no duplicates. Note that *n* ≤ *u*, certainly, and likely *n* is much less than *u* — otherwise, you would probably use a bit vector to represent the set. Note also that we are optimizing for speed at the expense of space, as a bit vector takes *u* bits but our data structure takes 2*u* integers.

Think about a bit vector. Setting a bit is a constant-time operation, as is checking if a bit is set or unset. But initializing the bit vector and iterating over the set elements of the bit vector each take time proportional to the size of the bit vector. Our sparse sets reduce the iteration to time proportional to the size of the set (rather than the size of the universe) and reduce the initialization time to a constant.

The sparse set is represented by two vectors that we will call dense (abbreviated *D*) and sparse (abbreviated *S*). Initially *n*, the number of elements in the set, is zero; the two vectors are uninitialized and may contain anything. To add an element 0 ≤ *k* < *u* to a set that does not already contain *k*, we set *D*[*n*] to *k*, *S*[*k*] to *n*, and increase *n* by 1, an operation that takes constant time. After this, the two vectors point to each other, which gives a test of set membership that also works in constant time: an element *k* is in the set if and only if *S*[*k*] < *n* and *D*[*S*[*k*]] == *k>*. Note that if *k* is not a member of the set, the value of *S*[*k*] doesn’t matter; either it *S*[*k*] will be greater than *n* or it will point to an element of *D* that doesn’t point back to it. The diagram above right shows a set with the elements 5, 1 and 4; the blue boxes may contain any value. To iterate over the elements of the set, read *D*[0 .. *n*−1], which takes time *O*(*n*), and to clear the set make *n* = 0, which takes time *O*(1); note in particular that clearing the set doesn’t require reinitialization. Other operations, including size-of, delete, union, intersection, difference, and set-equality are possible, and equally time-efficient compared to bit vectors, but we won’t discuss them here, since they are seldom used with this representation of sets. A common use of these sparse sets is with register allocation algorithms in compilers, which have a fixed universe (the number of registers in the machine) and are updated and cleared frequently during a single processing run.

Your task is to implement the insert, lookup, iterate and clear operations for sparse sets as described above. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Thanks, this was a good thing to learn.

Instead of D, S, n, u global, I packaged them in a single vector. This may not be appropriate when a program needs to manipulate a single such set very fast, but with a sufficiently smart compiler …

here is a Perl version:

use warnings;

use strict;

my @dense;

my @sparse;

my $n;

sub get_number {

my $op = shift;

my $k;

do {

print "Enter number to $op: ";

$k = <STDIN>;

chomp($k);

} while ($k !~ /^\d+$/);

return $k;

}

sub sparse_clear {

$n = 0;

}

sub sparse_insert {

my $k = get_number(‘insert’);

$dense[$n] = $k;

$sparse[$k] = $n;

$n++;

}

sub sparse_lookup {

my $k = get_number(‘lookup’);

( defined($sparse[$k]) &&

($sparse[$k] < $n) &&

($dense[$sparse[$k]] == $k)

) ? print "exists\n" : print "doesn’t exist\n";

}

sub sparse_iterate {

my $largest_sparce = 0;

print "Dense: ";

for (0 .. $n-1) {

print "$dense[$_] ";

$largest_sparce = $dense[$_] if ($dense[$_] > $largest_sparce);

}

print "\n";

print "Sparse ";

for (0 .. $largest_sparce) {

defined($sparse[$_]) ? print "$sparse[$_] " : print "u ";

}

print "\n";

}

sub sparse_quit {

exit;

}

my @ops = qw(clear insert lookup iterate quit);

while(1) {

print "\n";

for (0 .. scalar(@ops)-1) {

print "$_: $ops[$_]\n";

}

print "Enter operation: ";

my $choice = <STDIN>;

chomp($choice);

if ($choice =~ /^[0-4]$/) {

&{\&{"sparse_$ops[$choice]"}}();

}

}

Of course, the easier (and arguably better) way to do this would be to use one of the many existing perl modules which already implement sparse matrix arithmetic: http://search.cpan.org/search?query=sparse&mode=all

Very cool post. We often deal with large sparse data structures at work, but unfortunately our bottleneck is typically available memory and not processing time. Something like this is ideal for modern hardware though.

Just to emphasize your note, the maintenance of the variable n is crucial. Not only does it allow for constant-time initialization, clearing, and element-counting, but it is also the only way to guarantee all elements in the universe have unambiguous values.

Definitely going to give this a try sometime this weekend, thanks for sharing!

And C#:

using System;

using System.Collections;

using System.Collections.Generic;

using System.Linq;

using System.Text;

namespace SparseSet

{

internal sealed class SparseSet : IEnumerable

{

private ushort?[] D = new ushort?[ushort.MaxValue];

private ushort?[] S = new ushort?[ushort.MaxValue];

private ushort _size = 0;

internal bool Insert(ushort value)

{

bool result = !Lookup(value);

if (result)

{

D[_size] = value;

S[value] = _size;

_size++;

}

return result;

}

internal bool Lookup(ushort value)

{

bool result = S[value] < _size &&

S[value].HasValue &&

D[S[value].Value] == value;

return result;

}

internal void Clear()

{

_size = 0;

}

public IEnumerator GetEnumerator()

{

var result = new SparseSetEnumerator(D, _size);

return result;

}

System.Collections.IEnumerator System.Collections.IEnumerable.GetEnumerator()

{

return (IEnumerator)GetEnumerator();

}

}

internal sealed class SparseSetEnumerator : IEnumerator

{

private readonly ushort?[] _values = null;

private readonly ushort _size = 0;

private int _index = -1;

internal SparseSetEnumerator(ushort?[] values, ushort size)

{

_values = values;

_size = size;

}

public ushort Current

{

get

{

return _values[_index].Value;

}

}

public void Dispose()

{

// squeclh

}

object IEnumerator.Current

{

get

{

return (object)_values[_index];

}

}

public bool MoveNext()

{

_index++;

bool moving = _index < _size &&

_values[_index].HasValue;

return moving;

}

public void Reset()

{

_index = -1;

}

}

class Program

{

static void Main(string[] args)

{

var ss = new SparseSet();

ss.Insert(5);

ss.Insert(1);

ss.Insert(4);

ss.Clear();

ss.Insert(5);

ss.Insert(1);

ss.Insert(4);

ss.Clear();

ss.Insert(3);

ss.Insert(2);

ss.Insert(7);

ss.Clear();

ss.Insert(3);

ss.Insert(2);

ss.Insert(7);

ss.Insert(3);

ss.Insert(2);

ss.Insert(7);

ss.Clear();

ss.Insert(3);

ss.Insert(2);

ss.Insert(7);

ss.Insert(5);

ss.Insert(1);

ss.Insert(4);

foreach(ushort us in ss)

{

Console.WriteLine(us);

}

Console.WriteLine("—");

foreach (ushort us in ss)

{

Console.WriteLine(us);

}

ss.Clear();

Console.WriteLine("—");

foreach (ushort us in ss)

{

Console.WriteLine(us);

}

}

}

}

An implementation in Guile (Scheme):

(also here: https://gist.github.com/2011167)

Here is how to do this in Forth. Execute the name of the set to bring it in context.

The INSERT, LOOKUP, ITERATE and SCLEAR words work on the set in context.

0 VALUE #universe

0 VALUE n

0 VALUE Dense

0 VALUE Sparse

: SPARSE-SET ( sz | “name” — )

CREATE DUP , 2* CELLS ALLOT

DOES> @+ TO #universe TO Dense

Dense #universe CELLS + TO Sparse ;

#16 SPARSE-SET test-set

test-set ( make current )

: INSERT ( x — )

DUP #universe >= ABORT” INSERT :: out of range”

DUP Dense n CELL[] !

Sparse []CELL n SWAP !

1 +TO n ;

: LOOKUP ( x — bool )

>R

Sparse R@ CELL[] @ DUP n U>= IF DROP -R FALSE EXIT ENDIF

Dense []CELL @ R> = ;

: ITERATE ( xt — )

LOCAL xt

n 0 ?DO Dense I CELL[] @ xt EXECUTE LOOP ;

: SCLEAR ( — ) CLEAR n ;

SCLEAR

CR .( Insert {5,4,1} ) 5 INSERT 1 INSERT 4 INSERT

CR .( Lookup {5,1,4,15,12} => ) 5 LOOKUP . 1 LOOKUP . 4 LOOKUP . #15 LOOKUP . #12 LOOKUP .

CR .( Iterate with ‘.’ => ) ‘ . ITERATE

:ABOUT CR .” Sparse Sets”

CR .” Try: ( n — ) INSERT”

CR .” ( n — bool ) LOOKUP ”

CR .” ( xt — ) ITERATE ”

CR .” ( — ) SCLEAR ” ;

.ABOUT -sparseset CR

DEPRIVE