Collect Sets Of Ranges
August 25, 2015
A frequent idiom in data processing is the control-break idiom, where some processing must be done every time there is a change in some value. A simple example comes from collecting ranges, for instance, converting the sequence 0, 1, 2, 7, 21, 22, 108, 109 to the ranges 0-2, 7, 21-22, 108-109, where a break occurs whenever two numbers aren’t consecutive.
Writing control-break programs can be difficult, for two reasons. First, you don’t know there is a break until you see the next record after the break, so you either need to look ahead in the input or keep track of what you have seen. Second, there is an implied break at the end of the input, which occurs when there is no record at all. Depending on the situation, either or both of those can be tricky.
Your task is to write a program that converts a sequence to a set of ranges, as shown above. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
Keeps track of what is visited, using a Python generator to yield each range found.
In Python.
It’s a classic problem that Jackson suggested a simple solution for.
You structure the control flow after the data, i.e. one outer loop for the break points, and one inner loop for a range. The point is to have the state implicitly coded into the control flow, rather than manipulating an explicit state in a single loop. In this case this can be done by “reading ahead” before the loop, and then “reading” at the end of the loop.
To put it simple: Follow the structure of the data. This makes the loops very simple and easy to understand.
Oh yeah, the previous comment is the solution as a Scala Stream
F# solution :
Here’s another Haskell solution, no fancy output this time, but using a general splitting function with foldl operating on the partially constructed ranges:
Here’s another Haskell solution, no fancy output this time, but using a general splitting function with foldl operating on the partially constructed ranges:
(No Haskell highlighter – using “text” here)
Just as I would do with a pen and paper, I track the consecutive numbers; their differences is one. Anything other than that marks a breakpoint.
Here is the implementation in JavaScript:
Sample run:
public class Range {
public static String getRange(int a[], int index)
{
if(index == a.length-1)
return “”+a[index];
else{
int i=index;
int start=a[i];
int end=0;
while(a[i+1]==a[i]+1)
i++;
end=a[i];
if(end!=start)
return start+”-“+end+”,”+getRange(a,i+1);
else
return start+”,”+getRange(a,i+1);
}
}
public static void main(String[] args) {
int a[]={1,2,3,4,5,6,8,9,10,-34,-33,36,37,4,8,9,0,1,1,1};
System.out.println(getRange(a,0));
}
}
Another Haskell solution. We allow finding ranges for types other than numbers.
@Globules: nice. Interesting that neither of our fold-based solutions work for infinite lists, whereas my original non-fold-based one does.
So, here is another Haskell solution, using scanl as the looping construct rather than a fold – scanl is like foldl but returns a list of the intermediate results, so we can map over that to extract the actual ranges. To deal with the end of input case, we need to munge the data a little, so we can’t just use concat. Infinite lists are dealt with just fine:
@matthew: Thanks. Yeah, I wasn’t even thinking about infinite lists when I wrote it. (Bad programmer!)
Another annoyance is that in generalizing it to Enum I had to also use Bounded to ensure that pred didn’t throw. But of course now it doesn’t work for Integer. Having maybePred and maybeSucc would be nice.
Aha, yes, I didn’t see that problem. Here’s another iterative solution, this time using unfoldr. It’s rather like Dennis’s nested loop above. Once again, we have to do some jiggerypokery to terminate properly:
itertools.groupby() from the standard library creates an iterator that breaks an iterable into (key, group) pairs.
Here, a class is used to maintain state–the previous value and the current group number. The class implements the __call__ method to be used as the key func for groupby(). The method applies a predicate to the previous and current items. If the predicate is true, the group number is returned. If the predicate is false, the group number is changed (incremented) first. As a result, each group gets a sequential number assigned.
This can also be implemented using a generator/coroutine. The generator’s send() method is used as the key func for groupby().
Definitely the last Haskell solution from me. The pattern here is a sort of map with state, with some special handling of the final state; this can all be nicely captured by the mapAccumL function, as in:
Haskell. A job for a right fold with a lazy combining function, creating the shape of the return value first, and filling it up with data from the recursive result later, like any well-behaved tail recursive “modulo cons” program would do.
(the list comprehension is there just to avoid the verbal and verbose “case” statement).
@Will Ness: Very ingenious. We can rewrite Globules original foldr solution (using a case statement here, I like pattern matching) as:
and get proper behaviour on infinite lists. It does all seem a bit awkward though – I’m not convinced I like any of the functional solutions more than the original nice and simple recursive method.
@matthew: YMMV. I learned to stop worrying, and love the fold. :) It’s easier for me that way: the recursion is already there, `x` is for the value, `r` is for recursive result, and all that’s left for me to do is to decide what goes where. It’s all about the information flow anyway. Here’s (http://stackoverflow.com/a/14403413/849891) where I saw this technique first, in the context of Haskell. In Prolog it’s commonplace, but there we only have the direct recursion (in plain language, that is).
Turnes out there’s also my answer on that SO question, with another solution, a paramophism which might feel more natural.
Also, your “jiggerypokery” `unfoldr` might be related to this (http://stackoverflow.com/q/24519530/849891) “slightly generalizing unfold”, which just might be an apomorphism.
Well, I like a nice fold too, but sometimes things don’t always fit nicely into that pattern. Foldr has nice algebraic properties, but it seems difficult to get right, particularly with lazy lists (and there seems to be a correlation between not working with lazy lists and having a space leak) – maybe that gets easier with practice though, as you suggest. Apomorphism sounds like just the ticket, thanks for that; disjoint unions are an underused datatype.
Incidentally, I came up with this for finding the end points of the ranges, but I’m not sure what to make of it:
So, my original function, which was more or less:
works nicely with infinite lists and all that, but doesn’t easily fit the foldr recursion scheme because of that extra state parameter – sometimes the recursion is ‘f (m,p) s’ and sometimes it is ‘f (p,p) s’. But we can swap the order of the parameters to f and get:
and this does fit the foldr pattern, giving:
where the function on the input list is returning another function that is then applied to initial state. Actually running this with eg. ‘ranges [1..100000000]’ and we get the answer ‘[(1,100000000)]’ in a few seconds, with no memory bloat. Once again, I am very impressed with the Haskell compiler.
On the other hand, all the compiler has really done is inline the definition of foldr, so maybe it’s not that clever.
A solution in Racket Scheme.