Cluster

May 9, 2014

Cluster calls for some kind of dictionary structure to accumulate signature/item pairs. You could use a hash table or balanced tree, but instead we choose a simple (unbalanced) binary tree, on the theory that if the input is not presented in order by signature (if it is, a simple scan through the input in order would be sufficient) then it is unlikely that the tree will become too unbalanced; besides, it gives us a chance to write a simple binary tree insert and in-order traversal. Here’s our cluster:

(define (cluster proc lt? lst)
  (define (insert key value tree)
    (cond ((null? tree)
            (list key (list value) '() '()))
          ((lt? key (car tree))
            (let ((left (insert key value (caddr tree))))
              (list (car tree) (cadr tree) left (cadddr tree))))
          ((lt? (car tree) key)
            (let ((right (insert key value (cadddr tree))))
              (list (car tree) (cadr tree) (caddr tree) right)))
          (else
            (let ((new (cons value (cadr tree))))
              (list key new (caddr tree) (cadddr tree))))))
  (define (in-order tree)
    (if (null? tree) '()
      (append (in-order (caddr tree))
              (list (cadr tree))
              (in-order (cadddr tree)))))
  (let loop ((lst lst) (tree '()))
    (if (null? lst) (in-order tree)
      (loop (cdr lst) (insert (proc (car lst)) (car lst) tree)))))

The tree is represented as a recursive four-element list, with signature in the car, list of values in the cadr, left child in the caddr, and right child in the cadddr. Since the data structure is recursive, so is the code that does insertion, which begins at the root of the tree. If a tree is null, the signature must not appear in the tree, so a new node is built. Otherwise, the signature must already exist in the tree, so the existing node is updated, either by recursively inserting in the left sub-tree if the new signature is less than the current signature, or in the right sub-tree if the new signature is greater than the current signature, or by consing the input element to the current node if the two signatures are equal. The main loop of the function cdrs through the input, inserts each signature/item pair into the tree, and writes the tree when the input is exhausted. Here are some examples:

> (define x '("this" "is" "a" "fun" "and" "useful" "program"))
> (cluster string-length < x)
(("a") ("is") ("and" "fun") ("this") ("useful") ("program"))
> (cluster (lambda (x) (string-ref x 0)) char<? x)
(("and" "a") ("fun") ("is") ("program") ("this") ("useful"))
> (define (anagram s) (list->string (sort char<? (string->list s))))
> (define dict '("pots" "time" "spot" "pans" "item" "tops"))
> (cluster anagram string<? dict)
(("pans") ("item" "time") ("tops" "spot" "pots"))

You can run the program at http://programmingpraxis.codepad.org/tNQg34dH.

About these ads

Pages: 1 2

12 Responses to “Cluster”

  1. Jussi Piitulainen said

    Simple-minded grouping of consecutive items, the requested effect achieved by composing with sort:

    (include "sort-okeefe.scm")

    (define (group-by sign < items)
      (if (null? items)
          items
          (let group ((g (list (car items)))
                      (s (sign (car items)))
                      (items (cdr items)))
            (if (null? items)
                (list (reverse g))
                (let ((t (sign (car items))))
                  (if (or (< s t) (< t s))
                      (cons (reverse g)
                            (group (list (car items)) t (cdr items)))
                      (group (cons (car items) g) s (cdr items))))))))

    (define (cluster sign < items)
      (group-by sign < (sort items (lambda (i j) (< (sign i) (sign j))))))

    Test:

    (define test-data
      (list "oh" "one" "two" "three" "four" "five" "seven" "ate" "nine"))

    (write (cluster string-length < test-data)) (newline)

    Test result:

    (("oh") ("one" "two" "ate") ("four" "five" "nine") ("three" "seven"))

    (The predicate signature needs to have a pair of betas for input.)

  2. programmingpraxis said

    Added a beta. Thanks.

  3. matthew said

    Here’s a Haskell solution: tag the input list with the signature, sort, then run through the list gathering up the sublists. I tried to do something with foldl for this, but didn’t result in any nice simplifications.

    #!/usr/bin/runhaskell
    import Data.List
    
    h1 f (a,_)(b,_) = f a b
    cluster sig comp [] = []
    cluster sig comp s =
      let 
          ((a,n):s') = sortBy (flip (h1 comp)) (zip (map sig s) s)
          split [] _ s t = reverse s:t
          split ((a,n):s) a' s' t
                | a == a' = split s a' (n:s') t
                | otherwise = split s a [n] (reverse s':t)
      in
         split s' a [n] []
    
    s1 = [6,0,8,2,4,3,5,7,9,1];
    s2 = ["one","two","three","four","five","six"]
    
    main =
      print (cluster (`mod` 3) compare s1) >>
      print (cluster (`div` 3) compare s1) >>
      print (cluster (`mod` 3) compare []) >>
      print (cluster length compare s2)
    
  4. Globules said
    import Data.Function (on)
    import Data.List (groupBy, sortBy)
    import Data.Map (elems, fromListWith)
    import Data.Ord (comparing)
    
    --------------------------------------------------------------------------------
    
    cluster1 :: Eq β => (α -> β) -> (β -> β -> Bool) -> [α] -> [[α]]
    cluster1 proc' lt as =
      map (map snd) . groupBy ((==) `on` fst) . sortBy cmp $ zip bs as
      where bs = map proc' as
            cmp = ordFromLt lt `on` fst
    
    -- This is a bit dodgy...
    ordFromLt :: (a -> a -> Bool) -> a -> a -> Ordering
    ordFromLt lt a b = if lt a b then LT else GT
    
    --------------------------------------------------------------------------------
    
    -- Since the requirements imply that there exists an Ord instance for β (not
    -- just an Eq instance), we might as well take advantage of this.  This means
    -- the "lt" argument is no longer needed.
    
    cluster2 :: Ord β => (α -> β) -> [α] -> [[α]]
    cluster2 proc' as =
      map (map snd) . groupBy ((==) `on` fst) . sortBy (comparing fst) $ zip bs as
      where bs = map proc' as
    
    -------------------------------------------------------------------------------
    
    -- But, now that we have an Ord instance we can use a map instead.
    
    cluster3 :: Ord β => (α -> β) -> [α] -> [[α]]
    cluster3 proc' = elems . fromListWith (++) . map (\a -> (proc' a, [a]))
    
    -------------------------------------------------------------------------------
    
    main :: IO ()
    main = do
      let s1 = [6,0,8,2,4,3,5,7,9,1] :: [Int]
          s2 = ["one","two","three","four","five","six"] :: [String]
    
      print $ cluster1 (`mod` 3) (<) s1
      print $ cluster1 (`div` 3) (<) s1
      print $ cluster1 (`mod` 3) (<) ([] :: [Int])
      print $ cluster1 length (<) s2
    
      print $ cluster2 (`mod` 3) s1
      print $ cluster2 (`div` 3) s1
      print $ cluster2 (`mod` 3) ([] :: [Int])
      print $ cluster2 length s2
    
      print $ cluster3 (`mod` 3) s1
      print $ cluster3 (`div` 3) s1
      print $ cluster3 (`mod` 3) ([] :: [Int])
      print $ cluster3 length s2
    
  5. matthew said

    Nice solution. I’m still finding my way around the standard library so lots of new stuff there for me. I was wondering about converting from the Bool function from an Ord function, but I thought being Haskell, we’d probably get one of them anyway.

    Had another think about fold and came up with the following (I’m not sure what the pros and cons of using foldr or foldl are – here foldr works nicely to keep everything in the right order):

    import Data.List
    import Data.Function
    import Control.Arrow
    
    f (a,n) [] = [(a,[n])]
    f (a,n) ((a',s):t) 
      | a == a' = (a',n:s):t
      | otherwise = (a,[n]):(a',s):t
    
    cluster sig comp = 
      map (sig &&& id) >>> sortBy (comp `on` fst) >>> foldr f [] >>> map snd
    
    s1 = [6,0,8,2,4,3,5,7,9,1];
    s2 = ["one","two","three","four","five","six"]
    
    main =
      print (cluster (`mod` 3) compare s1) >>
      print (cluster (`div` 3) compare s1) >>
      print (cluster (`mod` 3) compare []) >>
      print (cluster length compare s2)
    
  6. chmllr said

    Clojure:

    (defn cluster [proc lt? lst]
      (vals (group-by proc (sort-by proc lt? lst))))
    
  7. matthew said

    @chmllr:

    NIce, but will group-by preserve the correct ordering by signatures? If not, then you can move the sort to the end.

  8. @matthew, I don’t know why exactly, but I observed in Clojure, that the sets are always iterated as LIFO queues (maybe the maps are based on Java’s LinkedHashMaps). Now, since sort-by is implemented with the reduce function starting with an empty set, I always get the correct order. However, you’re right: this is a weak assumption to rely on.

    I thought again and I came up with even a shorter solution, without any implicit assumptions:

    (defn cluster [proc lt? lst]
    (partition-by proc (sort-by proc lt? lst)))

  9. matthew said

    @christian: Very nice (of course, we call proc twice for each element, but that’s just one of those tradeoffs).

    Here’s something similar in Haskell, p is partition-by:

    g sig k u (a:s) =
      if k == k' then g sig k (a:u) s
      else reverse u:g sig k' [a] s
      where k' = sig a
    g sig k u [] = [reverse u]
    
    p sig [] = []
    p sig (a:s) = g sig (sig a) [a] s
    
    cluster sig comp = sortBy (comp `on` sig) >>> (p sig)
    
  10. chmllr said

    @matthew: just for curiosity: I didn’t touch Haskell for years, but would this pattern matching based idea work, to implement partition-by? (I think it’s the same what you do above, but without if-the-else and with TCO)

    partition-by sig lst = h [] [] (map (\ x -> [(sig x) x]) lst)
    where
    h tmp acc [] = map reverse acc
    h tmp acc [[s:x]:[s:y]:xs] = h [x:tmp] acc [[s:y]:xs]
    h tmp acc [[s:x]:[q:y]:xs] = h [] [[x:tmp]:acc] [[q:y]:xs]

  11. chmllr said

    PS: oops, one case was missing :-)

    partition-by sig lst = h [] [] (map (\ x -> [(sig x) x]) lst)
        where
            h tmp acc [] = map reverse acc
            h tmp acc [[s:x]:[s:y]:xs] = h [x:tmp] acc [[s:y]:xs]
            h tmp acc [[s:x]:[q:y]:xs] = h [] [[x:tmp]:acc] [[q:y]:xs]
            h tmp acc [[s:x]:[]] = h [] [[x:tmp]:acc] []
    
  12. matthew said

    @chmllr
    That would be nice, but Haskell doesn’t seem to like having the same variable twice in a pattern match:

    cluster4.hs:23:21:
        Conflicting definitions for `s'
        Bound at: cluster4.hs:23:21
                  cluster4.hs:23:27
        In an equation for `h'
    

    We could use a guard expression here (I would have done this but I wanted to bind the result of sig first, maybe there’s some other way of doing this).

    Writing list generating functions as “f a b c = x : f a’ b’ c'” without TCO I suppose seems more natural in Haskell since it will work with lazy lists (which partition-by could do).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 645 other followers

%d bloggers like this: