Cluster

May 9, 2014

Clustering is the process of collecting in groups all of the items from an input collection that share some common feature; for instance, the GROUP BY operator of SQL performs clustering. We will define cluster(proc, lt?, lst) as a function that takes an input list and returns a list of lists; proc computes a signature of each item in the input list, and each sub-list in the output list contains all those elements of the input list with identical signatures, with sub-lists in increasing order of signature according to lt?. The type of cluster is (α → β) × (β × β → boolean) × (list α) → (list (list α)).

Your task is to write the function cluster. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Posted by programmingpraxis

Filed in Exercises

12 Comments »

12 Responses to “Cluster”

Jussi Piitulainen said
May 9, 2014 at 1:04 PM
Simple-minded grouping of consecutive items, the requested effect achieved by composing with sort:
(include "sort-okeefe.scm")
(define (group-by sign < items)   (if (null? items)       items       (let group ((g (list (car items)))                   (s (sign (car items)))                   (items (cdr items)))         (if (null? items)             (list (reverse g))             (let ((t (sign (car items))))               (if (or (< s t) (< t s))                   (cons (reverse g)                         (group (list (car items)) t (cdr items)))                   (group (cons (car items) g) s (cdr items))))))))
(define (cluster sign < items)   (group-by sign < (sort items (lambda (i j) (< (sign i) (sign j))))))

Test:
(define test-data   (list "oh" "one" "two" "three" "four" "five" "seven" "ate" "nine"))
(write (cluster string-length < test-data)) (newline)

Test result:
(("oh") ("one" "two" "ate") ("four" "five" "nine") ("three" "seven"))

(The predicate signature needs to have a pair of betas for input.)
programmingpraxis said
May 9, 2014 at 1:08 PM
Added a beta. Thanks.

matthew said

May 9, 2014 at 11:38 PM

Here’s a Haskell solution: tag the input list with the signature, sort, then run through the list gathering up the sublists. I tried to do something with foldl for this, but didn’t result in any nice simplifications.

#!/usr/bin/runhaskell
import Data.List

h1 f (a,_)(b,_) = f a b
cluster sig comp [] = []
cluster sig comp s =
  let 
      ((a,n):s') = sortBy (flip (h1 comp)) (zip (map sig s) s)
      split [] _ s t = reverse s:t
      split ((a,n):s) a' s' t
            | a == a' = split s a' (n:s') t
            | otherwise = split s a [n] (reverse s':t)
  in
     split s' a [n] []

s1 = [6,0,8,2,4,3,5,7,9,1];
s2 = ["one","two","three","four","five","six"]

main =
  print (cluster (`mod` 3) compare s1) >>
  print (cluster (`div` 3) compare s1) >>
  print (cluster (`mod` 3) compare []) >>
  print (cluster length compare s2)

Globules said

May 10, 2014 at 4:40 AM

import Data.Function (on)
import Data.List (groupBy, sortBy)
import Data.Map (elems, fromListWith)
import Data.Ord (comparing)

--------------------------------------------------------------------------------

cluster1 :: Eq β => (α -> β) -> (β -> β -> Bool) -> [α] -> [[α]]
cluster1 proc' lt as =
  map (map snd) . groupBy ((==) `on` fst) . sortBy cmp $ zip bs as
  where bs = map proc' as
        cmp = ordFromLt lt `on` fst

-- This is a bit dodgy...
ordFromLt :: (a -> a -> Bool) -> a -> a -> Ordering
ordFromLt lt a b = if lt a b then LT else GT

--------------------------------------------------------------------------------

-- Since the requirements imply that there exists an Ord instance for β (not
-- just an Eq instance), we might as well take advantage of this.  This means
-- the "lt" argument is no longer needed.

cluster2 :: Ord β => (α -> β) -> [α] -> [[α]]
cluster2 proc' as =
  map (map snd) . groupBy ((==) `on` fst) . sortBy (comparing fst) $ zip bs as
  where bs = map proc' as

-------------------------------------------------------------------------------

-- But, now that we have an Ord instance we can use a map instead.

cluster3 :: Ord β => (α -> β) -> [α] -> [[α]]
cluster3 proc' = elems . fromListWith (++) . map (\a -> (proc' a, [a]))

-------------------------------------------------------------------------------

main :: IO ()
main = do
  let s1 = [6,0,8,2,4,3,5,7,9,1] :: [Int]
      s2 = ["one","two","three","four","five","six"] :: [String]

  print $ cluster1 (`mod` 3) (<) s1
  print $ cluster1 (`div` 3) (<) s1
  print $ cluster1 (`mod` 3) (<) ([] :: [Int])
  print $ cluster1 length (<) s2

  print $ cluster2 (`mod` 3) s1
  print $ cluster2 (`div` 3) s1
  print $ cluster2 (`mod` 3) ([] :: [Int])
  print $ cluster2 length s2

  print $ cluster3 (`mod` 3) s1
  print $ cluster3 (`div` 3) s1
  print $ cluster3 (`mod` 3) ([] :: [Int])
  print $ cluster3 length s2

matthew said

May 10, 2014 at 10:01 AM

Nice solution. I’m still finding my way around the standard library so lots of new stuff there for me. I was wondering about converting from the Bool function from an Ord function, but I thought being Haskell, we’d probably get one of them anyway.

Had another think about fold and came up with the following (I’m not sure what the pros and cons of using foldr or foldl are – here foldr works nicely to keep everything in the right order):

import Data.List
import Data.Function
import Control.Arrow

f (a,n) [] = [(a,[n])]
f (a,n) ((a',s):t) 
  | a == a' = (a',n:s):t
  | otherwise = (a,[n]):(a',s):t

cluster sig comp = 
  map (sig &&& id) >>> sortBy (comp `on` fst) >>> foldr f [] >>> map snd

s1 = [6,0,8,2,4,3,5,7,9,1];
s2 = ["one","two","three","four","five","six"]

main =
  print (cluster (`mod` 3) compare s1) >>
  print (cluster (`div` 3) compare s1) >>
  print (cluster (`mod` 3) compare []) >>
  print (cluster length compare s2)

chmllr said

May 13, 2014 at 8:02 PM

Clojure:

(defn cluster [proc lt? lst]
  (vals (group-by proc (sort-by proc lt? lst))))

matthew said
May 14, 2014 at 8:33 AM
@chmllr:

NIce, but will group-by preserve the correct ordering by signatures? If not, then you can move the sort to the end.
Сhristian Мüller (@gravitydenier) said
May 14, 2014 at 4:56 PM
@matthew, I don’t know why exactly, but I observed in Clojure, that the sets are always iterated as LIFO queues (maybe the maps are based on Java’s LinkedHashMaps). Now, since sort-by is implemented with the reduce function starting with an empty set, I always get the correct order. However, you’re right: this is a weak assumption to rely on.

I thought again and I came up with even a shorter solution, without any implicit assumptions:

(defn cluster [proc lt? lst]
(partition-by proc (sort-by proc lt? lst)))

matthew said

May 14, 2014 at 6:36 PM

@christian: Very nice (of course, we call proc twice for each element, but that’s just one of those tradeoffs).

Here’s something similar in Haskell, p is partition-by:

g sig k u (a:s) =
  if k == k' then g sig k (a:u) s
  else reverse u:g sig k' [a] s
  where k' = sig a
g sig k u [] = [reverse u]

p sig [] = []
p sig (a:s) = g sig (sig a) [a] s

cluster sig comp = sortBy (comp `on` sig) >>> (p sig)

chmllr said
May 15, 2014 at 5:01 AM
@matthew: just for curiosity: I didn’t touch Haskell for years, but would this pattern matching based idea work, to implement partition-by? (I think it’s the same what you do above, but without if-the-else and with TCO)

partition-by sig lst = h [] [] (map (\ x -> [(sig x) x]) lst)
where
h tmp acc [] = map reverse acc
h tmp acc [[s:x]:[s:y]:xs] = h [x:tmp] acc [[s:y]:xs]
h tmp acc [[s:x]:[q:y]:xs] = h [] [[x:tmp]:acc] [[q:y]:xs]

chmllr said

May 15, 2014 at 5:05 AM

PS: oops, one case was missing :-)

partition-by sig lst = h [] [] (map (\ x -> [(sig x) x]) lst)
    where
        h tmp acc [] = map reverse acc
        h tmp acc [[s:x]:[s:y]:xs] = h [x:tmp] acc [[s:y]:xs]
        h tmp acc [[s:x]:[q:y]:xs] = h [] [[x:tmp]:acc] [[q:y]:xs]
        h tmp acc [[s:x]:[]] = h [] [[x:tmp]:acc] []

matthew said
May 15, 2014 at 6:35 AM
@chmllr
That would be nice, but Haskell doesn’t seem to like having the same variable twice in a pattern match:
```
cluster4.hs:23:21:
    Conflicting definitions for `s'
    Bound at: cluster4.hs:23:21
              cluster4.hs:23:27
    In an equation for `h'
```
We could use a guard expression here (I would have done this but I wanted to bind the result of sig first, maybe there’s some other way of doing this).

Writing list generating functions as “f a b c = x : f a’ b’ c'” without TCO I suppose seems more natural in Haskell since it will work with lazy lists (which partition-by could do).

S	M	T	W	T	F	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Programming Praxis

Cluster

May 9, 2014

12 Responses to “Cluster”

Leave a comment

Categories

Archives

Archives

Programming Praxis

Cluster

May 9, 2014

Share this:

Related

12 Responses to “Cluster”

Leave a comment

Categories

Archives

Archives