Porter Stemming
September 8, 2009
Stemming, in the parlance of searching and information retrieval, is the operation of stripping the suffices from a word, leaving its stem. Google, for instance, uses stemming to search for web pages containing the words connected, connecting, connection and connections when you ask for a web page that contains the word connect.
There are basically two ways to implement stemming. The first approach is to create a big dictionary that maps words to their stems. The advantage of this approach is that it works perfectly (insofar as the stem of a word can be defined perfectly); the disadvantages are the space required by the dictionary and the investment required to maintain the dictionary as new words appear. The second approach is to use a set of rules that extract stems from words. The advantages of this approach are that the code is typically small, and it can gracefully handle new words; the disadvantage is that it occasionally makes mistakes. But, since stemming is imperfectly defined, anyway, occasional mistakes are tolerable, and the rule-based approach is the one that is generally chosen.
In 1979, Martin Porter developed a stemming algorithm that, with minor modifications, is still in use today; it uses a set of rules to extract stems from words, and though it makes some mistakes, most common words seem to work out right. Porter describes his algorithm and provides a reference implementation in C at http://tartarus.org/~martin/PorterStemmer/index.html; the description of the algorithm is repeated on the next page.
Your task is to write a function that stems words according to Porter’s algorithm; you should be aware that this exercise requires rather more code than we usually write, though it’s no harder than usual. When you are finished, you are welcome to read or run a suggested solution, or to post your solution or discuss the exercise in the comments below.
(module porter-stemmer scheme
(define or-map ormap)
(define and-map andmap)
(define (consonant? str i)
(let ((char (list-ref str i)))
(cond ((member char ‘(#\a #\e #\i #\o #\u)) #f)
((equal? char #\y)
(if (= i 0) #t (not (consonant? str (- i 1)))))
(else #t))))
;; m() measures the number of consonant sequences between k0 and j. if c is
;; a consonant sequence and v a vowel sequence, and indicates arbitrary
;; presence,
;; gives 0
;; vc gives 1
;; vcvc gives 2
;; vcvcvc gives 3
;; ….
(define (num-consonant-sequences str)
(letrec ((helper (λ(str lim i prefix-ends-in-consonant?)
(if (0 stem) (1 stem) (cry in step1b
(has-vowel? (rest str)))))
(define (ends-in-double-consonant? str)
(let ((r (reverse str)))
(and (< 1 (length str))
(equal? (first r) (second r))
(consonant? r 0))))
;; cvc(i) is TRUE i-2,i-1,i has the form consonant – vowel – consonant
;; and also if the second c is not w,x or y. this is used when trying to
;; restore an e at the end of a short word. e.g.
;; cav(e), lov(e), hop(e), crim(e), but
;; snow, box, tray.
(define (cvc? str)
(let ((last (- (length str) 1)))
(and (< 1 last)
(consonant? str last)
(not (consonant? str (- last 1)))
(consonant? str (- last 2))
(not (member (list-ref str last) '(#\w #\x #\y))))))
(define (cvcv? str)
(let ((nonvowel (memf (λ(x) (not (member x '(#\a #\e #\i #\o #\u)))) (reverse str))))
(and nonvowel (cvc? (reverse nonvowel)))))
;; ends(s) is TRUE k0,…k ends with the string s.
(define (starts-with str beginning)
“true iff beginning is a substring of str starting at 0”
(or (if (empty? beginning) str false)
(and (not (empty? str))
(equal? (first str) (first beginning))
(starts-with (rest str) (rest beginning)))))
(define (ends-with str ending)
(let ((stem (starts-with (reverse str) (reverse ending))))
(if stem (reverse stem) #f)))
(define (ends str ending)
(ends-with str (string->list ending)))
(define (replace-ending-if str condition ending replacement)
(let ((stem (ends-with str (string->list ending))))
(if stem
(if (condition stem)
(append stem (string->list replacement))
str)
#f)))
(define (->true any) true)
(define (rules str proc replacements)
(or (or-map proc replacements) str))
;; step1ab() gets rid of plurals and -ed or -ing. e.g.
;; caresses -> caress
;; ponies -> poni
;; ties -> ti
;; caress -> caress
;; cats -> cat
;; feed -> feed
;; agreed -> agree
;; disabled -> disable
;; matting -> mat
;; mating -> mate
;; meeting -> meet
;; milling -> mill
;; messing -> mess
;; meetings -> meet
(define (step1a str)
(rules
str
(λ(p) (replace-ending-if str ->true (car p) (cdr p)))
‘((“sses” . “ss”)
(“ies” . “i”)
(“ss” . “ss”)
(“s” . “”))))
(define (step1b str)
(let ((ed-ing-stem (or (ends str “ed”) (ends str “ing”))))
(if ed-ing-stem
(or (replace-ending-if str m>0 “eed” “ee”)
(if (has-vowel? ed-ing-stem)
(or (replace-ending-if str ->true “ated” “ate”)
(replace-ending-if str ->true “ating” “ate”)
(replace-ending-if str ->true “bled” “ble”)
(replace-ending-if str ->true “bling” “ble”)
(replace-ending-if str ->true “ized” “ize”)
(replace-ending-if str ->true “izing” “ize”)
(if (ends-in-double-consonant? ed-ing-stem)
(if (member (list-ref ed-ing-stem (- (length ed-ing-stem) 1)) ‘(#\l #\s #\z))
ed-ing-stem
(reverse (rest (reverse ed-ing-stem))))
#f)
(if (and (m=1 ed-ing-stem) (cvc? ed-ing-stem))
(append ed-ing-stem ‘(#\e))
ed-ing-stem))
str))
str)))
;; step1c() turns terminal y to i when there is another vowel in the stem.
(define (step1c str)
(or (replace-ending-if str has-vowel? “y” “i”) str))
;; step2() maps double suffices to single ones. so -ization ( = -ize plus
;; -ation) maps to -ize etc. Call this function only if the string before the
;; suffix has at least one non-initial consonant-sequence.
(define (step2 str)
(or
(or-map
(lambda (end-pattern)
(replace-ending-if str m>0 (car end-pattern) (cdr end-pattern)))
‘((“ational” . “ate”)
(“tional” . “tion”)
(“enci” . “ence”)
(“anci” . “ance”)
(“izer” . “ize”)
(“bli” . “ble”)
;; -DEPARTURE-
;; To match the published algorithm, replace prev line with
;; (“abli” . “able”)
(“alli” . “al”)
(“entli” . “ent”)
(“eli” . “e”)
(“ousli” . “ous”)
(“ization” . “ize”)
(“ation” . “ate”)
(“ator” . “ate”)
(“alism” . “al”)
(“iveness” . “ive”)
(“fulness” . “ful”)
(“ousness” . “ous”)
(“aliti” . “al”)
(“iviti” . “ive”)
(“biliti” . “ble”)
(“logi” . “log”) ;; DEPARTURE: the published algorithm does not have this line
))
str))
;; step3() deals with -ic-, -full, -ness etc. similar strategy to step2.
(define (step3 str)
(or
(or-map
(lambda (end-pattern)
(replace-ending-if str m>0 (car end-pattern) (cdr end-pattern)))
‘((“icate” . “ic”)
(“ative” . “”) ; huh?
(“alize” . “al”)
(“iciti” . “ic”)
(“ical” . “ic”)
(“ful” . “”) ; huh?
(“ness” . “”) ; huh?
))
str))
;; step4() takes off -ant, -ence etc., in context vcvc.
(define (step4 str)
(or
(or-map
(lambda (ending)
(replace-ending-if str m>1 ending “”))
‘(“al” “ance” “ence” “er” “ic” “able” “ible” “ant” “ement” “ment” “ent”
“ou” ;; for -ous
“ism” “ate” “iti” “ous” “ive” “ize”))
(replace-ending-if str (λ(stem) (m>1 (append stem ‘(#\s)))) “sion” “s”)
(replace-ending-if str (λ(stem) (m>1 (append stem ‘(#\t)))) “tion” “t”)
str))
;; step5 removes a final -e if m() > 1, and changes -ll to -l if m() > 1.
(define (step5a str)
(let ((stem-e (ends str “e”)))
(if (and stem-e
(let ((m (num-consonant-sequences stem-e)))
(or (1 str)) “ll” “l”) str))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(define (puts . stuff)
(map display stuff)
(newline)
(car (reverse stuff)))
(define (steps str)
(let* ((zero (puts “step0: ” (string->list str)))
(one-a (puts “step1a: ” (step1a zero)))
(one-b (puts “step1b: ” (step1b one-a)))
(one-c (puts “step1c: ” (step1c one-b)))
(two (puts “step2: ” (step2 one-c)))
(three (puts “step3: ” (step3 two)))
(four (puts “step4: ” (step4 three)))
(five-a (puts “step5a: ” (step5a four)))
(done (puts “step5b: ” (step5b five-a)))
)
done))
;; In stem(p,i,j), p is a char pointer, and the string to be stemmed is from p[i] to p[j]
;; inclusive. Typically i is zero and j is the offset to the last character of a string,
;; (p[j+1] == ”). The stemmer adjusts the characters p[i] … p[j] and returns the new
;; end-point of the string, k. Stemming never increases word length, so i <= k <= j. To
;; turn the stemmer into a module, declare 'stem' as extern, and delete the remainder of
;; this file.
(define (stem str)
(let ((len (string-length str)))
(if (string
(step5b
(step5a
(step4
(step3
(step2
(step1c
(step1b
(step1a
(string->list str)))))))))))))
(provide stem)
)
#lang scheme
;;; The above should have come with the documentation:
;; The Porter Stemming Algorithm, somewhat mechanically hand translated to Common Lisp by
;; Steven M. Haflich smh@franz.com Feb 2002. Subsequently translated to Scheme and made
;; wholly inefficient, if more readable, by Gregory Marton in 2009.
;; This port has only been tested with PLT Scheme, though there is little PLT-specific code.
;; This code is offered in the hope it will be useful, but with no warranty of correctness,
;; suitability, usability, or anything else.
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; This is the Porter stemming algorithm, as presented ins
;; Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14,
;; no. 3, pp 130-137,
;; only differing from it at the points maked –DEPARTURE– below.
;; See also http://www.tartarus.org/~martin/PorterStemmer
;; The algorithm as described in the paper could be exactly replicated
;; by adjusting the points of DEPARTURE, but this is barely necessary,
;; because (a) the points of DEPARTURE are definitely improvements, and
;; (b) no encoding of the Porter stemmer I have seen is anything like
;; as exact as this version, even with the points of DEPARTURE!
;; Release 1
;; Note that only lower case sequences are stemmed. Forcing to lower case
;; should be done before stem(…) is called.
;; This implementation is not a particularly fast one.
;; In particular, there is a common optimization to make the many ends-with
;; tests faster by switching on the last or penultimate letter, which I chose
;; not to use here for the sake of readability.
;; I suspect without proof that the many reversals could be ameliorated by just
;; reversing the string in the first place and pre-reversing the patterns, or
;; by maintaining a last position in the string vector instead of using a list of
;; characters.
;;; and the tests, which refer to files that can be found at:
;;; http://tartarus.org/~martin/PorterStemmer/
(require (planet schematics/schemeunit:3:4))
(require (planet schematics/schemeunit:3/text-ui))
(require “files.ss”)
(require “porter-stemmer.ss”)
(define *the-tests*
(test-suite
“stemming”
(let ((count 0))
(test-case
“stemming”
(for-each-line-in-file
“porter-test.txt”
(lambda (line)
(set! count (+ 1 count)) (when (integer? (/ count 1000)) (display “.”))
;;(display line) (newline)
(let* ((untab (regexp-split #px”\t” (chomp line)))
(original (first untab))
(stemmed (second untab))
(result (stem original)))
(check-equal? result stemmed original))))))))
(define num-errors (run-tests *the-tests*))
(define (test) (exit num-errors))
(provide test)
A very nice Programming Praxis example, as here presented. I have created a link to it from the Porter Stemmer page.
[…] Praxis porter stemming part1 […]
nice example
Reblogged this on {blog;}.