Porter Stemming
September 8, 2009
Here is Porter’s description of his algorithm:
To present the suffix stripping algorithm in its entirety we will need a few difinitions. A \consonant\ in a word is a letter other than A, E, I, O or U, and other than Y preceded by a consonant. (The fact that the term `consonant' is defined to some extent in terms of itself does not make it ambiguous.) So in TOY the consonants are T and Y, and in SYZYGY they are S, Z and G. If a letter is not a consonant it is a \vowel\. A consonant will be denoted by c, a vowel by v. A list ccc... of length greater than 0 will be denoted by C, and a list vvv... of length greater than 0 will be denoted by V. Any word, or part of a word, therefore has one of the four forms: CVCV ... C CVCV ... V VCVC ... C VCVC ... V These may all be represented by the single form [C]VCVC ... [V] where the square brackets denote arbitrary presence of their contents. Using (VC){m} to denote VC repeated m times, this may again be written as [C](VC){m}[V]. m will be called the \measure\ of any word or word part when represented in this form. The case m = 0 covers the null word. Here are some examples: m=0 TR, EE, TREE, Y, BY. m=1 TROUBLE, OATS, TREES, IVY. m=2 TROUBLES, PRIVATE, OATEN, ORRERY. The \rules\ for removing a suffix will be given in the form (condition) S1 -> S2 This means that if a word ends with the suffix S1, and the stem before S1 satisfies the given condition, S1 is replaced by S2. The condition is usually given in terms of m, e.g. (m > 1) EMENT -> Here S1 is `EMENT' and S2 is null. This would map REPLACEMENT to REPLAC, since REPLAC is a word part for which m = 2. The `condition' part may also contain the following: *S - the stem ends with S (and similarly for the other letters). *v* - the stem contains a vowel. *d - the stem ends with a double consonant (e.g. -TT, -SS). *o - the stem ends cvc, where the second c is not W, X or Y (e.g. -WIL, -HOP). And the condition part may also contain expressions with \and\, \or\ and \not\, so that (m>1 and (*S or *T)) tests for a stem with m>1 ending in S or T, while (*d and not (*L or *S or *Z)) tests for a stem ending witha double consonant other than L, S or Z. Elaborate conditions like this are required only rarely. In a set of rules written beneath each other, only one is obeyed, and this will be the one with the longest matching S1 for the given word. For example, with SSES -> SS IES -> I SS -> SS S -> (here the conditions are all null) CARESSES maps to CARESS since SSES is the longest match for S1. Equally CARESS maps to CARESS (S1=`SS') and CARES to CARE (S1=`S'). In the rules below, examples of their application, successful or otherwise, are given on the right in lower case. The algorithm now follows: Step 1a SSES -> SS caresses -> caress IES -> I ponies -> poni ties -> ti SS -> SS caress -> caress S -> cats -> cat Step 1b (m>0) EED -> EE feed -> feed agreed -> agree (*v*) ED -> plastered -> plaster bled -> bled (*v*) ING -> motoring -> motor sing -> sing If the second or third of the rules in Step 1b is successful, the following is done: AT -> ATE conflat(ed) -> conflate BL -> BLE troubl(ed) -> trouble IZ -> IZE siz(ed) -> size (*d and not (*L or *S or *Z)) -> single letter hopp(ing) -> hop tann(ed) -> tan fall(ing) -> fall hiss(ing) -> hiss fizz(ed) -> fizz (m=1 and *o) -> E fail(ing) -> fail fil(ing) -> file The rule to map to a single letter causes the removal of one of the double letter pair. The -E is put back on -AT, -BL and -IZ, so that the suffixes -ATE, -BLE and -IZE can be recognised later. This E may be removed in step 4. Step 1c (*v*) Y -> I happy -> happi sky -> sky Step 1 deals with plurals and past participles. The subsequent steps are much more straightforward. Step 2 (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible The test for the string S1 can be made fast by doing a program switch on the penultimate letter of the word being tested. This gives a fairly even breakdown of the possible values of the string S1. It will be seen in fact that the S1-strings in step 2 are presented here in the alphabetical order of their penultimate letter. Similar techniques may be applied in the other steps. Step 3 (m>0) ICATE -> IC triplicate -> triplic (m>0) ATIVE -> formative -> form (m>0) ALIZE -> AL formalize -> formal (m>0) ICITI -> IC electriciti -> electric (m>0) ICAL -> IC electrical -> electric (m>0) FUL -> hopeful -> hope (m>0) NESS -> goodness -> good Step 4 (m>1) AL -> revival -> reviv (m>1) ANCE -> allowance -> allow (m>1) ENCE -> inference -> infer (m>1) ER -> airliner -> airlin (m>1) IC -> gyroscopic -> gyroscop (m>1) ABLE -> adjustable -> adjust (m>1) IBLE -> defensible -> defens (m>1) ANT -> irritant -> irrit (m>1) EMENT -> replacement -> replac (m>1) MENT -> adjustment -> adjust (m>1) ENT -> dependent -> depend (m>1 and (*S or *T)) ION -> adoption -> adopt (m>1) OU -> homologou -> homolog (m>1) ISM -> communism -> commun (m>1) ATE -> activate -> activ (m>1) ITI -> angulariti -> angular (m>1) OUS -> homologous -> homolog (m>1) IVE -> effective -> effect (m>1) IZE -> bowdlerize -> bowdler The suffixes are now removed. All that remains is a little tidying up. Step 5a (m>1) E -> probate -> probat rate -> rate (m=1 and not *o) E -> cease -> ceas Step 5b (m > 1 and *d and *L) -> single letter controll -> control roll -> roll
(module porter-stemmer scheme
(define or-map ormap)
(define and-map andmap)
(define (consonant? str i)
(let ((char (list-ref str i)))
(cond ((member char ‘(#\a #\e #\i #\o #\u)) #f)
((equal? char #\y)
(if (= i 0) #t (not (consonant? str (- i 1)))))
(else #t))))
;; m() measures the number of consonant sequences between k0 and j. if c is
;; a consonant sequence and v a vowel sequence, and indicates arbitrary
;; presence,
;; gives 0
;; vc gives 1
;; vcvc gives 2
;; vcvcvc gives 3
;; ….
(define (num-consonant-sequences str)
(letrec ((helper (λ(str lim i prefix-ends-in-consonant?)
(if (0 stem) (1 stem) (cry in step1b
(has-vowel? (rest str)))))
(define (ends-in-double-consonant? str)
(let ((r (reverse str)))
(and (< 1 (length str))
(equal? (first r) (second r))
(consonant? r 0))))
;; cvc(i) is TRUE i-2,i-1,i has the form consonant – vowel – consonant
;; and also if the second c is not w,x or y. this is used when trying to
;; restore an e at the end of a short word. e.g.
;; cav(e), lov(e), hop(e), crim(e), but
;; snow, box, tray.
(define (cvc? str)
(let ((last (- (length str) 1)))
(and (< 1 last)
(consonant? str last)
(not (consonant? str (- last 1)))
(consonant? str (- last 2))
(not (member (list-ref str last) '(#\w #\x #\y))))))
(define (cvcv? str)
(let ((nonvowel (memf (λ(x) (not (member x '(#\a #\e #\i #\o #\u)))) (reverse str))))
(and nonvowel (cvc? (reverse nonvowel)))))
;; ends(s) is TRUE k0,…k ends with the string s.
(define (starts-with str beginning)
“true iff beginning is a substring of str starting at 0”
(or (if (empty? beginning) str false)
(and (not (empty? str))
(equal? (first str) (first beginning))
(starts-with (rest str) (rest beginning)))))
(define (ends-with str ending)
(let ((stem (starts-with (reverse str) (reverse ending))))
(if stem (reverse stem) #f)))
(define (ends str ending)
(ends-with str (string->list ending)))
(define (replace-ending-if str condition ending replacement)
(let ((stem (ends-with str (string->list ending))))
(if stem
(if (condition stem)
(append stem (string->list replacement))
str)
#f)))
(define (->true any) true)
(define (rules str proc replacements)
(or (or-map proc replacements) str))
;; step1ab() gets rid of plurals and -ed or -ing. e.g.
;; caresses -> caress
;; ponies -> poni
;; ties -> ti
;; caress -> caress
;; cats -> cat
;; feed -> feed
;; agreed -> agree
;; disabled -> disable
;; matting -> mat
;; mating -> mate
;; meeting -> meet
;; milling -> mill
;; messing -> mess
;; meetings -> meet
(define (step1a str)
(rules
str
(λ(p) (replace-ending-if str ->true (car p) (cdr p)))
‘((“sses” . “ss”)
(“ies” . “i”)
(“ss” . “ss”)
(“s” . “”))))
(define (step1b str)
(let ((ed-ing-stem (or (ends str “ed”) (ends str “ing”))))
(if ed-ing-stem
(or (replace-ending-if str m>0 “eed” “ee”)
(if (has-vowel? ed-ing-stem)
(or (replace-ending-if str ->true “ated” “ate”)
(replace-ending-if str ->true “ating” “ate”)
(replace-ending-if str ->true “bled” “ble”)
(replace-ending-if str ->true “bling” “ble”)
(replace-ending-if str ->true “ized” “ize”)
(replace-ending-if str ->true “izing” “ize”)
(if (ends-in-double-consonant? ed-ing-stem)
(if (member (list-ref ed-ing-stem (- (length ed-ing-stem) 1)) ‘(#\l #\s #\z))
ed-ing-stem
(reverse (rest (reverse ed-ing-stem))))
#f)
(if (and (m=1 ed-ing-stem) (cvc? ed-ing-stem))
(append ed-ing-stem ‘(#\e))
ed-ing-stem))
str))
str)))
;; step1c() turns terminal y to i when there is another vowel in the stem.
(define (step1c str)
(or (replace-ending-if str has-vowel? “y” “i”) str))
;; step2() maps double suffices to single ones. so -ization ( = -ize plus
;; -ation) maps to -ize etc. Call this function only if the string before the
;; suffix has at least one non-initial consonant-sequence.
(define (step2 str)
(or
(or-map
(lambda (end-pattern)
(replace-ending-if str m>0 (car end-pattern) (cdr end-pattern)))
‘((“ational” . “ate”)
(“tional” . “tion”)
(“enci” . “ence”)
(“anci” . “ance”)
(“izer” . “ize”)
(“bli” . “ble”)
;; -DEPARTURE-
;; To match the published algorithm, replace prev line with
;; (“abli” . “able”)
(“alli” . “al”)
(“entli” . “ent”)
(“eli” . “e”)
(“ousli” . “ous”)
(“ization” . “ize”)
(“ation” . “ate”)
(“ator” . “ate”)
(“alism” . “al”)
(“iveness” . “ive”)
(“fulness” . “ful”)
(“ousness” . “ous”)
(“aliti” . “al”)
(“iviti” . “ive”)
(“biliti” . “ble”)
(“logi” . “log”) ;; DEPARTURE: the published algorithm does not have this line
))
str))
;; step3() deals with -ic-, -full, -ness etc. similar strategy to step2.
(define (step3 str)
(or
(or-map
(lambda (end-pattern)
(replace-ending-if str m>0 (car end-pattern) (cdr end-pattern)))
‘((“icate” . “ic”)
(“ative” . “”) ; huh?
(“alize” . “al”)
(“iciti” . “ic”)
(“ical” . “ic”)
(“ful” . “”) ; huh?
(“ness” . “”) ; huh?
))
str))
;; step4() takes off -ant, -ence etc., in context vcvc.
(define (step4 str)
(or
(or-map
(lambda (ending)
(replace-ending-if str m>1 ending “”))
‘(“al” “ance” “ence” “er” “ic” “able” “ible” “ant” “ement” “ment” “ent”
“ou” ;; for -ous
“ism” “ate” “iti” “ous” “ive” “ize”))
(replace-ending-if str (λ(stem) (m>1 (append stem ‘(#\s)))) “sion” “s”)
(replace-ending-if str (λ(stem) (m>1 (append stem ‘(#\t)))) “tion” “t”)
str))
;; step5 removes a final -e if m() > 1, and changes -ll to -l if m() > 1.
(define (step5a str)
(let ((stem-e (ends str “e”)))
(if (and stem-e
(let ((m (num-consonant-sequences stem-e)))
(or (1 str)) “ll” “l”) str))
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
(define (puts . stuff)
(map display stuff)
(newline)
(car (reverse stuff)))
(define (steps str)
(let* ((zero (puts “step0: ” (string->list str)))
(one-a (puts “step1a: ” (step1a zero)))
(one-b (puts “step1b: ” (step1b one-a)))
(one-c (puts “step1c: ” (step1c one-b)))
(two (puts “step2: ” (step2 one-c)))
(three (puts “step3: ” (step3 two)))
(four (puts “step4: ” (step4 three)))
(five-a (puts “step5a: ” (step5a four)))
(done (puts “step5b: ” (step5b five-a)))
)
done))
;; In stem(p,i,j), p is a char pointer, and the string to be stemmed is from p[i] to p[j]
;; inclusive. Typically i is zero and j is the offset to the last character of a string,
;; (p[j+1] == ”). The stemmer adjusts the characters p[i] … p[j] and returns the new
;; end-point of the string, k. Stemming never increases word length, so i <= k <= j. To
;; turn the stemmer into a module, declare 'stem' as extern, and delete the remainder of
;; this file.
(define (stem str)
(let ((len (string-length str)))
(if (string
(step5b
(step5a
(step4
(step3
(step2
(step1c
(step1b
(step1a
(string->list str)))))))))))))
(provide stem)
)
#lang scheme
;;; The above should have come with the documentation:
;; The Porter Stemming Algorithm, somewhat mechanically hand translated to Common Lisp by
;; Steven M. Haflich smh@franz.com Feb 2002. Subsequently translated to Scheme and made
;; wholly inefficient, if more readable, by Gregory Marton in 2009.
;; This port has only been tested with PLT Scheme, though there is little PLT-specific code.
;; This code is offered in the hope it will be useful, but with no warranty of correctness,
;; suitability, usability, or anything else.
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;; This is the Porter stemming algorithm, as presented ins
;; Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14,
;; no. 3, pp 130-137,
;; only differing from it at the points maked –DEPARTURE– below.
;; See also http://www.tartarus.org/~martin/PorterStemmer
;; The algorithm as described in the paper could be exactly replicated
;; by adjusting the points of DEPARTURE, but this is barely necessary,
;; because (a) the points of DEPARTURE are definitely improvements, and
;; (b) no encoding of the Porter stemmer I have seen is anything like
;; as exact as this version, even with the points of DEPARTURE!
;; Release 1
;; Note that only lower case sequences are stemmed. Forcing to lower case
;; should be done before stem(…) is called.
;; This implementation is not a particularly fast one.
;; In particular, there is a common optimization to make the many ends-with
;; tests faster by switching on the last or penultimate letter, which I chose
;; not to use here for the sake of readability.
;; I suspect without proof that the many reversals could be ameliorated by just
;; reversing the string in the first place and pre-reversing the patterns, or
;; by maintaining a last position in the string vector instead of using a list of
;; characters.
;;; and the tests, which refer to files that can be found at:
;;; http://tartarus.org/~martin/PorterStemmer/
(require (planet schematics/schemeunit:3:4))
(require (planet schematics/schemeunit:3/text-ui))
(require “files.ss”)
(require “porter-stemmer.ss”)
(define *the-tests*
(test-suite
“stemming”
(let ((count 0))
(test-case
“stemming”
(for-each-line-in-file
“porter-test.txt”
(lambda (line)
(set! count (+ 1 count)) (when (integer? (/ count 1000)) (display “.”))
;;(display line) (newline)
(let* ((untab (regexp-split #px”\t” (chomp line)))
(original (first untab))
(stemmed (second untab))
(result (stem original)))
(check-equal? result stemmed original))))))))
(define num-errors (run-tests *the-tests*))
(define (test) (exit num-errors))
(provide test)
A very nice Programming Praxis example, as here presented. I have created a link to it from the Porter Stemmer page.
[…] Praxis porter stemming part1 […]
nice example
Reblogged this on {blog;}.