Porter Stemming

September 8, 2009

Here is Porter’s description of his algorithm:


To present the suffix stripping algorithm in its entirety we will need a few
difinitions.

A \consonant\ in a word is a letter other than A, E, I, O or U, and other
than Y preceded by a consonant. (The fact that the term `consonant' is
defined to some extent in terms of itself does not make it ambiguous.) So in
TOY the consonants are T and Y, and in SYZYGY they are S, Z and G. If a
letter is not a consonant it is a \vowel\.

A consonant will be denoted by c, a vowel by v. A list ccc... of length
greater than 0 will be denoted by C, and a list vvv... of length greater
than 0 will be denoted by V. Any word, or part of a word, therefore has one
of the four forms:

    CVCV ... C
    CVCV ... V
    VCVC ... C
    VCVC ... V

These may all be represented by the single form

    [C]VCVC ... [V]

where the square brackets denote arbitrary presence of their contents.
Using (VC){m} to denote VC repeated m times, this may again be written as

    [C](VC){m}[V].

m will be called the \measure\ of any word or word part when represented in
this form. The case m = 0 covers the null word. Here are some examples:

    m=0    TR,  EE,  TREE,  Y,  BY.
    m=1    TROUBLE,  OATS,  TREES,  IVY.
    m=2    TROUBLES,  PRIVATE,  OATEN,  ORRERY.

The \rules\ for removing a suffix will be given in the form

    (condition) S1 -> S2

This means that if a word ends with the suffix S1, and the stem before S1
satisfies the given condition, S1 is replaced by S2. The condition is
usually given in terms of m, e.g.

    (m > 1) EMENT ->

Here S1 is `EMENT' and S2 is null. This would map REPLACEMENT to REPLAC,
since REPLAC is a word part for which m = 2.

The `condition' part may also contain the following:

*S  - the stem ends with S (and similarly for the other letters).

*v* - the stem contains a vowel.

*d  - the stem ends with a double consonant (e.g. -TT, -SS).

*o  - the stem ends cvc, where the second c is not W, X or Y (e.g.
       -WIL, -HOP).

And the condition part may also contain expressions with \and\, \or\ and
\not\, so that

    (m>1 and (*S or *T))

tests for a stem with m>1 ending in S or T, while

    (*d and not (*L or *S or *Z))

tests for a stem ending witha double consonant other than L, S or Z.
Elaborate conditions like this are required only rarely.

In a set of rules written beneath each other, only one is obeyed, and this
will be the one with the longest matching S1 for the given word. For
example, with

    SSES -> SS
    IES  -> I
    SS   -> SS
    S    ->

(here the conditions are all null) CARESSES maps to CARESS since SSES is
the longest match for S1. Equally CARESS maps to CARESS (S1=`SS') and CARES
to CARE (S1=`S').

In the rules below, examples of their application, successful or otherwise,
are given on the right in lower case. The algorithm now follows:

Step 1a

    SSES -> SS                         caresses  ->  caress
    IES  -> I                          ponies    ->  poni
                                       ties      ->  ti
    SS   -> SS                         caress    ->  caress
    S    ->                            cats      ->  cat

Step 1b

    (m>0) EED -> EE                    feed      ->  feed
                                       agreed    ->  agree
    (*v*) ED  ->                       plastered ->  plaster
                                       bled      ->  bled
    (*v*) ING ->                       motoring  ->  motor
                                       sing      ->  sing

If the second or third of the rules in Step 1b is successful, the following
is done:

    AT -> ATE                       conflat(ed)  ->  conflate
    BL -> BLE                       troubl(ed)   ->  trouble
    IZ -> IZE                       siz(ed)      ->  size
    (*d and not (*L or *S or *Z))
       -> single letter
                                    hopp(ing)    ->  hop
                                    tann(ed)     ->  tan
                                    fall(ing)    ->  fall
                                    hiss(ing)    ->  hiss
                                    fizz(ed)     ->  fizz
    (m=1 and *o) -> E               fail(ing)    ->  fail
                                    fil(ing)     ->  file

The rule to map to a single letter causes the removal of one of the double
letter pair. The -E is put back on -AT, -BL and -IZ, so that the suffixes
-ATE, -BLE and -IZE can be recognised later. This E may be removed in step
4.

Step 1c

    (*v*) Y -> I                    happy        ->  happi
                                    sky          ->  sky

Step 1 deals with plurals and past participles. The subsequent steps are
much more straightforward.

Step 2

    (m>0) ATIONAL ->  ATE           relational     ->  relate
    (m>0) TIONAL  ->  TION          conditional    ->  condition
                                    rational       ->  rational
    (m>0) ENCI    ->  ENCE          valenci        ->  valence
    (m>0) ANCI    ->  ANCE          hesitanci      ->  hesitance
    (m>0) IZER    ->  IZE           digitizer      ->  digitize
    (m>0) ABLI    ->  ABLE          conformabli    ->  conformable
    (m>0) ALLI    ->  AL            radicalli      ->  radical
    (m>0) ENTLI   ->  ENT           differentli    ->  different
    (m>0) ELI     ->  E             vileli        - >  vile
    (m>0) OUSLI   ->  OUS           analogousli    ->  analogous
    (m>0) IZATION ->  IZE           vietnamization ->  vietnamize
    (m>0) ATION   ->  ATE           predication    ->  predicate
    (m>0) ATOR    ->  ATE           operator       ->  operate
    (m>0) ALISM   ->  AL            feudalism      ->  feudal
    (m>0) IVENESS ->  IVE           decisiveness   ->  decisive
    (m>0) FULNESS ->  FUL           hopefulness    ->  hopeful
    (m>0) OUSNESS ->  OUS           callousness    ->  callous
    (m>0) ALITI   ->  AL            formaliti      ->  formal
    (m>0) IVITI   ->  IVE           sensitiviti    ->  sensitive
    (m>0) BILITI  ->  BLE           sensibiliti    ->  sensible

The test for the string S1 can be made fast by doing a program switch on
the penultimate letter of the word being tested. This gives a fairly even
breakdown of the possible values of the string S1. It will be seen in fact
that the S1-strings in step 2 are presented here in the alphabetical order
of their penultimate letter. Similar techniques may be applied in the other
steps.

Step 3

    (m>0) ICATE ->  IC              triplicate     ->  triplic
    (m>0) ATIVE ->                  formative      ->  form
    (m>0) ALIZE ->  AL              formalize      ->  formal
    (m>0) ICITI ->  IC              electriciti    ->  electric
    (m>0) ICAL  ->  IC              electrical     ->  electric
    (m>0) FUL   ->                  hopeful        ->  hope
    (m>0) NESS  ->                  goodness       ->  good

Step 4

    (m>1) AL    ->                  revival        ->  reviv
    (m>1) ANCE  ->                  allowance      ->  allow
    (m>1) ENCE  ->                  inference      ->  infer
    (m>1) ER    ->                  airliner       ->  airlin
    (m>1) IC    ->                  gyroscopic     ->  gyroscop
    (m>1) ABLE  ->                  adjustable     ->  adjust
    (m>1) IBLE  ->                  defensible     ->  defens
    (m>1) ANT   ->                  irritant       ->  irrit
    (m>1) EMENT ->                  replacement    ->  replac
    (m>1) MENT  ->                  adjustment     ->  adjust
    (m>1) ENT   ->                  dependent      ->  depend
    (m>1 and (*S or *T)) ION ->     adoption       ->  adopt
    (m>1) OU    ->                  homologou      ->  homolog
    (m>1) ISM   ->                  communism      ->  commun
    (m>1) ATE   ->                  activate       ->  activ
    (m>1) ITI   ->                  angulariti     ->  angular
    (m>1) OUS   ->                  homologous     ->  homolog
    (m>1) IVE   ->                  effective      ->  effect
    (m>1) IZE   ->                  bowdlerize     ->  bowdler

The suffixes are now removed. All that remains is a little tidying up.

Step 5a

    (m>1) E     ->                  probate        ->  probat
                                    rate           ->  rate
    (m=1 and not *o) E ->           cease          ->  ceas

Step 5b

    (m > 1 and *d and *L) -> single letter
                                    controll       ->  control
                                    roll           ->  roll
Advertisement

Pages: 1 2 3

6 Responses to “Porter Stemming”

  1. (module porter-stemmer scheme

    (define or-map ormap)
    (define and-map andmap)

    (define (consonant? str i)
    (let ((char (list-ref str i)))
    (cond ((member char ‘(#\a #\e #\i #\o #\u)) #f)
    ((equal? char #\y)
    (if (= i 0) #t (not (consonant? str (- i 1)))))
    (else #t))))

    ;; m() measures the number of consonant sequences between k0 and j. if c is
    ;; a consonant sequence and v a vowel sequence, and indicates arbitrary
    ;; presence,

    ;; gives 0
    ;; vc gives 1
    ;; vcvc gives 2
    ;; vcvcvc gives 3
    ;; ….

    (define (num-consonant-sequences str)
    (letrec ((helper (λ(str lim i prefix-ends-in-consonant?)
    (if (0 stem) (1 stem) (cry in step1b
    (has-vowel? (rest str)))))

    (define (ends-in-double-consonant? str)
    (let ((r (reverse str)))
    (and (< 1 (length str))
    (equal? (first r) (second r))
    (consonant? r 0))))

    ;; cvc(i) is TRUE i-2,i-1,i has the form consonant – vowel – consonant
    ;; and also if the second c is not w,x or y. this is used when trying to
    ;; restore an e at the end of a short word. e.g.

    ;; cav(e), lov(e), hop(e), crim(e), but
    ;; snow, box, tray.

    (define (cvc? str)
    (let ((last (- (length str) 1)))
    (and (< 1 last)
    (consonant? str last)
    (not (consonant? str (- last 1)))
    (consonant? str (- last 2))
    (not (member (list-ref str last) '(#\w #\x #\y))))))

    (define (cvcv? str)
    (let ((nonvowel (memf (λ(x) (not (member x '(#\a #\e #\i #\o #\u)))) (reverse str))))
    (and nonvowel (cvc? (reverse nonvowel)))))

    ;; ends(s) is TRUE k0,…k ends with the string s.

    (define (starts-with str beginning)
    “true iff beginning is a substring of str starting at 0”
    (or (if (empty? beginning) str false)
    (and (not (empty? str))
    (equal? (first str) (first beginning))
    (starts-with (rest str) (rest beginning)))))

    (define (ends-with str ending)
    (let ((stem (starts-with (reverse str) (reverse ending))))
    (if stem (reverse stem) #f)))
    (define (ends str ending)
    (ends-with str (string->list ending)))

    (define (replace-ending-if str condition ending replacement)
    (let ((stem (ends-with str (string->list ending))))
    (if stem
    (if (condition stem)
    (append stem (string->list replacement))
    str)
    #f)))

    (define (->true any) true)
    (define (rules str proc replacements)
    (or (or-map proc replacements) str))

    ;; step1ab() gets rid of plurals and -ed or -ing. e.g.

    ;; caresses -> caress
    ;; ponies -> poni
    ;; ties -> ti
    ;; caress -> caress
    ;; cats -> cat

    ;; feed -> feed
    ;; agreed -> agree
    ;; disabled -> disable

    ;; matting -> mat
    ;; mating -> mate
    ;; meeting -> meet
    ;; milling -> mill
    ;; messing -> mess

    ;; meetings -> meet

    (define (step1a str)
    (rules
    str
    (λ(p) (replace-ending-if str ->true (car p) (cdr p)))
    ‘((“sses” . “ss”)
    (“ies” . “i”)
    (“ss” . “ss”)
    (“s” . “”))))

    (define (step1b str)
    (let ((ed-ing-stem (or (ends str “ed”) (ends str “ing”))))
    (if ed-ing-stem
    (or (replace-ending-if str m>0 “eed” “ee”)
    (if (has-vowel? ed-ing-stem)
    (or (replace-ending-if str ->true “ated” “ate”)
    (replace-ending-if str ->true “ating” “ate”)
    (replace-ending-if str ->true “bled” “ble”)
    (replace-ending-if str ->true “bling” “ble”)
    (replace-ending-if str ->true “ized” “ize”)
    (replace-ending-if str ->true “izing” “ize”)
    (if (ends-in-double-consonant? ed-ing-stem)
    (if (member (list-ref ed-ing-stem (- (length ed-ing-stem) 1)) ‘(#\l #\s #\z))
    ed-ing-stem
    (reverse (rest (reverse ed-ing-stem))))
    #f)
    (if (and (m=1 ed-ing-stem) (cvc? ed-ing-stem))
    (append ed-ing-stem ‘(#\e))
    ed-ing-stem))
    str))
    str)))

    ;; step1c() turns terminal y to i when there is another vowel in the stem.

    (define (step1c str)
    (or (replace-ending-if str has-vowel? “y” “i”) str))

    ;; step2() maps double suffices to single ones. so -ization ( = -ize plus
    ;; -ation) maps to -ize etc. Call this function only if the string before the
    ;; suffix has at least one non-initial consonant-sequence.

    (define (step2 str)
    (or
    (or-map
    (lambda (end-pattern)
    (replace-ending-if str m>0 (car end-pattern) (cdr end-pattern)))
    ‘((“ational” . “ate”)
    (“tional” . “tion”)
    (“enci” . “ence”)
    (“anci” . “ance”)
    (“izer” . “ize”)
    (“bli” . “ble”)
    ;; -DEPARTURE-
    ;; To match the published algorithm, replace prev line with
    ;; (“abli” . “able”)
    (“alli” . “al”)
    (“entli” . “ent”)
    (“eli” . “e”)
    (“ousli” . “ous”)
    (“ization” . “ize”)
    (“ation” . “ate”)
    (“ator” . “ate”)
    (“alism” . “al”)
    (“iveness” . “ive”)
    (“fulness” . “ful”)
    (“ousness” . “ous”)
    (“aliti” . “al”)
    (“iviti” . “ive”)
    (“biliti” . “ble”)
    (“logi” . “log”) ;; DEPARTURE: the published algorithm does not have this line
    ))
    str))

    ;; step3() deals with -ic-, -full, -ness etc. similar strategy to step2.

    (define (step3 str)
    (or
    (or-map
    (lambda (end-pattern)
    (replace-ending-if str m>0 (car end-pattern) (cdr end-pattern)))
    ‘((“icate” . “ic”)
    (“ative” . “”) ; huh?
    (“alize” . “al”)
    (“iciti” . “ic”)
    (“ical” . “ic”)
    (“ful” . “”) ; huh?
    (“ness” . “”) ; huh?
    ))
    str))

    ;; step4() takes off -ant, -ence etc., in context vcvc.

    (define (step4 str)
    (or
    (or-map
    (lambda (ending)
    (replace-ending-if str m>1 ending “”))
    ‘(“al” “ance” “ence” “er” “ic” “able” “ible” “ant” “ement” “ment” “ent”
    “ou” ;; for -ous
    “ism” “ate” “iti” “ous” “ive” “ize”))
    (replace-ending-if str (λ(stem) (m>1 (append stem ‘(#\s)))) “sion” “s”)
    (replace-ending-if str (λ(stem) (m>1 (append stem ‘(#\t)))) “tion” “t”)
    str))

    ;; step5 removes a final -e if m() > 1, and changes -ll to -l if m() > 1.
    (define (step5a str)
    (let ((stem-e (ends str “e”)))
    (if (and stem-e
    (let ((m (num-consonant-sequences stem-e)))
    (or (1 str)) “ll” “l”) str))

    ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

    (define (puts . stuff)
    (map display stuff)
    (newline)
    (car (reverse stuff)))

    (define (steps str)
    (let* ((zero (puts “step0: ” (string->list str)))
    (one-a (puts “step1a: ” (step1a zero)))
    (one-b (puts “step1b: ” (step1b one-a)))
    (one-c (puts “step1c: ” (step1c one-b)))
    (two (puts “step2: ” (step2 one-c)))
    (three (puts “step3: ” (step3 two)))
    (four (puts “step4: ” (step4 three)))
    (five-a (puts “step5a: ” (step5a four)))
    (done (puts “step5b: ” (step5b five-a)))
    )
    done))

    ;; In stem(p,i,j), p is a char pointer, and the string to be stemmed is from p[i] to p[j]
    ;; inclusive. Typically i is zero and j is the offset to the last character of a string,
    ;; (p[j+1] == ”). The stemmer adjusts the characters p[i] … p[j] and returns the new
    ;; end-point of the string, k. Stemming never increases word length, so i <= k <= j. To
    ;; turn the stemmer into a module, declare 'stem' as extern, and delete the remainder of
    ;; this file.

    (define (stem str)
    (let ((len (string-length str)))
    (if (string
    (step5b
    (step5a
    (step4
    (step3
    (step2
    (step1c
    (step1b
    (step1a
    (string->list str)))))))))))))
    (provide stem)

    )

  2. #lang scheme
    ;;; The above should have come with the documentation:

    ;; The Porter Stemming Algorithm, somewhat mechanically hand translated to Common Lisp by
    ;; Steven M. Haflich smh@franz.com Feb 2002. Subsequently translated to Scheme and made
    ;; wholly inefficient, if more readable, by Gregory Marton in 2009.

    ;; This port has only been tested with PLT Scheme, though there is little PLT-specific code.
    ;; This code is offered in the hope it will be useful, but with no warranty of correctness,
    ;; suitability, usability, or anything else.

    ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

    ;; This is the Porter stemming algorithm, as presented ins

    ;; Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14,
    ;; no. 3, pp 130-137,

    ;; only differing from it at the points maked –DEPARTURE– below.

    ;; See also http://www.tartarus.org/~martin/PorterStemmer

    ;; The algorithm as described in the paper could be exactly replicated
    ;; by adjusting the points of DEPARTURE, but this is barely necessary,
    ;; because (a) the points of DEPARTURE are definitely improvements, and
    ;; (b) no encoding of the Porter stemmer I have seen is anything like
    ;; as exact as this version, even with the points of DEPARTURE!

    ;; Release 1

    ;; Note that only lower case sequences are stemmed. Forcing to lower case
    ;; should be done before stem(…) is called.

    ;; This implementation is not a particularly fast one.
    ;; In particular, there is a common optimization to make the many ends-with
    ;; tests faster by switching on the last or penultimate letter, which I chose
    ;; not to use here for the sake of readability.
    ;; I suspect without proof that the many reversals could be ameliorated by just
    ;; reversing the string in the first place and pre-reversing the patterns, or
    ;; by maintaining a last position in the string vector instead of using a list of
    ;; characters.

    ;;; and the tests, which refer to files that can be found at:
    ;;; http://tartarus.org/~martin/PorterStemmer/

    (require (planet schematics/schemeunit:3:4))
    (require (planet schematics/schemeunit:3/text-ui))
    (require “files.ss”)
    (require “porter-stemmer.ss”)

    (define *the-tests*
    (test-suite
    “stemming”
    (let ((count 0))
    (test-case
    “stemming”
    (for-each-line-in-file
    “porter-test.txt”
    (lambda (line)
    (set! count (+ 1 count)) (when (integer? (/ count 1000)) (display “.”))
    ;;(display line) (newline)
    (let* ((untab (regexp-split #px”\t” (chomp line)))
    (original (first untab))
    (stemmed (second untab))
    (result (stem original)))
    (check-equal? result stemmed original))))))))

    (define num-errors (run-tests *the-tests*))
    (define (test) (exit num-errors))
    (provide test)

  3. A very nice Programming Praxis example, as here presented. I have created a link to it from the Porter Stemmer page.

  4. […] Praxis porter stemming part1 […]

  5. kishor parkhe said

    nice example

  6. nsmeef said

    Reblogged this on {blog;}.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: