I Before E

November 29, 2013

Much as I love Scheme, it’s not the ideal language for every task. Here I choose Awk, because it provides field-splitting and regular-expression matching:

awk '
    /^#/ { next }
    $1 ~ /CIE/
    $1 ~ /[ABD-Z]EI/ &&
        $0 !~ / EY[012]/
' c06d

The first program line skips the header that starts the file. The second line finds words like ANCIENT and SCIENCE that are spelled I-before-E even after C. The third line finds words in which an E-before-I follows some letter other than C and there is no long-A in the pronunciation; it finds words like FOREIGN and HEIGHT. The entire output contains 2299 lines, which you can see on the next page.

You can see the program at http://programmingpraxis.codepad.org/PEAjJaQJ.

About these ads

Pages: 1 2 3

4 Responses to “I Before E”

  1. Paul said

    A very interesting problem. I looked at your solution and I think, I see problems. In the CMU list you find for example:
    ATHEISM AH0 TH AY1 S AH0 M (– identified by your method –)
    ATHEISM(1) EY1 TH IY0 IH2 Z AH0 M
    ATHEIST EY1 TH IY0 AH0 S T
    ATHEISTIC EY2 TH IY0 IH1 S T IH0 K
    It is clear, that none of these entries obey the rule, as none of the EI map to EY1 or EY2. For the 2-4 line there is a EY1 or EY2 sound, but these are for the leading character.
    And what about IE that sound like EY0, EY1 and EY2. Probably these combinations are not in the list, but I did not see this checked in your script.

  2. programmingpraxis said

    @Paul: Obviously my approach is imperfect. The solution is for you to write a better program.

  3. Paul said

    This is IMO a very tough problem. The list contains a lot of English word, but also many foreign names (German, Scottish, etc.). My attempt can be found here and the list of exceptions (2521 in total) is here.
    First I tried to find a location of the “IE” or “EI” in the word and than compare that with the location of “EY” in the phonetics. This is not perfect, as the characters of the word often do not map to the phonetics. Then I tried to convert the word and the phonetics to a character string like “CVCVCSC”, where C, S and S stand for consonant, vowel and special (IE or EI) and then map the 2 character strings. That works somewhat better, but it is still not perfect.

  4. programmingpraxis said

    Agreed. This is a very tough problem. Like you, I had trouble matching the spelling with the phonetics, and finally I just gave up; sometimes a simple solution that is mostly right is better than a much more complicated solution that is better, but still only mostly right.

    I was amused to find this exercise in a beginning programming course. I think they ignored the “when sounded as AY” part and just looked at the “except after C” part, which is easy enough.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 616 other followers

%d bloggers like this: