Element Words

April 14, 2017

My periodic table doesn’t include some of the fancy new chemicals that have been recently created:

(define elements (map string-downcase (map symbol->string
  '(H He Li Be B C N O F Ne Na Mg Al Si P S Cl Ar K Ca Sc
  Ti V Cr Mn Fe Co Ni Cu Zn Ga Ge As Se Br Kr Rb Sr Y Zr
  Nb Mo Tc Ru Rh Pd Ag Cd In Sn Sb Te I Xe Cs Ba La Ce Pr
  Nd Pm Sm Eu Gd Tb Dy Ho Er Tm Yb Lu Hf Ta W Re Os Ir Pt
  Au Hg Tl Pb Bi Po At Rn Fr Ra Ac Th Pa U Np Pu Am Cm Bk
  Cf Es Fm Md No Lr Rf Db Sg Bh Hs Mt))))

A word is an element word if its first character is in the list of elements, or its first two characters are in the list of elements, and all the remaining characters after the first or second also form an element word:

(define (eword? str) ; element-word
  (let loop ((cs (string->list str)))
    (or (null? cs)
        (and (member (string (car cs)) elements)
             (loop (cdr cs)))
        (and (pair? (cdr cs))
             (member (list->string (take 2 cs)) elements)
             (loop (cddr cs))))))

For example, the word “phosphorus” can be formed as shown below, with “ho” formed either as hydrogen and oxygen or as holmium:

p  phosphorous
ho holmium
s  sulfur
p  phosphorous
h  hydrogen
o  oxygen
ru ruthenium
s  sulfur
> (eword? "phosphorus")
#t

We give the dictionary as a list of words; you could arrange to read the words from a file if you prefer:

(define (element-words words)
  (let loop ((words words) (max-len 0) (max-words (list)))
    (if (null? words) (reverse max-words)
      (let* ((word (car words)) (word-len (string-length word)))
        (cond ((< word-len max-len) (loop (cdr words) max-len max-words))
              ((not (eword? word)) (loop (cdr words) max-len max-words))
              ((< max-len word-len) (loop (cdr words) word-len (list word)))
              (else (loop (cdr words) max-len (cons word max-words))))))))

Here is a list of element names, which we use as a dictionary:

(define element-names (map string-downcase (map symbol->string
  '(Hydrogen Helium Lithium Beryllium Boron Carbon Nitrogen
  Oxygen Fluorine Neon Sodium Magnesium Aluminum Silicon
  Phosphorus Sulfur Chlorine Argon Potassium Calcium Scandium
  Titanium Vanadium Chromium Manganese Iron Cobalt Nickel
  Copper Zinc Gallium Germanium Arsenic Selenium Bromine
  Krypton Rubidium Strontium Yttrium Zirconium Niobium
  Molybdenum Technetium Ruthenium Rhodium Palladium Silver
  Cadmium Indium Tin Antimony Tellurium Iodine Xenon Cesium
  Barium Lanthanum Cerium Praseodymium Neodymium Promethium
  Samarium Europium Gadolinium Terbium Dysprosium Holmium
  Erbium Thulium Ytterbium Lutetium Hafnium Tantalum Tungsten
  Rhenium Osmium Iridium Platinum Gold Mercury Thallium Lead
  Bismuth Polonium Astatine Radon Francium Radium Actinium
  Thorium Protactinium Uranium Neptunium Plutonium Americium
  Curium Berkelium Californium Einsteinium Fermium Mendelevium
  Nobelium Lawrencium Rutherfordium Dubnium Seaborgium Bohrium
  Hassium Meitnerium))))

There are twelve element names that can be formed from the list of elements, of which the longest is phosphorus:

> (filter eword? element-names)
("carbon" "neon" "silicon" "phosphorus" "iron" "copper"
  "arsenic" "krypton" "tin" "xenon" "bismuth" "astatine")
> (element-words element-names)
("phosphorus")

You can run the program at http://ideone.com/6woBy6.

Advertisements

Pages: 1 2

8 Responses to “Element Words”

  1. Paul said

    In Python using an english dictionary of about 125000 words.

    from string import ascii_lowercase
    from glob import glob
    
    WORDS = [w.strip().lower()
             for f in glob("D:\Data\Development\spell\en\english.?")
             for w in open(f)]
    WORDS = sorted(WORDS, key=len, reverse=True)
    
    ECHAR = """H He Li Be B C N O F Ne Na Mg Al Si P S Cl Ar K Ca Sc Ti V Cr Mn Fe
    Co Ni Cu Zn Ga Ge As Se Br Kr Rb Sr Y Zr Nb Mo Tc Ru Rh Pd Ag Cd In Sn Sb Te I
    Xe Cs Ba La Ce Pr Nd Pm Sm Eu Gd Tb Dy Ho Er Tm Yb Lu Hf Ta W Re Os Ir Pt Au
    Hg Tl Pb Bi Po At Rn Fr Ra Ac Th Pa U Np Pu Am Cm Bk Cf Es Fm Md No Lr Rf Db
    Sg Bh Hs Mt""".lower().split()
    
    ECOMB = {c: set(w for w in ECHAR if w.startswith(c)) for c in ascii_lowercase}
    
    def check_word(word):
        Q = [(word, [])]
        while Q:
            remaining, elements = Q.pop()
            if not remaining:
                return elements
            for e in ECOMB.get(remaining[0], "???"):
                if remaining.startswith(e):
                    Q.append((remaining[len(e):], elements+[e]))
        return False
    
    print("Number of words:", len(WORDS))
    for word in WORDS:
        res = check_word(word)
        if res:
            print(word, len(word), " ".join(res))
    
    # Number of words: 125346
    # nonrepresentationalism 22 no n re p re se n ta ti o na li sm
    
  2. matthew said

    Stand back, I know Regular Expressions:

    $ egrep '^(h|he|li|be|b|c|n|o|f|ne|na|mg|al|si|p|s|cl|ar|k|ca|sc|ti|v|cr|mn|fe|co|ni|cu|zn|ga|ge|as|se|br|kr|rb|sr|y|zr|nb|mo|tc|ru|rh|pd|ag|cd|in|sn|sb|te|i|xe|cs|ba|la|ce|pr|nd|pm|sm|eu|gd|tb|dy|ho|er|tm|yb|lu|hf|ta|w|re|os|ir|pt|au|hg|tl|pb|bi|po|at|rn|fr|ra|ac|th|pa|u|np|pu|am|cm|bk|cf|es|fm|md|no|lr|rf|db|sg|bh|hs|mt)*$' /usr/share/dict/words | 
    awk 'length > max_length { max_length = length; longest_line = $0 } END { print longest_line }'
    nonrepresentational
    

    I don’t awk though and nicked that bit from ДМИТРИЙ МАЛИКОВ on Stack Exchange: https://unix.stackexchange.com/questions/24509/how-to-print-the-longest-line-in-a-file

  3. john said

    Using Perl 5. A dictionary of words should be supplied as a filename or on STDIN.


    #!/usr/bin/env perl

    use strict;
    use warnings;

    use v5.22;

    my @elements = qw(
        ac al am sb ar as at
        ba bk be bi bh b br
        cd ca cf c ce cs cl cr co cu cm
        ds db dy
        es er eu
        fm f fr
        gd ga ge au
        hf hs he ho h
        in i ir fe
        kr
        la lr pb li lu
        mg mn mt md hg mo
        nd ne np ni nb n no
        os o
        pd p pt pu po k pr pm pa
        ra rn re rh rg rb ru rf
        sm sc sg se si ag na sr s
        ta tc te tb tl th tm sn ti w
        uub uuh uuo uup uuq uus uut uuu u
        v
        xe
        yb y
        zn zr
        );

    my $pattern = '^(' . join('|', @elements) . ')+$';

    foreach (sort {-length($a) <=> -length($b)} <>) {
        if (m/$pattern/) {
            say;
            last;
        }
    }

  4. Mike said
    import re
    from mendeleev import element
    
    symbols='|'.join(e.symbol.lower() for e in element(list(range(1,119))))
    
    regex = re.compile("(" + symbols + ")+")
    
    with open('/data/12dicts/5desk.txt', 'rt') as words:
        print(max(filter(regex.match, words), key=len))
    

    Builds a regular expression using element symbol information from the mendeleev library. Because re.match() returns a MatchObject (a true value) when there is a match or None (a false value), otherwise, re.match() can be used with filter() to get the words that can be built using the element symbols. max(), with key=len, returns the longest word.

    For my wordlist, the longest is “otorhinolaryngologist” at 21 letters.

  5. Globules said

    Here’s a Haskell version. (I’m currently too lazy to keep track of the parse.)

    import Data.Char (toLower)
    import Data.List.Ordered (member, sortBy)
    import Data.List (find)
    import Data.Ord (Down(..), comparing)
    
    -- Elements from https://en.wikipedia.org/wiki/Periodic_table as of 2017-04-15.
    elements :: [String]
    elements = words $ map toLower
                 "Ac Ag Al Am Ar As At Au B Ba Be Bh Bi Bk Br C Ca Cd Ce Cf Cl Cm \
                 \Cn Co Cr Cs Cu Db Ds Dy Er Es Eu F Fe Fl Fm Fr Ga Gd Ge H He Hf \
                 \Hg Ho Hs I In Ir K Kr La Li Lr Lu Lv Mc Md Mg Mn Mo Mt N Na Nb  \
                 \Nd Ne Nh Ni No Np O Og Os P Pa Pb Pd Pm Po Pr Pt Pu Ra Rb Re Rf \
                 \Rg Rh Rn Ru S Sb Sc Se Sg Si Sm Sn Sr Ta Tb Tc Te Th Ti Tl Tm   \
                 \Ts U V W Xe Y Yb Zn Zr"
    
    composedOf :: Ord a => [a] -> [[a]] -> Bool
    []       `composedOf` _    = True
    [x]      `composedOf` elts = [x] `member` elts
    (x:y:xs) `composedOf` elts = [x, y] `member` elts &&     xs `composedOf` elts ||
                                    [x] `member` elts && (y:xs) `composedOf` elts
    
    main :: IO ()
    main = do
      ws <- fmap (sortBy (comparing (Down . length)) . words) getContents
      case find ((`composedOf` elements) . map toLower) ws of
        Nothing -> putStrLn "Didn't find any matching words."
        Just w  -> putStrLn $ "The longest matching word is: " ++ w
    
    $ ./elemword < /usr/share/dict/words
    The longest matching word is: nonrepresentationalism
    
  6. Paul said

    @Mike: Your line 6 should read:

    regex = re.compile("(" + symbols + ")+$")
    

    Then the regex matches until the end of the word. The word “otorhinolaryngologist” is not a valid match (only the “o” is matched.

  7. matthew said

    This reminds me of this problem: https://programmingpraxis.com/2014/07/25/number-words & here’s an adapted version of my C++ solution there – it’s essentially the recursive solution, but using dynamic programming to avoid repeated identical recursions. It also prints the number of ways found to construct each word, and piping the output through sort tells us that as before the 19-letter “nonrepresentational” is longest, but with only 2 variations, but the 17-letter “inconspicuousness” has 24 variations. Interestingly, “casinos”, with a mere 7 letters, has 13 variations:

    c:as:in:os
    ca:s:in:os
    c:as:i:n:os
    ca:s:i:n:os
    ca:si:n:os
    c:as:i:no:s
    ca:s:i:no:s
    ca:si:no:s
    c:as:in:o:s
    ca:s:in:o:s
    c:as:i:n:o:s
    ca:s:i:n:o:s
    ca:si:n:o:s
    
    #include <string.h>
    #include <vector>
    #include <string>
    #include <iostream>
          
    const char *elements[] = {
      "h","he","li","be","b","c","n","o","f","ne","na","mg","al","si","p","s",
      "cl","ar","k","ca","sc","ti","v","cr","mn","fe","co","ni","cu","zn","ga",
      "ge","as","se","br","kr","rb","sr","y","zr","nb","mo","tc","ru","rh",
      "pd","ag","cd","in","sn","sb","te","i","xe","cs","ba","la","ce","pr",
      "nd","pm","sm","eu","gd","tb","dy","ho","er","tm","yb","lu","hf","ta",
      "w","re","os","ir","pt","au","hg","tl","pb","bi","po","at","rn","fr",
      "ra","ac","th","pa","u","np","pu","am","cm","bk","cf","es","fm","md",
      "no","lr","rf","db","sg","bh","hs","mt"
    };
    const int NELEMENTS = sizeof(elements)/sizeof(*elements);
    
    struct Node {
      Node(Node *p, Node *n, std::string &s)
        : prefix(p), next(n), label(s),
          count((p ? p->count : 1) + (n ? n->count : 0)) {}
      Node *prefix, *next;
      std::string label;
      long count;
    };
         
    int main(int argc, char *argv[]) {
      std::string s;
      while (std::cin >> s) {
        int slen = s.size();
        std::vector<Node *> table(slen+1);
        for (int i = 1; i <= slen; i++) {
          for (int j = 0; j < NELEMENTS; j++) {
            std::string label = elements[j];
            int llen = label.size();
            if (llen <= i && s.compare(i-llen, llen, label) == 0) {
              // Check there is a solution to extend
              if (i == llen || table[i-llen] != NULL) {
                table[i] = new Node(table[i-llen],table[i],label);
              }
            }
          }
        }
        if (table[slen] != NULL) {
          std::cout << s.size() << " " << table[slen]->count << " " << s << "\n";
        }
      }
    }
    
  8. Jussi Piitulainen said

    Using an HFST (Helsinki Finite-State Tools) component to extract the matching words. Braces in the regex formalism delimit string literals, and 0 stands for the empty string which is mapped to a newline at the end of each match.

    ! elemental.pmatch - matches words that consist of elements, adds a
    ! newline to each match (not sure why boundary conditions are not
    ! needed in current versions; hoping that there is a nicer way to
    ! specify a newline; to be investigated)
    
    define element {h} | {he} | {li} | {be} | {b} | {c} | {n} | {o} |
     {f} | {ne} | {na} | {mg} | {al} | {si} | {p} | {s} | {cl} | {ar} |
     {k} | {ca} | {sc} | {ti} | {v} | {cr} | {mn} | {fe} | {co} | {ni} |
     {cu} | {zn} | {ga} | {ge} | {as} | {se} | {br} | {kr} | {rb} | {sr} |
     {y} | {zr} | {nb} | {mo} | {tc} | {ru} | {rh} | {pd} | {ag} | {cd} |
     {in} | {sn} | {sb} | {te} | {i} | {xe} | {cs} | {ba} | {hf} | {ta} |
     {w} | {re} | {os} | {ir} | {pt} | {au} | {hg} | {tl} | {pb} | {bi} |
     {po} | {at} | {rn} | {fr} | {ra} | {rf} | {db} | {sg} | {bh} | {hs} |
     {mt} | {ds} | {rg} | {cn} | {nh} | {fl} | {mc} | {lv} | {ts} | {og} |
     {la} | {ce} | {pr} | {nd} | {pm} | {sm} | {eu} | {gd} | {tb} | {dy} |
     {ho} | {er} | {tm} | {yb} | {lu} | {ac} | {th} | {pa} | {u} | {np} |
     {pu} | {am} | {cm} | {bk} | {cf} | {es} | {fm} | {md} | {no} | {lr} ;
    
    define TOP element+ 0:{
    };
    

    Using Bash to run the matcher and extract the longest matches.

    # Compiles elemental.pmatch, runs the transducer on a plain text file
    # to match instances of the pattern named TOP on each line, only
    # writing out the matches; transducer is made so that this produces
    # one word per line; select the longest of those.
    
    hfst-pmatch2fst -i elemental.pmatch -o elemental.hfst
    
    hfst-pmatch --newline --extract-patterns elemental.hfst < ~/pg52108.txt |
        sort -u |
        while read word
        do
            echo "${#word}	${word}"
        done |
        sort -nr | head
    

    The longest match in the document is “kirkaspiirteinen” ‘bright-featured’, at 16 characters.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: