Element Words
April 14, 2017
Today’s exercise is either an interview question or somebody’s homework, I’m not sure:
Given a list of all the symbols of the chemicals in the periodic table, and a list of all the words in the English language, find the longest word that can be made using symbols of the chemicals in the periodic table.
Your task is to write a program to find the longest word that can be formed from the periodic table. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
In Python using an english dictionary of about 125000 words.
from string import ascii_lowercase from glob import glob WORDS = [w.strip().lower() for f in glob("D:\Data\Development\spell\en\english.?") for w in open(f)] WORDS = sorted(WORDS, key=len, reverse=True) ECHAR = """H He Li Be B C N O F Ne Na Mg Al Si P S Cl Ar K Ca Sc Ti V Cr Mn Fe Co Ni Cu Zn Ga Ge As Se Br Kr Rb Sr Y Zr Nb Mo Tc Ru Rh Pd Ag Cd In Sn Sb Te I Xe Cs Ba La Ce Pr Nd Pm Sm Eu Gd Tb Dy Ho Er Tm Yb Lu Hf Ta W Re Os Ir Pt Au Hg Tl Pb Bi Po At Rn Fr Ra Ac Th Pa U Np Pu Am Cm Bk Cf Es Fm Md No Lr Rf Db Sg Bh Hs Mt""".lower().split() ECOMB = {c: set(w for w in ECHAR if w.startswith(c)) for c in ascii_lowercase} def check_word(word): Q = [(word, [])] while Q: remaining, elements = Q.pop() if not remaining: return elements for e in ECOMB.get(remaining[0], "???"): if remaining.startswith(e): Q.append((remaining[len(e):], elements+[e])) return False print("Number of words:", len(WORDS)) for word in WORDS: res = check_word(word) if res: print(word, len(word), " ".join(res)) # Number of words: 125346 # nonrepresentationalism 22 no n re p re se n ta ti o na li smStand back, I know Regular Expressions:
$ egrep '^(h|he|li|be|b|c|n|o|f|ne|na|mg|al|si|p|s|cl|ar|k|ca|sc|ti|v|cr|mn|fe|co|ni|cu|zn|ga|ge|as|se|br|kr|rb|sr|y|zr|nb|mo|tc|ru|rh|pd|ag|cd|in|sn|sb|te|i|xe|cs|ba|la|ce|pr|nd|pm|sm|eu|gd|tb|dy|ho|er|tm|yb|lu|hf|ta|w|re|os|ir|pt|au|hg|tl|pb|bi|po|at|rn|fr|ra|ac|th|pa|u|np|pu|am|cm|bk|cf|es|fm|md|no|lr|rf|db|sg|bh|hs|mt)*$' /usr/share/dict/words | awk 'length > max_length { max_length = length; longest_line = $0 } END { print longest_line }' nonrepresentationalI don’t awk though and nicked that bit from ДМИТРИЙ МАЛИКОВ on Stack Exchange: https://unix.stackexchange.com/questions/24509/how-to-print-the-longest-line-in-a-file
Using Perl 5. A dictionary of words should be supplied as a filename or on STDIN.
#!/usr/bin/env perl
use strict;
use warnings;
use v5.22;
my @elements = qw(
ac al am sb ar as at
ba bk be bi bh b br
cd ca cf c ce cs cl cr co cu cm
ds db dy
es er eu
fm f fr
gd ga ge au
hf hs he ho h
in i ir fe
kr
la lr pb li lu
mg mn mt md hg mo
nd ne np ni nb n no
os o
pd p pt pu po k pr pm pa
ra rn re rh rg rb ru rf
sm sc sg se si ag na sr s
ta tc te tb tl th tm sn ti w
uub uuh uuo uup uuq uus uut uuu u
v
xe
yb y
zn zr
);
my $pattern = '^(' . join('|', @elements) . ')+$';
foreach (sort {-length($a) <=> -length($b)} <>) {
if (m/$pattern/) {
say;
last;
}
}
import re from mendeleev import element symbols='|'.join(e.symbol.lower() for e in element(list(range(1,119)))) regex = re.compile("(" + symbols + ")+") with open('/data/12dicts/5desk.txt', 'rt') as words: print(max(filter(regex.match, words), key=len))Builds a regular expression using element symbol information from the mendeleev library. Because re.match() returns a MatchObject (a true value) when there is a match or None (a false value), otherwise, re.match() can be used with filter() to get the words that can be built using the element symbols. max(), with key=len, returns the longest word.
For my wordlist, the longest is “otorhinolaryngologist” at 21 letters.
Here’s a Haskell version. (I’m currently too lazy to keep track of the parse.)
import Data.Char (toLower) import Data.List.Ordered (member, sortBy) import Data.List (find) import Data.Ord (Down(..), comparing) -- Elements from https://en.wikipedia.org/wiki/Periodic_table as of 2017-04-15. elements :: [String] elements = words $ map toLower "Ac Ag Al Am Ar As At Au B Ba Be Bh Bi Bk Br C Ca Cd Ce Cf Cl Cm \ \Cn Co Cr Cs Cu Db Ds Dy Er Es Eu F Fe Fl Fm Fr Ga Gd Ge H He Hf \ \Hg Ho Hs I In Ir K Kr La Li Lr Lu Lv Mc Md Mg Mn Mo Mt N Na Nb \ \Nd Ne Nh Ni No Np O Og Os P Pa Pb Pd Pm Po Pr Pt Pu Ra Rb Re Rf \ \Rg Rh Rn Ru S Sb Sc Se Sg Si Sm Sn Sr Ta Tb Tc Te Th Ti Tl Tm \ \Ts U V W Xe Y Yb Zn Zr" composedOf :: Ord a => [a] -> [[a]] -> Bool [] `composedOf` _ = True [x] `composedOf` elts = [x] `member` elts (x:y:xs) `composedOf` elts = [x, y] `member` elts && xs `composedOf` elts || [x] `member` elts && (y:xs) `composedOf` elts main :: IO () main = do ws <- fmap (sortBy (comparing (Down . length)) . words) getContents case find ((`composedOf` elements) . map toLower) ws of Nothing -> putStrLn "Didn't find any matching words." Just w -> putStrLn $ "The longest matching word is: " ++ w@Mike: Your line 6 should read:
regex = re.compile("(" + symbols + ")+$")Then the regex matches until the end of the word. The word “otorhinolaryngologist” is not a valid match (only the “o” is matched.
This reminds me of this problem: https://programmingpraxis.com/2014/07/25/number-words & here’s an adapted version of my C++ solution there – it’s essentially the recursive solution, but using dynamic programming to avoid repeated identical recursions. It also prints the number of ways found to construct each word, and piping the output through sort tells us that as before the 19-letter “nonrepresentational” is longest, but with only 2 variations, but the 17-letter “inconspicuousness” has 24 variations. Interestingly, “casinos”, with a mere 7 letters, has 13 variations:
#include <string.h> #include <vector> #include <string> #include <iostream> const char *elements[] = { "h","he","li","be","b","c","n","o","f","ne","na","mg","al","si","p","s", "cl","ar","k","ca","sc","ti","v","cr","mn","fe","co","ni","cu","zn","ga", "ge","as","se","br","kr","rb","sr","y","zr","nb","mo","tc","ru","rh", "pd","ag","cd","in","sn","sb","te","i","xe","cs","ba","la","ce","pr", "nd","pm","sm","eu","gd","tb","dy","ho","er","tm","yb","lu","hf","ta", "w","re","os","ir","pt","au","hg","tl","pb","bi","po","at","rn","fr", "ra","ac","th","pa","u","np","pu","am","cm","bk","cf","es","fm","md", "no","lr","rf","db","sg","bh","hs","mt" }; const int NELEMENTS = sizeof(elements)/sizeof(*elements); struct Node { Node(Node *p, Node *n, std::string &s) : prefix(p), next(n), label(s), count((p ? p->count : 1) + (n ? n->count : 0)) {} Node *prefix, *next; std::string label; long count; }; int main(int argc, char *argv[]) { std::string s; while (std::cin >> s) { int slen = s.size(); std::vector<Node *> table(slen+1); for (int i = 1; i <= slen; i++) { for (int j = 0; j < NELEMENTS; j++) { std::string label = elements[j]; int llen = label.size(); if (llen <= i && s.compare(i-llen, llen, label) == 0) { // Check there is a solution to extend if (i == llen || table[i-llen] != NULL) { table[i] = new Node(table[i-llen],table[i],label); } } } } if (table[slen] != NULL) { std::cout << s.size() << " " << table[slen]->count << " " << s << "\n"; } } }Using an HFST (Helsinki Finite-State Tools) component to extract the matching words. Braces in the regex formalism delimit string literals, and 0 stands for the empty string which is mapped to a newline at the end of each match.
! elemental.pmatch - matches words that consist of elements, adds a ! newline to each match (not sure why boundary conditions are not ! needed in current versions; hoping that there is a nicer way to ! specify a newline; to be investigated) define element {h} | {he} | {li} | {be} | {b} | {c} | {n} | {o} | {f} | {ne} | {na} | {mg} | {al} | {si} | {p} | {s} | {cl} | {ar} | {k} | {ca} | {sc} | {ti} | {v} | {cr} | {mn} | {fe} | {co} | {ni} | {cu} | {zn} | {ga} | {ge} | {as} | {se} | {br} | {kr} | {rb} | {sr} | {y} | {zr} | {nb} | {mo} | {tc} | {ru} | {rh} | {pd} | {ag} | {cd} | {in} | {sn} | {sb} | {te} | {i} | {xe} | {cs} | {ba} | {hf} | {ta} | {w} | {re} | {os} | {ir} | {pt} | {au} | {hg} | {tl} | {pb} | {bi} | {po} | {at} | {rn} | {fr} | {ra} | {rf} | {db} | {sg} | {bh} | {hs} | {mt} | {ds} | {rg} | {cn} | {nh} | {fl} | {mc} | {lv} | {ts} | {og} | {la} | {ce} | {pr} | {nd} | {pm} | {sm} | {eu} | {gd} | {tb} | {dy} | {ho} | {er} | {tm} | {yb} | {lu} | {ac} | {th} | {pa} | {u} | {np} | {pu} | {am} | {cm} | {bk} | {cf} | {es} | {fm} | {md} | {no} | {lr} ; define TOP element+ 0:{ };Using Bash to run the matcher and extract the longest matches.
# Compiles elemental.pmatch, runs the transducer on a plain text file # to match instances of the pattern named TOP on each line, only # writing out the matches; transducer is made so that this produces # one word per line; select the longest of those. hfst-pmatch2fst -i elemental.pmatch -o elemental.hfst hfst-pmatch --newline --extract-patterns elemental.hfst < ~/pg52108.txt | sort -u | while read word do echo "${#word} ${word}" done | sort -nr | headThe longest match in the document is “kirkaspiirteinen” ‘bright-featured’, at 16 characters.