Prime Anagrams
October 15, 2019
There are many algorithms for recognizing anagrams. Here is an unusual one:
Assign each character to a prime number, then map a string to the corresponding primes, and compute their product. Two strings are anagrams only if they share a product.
Your task is to use prime numbers to recognize anagrams. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.
Nice idea. Here we just use the first 26 primes to encode lowercase letters, and assign primes in the ETAOINSHRDLU order of frequency, keeping the size of the coding numbers relatively small:
The highest code in that dictionary is “chlorofluorocarbons”, 8205628381610704807655035.
Here’s a solution in C using GMP.
Example:
That second ‘7’ in the list of primes should be ’73’!
If we use int for the product we will occasionally get overflow. I believe anagrams will still give the same product even with overflow but does anyone know whether two non anagrams can give the same product if overflow occurs during calculation of the product?
@Ernie, I believe it’s necessary that “two non anagrams can give the same product if overflow occurs during calculation of the product”. There are infinitely many products, so mapping them to a finite set (which would occur in the scenario with overflow) would result in different products getting mapped to the same code.
In my solution above I used the GMP library to avoid overflow, since there is no built-in type in C for arbitrarily sized integers.
Well, this modification of my program, taking the encoding mod 232 doesn’t produce any false positives, but mod 228:
finds the following non-anagrams:
Gah, that was meant to say “2^32” and “2^28” (I used the Python double asterisk though).
@Ernie, here’s an example where “two non anagrams can give the same product if overflow occurs during calculation of the product”.
Suppose that integers are 16-bit. For any string, if we append 16 of the same character, the one that maps to prime number 2, then the product after overflow would be 0.
For example, if we have strings “programming” and “praxis”, and the character “z” maps to prime number 2, then the following different strings would both have product 0 after overflow.
“programmingzzzzzzzzzzzzzzzz”
“praxiszzzzzzzzzzzzzzzz”
(“the following different strings” should say “the following non-anagrams”)
In my solution, for 16-bit codes, all of ‘biochemical’, ‘flushes’, ‘grackle’, ‘mournfulness’, ‘pectorals’, ‘songsters’ and ‘supported’, none of which are anagrams of each other, have code 41950, along with ‘enlists’, ‘listens’, ‘silents’ and ‘tinsels’, which are indeed anagrams.
Using ulong arithmetic modulo 2^61 – 1 I get no false positives from the dictionary enable1.txt
[…] task is to use prime numbers to recognize anagrams. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments […]