Prime Anagrams

October 15, 2019

We begin by computing the first 256 primes:

(define ps (list->vector (primes 1620)))

Then we can compute the signature of a word:

(define (signature str)
  (fold-right (lambda (c n)
               (modulo (* (vector-ref ps (char->integer c)) n)
                       2147483647)) ; 2**31-1
              1
              (string->list str)))

We take the modulus because otherwise the product grows very large. That introduces the possibility of a collision, but it’s very low probability. Two strings are anagrams if they have the same signature:

(define (anagram? str1 str2)
  (= (signature str1) (signature str2)))

Here’s an example:

> (anagram? "time" "emit")
#t

The letter t has ascii value 116, and the 116th prime is 643; the other primes are 577 for i, 601 for m, and 557 for e, giving a product of 124198529327, which is congruent to 1791961448 (mod 2447483647). You can run the program at https://ideone.com/vUa9Hq.

Posted by programmingpraxis

Filed in Exercises

12 Comments »

12 Responses to “Prime Anagrams”

matthew said

October 15, 2019 at 5:06 PM

Nice idea. Here we just use the first 26 primes to encode lowercase letters, and assign primes in the ETAOINSHRDLU order of frequency, keeping the size of the coding numbers relatively small:

import sys
import functools
import operator
primes = [5,71,41,29,2,47,61,19,7,97,79,31,43,13,
          11,67,89,23,17,3,37,7,59,83,53,101]
def encode(s):
    plist = (primes[ord(c)-ord('a')] for c in s)
    return functools.reduce(operator.mul,plist,1)

codes = {}
for line in sys.stdin:
    word = line.rstrip()
    if all("a" <= c <= "z" for c in word):
        codes.setdefault(encode(word),[]).append(word)

for n in sorted(codes.keys()):
    s = codes[n]
    if len(s) >= 7: print(n,sorted(s))

$ python3 anagram.py < /usr/share/dict/words
261970 ['pares', 'parse', 'pears', 'rapes', 'reaps', 'spare', 'spear']
480930 ['carets', 'caster', 'caters', 'crates', 'reacts', 'recast', 'traces']

The highest code in that dictionary is “chlorofluorocarbons”, 8205628381610704807655035.

Daniel said

October 15, 2019 at 5:49 PM

Here’s a solution in C using GMP.

// $ gcc -o anagrams anagrams.c -lgmp

#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>

#include <gmp.h>

bool check_anagrams(char* s1, char* s2) {
  static unsigned primes[] = {
       2,    3,    5,    7,   11,   13,   17,   19,   23,   29,   31,   37,   41,   43,   47,   53,
      59,   61,   67,   71,   73,   79,   83,   89,   97,  101,  103,  107,  109,  113,  127,  131,
     137,  139,  149,  151,  157,  163,  167,  173,  179,  181,  191,  193,  197,  199,  211,  223,
     227,  229,  233,  239,  241,  251,  257,  263,  269,  271,  277,  281,  283,  293,  307,  311,
     313,  317,  331,  337,  347,  349,  353,  359,  367,  373,  379,  383,  389,  397,  401,  409,
     419,  421,  431,  433,  439,  443,  449,  457,  461,  463,  467,  479,  487,  491,  499,  503,
     509,  521,  523,  541,  547,  557,  563,  569,  571,  577,  587,  593,  599,  601,  607,  613,
     617,  619,  631,  641,  643,  647,  653,  659,  661,  673,  677,  683,  691,  701,  709,  719,
     727,  733,  739,  743,  751,  757,  761,  769,  773,  787,  797,  809,  811,  821,  823,  827,
     829,  839,  853,  857,  859,  863,  877,  881,  883,  887,  907,  911,  919,  929,  937,  941,
     947,  953,  967,  971,  977,  983,  991,  997, 1009, 1013, 1019, 1021, 1031, 1033, 1039, 1049,
    1051, 1061, 1063, 1069, 1087, 1091, 1093, 1097, 1103, 1109, 1117, 1123, 1129, 1151, 1153, 1163,
    1171, 1181, 1187, 1193, 1201, 1213, 1217, 1223, 1229, 1231, 1237, 1249, 1259, 1277, 1279, 1283,
    1289, 1291, 1297, 1301, 1303, 1307, 1319, 1321, 1327, 1361, 1367, 1373, 1381, 1399, 1409, 1423,
    1427, 1429, 1433, 1439, 1447, 1451, 1453, 1459, 1471, 1481, 1483, 1487, 1489, 1493, 1499, 1511,
    1523, 1531, 1543, 1549, 1553, 1559, 1567, 1571, 1579, 1583, 1597, 1601, 1607, 1609, 1613, 1619,
  };
  mpz_t prod1;
  mpz_t prod2;
  mpz_init(prod1);
  mpz_init(prod2);
  mpz_set_ui(prod1, 1);
  mpz_set_ui(prod2, 1);
  for (char* s = s1; *s != '\0'; ++s) {
    mpz_mul_ui(prod1, prod1, primes[(unsigned)(*s)]);
  }
  for (char* s = s2; *s != '\0'; ++s) {
    mpz_mul_ui(prod2, prod2, primes[(unsigned)(*s)]);
  }
  bool result = mpz_cmp(prod1, prod2) == 0;
  mpz_clear(prod1);
  mpz_clear(prod2);
  return result;
}

int main(int argc, char* argv[]) {
  if (argc != 3) return EXIT_FAILURE;
  bool anagrams = check_anagrams(argv[1], argv[2]);
  printf("%d\n", anagrams);
  return EXIT_SUCCESS;
}

Example:

$ ./anagrams programming praxis
0

$ ./anagrams praxis xsairp
1

matthew said
October 15, 2019 at 6:41 PM
That second ‘7’ in the list of primes should be ’73’!
Ernie said
October 15, 2019 at 6:51 PM
If we use int for the product we will occasionally get overflow. I believe anagrams will still give the same product even with overflow but does anyone know whether two non anagrams can give the same product if overflow occurs during calculation of the product?
Daniel said
October 15, 2019 at 7:11 PM
@Ernie, I believe it’s necessary that “two non anagrams can give the same product if overflow occurs during calculation of the product”. There are infinitely many products, so mapping them to a finite set (which would occur in the scenario with overflow) would result in different products getting mapped to the same code.

In my solution above I used the GMP library to avoid overflow, since there is no built-in type in C for arbitrarily sized integers.

matthew said

October 15, 2019 at 7:15 PM

Well, this modification of my program, taking the encoding mod 232 doesn’t produce any false positives, but mod 228:

import sys
import functools
import operator
primes = [5,71,41,29,2,47,61,19,7,97,79,31,43,13,11,67,89,23,17,3,37,73,59,83,53,101]
def encode(s):
    plist = (primes[ord(c)-ord('a')] for c in s)
    return functools.reduce(operator.mul,plist,1) % 2**28
codes = {}
for line in sys.stdin:
    word = line.rstrip()
    if all("a" <= c <= "z" for c in word):
        codes.setdefault(encode(word),[]).append(word)

for n in sorted(codes.keys()):
    s = codes[n]
    if len(s) >= 2:
        for i in range(1,len(s)):
            if sorted(s[0]) != sorted(s[i]): print(s[0],s[1])

finds the following non-anagrams:

dissuaded strewn
harangued picker
chalkier dissecting
airfield instinctive

matthew said
October 15, 2019 at 7:16 PM
Gah, that was meant to say “2^32” and “2^28” (I used the Python double asterisk though).
Daniel said
October 15, 2019 at 7:53 PM
@Ernie, here’s an example where “two non anagrams can give the same product if overflow occurs during calculation of the product”.

Suppose that integers are 16-bit. For any string, if we append 16 of the same character, the one that maps to prime number 2, then the product after overflow would be 0.

For example, if we have strings “programming” and “praxis”, and the character “z” maps to prime number 2, then the following different strings would both have product 0 after overflow.

“programmingzzzzzzzzzzzzzzzz”

“praxiszzzzzzzzzzzzzzzz”
Daniel said
October 15, 2019 at 7:55 PM
(“the following different strings” should say “the following non-anagrams”)
matthew said
October 15, 2019 at 8:18 PM
In my solution, for 16-bit codes, all of ‘biochemical’, ‘flushes’, ‘grackle’, ‘mournfulness’, ‘pectorals’, ‘songsters’ and ‘supported’, none of which are anagrams of each other, have code 41950, along with ‘enlists’, ‘listens’, ‘silents’ and ‘tinsels’, which are indeed anagrams.
Ernie said
October 18, 2019 at 8:04 PM
Using ulong arithmetic modulo 2^61 – 1 I get no false positives from the dictionary enable1.txt
Prime Anagrams – essay studess said
November 6, 2019 at 4:53 AM
[…] task is to use prime numbers to recognize anagrams. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments […]

Programming Praxis

Prime Anagrams

October 15, 2019

12 Responses to “Prime Anagrams”

Leave a comment

Categories

Archives

Archives

Programming Praxis

Prime Anagrams

October 15, 2019

Share this:

Related

12 Responses to “Prime Anagrams”

Leave a comment

Categories

Archives

Archives