## Pearson Hashing

### May 25, 2018

Cryptographic hashes, also called message digests, are ubiquitous in modern cryptography — Brce Schneier calls them “the workhorses of modern cryptography” — used among other things in digital signatures, message authentication codes, as fingerprints to detect duplicate data, and as checksums to detect data corruption. Today we look at a simple example of a hash function, not suitable for cryptography, but useful for non-cryptographic purposes.

Peter Pearson describes a hash function based on a permutation table Tof the numbers 0 inclusive to 256 exclusive. Starting from an initial hash of 0, each byte of the message is accessed in order, added to the current hash mod 256, then the hash is replaced by the value in the permutation table corresponding to that sum. In some implementations, including Pearson’s original, the two bytes are XOR’ed rather than added mod 256.

If eight bits isn’t enough, Pearson gives a simple algorithm for extending the algorithm to sixteen bits. First, compute the normal 8-bit hash. Then, add 1 (mod 256) to the first character of the string, and compute the normal 8-bit hash of the modified string; since a modification to a single bit provides a large change in the hash value, the two hashes will be independent of each other. Finally, concatenate the two hashes. Pearson doesn’t mention it, but this scheme can be extended to any multiple of eight bits.

Your task is to write a program that computes the 8-bit and 16-bit Pearson hashes of an input string, given a suitable permutation table. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Pages: 1 2

### 7 Responses to “Pearson Hashing”

1. chaw said

Here is a solution in standard R7RS Scheme that uses the table from
the paper and provides 8 variants.

sample strings in the 8-bit version. Replacing the original table
with a couple of randomly generated ones that I tried gets rid of the
collisions. I could also reproduce the collisions using
@programmingpraxis’s implementation (with the fixed table from the
paper).

``` (import (scheme base) (scheme write) (scheme file) (srfi 8) (only (srfi 151) bitwise-xor)) ;; Table I from Pearson's paper (p. 678) (define pearson-table #(1 87 49 12 176 178 102 166 121 193 6 84 249 230 44 163 14 197 213 181 161 85 218 80 64 239 24 226 236 142 38 200 110 177 104 103 141 253 255 50 77 101 81 18 45 96 31 222 25 107 190 70 86 237 240 34 72 242 20 214 244 227 149 235 97 234 57 22 60 250 82 175 208 5 127 199 111 62 135 248 174 169 211 58 66 154 106 195 245 171 17 187 182 179 0 243 132 56 148 75 128 133 158 100 130 126 91 13 153 246 216 219 119 68 223 78 83 88 201 99 122 11 92 32 136 114 52 10 138 30 48 183 156 35 61 26 143 74 251 94 129 162 63 152 170 7 115 167 241 206 3 150 55 59 151 220 90 53 23 131 125 173 15 238 79 95 89 16 105 137 225 224 217 160 37 123 118 73 2 157 46 116 9 145 134 228 207 212 202 215 69 229 27 188 67 124 168 252 42 4 29 108 21 247 19 205 39 203 233 40 186 147 198 192 155 33 164 191 98 204 165 180 117 76 140 36 210 172 41 54 159 8 185 232 113 196 231 47 146 120 51 65 28 144 254 221 93 189 194 139 112 43 71 109 184 209)) ;; Naively map a Unicode character to the range 0..255. ;; (map char-hash-byte (string->list "Hello, दुनिया!")) (define (char-hash-byte c) (let loop ((n (char->integer c))) (receive (q r) (floor/ n 256) (if (zero? q) r (loop (bitwise-xor r (loop q))))))) (define (make-pearson-hash table char-num-proc mix-proc) (lambda (str) (let ((h 0)) (string-for-each (lambda (c) (set! h (vector-ref table (mix-proc h (char-num-proc c))))) str) h))) (define sample-strings '("Hello" "Hello," "Hell,o" "Hel,lo" "Hello, World!" "Hell,o World!" "Hel,lo World!" "Hello, दुनिया!" "Hell,o दुनिया!" "Hel,lo दुनिया!")) (define (demo-hash hproc) (display (map hproc sample-strings)) (newline)) (demo-hash (make-pearson-hash pearson-table char->integer (lambda (a b) (modulo (+ a b) 256)))) (demo-hash (make-pearson-hash pearson-table char->integer (lambda (a b) (modulo (bitwise-xor a b) 256)))) (demo-hash (make-pearson-hash pearson-table char-hash-byte (lambda (a b) (modulo (+ a b) 256)))) (demo-hash (make-pearson-hash pearson-table char-hash-byte bitwise-xor)) (define (make-pearson-hash-2 table char-num-proc mix-proc) (let ((phash (make-pearson-hash table char-num-proc mix-proc)) (tsize (vector-length table))) (lambda (str) (if (string=? "" str) (phash str) (+ (* tsize (phash str)) (phash (string-append (string (integer->char (modulo (+ 1 (char-num-proc (string-ref str 0))) tsize))) (string-copy str 1)))))))) (demo-hash (make-pearson-hash-2 pearson-table char->integer (lambda (a b) (modulo (+ a b) 256)))) (demo-hash (make-pearson-hash-2 pearson-table char->integer (lambda (a b) (modulo (bitwise-xor a b) 256)))) (demo-hash (make-pearson-hash-2 pearson-table char-hash-byte (lambda (a b) (modulo (+ a b) 256)))) (demo-hash (make-pearson-hash-2 pearson-table char-hash-byte bitwise-xor)) ```

``` (229 197 189 197 163 230 163 218 173 218) (57 85 72 2 117 74 156 171 17 131) (229 197 189 197 163 230 163 243 112 243) (57 85 72 2 117 74 156 240 219 47) (58783 50679 48563 50547 41844 59028 41854 56038 44426 55945) (14715 21955 18575 644 30134 19159 40158 43811 4497 33726) (58783 50679 48563 50547 41844 59028 41854 62310 28693 62392) (14715 21955 18575 644 30134 19159 40158 61524 56163 12160) ```

2. Daniel said

Here’s a solution in C.

```#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

static uint8_t T[] = {
1,87,49,12,176,178,102,166,121,193,6,84,249,230,44,163,
14,197,213,181,161,85,218,80,64,239,24,226,236,142,38,200,
110,177,104,103,141,253,255,50,77,101,81,18,45,96,31,222,
25,107,190,70,86,237,240,34,72,242,20,214,244,227,149,235,
97,234,57,22,60,250,82,175,208,5,127,199,111,62,135,248,
174,169,211,58,66,154,106,195,245,171,17,187,182,179,0,243,
132,56,148,75,128,133,158,100,130,126,91,13,153,246,216,219,
119,68,223,78,83,88,201,99,122,11,92,32,136,114,52,10,
138,30,48,183,156,35,61,26,143,74,251,94,129,162,63,152,
170,7,115,167,241,206,3,150,55,59,151,220,90,53,23,131,
125,173,15,238,79,95,89,16,105,137,225,224,217,160,37,123,
118,73,2,157,46,116,9,145,134,228,207,212,202,215,69,229,
27,188,67,124,168,252,42,4,29,108,21,247,19,205,39,203,
233,40,186,147,198,192,155,33,164,191,98,204,165,180,117,76,
140,36,210,172,41,54,159,8,185,232,113,196,231,47,146,120,
51,65,28,144,254,221,93,189,194,139,112,43,71,109,184,209
};

uint8_t pearson1(char* str) {
uint8_t h = 0;
while (*str) {
h = T[h ^ *str];
++str;
}
return h;
}

uint16_t pearson2(char* str) {
uint8_t h1 = pearson1(str);
uint8_t h2 = 0;
if (*str) {
h2 = T[h2 ^ (*str + 1)];
++str;
}
while (*str) {
h2 = T[h2 ^ *str];
++str;
}
uint16_t h = h1;
h <<= 8;
h += h2;
return h;
}

int main(int argc, char* argv[]) {
if (argc != 2) {
fprintf(stderr, "Usage: %s <STR>\n", argv[0]);
return EXIT_FAILURE;
}
printf("%d\n", pearson1(argv[1]));
printf("%d\n", pearson2(argv[1]));
return EXIT_SUCCESS;
}
```

Example:

```\$ ./pearson "Programming Praxis"
16
4298
```
3. V said

Solution in Ruby.

```def pearson_hash(message, table, size_in_bytes)
size_in_bytes.times.reduce("") do |memo, i|
hash = 0
message[0] = ((message[0].ord + i) % 256).chr
message.chars do |char|
hash = table[(hash + char.ord) % 256]
end
memo += hash.to_s
end
end

# use `(0..255).to_a.shuffle` to generate a new one
table = [
68, 6, 218, 45, 29, 132, 105, 22, 192, 155, 146, 66, 226, 231, 123, 25, 212,
64, 156, 52, 71, 220, 77, 194, 178, 36, 34, 153, 181, 130, 99, 39, 108, 215,
92, 119, 55, 216, 175, 72, 168, 169, 136, 185, 57, 53, 1, 233, 125, 237, 78,
184, 48, 61, 201, 161, 60, 217, 124, 173, 32, 15, 176, 172, 70, 9, 129, 100,
43, 142, 222, 207, 158, 189, 206, 82, 163, 44, 2, 17, 177, 20, 102, 174, 190,
126, 122, 109, 144, 252, 58, 121, 160, 127, 46, 221, 73, 112, 250, 4, 38, 18,
135, 209, 133, 69, 63, 152, 151, 115, 191, 110, 243, 230, 85, 167, 148, 246,
203, 74, 128, 238, 150, 204, 79, 81, 84, 180, 154, 104, 199, 208, 164, 182,
30, 143, 14, 241, 80, 42, 223, 249, 219, 3, 227, 239, 254, 186, 171, 232,
145, 98, 107, 157, 106, 75, 51, 251, 228, 26, 200, 89, 183, 83, 193, 210,
255, 67, 37, 211, 111, 59, 7, 35, 101, 47, 236, 234, 224, 137, 138, 159, 5,
134, 91, 13, 118, 139, 165, 240, 56, 149, 49, 33, 23, 86, 116, 179, 96, 16,
27, 21, 88, 93, 87, 90, 19, 54, 113, 50, 28, 120, 202, 62, 188, 245, 198,
195, 117, 94, 24, 170, 205, 244, 141, 0, 225, 103, 95, 214, 229, 11, 65, 97,
31, 248, 76, 114, 147, 41, 242, 196, 197, 235, 12, 247, 253, 140, 8, 10, 40,
162, 131, 213, 166, 187
]

# Test

puts pearson_hash('Hola', table, 1)
puts pearson_hash('Programming Praxis', table, 1)
puts pearson_hash('Programming Praxis', table, 4)
```

Outputs:

25
165
165121139220

4. Daniel said

Here are updated functions from my program above. I added casts to unsigned 8-bit ints for memory safety, to prevent reading outside array T’s boundary.

```uint8_t pearson1(char* str) {
uint8_t h = 0;
while (*str) {
h = T[h ^ (uint8_t)*str];
++str;
}
return h;
}

uint16_t pearson2(char* str) {
uint8_t h1 = pearson1(str);
uint8_t h2 = 0;
if (*str) {
h2 = T[h2 ^ (uint8_t)((uint8_t)*str + 1)];
++str;
}
while (*str) {
h2 = T[h2 ^ (uint8_t)*str];
++str;
}
uint16_t h = h1;
h <<= 8;
h += h2;
return h;
}
```
5. Daniel said

@V, in your example that sets size_in_bytes=4, the result, 165121139220, is represented with 38 bits (more than 32 bits = 4 bytes).

Here’s a modification that uses bit-shifting, instead of concatenating strings between “0” and “255”.

```def pearson_hash(message, table, size_in_bytes)
output = 0
size_in_bytes.times do |i|
hash = 0
message.each_char.with_index do |char, j|
hash = table[(hash + char.ord + (j == 0 ? i : 0)) % 256]
end
output = (output << 8) + hash
end
output
end
```

I also modified the function so that the first letter of the message is non-destructively incremented by the index of each iteration (a opposed to being incremented by 0, 1, 3, then 6, over four iterations).

6. […] the previous exercise we studied a simple cryptographic hash. Today we will use that hash to write an equally-simple […]

7. Alex B said

Bit surprised we haven’t seen a Python one yet.

```def pearson_hash(message, byte_length=1, table=None, hash_type=1):
# message is expected as bytes, bytearray or a utf-8 string
if not table:
table = [
1,87,49,12,176,178,102,166,121,193,6,84,249,230,44,163,
14,197,213,181,161,85,218,80,64,239,24,226,236,142,38,200,
110,177,104,103,141,253,255,50,77,101,81,18,45,96,31,222,
25,107,190,70,86,237,240,34,72,242,20,214,244,227,149,235,
97,234,57,22,60,250,82,175,208,5,127,199,111,62,135,248,
174,169,211,58,66,154,106,195,245,171,17,187,182,179,0,243,
132,56,148,75,128,133,158,100,130,126,91,13,153,246,216,219,
119,68,223,78,83,88,201,99,122,11,92,32,136,114,52,10,
138,30,48,183,156,35,61,26,143,74,251,94,129,162,63,152,
170,7,115,167,241,206,3,150,55,59,151,220,90,53,23,131,
125,173,15,238,79,95,89,16,105,137,225,224,217,160,37,123,
118,73,2,157,46,116,9,145,134,228,207,212,202,215,69,229,
27,188,67,124,168,252,42,4,29,108,21,247,19,205,39,203,
233,40,186,147,198,192,155,33,164,191,98,204,165,180,117,76,
140,36,210,172,41,54,159,8,185,232,113,196,231,47,146,120,
51,65,28,144,254,221,93,189,194,139,112,43,71,109,184,209
]
try:
message = bytearray(message)
except TypeError:
message = bytearray(message, encoding='utf8')

output = 0

for cycle in range(byte_length):
h = 0
message[0] = (message[0]+cycle) % 256
for byte in message:
if hash_type == 1:
h = table[(h+byte) % 256]
elif hash_type == 2:
h = table[h ^ byte]
else:
raise ValueError('Unknown hash type - try 1 or 2')
output = (output << 8) + h

return output
```