## String Subsets

### November 23, 2010

According to his blog, this interview question got Paul Tyma a programming job at Google.

We begin with the obvious brute-force solution: iterate through the substring, checking that each character is present in the main string. This is harder than it looks, because we need to be sure that we don’t re-use a letter from the main string. Our solution uses an array of booleans parallel to the main string, flipping the corresponding value from `#f` to `#t` whenever a character is used:

```(define (subset1? main sub)   (let ((v (make-vector (string-length main) #f)))     (define (find c s n)       (let ((k (string-find c s n)))         (if (not k) #f           (if (not (vector-ref v k))               (begin (vector-set! v k #t) k)               (find c s (+ k 1))))))     (let loop ((i 0))       (cond ((= (string-length sub) i) #t)             ((find (string (string-ref sub i)) main 0) (loop (+ i 1)))             (else #f)))))```

This solution is O(nm), where n is the length of the main string and m is the length of the substring; of course, this is the same as O(n2), on the assumption that m is of similar length to n. Another solution is to first sort both strings, then step through them; if you find a character in the substring that isn’t matched in the main string, or if you reach the end of the main string while characters remain in the substring, stop and return `#f`, or return `#t` if you reach the end of the substring without a failure:

```(define (subset2? main sub)   (let loop ((m (sort char<? (string->list main)))              (s (sort char<? (string->list sub))))     (cond ((null? m) (null? s))           ((null? s) #t)           ((char<? (car m) (car s)) (loop (cdr m) s))           ((char<? (car s) (car m)) #f)           (else (loop (cdr m) (cdr s))))))```

That solution has time complexity O(n log n); it’s O(n log n) for each of the sorts plus O(n) to step through the string in parallel, but only the leading term counts.

We next look at two O(n) solutions. Both work by forming a list of the counts of each character in the main string, then subtracting for each character in the substring; if a count ever becomes negative the test fails. One solution uses a hash table to store the counts, the other uses an array indexed by the ordinal of the character.

```(define (subset3? main sub)   (let ((h (make-hash char->integer char=? 0 256)))     (do ((i 0 (+ i 1))) ((= (string-length main) i))       (let ((c (string-ref main i)))         (h 'update c (lambda (k v) (+ v 1)) 1)))     (let loop ((i 0))       (if (= (string-length sub) i) #t         (let ((c (string-ref sub i)))           (if (zero? (h 'lookup c)) #f             (begin (h 'update c (lambda (k v) (- v 1)) 0)                    (loop (+ i 1)))))))))```

```(define (subset4? main sub)   (let ((v (make-vector 256 0)))     (do ((i 0 (+ i 1))) ((= (string-length main) i))       (let ((k (char->integer (string-ref main i))))         (vector-set! v k (+ (vector-ref v k) 1))))     (let loop ((i 0))       (if (= (string-length sub) i) #t         (let ((k (char->integer (string-ref sub i))))           (if (zero? (vector-ref v k)) #f             (begin (vector-set! v k (- (vector-ref v k) 1))                    (loop (+ i 1)))))))))```

In the story, Guy used an algorithm based on prime numbers; map each character to a prime number, multiply all the characters in the main string, then divide by each character in the substring, reporting a non-match if any division fails:

```(define (prime? n) ; trial division   (cond ((or (not (integer? n)) (negative? n))           (error 'prime? "must be non-negative integer"))         ((< n 2) #f) ((even? n) (= n 2))         (else (let loop ((d 3))                 (cond ((< n (* d d)) #t)                       ((zero? (modulo n d)) #f)                       (else (loop (+ d 2))))))))```

```(define (nth-prime n) ; counting from 1, not 0   (if (= n 1) 2     (let loop ((n (- n 2)) (p 3))       (cond ((zero? n) p)             ((prime? (+ p 2))               (loop (- n 1) (+ p 2)))             (else (loop n (+ p 2)))))))```

```(define (subset5? main sub)   (let loop ((n 1) (m (map add1 (map char->integer (string->list main))))                    (s (map add1 (map char->integer (string->list sub)))))     (if (pair? m) (loop (* n (nth-prime (car m))) (cdr m) s)       (if (null? s) #t         (let ((d (nth-prime (car s))))           (if (zero? (modulo n d)) (loop (/ n d) m (cdr s)) #f))))))```

That algorithm has a sorta-kinda time complexity of O(n), as long as the strings aren’t too long. The constant is larger than the two previous algorithms, however, because of the need to map from characters to primes, and if the strings aren’t tiny, the cost of the multiplications and divisions using big-integer arithmetic will add a factor of log n to the time complexity.

If I were interviewing, I’d expect the candidate to come up with at least one of the O(n) versions, perhaps after considering one of the other versions; any candidate who seriously put forward Guy’s solution would be blackballed. And once a candidate did come up with one of the O(n) versions, I might propose Guy’s solution just to see if the candidate found what was wrong with it. As a practical matter, the sorting version isn’t bad — it’s best to think of sorting as a linear algorithm with a largish constant — and the choice between the two O(n) solutions is a matter of the size of the alphabet; for ascii, the array solution is best, but for unicode, hashing will probably win. I probably shouldn’t admit this, but as I was writing this exercise the only one of the five versions that worked the first time was the sorting version; it’s my favorite solution, short, simple and clear, hard to get wrong, there are no extraneous variables to be confused or incremented at the wrong time (I did both those things as I implemented them), and it’s easy to see that it must be right.

We used hash tables and `string-find` from the Standard Prelude. You can run the program at http://programmingpraxis.codepad.org/qH8N4FfU.

Pages: 1 2

### 22 Responses to “String Subsets”

1. […] today’s Programming Praxis exercise,we have to write functions to determine if one string is a subset of […]

2. My Haskell solution (see http://bonsaicode.wordpress.com/2010/11/23/programming-praxis-string-subsets/ for comments and the preceding versions):

```import qualified Data.IntMap as I

subsetOf4 :: Enum a => [a] -> [a] -> Bool
subsetOf4 xs ys = I.null \$ I.differenceWith
(\x y -> if x <= y then Nothing else Just x) (f xs) (f ys)
where f = I.fromListWith (+) . map (flip (,) 1 . fromEnum)
```
3. here is a commented version http://www.gleocadie.net/?p=355&lang=en#more-355

#include
#include
#include

int *freq_table(char *s)
{
int i;
int *ft = (int*)calloc(256, sizeof(int));

for (i = 0; i < strlen(s); i++)
ft[(int)s[i]]++;

return ft;
}

int is_subset(int *ft, char *s)
{
int i;

for (i = 0; i < strlen(s); i++){
int off = (int)s[i];
if (ft[off] == 0)
return -42;
ft[off]–;
}
return 42;
}

int main(int argc, char** argv)
{
int *ft;

if (argc != 3) {
printf("You should provide 2 arguments %d\n", argc);
return -1;
}
ft = freq_table(argv);
if (is_subset(ft, argv) == 42)
fprintf(stdout, "%s is subset of %s\n", argv, argv);
else
fprintf(stdout, "%s is not subset of %s\n", argv, argv);

return 0;
}

4. slabounty said

A simple ruby solution

5. slabounty said

Hmmm … lost the code somehow …

```class String
def subset_of?(s)
self.chars do |c|
return false if !s.include? c
s.sub!(c, "")
end
true
end
end

puts "DA".subset_of?("ABCD")
puts "DXA".subset_of?("ABCD")
```
6. Bogdan Popa said

Felt like using dict comprehensions.

```#!/usr/bin/env python3
"""usage: subset.py string second_string
"""
import sys

def count(s):
l = []
for i in range(len(s)):
c, n = s[i], 1
if c in l:
continue
for k in s[i + 1:]:
if k == c:
n += 1
yield c, n
l.append(c)

def is_subset(s, s2):
ap = {c: n for c, n in count(s)}
ap2 = {c: n for c, n in count(s2)}
for key in ap2.keys():
try:
if ap2[key] > ap[key]:
return False
except KeyError:
return False
return True

def main(args):
try:
print(is_subset(args, args))
except IndexError:
print(__doc__, end="")
return 1
return 0

if __name__ == "__main__":
sys.exit(main(sys.argv))
```
7. My PHP version

```function subset_of(\$needle,\$haystack) {
foreach(str_split(\$needle) as \$i) {
\$pos = strpos(\$haystack,\$i);
if (\$pos === FALSE) return FALSE;
\$haystack[\$pos] = '';
}
return TRUE;
}
```

subset_of(‘DA’,’ABCD’); // TRUE

8. In Python 2.7 using the Counter class:

from collections import Counter

def contains_all(a, b):
counts = Counter(a)
counts.subtract(b)
return min(counts.values()) >= 0

9. Graham said

Python solution, with comments about expected running time to give justification of design choices:

```
#!/usr/bin/env python2.6

def is_str_subset(str1, str2):
d = {}      # dict of letter --> count for letter in str2
for letter in str2:
# theta(m) where m = |str2|; most expensive part here, but cheaper on its own
# than sorting m first to quickly search [theta(m lg m)] and same as linearly
# searching unsorted str2
try:
d[letter] += 1
except KeyError:
d[letter] = 1
for letter in str1:
# O(n) where n = |str1|; n < m, so total work is theta(m)
try:
if d[letter] == 0:
return False
else:
d[letter] -= 1
except KeyError:
return False
return True

if __name__ == '__main__':
str1 = "DA"
str2 = "ABCD"
print "Is str1 a string subset of str2?\t%s" % is_str_subset(str1, str2)
print "Is str3 a string subset of str2?\t%s" % is_str_subset(str3, str2)
```
10. Clever said

Overly verbose yes – but quick and easy to understand I hope.

```# First string is the hopeful subset, second string is the set
def stringSubset (str1, str2):
# Contains a count for each letter in the strings
count1 = {}
count2 = {}

# Walk through the first string counting characters
for c in str1:
if not count1.has_key(c):
count1[c] = 1
else:
count1[c] = count1[c] + 1

# Walk through the second string counting characters
for c in str2:
if not count2.has_key(c):
count2[c] = 1
else:
count2[c] = count2[c] + 1

# Walk through subset dictionary
# Return false if any character count doesn't match
for char in count1.keys():
if not count1[char] == count2[char]:
return False

# False was not returned above therefore we can return true
return True

# Test
print(stringSubset("DA","ABCD"))
```
11. Axio said

;; Use a better one next time, please.
(define (char-sort l)
(if (null? l)
l
(let ((h (car l))
(t (cdr l)))
(append
(char-sort (filter (lambda (x) (char<=? x h)) t))
(list h)
(char-sort (filter (lambda (x) (char>? x h)) t))))))

(define (string-subset? s2 s1)
(define (magic set1 set2)
(cond
((null? set1) #t)
((null? set2) #f)
((char=? (car set1) (car set2)) (magic (cdr set1) (cdr set2)))
((char<? (car set1) (car set2)) #f)
(else (magic set1 (cdr set2)))))
(and (<= (string-length s1) (string-length s2))
(magic (char-sort (string->list s1))
(char-sort (string->list s2)))))

12. Clever:
you code is clear, but you miss that “DA” is subset of “ABCDA”.
The guilty part is:

```for char in count1.keys():
if not count1[char] == count2[char]:
return False
```

```for char in count1.keys():
if not count1[char] <= count2[char]:
return False
```

and those tests would have passed:

```# Test
print(stringSubset("DA","ABCD"))
print(stringSubset("DA","ABCDA"))
----
True
False
True
True
```
13. Clever said

Je vous remercie, Grégory! I completely glossed over that with my insufficient testing.

14. Kevin said

Some of the details definitely not necessary…just for readability. Feedback please! I’m trying to learn :D

```#include <iostream>
#include <string>
#include <algorithm>
using namespace std;

int main( int argc, char* argv[] )
{
cout << "Give me two strings:" << endl;
string str1, str2, str1bak, str2bak;
cin >> str1;
cin >> str2;
str1bak = str1;
str2bak = str2;

if(str1.length() > str2.length())
{
cout << "1st string is larger than the 2nd so it cannot be a string subset!" << endl;
return 0;
}

sort(str1.begin(), str1.end());
sort(str2.begin(), str2.end());

if(str1.length() == str2.length() && str1 == str2)
{
cout << "These two strings are set equivalent!" << endl;
return 0;
}
else if(str1.length() != str2.length())
{
int i=0;
while(str1 != "")
{
if(str1 == str2[i])
{
str1 = str1.substr(1);
}
else if(str1 < str2[i])
{
cout << str1bak << " is not a string subset of " << str2bak << endl;
return 0;
}
i++;
}
cout << str1bak << " is a string subset of " << str2bak << endl;
}

return 0;
}
```
15. Steve Valentine said

In the following algorithm, if you get to the end of the proposed subset string, it is subset else it is not.

#include
#include
#include

int main () {

using namespace std;
using namespace boost::assign;
typedef multiset MUL_ST;
typedef multiset::iterator MS_I;

MUL_ST target;
MUL_ST subset;

target += ‘a’, ‘a’, ‘a’, ‘b’, ‘c’, ‘c’, ‘c’, ‘d’, ‘u’, ‘z’, ‘b’;
target += ‘b’, ‘b’, ‘c’, ‘d’, ‘u’, ‘z’, ‘v’;
subset += ‘b’, ‘b’, ‘c’, ‘d’, ‘u’, ‘v’, ‘v’;

cout << "\n"; BOOST_FOREACH(char c, subset) cout << c << " "; cout << "\n";

MS_I si = subset.begin(); MS_I ti = target.begin();

while ( si != subset.end() && ti != target.end() ) {
if ( *si *ti ) {
++ti;
if ( ti == target.end() ) goto judgement;
}
if ( *si != *ti ) goto judgement;
++si; ++ti;
}

judgement:
if ( si == subset.end() ) cout << "is a subset of" << "\n";
else cout << "is NOT a subset of" << "\n";

BOOST_FOREACH(char c, target) cout << c << " "; cout << "\n";

return 0;
}

16. lojze said

Language: Clojure
(defn subset-of [s1 s2]
(every? #(<= (count (filter #{%} s1)) (count (filter #{%} s2))) s1))

17. Khanh Nguyen said

In F#

```let rec remove_first (c:char) (l:char list) =
match l with
| [] -> []
| x::xs -> if x = c then xs else x :: (remove_first c xs)

let rec charlist_subsets (a:char list) (b:char list) =
match a, b with
| _, [] -> true
| a, x::xs -> let t = remove_first x a
not(t = a) && (charlist_subsets t xs)

let string_subsets (a:string) (b:string) =
charlist_subsets (a.ToCharArray() |> List.ofArray) (b.ToCharArray() |> List.ofArray)

string_subsets "ABCD" "DA"

```
18. Guillaume said
```; Language: scheme (Gambit)
; - do not sort any string,
; - count characters only on the target substring, because it is smaller,
; - stop walking the longer string as soon as possible (success or failure).

(define (string-subset? s sub)
(if (< (string-length s) (string-length sub))
#f
(let* ((tc (table-count sub)) ; count characters on the target substring
(n (table-length tc))  ; number of unique characters
(nfound 0))

(let loop ((rest (string->list s)))

(cond ((= n nfound) #t)  ; success -> stop walking the longer string
((null? rest) #f)  ; failure -> stop walking the longer string
(else
(let* ((first (car rest))
(v (- (table-ref tc first) 1)))
(table-set! tc (- v 1))
(and (= v 0) ; done for this type of character
(set! nfound (+ nfound 1)))
(loop (cdr rest)))))))))

(define (table-count s)
(let ((ret (make-table init: 0)))

(let loop ((rest (string->list s)))

(if (null? rest)
ret
(let* ((first (car rest))
(next (cdr rest)))
(table-set! ret first (+ (table-ref ret first) 1))
(if (null? next)
ret
(loop next)))))))

(string-subset? "ABCD" "")
; #t
(string-subset? "ABCD" "DA")
; #t
; #f
```
19. Jebb said

Three Python solutions, similar to the ones you proposed:

```#!/usr/bin/python2.7

def is_subset(string, substring):
# This runs in O(n2)
L = list(string)
for letter in substring:
if letter in L:
L.remove(letter)
else:
return False
return True

def is_subset_faster(string, substring):
# This runs in the same order as the sorting function (O(n log n)?)
lstring = sorted(list(string))
lsubstring = sorted(list(substring))
i = 0
for letter in lsubstring:
if i == len(lstring):
return False
while letter > lstring[i]:
if i < len(lstring):
i += 1
else:
return False
if letter == lstring[i]:
i += 1
else:	#letter < lstring[i]: letter in substring not in string
return False
return True

def is_subset_fastest(string, substring):
# This runs in O(n)
sLetters = {}
subLetters = {}
for letter in string:
if letter not in sLetters:
sLetters[letter] = 1
else:
sLetters[letter] += 1
for letter in substring:
if letter not in subLetters:
subLetters[letter] = 1
else:
subLetters[letter] += 1
for letter in subLetters.keys():
print letter
if (letter not in sLetters.keys()) or (sLetters[letter] < subLetters[letter]):
return False
return True
```

I was delighted to see that I came up with the same two O(n2) and O(nlogn) solutions that you proposed… too bad I didn’t see the O(n) solution (that one was a “D’oh” moment, really) before reading the comments. Guess I won’t quit my day job yet :-)

20. Jonathan said

A common lisp version…

```(defun string-subset (substring total-set)
"Returns true if the substring only contains characters that are in total-set."
(let ((n (length substring))
(m (length substring))
(freq (make-hash-table))
(uniques 0))
(and (<= n m)
(progn
(iter (for c in-vector substring)
(if (gethash c freq)
(incf (gethash c freq))
(progn
(setf (gethash c freq) 1)
(incf uniques))))
(iter (for c in-vector total-set)
(for count = (gethash c freq))
(when count
(cond
((= count 1)
(progn
(decf (gethash c freq))
(decf uniques)
(when (zerop uniques)
(return-from string-subset t))))
((> count 1)
(decf (gethash c freq)))))
(finally (return-from string-subset nil)))))))
```
21. Sasha B said

A much shorter solution involves removing the word from the subset one letter at a time.

```def isSubset(word, subset):
"""
Determines whether all of the characters in subset appear in word.
The second string cannot be longer than the first string.
>>> isSubset("ABCD", "DA")
True
False
"""
assert len(subset) <= len(word), "The second string cannot be longer than the first string."
# Go along the subset and try and remove corresponding letters from the word
for letter in subset:
if word.replace(letter, '') != word: # it was replaced correctly
word = word.replace(letter, '') # remove the letter and continue
else:
return False # couldn't find the letter to replace, so its not there
return True # all letters exist in the word
```
22. This works out.

```def issubstring(src, strfind):
if len(strfind) > len(src):
return False
iIndex = 0
iLocation = 0
for s in src:
if iLocation == len(strfind):
return True
if s == strfind[iLocation]:
iLocation+=1
else:
iLocation = 0
iIndex+=1

return iLocation == len(strfind) - 1
```