String Subsets
November 23, 2010
According to his blog, this interview question got Paul Tyma a programming job at Google.
We begin with the obvious brute-force solution: iterate through the substring, checking that each character is present in the main string. This is harder than it looks, because we need to be sure that we don’t re-use a letter from the main string. Our solution uses an array of booleans parallel to the main string, flipping the corresponding value from #f
to #t
whenever a character is used:
(define (subset1? main sub)
(let ((v (make-vector (string-length main) #f)))
(define (find c s n)
(let ((k (string-find c s n)))
(if (not k) #f
(if (not (vector-ref v k))
(begin (vector-set! v k #t) k)
(find c s (+ k 1))))))
(let loop ((i 0))
(cond ((= (string-length sub) i) #t)
((find (string (string-ref sub i)) main 0) (loop (+ i 1)))
(else #f)))))
This solution is O(nm), where n is the length of the main string and m is the length of the substring; of course, this is the same as O(n2), on the assumption that m is of similar length to n. Another solution is to first sort both strings, then step through them; if you find a character in the substring that isn’t matched in the main string, or if you reach the end of the main string while characters remain in the substring, stop and return #f
, or return #t
if you reach the end of the substring without a failure:
(define (subset2? main sub)
(let loop ((m (sort char<? (string->list main)))
(s (sort char<? (string->list sub))))
(cond ((null? m) (null? s))
((null? s) #t)
((char<? (car m) (car s)) (loop (cdr m) s))
((char<? (car s) (car m)) #f)
(else (loop (cdr m) (cdr s))))))
That solution has time complexity O(n log n); it’s O(n log n) for each of the sorts plus O(n) to step through the string in parallel, but only the leading term counts.
We next look at two O(n) solutions. Both work by forming a list of the counts of each character in the main string, then subtracting for each character in the substring; if a count ever becomes negative the test fails. One solution uses a hash table to store the counts, the other uses an array indexed by the ordinal of the character.
(define (subset3? main sub)
(let ((h (make-hash char->integer char=? 0 256)))
(do ((i 0 (+ i 1))) ((= (string-length main) i))
(let ((c (string-ref main i)))
(h 'update c (lambda (k v) (+ v 1)) 1)))
(let loop ((i 0))
(if (= (string-length sub) i) #t
(let ((c (string-ref sub i)))
(if (zero? (h 'lookup c)) #f
(begin (h 'update c (lambda (k v) (- v 1)) 0)
(loop (+ i 1)))))))))
(define (subset4? main sub)
(let ((v (make-vector 256 0)))
(do ((i 0 (+ i 1))) ((= (string-length main) i))
(let ((k (char->integer (string-ref main i))))
(vector-set! v k (+ (vector-ref v k) 1))))
(let loop ((i 0))
(if (= (string-length sub) i) #t
(let ((k (char->integer (string-ref sub i))))
(if (zero? (vector-ref v k)) #f
(begin (vector-set! v k (- (vector-ref v k) 1))
(loop (+ i 1)))))))))
In the story, Guy used an algorithm based on prime numbers; map each character to a prime number, multiply all the characters in the main string, then divide by each character in the substring, reporting a non-match if any division fails:
(define (prime? n) ; trial division
(cond ((or (not (integer? n)) (negative? n))
(error 'prime? "must be non-negative integer"))
((< n 2) #f) ((even? n) (= n 2))
(else (let loop ((d 3))
(cond ((< n (* d d)) #t)
((zero? (modulo n d)) #f)
(else (loop (+ d 2))))))))
(define (nth-prime n) ; counting from 1, not 0
(if (= n 1) 2
(let loop ((n (- n 2)) (p 3))
(cond ((zero? n) p)
((prime? (+ p 2))
(loop (- n 1) (+ p 2)))
(else (loop n (+ p 2)))))))
(define (subset5? main sub)
(let loop ((n 1) (m (map add1 (map char->integer (string->list main))))
(s (map add1 (map char->integer (string->list sub)))))
(if (pair? m) (loop (* n (nth-prime (car m))) (cdr m) s)
(if (null? s) #t
(let ((d (nth-prime (car s))))
(if (zero? (modulo n d)) (loop (/ n d) m (cdr s)) #f))))))
That algorithm has a sorta-kinda time complexity of O(n), as long as the strings aren’t too long. The constant is larger than the two previous algorithms, however, because of the need to map from characters to primes, and if the strings aren’t tiny, the cost of the multiplications and divisions using big-integer arithmetic will add a factor of log n to the time complexity.
If I were interviewing, I’d expect the candidate to come up with at least one of the O(n) versions, perhaps after considering one of the other versions; any candidate who seriously put forward Guy’s solution would be blackballed. And once a candidate did come up with one of the O(n) versions, I might propose Guy’s solution just to see if the candidate found what was wrong with it. As a practical matter, the sorting version isn’t bad — it’s best to think of sorting as a linear algorithm with a largish constant — and the choice between the two O(n) solutions is a matter of the size of the alphabet; for ascii, the array solution is best, but for unicode, hashing will probably win. I probably shouldn’t admit this, but as I was writing this exercise the only one of the five versions that worked the first time was the sorting version; it’s my favorite solution, short, simple and clear, hard to get wrong, there are no extraneous variables to be confused or incremented at the wrong time (I did both those things as I implemented them), and it’s easy to see that it must be right.
We used hash tables and string-find
from the Standard Prelude. You can run the program at http://programmingpraxis.codepad.org/qH8N4FfU.
[…] today’s Programming Praxis exercise,we have to write functions to determine if one string is a subset of […]
My Haskell solution (see http://bonsaicode.wordpress.com/2010/11/23/programming-praxis-string-subsets/ for comments and the preceding versions):
here is a commented version http://www.gleocadie.net/?p=355&lang=en#more-355
#include
#include
#include
int *freq_table(char *s)
{
int i;
int *ft = (int*)calloc(256, sizeof(int));
for (i = 0; i < strlen(s); i++)
ft[(int)s[i]]++;
return ft;
}
int is_subset(int *ft, char *s)
{
int i;
for (i = 0; i < strlen(s); i++){
int off = (int)s[i];
if (ft[off] == 0)
return -42;
ft[off]–;
}
return 42;
}
int main(int argc, char** argv)
{
int *ft;
if (argc != 3) {
printf("You should provide 2 arguments %d\n", argc);
return -1;
}
ft = freq_table(argv[1]);
if (is_subset(ft, argv[2]) == 42)
fprintf(stdout, "%s is subset of %s\n", argv[2], argv[1]);
else
fprintf(stdout, "%s is not subset of %s\n", argv[2], argv[1]);
return 0;
}
A simple ruby solution
Hmmm … lost the code somehow …
Felt like using dict comprehensions.
My PHP version
subset_of(‘DA’,’ABCD’); // TRUE
subset_of(‘DAD’,’ABCD’); // FALSE
In Python 2.7 using the Counter class:
from collections import Counter
def contains_all(a, b):
counts = Counter(a)
counts.subtract(b)
return min(counts.values()) >= 0
Python solution, with comments about expected running time to give justification of design choices:
Overly verbose yes – but quick and easy to understand I hope.
;; Use a better one next time, please.
(define (char-sort l)
(if (null? l)
l
(let ((h (car l))
(t (cdr l)))
(append
(char-sort (filter (lambda (x) (char<=? x h)) t))
(list h)
(char-sort (filter (lambda (x) (char>? x h)) t))))))
(define (string-subset? s2 s1)
(define (magic set1 set2)
(cond
((null? set1) #t)
((null? set2) #f)
((char=? (car set1) (car set2)) (magic (cdr set1) (cdr set2)))
((char<? (car set1) (car set2)) #f)
(else (magic set1 (cdr set2)))))
(and (<= (string-length s1) (string-length s2))
(magic (char-sort (string->list s1))
(char-sort (string->list s2)))))
Clever:
you code is clear, but you miss that “DA” is subset of “ABCDA”.
The guilty part is:
you would have had used:
and those tests would have passed:
Je vous remercie, Grégory! I completely glossed over that with my insufficient testing.
Some of the details definitely not necessary…just for readability. Feedback please! I’m trying to learn :D
In the following algorithm, if you get to the end of the proposed subset string, it is subset else it is not.
#include
#include
#include
int main () {
using namespace std;
using namespace boost::assign;
typedef multiset MUL_ST;
typedef multiset::iterator MS_I;
MUL_ST target;
MUL_ST subset;
target += ‘a’, ‘a’, ‘a’, ‘b’, ‘c’, ‘c’, ‘c’, ‘d’, ‘u’, ‘z’, ‘b’;
target += ‘b’, ‘b’, ‘c’, ‘d’, ‘u’, ‘z’, ‘v’;
subset += ‘b’, ‘b’, ‘c’, ‘d’, ‘u’, ‘v’, ‘v’;
cout << "\n"; BOOST_FOREACH(char c, subset) cout << c << " "; cout << "\n";
MS_I si = subset.begin(); MS_I ti = target.begin();
while ( si != subset.end() && ti != target.end() ) {
if ( *si *ti ) {
++ti;
if ( ti == target.end() ) goto judgement;
}
if ( *si != *ti ) goto judgement;
++si; ++ti;
}
judgement:
if ( si == subset.end() ) cout << "is a subset of" << "\n";
else cout << "is NOT a subset of" << "\n";
BOOST_FOREACH(char c, target) cout << c << " "; cout << "\n";
return 0;
}
Language: Clojure
(defn subset-of [s1 s2]
(every? #(<= (count (filter #{%} s1)) (count (filter #{%} s2))) s1))
In F#
Three Python solutions, similar to the ones you proposed:
I was delighted to see that I came up with the same two O(n2) and O(nlogn) solutions that you proposed… too bad I didn’t see the O(n) solution (that one was a “D’oh” moment, really) before reading the comments. Guess I won’t quit my day job yet :-)
A common lisp version…
A much shorter solution involves removing the word from the subset one letter at a time.
This works out.