String Subsets

November 23, 2010

According to his blog, this interview question got Paul Tyma a programming job at Google.

We begin with the obvious brute-force solution: iterate through the substring, checking that each character is present in the main string. This is harder than it looks, because we need to be sure that we don’t re-use a letter from the main string. Our solution uses an array of booleans parallel to the main string, flipping the corresponding value from #f to #t whenever a character is used:

(define (subset1? main sub) (let ((v (make-vector (string-length main) #f))) (define (find c s n) (let ((k (string-find c s n))) (if (not k) #f (if (not (vector-ref v k)) (begin (vector-set! v k #t) k) (find c s (+ k 1)))))) (let loop ((i 0)) (cond ((= (string-length sub) i) #t) ((find (string (string-ref sub i)) main 0) (loop (+ i 1))) (else #f)))))

This solution is O(nm), where n is the length of the main string and m is the length of the substring; of course, this is the same as O(n²), on the assumption that m is of similar length to n. Another solution is to first sort both strings, then step through them; if you find a character in the substring that isn’t matched in the main string, or if you reach the end of the main string while characters remain in the substring, stop and return #f, or return #t if you reach the end of the substring without a failure:

(define (subset2? main sub) (let loop ((m (sort char<? (string->list main))) (s (sort char<? (string->list sub)))) (cond ((null? m) (null? s)) ((null? s) #t) ((char<? (car m) (car s)) (loop (cdr m) s)) ((char<? (car s) (car m)) #f) (else (loop (cdr m) (cdr s))))))

That solution has time complexity O(n log n); it’s O(n log n) for each of the sorts plus O(n) to step through the string in parallel, but only the leading term counts.

We next look at two O(n) solutions. Both work by forming a list of the counts of each character in the main string, then subtracting for each character in the substring; if a count ever becomes negative the test fails. One solution uses a hash table to store the counts, the other uses an array indexed by the ordinal of the character.

(define (subset3? main sub) (let ((h (make-hash char->integer char=? 0 256))) (do ((i 0 (+ i 1))) ((= (string-length main) i)) (let ((c (string-ref main i))) (h 'update c (lambda (k v) (+ v 1)) 1))) (let loop ((i 0)) (if (= (string-length sub) i) #t (let ((c (string-ref sub i))) (if (zero? (h 'lookup c)) #f (begin (h 'update c (lambda (k v) (- v 1)) 0) (loop (+ i 1)))))))))

(define (subset4? main sub) (let ((v (make-vector 256 0))) (do ((i 0 (+ i 1))) ((= (string-length main) i)) (let ((k (char->integer (string-ref main i)))) (vector-set! v k (+ (vector-ref v k) 1)))) (let loop ((i 0)) (if (= (string-length sub) i) #t (let ((k (char->integer (string-ref sub i)))) (if (zero? (vector-ref v k)) #f (begin (vector-set! v k (- (vector-ref v k) 1)) (loop (+ i 1)))))))))

In the story, Guy used an algorithm based on prime numbers; map each character to a prime number, multiply all the characters in the main string, then divide by each character in the substring, reporting a non-match if any division fails:

(define (prime? n) ; trial division (cond ((or (not (integer? n)) (negative? n)) (error 'prime? "must be non-negative integer")) ((< n 2) #f) ((even? n) (= n 2)) (else (let loop ((d 3)) (cond ((< n (* d d)) #t) ((zero? (modulo n d)) #f) (else (loop (+ d 2))))))))

(define (nth-prime n) ; counting from 1, not 0 (if (= n 1) 2 (let loop ((n (- n 2)) (p 3)) (cond ((zero? n) p) ((prime? (+ p 2)) (loop (- n 1) (+ p 2))) (else (loop n (+ p 2)))))))

(define (subset5? main sub) (let loop ((n 1) (m (map add1 (map char->integer (string->list main)))) (s (map add1 (map char->integer (string->list sub))))) (if (pair? m) (loop (* n (nth-prime (car m))) (cdr m) s) (if (null? s) #t (let ((d (nth-prime (car s)))) (if (zero? (modulo n d)) (loop (/ n d) m (cdr s)) #f))))))

That algorithm has a sorta-kinda time complexity of O(n), as long as the strings aren’t too long. The constant is larger than the two previous algorithms, however, because of the need to map from characters to primes, and if the strings aren’t tiny, the cost of the multiplications and divisions using big-integer arithmetic will add a factor of log n to the time complexity.

If I were interviewing, I’d expect the candidate to come up with at least one of the O(n) versions, perhaps after considering one of the other versions; any candidate who seriously put forward Guy’s solution would be blackballed. And once a candidate did come up with one of the O(n) versions, I might propose Guy’s solution just to see if the candidate found what was wrong with it. As a practical matter, the sorting version isn’t bad — it’s best to think of sorting as a linear algorithm with a largish constant — and the choice between the two O(n) solutions is a matter of the size of the alphabet; for ascii, the array solution is best, but for unicode, hashing will probably win. I probably shouldn’t admit this, but as I was writing this exercise the only one of the five versions that worked the first time was the sorting version; it’s my favorite solution, short, simple and clear, hard to get wrong, there are no extraneous variables to be confused or incremented at the wrong time (I did both those things as I implemented them), and it’s easy to see that it must be right.

We used hash tables and string-find from the Standard Prelude. You can run the program at http://programmingpraxis.codepad.org/qH8N4FfU.

Posted by programmingpraxis

Filed in Exercises

22 Comments »

22 Responses to “String Subsets”

Programming Praxis – String Subsets « Bonsai Code said
November 23, 2010 at 10:46 AM
[…] today’s Programming Praxis exercise,we have to write functions to determine if one string is a subset of […]

Remco Niemeijer said

November 23, 2010 at 10:47 AM

My Haskell solution (see http://bonsaicode.wordpress.com/2010/11/23/programming-praxis-string-subsets/ for comments and the preceding versions):

import qualified Data.IntMap as I

subsetOf4 :: Enum a => [a] -> [a] -> Bool
subsetOf4 xs ys = I.null $ I.differenceWith
    (\x y -> if x <= y then Nothing else Just x) (f xs) (f ys)
    where f = I.fromListWith (+) . map (flip (,) 1 . fromEnum)

Grégory LEOCADIE said
November 23, 2010 at 10:49 AM
here is a commented version http://www.gleocadie.net/?p=355&lang=en#more-355

#include
#include
#include

int *freq_table(char *s)
{
int i;
int *ft = (int*)calloc(256, sizeof(int));

for (i = 0; i < strlen(s); i++)
ft[(int)s[i]]++;

return ft;
}

int is_subset(int *ft, char *s)
{
int i;

for (i = 0; i < strlen(s); i++){
int off = (int)s[i];
if (ft[off] == 0)
return -42;
ft[off]–;
}
return 42;
}

int main(int argc, char** argv)
{
int *ft;

if (argc != 3) {
printf("You should provide 2 arguments %d\n", argc);
return -1;
}
ft = freq_table(argv[1]);
if (is_subset(ft, argv[2]) == 42)
fprintf(stdout, "%s is subset of %s\n", argv[2], argv[1]);
else
fprintf(stdout, "%s is not subset of %s\n", argv[2], argv[1]);

return 0;
}
slabounty said
November 23, 2010 at 5:06 PM
A simple ruby solution

slabounty said

November 23, 2010 at 5:07 PM

Hmmm … lost the code somehow …

class String
    def subset_of?(s)
        self.chars do |c|
            return false if !s.include? c
            s.sub!(c, "")
        end
        true
    end
end

puts "DA".subset_of?("ABCD")
puts "DXA".subset_of?("ABCD")
puts "DAD".subset_of?("ABCD")

Bogdan Popa said

November 23, 2010 at 6:43 PM

Felt like using dict comprehensions.

#!/usr/bin/env python3
"""usage: subset.py string second_string
"""
import sys

def count(s):
    l = []
    for i in range(len(s)):
        c, n = s[i], 1
        if c in l:
            continue
        for k in s[i + 1:]:
            if k == c:
                n += 1
        yield c, n
        l.append(c)

def is_subset(s, s2):
    ap = {c: n for c, n in count(s)}
    ap2 = {c: n for c, n in count(s2)}
    for key in ap2.keys():
        try:
            if ap2[key] > ap[key]:
                return False
        except KeyError:
            return False
    return True

def main(args):
    try:
        print(is_subset(args[1], args[2]))
    except IndexError:
        print(__doc__, end="")
        return 1
    return 0

if __name__ == "__main__":
    sys.exit(main(sys.argv))

Ian Tyndall said

November 23, 2010 at 8:25 PM

My PHP version

function subset_of($needle,$haystack) {
    foreach(str_split($needle) as $i) {
        $pos = strpos($haystack,$i);
        if ($pos === FALSE) return FALSE;
        $haystack[$pos] = '';
    }
    return TRUE;
}

subset_of(‘DA’,’ABCD’); // TRUE
subset_of(‘DAD’,’ABCD’); // FALSE

Kent Johnson said
November 23, 2010 at 10:45 PM
In Python 2.7 using the Counter class:

from collections import Counter

def contains_all(a, b):
counts = Counter(a)
counts.subtract(b)
return min(counts.values()) >= 0

Graham said

November 24, 2010 at 2:26 AM

Python solution, with comments about expected running time to give justification of design choices:


#!/usr/bin/env python2.6

def is_str_subset(str1, str2):
    d = {}      # dict of letter --> count for letter in str2
    for letter in str2:
# theta(m) where m = |str2|; most expensive part here, but cheaper on its own
# than sorting m first to quickly search [theta(m lg m)] and same as linearly
# searching unsorted str2
        try:
            d[letter] += 1
        except KeyError:
            d[letter] = 1
    for letter in str1:
# O(n) where n = |str1|; n < m, so total work is theta(m)
        try:
            if d[letter] == 0:
                return False
            else:
                d[letter] -= 1
        except KeyError:
            return False
    return True


if __name__ == '__main__':
    str1 = "DA"
    str2 = "ABCD"
    str3 = "DAD"
    print "Is str1 a string subset of str2?\t%s" % is_str_subset(str1, str2)
    print "Is str3 a string subset of str2?\t%s" % is_str_subset(str3, str2)

Clever said

November 24, 2010 at 7:33 AM

Overly verbose yes – but quick and easy to understand I hope.

# First string is the hopeful subset, second string is the set
def stringSubset (str1, str2):
	# Contains a count for each letter in the strings
	count1 = {}
	count2 = {}
	
	# Walk through the first string counting characters
	for c in str1:
		if not count1.has_key(c):
			count1[c] = 1
		else:
			count1[c] = count1[c] + 1
			
	# Walk through the second string counting characters
	for c in str2:
		if not count2.has_key(c):
			count2[c] = 1
		else:
			count2[c] = count2[c] + 1
			
	# Walk through subset dictionary
	# Return false if any character count doesn't match
	for char in count1.keys():
		if not count1[char] == count2[char]:
			return False
			
	# False was not returned above therefore we can return true
	return True

# Test
print(stringSubset("DA","ABCD"))
print(stringSubset("DA","DAD"))

Axio said
November 24, 2010 at 9:12 AM
;; Use a better one next time, please.
(define (char-sort l)
  (if (null? l)
    l
    (let ((h (car l))
          (t (cdr l)))
      (append
        (char-sort (filter (lambda (x) (char<=? x h)) t))
        (list h)
        (char-sort (filter (lambda (x) (char>? x h)) t))))))

(define (string-subset? s2 s1)
  (define (magic set1 set2)
    (cond
      ((null? set1) #t)
      ((null? set2) #f)
      ((char=? (car set1) (car set2)) (magic (cdr set1) (cdr set2)))
      ((char<? (car set1) (car set2)) #f)
      (else (magic set1 (cdr set2)))))
  (and (<= (string-length s1) (string-length s2))
       (magic (char-sort (string->list s1))
              (char-sort (string->list s2)))))

Grégory LEOCADIE said

November 24, 2010 at 9:27 AM

Clever:
you code is clear, but you miss that “DA” is subset of “ABCDA”.
The guilty part is:

for char in count1.keys():
  if not count1[char] == count2[char]:
    return False

you would have had used:

for char in count1.keys():
  if not count1[char] <= count2[char]:
    return False

and those tests would have passed:

# Test
print(stringSubset("DA","ABCD"))
print(stringSubset("DAD","ABCD"))
print(stringSubset("DA","DAD"))
print(stringSubset("DA","ABCDA"))
----
True
False
True
True

Clever said
November 25, 2010 at 12:28 AM
Je vous remercie, Grégory! I completely glossed over that with my insufficient testing.

Kevin said

November 25, 2010 at 8:14 AM

Some of the details definitely not necessary…just for readability. Feedback please! I’m trying to learn :D

#include <iostream>
#include <string>
#include <algorithm>
using namespace std;

int main( int argc, char* argv[] )
{
	cout << "Give me two strings:" << endl;
	string str1, str2, str1bak, str2bak;
	cin >> str1;
	cin >> str2;
	str1bak = str1;
	str2bak = str2;

	if(str1.length() > str2.length())
	{
		cout << "1st string is larger than the 2nd so it cannot be a string subset!" << endl;
		return 0;
	}

	sort(str1.begin(), str1.end());
	sort(str2.begin(), str2.end());

	if(str1.length() == str2.length() && str1 == str2)
	{
		cout << "These two strings are set equivalent!" << endl;
		return 0;
	}
	else if(str1.length() != str2.length())
	{
		int i=0;
		while(str1 != "")
		{
			if(str1[0] == str2[i])
			{
				str1 = str1.substr(1);
			}
			else if(str1[0] < str2[i])
			{
				cout << str1bak << " is not a string subset of " << str2bak << endl;
				return 0;
			}
			i++;
		}
		cout << str1bak << " is a string subset of " << str2bak << endl;
	}

	return 0;
}

Steve Valentine said
November 26, 2010 at 4:05 PM
In the following algorithm, if you get to the end of the proposed subset string, it is subset else it is not.

#include
#include
#include

int main () {

using namespace std;
using namespace boost::assign;
typedef multiset MUL_ST;
typedef multiset::iterator MS_I;

MUL_ST target;
MUL_ST subset;

target += ‘a’, ‘a’, ‘a’, ‘b’, ‘c’, ‘c’, ‘c’, ‘d’, ‘u’, ‘z’, ‘b’;
target += ‘b’, ‘b’, ‘c’, ‘d’, ‘u’, ‘z’, ‘v’;
subset += ‘b’, ‘b’, ‘c’, ‘d’, ‘u’, ‘v’, ‘v’;

cout << "\n"; BOOST_FOREACH(char c, subset) cout << c << " "; cout << "\n";

MS_I si = subset.begin(); MS_I ti = target.begin();

while ( si != subset.end() && ti != target.end() ) {
if ( *si *ti ) {
++ti;
if ( ti == target.end() ) goto judgement;
}
if ( *si != *ti ) goto judgement;
++si; ++ti;
}

judgement:
if ( si == subset.end() ) cout << "is a subset of" << "\n";
else cout << "is NOT a subset of" << "\n";

BOOST_FOREACH(char c, target) cout << c << " "; cout << "\n";

return 0;
}
lojze said
November 27, 2010 at 4:21 AM
Language: Clojure
(defn subset-of [s1 s2]
(every? #(<= (count (filter #{%} s1)) (count (filter #{%} s2))) s1))

Khanh Nguyen said

November 29, 2010 at 9:28 PM

In F#

let rec remove_first (c:char) (l:char list) =
    match l with
    | [] -> []
    | x::xs -> if x = c then xs else x :: (remove_first c xs)

let rec charlist_subsets (a:char list) (b:char list) = 
    match a, b with 
    | _, [] -> true
    | a, x::xs -> let t = remove_first x a 
                  not(t = a) && (charlist_subsets t xs)

let string_subsets (a:string) (b:string) = 
    charlist_subsets (a.ToCharArray() |> List.ofArray) (b.ToCharArray() |> List.ofArray)

string_subsets "ABCD" "DA"
string_subsets "ABCD" "DAD"

Guillaume said

November 30, 2010 at 1:55 PM

; Language: scheme (Gambit)
; - do not sort any string,
; - count characters only on the target substring, because it is smaller,
; - stop walking the longer string as soon as possible (success or failure).

(define (string-subset? s sub)
  (if (< (string-length s) (string-length sub))
      #f
      (let* ((tc (table-count sub)) ; count characters on the target substring
             (n (table-length tc))  ; number of unique characters
             (nfound 0))
        
        (let loop ((rest (string->list s)))
          
          (cond ((= n nfound) #t)  ; success -> stop walking the longer string
                ((null? rest) #f)  ; failure -> stop walking the longer string
                (else
                 (let* ((first (car rest))
                        (v (- (table-ref tc first) 1)))
                   (table-set! tc (- v 1))
                   (and (= v 0) ; done for this type of character
                        (set! nfound (+ nfound 1)))
                   (loop (cdr rest)))))))))
        
(define (table-count s)
  (let ((ret (make-table init: 0)))

    (let loop ((rest (string->list s)))

      (if (null? rest)
          ret
          (let* ((first (car rest))
                 (next (cdr rest)))
            (table-set! ret first (+ (table-ref ret first) 1))
            (if (null? next)
                ret
                (loop next)))))))

(string-subset? "ABCD" "")
; #t
(string-subset? "ABCD" "DA")
; #t
(string-subset? "ABCD" "DAD")
; #f

Jebb said

December 5, 2010 at 7:06 PM

Three Python solutions, similar to the ones you proposed:

#!/usr/bin/python2.7

def is_subset(string, substring):
# This runs in O(n2)
	L = list(string)
	for letter in substring:
		if letter in L:
			L.remove(letter)
		else:
			return False	
	return True

def is_subset_faster(string, substring):
# This runs in the same order as the sorting function (O(n log n)?)
	lstring = sorted(list(string))
	lsubstring = sorted(list(substring))
	i = 0
	for letter in lsubstring:
		if i == len(lstring):
			return False
		while letter > lstring[i]:
			if i < len(lstring):
				i += 1
			else:
				return False
		if letter == lstring[i]:
			i += 1
		else:	#letter < lstring[i]: letter in substring not in string
			return False
	return True

def is_subset_fastest(string, substring):
# This runs in O(n)
	sLetters = {}
	subLetters = {}
	for letter in string:
		if letter not in sLetters:
			sLetters[letter] = 1
		else:
			sLetters[letter] += 1
	for letter in substring:
		if letter not in subLetters:
			subLetters[letter] = 1
		else:
			subLetters[letter] += 1
	for letter in subLetters.keys():
		print letter
		if (letter not in sLetters.keys()) or (sLetters[letter] < subLetters[letter]):
			return False
	return True

I was delighted to see that I came up with the same two O(n2) and O(nlogn) solutions that you proposed… too bad I didn’t see the O(n) solution (that one was a “D’oh” moment, really) before reading the comments. Guess I won’t quit my day job yet :-)

Jonathan said

December 12, 2010 at 11:16 PM

A common lisp version…

(defun string-subset (substring total-set)
  "Returns true if the substring only contains characters that are in total-set."
  (let ((n (length substring))
        (m (length substring))
        (freq (make-hash-table))
        (uniques 0))
    (and (<= n m)
      (progn
        (iter (for c in-vector substring)
          (if (gethash c freq)
            (incf (gethash c freq))
            (progn
              (setf (gethash c freq) 1)
              (incf uniques))))
        (iter (for c in-vector total-set)
              (for count = (gethash c freq))
          (when count
            (cond
              ((= count 1)
                (progn
                  (decf (gethash c freq))
                  (decf uniques)
                  (when (zerop uniques)
                    (return-from string-subset t))))
              ((> count 1)
                (decf (gethash c freq)))))
          (finally (return-from string-subset nil)))))))

Sasha B said

January 17, 2011 at 5:16 AM

A much shorter solution involves removing the word from the subset one letter at a time.

def isSubset(word, subset):
    """
    Determines whether all of the characters in subset appear in word.
    The second string cannot be longer than the first string.
    >>> isSubset("ABCD", "DA")
    True
    >>> isSubset("ABCD", "DAD")
    False
    """
    assert len(subset) <= len(word), "The second string cannot be longer than the first string."
    # Go along the subset and try and remove corresponding letters from the word
    for letter in subset:
        if word.replace(letter, '') != word: # it was replaced correctly
            word = word.replace(letter, '') # remove the letter and continue
        else:
            return False # couldn't find the letter to replace, so its not there
    return True # all letters exist in the word

k A r T h I k______ said

December 20, 2011 at 7:28 AM

This works out.

def issubstring(src, strfind):
    if len(strfind) > len(src):
        return False
    iIndex = 0
    iLocation = 0
    for s in src:
        if iLocation == len(strfind):
            return True
        if s == strfind[iLocation]:
            iLocation+=1
        else:
            iLocation = 0
        iIndex+=1
     
    return iLocation == len(strfind) - 1

Programming Praxis

String Subsets

November 23, 2010

22 Responses to “String Subsets”

Leave a comment

Categories

Archives

Archives

Programming Praxis

String Subsets

November 23, 2010

Share this:

Related

22 Responses to “String Subsets”

Leave a comment

Categories

Archives

Archives