String Data Type
February 5, 2019
We start with some decisions: Strings are immutable. They will be represented as a linked list of characters. Indexes start at zero. Characters may be either ascii or unicode, as the underlying Scheme implementation admits; all the functions in our string data type works in either case. Here we define a sample string, for testing, and procedures to convert back and forth between Scheme strings and our new data type, which we call str
to distinguish it from native Scheme string
s:
(define string->str string->list) (define str->string list->string) (define s (string->str "Programming Praxis"))
> s (#\P #\r #\o #\g #\r #\a #\m #\m #\i #\n #\g #\space #\P #\r #\a #\x #\i #\s) > (str->string s) "Programming Praxis"
Computing the length of a string, or extracting a particular character from a string, are simple. We provide a substring function in two arities, with or without an endpoint, from a starting position to 1 past an ending position; if the endpoint is omitted, it is taken to be the end of the string:
(define str-length length) (define str-ref list-ref) (define str-substring (case-lambda ((s start) (str-substring s start (+ (str-length s)))) ((s start past) (take (- past start) (drop start s)))))
> (str-length s) 18 > (str-ref s 8) #\i > (str-substring s 8) (#\i #\n #\g #\space #\P #\r #\a #\x #\i #\s) > (str-substring s 8 13) (#\i #\n #\g #\space #\P)
The last function we provide concatenates one or more strings:
(define str-concat append)
> (str-concat (string->str "Programming") (string->str " ") (string->str "Praxis")) (#\P #\r #\o #\g #\r #\a #\m #\m #\i #\n #\g #\space #\P #\r #\a #\x #\i #\s)
We leaned heavily on Scheme’s native lists, which is fine. I’m not going to implement it here, but it is interesting to consider how to port our library to C. The str
data type could be represented by a struct
with a length field and an array of characters of the given length. Then most of the library is obvious, but we have a problem with str->string
: what can we do with a str
that contains \0? We could ignore the problem and let our programs abort, but that is impolite. We could silently delete the \0 character, or convert it to some other character (a space character, or some weird unicode character), but both options merely push the problem somewhere else. We could throw some kind of exception. There is no good answer. In my program that inspired this exercise, I checked each str
as it was built, and if an attempt was made to insert a \0, my program wrote a message to an error log and ignored the attempt to add the \0 character, which was a “good enough” solution even if it wasn’t “good.”
You can run the program at https://ideone.com/2FBUDe.
Don’t know that it is very portable, but if I wanted a string library in C++ I might start off with something like this:
Didn’t know that was going to happen with the circled letters. Only seems to have mangled the “M”, strange.