String Data Type

February 5, 2019

We start with some decisions: Strings are immutable. They will be represented as a linked list of characters. Indexes start at zero. Characters may be either ascii or unicode, as the underlying Scheme implementation admits; all the functions in our string data type works in either case. Here we define a sample string, for testing, and procedures to convert back and forth between Scheme strings and our new data type, which we call str to distinguish it from native Scheme strings:

(define string->str string->list)
(define str->string list->string)
(define s (string->str "Programming Praxis"))
> s
(#\P #\r #\o #\g #\r #\a #\m #\m #\i #\n #\g #\space #\P #\r #\a #\x #\i #\s)
> (str->string s)
"Programming Praxis"

Computing the length of a string, or extracting a particular character from a string, are simple. We provide a substring function in two arities, with or without an endpoint, from a starting position to 1 past an ending position; if the endpoint is omitted, it is taken to be the end of the string:

(define str-length length)
(define str-ref list-ref)
(define str-substring (case-lambda
  ((s start) (str-substring s start (+ (str-length s))))
  ((s start past) (take (- past start) (drop start s)))))
> (str-length s)
18
> (str-ref s 8)
#\i
> (str-substring s 8)
(#\i #\n #\g #\space #\P #\r #\a #\x #\i #\s)
> (str-substring s 8 13)
(#\i #\n #\g #\space #\P)

The last function we provide concatenates one or more strings:

(define str-concat append)
> (str-concat (string->str "Programming")
              (string->str " ")
              (string->str "Praxis"))
(#\P #\r #\o #\g #\r #\a #\m #\m #\i #\n #\g #\space #\P #\r #\a #\x #\i #\s)

 

We leaned heavily on Scheme’s native lists, which is fine. I’m not going to implement it here, but it is interesting to consider how to port our library to C. The str data type could be represented by a struct with a length field and an array of characters of the given length. Then most of the library is obvious, but we have a problem with str->string: what can we do with a str that contains \0? We could ignore the problem and let our programs abort, but that is impolite. We could silently delete the \0 character, or convert it to some other character (a space character, or some weird unicode character), but both options merely push the problem somewhere else. We could throw some kind of exception. There is no good answer. In my program that inspired this exercise, I checked each str as it was built, and if an attempt was made to insert a \0, my program wrote a message to an error log and ignored the attempt to add the \0 character, which was a “good enough” solution even if it wasn’t “good.”

You can run the program at https://ideone.com/2FBUDe.

Pages: 1 2

2 Responses to “String Data Type”

  1. matthew said

    Don’t know that it is very portable, but if I wanted a string library in C++ I might start off with something like this:

    #include <stdio.h>
    #include <stdint.h>
    #include <stddef.h>
    #include <assert.h>
    #include <locale.h>
    #include <algorithm>
    #include <new>
    
    template <typename T>
    class Buffer {
    public:
      explicit Buffer(size_t l) : len(l), refcount(0) {}
      static Buffer *allocate(size_t l) {
        char *p = new char[offsetof(Buffer<T>,data)+(l+1)*sizeof(T)];
        Buffer *s = new (p) Buffer(l); // Placement new
        s->data[l] = T(0); // Null terminate
        return s;
      }
      T &operator[](size_t i) {
        assert(i <= len);
        return data[i];
      }
      void incref() { refcount++; }
      void decref() {
        refcount--;
        if (refcount == 0) {
          delete [] reinterpret_cast<char*>(this);
        }
      }
      size_t length() { return len; }
    private:
      size_t len;
      uint32_t refcount;
      T data[1];
    };
    
    template <typename T>
    class String {
    public:
      explicit String(size_t length)
        : buffer(Buffer<T>::allocate(length)) {             
        buffer->incref();
      }
      explicit String(const T *p) {
        const T *q = p;
        while(*q) q++;
        buffer = Buffer<T>::allocate(q-p);
        buffer->incref();
        std::copy(p,q,&(*buffer)[0]);
      }
      String(const String &s)
        : buffer(s.buffer) {
        buffer->incref();
      }
      ~String() {
        buffer->decref();
      }
      String &operator=(const String &s) {
        buffer->decref();
        buffer = s.buffer;
        buffer->incref();
        return *this;
      }
      String substring(size_t start, size_t len) {
        assert(start + len <= length());
        String s(len);
        std::copy(&(*buffer)[start],
                  &(*buffer)[start+len],
                  &(*s.buffer)[0]);
        return s;
      }
      T &operator[](size_t i) {
        assert(buffer);
        return (*buffer)[i];
      }
      size_t length() const {
        return buffer->length();
      }
    private:
      Buffer<T> *buffer;
    };
    
    int main() {
      setlocale(LC_CTYPE, "");
      String<wchar_t> s(L"abcdefghijklmnopqrstuvwxyz");
      printf("%S\n",&s[0]);
      for (int i = 0; i < 26; i++) {
        s[i] = 0x24b6+i;
      }
      printf("%S\n",&s[0]);
      s = s.substring(13,13);
      printf("%S\n",&s[0]);
    }
    
    $ ./a.out
    abcdefghijklmnopqrstuvwxyz
    ⒶⒷⒸⒹⒺⒻⒼⒽⒾⒿⓀⓁⓂⓃⓄⓅⓆⓇⓈⓉⓊⓋⓌⓍⓎⓏ
    ⓃⓄⓅⓆⓇⓈⓉⓊⓋⓌⓍⓎⓏ
    
  2. matthew said

    Didn’t know that was going to happen with the circled letters. Only seems to have mangled the “M”, strange.

Leave a comment