String Data Type

February 5, 2019

Different programming languages handle strings differently. In C, for instance, strings are null-terminated arrays of characters, so the \0 character is not permitted in strings; other langauges permit any character in a string. Some languages are limited to ascii, others admit unicode. Some languages allow strings to be mutated, while others make strings immutable. Some languages number the characters in a string starting from 0, others from 1. If you’re porting a program from one language to another, dealing with strings can be a headache. (You can probably guess accurately at the motivation for today’s exercise.)

Today’s exercise calls for you to write a portable string data type that can be easily moved language to another. You are free to define strings as you wish — not everyone will have the same needs. Whatever you decide, you should provide conversions to and from the strings in your native language, as well as some basic operations on strings: determine their length, find the character at a particular position, modify a character if you decide strings should be mutable, extract a substring, concatenate two strings, perhaps others. Make your data type as simple or as elaborate as you wish.

Your task is to write a portable string data type that can be easily moved from one language to another. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Advertisement

Pages: 1 2

2 Responses to “String Data Type”

  1. matthew said

    Don’t know that it is very portable, but if I wanted a string library in C++ I might start off with something like this:

    #include <stdio.h>
    #include <stdint.h>
    #include <stddef.h>
    #include <assert.h>
    #include <locale.h>
    #include <algorithm>
    #include <new>
    
    template <typename T>
    class Buffer {
    public:
      explicit Buffer(size_t l) : len(l), refcount(0) {}
      static Buffer *allocate(size_t l) {
        char *p = new char[offsetof(Buffer<T>,data)+(l+1)*sizeof(T)];
        Buffer *s = new (p) Buffer(l); // Placement new
        s->data[l] = T(0); // Null terminate
        return s;
      }
      T &operator[](size_t i) {
        assert(i <= len);
        return data[i];
      }
      void incref() { refcount++; }
      void decref() {
        refcount--;
        if (refcount == 0) {
          delete [] reinterpret_cast<char*>(this);
        }
      }
      size_t length() { return len; }
    private:
      size_t len;
      uint32_t refcount;
      T data[1];
    };
    
    template <typename T>
    class String {
    public:
      explicit String(size_t length)
        : buffer(Buffer<T>::allocate(length)) {             
        buffer->incref();
      }
      explicit String(const T *p) {
        const T *q = p;
        while(*q) q++;
        buffer = Buffer<T>::allocate(q-p);
        buffer->incref();
        std::copy(p,q,&(*buffer)[0]);
      }
      String(const String &s)
        : buffer(s.buffer) {
        buffer->incref();
      }
      ~String() {
        buffer->decref();
      }
      String &operator=(const String &s) {
        buffer->decref();
        buffer = s.buffer;
        buffer->incref();
        return *this;
      }
      String substring(size_t start, size_t len) {
        assert(start + len <= length());
        String s(len);
        std::copy(&(*buffer)[start],
                  &(*buffer)[start+len],
                  &(*s.buffer)[0]);
        return s;
      }
      T &operator[](size_t i) {
        assert(buffer);
        return (*buffer)[i];
      }
      size_t length() const {
        return buffer->length();
      }
    private:
      Buffer<T> *buffer;
    };
    
    int main() {
      setlocale(LC_CTYPE, "");
      String<wchar_t> s(L"abcdefghijklmnopqrstuvwxyz");
      printf("%S\n",&s[0]);
      for (int i = 0; i < 26; i++) {
        s[i] = 0x24b6+i;
      }
      printf("%S\n",&s[0]);
      s = s.substring(13,13);
      printf("%S\n",&s[0]);
    }
    
    $ ./a.out
    abcdefghijklmnopqrstuvwxyz
    ⒶⒷⒸⒹⒺⒻⒼⒽⒾⒿⓀⓁⓂⓃⓄⓅⓆⓇⓈⓉⓊⓋⓌⓍⓎⓏ
    ⓃⓄⓅⓆⓇⓈⓉⓊⓋⓌⓍⓎⓏ
    
  2. matthew said

    Didn’t know that was going to happen with the circled letters. Only seems to have mangled the “M”, strange.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: