Titlecase
April 12, 2016
The only tricky part of this exercise is the definition of “word.” We’ll say that a word starts with a letter or digit and ends at the next whitespace character. That way, “PRAXIS” is a word, as is “isn’t”, the “p” in “\”praxis\”” (that’s a string of eight characters, of which the first and last are double-quotes) gets capitalized, and the “p” in “123p” doesn’t. We’ll process the string in two passes: The first pass, which is implicit in the string-downcase
function, makes all characters lower case. The second pass uses a boolean auxiliary variable inword
, initially #f
, which becomes #t
when it sees an alphanumeric character not preceded by an alphanumeric character and is reset to #f
when it sees whitespace:
(define (titlecase str) (let ((str (string-downcase str)) (inword #f)) (do ((i 0 (+ i 1))) ((= i (string-length str)) str) (if inword (when (char-whitespace? (string-ref str i)) (set! inword #f)) (when (and (char-alphanumeric? (string-ref str i)) (or (zero? i) (not (char-alphanumeric? (string-ref str (- i 1)))))) (set! inword #t) (string-set! str i (char-upcase (string-ref str i))))))))
That’s hard to read only because all the Scheme string and character functions have long names. Here are some examples:
> (titlecase "programming PRAXIS") Programming Praxis > (titlecase "\"prax'is\"") "Prax'is" > (titlecase "123p") 123p
You can run the program at http://ideone.com/5vw6ur, where you will also see the definition of char-alphanumeric?
.
Re-read actually want split on space only…
Alternate Perl solution:
[sourcode lang=”perl”]
#!perl
use strict;
use warnings;
sub tc { return join ‘ ‘, map { ucfirst lc } split /\s+/, $_[0]; }
[/sourcecode]
Oops, typo in the above:
#!perl
use strict;
use warnings;
sub tc { return join ‘ ‘, map { ucfirst lc } split /\s+/, $_[0]; }
A Haskell version. The resulting strings are printed in quotes to demonstrate that the original whitespace is maintained.
Here’s a simple FSA for the problem in C++, templated so we can do both traditional and wide strings:
All of these solutions are buggy, because they don’t deal properly with the full complexities of Unicode. I believe the sample implementation of SRFI 129 will work correctly assuming only correct
char-upcase
andchar-downcase
from the underlying Scheme; the additional maps required are provided by thetitlemaps.scm
file in the same repo. The code isn’t exactly efficient, but it works. Note that it uppercases any letter preceded by a non-letter, and space is not special (so “foo-bar” titlecases as “Foo-Bar”).Here’s part of the rationale from the SRFI, which explains why R6RS gets this only partly right:
@John: good point. How does that code deal with Dutch IJ (when not represented as a single codepoint) or the notorious Turkish ‘I’?
It would have to be tailored for specific languages. You want “ijzeren” (meaning “iron”) titlecased as “IJzeren”, but you don’t want “ijtihad” (romanized Arabic for “diligence” or “independent judgment”, literally “struggle with oneself”) to become “IJtihad”. The R6RS/R7RS definitions of
char-upcase
andchar-downcase specifically exclude the Turkish/Azeri and Lithuanian special cases of casing.
@John: tricky stuff indeed. And just as tricky is the question of word boundaries – won’t your function titlecase “isn’t” to “Isn’T” though. Not sure what the best thing to do here, maybe just follow http://unicode.org/reports/tr29/#Word_Boundaries (‘The correct interpretation of hyphens in the context of word boundaries is challenging’).
TR 29 is indeed the Right Thing, but SRFI 129 deliberately doesn’t specify it. It’s another of those language-sensitive issues: “doesn’t” is a single word and shouldn’t become “Does’Nt”, but “l’assommoir” (French slang meaning something like “the joint” or “the dive”, and the title of a novel by Zola) is underlyingly two words and should become “L’Assommoir”. Language-insensitive code can only do so much.
@John – Sure – I was about to say we could assume an English milieu for this problem, but even then that doesn’t help – I can’t think of a case where there is more than 1 letter before the apostrophe that does get capitalized, but for just 1 letter we can have both forms (“Y’All”, “O’Clock”, “P’s and Q’s”, “o’er the lea” – though assuming capitalization for the second part (to cover O’Shaughnessy and L’Escargot) with an explicit list of exceptions might be adequate.
There is the name “De’Ath” of course, though some spell it “De’ath” (e.g. https://en.wikipedia.org/wiki/Wilfred_De'ath, though Wikipedia seems a bit confused on the matter). There was a lecturer at university called De’Ath, not sure how he pronounced it, but he was pretty universally known as Doctor Death.
In the novel Murder Must Advertise, Lord Peter Wimsey uses the pseudonym “Death Bredon”, which is actually his two middle names. When asked about his first name, he says: “It’s spelt Death. Pronounce it any way you like. Most of the people who are plagued with it make it rhyme with teeth, but personally I think it sounds more picturesque when rhymed with breath.”
@John: Nice quote, thanks. I wonder if The Nine Tailors has anything to help with the current problem.
Solution in C#, using LINQ and Extension Methods:
Solution in C#: