Non-Breaking Hyphen
October 2, 2020
We begin by creating a string with an embedded unicode character:
(define s (string-append "1000136650" (string #\x7db) "1"))
> s "1000136650ߛ1"
For some reason, the end of the string is reversed when copied from my terminal window to WordPress. Then we write a function asciistr
similar to the Oracle function of the same name:
(define (asciistr s) (map char->integer (string->list s)))
> (asciistr s) (49 48 48 48 49 51 54 54 53 48 2011 49)
Now you can see the string is in the correct sequence. You can run the program at https://ideone.com/O6Dumh, where you will see
that Guile substitutes a different character for the non-breaking hyphen.
A non-breaking hyphen is 0x2011
I’m not sure what the “correct” behaviour is here. Obviously crashing the system wasn’t it, but should this field just be left untouched? Or should the non-breaking hyphen have been normalised to a regular hyphen? And if it’s the latter, is this an easy thing to do in general? (How many other characters look like hyphens?) Honestly, raising an error in the case that the invoice number contains non-numeric characters seems like the least evil thing. Otherwise we’ll wind up having to resort to OCR somewhere down the line.
Anyway, thanks for the amazing supply of interesting problems!
@Tim: The correct behavior is to decide whether to accept unicode characters either everywhere or nowhere. If you decide to accept unicode characters in the user interface, the database backend should accept them. If you decide the database backend will not accept unicode characters, the user interface should reject them (strip them, or beep at the user).
What has happened historically is that the vendor had a custom user interface and database backend; both used plain ascii. Parts of the system date to the 1980s. The vendor recently (in the last two years) updated the user interface so it could be used from any web browser, but kept the database backend (their advertising now features the word “mobile” prominently). Apparently, the user interface now allows unicode, but the database backend doesn’t. I filed a bug report, but it will be ignored (there are some bugs in the bug-tracking database that are over twenty years old).
Frankly, I am perfectly happy with the current state of affairs. If the vendor wrote good programs, I would be out of a job.
Here’s a solution in Python.
Output:
@Praxis – Working in Japanese text, where Unicode is not the norm, and sometimes with ‘standard’ encodings not conforming to the standards, I feel your pain here.
Just one comment though, non-breaking hyphen is U+2011 (0x2011, 8209[sub]10[/sub]); U+07db (0x7db, 2011[sub]10[/sub]) is Nko Letter Sa. Interestingly, this has Right-To-Left Bidirectional Class, which may be why the order is reversed when copying from terminal to WordPress.