Non-Breaking Hyphen

October 2, 2020

Today’s exercise tells the story of a problem I faced in my day job. A user apparently copy/pasted a field from Word, Excel or Outlook into our data processing system, which assumes plain ascii (the underlying database, Oracle, handles unicode properly, but the system on top of Oracle doesn’t). Unfortunately, although the field looked okay,

10001366650-1

the dash was actually a non-breaking hyphen, unicode 2011₁₀, which broke the system in a rather large way. The field in question was the vendor invoice number in an accounts payable system, and when the check paying that invoice was written, the check-writing program dropped the remittance advice from that check, so every subsequent check had the wrong remittance advice attached. The error wasn’t discovered immediately, so some of the checks were already mailed, making recovery difficult (we couldn’t just void the check run and start over because some of the checks were already mailed, and restarting the check run meant the check numbers wouldn’t match). So it was a grand old mess. In case you’re curious, I demonstrated the error with this SQL statement

select asciistr(fabinvh_vend_inv_code)
from   fabinvh
where  fabinvh_code = 'I2104519'

which returned

1000136650\20111

Unicode is nearly thirty years old. Users have the right to expect that their systems handle unicode characters properly.

Your task is to write a program that detects unicode/ascii error; you might also tell us about any unicode horror stories you have faced. When you are finished, you are welcome to read or run a suggested solution, or to post your own solution or discuss the exercise in the comments below.

Posted by programmingpraxis

Filed in Exercises

5 Comments »

5 Responses to “Non-Breaking Hyphen”

Dale said
October 2, 2020 at 7:02 PM
A non-breaking hyphen is 0x2011
Tim said
October 5, 2020 at 3:49 PM
I’m not sure what the “correct” behaviour is here. Obviously crashing the system wasn’t it, but should this field just be left untouched? Or should the non-breaking hyphen have been normalised to a regular hyphen? And if it’s the latter, is this an easy thing to do in general? (How many other characters look like hyphens?) Honestly, raising an error in the case that the invoice number contains non-numeric characters seems like the least evil thing. Otherwise we’ll wind up having to resort to OCR somewhere down the line.

Anyway, thanks for the amazing supply of interesting problems!
programmingpraxis said
October 5, 2020 at 4:24 PM
@Tim: The correct behavior is to decide whether to accept unicode characters either everywhere or nowhere. If you decide to accept unicode characters in the user interface, the database backend should accept them. If you decide the database backend will not accept unicode characters, the user interface should reject them (strip them, or beep at the user).

What has happened historically is that the vendor had a custom user interface and database backend; both used plain ascii. Parts of the system date to the 1980s. The vendor recently (in the last two years) updated the user interface so it could be used from any web browser, but kept the database backend (their advertising now features the word “mobile” prominently). Apparently, the user interface now allows unicode, but the database backend doesn’t. I filed a bug report, but it will be ignored (there are some bugs in the bug-tracking database that are over twenty years old).

Frankly, I am perfectly happy with the current state of affairs. If the vendor wrote good programs, I would be out of a job.

Daniel said

October 7, 2020 at 4:54 AM

Here’s a solution in Python.

def is_ascii(string: str) -> bool:
    try:
        string.encode('ascii')
        return True
    except UnicodeEncodeError:
        return False

print(is_ascii('programming-praxis'))
print(is_ascii('programming\u2011praxis'))

Output:

True
False

Alex B said
October 7, 2020 at 8:26 AM
@Praxis – Working in Japanese text, where Unicode is not the norm, and sometimes with ‘standard’ encodings not conforming to the standards, I feel your pain here.

Just one comment though, non-breaking hyphen is U+2011 (0x2011, 8209[sub]10[/sub]); U+07db (0x7db, 2011[sub]10[/sub]) is Nko Letter Sa. Interestingly, this has Right-To-Left Bidirectional Class, which may be why the order is reversed when copying from terminal to WordPress.

S	M	T	W	T	F	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Programming Praxis