KBD

Keith Devens .com

Tuesday, August 19, 2008 Flag waving
And certainly both Horses were doing, if not all they could, all they thought they could, which is not quite... – C.S. Lewis (The Horse and His Boy, ch 10)
← Jabber serverAOL using SPF →

Daily link icon Friday, January 9, 2004

Unicode for XML

Via Internet Alchemy, Entry-Level Unicode for XML looks like exactly what I want to read. I'm writing parsers for a non-XML markup language, and many of the issues regarding Unicode I have are exactly the same as the ones XML has. The parsers have been essentially done for a long time now, but I have to make sure they all work with Unicode. I also own a copy of Unicode Demystified, which is excellent, though this article is expectedly more targeted for my purposes than the book.

If the language supports Unicode directly, it's pretty much a piece of cake to support Unicode. I don't expect much work at all in Java, Python, etc. But one of my main questions is, for languages that don't support Unicode natively, such as PHP and C/C++, can I just store stuff in strings in UTF-8 format, and let an application my libraries hand data off to worry about converting it to a 16 bit format if they want to?

← Jabber serverAOL using SPF →

Comments XML gif

Jon Hanna (http://www.hackcraft.net/) wrote:

You aren't guaranteed to be able to do that with C++ (nor, I understand, with PHP). Really, it depends on the application (if it doesn't say it'll handle UTF-8 in this way then it probably won't).
If the application uses UTF-16 or UTF-32 then the code to transcode from UTF-8 is simple and can work efficiently on a streaming basis. If it's UCS-2 it's easy enough (you just need to work out what you're going to do if the UTF-8 contains a character UCS-2 doesn't contain). If the 16bit encoding you refer to is something else then the complexity will vary according to the encoding in question.

∴ Jon Hanna | 27-Feb-2004 7:08am est | http://www.hackcraft.net/ | #4009

Keith (http://keithdevens.com/) wrote:

If the 16bit encoding you refer to is something else...

I just meant something like what Java and C# use natively to store Unicode strings.

You aren't guaranteed to be able to do that with C++ (nor, I understand, with PHP).

Why not? PHP strings and C++ strings are 8-bit clean. Obviously asking the language for the length of the string won't return the correct number of characters unless the string contains all ASCII characters, but would you recommend against storing the raw UTF-8 data in a string like that for other reasons?

Keith | 27-Feb-2004 2:36pm est | http://keithdevens.com/ | #4013

Jon Hanna (http://www.hackcraft.net/) wrote:

I just meant something like what Java and C# use natively to store Unicode strings.

Grand so, few worries there.

Why not? PHP strings and C++ strings are 8-bit clean.

Yes, you can put UTF-8 in them, and you will generally be safe with them. Surprises can arise when you come to use a function that isn't expecting it to be UTF-8, and strlen() for example will return the number of code-units rather than the number of code-points as you state. But if you're hip to the possibility of stuff like that then you can be quite safe in storing stuff in UTF-8.

That said, I find UTF-16 is a easier to work with that UTF-8 in C++. YMMV.

∴ Jon Hanna | 9-Mar-2004 12:52pm est | http://www.hackcraft.net/ | #4099

Feel free to post a comment below. Please see my comment policy.

Formatting Rules (No HTML):

  • **bold**, *italic*, _underlined_, --strikeout--
  • "text"="url" creates a link, and URLs are auto-highlighted
  • Blockquote: Like e-mail, begin paragraph with > (greater-than sign)
  • Lists: begin paragraph with *,-, or + (unordered), or # (ordered)
  • Code block: ?!code:language=perl|php|sql|javascript|etc.{\n}...{\n}?!/code

:
(will be your IP address if blank)
: (optional)
(Will not be shown on site)

: (optional)
:

August 2008
SunMonTueWedThuFriSat
 12
3456789
10111213141516
17181920212223
24252627282930
31 



RSS feed RSS feed for Keith's Weblog
Atom feed Atom feed for Keith's Weblog
Weblog archive
Recent comments
  on 2 posts

Recent comments XML

Girls, please don't get breast implants

Wow... I'm almost embarrassed to​admit I'm a member of the female​gender, a...

Proud B-Cup: Aug 16, 2:59am

Spider solitaire

HELLO Keith, 
I did your​impossible game at fourth attempt​with score of 1...

NICOLA-ITA: Aug 14, 3:55am

Generated in about 0.15s.

(Used 8 db queries)

mobile phone