CJKV Information Processing 2nd ed. 52

Posted by samzenpus on Wednesday July 08, 2009 @02:00PM from the read-all-about-it dept.

stoolpigeon writes "At the end of last year, I made a move from an IT shop focused on supporting the US side of our business to a department that provides support to our operations outside the US. This was the first time I've worked in an international context and found myself, on a regular basis, running into long-time assumptions that were no longer true. My first project was implementing a third-party, web-based HR system for medium-sized offices. I found myself constantly missing important issues because I had such a narrow approach to the problem space. Sure, I've built applications and databases that supported Unicode, but I've never actually implemented anything with them but the same types of systems I'd built in the past with ASCII. But a large portion of the world's population is in Asia, and ASCII is certainly not going to cut it there. Fortunately, a new edition of Ken Lunde's classic CJKV Information Processing has become available, and it has really opened my eyes." Keep reading for the rest of JR's review.

CJKV Information Processing 2nd ed.
author	Ken Lunde
pages	898
publisher	O'Reilly Media, Inc.
rating	10/10
reviewer	JR Peck
ISBN	978-0-596-51447-1
summary	Chinese, Japanese, Korean and Vietnamese computing.

CJKV Information Processing has a long history that actually goes back into the 1980s. It began as a simple text document JAPAN.INF, available via FTP on a number of servers. This document was excerpted and refined and published as Lunde's first book in 1993, Understanding Japanese Information Processing. Shortly after JAPAN.INF became CJK.INF and the foundation for the first edition of CJKV Information Processing was born. The first edition was published in 1999, and it is safe to say that a number of important things have changed over the last 10 years. Lunde states four major developments that prompted this second edition in the preface. They are the emergence of Unicode, OpenType and the Portable Document Format (PDF) as preferred tools and lastly the maturity of the web in general to use Unicode and deal with a wider range of languages and their character sets.

Lunde sets out not to create an exhaustive reference on the languages themselves, but rather an exhaustive guide to the considerations that come into play when processing CJKV information. As Lunde states, "..this book focuses heavily on how CJKV text is handled on computer systems in a very platform-independent way..." Taking into account the complexity of the topic, the breadth of the work and the degree to which it is independent of any specific technology, outside a heavy bias for Unicode, is extremely impressive. A glance over the table of contents show just how true this is. Chapter 9, Information Processing Techniques has sections touching on C/C++, Java, Perl, Python, Ruby, Tcl and others. These are brief, with most examples in Java but that they are all directly addressed shows a great awareness of the options out there. The sections that deal with operating system issues have the same breadth. Chapter 10, OSes, Text Editors, and Word Processors doesn't just hit the top Mac and Windows items. It looks at FreeBSD, Linux, Mac OS X, MS Vista, MS-DOS, Plan 9, OpenSolaris, Unix and more. There are also sections for what Lunde calls hybrid environments such as Boot Camp, CrossOver Mac, Gnome, KDE, VMware Fusion, Wine and the X Window System. Interestingly the Word Processor system covers AbiWord and KWord but not OpenOffice.org The point stands that anyone looking to support CJKV, this book will probably cover your platform and give you at the very least a starting point with your chosen tool set.

That said, an extremely specific implementation is not what Lunde is out to offer up. This is the very opposite of a 'cook book' approach. This also makes the book extremely useful to anyone dealing with internationalization, globalization or localization issues regardless of character set or language. Lunde teaches the underlying principles of how writing systems and scripts work. He then moves to how computer systems deal with these various writing systems and scripts. The focus is always on CJKV but the principles will hold true in any setting. This continues to be the case as Lunde talks about character sets, encoding, code conversion and a host of other issues that surround handling characters. Typography is included, as well as input and output methods. In each case Lunde covers the basics as well as pointing out areas of concern and where exceptions may cause issues. The author is nothing if not thorough in this regard. His knowledge of the problem space is at times down right staggering. Lunde also touches on dictionaries as well as publishing in print and on the web.

The first three chapters set the table for the rest of the book with an overview of the issues that will be addressed, information on the history and usage of the writing systems and scripts covered and the character set standards that exist. This was a fascinating glimpse, once again into CJKV languages and how other languages are dealt with as well. I think there is even a lot here that would be extremely informative to a person who wants to learn more about CJKV, even if they are not a developer that will be working with one of the languages. That's only the first quarter of the book, so I don't know that it would be worth it from just that perspective, but it is definitely a nice benefit of Lunde's approach.

The style is very readable, but I wouldn't just hand this to someone who didn't have some familiarity with text processing issues on computer systems. While there is no requirement to know or understand one of the CJKV languages, understanding how computer systems process data and information is important. I did not know anything about CJKV languages prior to reading the book and have learned quite a bit. What I learned was not limited to the CJKV arena. The experience I had was very similar to when I studied ancient Greek in school. Learning Greek I learned much more about English grammar than I had ever picked up prior. Reading CJKV Information Processing I learned quite a bit more about the issues involved in things like character encoding and typography for every language, not just these four. But in dealing with CJKV specifically I've found that Lunde's work is indispensable. It is not just my go to reference, it's essentially my only reference. If any other works do come my way, this is the standard against which they will be judged.

There are thirteen indexes including a nice glossary. Nine of them are character sets, which were printed out in the longer first edition. In this second edition, there is a note on each, with a url pointing to a PDF with the information. It seemed odd, but each URL gets it's own page. This means there are nine pages with nothing but the title of the index and a url. Fortunately they are all in the same directory, which can be reached directly from the books page at the O'Reilly site. It seems it would have made sense to just list them all on a single page, but maybe it was necessary for some reason. It's a minute flaw in what is a great book."

You can purchase CJKV Information Processing 2nd ed. from amazon.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

CJKV Information Processing 2nd ed.

This discussion has been archived. No new comments can be posted.

Search 52 Comments Log In/Create an Account

Comments Filter:

one page min. per index / appendix / chapter (Score:4, Interesting)

by WillAdams ( 45638 ) writes: on Wednesday July 08, 2009 @02:19PM (#28625883) Homepage

is likely a limitation of the use of FrameMaker to compose the document and an unwillingness to set up new styles to put them together (unfortunately O'Reilly hasn't use TeX for a title since _Making TeX Work_) and was probably let stand since they needed a particular page count to come out to even signatures anyway.
William

Overlaps with "Unicode Explained"? (Score:4, Interesting)

by tcopeland ( 32225 ) writes: <tom@thomasleecAU ... d.com minus poet> on Wednesday July 08, 2009 @02:33PM (#28626131) Homepage

When I was working on my JavaCC book [generating...javacc.com] I bought Jukka Korpela's Unicode Explained [amazon.com] and it was *extremely* helpful. After reading it I actually felt comfortable using various tools to convert from one encoding to another, discussing multibyte character sets, and so forth. It helped me write the Unicode chapter in my book with some confidence. It was the first time I had used vi to enter Unicode characters... fun times.
That said, it sounds like "CJKV Information Processing" covers some of the same ground. Has anyone read both?

Fonts and encoding (Score:4, Interesting)

by jbolden ( 176878 ) writes: on Wednesday July 08, 2009 @04:25PM (#28627825) Homepage

I own the first edition of CJKV but I find Fonts and encodings [google.com] to be far more useful. Obviously if you are working heavily in any of these languages the 2nd best book is worth having but I'd say that F&E feels like a systematic treatment while CJKV feels like 1000 pages of webarticles on the topic.

Re:Oh boy... (Score:3, Interesting)

by SL Baur ( 19540 ) writes: <steve@xemacs.org> on Wednesday July 08, 2009 @05:04PM (#28628327) Homepage Journal

I'm not sure why this was modded offtopic.
s/English/ASCII/ and I got plenty of complaints along those lines in my mailbox over the years. Supporting Asian languages can be expensive in terms of processing time. Japanese companies *can* be insular, been there done that. I have no experience with the CKV part.
Fortunately the state of the art in computing hardware has improved over the years and it's not as expensive as it used to be.
Their English web presence leaves something to be desired, but I agree with their mission statement - http://www.m17n.org/index.html [m17n.org] Those are the guys who first did Asian language support for emacs. I worked with them for a year in Japan.

Re:The Absolute Minimum..." (Score:5, Interesting)

by bcrowell ( 177657 ) writes: on Wednesday July 08, 2009 @06:11PM (#28629145) Homepage

Nice article -- thanks for providing the link! I liked this: "There Ain't No Such Thing As Plain Text. If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly."
This is not a hard problem to solve in the case of email and web pages, which can have encoding given in headers. (If you validate your page using the w3c validator, it will warn you if you didn't supply an encoding.) It's also not an insanely hard problem for strings in memory; the encoding can be either set by your encoding convention or handled behind the scenes by your language (as in perl).
What really sucks is files. For instance, I wrote this [lightandmatter.com] extremely simple terminal-based personal calendar program in perl, and it's actually attracted a decent number of users. It's internationalized in 11 languages. Well, one day a user sends me an email complaining that the program is giving him mysterious error messages. He sends me his calendar file, which is a plain text file with some Swedish in it. I run the program on my machine with his calendar file, and it works fine. I can't reproduce the bug. We go through a few rounds of confused communication before I finally realize that he must have had the file encoded in Latin-1 on his end, whereas my program is documented as requiring utf-8. So now my program has to include the following cruft:
sub file_is_valid_utf8 { my $f = shift; open(F,"<:raw",$f) or return 0; local $/; my $x=<F>; close F; return is_valid_utf8($x); } # What's passed to this routine has to be a stream of bytes, not a utf8 string in which the characters are complete utf8 characters. # That's why you typically want to call file_is_valid_utf8 rather than calling this directly. sub is_valid_utf8 { my $x = shift; return utf8::decode(my $dummy = $x); }

Yech. It requires reading the file twice, and it's not even 100% reliable.
This is the kind of situation where the Unix philosophy, based on plain text files and little programs that read and write them, really runs into a problem. With hindsight, it would have been really, really helpful if Unix filesystems could have included just a smidgen more metadata, enough to specify the character encoding.

Re:The Absolute Minimum..." (Score:3, Interesting)

by spitzak ( 4019 ) writes: on Wednesday July 08, 2009 @09:11PM (#28630931) Homepage

What you are encountering is a typical moron implementatin of UTF-8.
For some reason otherwise intelligent programmers lose their minds when presented with UTF-8. They act as though the program will crash instantly if they ever make a pointer that points at the middle of a character, or if they fail to correclty count the "characters" in a string and dare to use an offset or number of bytes. I am not really certain what causes these diseases but being exposed to decades of character==byte ASCII programming seems responsible.
One way I try to correct this is to get them to thing about "words" the same way they are thinking about "characters". Do they panic that there is not a fast method of moving by N words? Do they panic that it is possible to split a string in the middle of a word and thus produce two incorrectly-spelled words? Do they think that copying text in fixed-sized blocks from one location to another will somehow garble it because at the midpoint a word got split into two parts? No, not if they have any brains at all. However for some reason when they see multibyte characters all sense goes out the window.
Here is how you solve it: you keep the text as UTF-8 and you treat it as an array of BYTES!. Smoke will NOT come out of your computer because you don't continuously think about the "characters", in fact it will, amazingly enough, be remembered and the bits will not change because you failed to continuously look at the character boundaries or you dared to not count how many there were! When it gets to time to display it, you parse out each UTF-8 character and draw it on the screen (and also do all that complex Pango-like layout). At the same time, through an amazing ability of the UTF-8 decoder to recognize that it can't decode something, you will, FOR FREE, find the errors. You can then render the bytes of the error sequence in another way, perhaps by choosing the matching character from CP1252.
This completely avoids the need for "metadata" and "BOM" and all that other crap, and magically works when the users accidentally pastes text from different encodings together, something that no metadata can ever solve.
This isn't rocket science or magic, but for some reason it appears to be for a lot of people. You included, and many many other intelligent people. Comon, everybody, please think a little!

Re:The Absolute Minimum..." (Score:3, Interesting)

by david.given ( 6740 ) writes: <dg@cowlark.com> on Thursday July 09, 2009 @12:37AM (#28632401) Homepage Journal

I liked this: "There Ain't No Such Thing As Plain Text. If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly."
My Unicode mantra is:
"You can't do random access on strings. No, not even if you turn it into UCS-2. Or UCS-4. Yes, Java is lying to you."
This is because a Unicode printable thing can span multiple bytes and multiple code points. You can't find the nth character in a string, firstly because Unicode doesn't really have such a concept as a character, and secondly because you don't know where it is. This Java code:
char c = s.charAt(4);
...doesn't do what people think it does --- it returns the 4th UTF-16 sequence thingamajig that may actually contain only part of a code point, and that code point may actually only contain part of a glyph, and trying to string slice without first checking you're at the end of a glyph is going to cause people from countries that use combining characters to hate you, because your app will break.
So in essence, in order to manipulate strings, you need to step through them from one glyph to the next, each of which may occupy an arbitrary number of bytes. So you might as well use UTF-8.
A while back I wrote a word processor using this technique: WordGrinder [sf.net]. It worked surprisingly well; the whole thing is 6300 lines of code and the first version took a month to write. I'll admit that I chickened out with RTL and entry and display of combining characters, but the text storage core can cope with them just fine.
But it does require a rather different philosophy for managing text than in the good old ASCII days, which is a pain in the arse sometimes...

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

CJKV Information Processing 2nd ed. 52

CJKV Information Processing 2nd ed. More Login

CJKV Information Processing 2nd ed.

one page min. per index / appendix / chapter (Score:4, Interesting)

Overlaps with "Unicode Explained"? (Score:4, Interesting)

Fonts and encoding (Score:4, Interesting)

Re:Oh boy... (Score:3, Interesting)

Re:The Absolute Minimum..." (Score:5, Interesting)

Re:The Absolute Minimum..." (Score:3, Interesting)

Re:The Absolute Minimum..." (Score:3, Interesting)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot