Submitting a review for consideration is easy; please first read Slashdot's book review guidelines. Updated: 2008114 by samzenpus
| CJKV Information Processing 2nd ed. | |
| author | Ken Lunde |
| pages | 898 |
| publisher | O'Reilly Media, Inc. |
| rating | 10/10 |
| reviewer | JR Peck |
| ISBN | 978-0-596-51447-1 |
| summary | Chinese, Japanese, Korean and Vietnamese computing. |
All trademarks and copyrights on this page are owned by their respective owners. Comments are owned by the Poster. The Rest © 1997-2010 Geeknet, Inc.
QUE? (Score:2, Funny)
http://lmgtfy.com/?q=CJKV [lmgtfy.com]
Re: (Score:1)
CJKV is.... (Score:5, Informative)
The term CJKV means CJK plus Vietnamese, which in the past used Hán t/Chinese characters and Ch Nôm prior to adopting Quc Ng.
http://en.wikipedia.org/wiki/CJK_characters [wikipedia.org]
uh (Score:1, Insightful)
why modded troll?
"and it has really opened my eyes" (Score:2, Funny)
Interesting slant.
Re: (Score:2)
I was familiar with the first three, the kanji used to represent China (the character for middle), Japan (the character for the sun) and Korea (the character for... Korea), but I didn't realize the last one was the character for Vietnam. It normally means to wake or cause. FWIW, the old name for Vietnam in Japanese seems to be "etsunan", which I guess is pretty close phonetically.
Hanzi/Kanji for "Viet", and other trivia (Score:2)
Doh, nope! :) Actually, it's not the character for "wake" or "cause", i.e. okiru or okosu, but rather the character for "exceed" or "pass through", as in the Japanese words koeru or kosu.
The etsu part in Japanese is pronounced yuè in Mandarin Chinese (link [mandarintools.com]). The "u" is kinda pinched in pro
Re: (Score:2)
Re: (Score:2)
Well, FWIW, the koeru kanji looks not too far from the okiru kanji; they both have the same bushu or radical, the bit going down the left and extending across underneath, which happens to be one of the larger bushu too. :) And, for that matter, there are two kanji used for koeru / kosu, one with the on reading (i.e., the reading(s) generally used in compounds and that came originally from Chinese) of etsu, and the other read as chô, as in chô kawaii!
So no worries, hey, it's Japanese. Whee!
Che
Re: (Score:1)
Thanks!
And to think someone thought that CHKV was a programming language :)
one page min. per index / appendix / chapter (Score:4, Interesting)
is likely a limitation of the use of FrameMaker to compose the document and an unwillingness to set up new styles to put them together (unfortunately O'Reilly hasn't use TeX for a title since _Making TeX Work_) and was probably let stand since they needed a particular page count to come out to even signatures anyway.
William
Re: (Score:2)
Just after I posted I wondered if O'Reilly was still so wedded to FM and wished I could've taken the time to research the matter.
Thanks for correcting my wrong assumption and setting the record straight.
William
Re: (Score:2)
Sure, that was intended as a joke, but a lot of protocols in the computer world sure feel like they were invented with the assumption that everyone communicated in nothing but 7-bit US-ASCII.
After all, why else would we need Quoted-Printable and Base64 encoding, which let you put non-7bit data into 7-bit US-ASCII?
And then we have character sets. Its a total mess. It started (most likely) with US-ASCII, and eventually ended up at the all-encompassing Unicode. But along the way, we gained dozens of "legacy
Re: (Score:3, Interesting)
I'm not sure why this was modded offtopic.
s/English/ASCII/ and I got plenty of complaints along those lines in my mailbox over the years. Supporting Asian languages can be expensive in terms of processing time. Japanese companies *can* be insular, been there done that. I have no experience with the CKV part.
Fortunately the state of the art in computing hardware has improved over the years and it's not as expensive as it used to be.
Their English web presence leaves something to be desired, but I agree wit
Overlaps with "Unicode Explained"? (Score:4, Interesting)
When I was working on my JavaCC book [generating...javacc.com] I bought Jukka Korpela's Unicode Explained [amazon.com] and it was *extremely* helpful. After reading it I actually felt comfortable using various tools to convert from one encoding to another, discussing multibyte character sets, and so forth. It helped me write the Unicode chapter in my book with some confidence. It was the first time I had used vi to enter Unicode characters... fun times.
That said, it sounds like "CJKV Information Processing" covers some of the same ground. Has anyone read both?
Overlaps, sure; equivalent, I don't think so. (Score:3, Insightful)
It's going to have a big overlap, but the additional, crucially important material with CJKV processing is the non-Unicode encoding systems that have been used for those scripts, and the input methods that are used to enter the scripts into the computer. A general-purpose Unicode book will not go into a lot of depth about either of these topics.
The Absolute Minimum..." (Score:5, Informative)
"The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" is also a very good -- but very much shorter -- introduction to Unicode.
http://www.joelonsoftware.com/articles/Unicode.html [joelonsoftware.com]
I frequently send this to people that I need to work with who don't "get" it.
Parent
Re:The Absolute Minimum..." (Score:5, Interesting)
Nice article -- thanks for providing the link! I liked this: "There Ain't No Such Thing As Plain Text. If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly."
This is not a hard problem to solve in the case of email and web pages, which can have encoding given in headers. (If you validate your page using the w3c validator, it will warn you if you didn't supply an encoding.) It's also not an insanely hard problem for strings in memory; the encoding can be either set by your encoding convention or handled behind the scenes by your language (as in perl).
What really sucks is files. For instance, I wrote this [lightandmatter.com] extremely simple terminal-based personal calendar program in perl, and it's actually attracted a decent number of users. It's internationalized in 11 languages. Well, one day a user sends me an email complaining that the program is giving him mysterious error messages. He sends me his calendar file, which is a plain text file with some Swedish in it. I run the program on my machine with his calendar file, and it works fine. I can't reproduce the bug. We go through a few rounds of confused communication before I finally realize that he must have had the file encoded in Latin-1 on his end, whereas my program is documented as requiring utf-8. So now my program has to include the following cruft:
Yech. It requires reading the file twice, and it's not even 100% reliable.
This is the kind of situation where the Unix philosophy, based on plain text files and little programs that read and write them, really runs into a problem. With hindsight, it would have been really, really helpful if Unix filesystems could have included just a smidgen more metadata, enough to specify the character encoding.
Parent
Re: (Score:1)
Java I/O is better here. (Score:1)
AFAIK it's not possible to do it in a 100% reliable fashion, but there are technical solutions where the file doesn't need to be read twice. Java, despite all of its flaws, handles this sort of thing pretty well, so I'll use that as an example.
In Java, there is a distinction between byte-based and character-based I/O. InputStream [sun.com] and OutputStream [sun.com] are byte-based I/O classes; Reader [sun.com] and Writer [sun.com] are character-based. Then you have clas
Re: (Score:3, Interesting)
What you are encountering is a typical moron implementatin of UTF-8.
For some reason otherwise intelligent programmers lose their minds when presented with UTF-8. They act as though the program will crash instantly if they ever make a pointer that points at the middle of a character, or if they fail to correclty count the "characters" in a string and dare to use an offset or number of bytes. I am not really certain what causes these diseases but being exposed to decades of character==byte ASCII programming s
Re: (Score:2)
Indeed. Which is why Bush hid the facts [wikipedia.org].
Re: (Score:3, Interesting)
I liked this: "There Ain't No Such Thing As Plain Text. If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly."
My Unicode mantra is:
"You can't do random access on strings. No, not even if you turn it into UCS-2. Or UCS-4. Yes, Java is lying to you."
This is because a Unicode printable thing can span multiple bytes and multiple code points. You can't find the nth character in a string, firstly because Unicode doesn't really have such a concept as a character, and secondly because you don't know where it is. This Java code:
char c = s.charAt(4);
...doesn't do what people think it does --- it returns the 4th UTF-
Re: (Score:1)
It's been interesting reading different people's replies to my post. One thing I've noticed is that each of us is talking about the language he's most familiar with. I was writing about a situation I encountered with perl. You're talking about java. Other people are talking about C.
Your comment applies to java but not to perl. In perl, you really can do random access on strings. All the int
Re: (Score:2)
Actually, I do it mostly in C --- I picked Java for that example because it has a really simple example of getting it wrong.
And when you say Perl supports random access of Unicode strings, are you sure it's not just giving you random access to an array of Unicode code points --- which is also wrong? Remember that a single Unicode glyph can be made up of an arbitrary number of code points.
Even in European languages, trying to split a string between the combining accent code point and the base character c
Re: (Score:2)
Interesting point. Some documentation: man perlunicode [perl.org], man perluniintro [perl.org], Unicode::Normalize [perl.org]. I spent some time studying these, and concluded that I didn't understand enough to answer your question :-)
Re: (Score:1)
If you work with them it is easier, hopefully you can try to get them fired or at least coerced into doing it right.
With free software programmed by volunteers it is even worse. Many such volunteers are great coders but they come from ASCII countries and as such don't "get" while tail should perform worse than it used to do, or why should they care about character width instead of strlen, or why should they update an algorithm they borrowed from K&R 30 years ago.
Truth is, with UTF-8 while you lose the c
Re: (Score:2)
Re: (Score:2)
> Gee, my methods were different than most: I married a Ukrainian woman.
Hehe, yeah, actually, my wife is Romanian, so all my JavaCC Unicode examples involve s-with-cedilla and stuff like that :-) Buna zuia!
Great Book. Could use an Arabic supplement. (Score:3, Informative)
Re: (Score:2)
Spungo's Law says... (Score:1)
...nearly every week there will be a new O'Reilly book on something you've never heard of.
Re: (Score:1)
I lot that law, very funny!!
How's this different from ASDF processing? (Score:1)
Re: (Score:2)
It's well known that CJKV is more like QPZA than ASDF, although TYRX process is probably better documented than either.
Recent developments in RWRI technology have seen a lot of uptake by the IRWR community, leading some to believe that ASDF is on its way out entirely.
That's a completely clear and informative SUMMARY of the issue, right?
Fonts and encoding (Score:4, Interesting)
I own the first edition of CJKV but I find Fonts and encodings [google.com] to be far more useful. Obviously if you are working heavily in any of these languages the 2nd best book is worth having but I'd say that F&E feels like a systematic treatment while CJKV feels like 1000 pages of webarticles on the topic.
Re: (Score:1)
And just when you think you mastered it all.... (Score:1)
will cause PathDoesNotExistException.
You need to go through the whole code base and remove any case-changes that happen with the letter "i" or letter "I".
Because Turkish is the ONLY alphabet where the uppercase version of 7bit "i" has 8 bits! Undotted i [wikipedia.org]
Re: (Score:3, Insightful)
Changing the case of a path SHOULD cause it to refer to a different path.
Here is 5 cents, go buy yourself a better computer.
Re: (Score:1)