Processing Chinese Text with PHP

Previous   Contents   Next  
References

Localization

Localization usually consists of translation of the externalized text to various languages. It might seem that, since Mandarin is the standard language both on the Mainland and in Taiwan, that no separate translation should be necessary. However, since separation of political systems the language has grown apart, especially for new vocabulary needed for technological developments. Therefore, different translations are usually required Mainland and Taiwan language variants confusingly often called simplified and traditional translations, which should only distinguish the way that text is written.

Locales

A locale is an object describing a language, country, and optionally encoding combination. Locales usually contain two parts and the codes differ on Windows and UNIX / Linux. The first part describes the language. It is 'zh' for Mandarin Chinese (zhongwen). The second part describes the location variant. Common locations are CN (People's Republic of China, PRC), HK (Hong Kong), SG (Singapore), and TW (Taiwan). For Chinese some of the more common possibilities for locale strings are

LocaleLocale String (Windows)Locale String (UNIX)
Chinese (Hong Kong)Chinese_hong kong ("hkg", "hong kong", or "hong-kong")zh_HK
Chinese (PRC - Mainland)Chinese_china ("china", "chn", "pr china", or "pr-china")zh_CN
Chinese (Singapore)Chinese_singapore ("sgp" or "singapore")zh_SG
Chinese (Taiwan)Chinese_taiwan ("twn" or "taiwan")zh_TW

Since Han Chinese dialects have the same written form there is no need to go into the different dialects, such as Taiwanese, Cantonese, Shanghainese, etc. When we talk about Taiwanese Chinese in this context we do not mean the Taiwanese dialect but the Taiwanese variant of Mandarin. The wrinkle here, however, is that the pronunciation representations are totally different for each dialect. There are even a number of different representations for Mandarin Chinese, although, outside Taiwan, the world seems to be standizing on Hanyu Pinyin. In addition, non-Han Chinese languages, such as Tibetan, Mongolian, Yi, and so on are a totally separate issue that I will not cover.

An optional final part of the locale is separated by a period and applies to the character set. For example, zh_TW.big5 for a Taiwanese variant of Mandarin Chinese using the Big 5 character set.

The most common use for Chinese variants is to decide whether to display traditional of simplified text. By convention Mainland and Singapore use simplified and the rest use traditional.

On Windows you can find the locale strings at the Microsoft Developer Network web site [MSDN1]. In Windows these are determined by the Win32 NLS API. On UNIX or Linux you can find out the locale strings by typing the command /usr/bin/locale -a.

To set the locale use the setlocale function. This takes a category constant and a locale string. For example, setlocale(LC_ALL, 'zh_CN'). The different category constants are

ConstantDescription
LC_ALLeverything
LC_COLLATEstring comparison
LC_CTYPEcharacter classification and conversion
LC_MONETARYMoney
LC_NUMERICdecimal separator
LC_TIMEdate and time formatting
LC_MESSAGESsystem responses

To find out the current locale use a 0 in setlocale. For example,


<?php
print setlocale(LC_ALL, 0);
?>

Names of People in Chinese

Translation of peoples names in Chinese has been inconsistent until recently. Although arguably a topic for professional translation I would like to mention it. Chinese list their surname first and commonly have two characters following that to make up their full name. When 'translating' we usually write Hanyu Pinyin form without the diacritics. However, many locales use dialect variants, most commonly Cantonese, and Taiwan does not use Hanyu Pinyin but an older phonetic system. Most professional translation services will use Hanyu Pinyin but this may conflict with the 'English' names that people have given themselves previously. Inconsistencies can also arise because of spacing within the Pinyin. For example, should 胡锦涛 be Hú Jǐn tāo or Hú Jǐntāo?

The next question is, then, when we give the 'English' form should we also switch the surname around to the end? The common answer is yes but be careful of exceptions. The third question is what to do about the second and third characters. This has been a very confusing issue resulting in Chinese being left with different combinations of 'first' and 'middle' names. The recent convention is to join the pinyin for the second and third characters without a space to avoid the possibility of confusion over a middle name. For example, 毛泽东, the name of the venerable Chairman Mao, using this system, would be written as Zedong Mao. Of course, we are more used to seeing Mao Zedong or Mao Ze Dong, from which the possible confusion over middle name or initial should be obvious.


Previous   Contents   Next  
References

About 关于本网站 © chinesenotes.com 2007-2010. Please send comments to alex@chinesenotes.com.