This page describes the design of the dictionary. Readers interested in lexicography, contributing to the dictionary, or dictionary users wanting a deeper knowledge of the dictionary may find this page useful. See the Style Guide for details on writing dictionary entries.
The Chinese Notes dictionary is a Chinese-English dictionary. The most notable points of the dictionary are that it is machine readable for incorporating into a text reader and contains a large number of literary Chinese and Buddhist terms for reading of historic texts.
The current Chinese Notes project is in its third life at present:
- The initial project was a basic dictionary written in PHP beginning in about 2006. A collection of newspaper articles was informally collected. They was not organized formally as a corpus but just as a collection of web pages. PHP software was written to serve the web pages.
- Splitting off the NTI Reader project to ntireader.org to separate out Buddhist content. In the process the project was open-sourced, the Buddhist corpus was expanded, Python software was developed to process the corpus, and methodology was added for word sense disambiguation based on word sense frequency in the tagged part of the corpus.
- The Chinese Notes project is being open sourced, the site overhauled, the corpus is being structured and expanded, Go software for processing the corpus is being added, and the writing of the dictionary entries is being formalized. This process is just getting started.
Type and Purpose of the Dictionary
The dictionary is a Chinese to English dictionary, including both literary Chinese and modern Chinese. The dictionary describes how words are used in the corpus of texts listed in the Texts menu, the companion site NTI Reader site, and other texts. The main purposes of the dictionary are
- To support the Chinese Notes text reader: In order to do this the dictionary necessarily contains modern Chinese, literary Chinese, and special vocabularies. The primary purpose is to decode Chinese text but some help is provided to encode text. The user will do this by navigating from one of the words in the dictionary entry notes or the corpus through a mouse-over, clicking on a Chinese word to bring up the popover, or searching on a text string in the lookup tool on the main page.
- Look up individual Chinese words. The user will enter a single word in the lookup text field on the main page. Again, the primary purpose is decoding. That is, a user would look up a Chinese word to find the meaning explained in English. The 'English' field of the dictionary entries provide the equivalent English terms to assist with decoding. Encoding could be either for Chinese or English text. That is, looking up a Chinese word in order to learn how to use it. The dictionary entry notes provide help for this for a limited number of words. Encoding could also be for assistance in optimally translating the term into English. The examples and explanation in the dictionary entry notes are also intended for this.
- As a translation database to find similar terms to a phrase that a user is trying to translate. For this purpose many phrases and named entities are stored.
- Do miscellaneous searches: The user may not know the exact Chinese word but know something about it, such as a part of the word, one of the English matching words, the pinyin text, or navigate to the word through a synonym or other kind of related word. An important feature to support this is word-to-meaning-to-word. For example, a user can look up a word, find synonyms to the word, and follow links to the details of the synonyms. Also, a user may search on an English term and find definitions of Chinese words that include that English term. However, that does not make the dictionary bidirectional because all the headwords are Chinese.
- Natural language text processing: A machine readable dictionary for doing automated text analysis for a number of purposes, including ngram analysis and discovery of new words in individual texts. Importantly, it is freely reusable by other people with the raw text files available on GitHub.
- A basis in literary Chinese for the NTI Buddhist Text Reader: Most Buddhist dictionaries only give the specialist terms. However, most of the words in the texts are non-specialist literary Chinese words, which takes considerable training to master.
- Word etymology - The user may want to know something about the word etymology, such as whether it was used in early texts.
The dictionary includes a general Chinese-English dictionary, a named-entity (people, places, publication titles, etc) database of Chinese names with their English equivalents, and phrase memory databank. The number of named entities that could potentially grow to millions. For this reason they are managed separately from the general dicitonary. The general dictionary is stored in the file words.txt. The modern named entities are stored in the file modern_named_entities.txt. There are other files for Buddhist named entities and translated phrases in the same directory.
User Profiles and Assumed Skills
Anybody can use the dictionary without restriction but there are certain features designed for particular users and certain skills required to make full use of it.
Non-native speaking users in the process of learning Chinese will be able to make best use of the dictionary. Features of the dictionary, such as the lookup feature that breaks a text string into words will help these users. That will identify the individual words as well as save time for repeated lookup. The notes explaining points on word function will also help these users.
Users studying literary Chinese will benefit from the inclusion of many literary Chinese word senses. The references to literary Chinese sources will help these users.
Native speakers wanting to translate Chinese into English may benefit from the dictionary although it does not give very much guidance on optimal English word choice.
The basic skills expected of users are:
- Understanding of Hanyu pinyin
- Installation of fonts on their computer to display Chinese characters
- A basic understanding of Chinese grammar will be helpful to understand the parts of speech listed. In particular, there are a number differences with English grammar, such as Chinese measure words and particles.
Corpora and software for processing of corpora transformed lexicography beginning in the 1980's. (Atkins and Rundell, 2008, p. 53) This project has evolved from informal collection of words to more formal collection using corpora. The Chinese Notes project actually consists of four parts:
- The dictionary data
- Several corpora of texts (under the Texts menu)
- Software for processing the corpora texts and generating HTML pages (a Go command line tool available in GitHub)
- Software for serving interactive web pages (PHP software in GitHub)
Previously, dictionary compilers created citations for words to demonstrate their use and ensure that the dictionary entries were reliable. (Atkins and Rundell, 2008, pp. 53-54) This has been replaced by corpora and software for processing the corpora to identify the most common words and how they are used. The corpus in this project is an arbitrarily selected, as opposed to a carefully selected and balanced corpus, like the British National Corpus (BNC). (Atkins and Rundell, 2008, p. 53) The main criteria for selection of the text entries in Chinese Notes are feasibility, copyright, and personal interest. One difference between chinesenotes and other dictionaries is the open inclusion of the corpus and linking of dictionary entries with corpus data. The hope here is that the data will be useful for some users wanting to dig deeper into word meaning and use and that it will not be too confusing for other users who do not want to do that.
The headwords in the dictionary consist of Chinese words written in simplified characters. If the simplified and traditional characters differ then the word written in traditional characters is provided as well. This enables the headwords to be written in either simplified or traditional form. The headwords include single character words, like 个 gè 'unit,' multi-character words, like 表示 biǎoshì 'to express,' and multiword expressions, like 男扮女装 nán bàn nǘ zhuāng 'a man dressed as a woman.' Multiword expressions may be proper nouns, idioms, or other kinds of phrases. A multiword expression may be included as a headword if it the sum is more than the parts or it is a set phrase that must be written in a certain way. Set phrases are often literary Chinese embedded in modern Chinese, such as idioms from the Analects of Confucius 《論語》. The headwords are stored in the file headwords.txt.
Each entry in the lexical_units.txt file is a lexical unit. The combination of fields that may vary with a given headword include, the word written in traditional text, Hanyu pinyin, part of speech, and English equiivalent. Other fields that provide additional information about a lexical unit include concept (in English and Chinese), topic (in English and Chinese), subtopic (in English and Chinese), image (a file name), mp3 (the name of a sound file), and notes. The notes field may contain explanatory notes, references, and examples. The file field in the lexical_units.txt file is the headword id.
This field is the part of speech of the Chinese word. It should match one of the entries in the grammar.txt file in GitHub. These parts of speech are intended for both modern Chinese and literary Chinese. However, it should be kept in mind that parts of speech are arguable for modern Chinese and especially flexible in literary Chinese.
The English section of the lexical unit should include direct translations or near synonyms in English.
Domain labels indicate specialized fields that words are used in, such as lingustics or medicine. (Atkins and Rundell, 2008, p. 227) The topic_cn and topic_en fields are in Chinese and English respecitvely. The parent_cn and parent_en are subdomains. For example, History | China, i.e. Chinese History as a subdomain of the domain History.
The notes section provides additional information about the word. For modern Chinese words notes explaining the use of the word should be given and examples included. Examples should illustrate the word sense with English gloss of Chinese examples. English gloss is given in single quotes The source of the example given in English or pinyin and Chinese. For example, 㣇：脩豪獸。 'Yì: a grand beast with long hair.' (Shuo Wen Jie Zi, Scroll 10 《說文解字‧卷十》) Synonyms in the source language should be provided to illustrate the word sense.
Collocators of headwords should be included to clarify word sense and indicate how to use the headword. Collocators of adjectives are usually nouns modified. (Atkins and Rundell, 2008, p. 217) For example, in the noun phrase 大规模 'large scale,' 规模 guīmó is a collocator for for the headword 大 dà 'big' (adjective). Since space is not limited in electronic dictionaries, the full form is given instead of the traditional abbreviated form ~规模 '~ scale.' This also makes the gloss more easily readable. Collocators of nouns are the word that modify them. For example, 蓝天 'blue sky,' which also helps differentiate the sense headword, in this case sky versus heaven, which is another sense of 天 tiān.
- Atkins, B.T.S., Rundell, M., 2008. The Oxford Guide to Practical Lexicography. Oxford University Press, Oxford.