Chinese Notes

Dictionary Design

This page describes the design of the dictionary. Readers interested in lexicography, contributing to the dictionary, or dictionary users wanting a deeper knowledge of the dictionary may find this page useful. See the Style Guide for details on writing dictionary entries.

The current Chinese Notes project is in its third life at present:

  1. The initial project was a basic dictionary written in PHP beginning in about 2006. A collection of newpaper articles was informally collected. They was not organized formally as a corpus but just as a collection of web pages. PHP software was written to serve the web pages.
  2. Splitting off the NTI Reader project to ntireader.org to separate out Buddhist content. In the process the project was open-sourced, the Buddhist corpus was expanded, Python software was developed to process the corpus, and methodology was added for word sense disambiguation based on word sense frequency in the tagged part of the corpus.
  3. The Chinese Notes project is being open sourced, the site overhauled, the corpus is being structured and expanded, Go software for processing the corpus is being added, and the writing of the dictionary entries is being formalized. This proces is just getting started.

Type and Purpose of the Dictionary

The dictionary is a Chinese to English dictionary, including both literary Chinese and modern Chinese. The modern Chinese part could be considered a learners' dictionary with a focus on literary (aka classical) Chinese. The dictionary describes how words are used in the corpus of texts listed in the Texts menu, the companion site NTI Reader site, and other texts. The main purposes of the dictionary are

User Profiles and Assumed Skills

Anybody can use the dictionary without restriction but there are certain features designed for particular users and certain skills required to make full use of it.

Non-native speaking users in the process of learning Chinese will be able to make best use of the dictionary. Features of the dictionary, such as the lookup feature that breaks a text string into words will help these users. That will indentify the individual words as well as save time for repeated lookup. The notes explaining points on word function will also help these users.

Users studying literary Chinese will benefit from the inclusion of many literary Chinese word senses. The references to literary Chinese sources will help these users.

Native speakers wanting to translate Chinese into English may benefit from the dictionary although it was not give very much guidance on optimal English word choice.

The basic skills expected of users are:

Corpora

Corpora and software for processing of corpora transformed lexicography beginning in the 1980's. (Atkins and Rundell, 2008, p. 53) This project has evolved from informal collection of words to more formal collection using corpora. The Chinese Notes project actually consists of four parts:

Previously, dictionary compilers created citations for words to demonstrate their use and ensure that the dictionary entries were reliable. (Atkins and Rundell, 2008, pp. 53-54) This has been replaced by corpora and software for processing the corpora to identify the most common words and how they are used. The corpus in this project is an arbitrarily selected, as opposed to a carefully selected and balanced corpus, like the British National Corpus (BNC). (Atkins and Rundell, 2008, p. 53) The main criteria for selection of the text entries in Chinese Notes are feasibility, copyright, and personal interest. One difference between chinesenotes and other dictionaries is the open inclusion of the corpus and linking of dictionary entries with corpus data. The hope here is that the data will be useful for some users wanting to dig deeper into word meaning and use and that it will not be too confusing for other users who do not want to do that.

Dictionary Entries

The headwords in the dictionary consist of Chinese words written in simplified characters. If the simplified and traditional characters differ then the word written in traditional characters is provided as well. This enables the headwords to be written in either simplified or traditional form. The headwords include single character words, like gè 'unit,' multi-character words, like 表示 biǎoshì 'to express,' and multiword expressions, like 男扮女装 nán bàn nǘ zhuāng 'a man dressed as a woman.' Multiword expressions may be proper nouns, idioms, or other kinds of phrases. A multiword expression may be included as a headword if it the sum is more than the parts or it is a set phrase that must be written in a certain way. Set phrases are often literary Chinese embedded in modern Chinese, such as idioms from the Analects of Confucius 《論語》. The headwords are stored in the file headwords.txt.

Each entry in the lexical_units.txt file is a lexical unit. The combination of fields that may vary with a given headword include, the word written in traditional text, Hanyu pinyin, part of speech, and English equiivalent. Other fields that provide additional information about a lexical unit include concept (in English and Chinese), topic (in English and Chinese), subtopic (in English and Chinese), image (a file name), mp3 (the name of a sound file), and notes. The notes field may contain explanatory notes, references, and examples. The file field in the lexical_units.txt file is the headword id.

Grammar

This field is the part of speech of the Chinese word. It should match one of the entries in the grammar.txt file in GitHub. These parts of speech are intended for both modern Chinese and literary Chinese. However, it should be kept in mind that parts of speech are arguable for modern Chinese and especially flexible in literary Chinese.

English

The English section of the lexical unit should include direct translations or near synonyms in English.

Domain Labels

Domain labels indicate specialized fields that words are used in, such as lingustics or medicine. (Atkins and Rundell, 2008, p. 227) The topic_cn and topic_en fields are in Chinese and English respecitvely. The parent_cn and parent_en are subdomains. For example, History | China, i.e. Chinese History as a subdomain of the domain History.

Notes

The notes section provides additional information about the word. For modern Chinese words notes explaining the use of the word should be given and examples included. Examples should illustrate the word sense with English gloss of Chinese examples. English gloss is given in single quotes The source of the example given in English or pinyin and Chinese. For example, 。 'Yì: a grand beast with long hair.' (Shuo Wen Jie Zi, Scroll 10 《說文解字》) Synonyms in the source language should be provided to illustrate the word sense.

Collocators of headwords should be included to clarify word sense and indicate how to use the headword. Collocators of adjectives are usually nouns modified. (Atkins and Rundell, 2008, p. 217) For example, in the noun phrase 大规模 'large scale,' 规模 guīmó is a collocator for for the headword dà 'big' (adjective). Since space is not limited in electronic dictionaries, the full form is given instead of the traditional abbreviated form ~规模 '~ scale.' This also makes the gloss more easily readable. Collocators of nouns are the word that modify them. For example, 蓝天 'blue sky,' which also helps differentiate the sense headword, in this case sky versus heaven, which is another sense of tiān.

Reference

  1. Atkins, B.T.S., Rundell, M., 2008. The Oxford Guide to Practical Lexicography. Oxford University Press, Oxford.