This file is provided as-is by Unicode, Inc. (The Unicode Consortium). No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. If this file has been provided on magnetic media by Unicode, Inc., the sole remedy for any claim will be exchange of defective media within 90 days of receipt. Unicode Technical Report #1 Draft Proposals ASCII plain text version without charts Copyright 1992 Unicode Inc. All Rights Reserved Until the end of the review period in August 1993: Permission is granted to freely reproduce this report in small quantities for purposes of review provided this notice remains affixed. Review period closes August 15, 1993 Introduction This Technical Report is comprised of three concrete proposals to which the Unicode Technical Committee is strongly committed in their current form. These are: Ethiopian, Burmese, and Khmer. These proposals have been reviewed internally and have been relatively stable over a period of time. The committee believes they represent good technical solutions for the proposed scripts, and therefore also recommends that specific codepoints within the body of Unicode be allocated for them, as follows: Burmese U+0F00 U+0F7F Khmer U+0F80 U+0FFF Ethiopian U+1200 U+125F Specific open issues for each of these are addressed in the respective draft block introductions. These open issues do not detract substantially from the solidity of the proposals. Acknowledgements * The Burmese proposal was written by Andy Daniels, with contributions by Lloyd Anderson and Glenn Adams. * The Khmer proposal was written by Andy Daniels. * The Ethiopian proposal was written by Joe Becker. The following individuals contributed greatly to the production of this report: Lloyd Anderson, Glenn Adams, Lee Collins Burmese U+0F00 -> 0F7F The Burmese script is used to write Burmese, the majority language of Myanmar (formerly Burma) and Pali. Variations and extentions of the script are used to write other languages of the region, such as Shan and Mon, and also to write Sanskrit. The Burmese script derives from 11th century Mon. The Mon script itself is probably borrowed directly from South India. The earliest Mon inscription, found at Lopburi in Thailand, dates from the eight century and is written in the Pallava script used at the Hinayana Buddhist center of Conjeeveram in the area of Madras on the east coast of India. In A.D. 1057 one of the first Burmese kings, Aniruddha, conquered Thaton, a major Mon center, and brought back with him to Pagan the most learned monks, artists and artisans of the Mon. The first inscription in Burmese dates from the following year and is written in an alphabet almost identical with that of the Mon inscriptions. Aside from rounding of the originally square characters, this script has remained largely unchanged to the present. The Burmese script therfore ultimately derives from Brahmi, and so shares the structural features of its relatives: Consonant symbols include an inherent vowel; various signs are placed before, above, below and after a consonant to indicate a vowel other than the inherent one; ligatures and conjuncts are used to indicate consonant clusters. In the course of its adaptation to non-Indo-Aryan languages, the Burmese script has acquired some features that distinguish it from other Indic scripts. The killer, or virama, participates in some common constructions that would be clumsy to handle the way they would be in the other Indic scripts, so the control function of the virama is separated from the diacritic function of the killer. The virama, 0F4D is used to form conjunct consonants, while the killer, 0F52, is a simple diacritic and has no effect on character shaping. The killer is also combined with the VOWEL SIGN O (0F4B) to form the low level tone vowel "o." When used this way, this symbol is known as hyei hto, or "thrust forward." Burmese distinguishes as set of "medial" consonants. Originally conjunct forms of YA PALE, YA GAU, WA and HA, they are used in modern Burmese to form new letters and to spell certain vowel and consonant combinations. They are treated here as no different from any other conjunct and should be coded using the virama. ISSUE: There's no reason from the point of view of the rendering engine to have separate codes for the medials. Some implementors feel that the medials should nevertheless have separate codes. Including them introduces alternate spellings for the same syllables, something that should be avoided. If there are compelling reasons for including the medials, there is certainly room to add them. When a syllable has more than one medial, it is recommended that they appear in the order that such syllables are traditionally spelled. That is, HA HTOU, before YA PIN or YA YI, before WA HSWE. Note that YA PIN and YA YI cannot appear in the same syllable in Burmese. For example, "cwei" ("to drop off") is coded as 0F15+0F5C+0F5E+0F47. "Hmyu" ("to delight, allure") is coded as 0F2E+0F5F+0F5D+0F42. This differs from the order in which medials are normally written. ISSUE: This rule is not strictly necessary, but regularizes the spelling and simplifies rendering, string comparison and other functions. Burmese has several glyphs that are used with varying semantics which are here given separate code points for each different usage. The following pairs of letters look the same, but must be distinguished in the text stream: EHKAYA U (0F09) and NYA GALEI (0F5B) GA NGE (0F17) and DIGIT HYI (0F6E) WA (0F35) and DIGIT THOUN NYA (0F66) YA GAU (0F30) and DIGIT HKUN NI (0F6D) DIGIT LEI (0F6A) and SYMBOL LAGAUN (0F73) The last two pairs are distinguished in some fonts but not in others. Also, the LETTER 0 (0F13) is distinguished from the sequence 0F48+0F5D, and the ZA MYINZWE is distinguished from 0F1A+0F5C. Symbols not found as single characters are formed from sequences of the basic characters given here. For example,tha ji ("great tha") is coded by the sequence 0F38+0F4D+0F38, i.e., it is a conjunct formed from two THAs. Kinzi is a conjunct formed from LETTER NGA followed by some other consonant, that is, the sequence 0F19+0F4D+Consonant. Low level tone "o" has already been noted. Level tone "ou" is to be coded as 0F41+0F3F. Other combinations follow similarly. The LETTER A, though classified here are a vowel, is actually a consonant. Thus it can combine with any of the vowel symbols. The tone mark AUKA MYI is often written to the left of a subscript vowel sign or medial consonant. It should, nevertheless come after the vowel or medial in the text stream. It is also used with killed consonants in writing closed syllables. In this case, too, the AUKA MYI should come after the ATHA in the text stream. For example, the word /hyun./ (short, high falling tone) should be represented as 0F30+0F5F+0F5E+0F02+0F51. The SYMBOL HNAI is only used in the literary combination 0F73+0F19+0F52+0F03, meaning "the aforementioned." Burmese does not use any whitespace between words. If word boundary indications are desired, for example for the use of automatic line layout algorithms, U+200B, ZERO WIDTH SPACE, is to be used. Block Structure: Burmese characters are mapped to their corresponding ISCII slots whenever possible. Gaps in the block result mainly from this mapping. Several ranges of code points are reserved for future expansion. A notable exception is the pair NYA GALEI and NYA JI. Historically, NYA GALEI is a simple palatal nasal, while NYA JI is a ligature representing a double NYA GALEI. NYA JI, however, has come to be regarded as the primary form of the letter in Burmese, so it is assigned to the "preferred" ISCII slot for the palatal nasal (U+0F1E), and NYA GALEI is placed at U+0F5F. U+0F00 to U+0F01 Unassigned U+0F02 to U+0F03 Various signs U+0F04 Unassigned U+0F05 to U+0F14 Independent vowels U+0F15 to U+0F39 Consonants U+0F3A to U+0F3E Unassigned U+0F3F to U+0F4C Dependent vowel signs U+0F4D Virama U+0F4E to U+0F50 Unassigned U+0F51 to U+0F52 Tone marks U+0F53 to U+0F5F Unassigned, reserved for extensions U+0F60 to U+0F63 Additional dependent vowel signs U+0F64 to U+0F65 Unassigned U+0F66 to U+0F6F Digits U+0F70 to U+0F73 Special symbols U+0F74 to U+0F77 Unassigned, reserved for additional symbols Note: The transliteration used here follows D. Haigh Roop, An Introduction to the Burmese Writing System (1972). Tone indications are left out of the character names. ISSUE: As with Khmer, if there is a more standard transliteration, it should be used. ISSUE: Old Burmese has a small subscript LETTER A, which is the precursor of the tone mark AUKA MYA and appears exactly where modern Burmese would use the latter. This can probably be treated as a font difference. There is also a superscript form of YA GAU, similar in use to the Indic repha. This can probably be accommodated in the shaping rules. This is not a major issue as there is plenty of room to add these characters. Further investigation is required. DRAFT 03 Nov 1992; rev 92/11/25 DRAFT BURMESE CHARACTER NAMES 0F00 0F01 @ Various Signs 0F02 BURMESE THEIDHEI TIN = little thing put on anusvara, niggahita 0F03 BURMESE HYEIGA PAU = dots ahead visarga 0F04 @ Independent Vowels 0F05 BURMESE LETTER A 0F06 0F07 BURMESE PALI EHKAYA I = letter pali I 0F08 BURMESE EHKAYA I = letter I 0F09 BURMESE EHKAYA U = letter U x Burmese nya galei -> 0F5B 0F0A 0F0B BURMESE LETTER VOCALIC R Sanskrit 0F0C BURMESE LETTER VOCALIC L Sanskrit 0F0D 0F0E 0F0F BURMESE EHKAYA EI = letter EI 0F10 0F11 0F12 0F13 BURMESE LETTER O x sra 0F14 @ Consonants 0F15 BURMESE KA JI = great ka 0F16 BURMESE HKA GWEI = curved hka 0F17 BURMESE GA NGE = small ga x Burmese digit hyi -> 0F6E 0F18 BURMESE GA JI = great ga 0F19 BURMESE LETTER NGA 0F1A BURMESE SA LOUN = round sa 0F1B BURMESE HSA LEIN = twisted hsa 0F1C BURMESE ZA GWE = split za 0F1D BURMESE ZA MYINZWE = bridle za x cya 0F1E BURMESE NYA JI = great nya 0F1F BURMESE TA TALINJEI = bier-hook ta 0F20 BURMESE HTA WUNBE = duck hta 0F21 BURMESE DA YINGAU = crooked-breasted da 0F22 BURMESE DA YEIHMOU = water-dipper da 0F23 BURMESE NA JI = great na 0F24 BURMESE TA WUNBU = pot-bellied ta 0F25 BURMESE HTA HSINDU = elephant-fetter hta 0F26 BURMESE DA DWEI = twisted da 0F27 BURMESE DA AUHCAI = bottom-indented da 0F28 BURMESE NA NGE = small na 0F29 0F2A BURMESE PA ZAU = steep-sided pa 0F2B BURMESE HPA OUHTOU = capped hpa 0F2C BURMESE BA LAHCAI = top-indented ba 0F2D BURMESE BA GOUN = hump-backed ba 0F2E BURMESE LETTER MA 0F2F BURMESE YA PALE = supine ya 0F30 BURMESE YA GAU = crooked ya x Burmese digit hkun ni -> 0F6D 0F31 0F32 BURMESE LETTER LA 0F33 BURMESE LA JI = great la 0F34 0F35 BURMESE LETTER WA x Burmese digit thoun nya -> 0F66 0F36 BURMESE LETTER SANSKRIT SHA Sanskrit 0F37 BURMESE LETTER SANSKRIT SSA Sanskrit 0F38 BURMESE LETTER THA 0F39 BURMESE LETTER HA 0F3A 0F3B 0F3C 0F3D @ Vowel Signs 0F3E BURMESE YEI HCA = line drawn down 0F3F BURMESE LOUNJI TIN = big circle put on 0F40 BURMESE LOUNJI TIN HSAN HKA = big circle put on with a grain of rice 0F41 BURMESE TAHCAUN NGIN = one stroke drawn out 0F42 BURMESE HNAHCAUN NGIN = two strokes drawn out 0F43 BURMESE VOWEL SIGN VOCALIC R Sanskrit 0F44 BURMESE VOWEL SIGN VOCALIC RR Sanskrit 0F45 0F46 0F47 BURMESE THAWEI HTOU = thrust in front 0F48 BURMESE NAU PYI = thrown backwards 0F49 0F4A 0F4B BURMESE VOWEL SIGN O 0F4C @ Virama 0F4D BURMESE VIRAMA x Burmese atha -> 0F52 0F4E 0F4F 0F50 @ Tone Marks 0F51 BURMESE AUKA MYI = stopped below 0F52 BURMESE ATHA = killer = hyei htou, "thrust forward" x Burmese virama -> 0F4D 0F53 0F54 0F55 0F56 0F57 0F58 0F59 0F5A 0F5B 0F5C 0F5D 0F5E @ Consonants 0F5F BURMESE NYA GALEI = little nya x Burmese ehkaya u -> 0F09 @ Vowel Signs 0F60 BURMESE LETTER VOCALIC RR Sanskrit 0F61 BURMESE LETTER VOCALIC LL Sanskrit 0F62 BURMESE VOWEL SIGN VOCALIC L Sanskrit 0F63 BURMESE VOWEL SIGN VOCALIC LL Sanskrit 0F64 0F65 @ Digits 0F66 BURMESE DIGIT THOUN NYA = digit zero x Burmese wa -> 0F35 0F67 BURMESE DIGIT TI = digit one 0F68 BURMESE DIGIT HNI = digit two 0F69 BURMESE DIGIT THOUN = digit three 0F6A BURMESE DIGIT LEI = digit four x Burmese symbol lagaun -> 0F73 0F6B BURMESE DIGIT NGA = digit five 0F6C BURMESE DIGIT HCAU = digit six 0F6D BURMESE DIGIT HKUN NI = digit seven x Burmese ya gau -> 0F30 0F6E BURMESE DIGIT HYI = digit eight x Burmese ga nge -> 0F17 0F6F BURMESE DIGIT KOU = digit nine @ Various symbols 0F70 BURMESE SYMBOL YWEI 0F71 BURMESE SYMBOL EHKAYA I 0F72 BURMESE SYMBOL HNAI 0F73 BURMESE SYMBOL LAGAUN x Burmese digit lei -> 0F6A 0F74 0F75 0F76 0F77 0F78 0F79 0F7A 0F7B 0F7C 0F7D 0F7E 0F7F Khmer U+0F80 -> 0FDF Cambodian, also known as Khmer, is the official language of Cambodia. Mutually intelligible dialects are also spoken in northeastern Thailand and the Mekong Delta region of Vietnam. While not itself an Indo-European language, much of the administrative, military and literary vocabulary of Khmer is borrowed from Sanskrit. With the advent of Theravada Buddhism at the beginning of the fifteenth century, Khmer began to borrow Pali words, and continues to use Pali as a major source of neologisms today. There is also much cross-borrowing between Thai and Khmer, as well as a relatively recent infusion of French words and a smattering of Chinese and Vietnamese loanwords in colloquial speech. The Khmer script, called a'saa kmae ("Khmer letters"), as well as Thai, Lao, Burmese, Old Mon and others, are all descended from the Brahmi script of South India. The exact geographical source, or possibly sources, has not been determined, but there is a great similarity between the earliest inscriptions in the region and the Pallawa script of the Coromandel coast of India. Structurally, the Khmer script stays very close to its southern Brahmi origins. There is a set of 35 consonants, each with an inherent vowel sound. Additional signs are placed before, above, below and after the consonants to indicate vowels other than the inherent one. Consonant clusters are represented by conjunct consonants, where the first consonant of the cluster maintains its full form and succeeding consonants are written as subscripts. The Khmer language has a much richer set of vowels than the Indo-Aryan languages for which the ancestral script was used. By the same token, there is a much smaller set of consonant sounds. The Khmer script is adapted to the language by adding extra vowel signs and various diacritic marks, and by using the choice of consonant as well as of vowel signs to determine the particular vowel sound represented. Thus most vowel signs do not have a single value but must be interpreted in the context of the associated consonant. This is very similar to the situation in Thai and Lao, where different consonant symbols have the same sound but encode different tones. There are two basic styles of script in modern Khmer, each with two major variations. They are the a'saa criang ("slanted script") and the a'saa muul ("round script"). There is no fundamental structural difference between them, however, so the "standing" variant of the slanted script is chosen here as representative. Representation: The Khmer script follows the model of Devanagari and other Indic scripts. The basic unit is the syllabic cluster consisting of a series of consonants separated by WIRIAM (0FC5), followed by one or both of the pronounciation shifters MUSEKATOAN (0FCA) and TRUYSAP (0FCB), followed by an optional vowel, followed by diacritics and quality marks. For example, the word /knyom/, "I," is coded as the string 0F81+0FC5+0F89+0FB5+0FC2. In cases where there is already some other superscript in the cluster, the two pronounciation shifters are written as the subscript symbol kbiah kraom, which looks much like VOWEL SIGN O. This vowel sign is not to be used for this purpose. It is the responsibility of the presentation software to select the correct appearance of the shifter. For example, /sii/, "to eat," should be coded as 0F9F+0FCB+0FB4, not as 0F9F+0FB7+0FB4. RAWBAT (0FCC) historically corresponds to the Devanagari repha, that is, to an initial /r/. It has lost this function in Khmer and instead is considered a simple diacritic similar to TOANDAKHIAT in both reading and sorting. There are also many cases of consonant clusters with initial /r/ that should be written with a full RAW and not a RAWBAT, so a separate character is provided for it. Khmer writing does not normally separate words with white space as European languages do. If it is desirable to represent word boundaries in the text stream, for example, for use by automatic line layout algorithms, U+200B, ZERO WIDTH SPACE, should be used. Two relatively rare symbols in modern usage are not included here. They are pnek moan, the "cock's eye," and "komout." They are identical in form and function to the Thai characters FONGMAN and KHOMUT, respectively, so the latter two should be used when these symbols are needed. Block Structure: U+0F80 to U+0FA2 Consonants U+0FA3 to U+0FB1 Independent Vowels U+0FB2 to U+0FC1 Vowel Signs U+0FC2 to U+0FC4 Quality Marks U+0FC5 Virama U+0FC6 to U+0FC7 Unassigned U+0FC8 to U+0FCF Diacritics U+0FD0 to U+0FD9 Digits U+0FDA to U+0FDE Symbols and Punctuation U+0FDF Unassigned ISSUES: The independent vowels LETTER AO TYPE 2 and LETTER AW TYPE 2 are variant forms of LETTER AO TYPE 1 and LETTER AW TYPE 1, respectively. It is not believed that they are in free variation: LETTER AO TYPE 2 occurs only in the combination "aoy," while LETTER AW TYPE 2 is only cited in a few references, but not used. There is an opportunity to unify these pairs. Note that LETTER UW and LETTER OU are also listed as variants, but they are actually not in free variation, so both must be provided. It may be desirable to add the vowel sign AM instead of using the combination AA+NIKAHAT. This would simplify a common special case in sorting. The punctuation marks KHAN and BARIYAOSAN may be unified with some other characters, just as Indic dandas have been. A likely candidate for the former is Thai PAI YAN NOI. Such a unification, as well as that of the "cock's eye" and "cow piss" characters presents an interesting challenge to the font mechanism of a Unicode rendering engine: Different glyphs may be required for the same character when used in conjunction with different scripts. This seems like a needless complication for what are otherwise simple, non-combining characters. It may be more desirable from a political standpoint to follow either the Thai or the ISCII coding schemes. Sample charts have been produced showing how this may be done. If this is indeed the path taken, those charts should be expanded to include all characters in this proposal. The vowel encoding takes an ISCII-like approach, coding as single characters vowels that consist of two or more disjoint glyphs. If vowel symbols are instead decomposed into their constituent glyphs and those coded separately, there is then no advantage to the code point assignments made here. In such a case, the assignments should be made according to the Thai pattern. The romanization scheme here is rather ad-hoc. If a more commonly accepted one exists, the character names should be changed accordingly. Draft 03 October 1992; rev 92/11/25 DRAFT KHMER CHARACTER NAMES @ Consonants 0F80 KHMER LETTER KAA 0F81 KHMER LETTER KHAA 0F82 KHMER LETTER KAW 0F83 KHMER LETTER KHAW 0F84 KHMER LETTER NGAW 0F85 KHMER LETTER CAA 0F86 KHMER LETTER CHAA 0F87 KHMER LETTER CAW 0F88 KHMER LETTER CHAW 0F89 KHMER LETTER NYAW 0F8A KHMER LETTER DAA 0F8B KHMER LETTER RETROFLEX THAA 0F8C KHMER LETTER DAW 0F8D KHMER LETTER RETROFLEX THAW 0F8E KHMER LETTER NAA 0F8F KHMER LETTER TAA 0F90 KHMER LETTER THAA 0F91 KHMER LETTER TAW 0F92 KHMER LETTER THAW 0F93 KHMER LETTER NAW 0F94 KHMER LETTER BAA 0F95 KHMER LETTER PHAA 0F96 KHMER LETTER PAW 0F97 KHMER LETTER PHAW 0F98 KHMER LETTER MAW 0F99 KHMER LETTER YAW 0F9A KHMER LETTER RAW 0F9B KHMER LETTER LAW 0F9C KHMER LETTER WAW 0F9D KHMER LETTER SHAA Sanskrit 0F9E KHMER LETTER SSAA Sanskrit 0F9F KHMER LETTER SAA 0FA0 KHMER LETTER HAA 0FA1 KHMER LETTER LAA 0FA2 KHMER LETTER QAA glottal stop @ Independent Vowels 0FA3 KHMER LETTER E 0FA4 KHMER LETTER EY 0FA5 KHMER LETTER O 0FA6 KHMER LETTER UW 0FA7 KHMER LETTER OU 0FA8 KHMER LETTER AE 0FA9 KHMER LETTER AY 0FAA KHMER LETTER AO TYPE 1 0FAB KHMER LETTER AO TYPE 2 0FAC KHMER LETTER AW TYPE 1 0FAD KHMER LETTER AW TYPE 2 0FAE KHMER LETTER RIK 0FAF KHMER LETTER RII 0FB0 KHMER LETTER LIK 0FB1 KHMER LETTER LII @ Vowel Signs 0FB2 KHMER VOWEL SIGN AA 0FB3 KHMER VOWEL SIGN E 0FB4 KHMER VOWEL SIGN EY 0FB5 KHMER VOWEL SIGN U 0FB6 KHMER VOWEL SIGN UI 0FB7 KHMER VOWEL SIGN O x kbiah kraom 0FB8 KHMER VOWEL SIGN OU 0FB9 KHMER VOWEL SIGN UA 0FBA KHMER VOWEL SIGN AU 0FBB KHMER VOWEL SIGN IE 0FBC KHMER VOWEL SIGN IU 0FBD KHMER VOWEL SIGN EI 0FBE KHMER VOWEL SIGN AE 0FBF KHMER VOWEL SIGN AY 0FC0 KHMER VOWEL SIGN AO 0FC1 KHMER VOWEL SIGN AW @ Quality Marks 0FC2 KHMER SIGN NIKAHAT = sra am = damla 0FC3 KHMER SIGN REAHMUK = wihsakea = wihsancani 0FC4 KHMER SIGN YUKALEAPINTU = coc pi @ Virama 0FC5 KHMER SIGN WIRIAM virama 0FC6 0FC7 @ Diacritics 0FC8 KHMER VOWEL SIGN BANTA = sangkat = reahsannya 0FC9 KHMER VOWEL SIGN SANYOK SANNYA 0FCA KHMER SIGN MUSEKATOAN = tmin kandao vowel pronounciation shifter 0FCB KHMER SIGN TRUYSAP vowel pronounciation shifter 0FCC KHMER SIGN RAWBAT = rephea 0FCD KHMER SIGN TOANDAKHIAT = samlap = patdesaet 0FCE KHMER SIGN KAKABAT = caung kaek 0FCF KHMER SIGN AHSDA = leik prabuy @ Digits 0FD0 KHMER DIGIT ZERO 0FD1 KHMER DIGIT ONE 0FD2 KHMER DIGIT TWO 0FD3 KHMER DIGIT THREE 0FD4 KHMER DIGIT FOUR 0FD5 KHMER DIGIT FIVE 0FD6 KHMER DIGIT SIX 0FD7 KHMER DIGIT SEVEN 0FD8 KHMER DIGIT EIGHT 0FD9 KHMER DIGIT NINE @ Symbols and Punctuation 0FDA KHMER CURRENCY SYMBOL RIAL 0FDB KHMER LEIK TO = amendit sannya repetition sign 0FDC KHMER CAMNOC PI KUH x (division sign -> 00F7) x (tibetan comma -> 1038) colon, semicolon 0FDD KHMER KHAN full stop, ellipsis, abbreviation 0FDE KHMER BARIYAOSAN end of section 0FDF Proposal for Ethiopian Encoding The Ethiopian proposal consists of a list of questions/issues, a chart, a character names list, and a block introduction. The content is based on UTC/1991-026 On the Extended Ethiopic Alphabet of February 26, 1991 and its later adjustments by Lloyd Anderson, unioned with features of the Xerox Amharic implementation by Joe Becker. The character names are based on those in DP 10646, which came from WG2/N459 "Ethiopian character sets" by Michael Mann. QUESTIONS FOR REVIEWERS: 1. Is this collection missing any important, well-established "extension" letters for writing less-common languages? 2. Are the glyphs in the charts appropriate? 3. Can you supply documentation to support the specification of the following two characters? 121D ETHIOPIAN CONSONANT GG 1237 ETHIOPIAN VOWEL PHONETIC AE In particular, does U+1237 occur (as a vowel, not as a mark of "w" rounding) on any consonant other than U+1211? Should the combination of U+1237 with U+1211 simply be encoded as a distinct consonant (to be added between current U+1211 and U+1212)? 4. Are the following characters specified correctly? 1256 ETHIOPIAN COMMA modern usage like colon 1257 ETHIOPIAN COLON modern usage like semicolon 1259 ETHIOPIAN NEW COMMA modern usage 5. Do syllable glyph variants ever occur distinctively within the same text, or are they merely font design choices like the glyph variants of Latin "a" or "g"? ISSUES: * In this design, no provision is made for coding the syllable glyphs; it is intended that they be excluded from Unicode/10646 BMP. If we learn that glyph variants may occur distinctively, then we may need to define some additional means for specifying glyph variants within plain text. * Should we define an Ethiopian White Space character which can be easily guaranteed to have the same (minimum) width as U+1255 ETHIOPIAN WORDSPACE? Currently opinion is that this is unnecessary. Ethiopian (U+1200 -> U+125F) The Ethiopian script, which originally evolved for the archaic language Ge'ez, is currently used to write several languages of Eastern Africa, including Amharic, Tigre, and Oromo. The script continues to be extended for writing languages that have little tradition of printed typography; new characters to cover such extensions may added to the standard later as definitive information about them becomes available. Encoding Principles. The visible glyphs of the Ethiopian script are not the objects shown in the encoding chart. The elements of the encoding are the alphabet underlying the script, thus the encoding is (roughly) phonetic rather than glyphic. These alphabetic letters are expected to be the units of keyboard input and all text representation short of rendering. Rendering. Each visible glyph of the Ethiopian script represents a syllable rather than a single letter. The syllables can all be treated as simple (consonant + vowel) pairs, so that each glyph can be thought of as a ligature of two underlying letters. Thus the syllable "MA" would be represented in the encoding as U+1203 ETHIOPIAN CONSONANT M plus U+1233 ETHIOPIAN VOWEL A. The syllable glyphs themselves are not intended to be incorporated in this encoding. The individual consonant or vowel codes should not be isolated (i.e. unpaired) in normal final text, and their rendering in such circumstances is an option of the implementation. One possibility is to use special symbols for the individual letters, as is done in the code charts here. Chart Symbols Representing Individual Letters. Since the Ethiopian glyphs are normally syllabic, the script provides no unambiguous way of representing the underlying individual letters. Therefore in the code charts and names list, a convention has been adopted in which consonant letters are represented by their "first" form surrounded by a dotted circle, and vowel letters are represented by a typical glyph fragment attached to a dotted circle. This is not intended to imply direct glyphic composition of those forms, but merely to signify the underlying letters. Encoding/Rendering of "First Form" Syllables. The circled consonants in the charts U+1200 -> U+1224 are underlying letters, they should not be confused with rendered full first form syllable glyphs. As with all glyphs in the script, the first form syllables are encoded as simple (consonant + vowel) pairs. Thus the glyph "MAE" would be represented in the encoding as U+1203 ETHIOPIAN CONSONANT M plus U+1230 ETHIOPIAN VOWEL AE. This pair would then be rendered via a "ligature" MAE whose appearance would resemble the chart symbol for U+1203 ETHIOPIAN CONSONANT M without the circle. Encoding/Rendering of Lone Consonants ("Sixth Form" Syllables). The sixth form syllable glyphs are sometimes pronounced as though they were lone consonants (i.e. the vowel is dropped in speech), but this does not change their encoding. As with all glyphs in the script, the sixth form syllables are encoded as simple (consonant + vowel) pairs. Thus the spoken lone consonant "M" would be represented in the encoding as U+1203 ETHIOPIAN CONSONANT M plus U+1235 ETHIOPIAN VOWEL SCHWA. Variant Glyph Forms. The script sometimes provides different glyph forms to represent the same syllables. It is assumed that these alternatives do not vary freely, in other words that is appropriate for a given font to contain only one selected glyph form for each syllable. Therefore no mechanism is provided for specifying glyph variants within a plain text stream of characters. The situation is analogous to that of the glyph variants of Latin "a" or "g". Letter Names. The Ethiopian script often has multiple letters corresponding to the same Latin letter, making it difficult to assign unique Latin names. Therefore the names list makes use of certain devices (such as doubling a Latin letter in the name) merely to create uniqueness; this has no relation to the phonetics of the Ethiopian letters. Encoding Order and Sorting. The order of the letters in the encoding is based on the traditional alphabetical order. This order differs from the sort order used for one or another language, if only because in many languages various pairs or triplets of letters are treated as equivalent in the first sorting pass. For example, an Amharic dictionary is likely to start out with a section headed by three letters: U+1200 ETHIOPIAN CONSONANT H U+1202 ETHIOPIAN CONSONANT HH U+120E ETHIOPIAN CONSONANT X Thus the encoding order cannot and does not implement a collation procedure for any particular language using this script. Space Characters. The traditional word separator is U+1255 ETHIOPIAN WORDSPACE ( : ), but in modern usage a plain white wordspace is becoming common. The ASCII character U+0020 SPACE is suitable for the latter usage, although its (minimum) width is not guaranteed to be the same as that of the traditional wordspace. Diacritical Marks. The mark U+030E NON-SPACING DOUBLE VERTICAL LINE ABOVE may occasionally be used to indicate emphasis or gemination. If this or other diacritical marks are used, they follow the vowel letter of the syllable to which they apply. Encoding Structure. The Unicode block for the Ethiopian script is divided into the following ranges: U+1200 to U+1224 Consonant phonetic letters U+1225 to U+122F Currently unassigned U+1230 to U+123D Vowel phonetic letters (U+1239 is an intentional gap) U+123E to U+123F Currently unassigned U+1240 to U+1254 Numbers (U+1240 is an intentional gap) U+1255 to U+125B Punctuation U+125C to U+125F Currently unassigned Draft October 30, 1992; rev 93/01/08 ETHIOPIAN CHARACTER NAMES LIST @ Consonant phonetic letters 1200 ETHIOPIAN CONSONANT H 1201 ETHIOPIAN CONSONANT L 1202 ETHIOPIAN CONSONANT HH 1203 ETHIOPIAN CONSONANT M 1204 ETHIOPIAN CONSONANT SZ 1205 ETHIOPIAN CONSONANT R 1206 ETHIOPIAN CONSONANT S 1207 ETHIOPIAN CONSONANT SH 1208 ETHIOPIAN CONSONANT Q 1209 ETHIOPIAN CONSONANT QH 120A ETHIOPIAN CONSONANT B 120B ETHIOPIAN CONSONANT V 120C ETHIOPIAN CONSONANT T 120D ETHIOPIAN CONSONANT C 120E ETHIOPIAN CONSONANT X 120F ETHIOPIAN CONSONANT N 1210 ETHIOPIAN CONSONANT NY 1211 ETHIOPIAN CONSONANT GLOTTAL 1212 ETHIOPIAN CONSONANT K 1213 ETHIOPIAN CONSONANT XX 1214 ETHIOPIAN CONSONANT W 1215 ETHIOPIAN CONSONANT NULL 1216 ETHIOPIAN CONSONANT Z 1217 ETHIOPIAN CONSONANT ZH 1218 ETHIOPIAN CONSONANT Y 1219 ETHIOPIAN CONSONANT D 121A ETHIOPIAN CONSONANT DD Oromo 121B ETHIOPIAN CONSONANT J 121C ETHIOPIAN CONSONANT G 121D ETHIOPIAN CONSONANT GG Bilen 121E ETHIOPIAN CONSONANT TH 121F ETHIOPIAN CONSONANT CH 1220 ETHIOPIAN CONSONANT PH 1221 ETHIOPIAN CONSONANT TS 1222 ETHIOPIAN CONSONANT TZ 1223 ETHIOPIAN CONSONANT F 1224 ETHIOPIAN CONSONANT P 1225 1226 1227 1228 1229 122A 122B 122C 122D 122E 122F @ Vowel phonetic letters 1230 ETHIOPIAN VOWEL AE 1231 ETHIOPIAN VOWEL U 1232 ETHIOPIAN VOWEL I 1233 ETHIOPIAN VOWEL A 1234 ETHIOPIAN VOWEL E 1235 ETHIOPIAN VOWEL SCHWA 1236 ETHIOPIAN VOWEL O 1237 ETHIOPIAN VOWEL PHONETIC AE used primarily with U+1211 ETHIOPIAN CONSONANT GLOTTAL 1238 ETHIOPIAN VOWEL WAE 1239 123A ETHIOPIAN VOWEL WI 123B ETHIOPIAN VOWEL WA 123C ETHIOPIAN VOWEL WE 123D ETHIOPIAN VOWEL W 123E 123F @ Numbers 1240 1241 ETHIOPIAN NUMBER ONE 1242 ETHIOPIAN NUMBER TWO 1243 ETHIOPIAN NUMBER THREE 1244 ETHIOPIAN NUMBER FOUR 1245 ETHIOPIAN NUMBER FIVE 1246 ETHIOPIAN NUMBER SIX 1247 ETHIOPIAN NUMBER SEVEN 1248 ETHIOPIAN NUMBER EIGHT 1249 ETHIOPIAN NUMBER NINE 124A ETHIOPIAN NUMBER TEN 124B ETHIOPIAN NUMBER TWENTY 124C ETHIOPIAN NUMBER THIRTY 124D ETHIOPIAN NUMBER FORTY 124E ETHIOPIAN NUMBER FIFTY 124F ETHIOPIAN NUMBER SIXTY 1250 ETHIOPIAN NUMBER SEVENTY 1251 ETHIOPIAN NUMBER EIGHTY 1252 ETHIOPIAN NUMBER NINETY 1253 ETHIOPIAN NUMBER HUNDRED 1254 ETHIOPIAN NUMBER TEN THOUSAND @ Punctuation 1255 ETHIOPIAN WORDSPACE 1256 ETHIOPIAN COMMA modern usage like colon 1257 ETHIOPIAN COLON modern usage like semicolon 1258 ETHIOPIAN PERIOD 1259 ETHIOPIAN NEW COMMA modern usage 125A ETHIOPIAN QUESTION MARK archaic 125B ETHIOPIAN PARAGRAPH SEPARATOR archaic