ISO INTERNATIONAL ORGANIZATION FOR STANDARDIZATION ORGANISATION INTERNATIONALE DE NORMALISATION ISO-IEC JTC1/SC2/WG2 Multi-Octet Coded Character Set ISO-IEC JTC1/SC2/WG2 N884 X3L2/93-017 (Project 396-D) April 11, 1993 Title: Concerning Future Allocations Source: Rick McGowan & Joe Becker Status: Recommended by X3L2 to the attention of WG2 Requested Action: Consideration at WG2 meeting #23, May 24-28, 1993 PART I: At-a-Glance Summaries of the Allocation Situation I.A: Introduction This paper is intended to address the question of future code allocations for scripts and characters in IS 10646-1 Universal Multiple-Octet Coded Character Set, particularly in its Basic Multilingual Plane (BMP), and correspondingly in Unicode 1.1. The goal of 10646/Unicode code assignments is to provide a uniform basis for computer processing and electronic interchange of all textual information. The user community for this standard is very large and heterogeneous, including not only world language communities computerizing their own scripts, but also bibliographic and scholarly communities storing and studying the entire history of human text communication. This broad user community has already brought forth more entities that are potential candidates for character encoding than there are codepoints remaining in the 16-bit space of BMP/Unicode. The purpose of this section is to document this surplus of character encoding candidates in summary form. Because of the shortage of codespace, future assignments in the BMP/Unicode need to be approached via systematic planning. This paper does not aim to present a final resolution of allocation issues, but rather to initiate discussion by examining the factors that must be taken into consideration. The paper includes: I. Overview of codespace situation based on size estimates of future encodings II. Discussion of possible architectural innovations in BMP structure and usage III. Discussion of principles for encoding in the BMP in the future IV. Suggestive allocations of prime candidate scripts and symbols in the A-Zone I.B: Current BMP codespace allocation In discussing codespace quantities, three units may be of interest: cell single character code value, also called a codepoint column 16 cells, visualized as columns in the character charts (this term is not defined in IS 10646) row 256 cells, visualized as rows across the square code planes in large-scale diagrams of the codespace; when magnified in character charts, a row may be visualized as a square of 16 x 16 cells In terms of cells, the statistics of IS 10646-1 and Unicode 1.1 are as follows: 6467 9.9% cells of restricted use 34167 52.1% cells assigned 24902 38.0% cells available 65536 100% cells total These numbers are slightly misleading because some isolated "available" cells within script blocks are essentially unavailable for future assignment (e.g. 03A2 within the Greek alphabet). A more useful accounting unit is the column, since script blocks are assigned on the basis of this unit. In terms of columns, the statistics of IS 10646-1 and Unicode 1.1 are as follows: 404 9.9% columns of restricted use 2283 55.7% columns assigned 1409 34.4% columns available 4096 100% columns total These numbers are still slightly optimistic, since some isolated columns are also essentially unavailable, but in general columns will be the most useful unit for the approximate kind of figuring done in this paper. The remainder of this section is keyed to the attached bar-chart illustration Figure 1: ISO 10646 BMP / Unicode Codespace Allocations (at the end of the document). The above allocation numbers are illustrated in the top section of Figure 1, labeled "Current". The horizontal axis in Figure 1 indicates codepoints, with the two main vertical lines demarcating the 16-bit space of 0 to 65536 cells. The first horizontal bar in the "Current" section represents the BMP/Unicode in its zone divisions as described in IS 10646. The second horizontal bar represents the major current allocation regions. The third horizontal bar in the "Current" section represents the same allocations, but compacted so that the available cells are all shown on the right-hand end. This bar serves as the basis for the rest of the figure. I.C: Three oversized groups of potential candidates To establish that are more potential candidates for character encoding than there are codepoints remaining in the BMP, this section exhibits three large groups of candidates, each one of which would exceed the available space if accepted in toto. (The three groups are not entirely disjoint from each other, but will be treated as disjoint for the purposes of this discussion.) Raised in Standards Documents In the course of design discussions about IS 10646 within ISO SC2/WG2 and other standards groups, many sets of entities have been raised as potential future character encoding requests. The formal proposal status of these character sets varies widely, and will not be examined in detail here. However, it is safe to say that such proposals are often quite tenacious, and WG2 has been understandably reluctant to turn down a qualified formal request for code assignments. The following table lists the major components of this collection, the numbers assume unification within this group and with existing codes: Set Country Document Chars Cols HCS-B China ballot & N 822 8279 518 CNS additions Taiwan (CJK-JRG mins) 7450 466 Chu Nom Vietnam NPCT (CJK-5-11) 1642 103 Latin & al. Germany WG2 N 762 1536 97 Tibetan "ligs." China WG2 N 470 1500 94 Yi (modern) China WG2 N 470 1165 73 Math symbols (?) IEC TR 9573-13 700 44 Ethiopian syls U.K. ballot & N 807 416 26 Misc. Misc. Misc. WG2 docs 320 21 CJK radicals China WG2 N 824 315 20 Misc. symbols China WG2 N 825 288 18 Videotex Denmark WG2 N 838 170 10 Misc. ISO Stds. ISO/TC46 WG2 N 809 148 9 Sinhala U.K. ballot & N 803 128 8 Tibetan China,UKWG2 N 826 & al. 100 6 Arab numerics Egypt DIS-2 ballot 90 6 DIS-1 Mongolian China DIS-2 ballot 85 6 Total 23789 1525 Available in BMP/Unicode (currently-empty rows) 21500 1343 This group is illustrated in Figure 1 as the horizontally-striped bar "Raised in Standards Documents". While not all of these potential candidates have been formally proposed for inclusion in the 10646 BMP/Unicode, the fact is that there is not room for all of them. Hieroglyphic & Non-Alphabetic Scripts The Unicode Technical Committee chartered its Scripts & Symbols Subcommittee to accumulate potential future character encoding requests; these labors have been undertaken largely by Rick McGowan. This research has collected a group of scripts that use large character sets of a hieroglyphic or other non-alphabetic nature (not including CJKV ideographs). The informational and proposal status of these character sets varies widely, and will not be examined in detail here. Despite the limited application of some of these scripts, they are genuine characters of human written language, and their encoding to permit computer processing may indeed be of interest to Unicode Consortium members. The following table lists the major components of this collection; the numbers are rough estimates. Type Cols Han inspired 1200 Hieroglyphs, New World & Oceania 250 Other Middle-Eastern, Classical precursors 70 Cuneiform, ideographic types 60 Hieroglyphs, Classical 40 Total 1620 Available in BMP/Unicode (currently-empty rows) 1343 This group is illustrated in Figure 1 as the ////////-striped bar "Hieroglyphic & Non-Alphabetic Scripts". Again, while not all of these large-set scripts have been formally proposed for inclusion in the 10646 BMP/Unicode, the fact is that there is not room for all of them. Other (e.g. added Han, Hangul, Presentation Forms) The collections enumerated above do not exhaust the large sets of entities that may potentially be advanced for character code assignment. Possibilities include the remainder of the CCCII collection of about 75,000 CJKV ideographs, or at least the extensions that bring Taiwan CNS 11643 to 48,027 (plus an added 6,696) characters, an indeterminate but large number of archaic Hangul syllables, a nearly infinite number of precomposed diacritical mark combinations, and many other entities categorized as "presentation forms". The group consisting of the union of these collections is illustrated in Figure 1 as the open-ended \\\\\\\\-striped bar "Other (e.g. added Han, Hangul, Presentation Forms)". Although the eventual size of the residual encoding demand is indeterminate, the fact is that there is not room for all of it in the BMP/Unicode. I.D: Prime candidate alphabets and symbols for A-Zone allocation In contrast to the three oversized groups just discussed, the Unicode Scripts & Symbols Subcommittee has enumerated two small but extremely important collections that do fit appropriately into the BMP/Unicode, namely the top-priority items in the lists of yet-unencoded alphabets and symbols. These top-ranked candidates among the remaining alphabets and symbol sets may be called "prime candidates" for encoding. Consistent with the BMP's design, the prime candidates do fit into the A-Zone Alphabets region and Symbols region. A detailed definition and suggestive allocation of these sets is presented in Part IV of this document. The group of prime candidates is illustrated in the "Proposed" section of Figure 1 as the cross-hatched bar "Prime Candidate Alpha[bets] and Symbols". The remaining bars in the "Proposed" section represent other suggested components discussed in the remainder of this paper. I.E: Summary The prime candidates among the yet-unencoded alphabets and symbol sets do fit into their designated unassigned regions within the A-Zone of the BMP/Unicode. However, there remain three large groups of potential character code allocation requirements: those raised in standards documents, hieroglyphic & non-alphabetic scripts, and other large additions. Not one of these groups, much less all three, can fit in the remaining unassigned codespace in the BMP/Unicode. If the (slight) overlap among the three groups were unified away, and if half or even two-thirds of the candidate entities in these groups were rejected for encoding, the remainder still would not fit. This shortage of codespace in the BMP/Unicode motivates the discussion in the following two sections. PART II: Responses via innovation in BMP structure and usage Because there is not enough space in BMP/Unicode for the potential encoding candidates, an approach of extending this standard by simply assigning new characters on a first-come-first-served basis would run out of encoding space well before the needs of the user community were satisfied. The prospect of such an occurrence would inevitably lead to great friction among organizations with competing claims on the remaining BMP/Unicode territory. Such conflict would severely threaten the standard, both technically and politically. Therefore it is necessary to devise a new and structurally different approach to future code allocation, involving some degree of architectural modification to the standard. This section briefly discusses two such proposals that have been made: a "swap zone" and "extended UCS-2". At the X3L2 meeting where these matters were discussed, there was unanimous preference for extended UCS-2 over the swap zone design, so these alternatives are not presented here as though they are regarded equally. II.A: Swap zone In the ISO-IEC SC2/WG2 meetings where the current structure of the 10646 BMP was decided, there was discussion of the potential use of the O-Zone as a target zone for swapping in designated quarters of supplemental planes. The topic was however deferred, and no formal definition of such a regime was defined. The Japan national body negative vote on DIS 10646-1.2 explicitly requested that the O-Zone be expanded from 64 to 94 rows "for swapping-in the existing standards"; again no formal details were provided. In the absence of detailed proposals, general comments may be made on these two broad variants of the swap zone idea. Considering the Japanese innovation first, this proposal to expand the O-Zone and swap in existing standards was explicitly rejected by WG2, because it is opposed to the basic intent of this standard, which is to assign a single code position to each graphical character, not to provide an indexing mechanism into existing standards. The expansion of the O-Zone to 94 rows is impossible in any case. The original idea of a swap zone for non-duplicated quarters of supplemental planes, presumably involving an exterior protocol and/or invocation mechanism, is not excluded on the above grounds. It could succeed in permitting characters encoded in supplemental planes to be accessed from within UCS-2 format. However, it does require an identification protocol or invocation information which is costly to manage, and which may become separated from the text data. Once the swap identification information is lost, the original semantics of the swapped character codes cannot be restored by any further recipient of the text. Although such a design could work in some circumstances, vulnerability to irretrievable loss of character semantics makes it unsafe for general-purpose blind interchange of text data. This problem led to the proposal of the "extended UCS-2" design, which achieves a similar purpose but without introducers or controls, and with no danger of losing the semantics of a coded character. II.B: Extended UCS-2 In Figure 1, the "Plan C" section represents the result of Proposal for Extended UCS-2, being also a Proposal for Extended Unicode, document X3L2/93-016 of January 21, 1993. To summarize that document, two zones of 1024 BMP codes each are reserved for "High Half" and "Low Half" values of four-octet canonical character forms. Then an additional coded representation form for UCS called "Extended Two-octet BMP Form" is defined, in which the sequence of a two-octet High Half code followed by a two-octet Low Half code is interpreted as a single occurrence of the corresponding four-octet canonical character form. A fully precise specification is contained in document X3L2/93-016. This innovation is a rather simple addition that is fully compatible with the design (and any existing implementations) of IS 10646-1 and Unicode 1.1. It provides for 1024 x 1024 = 1,048,576 code values from supplemental planes to be easily included in a UCS-2/Unicode stream. In supplementary planes allocated so as to make use of this feature, each plane would contain at most 1024 assigned codes (a few such blocks are visualized in the lower-right corner of Figure 1). The provision of over 1,000,000 codes accessible from within UCS-2/Unicode is far more than sufficient to encompass all known future encoding candidates. So this design does appear to be a resolution of the code surplus problem with a minimal amount of additional mechanism and compatibility problems. Under this proposal, the remaining 16-bit BMP/Unicode space, principally the O-Zone, would have no special properties, and would be available for ordinary character encoding of 16K carefully selected candidates. The extended UCS-2 design does remove a great deal of the perceived pressure to include everything within the BMP, yet there still needs to be a plan that more precisely defines what it means for these characters to be "carefully selected". That topic is the subject of the following section of this paper. PART III: Encoding Principles The initial edition of IS 10646-1 and Unicode 1.1 is focused on covering characters that are in common usage, as requested by standards body member nations and organizations, and as attested by documentation in existing international, national, and industry computer and communication standards and typographic collections. The future character code allocations that will extend this standard have quite a different nature. Almost by definition the user communities for the less-common characters will be limited and not politically organized, there will be no more widely-implemented pre-existing standards, and precise information about character collections may be difficult to obtain, or even disputed. At the same time, as shown previously in this paper, the remaining BMP/Unicode codespace is a limited resource that requires careful management. This means establishing agreement on orderly procedures to meet the goals of future encoding. This section proposes a structured approach to gathering future character collection candidates, a range of alternatives for assigning them, and a set of goals and guidelines for managing the remaining BMP/Unicode codespace. III.A: Repertoire research, collection, and feedback Future character candidates will generally have smaller user communities than the characters already included in the standard, and in many cases these communities may be fragmented into various native, emigre, bibliographic, and scholarly components. Therefore, the first phase in addressing future character collections needs to be an explicit process of research, repertoire collection, and feedback solicitation. These processes have already begun, notably in the CJK-JRG for ideographic characters, and the Unicode Scripts & Symbols Subcommittee for non-ideographic characters. These groups have been performing research, collecting candidates, and circulating draft repertoires back to the potential user communities for feedback. Insofar as possible, these organized research efforts should be developed into a single formal channel for new character set proposals replacing direct decision by the working groups as the mechanism for future encodings. In the future, research will be necessary in order to establish basic information about proposed characters that are less frequently used. For example, current CJK-JRG criteria are to establish the following for each newly proposed character: who uses this character (or needs to) and what for; what computer systems currently implement it, and/or fonts and/or standards contain it; and what dictionaries, reference materials, or other sources establish its usage. Because the user communities for less-frequent characters may be scattered, there needs to be a feedback phase to solicit consensus on this information. Such study may require considerable delay, and there is no guarantee that enough consistent information about a character or set may found to qualify it as a genuine candidate for encoding. The end-product of this research phase is a proposed repertoire with associated information about each character candidate. When this has been sufficiently reviewed to ensure consensus among the user community, it becomes input to the next stages of the decision-making process. III.B: Code assignment alternatives Once a repertoire of proposed character candidates is defined, there are several possibilities for how the candidates may be addressed in the standard. These possibilities may be treated as a set of ordered filters through which each candidate is to be screened. (1) Do not encode. Some candidates may be unified with existing character codes, or may be noted as micro-variants of existing characters. A candidate which can be represented dynamically as a composite sequence of combining forms probably should not be encoded, since in most cases adding it would merely create a duplicate spelling. In general, if a candidate can be represented via a sequence of already-encoded characters, then adding it might bring more problems than benefits. (2) Treat as a glyph. The relation of 10646/Unicode characters to glyph encodings has not yet been fully specified, but even now it is possible to recognize certain sets of entities that are valid glyphs but not valid characters. Some cases may be rather clear-cut, such as linking forms in Latin handwriting-emulation fonts. Other cases are already beginning to be formally recognized, such as the CJK-JRG's exclusion of further typeface variants from character status. Other categories might include stylistic micro-variants, ligatures, and syllable clusters. It is not a good idea to add more collections of purely-glyphic entities to the BMP. (3) Encode in Private Use Area. The Private Use Area may be useful and appropriate for encoding some character sets that have only a limited user community. Also, it is possible that some complex scripts could be quite difficult to implement, and a Private Use encoding could be used as a test-bed until a single workable encoding and implementation were agreed upon by the user community. In the latter case, this new "de-facto standard" could then be re-proposed with greater confidence for permanent encoding. (4) Encode on supplementary plane. Section I of this document suggests that inevitably the majority of remaining candidates will be encoded on supplementary planes. For less-frequently-used characters this need not be a major disadvantage, especially if a feature such as Extended UCS-2 permits these to be accessed within a UCS-2 stream. Foresight and agreed criteria are necessary for planning supplementary plane allocations in an orderly way. (5) Encode on BMP. Encoding of further characters on the BMP should begin to be regarded as an exceptional decision, something of a last resort in case none of the preceding mechanisms can apply. However, because of the need to shepherd the remaining BMP/Unicode space, we also need to develop explicit goals and guidelines for what candidates may be appropriate to encode there. III.C: Goals and guidelines for BMP encoding Having successfully encoded most common usage characters in the initial edition of IS 10646-1 and Unicode 1.1, the standards community would be well served by an explicit set of goals and guidelines for what it hopes to accomplish with the assignments that will be made into the remaining BMP/Unicode space. Below are some proposals for such goal statements. Goal: The BMP should contain the basic elements of all scripts The main intention of the term Basic Multilingual Plane needs to be clearly stated. Presumably this plane is to be devoted to covering the breadth of characters used for human written language. If this is the case, there may be times when a small obscure alphabet needs to be given priority over equally obscure additions to vast sets such as the CJKV ideographs. In particular, this guideline argues that the remaining A-Zone space should be given its natural application to the "Remaining alphabets" group (Section I.C above) and a few more common symbols. Goal: BMP encodings should have high utility Generally the BMP should be devoted to high-utility characters widely implemented in some form of communication systems. These include, for example, hardcopy typographic systems that are awaiting computerization, and characters recognizable and useful to a large user community. The "utility" of a character in a computer / communication standard can be measured (at least in theory) by such factors as: number of publications (e.g., newspapers or books) using the character, the size of the community who can recognize the character, etc. Characters of more limited use should be considered for supplementary plane encodings, for example large sets of characters for obscure dead scripts and large sets of individual personal name CJKV ideographs. Goal: BMP encodings should take into account their user communities "Utility" also means that the encodings in the BMP should actually be available in implementation to some community of users. For less-frequent characters, the community of users becomes smaller, and its direct participation in the standardization process becomes more important. At the same time it becomes more difficult, because users may exist in scattered geographic and political communities, and in addition have a geographically-distributed scholarly and bibliographic community (for dead scripts, of course, only the latter). The political community may be a smaller unit than a country, and may even be oppressed by the country that contains it. In all cases where these communities have organized themselves to address the matter of computer encoding, their input should be given especial weight, particularly if it is embodied in specific implementations, and especially if those are actually in use within the community. The whole user community should also be consulted for information on the script and its elements. In situations where community consensus is lacking (e.g., where an encoding proposal arrives from a single source), assignment into the BMP should be deferred until user consensus can be obtained. Non-Goal: The BMP need not cover all entities in future standards It is not necessary, though it may often be desirable, that all entities in future international, national, and industry computer and communication standards be included in the BMP. The initial edition practice of covering pre-existing standards was used as a means of evaluating established utility, as well as ensuring compatibility with existing practice. Entities contained in new standards may or may not have proven utility, and may or may not establish themselves in common usage. Although new standards will continue to be a valuable and important source of candidates for the BMP, inclusion in a new standard will not in and of itself qualify an entity for encoding in the BMP. PART IV: Overview of Potential Future Encodings IV.A: Candidate scripts and symbols for A-Zone allocation This section will introduce specifics of future code collection and allocation, building on the general principles discussed in the previous section. The approach is based on detailed listings of scripts and symbols that are not encoded in 10646 BMP, hence not in Unicode 1.1. A full listing of scripts and symbol sets known to the Unicode Technical Committee (with equivalent name aliases) is maintained by the Scripts & Symbols Subcommittee, and is available on request. There are three primary unallocated zones in the BMP: rows 12-1D 192 columns A-Zone: Alphabets rows 28-2F 128 columns A-Zone: Symbols rows A0-DF 1024 columns O-Zone Consistent with the BMP's design, most of the "prime candidates" among remaining alphabetic characters will fit into the A-Zone Alphabets region, and most of the prime candidates among remaining symbolic characters will fit into the A-Zone Symbols region. Therefore the discussion in this section will focus on an approach for allocating alphabet and symbol candidates into the A-Zone. Planning for the O-Zone will be more difficult, and possible O-Zone allocations are not directly addressed this document. However, the analysis below of prime candidates for A-Zone encoding leads to a list of leftover "secondary candidates" which fail to make the A-Zone cutoff but which should be considered for O-Zone encoding. IV.B: Categories and listings of scripts Given a listing of scripts, there are many dimensions along which the candidates can be evaluated for BMP allocation. Candidate scripts may be living or extinct, may contain small or large numbers of characters, may support a great or limited published literature, may be clearly-defined or obscure, and so forth. The following overview discussion compresses this elaborate multi-dimensional analysis into a simple linear set of seven major script categories, A through G. This linear ordering admittedly does not do justice to the details of each script's situation, but it does give an approximate ranking of the candidates as is needed in order to undertake BMP allocation. The categories are as follows: A Contemporary There exists a modern community of native users who produce new printed matter in the script (newspapers, magazines, books, signs). Examples: Burmese, Maldivian, Syriac. B Specialized (Small) There exists a limited community of users (e.g. liturgical) who produce new printed material in the script. Generally these scripts have few native users, or are not in day-to-day use for ordinary communication. Examples: Javanese, Pahlavi, personal name ideographs. (Large sets of this description are moved to category F.) C Major Extinct (Small) There exists a relatively large body of literature in the script, and a relatively large scholarly community studying it. Examples: Etruscan, Linear B. (Large sets of this description are moved to category F.) D Attested Extinct (Small) There exists a relatively limited literature in the script, and a relatively small scholarly community studying it. Examples: Samaritan, Meroitic. (Large sets of this description are moved to category F.) ---- A-Zone Cutoff Line ---- E Minor Extinct The utility of publicly encoding these script is open to question. They may be secondary candidates for encoding elsewhere on the BMP, or their limited scholarly communities may wish to encode them in the Private Use Area. Examples: Khotanese, Lahnda. F Hieroglyphic or Ideographic The script has a large character set (10 or more columns, i.e. 160 or more characters), which essentially means hieroglyphic or ideographic scripts. A large character set is almost by definition obscure, since it is difficult to obtain information or agreement on the precise membership of the set. The following examples are ordered by the category to which they would otherwise have belonged if they had had small character sets: (B) Lolo, Moso, Yi (C) Akkadian, Chu Nom, Egyptian Hieroglyphics (D) Hittite(Luwian), Khitan, Mayan Hieroglyphics, Nuchen G Obscure The script is not deciphered or understood completely, or is not well attested by substantial literature or scholarly community. Its community of users, if any, may wish to encode it in the Private Use Zone. Examples: Xixia, Rongo-rongo, Osmanya. The bottom-line result is that scripts in categories A through D will fit in the Alphabets region of the A-Zone , whereas scripts in categories E through G will not. The detailed listing below is sorted first by category and then by script name. The letter code in the first column represents the script's Status in the UTC Scripts Subcommittee register: P a concrete proposal exists for the script and has been published R the required repertoire is more-or-less completely known I some information is known, but probably not the complete repertoire There follows the number of Columns required for encoding the script. A number followed by a question mark is a "best estimate" based upon the information currently available; otherwise the numbers are well established. It has not been possible yet to get even estimated numbers for some scripts. At the bottom of each category is an estimated total of columns required for that category; this number includes only those scripts in the group whose size is known or estimated. Scholars could doubtless find various faults with the categorization and content estimates (particularly of the extinct scripts in D and E), and these may be refined in the future as information becomes available. But at the moment these estimates offer the best available data for approaching the problem of future code allocation. Status Cols Name & Commentary Category A P 8 Burmese R 6 Cree (Evans syllabic signs) P 8 Ethiopian (see also Ge'ez) R 4 Karenni (Kayah Li) P 8 Khmer (Cambodian) R 1 Lisu P 3 Maldivian {RL} [Dihevi] P 8 Mongolian - 4? Pollard phonetic P 8 Sinhala P 5 Tai Lu P 3 Tai Nua P 6 Tibetan P 3 Tifinagh 75 Category B I 6 Balinese P 2 Batak P 2 Buginese (Makassar) P 6 Cherokee P 6 Glagolitic (Glagolitsa) - ? Han Ideographs (personal names) I 6 Hmong P 6 Javanese P 6 Lepcha (Rong) P 6 Limbu (Indic type) - 4 Maghreb (see Arabic?) P 4 Mangyan P 2 Syriac (Nestorian, Estrangela, etc) P 2 Tagalog (Tagbanuwa, etc.) - ? Tamil Granta (extension to Tamil?) 58 Category C R 3 Ahom P 2 Aramaic P 3 Avestan (Pahlavi) I 6 Brahmi (Asoka) - 6 Cham P 8 Cretan Linear B (see also Mycenaean) R 2 Egyptian Hieroglyphic Basic Alphabet Only P 3 Etruscan (+ Oscan) {RL} R 4 Khamti I 6 Kharoshthi P 3 Old Persian cuneiform P 2 Phoenician R 6 Rejang P 3 Runes (Germanic, Anglo-Saxon, Scandinavian) R 4 Siddham P 2 Ugaritic cuneiform 63 Category D R 7 Albanian (Buthakukye, Elbassan, Veso Bei's) P 3 Balti {RL} I 6 Box-headed script - ? Han Ideographs (archaic & rare) - 4? Mandaean {RL} P 4 Manipuri (see also Bengali) P 2 Meroitic {RL} P 3 Numidian {BT or RL} P 2 Ogham - 2? Parthian R 4 'Phags-pa R 6 Pyu R 3 Samaritan - 6? Satavahana P 3 South Arabian {RL} 55 Category E - 6? Chakma - 6? Chola - 6? Ge'ez (see also Ethiopian) - 2? Iberian - 6? Kaithi - 6? Khotanese - 4? Kok Turki runes (Orkhon) + Old Hungarian (descendant) - 6? Kuoyu - 6? Lahnda R 2 Lycian {RL} R 2 Lydian {RL} - 6? Manchu - 4? Manichaean - 6? Modi - 3? Sogdian (Uzbekistan) - 6? Tankri - 4? Uighur (see Mongolian?) 81 Category F I 32? Lolo I 85? Moso (Nasi) ideograms - 32? Moso phonetic R 75? Yi - 320? Akkadian (Assyrian,Babylonian,Sumerian,etc) R 144? Chu NÓm [Vietnamese, Annamese] R 320? Egyptian Hieroglyphic (+Demotic, Hieratic) I 12 Hittite hieroglyphic & syllabic (Luwian) - 315? Khitan (Liao, Khidan) - 64? Mayan hieroglyphics - 310? Nuchen (Jurcen, Ju-chen, Niu-chih) 1709 Category G - 64? Aymara - 64? Aztec pictograms - 40? Bamum (Cameroon) - ? Carian R 3 Chinook (form of shorthand) R 6 Cretan Linear A R 4 Cypriote syllabary I 4? Cypro-Minoan (Enkomi+Ugarit) R 3 Deseret (Mormon) - ? Han Ideographs (hapax legomena) - 32? Indus Valley - ? Jindai (Shinto, Japan) - 32? Kauder (Micmac Indians) I 4 Osmanya (Somalian) - 31? Paucartambo - 3? Phaistos disk script - 6? Proto-Byblic - 32? Proto-Elamic - 25? Rongo-rongo (Easter Island script) - ? Sidetic - 64? Vai (Liberia) - 8? Woleai (Caroline) I 315? Xixia (Tangut) 74 Total: 2648 IV.C: Categories and listings of symbols The word "symbol" in various contexts is applied to many entities that should not necessarily receive character codes for use in encoding inline text sequences. A set of rough guidelines have been developed for when a graphic "symbol" might be a candidate for character code assignment: Guidelines for inclusion: * If the symbol is commonly used in inline text * If the symbol itself has a name, e.g. "ampersand", "hammer-and-sickle", "caduceus" Guidelines for exclusion: * If the symbol is not normally used in inline text * If the symbol is merely a drawing (stylized or not) of something, e.g. this is intended to exclude pictures of cows, dragons, etc. * If the symbol is "pictographic", i.e. it represents merely that which it is a picture of * If the symbol is usually used only in 2-Dimensional diagrams, e.g. circuit components, weather chart symbols * If the symbol is composable, e.g. a slash through some other symbol indicating negation Given this initial set of filters, the categories for symbols are parallel to a subset of the categories for alphabets: A Contemporary There exists a modern computer or typographic community of users who produce new printed matter (newspapers, magazines, books, signs) using the symbols in inline text. B Specialized There exists a limited community of users (e.g. technical) who produce printed material using the symbols in inline text. Generally these symbols have few users, or are not in day-to-day use for ordinary communication. ---- A-Zone Cutoff Line ---- E Minor The utility of publicly encoding these symbols is open to question. The very limited community using it may wish to encode it via the Private Use Area. Examples: Bliss "Semantography" G Questionable usage in inline text This set corresponds to the "Guidelines for exclusion" above. There are so many non-candidates in this category that it is broken down into subcategories in the detailed listing that follows. Status Cols Name & Commentary Category A - 44? SGML mapped (IEC TR 9573-13?) - 4 Control chars (ISO 2047) - 4 Keyboards (ISO 9995-7) - ? Existing industry fonts, TeX - 4 CJK Related Symbols 56 Category B - 10 Videotex - ? More technical, currency, fractions, etc - ? More geometrics, dingbats/pi 10 Category E - 6 Bliss "Semantography" - ? Corporate collections 6 Category G Just-plain-symbols - 19 Manufacturing - 16 Agriculture - 13 Business - 13 Communications - 10 Medicine, pharmacy - ? Many other fields... - ? Logotypes (not encoded) Symbols from signs & markings - 32 Traffic - 16 Controls - 13 Travel - 13 Sports & recreation - 7 Shop signs - 7 Hospitals - 7 Safety - 4 Religion - 4 Handling of goods - 4 Hobo signs Two Dimensional diagrammatic languages - 16 Cartographic incl. military, geological - 16 Tech drawing, electronics, flowcharts - 16 Architecture - 7 Alchemy - 7 Meteorology - 20 Western music notation - ? Non-western music notations - ? Chemical formulae - ? Biology - ? Astronomy & astrology - ? Notations, e.g. choreography Arbitrary graphics (endless collections) - ? Patterns & borders - ? Pictographic drawings of things Miscellaneous - 4 Gestures - ? More enclosed (circled, parenthsized, etc.) 264 Comments: The "44? columns" for SGML mapped is an estimate of the SGML repertoire not already encoded. The CJK Related Symbols are best assigned in their own region, rows 30-33. It remains to be decided if Videotex is a reasonable candidate for encoding. The bottom-line result is that symbols in categories A and B will fit in the Symbols region of the A-Zone. Symbols in categories E and G need not be considered for encoding. IV.D: Suggestive Allocations of Scripts and Symbols in the A-Zone The final goal of all the preceding analysis is to generate an actual allocation of prime candidate scripts and symbols into the remaining A-Zone space. The listing on the following page presents a suggestive sample of how such an allocation might look. This "strawman" is intended mainly as a focus of discussion and not as a definitive solution, especially as regards the actual allocation of cells. The proposal was generated simply by beginning with the A category and allocating more-or-less linearly through the B, C, and D categories. Rarely is a script split across more than one row. Right-to-left scripts are generally placed in a special area reserved for them in rows 07-08. Space is also allocated in rows 2C-2F for the so-called Low Half Zone of the extension technique described in section II.B. Each line in the listing represents a row of 16 columns, or 256 cells. Newly-proposed script allocations are greatly indented, and followed by a colon and the number of columns consumed by the script. The number of resulting empty columns is indicated for convenience in square brackets at the ends of some lines. Row Currently Assigned Contents Suggested Allocations 00 ASCII, 8859-1 01 European Latin, Extended Latin 02 Std Phonetic, Mod letters Lisu:1 03 General diac, Greek 04 Cyrillic 05 Armenian, Hebrew [+3] 06 Arabic 07 (RL) Maldivian:3 Maghreb:4 Numidian:3 (RL) Syriac:2, Aramaic:2, Samaritan:2 08 (RL) Phoenician:2 Etruscan:3 Balti:3 (RL) Meroitic:2 Parthian:2 [+4] 09 Devanagari, Bengali 0A Gurmukhi, Gujarati 0B Oriya, Tamil 0C Telugu, Kannada 0D Malayalam Sinhala:8 0E Thai, Lao 0F Burmese:8 Khmer:8 10 Georgian Tibetan:8 11 Mongolian:4 Ethiopian:8 Karenni:4 12 Cree:6 Pollard:4 TaiLu:5 TaiNua:1 13 TaiNua:2 Tifinagh:3 Cham:6 Runes:3 EgyptianAlpha:2 14 Bali:6 Java:6 Batak:2 Buginese:2 15 Cherokee:6 Glagolitic:6 Hmong:4 16 Hmong:2 Lepcha:6 Limbu:6 Tagalog:2 17 Mangyan:4 Avestan:3 Brahmi:6 Ahom:3 18 Khamti:3 Kharoshthi:6 Rejang:6 [+1] 19 Siddham:4 Ugaritic:2 OldPersian:3 'PhagsPa:4 SouthArabian:3 1A Albanian:7 BoxHead:6 Ogham:2 [+1] 1B LinearB:8 Pyu:6 [+2] 1C Manipuri:4 Satavahana:6 Mandaean:4 [+2] 1D [+16] 1E Latin & Greek precomposed 1F Latin & Greek precomposed 20 Punc, Sup/Sub, Crncy, Diac 21 Syms, Number Forms, Arrows 22 Math operators 23 Misc Technical MiscSymbols:13 24 Control Pix, OCR, Encl Alpha 25 Form & Cht, Geometrics 26 Misc Dingbats Dingbats:9 27 Zapf Dingbats Dingbats:4 28 SGML:16 29 SGML:16 2A SGML:12 Control chars:4 2B Keyboards:4 Videotex:10 [+2] 2C Extension Low Half Zone:16 2D Extension Low Half Zone:16 2E Extension Low Half Zone:16 2F Extension Low Half Zone:16 Plaintext version of Figure 1: ISO 10646 BMP / Unicode Codespace Allocations (in this rendition of the figure, 1 character = 2048 cells = 128 columns) (lengths of bars are not entirely to scale, due to roundoff error) 0 2 4 6 8 A C E F | | | | | ----|-------------------------------|---- C | | u | A-Zone | I-Zone |O-Zone | R | r | | r |XXaXaXXXXXXXXXXXXXXooooooooRRRR| e | | n |RRRXXXXXXXXXXXXXXXXXXaaoooooooo| t | | ----|-------------------------------|---------------------------- P | | r | | o |RRRXXXXXXXXXXXXXXXXXX | j | pp | e | dddddddddddd c | | hhhhhhhhhhh t | | +++++++++++... e | | d | | ----|-------------------------------|---------------------------- P | | r | | o |RRRXXXXXXXXXXXXXXXXXX | m m m m m m m p | pp | o | e | m m m m m m m s | sssssss| e | | d | | ----|-------------------------------|---------------------------- Key: R = Restricted X = Assigned a = A-Zone unassigned o = O-Zone unassigned p = Prime Candidate Alphabets and Symbols d = Raised in Standards Documents h = Hieroglyphic & Non-Alphabetic Scripts + = Other (e.g. added Han, Hangul, Presentation Forms) e = Reserved: Extension Half-Code Values s = Carefully Selected High-Usage Remainder m = > 1,000,000 more in blocks of 1,024 (on supplemental planes)