ISO
	    INTERNATIONAL ORGANIZATION FOR STANDARDIZATION
	     ORGANISATION INTERNATIONALE DE NORMALISATION

			  ISO-IEC JTC1/SC2/WG2
		    Multi-Octet Coded Character Set

						  ISO-IEC JTC1/SC2/WG2 N884
						X3L2/93-017 (Project 396-D)
							     April 11, 1993

Title:		    Concerning Future Allocations
Source:		    Rick McGowan & Joe Becker
Status:		    Recommended by X3L2 to the attention of WG2
Requested Action:   Consideration at WG2 meeting #23, May 24-28, 1993


PART I:  At-a-Glance Summaries of the Allocation Situation

I.A:  Introduction

This paper is intended to address the question of future code allocations for
scripts and characters in IS 10646-1 Universal Multiple-Octet Coded Character
Set, particularly in its Basic Multilingual Plane (BMP), and correspondingly
in Unicode 1.1.

The goal of 10646/Unicode code assignments is to provide a uniform basis for
computer processing and electronic interchange of all textual information.
The user community for this standard is very large and heterogeneous,
including not only world language communities computerizing their own scripts,
but also bibliographic and scholarly communities storing and studying the
entire history of human text communication.

This broad user community has already brought forth more entities that are
potential candidates for character encoding than there are codepoints
remaining in the 16-bit space of BMP/Unicode.  The purpose of this section is
to document this surplus of character encoding candidates in summary form.

Because of the shortage of codespace, future assignments in the BMP/Unicode
need to be approached via systematic planning.  This paper does not aim to
present a final resolution of allocation issues, but rather to initiate
discussion by examining the factors that must be taken into consideration.
The paper includes:

  I.	Overview of codespace situation based on size estimates of
  	future encodings

 II.	Discussion of possible architectural innovations in BMP structure
 	and usage

III.	Discussion of principles for encoding in the BMP in the future

 IV.	Suggestive allocations of prime candidate scripts and symbols
 	in the A-Zone

I.B:  Current BMP codespace allocation

In discussing codespace quantities, three units may be of interest:

cell	single character code value, also called a codepoint

column	16 cells, visualized as columns in the character charts
	(this term is not defined in IS 10646)

row	256 cells, visualized as rows across the square code planes in
	large-scale diagrams of the codespace; when magnified in character
	charts, a row may be visualized as a square of 16 x 16 cells

In terms of cells, the statistics of IS 10646-1 and Unicode 1.1 are as follows:

 6467	 9.9% cells of restricted use
34167	52.1% cells assigned
24902	38.0% cells available

65536	100% cells total

These numbers are slightly misleading because some isolated "available" cells
within script blocks are essentially unavailable for future assignment (e.g.
03A2 within the Greek alphabet).  A more useful accounting unit is the column,
since script blocks are assigned on the basis of this unit.  In terms of
columns, the statistics of IS 10646-1 and Unicode 1.1 are as follows:

 404	 9.9% columns of restricted use
2283	55.7% columns assigned
1409	34.4% columns available

4096	100% columns total

These numbers are still slightly optimistic, since some isolated columns are
also essentially unavailable, but in general columns will be the most useful
unit for the approximate kind of figuring done in this paper.

The remainder of this section is keyed to the attached bar-chart illustration
Figure 1: ISO 10646 BMP / Unicode Codespace Allocations (at the end of the
document).  The above allocation numbers are illustrated in the top section
of Figure 1, labeled "Current".  The horizontal axis in Figure 1 indicates
codepoints, with the two main vertical lines demarcating the 16-bit space of
0 to 65536 cells.  The first horizontal bar in the "Current" section
represents the BMP/Unicode in its zone divisions as described in IS 10646.
The second horizontal bar represents the major current allocation regions.
The third horizontal bar in the "Current" section represents the same
allocations, but compacted so that the available cells are all shown on the
right-hand end.  This bar serves as the basis for the rest of the figure.

I.C:  Three oversized groups of potential candidates

To establish that are more potential candidates for character encoding than
there are codepoints remaining in the BMP, this section exhibits three large
groups of candidates, each one of which would exceed the available space if
accepted in toto.  (The three groups are not entirely disjoint from each
other, but will be treated as disjoint for the purposes of this discussion.)

Raised in Standards Documents

In the course of design discussions about IS 10646 within ISO SC2/WG2 and
other standards groups, many sets of entities have been raised as potential
future character encoding requests.  The formal proposal status of these
character sets varies widely, and will not be examined in detail here.
However, it is safe to say that such proposals are often quite tenacious, and
WG2 has been understandably reluctant to turn down a qualified formal request
for code assignments.

The following table lists the major components of this collection, the numbers
assume unification within this group and with existing codes:

	Set               Country      Document         Chars     Cols
	
	HCS-B             China        ballot & N 822     8279     518
	CNS additions     Taiwan       (CJK-JRG mins)     7450     466
	Chu Nom           Vietnam      NPCT (CJK-5-11)    1642     103
	Latin & al.       Germany      WG2 N 762          1536      97
	Tibetan "ligs."   China        WG2 N 470          1500      94
	Yi (modern)       China        WG2 N 470          1165      73
	Math symbols      (?)          IEC TR 9573-13      700      44
	Ethiopian syls    U.K.         ballot & N 807      416      26
	Misc.             Misc.        Misc. WG2 docs      320      21
	CJK radicals      China        WG2 N 824           315      20
	Misc. symbols     China        WG2 N 825           288      18
	Videotex          Denmark      WG2 N 838           170      10
	Misc. ISO Stds.   ISO/TC46     WG2 N 809           148       9
	Sinhala           U.K.         ballot & N 803      128       8
	Tibetan           China,UKWG2  N 826 & al.         100       6
	Arab numerics     Egypt        DIS-2 ballot         90       6
	DIS-1 Mongolian   China        DIS-2 ballot         85       6
    
	Total                                            23789    1525
	Available in BMP/Unicode (currently-empty rows)  21500    1343

This group is illustrated in Figure 1 as the horizontally-striped bar "Raised
in Standards Documents".  While not all of these potential candidates have
been formally proposed for inclusion in the 10646 BMP/Unicode, the fact is
that there is not room for all of them.

Hieroglyphic & Non-Alphabetic Scripts

The Unicode Technical Committee chartered its Scripts & Symbols Subcommittee
to accumulate potential future character encoding requests; these labors have
been undertaken largely by Rick McGowan.  This research has collected a group
of scripts that use large character sets of a hieroglyphic or other
non-alphabetic nature (not including CJKV ideographs).  The informational and
proposal status of these character sets varies widely, and will not be
examined in detail here. Despite the limited application of some of these
scripts, they are genuine characters of human written language, and their
encoding to permit computer processing may indeed be of interest to Unicode
Consortium members.

The following table lists the major components of this collection; the numbers
are rough estimates.

Type						Cols

Han inspired					1200
Hieroglyphs, New World & Oceania		 250
Other Middle-Eastern, Classical precursors	  70
Cuneiform, ideographic types	   		  60
Hieroglyphs, Classical	  			  40

Total	1620
Available in BMP/Unicode (currently-empty rows)	1343

This group is illustrated in Figure 1 as the ////////-striped bar
"Hieroglyphic & Non-Alphabetic Scripts".  Again, while not all of these
large-set scripts have been formally proposed for inclusion in the 10646
BMP/Unicode, the fact is that there is not room for all of them.

Other (e.g. added Han, Hangul, Presentation Forms)

The collections enumerated above do not exhaust the large sets of entities
that may potentially be advanced for character code assignment.  Possibilities
include the remainder of the CCCII collection of about 75,000 CJKV ideographs,
or at least the extensions that bring Taiwan CNS 11643 to 48,027 (plus an
added 6,696) characters, an indeterminate but large number of archaic Hangul
syllables, a nearly infinite number of precomposed diacritical mark
combinations, and many other entities categorized as "presentation forms".

The group consisting of the union of these collections is illustrated in
Figure 1 as the open-ended \\\\\\\\-striped bar "Other (e.g. added Han,
Hangul, Presentation Forms)".  Although the eventual size of the residual
encoding demand is indeterminate, the fact is that there is not room for all
of it in the BMP/Unicode.

I.D:  Prime candidate alphabets and symbols for A-Zone allocation

In contrast to the three oversized groups just discussed, the Unicode Scripts
& Symbols Subcommittee has enumerated two small but extremely important
collections that do fit appropriately into the BMP/Unicode, namely the
top-priority items in the lists of yet-unencoded alphabets and symbols.  These
top-ranked candidates among the remaining alphabets and symbol sets may be
called "prime candidates" for encoding.

Consistent with the BMP's design, the prime candidates do fit into the A-Zone
Alphabets region and Symbols region.  A detailed definition and suggestive
allocation of these sets is presented in Part IV of this document.

The group of prime candidates is illustrated in the "Proposed" section of
Figure 1 as the cross-hatched bar "Prime Candidate Alpha[bets] and Symbols".
The remaining bars in the "Proposed" section represent other suggested
components discussed in the remainder of this paper.

I.E:  Summary

The prime candidates among the yet-unencoded alphabets and symbol sets do fit
into their designated unassigned regions within the A-Zone of the BMP/Unicode.

However, there remain three large groups of potential character code
allocation requirements: those raised in standards documents, hieroglyphic &
non-alphabetic scripts, and other large additions.  Not one of these groups,
much less all three, can fit in the remaining unassigned codespace in the
BMP/Unicode.  If the (slight) overlap among the three groups were unified
away, and if half or even two-thirds of the candidate entities in these groups
were rejected for encoding, the remainder still would not fit.

This shortage of codespace in the BMP/Unicode motivates the discussion in the
following two sections.


PART II:  Responses via innovation in BMP structure and usage

Because there is not enough space in BMP/Unicode for the potential encoding
candidates, an approach of extending this standard by simply assigning new
characters on a first-come-first-served basis would run out of encoding space
well before the needs of the user community were satisfied.  The prospect of
such an occurrence would inevitably lead to great friction among organizations
with competing claims on the remaining BMP/Unicode territory.  Such conflict
would severely threaten the standard, both technically and politically.

Therefore it is necessary to devise a new and structurally different approach
to future code allocation, involving some degree of architectural modification
to the standard.  This section briefly discusses two such proposals that have
been made: a "swap zone" and "extended UCS-2".  At the X3L2 meeting where
these matters were discussed, there was unanimous preference for extended
UCS-2 over the swap zone design, so these alternatives are not presented here
as though they are regarded equally.

II.A:  Swap zone

In the ISO-IEC SC2/WG2 meetings where the current structure of the 10646 BMP
was decided, there was discussion of the potential use of the O-Zone as a
target zone for swapping in designated quarters of supplemental planes.  The
topic was however deferred, and no formal definition of such a regime was
defined.  The Japan national body negative vote on DIS 10646-1.2 explicitly
requested that the O-Zone be expanded from 64 to 94 rows "for swapping-in the
existing standards"; again no formal details were provided.  In the absence
of detailed proposals, general comments may be made on these two broad
variants of the swap zone idea.

Considering the Japanese innovation first, this proposal to expand the O-Zone
and swap in existing standards was explicitly rejected by WG2, because it is
opposed to the basic intent of this standard, which is to assign a single code
position to each graphical character, not to provide an indexing mechanism
into existing standards.  The expansion of the O-Zone to 94 rows is impossible
in any case.

The original idea of a swap zone for non-duplicated quarters of supplemental
planes, presumably involving an exterior protocol and/or invocation mechanism,
is not excluded on the above grounds.  It could succeed in permitting
characters encoded in supplemental planes to be accessed from within UCS-2
format.  However, it does require an identification protocol or invocation
information which is costly to manage, and which may become separated from
the text data.  Once the swap identification information is lost, the original
semantics of the swapped character codes cannot be restored by any further
recipient of the text.

Although such a design could work in some circumstances, vulnerability to
irretrievable loss of character semantics makes it unsafe for general-purpose
blind interchange of text data.  This problem led to the proposal of the
"extended UCS-2" design, which achieves a similar purpose but without
introducers or controls, and with no danger of losing the semantics of a coded
character.

II.B:  Extended UCS-2

In Figure 1, the "Plan C" section represents the result of Proposal for
Extended UCS-2, being also a Proposal for Extended Unicode, document
X3L2/93-016 of January 21, 1993.  To summarize that document, two zones of
1024 BMP codes each are reserved for "High Half" and "Low Half" values of
four-octet canonical character forms.  Then an additional coded representation
form for UCS called "Extended Two-octet BMP Form" is defined, in which the
sequence of a two-octet High Half code followed by a two-octet Low Half code
is interpreted as a single occurrence of the corresponding four-octet
canonical character form.  A fully precise specification is contained in
document X3L2/93-016.

This innovation is a rather simple addition that is fully compatible with the
design (and any existing implementations) of IS 10646-1 and Unicode 1.1.  It
provides for  1024 x 1024 = 1,048,576  code values from supplemental planes to
be easily included in a UCS-2/Unicode stream.  In supplementary planes
allocated so as to make use of this feature, each plane would contain at most
1024 assigned codes (a few such blocks are visualized in the lower-right
corner of Figure 1).

The provision of over 1,000,000 codes accessible from within UCS-2/Unicode is
far more than sufficient to encompass all known future encoding candidates.
So this design does appear to be a resolution of the code surplus problem with
a minimal amount of additional mechanism and compatibility problems.

Under this proposal, the remaining 16-bit BMP/Unicode space, principally the
O-Zone, would have no special properties, and would be available for ordinary
character encoding of 16K carefully selected candidates. The extended UCS-2
design does remove a great deal of the perceived pressure to include
everything within the BMP, yet there still needs to be a plan that more
precisely defines what it means for these characters to be "carefully
selected".  That topic is the subject of the following section of this paper.


PART III:  Encoding Principles

The initial edition of IS 10646-1 and Unicode 1.1 is focused on covering
characters that are in common usage, as requested by standards body member
nations and organizations, and as attested by documentation in existing
international, national, and industry computer and communication standards
and typographic collections.

The future character code allocations that will extend this standard have
quite a different nature.  Almost by definition the user communities for the
less-common characters will be limited and not politically organized, there
will be no more widely-implemented pre-existing standards, and precise
information about character collections may be difficult to obtain, or even
disputed.

At the same time, as shown previously in this paper, the remaining BMP/Unicode
codespace is a limited resource that requires careful management.  This means
establishing agreement on orderly procedures to meet the goals of future
encoding.  This section proposes a structured approach to gathering future
character collection candidates, a range of alternatives for assigning them,
and a set of goals and guidelines for managing the remaining BMP/Unicode
codespace.

III.A:  Repertoire research, collection, and feedback

Future character candidates will generally have smaller user communities than
the characters already included in the standard, and in many cases these
communities may be fragmented into various native, emigre, bibliographic, and
scholarly components.  Therefore, the first phase in addressing future
character collections needs to be an explicit process of research, repertoire
collection, and feedback solicitation.

These processes have already begun, notably in the CJK-JRG for ideographic
characters, and the Unicode Scripts & Symbols Subcommittee for non-ideographic
characters.  These groups have been performing research, collecting
candidates, and circulating draft repertoires back to the potential user
communities for feedback.  Insofar as possible, these organized research
efforts should be developed into a single formal channel for new character
set proposals replacing direct decision by the working groups as the mechanism
for future encodings.

In the future, research will be necessary in order to establish basic
information about proposed characters that are less frequently used.  For
example, current CJK-JRG criteria are to establish the following for each
newly proposed character:  who uses this character (or needs to) and what for;
what computer systems currently implement it, and/or fonts and/or standards
contain it; and what dictionaries, reference materials, or other sources
establish its usage.

Because the user communities for less-frequent characters may be scattered,
there needs to be a feedback phase to solicit consensus on this information.
Such study may require considerable delay, and there is no guarantee that
enough consistent information about a character or set may found to qualify
it as a genuine candidate for encoding.

The end-product of this research phase is a proposed repertoire with
associated information about each character candidate.  When this has been
sufficiently reviewed to ensure consensus among the user community, it becomes
input to the next stages of the decision-making process.

III.B:  Code assignment alternatives

Once a repertoire of proposed character candidates is defined, there are
several possibilities for how the candidates may be addressed in the standard.
These possibilities may be treated as a set of ordered filters through which
each candidate is to be screened.

(1) Do not encode.  Some candidates may be unified with existing character
codes, or may be noted as micro-variants of existing characters.  A candidate
which can be represented dynamically as a composite sequence of combining
forms probably should not be encoded, since in most cases adding it would
merely create a duplicate spelling.  In general, if a candidate can be
represented via a sequence of already-encoded characters, then adding it might
bring more problems than benefits.

(2) Treat as a glyph.  The relation of 10646/Unicode characters to glyph
encodings has not yet been fully specified, but even now it is possible to
recognize certain sets of entities that are valid glyphs but not valid
characters.  Some cases may be rather clear-cut, such as linking forms in
Latin handwriting-emulation fonts.  Other cases are already beginning to be
formally recognized, such as the CJK-JRG's exclusion of further typeface
variants from character status.  Other categories might include stylistic
micro-variants, ligatures, and syllable clusters.  It is not a good idea to
add more collections of purely-glyphic entities to the BMP.

(3) Encode in Private Use Area.  The Private Use Area may be useful and
appropriate for encoding some character sets that have only a limited user
community.  Also, it is possible that some complex scripts could be quite
difficult to implement, and a Private Use encoding could be used as a test-bed
until a single workable encoding and implementation were agreed upon by the
user community.  In the latter case, this new "de-facto standard" could then
be re-proposed with greater confidence for permanent encoding.

(4) Encode on supplementary plane.  Section I of this document suggests that
inevitably the majority of remaining candidates will be encoded on
supplementary planes.  For less-frequently-used characters this need not be
a major disadvantage, especially if a feature such as Extended UCS-2 permits
these to be accessed within a UCS-2 stream.  Foresight and agreed criteria
are necessary for planning supplementary plane allocations in an orderly way.

(5) Encode on BMP.  Encoding of further characters on the BMP should begin to
be regarded as an exceptional decision, something of a last resort in case
none of the preceding mechanisms can apply.  However, because of the need to
shepherd the remaining BMP/Unicode space, we also need to develop explicit
goals and guidelines for what candidates may be appropriate to encode there.

III.C:  Goals and guidelines for BMP encoding

Having successfully encoded most common usage characters in the initial
edition of IS 10646-1 and Unicode 1.1, the standards community would be well
served by an explicit set of goals and guidelines for what it hopes to
accomplish with the assignments that will be made into the remaining
BMP/Unicode space.  Below are some proposals for such goal statements.

Goal: The BMP should contain the basic elements of all scripts

The main intention of the term Basic Multilingual Plane needs to be clearly
stated.  Presumably this plane is to be devoted to covering the breadth of
characters used for human written language.  If this is the case, there may
be times when a small obscure alphabet needs to be given priority over equally
obscure additions to vast sets such as the CJKV ideographs.  In particular,
this guideline argues that the remaining A-Zone space should be given its
natural application to the "Remaining alphabets" group (Section I.C above)
and a few more common symbols.

Goal: BMP encodings should have high utility

Generally the BMP should be devoted to high-utility characters widely
implemented in some form of communication systems.  These include, for
example, hardcopy typographic systems that are awaiting computerization, and
characters recognizable and useful to a large user community.  The "utility"
of a character in a computer / communication standard can be measured (at
least in theory) by such factors as: number of publications (e.g., newspapers
or books) using the character, the size of the community who can recognize
the character, etc.  Characters of more limited use should be considered for
supplementary plane encodings, for example large sets of characters for
obscure dead scripts and large sets of individual personal name CJKV
ideographs.

Goal: BMP encodings should take into account their user communities

"Utility" also means that the encodings in the BMP should actually be
available in implementation to some community of users.  For less-frequent
characters, the community of users becomes smaller, and its direct
participation in the standardization process becomes more important.  At the
same time it becomes more difficult, because users may exist in scattered
geographic and political communities, and in addition have a
geographically-distributed scholarly and bibliographic community (for dead
scripts, of course, only the latter).  The political community may be a
smaller unit than a country, and may even be oppressed by the country that
contains it.

In all cases where these communities have organized themselves to address the
matter of computer encoding, their input should be given especial weight,
particularly if it is embodied in specific implementations, and especially if
those are actually in use within the community.  The whole user community
should also be consulted for information on the script and its elements.  In
situations where community consensus is lacking (e.g., where an encoding
proposal arrives from a single source), assignment into the BMP should be
deferred until user consensus can be obtained.

Non-Goal: The BMP need not cover all entities in future standards

It is not necessary, though it may often be desirable, that all entities in
future international, national, and industry computer and communication
standards be included in the BMP.  The initial edition practice of covering
pre-existing standards was used as a means of evaluating established utility,
as well as ensuring compatibility with existing practice.  Entities contained
in new standards may or may not have proven utility, and may or may not
establish themselves in common usage.  Although new standards will continue
to be a valuable and important source of candidates for the BMP, inclusion in
a new standard will not in and of itself qualify an entity for encoding in
the BMP.


PART IV:  Overview of Potential Future Encodings

IV.A:  Candidate scripts and symbols for A-Zone allocation

This section will introduce specifics of future code collection and
allocation, building on the general principles discussed in the previous
section.  The approach is based on detailed listings of scripts and symbols
that are not encoded in 10646 BMP, hence not in Unicode 1.1.  A full listing
of scripts and symbol sets known to the Unicode Technical Committee (with
equivalent name aliases) is maintained by the Scripts & Symbols Subcommittee,
and is available on request.

There are three primary unallocated zones in the BMP:

rows 12-1D	192 columns	A-Zone: Alphabets
rows 28-2F	128 columns	A-Zone: Symbols
rows A0-DF	1024 columns	O-Zone

Consistent with the BMP's design, most of the "prime candidates" among
remaining alphabetic characters will fit into the A-Zone Alphabets region,
and most of the prime candidates among remaining symbolic characters will fit
into the A-Zone Symbols region.  Therefore the discussion in this section will
focus on an approach for allocating alphabet and symbol candidates into the
A-Zone.

Planning for the O-Zone will be more difficult, and possible O-Zone
allocations are not directly addressed this document.  However, the analysis
below of prime candidates for A-Zone encoding leads to a list of leftover
"secondary candidates" which fail to make the A-Zone cutoff but which should
be considered for O-Zone encoding.

IV.B:  Categories and listings of scripts

Given a listing of scripts, there are many dimensions along which the
candidates can be evaluated for BMP allocation.  Candidate scripts may be
living or extinct, may contain small or large numbers of characters, may
support a great or limited published literature, may be clearly-defined or
obscure, and so forth.  The following overview discussion compresses this
elaborate multi-dimensional analysis into a simple linear set of seven major
script categories, A through G.  This linear ordering admittedly does not do
justice to the details of each script's situation, but it does give an
approximate ranking of the candidates as is needed in order to undertake BMP
allocation.  The categories are as follows:

A	Contemporary
There exists a modern community of native users who produce new printed matter
in the script (newspapers, magazines, books, signs).  Examples: Burmese,
Maldivian, Syriac.

B	Specialized (Small)
There exists a limited community of users (e.g. liturgical) who produce new
printed material in the script.  Generally these scripts have few native
users, or are not in day-to-day use for ordinary communication.  Examples:
Javanese, Pahlavi, personal name ideographs.  (Large sets of this description
are moved to category F.)

C	Major Extinct (Small)
There exists a relatively large body of literature in the script, and a
relatively large scholarly community studying it.  Examples: Etruscan,
Linear B.  (Large sets of this description are moved to category F.)

D	Attested Extinct (Small)
There exists a relatively limited literature in the script, and a relatively
small scholarly community studying it.  Examples: Samaritan, Meroitic.  (Large
sets of this description are moved to category F.)

			    ---- A-Zone Cutoff Line ----

E	Minor Extinct
The utility of publicly encoding these script is open to question.  They may
be secondary candidates for encoding elsewhere on the BMP, or their limited
scholarly communities may wish to encode them in the Private Use Area.
Examples: Khotanese, Lahnda.

F	Hieroglyphic or Ideographic
The script has a large character set (10 or more columns, i.e. 160 or more
characters), which essentially means hieroglyphic or ideographic scripts.  A
large character set is almost by definition obscure, since it is difficult to
obtain information or agreement on the precise membership of the set.  The
following examples are ordered by the category to which they would otherwise
have belonged if they had had small character sets:
		(B) Lolo, Moso, Yi
		(C) Akkadian, Chu Nom, Egyptian Hieroglyphics
		(D) Hittite(Luwian), Khitan, Mayan Hieroglyphics, Nuchen

G	Obscure
The script is not deciphered or understood completely, or is not well attested
by substantial literature or scholarly community.  Its community of users, if
any, may wish to encode it in the Private Use Zone.  Examples: Xixia,
Rongo-rongo, Osmanya.

The bottom-line result is that scripts in categories A through D will fit in
the Alphabets region of the A-Zone , whereas scripts in categories E through
G will not.

The detailed listing below is sorted first by category and then by script
name.  The letter code in the first column represents the script's Status in
the UTC Scripts Subcommittee register:

	P	a concrete proposal exists for the script and has been published
	R	the required repertoire is more-or-less completely known
	I	some information is known, but probably not the complete repertoire

There follows the number of Columns required for encoding the script.  A
number followed by a question mark is a "best estimate" based upon the
information currently available; otherwise the numbers are well established.
It has not been possible yet to get even estimated numbers for some scripts.
At the bottom of each category is an estimated total of columns required for
that category; this number includes only those scripts in the group whose size
is known or estimated.

Scholars could doubtless find various faults with the categorization and
content estimates (particularly of the extinct scripts in D and E), and these
may be refined in the future as information becomes available.  But at the
moment these estimates offer the best available data for approaching the
problem of future code allocation.

Status	Cols	Name & Commentary

Category A
P	8	Burmese
R	6	Cree (Evans syllabic signs)
P	8	Ethiopian (see also Ge'ez)
R	4	Karenni (Kayah Li)
P	8	Khmer (Cambodian)
R	1	Lisu
P	3	Maldivian {RL} [Dihevi]
P	8	Mongolian
-	4?	Pollard phonetic
P	8	Sinhala
P	5	Tai Lu
P	3	Tai Nua
P	6	Tibetan
P	3	Tifinagh
		75

Category B
I	6	Balinese
P	2	Batak
P	2	Buginese (Makassar)
P	6	Cherokee
P	6	Glagolitic (Glagolitsa)
-	?	Han Ideographs (personal names)
I	6	Hmong
P	6	Javanese
P	6	Lepcha (Rong)
P	6	Limbu (Indic type)
-	4	Maghreb	(see Arabic?)
P	4	Mangyan
P	2	Syriac (Nestorian, Estrangela, etc)
P	2	Tagalog (Tagbanuwa, etc.)
-	?	Tamil Granta (extension to Tamil?)
		58

Category C
R	3	Ahom
P	2	Aramaic
P	3	Avestan (Pahlavi)
I	6	Brahmi (Asoka)
-	6	Cham
P	8	Cretan Linear B (see also Mycenaean)
R	2	Egyptian Hieroglyphic Basic Alphabet Only
P	3	Etruscan (+ Oscan) {RL}
R	4	Khamti
I	6	Kharoshthi
P	3	Old Persian cuneiform
P	2	Phoenician
R	6	Rejang
P	3	Runes (Germanic, Anglo-Saxon, Scandinavian)
R	4	Siddham
P	2	Ugaritic cuneiform
		63

Category D
R	7	Albanian (Buthakukye, Elbassan, Veso Bei's)
P	3	Balti {RL}
I	6	Box-headed script
-	?	Han Ideographs (archaic & rare)
-	4?	Mandaean {RL}
P	4	Manipuri (see also Bengali)
P	2	Meroitic {RL}
P	3	Numidian {BT or RL}
P	2	Ogham
-	2?	Parthian
R	4	'Phags-pa
R	6	Pyu
R	3	Samaritan
-	6?	Satavahana
P	3	South Arabian {RL}
		55

Category E
-	6?	Chakma
-	6?	Chola
-	6?	Ge'ez (see also Ethiopian)
-	2?	Iberian
-	6?	Kaithi
-	6?	Khotanese
-	4?	Kok Turki runes (Orkhon) + Old Hungarian (descendant)
-	6?	Kuoyu
-	6?	Lahnda
R	2	Lycian {RL}
R	2	Lydian {RL}
-	6?	Manchu
-	4?	Manichaean
-	6?	Modi
-	3?	Sogdian (Uzbekistan)
-	6?	Tankri
-	4?	Uighur (see Mongolian?)
		81

Category F
I	32?	Lolo
I	85?	Moso (Nasi) ideograms
-	32?	Moso phonetic
R	75?	Yi
-	320?	Akkadian (Assyrian,Babylonian,Sumerian,etc)
R	144?	Chu N�m [Vietnamese, Annamese]
R	320?	Egyptian Hieroglyphic (+Demotic, Hieratic)
I	12	Hittite hieroglyphic & syllabic (Luwian)
-	315?	Khitan (Liao, Khidan)
-	64?	Mayan hieroglyphics
-	310?	Nuchen (Jurcen, Ju-chen, Niu-chih)
		1709

Category G
-	64?	Aymara
-	64?	Aztec pictograms
-	40?	Bamum (Cameroon)
-	?	Carian
R	3	Chinook (form of shorthand)
R	6	Cretan Linear A
R	4	Cypriote syllabary
I	4?	Cypro-Minoan (Enkomi+Ugarit)
R	3	Deseret (Mormon)
-	?	Han Ideographs (hapax legomena)
-	32?	Indus Valley
-	?	Jindai (Shinto, Japan)
-	32?	Kauder (Micmac Indians)
I	4	Osmanya (Somalian)
-	31?	Paucartambo
-	3?	Phaistos disk script
-	6?	Proto-Byblic
-	32?	Proto-Elamic
-	25?	Rongo-rongo (Easter Island script)
-	?	Sidetic
-	64?	Vai (Liberia)
-	8?	Woleai (Caroline)
I	315?	Xixia (Tangut)
		74		Total: 2648

IV.C:  Categories and listings of symbols

The word "symbol" in various contexts is applied to many entities that should not necessarily receive character codes for use in encoding inline text sequences.  A set of rough guidelines have been developed for when a graphic "symbol" might be a candidate for character code assignment:

Guidelines for inclusion:

 * If the symbol is commonly used in inline text

 * If the symbol itself has a name, e.g. "ampersand", "hammer-and-sickle",
   "caduceus"

Guidelines for exclusion:

 * If the symbol is not normally used in inline text

 * If the symbol is merely a drawing (stylized or not) of something, e.g. this
   is intended to exclude pictures of cows, dragons, etc.

 * If the symbol is "pictographic", i.e. it represents merely that which it is
   a picture of

 * If the symbol is usually used only in 2-Dimensional diagrams, e.g. circuit
   components, weather chart symbols

 * If the symbol is composable, e.g. a slash through some other symbol
   indicating negation

Given this initial set of filters, the categories for symbols are parallel to
a subset of the categories for alphabets:

A	Contemporary
There exists a modern computer or typographic community of users who produce
new printed matter (newspapers, magazines, books, signs) using the symbols in
inline text.

B	Specialized
There exists a limited community of users (e.g. technical) who produce printed
material using the symbols in inline text.  Generally these symbols have few
users, or are not in day-to-day use for ordinary communication.

			    ---- A-Zone Cutoff Line ----

E	Minor
The utility of publicly encoding these symbols is open to question.  The very
limited community using it may wish to encode it via the Private Use Area.
Examples: Bliss "Semantography"

G	Questionable usage in inline text
This set corresponds to the "Guidelines for exclusion" above.  There are so
many non-candidates in this category that it is broken down into subcategories
in the detailed listing that follows.

Status	Cols	Name & Commentary

Category A
-	44?	SGML mapped (IEC TR 9573-13?)
-	4	Control chars (ISO 2047)
-	4	Keyboards (ISO 9995-7)
-	?	Existing industry fonts, TeX
-	4	CJK Related Symbols
		56

Category B
-	10	Videotex
-	?	More technical, currency, fractions, etc
-	?	More geometrics, dingbats/pi
		10

Category E
-	6	Bliss "Semantography"
-	?	Corporate collections
		6

Category G
Just-plain-symbols
-	19	Manufacturing
-	16	Agriculture
-	13	Business
-	13	Communications
-	10	Medicine, pharmacy
-	?	Many other fields...
-	?	Logotypes (not encoded)
Symbols from signs & markings
-	32	Traffic
-	16	Controls
-	13	Travel
-	13	Sports & recreation
-	7	Shop signs
-	7	Hospitals
-	7	Safety
-	4	Religion
-	4	Handling of goods
-	4	Hobo signs
Two Dimensional diagrammatic languages
-	16	Cartographic incl. military, geological
-	16	Tech drawing, electronics, flowcharts
-	16	Architecture
-	7	Alchemy
-	7	Meteorology
-	20	Western music notation
-	?	Non-western music notations
-	?	Chemical formulae
-	?	Biology
-	?	Astronomy & astrology
-	?	Notations, e.g. choreography

Arbitrary graphics (endless collections)
-	?	Patterns & borders
-	?	Pictographic drawings of things
Miscellaneous
-	4	Gestures
-	?	More enclosed (circled, parenthsized, etc.)
		264

Comments:  The "44? columns" for SGML mapped is an estimate of the SGML
repertoire not already encoded.  The CJK Related Symbols are best assigned in
their own region, rows 30-33.  It remains to be decided if Videotex is a
reasonable candidate for encoding.

The bottom-line result is that symbols in categories A and B will fit in the
Symbols region of the A-Zone.  Symbols in categories E and G need not be
considered for encoding.

IV.D:  Suggestive Allocations of Scripts and Symbols in the A-Zone

The final goal of all the preceding analysis is to generate an actual
allocation of prime candidate scripts and symbols into the remaining A-Zone
space.  The listing on the following page presents a suggestive sample of how
such an allocation might look.  This "strawman" is intended mainly as a focus
of discussion and not as a definitive solution, especially as regards the
actual allocation of cells.

The proposal was generated simply by beginning with the A category and
allocating more-or-less linearly through the B, C, and D categories.  Rarely
is a script split across more than one row.  Right-to-left scripts are
generally placed in a special area reserved for them in rows 07-08.  Space is
also allocated in rows 2C-2F for the so-called Low Half Zone of the extension
technique described in section II.B.

Each line in the listing represents a row of 16 columns, or 256 cells.
Newly-proposed script allocations are greatly indented, and followed by a
colon and the number of columns consumed by the script.  The number of
resulting empty columns is indicated for convenience in square brackets at
the ends of some lines.

Row	Currently Assigned Contents	Suggested Allocations

00	ASCII, 8859-1
01	European Latin, Extended Latin
02	Std Phonetic, Mod letters	Lisu:1
03	General diac, Greek
04	Cyrillic
05	Armenian, Hebrew [+3]
06	Arabic
07		(RL) Maldivian:3 Maghreb:4 Numidian:3
			(RL) Syriac:2, Aramaic:2, Samaritan:2
08		(RL) Phoenician:2 Etruscan:3 Balti:3
			(RL) Meroitic:2 Parthian:2 [+4]
09	Devanagari, Bengali
0A	Gurmukhi, Gujarati
0B	Oriya, Tamil
0C	Telugu, Kannada
0D	Malayalam	Sinhala:8
0E	Thai, Lao
0F		Burmese:8 Khmer:8
10	Georgian	Tibetan:8
11		Mongolian:4 Ethiopian:8 Karenni:4
12		Cree:6 Pollard:4 TaiLu:5 TaiNua:1
13		TaiNua:2 Tifinagh:3 Cham:6 Runes:3
			EgyptianAlpha:2
14		Bali:6 Java:6 Batak:2 Buginese:2
15		Cherokee:6 Glagolitic:6 Hmong:4
16		Hmong:2 Lepcha:6 Limbu:6 Tagalog:2
17		Mangyan:4 Avestan:3 Brahmi:6 Ahom:3
18		Khamti:3 Kharoshthi:6 Rejang:6 [+1]
19		Siddham:4 Ugaritic:2 OldPersian:3
			'PhagsPa:4 SouthArabian:3
1A		Albanian:7 BoxHead:6 Ogham:2 [+1]
1B		LinearB:8 Pyu:6 [+2]
1C		Manipuri:4 Satavahana:6 Mandaean:4 [+2]
1D		[+16]
1E	Latin & Greek precomposed
1F	Latin & Greek precomposed
20	Punc, Sup/Sub, Crncy, Diac
21	Syms, Number Forms, Arrows
22	Math operators
23	Misc Technical	MiscSymbols:13
24	Control Pix, OCR, Encl Alpha
25	Form & Cht, Geometrics
26	Misc Dingbats	Dingbats:9
27	Zapf Dingbats	Dingbats:4
28		SGML:16
29		SGML:16
2A		SGML:12 Control chars:4
2B		Keyboards:4 Videotex:10 [+2]
2C		Extension Low Half Zone:16
2D		Extension Low Half Zone:16
2E		Extension Low Half Zone:16
2F		Extension Low Half Zone:16


Plaintext version of Figure 1: ISO 10646 BMP / Unicode Codespace Allocations

(in this rendition of the figure, 1 character = 2048 cells = 128 columns)
(lengths of bars are not entirely to scale, due to roundoff error)

        0   2   4   6   8   A   C   E  F
        |       |       |       |       |
    ----|-------------------------------|----
    C   |                               |
    u   | A-Zone  | I-Zone  |O-Zone | R |
    r   |                               |
    r   |XXaXaXXXXXXXXXXXXXXooooooooRRRR|
    e   |                               |
    n   |RRRXXXXXXXXXXXXXXXXXXaaoooooooo|
    t   |                               |
    ----|-------------------------------|----------------------------
    P   |                               |
    r   |                               |
    o   |RRRXXXXXXXXXXXXXXXXXX          |
    j   |                     pp        |
    e   |                       dddddddddddd
    c   |                               |   hhhhhhhhhhh
    t   |                               |              +++++++++++...
    e   |                               |
    d   |                               |
    ----|-------------------------------|----------------------------
    P   |                               |
    r   |                               |
    o   |RRRXXXXXXXXXXXXXXXXXX          |   m   m   m   m   m   m   m
    p   |                     pp        |
    o   |                       e       |   m   m   m   m   m   m   m
    s   |                        sssssss|
    e   |                               |
    d   |                               |
    ----|-------------------------------|----------------------------

Key:

    R = Restricted
    X = Assigned
    a = A-Zone unassigned
    o = O-Zone unassigned

    p = Prime Candidate Alphabets and Symbols
    d = Raised in Standards Documents
    h = Hieroglyphic & Non-Alphabetic Scripts
    + = Other (e.g. added Han, Hangul, Presentation Forms)

    e = Reserved: Extension Half-Code Values
    s = Carefully Selected High-Usage Remainder
    m = > 1,000,000 more in blocks of 1,024 (on supplemental planes)