Unicode for ancient languages

From The Digital Classicist Wiki
Jump to navigation Jump to search


Unicode is the de facto standard for the consistent encoding, representation, and handling of text expressed in most of the world's digital writing systems. Maintained by a nonprofit organization, Unicode is the basis upon which we can create and edit text in mixed alphabets and reliably share that data with other people, changes in fonts notwithstanding. That is, any text that is Unicode compliant remains constant, no matter what font is used to display the data. If some software tries to display some Unicode-compliant text in a particular font that does not support a particular alphabet, and ends up displaying boxes, the underlying data is still fine. Swapping the text to a font that does support the alphabet will reveal this to be the case.

With more than 110,000 characters, Unicode is as complex as human writing itself, and so lends itself to organization. Because computers work on the binary system, it was considered ideal to number the characters or glyphs in Unicode in hexadecimal numeration, which uses the digits 0 through 9 and the letters A through F. (Decimal 10 is Hexadecimal A; decimal 17 = hex 10; decimal 79 = hex 4F.) It is therefore helpful to think of Unicode as a very long ribbon sixteen characters wide. That ribbon is divided into Unicode blocks, each one corresponding, more or less, with a particular alphabet. (This ribbon image is illustrated nicely here.) Even more important, it is common to refer to Unicode characters with the assigned hexadecimal value in four or more digits, prefaced by "U+", and to state the official name of the Unicode character fully in uppercase. Thus U+0020 is SPACE, U+00AE is ®, the REGISTERED SIGN, and U+044E is ю, CYRILLIC SMALL LETTER YU.

Special Features in Unicode Classicists Should Know

Because writing systems from antiquity have been introduced to Unicode in phases, there are special exceptions or behaviors that classicists should be aware of.

The Greek alphabet is split principally between two major blocks: Greek and Coptic U+0370..03FF and Greek Extended U+1F00..1FFF. The first block retains its original name, but the Coptic alphabet has been given its own Unicode block: Coptic U+2C80..2CFF. Other Unicode blocks that carry Greek characters:

  • Ancient Greek Numbers U+10140..1018F (note that the end of this block includes papyrological characters that are not numbers)
  • Linear B Syllabary, U+10000..1007F
  • Linear B Ideograms U+10080..100FF
  • Aegean Numbers, U+10100..1013F
  • Byzantine Musical Symbols, U+1D000..1D0FF
  • General Punctuation, U+2000..206F (note esp. punctuation marks toward end)

If you are using combining characters, these are generally drawn from the general Combining Diacritical Marks block, U+0300..036F. This applies also to common punctuation.

A false distinction was introduced to Unicode between the oxia (acute) and tonos, resulting in wrongly duplicated code points. See Greek Unicode duplicated vowels for a full discussion.

The Greek question mark is simply the common semicolon (U+003B). There is a GREEK QUESTION MARK U+037E but the Unicode database marks the latter to be normalized to the former.

Unicode introduced a distinction between uppercase and lowercase numerals for six, ninety, and nine hundred: U+03DA, U+03DB (Ϛϛ), and U+03DE..U+03E1 (ϞϟϠϡ). There are no rules that dictate which form is preferred.

Standardization of Glyphs Not in Unicode

If you think a glyph deserves to be included in Unicode but are not certain, it is best to start with the Unicode discussion list and Deborah Anderson at Berkeley.

If a new character is in order and you need to create a proposal, you may wish to study how other proposals have been developed.

  • The Thesaurus Linguae Graecae has made several Unicode proposals for the encoding of Ancient Greek characters and symbols (several TLG proposals online);
  • the EAGLE committee, who have also made recommendations for Latin epigraphic symbols;

If a glyph is not considered to have the merit of being included in Unicode, but is thought important for specialized fonts, you may wish to study how others have been designed and developed:

  • David Perry (creator of the Cardo font) has also done work in this area;
  • MUFI (the Mediaeval Unicode Font Initiative) are a group with related concerns
  • Athena Ruby, from Dumbarton Oaks, has encoded a number of specialized symbols and characters by tying the glyphs not only to the Private Use Area but to their proper code point (where it exists).

Other Resources