Collations for Ancient Languages in XSLT and XQuery

From The Digital Classicist Wiki
Revision as of 15:10, 18 May 2017 by GabrielBodard (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

When writing XSLT stylesheets and XQuery queries, classicists will find the need to alphabetize their material in orders determined by the language or other considerations. For example, scholars working with Latin may wish to conflate the i with the j and the u with the v, and those working with Greek may wish to have the ϙ (qoppa) collated in the alphabet, or to include characters that are outside the Greek and Coptic and Greek Extended planes.

The best solution is to use XSLT 3.0 (https://www.w3.org/TR/xslt-30/), XQuery 3.0 (https://www.w3.org/TR/xquery-30/), or XQuery 3.1 (https://www.w3.org/TR/xquery-31/), which all support the Unicode Collation Algorithm as specified in the Functions and Operators specification (https://www.w3.org/TR/xpath-functions-31/#uca-collations). Earlier W3C recommendations on XSLT (1.0, 2.0, 3.0) and XQuery (1.0, 3.0) provide for collations through attributes such as @collation, but they leave to individual transformation engines the decisions on how to construct and retrieve specific collations.

Contents

Examples of XSLT/XQuery Collations

Greek

Latin

Syriac

  • Syriac Reference Portal: romanized transliteration scheme: definition and application -- intended to work with the Saxon engine.

Alternatives

  • It is possible to use the fn:translate() function as a processor-independent collation method. This method binds characters to specific Unicode code points, and relies upon default sorting by code point to alphabetize. Here is an example of how to sort, non-case-sensitive, a sequence of Greek words stored in the variable $gr (select new lines have been introduced, to improve display on the screen; when using this code remove all newlines between the opening and closing tags):
<sort select="translate($gr,'ἀἁἂἃἄἅἆἇἈἉἊἋἌἍἎἏὰάᾀᾁᾂᾃᾄᾅᾆᾇᾈᾉᾊᾋᾌᾍᾎᾏᾰᾱᾲᾳᾴᾶᾷᾸᾹᾺΆᾼΆΑάαΒβϐΓγΔδἐἑἒἓἔ
ἕἘἙἚἛἜἝὲέῈΈΈΕέεϵ϶ΖζἠἡἢἣἤἥἦἧἨἩἪἫἬἭἮἯὴήᾐᾑᾒᾓᾔᾕᾖᾗᾘᾙᾚᾛᾜᾝᾞᾟῂῃῄῆῇῊΉῌͰͱΉΗήηΘθϑϴἰἱἲἳἴἵἶἷἸἹἺἻἼἽἾἿὶίῐῑ
ῒΐῖῗῘῙῚΊΊΐΙΪίιϊϳΚκϏϗϰΛλΜμΝνΞξὀὁὂὃὄὅὈὉὊὋὌὍὸόῸΌΌΟοόΠπϺϻῤῥῬΡρϱϼΣςσϲϹϽϾϿΤτὐὑὒὓὔὕὖὗὙὛὝὟὺύῠῡῢΰῦῧῨῩ
ῪΎΎΥΫΰυϋύϒϓϔΦφϕΧχΨψὠὡὢὣὤὥὦὧὨὩὪὫὬὭὮὯὼώᾠᾡᾢᾣᾤᾥᾦᾧᾨᾩᾪᾫᾬᾭᾮᾯῲῳῴῶῷῺΏῼΏΩωώϖϚϛϜϝϞϟϘϙͲͳϠϡϷϸϢϣϤϥϦϧϨϩϪϫϬϭ
Ϯϯ᾽ι᾿῀῁῍῎῏῝῞῟῭΅`´῾ʹ͵Ͷͷͺͻͼͽ;΄΅·',
'ααααααααααααααααααααααααααααααααααααααααααααααααααβββγγδδεεεεεεεεεεεεεεεεεεεεεεζζηηηηηηηηηη
ηηηηηηηηηηηηηηηηηηηηηηηηηηηηηηηηηηηηηηθθθθιιιιιιιιιιιιιιιιιιιιιιιιιιιιιιιιιιιικκκκκλλμμννξξο
οοοοοοοοοοοοοοοοοοοππϻϻρρρρρρρσσσσσσσσττυυυυυυυυυυυυυυυυυυυυυυυυυυυυυυυυυυφφφχχψψωωωωωωωωωωω
ωωωωωωωωωωωωωωωωωωωωωωωωωωωωωωωωωωωω ϛϛϝϝϟϟϙϙϠϠϡϡϸϸϣϣϥϥϧϧϩϩϫϫϭϭϯϯ')"/>
  • A simpler function, which should perform the same result, might be (i.e. normalize as decomposed Unicode, then strip out the combining diacritics characters):
lower-case(translate(normalize-unicode($gr,'NFD'),
'&#x0300;&#x0301;&#x0308;&#x0313;&#x0314;&#x0342;&#x0345;',''))
Personal tools