LiLa: Linking Latin
- Marco Passarotti
The LiLa: Linking Latin project (2018-2023) is building a Linked Data Knowledge Base of Linguistic Resources and Natural Language Processing (NLP) tools for Latin. LiLa collects and connects both existing and newly-generated (meta)data. The former are mostly linguistic resources (corpora, lexica, ontologies, dictionaries, thesauri) and NLP tools (tokenisers, lemmatisers, PoS-taggers, morphological analysers and dependency parsers) for Latin. These are currently available from different providers under different licences. As for newly-generated (meta)data, LiLa assesses a set of selected linguistic resources by expanding their lexical and/or textual coverage. In particular, LiLa (a) enhances a large amount of Latin texts with PoS-tagging and lemmatisation, (b) harmonises the annotation of the three Universal Dependencies treebanks for Latin, (c) improves the lexical coverage of the Latin WordNet and the valency lexicon Latin-Vallex, and (d) expands the textual coverage of the Index Thomisticus Treebank. Furthermore, LiLa builds a set of newly-trained models for PoS-tagging and lemmatisation, and works on developing and testing the best performing NLP pipeline for such a task. Connections between datasets are edges labelled with a restricted set of values (metadata) taken from a vocabulary of knowledge description.
LiLa meets the so-called FAIR Guiding Principles for scientific data management and stewardship, which state that scholarly data must be Findable, Accessible, Interoperable and Reusable.