LiLa: Linking Latin
- Marco Passarotti
The LiLa: Linking Latin project (2018-2023) is building a Linked Data Knowledge Base of Linguistic Resources and Natural Language Processing (NLP) tools for Latin. LiLa collects and connects both existing and newly-generated (meta)data. The former are mostly linguistic resources (corpora, lexica, ontologies, dictionaries, thesauri) and NLP tools (tokenisers, lemmatisers, PoS-taggers, morphological analysers and dependency parsers) for Latin. These are currently available from different providers under different licences. As for newly-generated (meta)data, LiLa assesses a set of selected linguistic resources by expanding their lexical and/or textual coverage. In particular, LiLa (a) enhances a large amount of Latin texts with PoS-tagging and lemmatisation, (b) harmonises the annotation of the three Universal Dependencies treebanks for Latin, (c) improves the lexical coverage of the Latin WordNet and the valency lexicon Latin-Vallex, and (d) expands the textual coverage of the Index Thomisticus Treebank. Furthermore, LiLa builds a set of newly-trained models for PoS-tagging and lemmatisation, and works on developing and testing the best performing NLP pipeline for such a task. Connections between datasets are edges labelled with a restricted set of values (metadata) taken from a vocabulary of knowledge description.
LiLa meets the so-called FAIR Guiding Principles for scientific data management and stewardship, which state that scholarly data must be Findable, Accessible, Interoperable and Reusable.
LiLa is based at the CIRCSE Research Centre of the Università Cattolica del Sacro Cuore in Milan, Italy and has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme - Grant Agreement No 769994.