To conduct thorough cross linguistic comparisons, we must gather linguistic data, ensuring maximum comparability across individual data points, resources, and language families. Despite the substantial increase in digitally accessible data for the world's languages in recent decades, we still face a significant scarcity of comparable data. The situation is compounded due to past data collections not being not been archived for long-term durability. As a result, quite a few datasets have disappeared from the internet and are no longer available now. Inspired by the GenBank database, where scholars can deposit nucleotide sequences publicly, we have created Lexibank, a collection of cross-linguistic datasets in standardised formats, which offers access to word forms, sound inventories, and lexical features for more than 2,000 language varieties derived from 100 individual high-quality datasets.
There are numerous ways in which Lexibank data can be analysed and used. By assembling lexical data for a large number of languages, Lexibank offers multiple possibilities for researchers investigating cross-linguistic aspects of the lexicon of human languages. Thus, Lexibank allows scholars to expand previous studies on colour term evolution, body part terminology, or emotion semantics for specific semantic domains. With respect to the relation between lexical form and meaning, Lexibank offers the most extensive collection of lexical data with standardised transcriptions and semantic glosses, allowing scholars to test individual hypotheses on sound symbolism in the world’s languages. With respect to the investigation of general aspects of lexical organisation, Lexibank offers one of the largest cross-linguistic collections of form-meaning pairs, allowing scholars to search for various factors that shape the lexicon of the world’s languages. For historical language comparison, the Lexibank wordlist collection offers the largest assembly of expert judgements on historically related (cognate) words available to date. Given that computational cognate detection methods are still unable to compete with experts, our collection thus offers rich material to test and train new methods in the future. Similarly – given that the Lexibank collection unifies data on a global basis – scholars can use the data collection to test new methods for the automated identification of borrowings or to expand upon previous approaches to the automated detection of contact areas.
Lexibank significantly contributes to the 'FAIRness' of cross-linguistic datasets by ensuring data is Findable, Accessible, Interoperable, and Reusable. By offering a detailed, replicable workflow for unifying and standardizing lexical datasets in various formats, it fulfils the foundational objective of the Cross-Linguistic Data Formats initiative, promoting reproducible research in linguistics.