Linguistic Diversity

Three fundamental facts about language demand explanation:
- Why are there approximately 7,000 languages spoken today?
- Why is their distribution across the globe so uneven?
- Why do they differ so much in almost all aspects of language, including lexicon, phonology, morphology and grammar?
Answering big questions about the patterns and causes of linguistic diversity requires well-sampled global data in standardised formats. Previous research on linguistic diversity has often been hampered by idiosyncratic data formats and limited regional samples, frequently with a Eurocentric focus. Linguistic data are famously messy - different projects use their own transcription systems, concept labels, part-of-speech tags, glossing conventions, and language identifiers. Standard cross-linguistic data formats are crucial because they make heterogeneous linguistic datasets genuinely comparable by enforcing shared identifiers for languages, concepts, and sounds. They turn individually curated datasets into interoperable resources that can be combined, replicated, and extended without labour-intensive cleaning. By providing machine-readable structures, they enable large-scale computational analyses—phylogenetic, typological, and statistical—that would otherwise be impossible. They also support cumulative science by ensuring transparency, versioning, and reproducibility across research groups. In short, standard formats are the backbone that allows evolutionary language science to function as a modern, data-intensive discipline.
A major thrust of research in the DLCE has been the creation of global linguistic databases. Our projects, such as Lexibank (List et al., 2022), Grambank (Skirgård et al., 2023), and Numeralbank (see the ERC QUANTA project), have created unprecedented global databases, containing thousands of languages and hundreds of features. By systematically encoding and interlinking linguistic structures, these resources allow us to move from anecdotal comparisons to rigorous statistical inferences. Over the last three years, we have substantially expanded and refined Lexibank 1. With Lexibank 2 (List et al., 2025), the number of datasets almost doubled (from 76 to 134), the number of languages covered increased from 2,028 to 3,107, and the amount of word forms rose from 709,000 to 1.7 million, while also reducing duplicate entries present in Lexibank 1. All included datasets now have standardised phonetic transcriptions and are consistently linked to Glottolog, Concepticon, and CLTS. Additionally, Lexibank 2 adds pre-computed phonological and semantic features, thereby facilitating rich, large-scale comparative analyses. This is supported by uniform SQLite access, enabling fast queries and study prototyping.
The 7,000 languages spoken across the globe vary in the order in which they arrange words and the constructions they use to combine segments in higher-order units. They can also differ markedly in how information is grammatically expressed. Some languages always mark categories such as gender, number, case, and tense, while others never or only optionally mark them. Furthermore, sentences that consist of many words in some languages can be translated by a single word in other languages, while the preferred word order varies widely. This linguistic diversity is not randomly distributed. We expect it to be shaped by human cognition, geographical proximity and genealogical descent. However, an accurate understanding of the actual structural diversity of languages, the factors that shape that variation, and what is at stake when the world loses languages has been hampered by the lack of accessible, systematically sampled global data.
To remedy this situation, we created a global network of over 100 linguists and computer scientists to construct the Grambank database. The initial Grambank 1.0 database covered 2,467 language varieties, capturing a wide range of grammatical phenomena in 195 features, from word order to verbal tense, nominal plurals, and many other well-studied comparative linguistic variables. The coverage spanned 215 different language families and 101 isolates from all inhabited continents and geographic regions. We used this data to test the relative roles of genealogicalinheritance and geographical diffusion in shaping grammatical diversity and to assess the consequences of language loss (Skirgård et al., 2023). More recently, we have used the Grambank data to test the role of sociodemographic factors in shaping linguistic complexity (Shcherbakova et al., 2023 - see the section on language and culture) and to evaluate claims about putative linguistic universals (Verkerk et al., 2025 - see the section on language and cognition). An updated and expanded version of Grambank (Grambank 2.0) will shortly be released, containing 596,895 datapoints for 3,130 language varieties.
Representative publications
Blum, F., Barrientos Ugarte, C., Englisch, J., Forkel, R., Greenhill, S. J., Rzymski, C., & List, J.-M. (2025). Lexibank 2: Pre-computed features for large-scale lexical data. Open Research Europe, 5: 126
Skirgård, H., Haynie, H. J., Blasi, D. E., Hammarström, H., Collins, J., Latarche, J. J., Lesage, J., Weber, T., Witzlack-Makarevich, A., Passmore, S., Chira, A.-M., Maurits, L., Dinnage, R., Dunn, M., Reesink, G., Singer, R., Bowern, C., Epps, P., Hill, J., Vesakoski, O., Robbeets, M., Abbas, N. K., Auer, D., Bakker, N. A., Barbos, G., Borges, R. D., Danielsen, S., Dorenbusch, L., Dorn, E., Elliott, J., Falcone, G., Fischer, J., Ghanggo Ate, Y., Gibson, H., Göbel, H.-P., Goodall, J. A., Gruner, V., Harvey, A., Hayes, R., Heer, L., Herrera Miranda, R. E., Hübler, N., Huntington-Rainey, B., Ivani, J. K., Johns, M., Just, E., Kashima, E., Kipf, C., Klingenberg, J. V., König, N., Koti, A., Kowalik, R. G., Krasnoukhova, O., Lindvall, N. L., Lorenzen, M., Lutzenberger, H., Martins, T. R., Mata German, C., van der Meer, S., Montoya Samamé, J., Müller, M., Muradoglu, S., Neely, K., Nickel, J., Norvik, M., Oluoch, C. A., Peacock, J., Pearey, I. O., Peck, N., Petit, S., Pieper, S., Poblete, M., Prestipino, D., Raabe, L., Raja, A., Reimringer, J., Rey, S. C., Rizaew, J., Ruppert, E., Salmon, K., Sammet, J., Schembri, R., Schlabbach, L., Schmidt, F. W., Skilton, A., Smith, W. D., de Sousa, H., Sverredal, K., Valle, D., Vera, J., Voß, J., Witte, T., Wu, H., Yam, S., Ye, J., Yong, M., Yuditha, T., Zariquiey, R., Forkel, R., Evans, N., Levinson, S. C., Haspelmath, M., Greenhill, S. J., Atkinson, Q. D., & Gray, R. D. (2023). Grambank reveals the importance of genealogical constraints on linguistic diversity and highlights the impact of language loss. Science Advances, 9: eadg6175.