Workshop on Cross-Linguistic Data Formats 2023– Graphs and Text

Ten years ago, in early 2013, the Cross-Linguistic Linked Data project (CLLD) was kicked off at the MPI EVA. It soon became one of the main drivers of an initiative towards standardization of cross-linguistic data, culminating in the CLDF specification in 2018, which grew out of a series of workshops on “Language Comparison with Linguistic Databases” (see here). Looking back at this history, we want to explore what role CLDF can play in the future of the field and if targeted workshops are useful to govern the standard.

This workshop brings together researchers using cross-linguistic data, publishers of such data and tool builders, i.e. representatives of the community from which the CLDF standard grew and for which it is intended.

The main goals during the two workshop days are

  • an update on what’s happening regarding cross-linguistic, data-intensive research, concentrating specifically on graphs and texts as two new major goals of standardization that we intend to tackle soon, and

  • a shared understanding of the role CLDF (or standardization in general) can play in research (ideally resulting in a “community of practice”, driving the future of standardization efforts).

Cross-Linguistic Data-Intensive Research

Several large-scale data collection projects have come to fruition over the last decade (List et al. 2022; Skirgård et al. 2023), providing ever more input for research that takes into account the world’s linguistic diversity.

Inferring phylogenetic trees from cognate-coded lexical data (Sagart et al. 2019; Greenhill et al. 2023; Heggarty et al. 2023) can probably be regarded as the de-facto standard application of such data, but research in psychology using colexification networks (Jackson et al. 2019; Brochhagen et al. 2023) or the quest for language universals (Dediu 2023) are just two more examples for research that routinely uses cross-linguistic data.

During this workshop, we hope to learn about more research questions that require cross-linguistic data to be answered.

The Role of Standards

Standardization may sometimes be perceived as contra-productive in research, because it seems essentially at odds with “cutting-edge” methodology or individual researchers’ intuition and freedom of thought (Bauman 2011). But clearly, standing on the shoulders of giants becomes easier, when solid steps lead there.

During the workshop, we hope to identify steps that – in retrospect – lead in the right direction and understand which current research paths are well-trodden enough to become candidates for further standardization.

Next Steps Towards Expanding CLDF

Finally, the workshop will serve as an experiment in figuring out how to govern a standard like CLDF. In the best case, the next version of CLDF will be shaped by requirements gathered, lessons learned, and opportunities identified during the workshop.


Bauman, Syd. 2011. “Interchange Vs. Interoperability.” In Proceedings of Balisage: The Markup Conference 2011. Mulberry Technologies.

Brochhagen, Thomas, Gemma Boleda, Eleonora Gualdoni, and Yang Xu. 2023. “From Language Development to Language Evolution: A Unified View of Human Lexical Creativity.” Science 381 (6656): 431–36.

Dediu, Dan. 2023. “Ultraviolet Light Affects the Color Vocabulary: Evidence from 834 Languages.” Frontiers in Psychology 14.

Greenhill, Simon J., Hannah J. Haynie, Robert M. Ross, Angela Chira, Johann-Mattis List, Lyle Campbell, Carlos A. Botero, and Russell D. Gray. 2023. “A Recent Northern Origin for the Uto-Aztecan Family.” Language 0 (0).

Heggarty, Paul, Cormac Anderson, Matthew Scarborough, Benedict King, Remco Bouckaert, Lechosław Jocz, Martin Joachim Kümmel, et al. 2023. “Language Trees with Sampled Ancestors Support a Hybrid Model for the Origin of Indo-European Languages.” Science 381 (6656).

Jackson, Joshua Conrad, Joseph Watts, Teague R. Henry, Johann-Mattis List, Peter J. Mucha, Robert Forkel, Simon J. Greenhill, Russell D. Gray, and Kristen Lindquist. 2019. “Emotion Semantics Show Both Cultural Variation and Universal Structure.”Science 366 (6472): 1517–22.

List, Johann-Mattis, Robert Forkel, Simon J. Greenhill, Christoph Rzymski, Johannes Englisch, and Russell D. Gray. 2022. “Lexibank, a Public Repository of Standardized Wordlists with Computed Phonological and Lexical Features.” Scientific Data 9 (316): 1–31.

Sagart, Laurent, Guillaume Jacques, Yunfan Lai, Robin Ryder, Valentin Thouzeau, Simon J. Greenhill, and Johann-Mattis List. 2019. “Dated Language Phylogenies Shed Light on the Ancestry of Sino-Tibetan.”Proceedings of the National Academy of Science of the United States of America 116: 10317–22.

Skirgård, Hedvig, Hannah J. Haynie, Damián E. Blasi, Harald Hammarström, Jeremy Collins, Jay J. Latarche, Jakob Lesage, et al. 2023. “Grambank Reveals the Importance of Genealogical Constraints on Linguistic Diversity and Highlights the Impact of Language Loss.”Science Advances 9 (16).

Invited participants

(listed in alphabetical order by participant's last name)

  • Sascha Alexeyenko
  • Laura Becker
  • Christian Bentz
  • Katja Bocklage
  • Thomas Brochhagen
  • Anna Di Natale
  • Promis Dodzi Kpoglu
  • Jeff Good
  • John Mansfield
  • Barbara Meierernst
  • Jessica Nieder
  • Sebastian Nordhoff
  • Matthias Pache
  • Michele Pullini
  • Arne Rubehn



Thursday, December 14

08:40 - 09:00everyoneregistration: picking up name tags, lunch & reception tickets
09:00 - 09:30Johann-Mattis List & Robert ForkelIntroduction (slides)
09:30 - 10:00Sascha AlexeyenkoCLLD apps as a tool for the construction of datasets (slides)
10:00 - 10:30Jeff GoodExtending CLDF to multilingual data (slides)
10:30 - 11:00COFFEE BREAK 
11:00 - 11:30John Mansfield Areal colexification and partial colexification in northern Australia (slides)
11:30 - 12:00Thomas BrochhagenChallenges and insights from cross-linguistic word-meaning associations: A roadmap for the study of loose colexification (slides)
12:00 - 12:30Annika Tjuka & Johann-Mattis ListRepresenting semantic networks in Concepticon (slides)
12:30 - 14:00LUNCH BREAK 
14:00 - 14:30Christian Bentz Collecting character sequences for paleolithic signs and written languages
14:30 - 15:00Sebastian NordhoffGenerating CLDF from heterogenenous input in the Open Text Collections project: Input from FLEx, ELAN, tex
15:00 - 15:30Barbara Meisterernst The morpho syntax of Archaic Chinese verbs: Loss of morphology as trigger for the emergence of analytic structures (slides)
15:30 - 16:00COFFEE BREAK 
16:00 - 17:30Practice Session 1Text formats in CLDF

Friday, December 15

09:00 - 10:30Practice Session 2CLICS4 and networks in CLDF
10:30 - 11:00COFFEE BREAK 
11:00 - 12:30everyoneDiscussion
12:30 - 13:30LUNCH BREAK 

13:30 - 14:00


Wrapping up

You can download our book of abstracts here.


Deadline for registration was November 30th, 2023.


Questions? Please send any queries regarding this workshop to us here.