Workshop on Cross-Linguistic Data Formats 2023– Graphs and Text

Ten years ago, in early 2013, the Cross-Linguistic Linked Data project (CLLD) was kicked off at the MPI EVA. It soon became one of the main drivers of an initiative towards standardization of cross-linguistic data, culminating in the CLDF specification in 2018, which grew out of a series of workshops on “Language Comparison with Linguistic Databases” (see here). Looking back at this history, we want to explore what role CLDF can play in the future of the field and if targeted workshops are useful to govern the standard.

This workshop brings together researchers using cross-linguistic data, publishers of such data and tool builders, i.e. representatives of the community from which the CLDF standard grew and for which it is intended.

The main goals during the two workshop days are

an update on what’s happening regarding cross-linguistic, data-intensive research, concentrating specifically on graphs and texts as two new major goals of standardization that we intend to tackle soon, and
a shared understanding of the role CLDF (or standardization in general) can play in research (ideally resulting in a “community of practice”, driving the future of standardization efforts).

Cross-Linguistic Data-Intensive Research

Several large-scale data collection projects have come to fruition over the last decade (List et al. 2022; Skirgård et al. 2023), providing ever more input for research that takes into account the world’s linguistic diversity.

Inferring phylogenetic trees from cognate-coded lexical data (Sagart et al. 2019; Greenhill et al. 2023; Heggarty et al. 2023) can probably be regarded as the de-facto standard application of such data, but research in psychology using colexification networks (Jackson et al. 2019; Brochhagen et al. 2023) or the quest for language universals (Dediu 2023) are just two more examples for research that routinely uses cross-linguistic data.

During this workshop, we hope to learn about more research questions that require cross-linguistic data to be answered.

The Role of Standards

Standardization may sometimes be perceived as contra-productive in research, because it seems essentially at odds with “cutting-edge” methodology or individual researchers’ intuition and freedom of thought (Bauman 2011). But clearly, standing on the shoulders of giants becomes easier, when solid steps lead there.

During the workshop, we hope to identify steps that – in retrospect – lead in the right direction and understand which current research paths are well-trodden enough to become candidates for further standardization.

Next Steps Towards Expanding CLDF

Finally, the workshop will serve as an experiment in figuring out how to govern a standard like CLDF. In the best case, the next version of CLDF will be shaped by requirements gathered, lessons learned, and opportunities identified during the workshop.

References

Bauman, Syd. 2011. “Interchange Vs. Interoperability.” In Proceedings of Balisage: The Markup Conference 2011. Mulberry Technologies. https://doi.org/10.4242/balisagevol7.bauman01.

Brochhagen, Thomas, Gemma Boleda, Eleonora Gualdoni, and Yang Xu. 2023. “From Language Development to Language Evolution: A Unified View of Human Lexical Creativity.” Science 381 (6656): 431–36. https://doi.org/10.1126/science.ade7981.

Dediu, Dan. 2023. “Ultraviolet Light Affects the Color Vocabulary: Evidence from 834 Languages.” Frontiers in Psychology 14. https://doi.org/10.3389/fpsyg.2023.1143283.

Greenhill, Simon J., Hannah J. Haynie, Robert M. Ross, Angela Chira, Johann-Mattis List, Lyle Campbell, Carlos A. Botero, and Russell D. Gray. 2023. “A Recent Northern Origin for the Uto-Aztecan Family.” Language 0 (0).

Heggarty, Paul, Cormac Anderson, Matthew Scarborough, Benedict King, Remco Bouckaert, Lechosław Jocz, Martin Joachim Kümmel, et al. 2023. “Language Trees with Sampled Ancestors Support a Hybrid Model for the Origin of Indo-European Languages.” Science 381 (6656). https://doi.org/10.1126/science.abg0818.

Jackson, Joshua Conrad, Joseph Watts, Teague R. Henry, Johann-Mattis List, Peter J. Mucha, Robert Forkel, Simon J. Greenhill, Russell D. Gray, and Kristen Lindquist. 2019. “Emotion Semantics Show Both Cultural Variation and Universal Structure.”Science 366 (6472): 1517–22.

List, Johann-Mattis, Robert Forkel, Simon J. Greenhill, Christoph Rzymski, Johannes Englisch, and Russell D. Gray. 2022. “Lexibank, a Public Repository of Standardized Wordlists with Computed Phonological and Lexical Features.” Scientific Data 9 (316): 1–31.

Sagart, Laurent, Guillaume Jacques, Yunfan Lai, Robin Ryder, Valentin Thouzeau, Simon J. Greenhill, and Johann-Mattis List. 2019. “Dated Language Phylogenies Shed Light on the Ancestry of Sino-Tibetan.”Proceedings of the National Academy of Science of the United States of America 116: 10317–22.

Skirgård, Hedvig, Hannah J. Haynie, Damián E. Blasi, Harald Hammarström, Jeremy Collins, Jay J. Latarche, Jakob Lesage, et al. 2023. “Grambank Reveals the Importance of Genealogical Constraints on Linguistic Diversity and Highlights the Impact of Language Loss.”Science Advances 9 (16). https://doi.org/10.1126/sciadv.adg6175.

Invited participants

(listed in alphabetical order by participant's last name)

Sascha Alexeyenko
Laura Becker
Christian Bentz
Katja Bocklage
Thomas Brochhagen
Anna Di Natale
Promis Dodzi Kpoglu
Jeff Good
John Mansfield
Barbara Meierernst
Jessica Nieder
Sebastian Nordhoff
Matthias Pache
Michele Pullini
Arne Rubehn

Program

Thursday, December 14

08:40 - 09:00	everyone	registration: picking up name tags, lunch & reception tickets
09:00 - 09:30	Johann-Mattis List & Robert Forkel	Introduction (slides)
09:30 - 10:00	Sascha Alexeyenko	CLLD apps as a tool for the construction of datasets (slides)
10:00 - 10:30	Jeff Good	Extending CLDF to multilingual data (slides)
10:30 - 11:00	COFFEE BREAK
11:00 - 11:30	John Mansfield	Areal colexification and partial colexification in northern Australia (slides)
11:30 - 12:00	Thomas Brochhagen	Challenges and insights from cross-linguistic word-meaning associations: A roadmap for the study of loose colexification (slides)
12:00 - 12:30	Annika Tjuka & Johann-Mattis List	Representing semantic networks in Concepticon (slides)
12:30 - 14:00	LUNCH BREAK
14:00 - 14:30	Christian Bentz	Collecting character sequences for paleolithic signs and written languages
14:30 - 15:00	Sebastian Nordhoff	Generating CLDF from heterogenenous input in the Open Text Collections project: Input from FLEx, ELAN, tex
15:00 - 15:30	Barbara Meisterernst	The morpho syntax of Archaic Chinese verbs: Loss of morphology as trigger for the emergence of analytic structures (slides)
15:30 - 16:00	COFFEE BREAK
16:00 - 17:30	Practice Session 1	Text formats in CLDF
17:30 - 20:00	WORKSHOP RECEPTION

Friday, December 15

09:00 - 10:30	Practice Session 2	CLICS4 and networks in CLDF
10:30 - 11:00	COFFEE BREAK
11:00 - 12:30	everyone	Discussion
12:30 - 13:30	LUNCH BREAK
13:30 - 14:00	everyone	Wrapping up

You can download our book of abstracts here.

Registration

Deadline for registration was November 30th, 2023.

Organizers

Contact

Questions? Please send any queries regarding this workshop to us here.

CROSS-LINGUISTIC DATA FORMATS WORKSHOP

Workshop on Cross-Linguistic Data Formats 2023– Graphs and Text

Cross-Linguistic Data-Intensive Research

The Role of Standards

Next Steps Towards Expanding CLDF

References

Invited participants

Program

Registration

Organizers

Contact

Max Planck Institute for Evolutionary Anthropology

Quick Links

Departments and Groups

Workshop on Cross-Linguistic Data Formats 2023– Graphs and Text

Cross-Linguistic Data-Intensive Research

The Role of Standards

Next Steps Towards Expanding CLDF

References

Invited participants

Program

Registration

Organizers

Contact

Max Planck Institute for Evolutionary Anthropology

Quick Links