18.10.2017 - 07:49
A  A


Project leader

  • Frank Seifart

Project members

  • Hans-Jörg Bibiko
  • Balthasar Bickel
  • Swintha Danielsen
  • Roland Meyer
  • Sebastian Nordhoff
  • Brigitte Pakendorf
  • Jan Strunk
  • Alena Witzlack-Makarevich
  • Taras Zakharko

Research assistants

  • Helen Geyer
  • Lisa Steinbach
  • Evgeniya Zhivotova

The relative frequencies of nouns, pronouns, and verbs cross-linguistically

(Volkswagen Foundation DoBeS grant 86 292)

This project investigated the relative frequencies of core parts of speech, such as nouns, verbs, and pronouns, in spoken language corpora of seven languages that represent a wide range of areal and typological diversity. We focused on two research questions:

  1. Why do languages vary so drastically in the relative frequencies of noun, pronoun, and verb tokens employed in discourse? Our pilot study for this project suggested that in some languages (such as Chintang) the overall number of nouns and pronouns taken together roughly equals the overall number of verbs, while in others (such as Sri Lanka Malay) this ratio is twice as high, i.e., the overall number of nouns and pronouns taken together is roughly double the overall number of verbs. What typological or other differences between languages can explain these differences in the use of parts of speech? One of the hypotheses we tested was the presence of argument indexing on verbs, which may make the overt realization of arguments as nouns or pronouns unnecessary, and may thus explain the low frequencies of nouns and pronouns in some languages. 
  2. Why do the relative frequencies of nouns, pronouns, and verbs vary within texts? Our pilot study has shown that—consistently across languages—at the beginning of narrative texts, nouns are particularly frequently used, reflecting the introduction of new discourse participants, as expected. Furthermore, there were characteristic, sinusoidal alternations in the frequencies of noun use as narrative texts unfold, with regular peaks of heavy noun use roughly every 10-15 clauses. These peaks may reflect universal cognitive constraints on the activation of discourse participants, which necessitate their re-introduction by full lexical nouns after their activation has decayed, ultimately due to constraints of short-term memory. 

We also investigated the influence of further factors on the relative frequencies of nouns, pronouns, and verbs, such as the degree of speakers’ and listeners’ mutual acquaintance (known/familiar vs. unknown) and text genres. In this context we empirically tested the assumed universality of ‘nouniness’ of formal genres.

The newly available data compiled in the DoBeS framework allowed us to develop and then appropriately address these research questions for the first time, as they require data from diverse languages that are annotated for parts of speech by experts, time-aligned, and described with detailed metadata with respect to speakers’ social status, mutual acquaintance, etc. These data allowed us to capture subtle language usage patterns and explore their relation to typological differences between languages, narrative strategies, and other linguistic and non-linguistic factors. This project thus further developed documentary linguistics, connecting it with areas such as corpus linguistics, morphological typology, syntactic theory, discourse studies, and cognitive linguistics. In order to connect our findings with research on well-known languages such as English, we additionally carried out analyses on published corpora of English.

The methods applied include computational techniques for quantitative analysis of textual data of the type that has been produced by DoBeS projects, with as little additional manual annotation of data as possible. This permited us to analyze the huge amount of data necessary to detect and appropriately describe the subtle patterns under investigation. It involved developing solutions for a number of technological and computational issues for cross-corpora studies, as additional outcomes of this project.


Region Number of speakers Language Expert
Baure Arawakan Amazonia 84 Swintha Danielsen
Chintang Tibeto-Burman Himalaya ~ 1,500 Balthasar Bickel
Bora Boran Amazonia ~ 1,500 Frank Seifart
N|uu Southern Khoisan South Africa 6 Alena Witzlack-Makarevich
Sri Lanka Malay Austronesian Sri Lanka ~ 45,000 Sebastian Nordhoff
Ėven Tungusic Siberia ~ 2,500 Brigitte Pakendorf
Sakha (Yakut) Turkic Siberia ~ 360,000 Brigitte Pakendorf


Seifart, Frank, Roland Meyer, Taras Zakharko, Balthasar Bickel, Swintha Danielsen, Sebastian Nordhoff, and Alena Witzlack-Makarevich. 2010. Cross-linguistic variation in the noun-to-verb ratio: Exploring automatic tagging and quantitative corpus analysis. Paper presented at the DobeS Workshop “Advances in Documentary Linguistics” Nijmegen, 14-15 October 2010.

Seifart, Frank 2011. Cross-linguistic variation in the noun-to-verb ratio: the role of verb morphology and narrative strategies. Poster presented at the Association for Linguistic Typology 9th Biennial Conference, The University of Hong Kong, July 21-24, 2011. (pdf)


  • The relative frequencies of nouns, pronouns, and verbs in discourse An international workshop. Leipzig, August 12-13, 2013.

  • Related project: Referentiality Project at Universität Erfurt
  • Funding agency: Volkswagen Foundation
  • Funding scheme: DOBES (Documentation bedrohter Sprachen)