(Volkswagen Foundation DoBeS grant 86 292)
This project investigated the relative frequencies of core parts of speech, such as nouns, verbs, and pronouns, in spoken language corpora of seven languages that represent a wide range of areal and typological diversity. We focused on two research questions:
- Why do languages vary so drastically in the relative frequencies of noun, pronoun, and verb tokens employed in discourse? Our pilot study for this project suggested that in some languages (such as Chintang) the overall number of nouns and pronouns taken together roughly equals the overall number of verbs, while in others (such as Sri Lanka Malay) this ratio is twice as high, i.e., the overall number of nouns and pronouns taken together is roughly double the overall number of verbs. What typological or other differences between languages can explain these differences in the use of parts of speech? One of the hypotheses we tested was the presence of argument indexing on verbs, which may make the overt realization of arguments as nouns or pronouns unnecessary, and may thus explain the low frequencies of nouns and pronouns in some languages.
- Why do the relative frequencies of nouns, pronouns, and verbs vary within texts? Our pilot study has shown that—consistently across languages—at the beginning of narrative texts, nouns are particularly frequently used, reflecting the introduction of new discourse participants, as expected. Furthermore, there were characteristic, sinusoidal alternations in the frequencies of noun use as narrative texts unfold, with regular peaks of heavy noun use roughly every 10-15 clauses. These peaks may reflect universal cognitive constraints on the activation of discourse participants, which necessitate their re-introduction by full lexical nouns after their activation has decayed, ultimately due to constraints of short-term memory.
We also investigated the influence of further factors on the relative frequencies of nouns, pronouns, and verbs, such as the degree of speakers’ and listeners’ mutual acquaintance (known/familiar vs. unknown) and text genres. In this context we empirically tested the assumed universality of ‘nouniness’ of formal genres.
The newly available data compiled in the DoBeS framework allowed us to develop and then appropriately address these research questions for the first time, as they require data from diverse languages that are annotated for parts of speech by experts, time-aligned, and described with detailed metadata with respect to speakers’ social status, mutual acquaintance, etc. These data allowed us to capture subtle language usage patterns and explore their relation to typological differences between languages, narrative strategies, and other linguistic and non-linguistic factors. This project thus further developed documentary linguistics, connecting it with areas such as corpus linguistics, morphological typology, syntactic theory, discourse studies, and cognitive linguistics. In order to connect our findings with research on well-known languages such as English, we additionally carried out analyses on published corpora of English.
The methods applied include computational techniques for quantitative analysis of textual data of the type that has been produced by DoBeS projects, with as little additional manual annotation of data as possible. This permited us to analyze the huge amount of data necessary to detect and appropriately describe the subtle patterns under investigation. It involved developing solutions for a number of technological and computational issues for cross-corpora studies, as additional outcomes of this project.