CHM by Benjamin Molineaux

Prolegomena to a Corpus of Historical Mapudungun

1606-1930

Welcome to the documentation page for the development of a Corpus of Historical Mapudungun (CHM)

This space will contain information on the process, principles and source texts involved in generating a linguistically tagged corpus of the earliest writings in Mapudungun, spanning 1606 to 1930.

A Corpus of Historical Mapudungun is the main proposed outcome of a Leverhulme Early Career Fellowship awarded to Benjamin Molineaux at the University of Edinburgh's Angus McIntosh Centre for Historical Linguistics and due to run between April 2018 and March 2021.

About Mapudungun

Mapudungun is the ancestral language of the Mapuche people of south-central Chile and Argentina. Today Mapudungun is spoken mostly in pockets of Chile's 8th, 9th, 10th and 14th regions by an estimated 250,000 speakers. In Argentina, numbers of self-reportedly competent speakers are around 8,400. In both countries, monolingualism is vanishingly rare, with the range of interlocutors and topics for the use of the language having grown progressively smaller.

The genetic affiliation of Mapudungun is uncertain. A number of claims have been made, ranging from relation to near-neighbours to the north – such as Quechua, Aymara and Pano-Tacanan – and to the south – Kawésqar, Yaghan and Chon (Tierra del Fuego, now extinct) – as well as membership in more distant families such as Arawakan, Mayan, or Aztec and Uto-Aztecan. With strong evidence lacking for any of these theories, the language is often presumed to be an isolate. From a regional-typological perspective, however, it can be grouped with other Andean-type languages with agglutinating, strictly ordered morphology, with a special affinity towards Quechua and Aymara, as far as the tendency for suffixation goes.

Mapudungun is considered a polysynthetic, agglutinating, head-marking language. Insofar as these categories can be considered useful, their fundamental locus of instantiation in the language is the verb, which displays – aside from intricate (obligatory and optional) inflectional morphology – a wide array of derivational and compounding processes. This richness of verbal morphology is in stark contrast with the noun, where, barring compounding (which is highly productive), morphological structure is markedly sparse, displaying no gender and practically no case or obligatory number marking.

The Historical Record of Mapudungun

The first formal description of the language, by Jesuit Luys de Valdivia, was published in 1606 and held that ‘no other language than this runs down from the city of Coquimbo and its surroundings to the island of Chiloé and beyond, and from the foot of the great snow-covered mountain-range to the sea’(`To the reader'). In the same text, Mapudungun is claimed to be mostly homogenous, with some regional variation in vocabulary, though ‘the precepts and rules of this art are general for all the provinces’. Whether or not Valdivia’s assessment was correct, the past 410 years have seen drastic changes both in the language’s geographic distribution and range of use. The CHM aims to describe the variation within the historical record by transcribing and coding a large proportion of the written record for the language. Most of the earliest material comes from Christian missionaries, though there are also texts from explorers and military men, as well as, later on, a few texts with more explicitly academic and cultural aims.

Source Texts

The Source Material for the CHM

Currently contains a list of relevant documents representing data for historical Mapudungun (1606-1930). It will eventually contain brief descriptions (metadata) for the texts, as well as links to the image-based PDF files. As Optical Character Recognition (OCR) outputs are hand checked, plain-text and web-based transcriptions of the materials are being published alongside the image-based PDFs.

Note that some of the original works are only partially transcribed, as the focus is on those parts where Mapudungun language samples are present, rather than those parts where broader descriptions of language or culture are given in Spanish or any other non-Mapudunun language.

Core CHM texts
Potential (bonus) CHM texts

Corpus encoding

As the objective of the corpus is to provide a view into the synchrony and diachrony of lexical, morphological and phonological features, texts are being parsed at all three of these levels. The first stage of this process --- lemmatisation --- identifies the key root-elements, as well as the part-of-speech (POS) category for each word, providing a single identifiable label and reducing both morphological and spelling heterogeneity (see 1).

(1)

Form Lemma Transl. POS
a. kude-kefuingu kuden 'to play' V
b. kuthe-kalape kuden 'to play' V

XML

a. <w xml:lang="arn" lemma="kuden" pos="V" corresp="play">kudekefuingu</w>
b. <w xml:lang="arn" lemma="kuden" pos="V" corresp="play">kuthekalape</w>

The second stage is morphological parsing, which identifies individual morphemes beyond the root and labels them according to function (as in 2). The result of both these processes is a TEI-standard XML text with the relevant tags embedded. A full 10% of the total word-types in the corpus texts has been tagged in this way, and a machine-learning algorithm is being developed to tag the remainder of the material both at the level of the lemma and the morpheme. Additional hand corrections will be necessary in order to complete the process.

(2)

Form Morphemes
a. kude-ke-fu-ingu ROOT(play)-HABIT-BROKEN.IMPLICATURE-IND.3.DUAL
b. kuthe-ka-la-pe ROOT(play)-CONT-NEG-IMP.3.SG

XML

a. <w> <m baseForm="kude" type="root" corresp="play">kude</m><m baseForm="ke" type="habit">ke</m><m baseForm="fu" type="BI">fu</m><m baseForm="ingu" type="ind.3.d">ingu</m></w>
b. <w> <m baseForm="kude" type="root" corresp="play">kuthe</m><m baseForm="ka" type="cont">ka</m><m baseForm="la" type="neg">la</m><m baseForm="pe" type="inp.3.sg">pe</m></w>

The final stage of the tagging will be grapho-phonological parsing (cf. Kopaczyk et al. 2018), which entails providing sound values for each word (as in 3), following a list of spelling-based rules for each text. The results should effectively reconstruct the phonic structure of each text, such that it can be compared with others from different periods and locations, helping to map phonological change from the bottom up.

(3)

Form Sound Lemma Transl. Source Dialect
a. <vúta> [vɨta] fücha 'old/big' Valdivia 1606 North
b. <fücha> [vɨʧa] fücha 'old/big' Augusta 1916 Central/Coast

XML

a. <m> <c ipa="v">v</c><c ipa="ɨ">ú</c><c ipa="t">t</c><c ipa="a">a</c></m>
b. <m> <c ipa="f">f</c><c ipa="ɨ">ü</c><c ipa="t">t</c><c ipa="a">a</c></m>

The front end of the corpus --- soon to be available in beta form --- will provide search options (in both English and Spanish) at all three levels of tagging (word, morpheme and sound), as well as allowing users to correlate these features across texts and with relevant non-linguistic metadata such as date, location, author, genre, etc. A simpler browser version will also be available for non-linguists, allowing for texts to be browsed and downloaded with parallel translations.

Towards a New World philology

The careful transcription and tagging of the historical material that is proposed for this project follows a long tradition at the Angus McIntosh Centre, working mostly on early English and Scots. The importance of these methods, however were not lost to some of the more prominent Mapudungun scholars, as evidenced in the following passage by Dr. Rudolf Lenz, who conducted some of the earliest explicitly academic studies of the language:

Aun el lenguaje vulgar que no tiene ninguna lengua literaria al lado puede ser una cosa mucho ménos determinada de lo que comunmente se cree, i no siempre podrá justificarse que en la edicion filolójica de un testo de siglos pasados se uniforme la ortografia del autor en todos los casos. Cuando la ortografía vacila en lenguas que se escriben poco, esto puede espresar el empleo de diferentes pronunciaciones en una misma palabra, o puede tener la causa de que ninguna de las diferentes maneras de escribir corresponda bien a la pronunciacion. Mucho mas raro será que el autor se haya simplemente equivocada al escribir lo que pronunciaba.

Rudolf Lenz (1897:132)