1. Phylogenetics

3.1 Using the phylogenetic method to study language evolution

3.1.1 How can phylogenetic methods be used to study language evolution?

Quantitative methods of studying language evolution requires data collection and comparison, similar to biological study of human evolution. While the study of human evolution requires comparison of physical or genetic characteristics of biological species, language evolution requires the study of linguistic data.

The concept of comparing lexical cognates in order to measure the distance between languages seem to have came from the French explorer Dumont D’Urville (Petroni & Serva, 2011). In D’Urville 1832 (cited in Petroni & Serva, 2011: 54), the French explorer used 115 lexical items and then assigned cognates a distance from 0 to 1. This list included all but three items that is in the 100-word Swadesh list that is widely used today to generate lexical distances between languages. The percentage of shared cognates between languages can be computed based on the Swadesh list in order to find out the distances between the languages of interest. Such wordlists can be used to build phylogenetic trees.

Phylogenetic analysis is based on a data matrix where the rows represent the languages to be studied and the columns represent a linguistic feature or character (Nichols & Warnow, 2008). Different languages may have different forms of a character and these are called ‘states’ of the character (Barbancon et al., 2013). There are three types of linguistic characters: lexical, phonological and morphological. For lexical characters, states correspond to cognate classes. When two or more languages are found to contain cognates, they will then be assigned the same character state.

Phylogenetics5

Fig. 5. Cognate classes across English, French, Russian and Ingush
(Nichols & Warnow, 2008: 765)

In Fig. 5., sun, soleil and solnce are cognates, hence they are assigned the same state, (1). Maalx is not a cognate, hence it is assigned a different state, (2).

Phylogenetics6

Fig. 6. Data matrix for linguistic phylogenetic analysis
(Pagel, 2009: 405)

Data matrix M in Fig. 6. is an example of an input to phylogenetic analysis. The first column of M represents a meaning which has four distinct states or cognate classes of words (0, 1, 2 and 3), whereas the second column denotes a meaning which only have two distinct states or cognate classes (0 and 1), and so on.

For phonological characters, states represent the presence or absence of certain sound changes in the history of the language; thus phonological characters may only have two states. Morphological characters, like lexical characters, have states that correspond to cognate classes, but instead of lexicon, they represent inflectional markers. The assumption is that if two languages display the same state for the same character, they share a common ancestry. However, borrowing may result in what seems like shared inheritance but is in fact a result of language contact. Parallel development and back-mutation, which are manifestations of a phenomenon called homoplasy, can also result in shared states that cannot be attributed to shared inheritance.

Many phylogeny reconstruction methods used to generate language trees are standard methods used in molecular phylogenetics.

Distance-based methods first transforms a character matrix into a distance matrix in which distances between the languages of interest are defined. A tree is then constructed based on the distance matrix (Nichols & Warnow, 2008). UPMGA (Unweighted Pair Group Method with Arithmetic Mean) is an algorithm that repeatedly joins two languages in the matrix that have the smallest distance. This method assumes that the dataset in the character matrix produce distances that evolves like clockwork, in another words obey the lexical clock. NJ (Neighbour Joining) joins pairs of languages that has the smallest corrected (that accounts for unseen state changes) distance and it does not need the clock assumption to hold (Barbancon, 2013).

Other methods include Maximum Parsimony (MP), Maximum Compatibility (MC) and Bayesian analyses (Nichols & Warnow, 2008). MP seeks a tree on which there is the least number of character state changes whereas MC seeks a tree on which there is a maximum number of compatible (evolved without homoplasy) characters. Bayesian methods estimates the probability of each tree being the true tree and produces a probability distribution of the group of trees. The Gray and Atkinson method is one of the bayesian methods used to construct language trees.

Also, unlike phonological comparison of linguistic data, whereby the sounds of words across languages are being compared, the resemblance-based model focuses on comparing words that appear similar morphologically in languages under the same family and across different language families. These words are known as cognate sets: words in different languages that are related semantically and morphologically (Dunn et al., 2005). With the reliance on cognate sets, this would help to lay out the greater linguistic family groups that are already known, such as Indo-European, Austronesian, Sino-Tibetan and etc. On top of that, these phylogenetic trees would allow researchers to detect possible relationships between languages which are not being detected previously by manual construction of phylogenetic trees by Historical linguists.

Blood	Bone	Breast	Come	Die	Dog	Drink	Ear
Eye	Fire	Fish	Full	Hand	Hear	Horn	I
Knee	Leaf	Liver	Louse	Mountain	Name	New	Night
Nose	One	Path	Person	See	Skin	Star	Stone
Sun	Tongue	Tooth	Tree	Two	Water	We	You (sg)

3.1 Using the phylogenetic method to study language evolution

3.1.1 How can phylogenetic methods be used to study language evolution?

3.2 Studies Using Phylogenetic Method

3.2.1 Studies that uses characters to construct phylogenetic trees

3.2.1.1 Language Phylogenies Reveal Expansion Pulses and Pauses in Pacific Settlement

3.2.2 Studies that uses phylogenetic trees to study language evolution

3.2.2.1 The Origin and Evolution of Word Order

3.2.3 Resemblance-based Model

3.2.3.1 Automated Similarity Judgement Program (ASJP)

3.3 Limitations of the phylogenetic method

3.3.1 Comparison with previous analyses

3.3.2 Use of Simulations

4.1 References

2.1 Using Geo-linguistics to study about language evolution

2.1.1 Cartogram and Choropleth

2.1.2 Network graphs

2.2 Studies involving Geo-linguistics

2.2.1 Geo-twitter

2.2.2 Geo-location and phonetic differences

2.3 Limitations

2.3.1 Limitations of Geo- Twitter

2.3.2 Limitations of the Geographical location study

1.1 Introduction

1.2 Advantages of using formal models in language evolution studies

1.2.1 Formal models test the validity of theories and hypotheses

1.2.2 Formal modelling compels a rigorous definition of its components

1.2.3 Formal models help us gain insight

1.3 Studying language evolution through computational models

1.3.1 Modelling of various conditions and factors

1.3.2 The Baldwin effect phenomenon

1.3.3 The study of language as a complex system

1.3.4 Fluid Construction Grammar (FCG)

1.4 Studies involving the Baldwin effect

1.4.1 Language evolution and the Baldwin effect

1.5 Studies involving Fluid Construction Grammar (FCG)

1.5.1 Theory of FCG

1.5.1.1 Design Patterns in FCG

1.5.1.2 Application: FCG Case Study (Spanish L2)

1.6 Limitations of computational models

1.6.1 Considerations for design simplicity

1.6.2 Considerations for design specifications

3.3.1

Comparison with previous analyses

3.3.2

Use of Simulations