Save

Indo-European phylogenetics with R

A tutorial introduction

In: Indo-European Linguistics
Author:
David Goldstein University of California USA Los Angeles, CA

Search for other papers by David Goldstein in
Current site
Google Scholar
PubMed
Close
Open Access

Abstract

The last twenty or so years have witnessed a dramatic increase in the use of computational methods for inferring linguistic phylogenies. Although the results of this research have been controversial, the methods themselves are an undeniable boon for historical and Indo-European linguistics, if for no other reason than that they allow the field to pursue questions that were previously intractable. After a review of the advantages and disadvantages of computational phylogenetic methods, I introduce the following methods of phylogenetic inference in R: maximum parsimony; distance-based methods (UPGMA and neighbor joining); and maximum likelihood estimation. I discuss the strengths and weaknesses of each of these methods and in addition explicate various measures associated with phylogenetic estimation, including homoplasy indices and bootstrapping. Phylogenetic inference is carried out on the Indo-European dataset compiled by Don Ringe and Ann Taylor, which includes phonological, morphological, and lexical characters.

1 Introduction

Phylogenetic trees model linguistic descent. More specifically, they are hypotheses about the order of lineage-splitting events from an often unobservable common ancestor to a set of observable descendants (Bowern & Koch 2004: 8–9, Pagel 2017: 152). The phylogeny of the Indo-European languages is a matter of long-standing debate (for a recent overview, see Ringe 2017). Widmer (2018: 374) writes that “Auch in der Indogermanistik gibt es keinen Konsens, wie die Topologie des Stammbaums der indogermanischen Sprachen im Einzelnen aussieht.”1 The members of late clades are clear (that is, we are in no doubt about which languages belong to, e.g., the Celtic clade), but the order in which early clades formed has evaded consensus—with the notable exception of Anatolian, which is widely believed to be a sister to Proto-Nuclear-Indo-European:

d242103628e190

Figure 1

Nuclear IE star phylogeny

Citation: Indo-European Linguistics 8, 1 (2020) ; 10.1163/22125892-20201000

One of the major questions of Indo-European linguistics is the order in which clades formed within Nuclear-Indo-European.

In the last twenty or so years, the methods of phylogenetic estimation have changed dramatically. Jäger (2015: 12752) goes so far as to declare: “Computational phylogenetics is in the process of revolutionizing historical linguistics.”2 In fact, the situation is more complex, and Jäger’s statement premature.

On the one hand, it is true that computational phylogenetics has expanded the toolkit of historical linguistics. At the same time, the first wave of research in computational linguistic phylogenetics has engendered extensive controversy (see, e.g., Pereltsvaig & Lewis 2015 along with the reviews of Bowern 2017 and Verkerk 2017). It is no surprise then that skepticism towards computational linguistic phylogenetics runs high in certain circles (see, e.g., Heggarty 2006, Nichols & Warnow 2008: 760).

A more accurate assessment of the current status of computational phylogenetics is that it offers an enormous amount of potential. This potential does not necessarily lie in the ability to overturn long-standing conclusions of the field. Rather, these new methods enable Indo-Europeanists to investigate aspects of language change that were previously intractable (such as estimating branch lengths, rates of character change, and rates of diversification).

It is essential to understand both the advantages and disadvantages of the various computational phylogenetic methods (cf. Bowern 2018). Although it is possible to answer questions with computational methods that are otherwise intractable, computational methods are not in and of themselves “superior” to traditional methods. Reliable results can only come from the use of computational methods in concert with traditional analysis. Furthermore, although computational linguistic phylogenetics will undoubtedly yield exciting results, this success will not come at the expense of traditional comparative linguistic research, since the relationship between these two approaches is one of mutual symbiosis.

The goal of this article is to enable historical linguists who have no experience with computational methods to estimate phylogenies with R and RStudio (R Core Team 2019).3 Although the focus of this tutorial is decidedly on basic knowledge, I provide a substantial introduction to each method (maximum parsimony, UPGMA, NJ, and maximum likelihood).4 The recent overview papers by Nichols & Warnow (2008), Dunn (2015), Bowern (2018), and Garrett (2018) make excellent companion pieces to the practical orientation of this article.

The remainder of the paper is organized as follows. Section 2 discusses the advantages and disadvantages of computational estimation of linguistic phylogenies. Section 3 introduces R and RStudio, the software that we will use for phylogenetic analysis. Building on this, section 4 introduces the dataset and guides the reader through the process of reading data into R. Sections 5 through 8 form the core of the article. These sections present parsimony methods, distance-based methods, and maximum likelihood methods of phylogenetic inference. Valedictory remarks bring the paper to a close in section 9.

Before discussing the advantages and disadvantages of computational linguistic phylogenetics, I need to say a word about the descriptive terms used throughout this paper. I generally prefer the terms current in evolutionary biology to those used in historical linguistics (a practice shared by Lass 1997). The former offers a much richer conceptual vocabulary for phylogenetic analysis than historical linguistics and I see no reason to pass on this bounty.5 Following Ewens & Grant (2005: 497), I avoid the term (phylogenetic) reconstruction in favor of (phylogenetic) estimation or inference, since reconstruction suggests that the process of inferring past linguistic states is free of uncertainty, which is simply not the case. Claims about linguistic prehistory can rarely (if ever) be made with certainty. Concerning the phylogeny of the Indo-European languages, none of the trees in the literature (or presented below) is the true tree (cf. Garrett 2006: 43). The true tree is currently unknowable, because it is unclear how many branches or languages of Indo-European have vanished from the historical record. When it comes to phylogenies and ancestral states, our goal is the best approximation of the true tree and the true state given the extant data.6

2 The advantages and disadvantages of computational methods

The dataset introduced below contains characters from 24 taxa (i.e., languages, or tips of the phylgenetic tree). The number of possible unrooted trees for this dataset is 563,862,029,680,583,512,791,449,600.7 The number of possible rooted trees is 25,373,791,335,626,255,807,872,499,712 (Felsenstein 1978a, Felsenstein 2004: 19–36, Baum & Smith 2013: 187–90). In either case, the possible tree space is overwhelming. Although a specialist knows that wide swaths of this tree space are incorrect, it is nevertheless beyond human capabilities to assess which of the many viable candidate trees best fits the data.

It is well known that languages can emit weak or even conflicting phylogenetic signals. Phylogenetic algorithms enable us to make principled decisions on how to handle such cases. This is important, because in such cases researchers can be influenced by phylogenetic analyses that they want to be true. As Efron & Tibshirani (1993: 1) put it, “we are all too good at picking out non-existent patterns that happen to suit our purposes.” McMahon & McMahon (2005: 68–69) and Scarborough (2016: 33) discuss this issue in more detail.

Computational phylogenetics enables us to explore dimensions of linguistic history that are rarely if ever discussed in the traditional scholarship. The Indo-European literature has focused almost exclusively on the question of topology.8 That is of course an important question, but there are other aspects of the history of the Indo-European languages that should also be pursued. For example, we know little about how the rates of change among different components of language (phonology, morphology, syntax, and the lexicon) vary over time (see Nettle 1999a, Nettle 1999b, Clackson 2000).

Computational methods also enable researchers to assess the extent to which the data provide evidence for a particular clade. This is absolutely crucial to any phylogenetic analysis. In making inferences about events that reach back several millennia in time, we do not deal in certainties. We therefore need tools that enable us to acknowledge this uncertainty and the limitations of our data:

The field of phylogenetics should not be seen as an attempt to build trees, but rather to examine alternative trees and then quantify the extent to which data support or reject different phylogenetic conclusions.

Baum & Smith (2013: 265), emphasis in original

To this end, I introduce bootstrap analysis in section 6 below.

Finally, computational methods—in particular maximum likelihood estimation and Bayesian inference—enable historical linguists to infer phylogenies based on specific models of linguistic change (known as transition models; see section 8.3 below). Such models encode assumptions, for instance, about the probability of change and whether certain directions of change are more or less likely. With these methods, it thus becomes possible to incorporate a theory of language change into phylogenetic inference.

For all the advantages of computational methods, they are not without their pitfalls, perhaps the most threatening of which is the tendency to confuse model sophistication (or model precision) with model accuracy (cf. Pereltsvaig & Lewis 2015: 7–10 on scientism). Simply because the sophistication of computational phylogenetic methods outstrips that of traditional methods, one might come to think that these methods (in particular Bayesian inference) will automatically yield a superior approximation to the true tree. Another concern along similar lines is that computational methods can lead to researcher absenteeism in as much as it can lead one to think that computational power can make up for datasets that are either flawed or characterized by conflicting phylogenetic signals. That is of course impossible. The computational methods presented below are only as good as the data culled for analysis.

Some have argued that the transmission of genes is fundamentally different from the transmission of linguistic knowledge (e.g., Andersen 2006, Lewis & Pereltsvaig 2012, Pereltsvaig & Lewis 2015: 149–56).9 Armed with such a view, one might question whether the computational methods that have been developed for the phylogenetic estimation of species are suitable for linguistic data (see Bowern 2018: 283–84). What unites evolutionary biology and historical linguistics is not so much the phenomena that they investigate, but rather the nature of the questions that they pursue. Both fields aim to draw inferences about prehistory from observable data. Provided that the models and underlying assumptions are compatible with linguistic change, there is no reason why methods developed for the evolution of species should be unsuitable for linguistic history. Pagel (2017: 152) draws attention to the crucial point that both genetic information and linguistic properties can be represented as digital systems of inheritance (cf. Bowern 2018: 284). It is true that some methods or models developed for evolutionary biology will not be applicable to linguistic data, but one cannot conclude from such incompatibility that methods of computational phylogenetics in general cannot be used on linguistic data.

2.1 Computational phylogenetics and traditional subgrouping

If one accepts the need for computational phylogenetics, the question arises of what the relationship between computational and traditional methods should be. Computational linguistic phylogenetics faces the following conundrum. If the methods produce novel results at odds with traditional subgrouping, they may be dismissed as incorrect (the most salient example of this is the debate that has surrounded Gray & Atkinson 2003 and Bouckaert et al. 2012). If the methods recapitulate the results of traditional analyses, then they may be deemed otiose. Consequently, one can come away with the impression that there is no place in the field for computational methods, in as much as they are at best unnecessary and at worst misguided.

First and foremost, computational methods should not be viewed as a replacement of traditional subgrouping as based on the comparative method (Ringe, Warnow, & Taylor 2002: 66, Bowern 2017: 427). Computational methods should be used in conjunction with the traditional methods known to yield reliable results:

[T]raditional subgrouping is logically coherent and methodologically unobjectionable: in order to subgroup a particular subset of the family’s languages together, one demands that they exclusively share clear and linguistically significant innovations which are unusual enough that they could not reasonably have arisen more than once independently. To put it in biologist’s terms, one recognises a clade by the presence of unique synapomorphies, rigorously excluding any traits that might conceivably be analogous rather than homologous. This is so clearly correct that we have no intention of even questioning it.10

Ringe, Warnow, & Taylor (2002: 65–66)

There are various ways in which traditional subgrouping and computational phylogenetics can complement one another. For instance, computational methods can play a confirmatory role. If computational methods come to the same answers that the field achieved without the aid of a computer, that is worth knowing. (It would be worth knowing because it would mean that we have an algorithm that approximates the method of phylogenetic inference among historical linguists.) In a similar vein, if some of the phylogenetic analyses are at odds with computational results, that is also important. In addition, computational methods can be used to guide us out of an impasse. There are many aspects of the history of the archaic Indo-European languages for which traditional methods have not yet yielded a consensus answer. As the quotation from Widmer above reveals, there is a lot of uncertainty surrounding the topology of Indo-European, for instance.

Subgroups are standardly established on the basis of shared innovations. To identify an innovation one has to be able to identify an ancestral state. In some cases, this is not a challenge. For instance, given a language with only oral vowels and nasal consonant codas and a related language with nasal vowels but no nasal consonant codas, the nasal vowels of the latter are very likely the innovation. In other cases, determining the innovation is more challenging. The continued uncertainty of whether the augment was present in PIE is one such example.11

Not only does subgrouping depend on the inference of ancestral states, but the inference of ancestral states also depends upon subgrouping. When a cognate lexical item is attested in, say, three taxa then one has to decide how far back its ancestral lexical item should be projected—that is, whether to some intermediate interior node or to PIE itself. Phylogeny plays a crucial role in assessing such questions (for further discussion, see, e.g., Mallory & Adams 2006: 106–10, Olander 2018). The upshot is a chicken-and-egg scenario in which subgrouping and ancestral-state inference can be mutually dependent endeavors.

3 Software

R is a statistical programming language built on the S language (Wickham 2014). R offers many advantages, foremost of which is that it is free, general purpose software. It boasts over 4,000 libraries, which include a wide array of packages for phylogenetic analysis. The analyses and tree graphs presented below were all carried out in R version 3.5.3.12 R can be downloaded at https://www.r-project.org.

Once R has been installed, one should also download the Integrated Development Environment (IDE) RStudio, which is available at https://www.rstudio.com.13 I urge the reader to use RStudio (as opposed to R) for carrying out the phylogenetic analyses below.

Once you have R and RStudio installed, you will need to install packages for phylogenetic analysis. The two most important packages for our purposes are ape (Paradis 2012) and phangorn (Schliep 2011, Schliep 2018b). Packages can be downloaded to your hard drive with the following command (the ‘#’ symbol is used for comments in R; entering them in the R console in RStudio will have no effect):

3

Typically you will download packages from CRAN, The Comprehensive R Archive Network (https://cran.rstudio.com). As explained below, however, packages can be downloaded from other sources, such as BioConductor or GitHub.

Once the packages have been downloaded, they need to be loaded into the current session, which can be done with the library() function:

3

Once the packages are loaded into your working environment, their functions will be at your disposal.

At this point, you may want to create a new script file in RStudio rather than work directly in the R console. To do this, open RStudio and go to File > New File > R Script in the menu bar. A new script file will then appear above the console pane. You should put the commands for loading the above packages in the preamble of the document. All of the code below for phylogenetic inference and visualization of trees is available along with the datasets used in this tutorial at http://doi.org/10.5281/zenodo.3417299.

For plotting trees, one can use the packages ggdendro and ggtree (Yu et al. 2017), which extend the ggplot2 package. The trees below were produced with version ggtree version 1.14.6 (Yu et al. 2017). In contrast to the other packages described in this tutorial, ggtree is not available on CRAN. It is available from BioConductor, which can be downloaded with the following code:

3

Once BiocManager is loaded, ggtree is installed and loaded as follows:

3

4 The dataset

The phylogenetic trees presented in the subsequent sections are based on the phonological (Ringe & Taylor 2007b), morphological (Ringe & Taylor 2007a), and lexical characters (Ringe & Taylor 2002, Ringe, Warnow, & Taylor 2012) in the screened dataset created by Don Ringe and Ann Taylor (Nakhleh, Ringe, & Warnow 2005: 178; for a critical assessment of the dataset, see Drinka 2013: 383–85). It contains twenty-two phonological characters; twelve morphological ones; and 259 lexical characters, for a total of 293 characters.

The dataset uses multistate character values. The augment, which is character M2 in Ringe & Taylor (2007a), will serve as an illustrative example (for further examples, see Nakhleh, Ringe, & Warnow 2005: 410–18):

(1) Multi-state character encoding for the augment

Hittite

2

Avestan

1

Luvian

10

Gothic

15

Armenian

1

Old Church Slavic

5

Lycian

11

Old Norse

16

Greek

1

Lithuanian

6

Tocharian A

12

Old High German

17

Albanian

3

Old English

7

Old Persian

1

Welsh

18

Tocharian B

4

Old Irish

8

Old Prussian

13

Oscan

19

Vedic

1

Latin

9

Latvian

14

Umbrian

20

The value 1 denotes the presence of the augment. Character values from 2 onwards denote its absence.14

At the risk of stating a truism, I want to stress the critical importance of character selection and encoding (cf. Nakhleh et al. 2005: 172, Geisler & List 2010).15 This is by far the most important component of phylogenetic analysis. No matter the sophistication of the method of phylogenetic inference, if the linguistic analysis of the data is flawed (e.g., incorrect coding of cognates or poorly selected characters), the estimated phylogeny will also be flawed (cf. Johnson 2008: 250, Chang et al. 2015: 221). In an era of ever increasing technological sophistication, it is more important than ever that we be able to distinguish accuracy and precision, two phenomena that, though often mistaken for one another, are in fact worlds apart.16 Simply because a method is more sophisticated or yields more precise answers (e.g., an estimated time depth for Proto-Indo-European) does not mean that such answers automatically lay greater claim to the truth.

4.1 Reading data into R

The Indo-European character datasets curated by Don Ringe and Ann Taylor are available on Luay Nakleh’s website at https://www.cs.rice.edu/~nakhleh/CPHL/.17 We read the data into R from the web as follows:

4.1

The Indo-European character data is now the R object screened.df (the object bears the extension .df because it is a data structure known as a dataframe). The argument stringsAsFactors = FALSE enables the values in the table to be treated as character strings and fill = NA is needed because the rows do not all have the same number of elements. This argument inserts NA in cells of the table to make the rows equal in length.

A few things need to be changed before we can analyze the data (character M11 is removed per Ringe & Taylor 2007a: 9–10):

4.1

For several of the phylogenetic analyses below, I use version 2.4 of the package phangorn (Schliep 2018b), which requires that the data be in the phyDat structure. The following code transforms the above dataframe into a phyDat object (see further Schliep 2017):

4.1

The object screened.phydat will serve as the input to most of the phylogenetic analyses below. To see what the object contains just type its name into the console:

4.1

## 24 sequences with 293 character and 282 different site patterns.

## The states are 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 32

Finally, most of the methods below infer unrooted trees. To establish the branching order among clades, we need to select an outgroup. Since Anatolian is now agreed by many to have been the first clade to branch off (e.g., Melchert & Oettinger 2009: 53–54, Melchert forthcoming), the Anatolian languages in the dataset (that is, Hittite, Lycian, and Luvian) will serve as the outgroup. It is created as follows:

4.1

Below I use the object anatolian in the specification of the outgroup.

5 Parsimony methods

We begin with parsimony methods (Fitch 1971, Stewart 1993, Swofford et al. 1996: 415–26, Kitching et al. 1998, Felsenstein 2004: 1–146, Albert 2005, Swofford & Sullivan 2009, Nunn 2011: 30–33, Baum & Smith 2013: 173–215, Yang 2014: 95–100, Warnow 2018: 63–69), which resemble traditional methods of subgrouping. Maximum parsimony methods are based on an optimality criterion: the tree that requires fewest changes for a given dataset is optimal. (The total number of steps for a dataset on a given tree is known as the length of the tree.) More specifically, the optimal tree minimizes the amount of homoplasy.18 Underlying this method is the assumption that language change is slow (in the sense that the characters have only undergone a small number of transitions) and that we should therefore prefer phylogenies that minimize the number of changes posited for the data.19

There are several algorithms for calculating the parsimony score of a tree for a given dataset, the most prominent of which are Fitch, Sankoff, and Dollo. In Fitch parsimony, a change between any two states is possible, and all changes count for just one step (Fitch 1971, Felsenstein 2004: 11–13). Sankoff parsimony also allows a change between any two states (Sankoff 1975, Felsenstein 2004: 13–16). The crucial difference is that Sankoff parsimony assumes a cost matrix for transititions between any two given states.20

Another form of parsimony that is relevant to linguistic phylogenetics is Dollo parsimony. According to this model, a trait can be acquired once, and if lost it can never be regained (Farris 1977). This form of parsimony is of interest to historical linguistics because it has a correlate in the domain of sound change, namely Garde’s Principle (Garde 1961), which states that phonological mergers cannot be undone (Hoenigswald 1960: 75–82, 87–98). So once two phonemes merge, their ancestral distribution cannot be recovered. (For a discussion of this phenomenon and apparent exceptions, see Silverman 2012: 62–77.)

For up to about twenty taxa, the branch and bound algorithm (introduced in section 5.1 below) is guaranteed to find the most parismonious tree. For larger datasets, we need recourse to a heuristic search algorithm, which I introduce in section 5.5 below. In contrast to the branch and bound methods, these search algorithms are not guaranteed to find the most parsimonious tree.

5.1 Branch and bound

The branch and bound algorithm is guaranteed to find the most parsimonious tree(s) (Felsenstein 2004: 38). The algorithm does not, however, calculate the length of all possible trees, but rather exploits the following insight to exclude regions of unparsimonious trees (see further Felsenstein 2004: 60–64): adding taxa to a tree will never decrease its length (Baum & Smith 2013: 189, Huson, Rupp, & Scornavacca 2010: 35). That is, whatever homoplasy exists on a tree will never be reduced by adding taxa to the tree. So if removing taxa from a tree results in a parsimony score higher than that of the current bound (i.e., the current best tree), then all trees derived from this reduced tree will be less parsimonious (Baum & Smith 2013: 189). Thus the branch and bound algorithm reduces the tree space by eliminating swaths that cannot contain an optimal tree and thereby drastically reduces the number of trees for which a parsimony score is calculated.

The main disadvantage of this technique is that it is very slow and can only really be used for datasets that contain at most ten to twenty taxa. The package phangorn contains the function bab(), which will find all most parsimonious trees from a given dataset (depending on your computer, you may have to wait up to ten minutes to get the command prompt back):

5.1

With the bab() function, one can specify a start tree (i.e., a tree used to initiate the search) by adding tree = inside the parentheses. (Options of a function such as this one are known as arguments.) Here I opted not to do that by setting the value of this argument to NULL. Doing so causes a ratchet search (introduced below in section 5.5) to be performed to find a start tree.

The output of the bab() function is an object of the class multiPhylo (see further Paradis 2012: 55–56). For our dataset, the branch and bound search returns fifteen maximally parsimonious trees. By calling the function parsimony() from the phangorn package (see further Paradis 2012: 165–66), we can confirm that the parsimony scores (or p-scores) are identical:21

5.1

There are fifteen p-scores, one for each tree.

5.1.1 Rooting the trees and adding branch lengths

The branch and bound algoritm returns unrooted trees, which we can confirm with the function is.rooted():

5.1.1

To root the trees we call the function root():

5.1.1

This code sets Anatolian as the outgroup of each of the trees from the branch and bound algorithm. To check that the trees are in fact rooted, we again call the function is.rooted():

5.1.1

The trees produced by the branch and bound algorithm also lack branch lengths. To add branch lengths to the trees, we call acctran() from the phangorn package:

5.1.1

This function estimates branch lengths via a method known as accelerated transformation. Homoplastic characters can lead to multiple maximally parsimonious trees. The central idea of accelerated transformation is to assign character-state changes as soon as possible on the tree, which maximizes character-state reversals (for more on the calculation of branch length, see Swofford & Maddison 1987, Felsenstein 2004: 70–72).

The following code returns the length of each branch on the first tree:

5.1.1

The branch lengths represent the number of inferred changes. By changing the index in the double brackets, one can obtain the branch lengths for other trees. Summing the length of each branch, we obtain the p-score observed above:

5.1.1

5.2 Visualization

Phylogenetic trees can be plotted with the plot() function. Here for instance is the first of the branch and bound trees:

5.2
d242103628e888

Figure 2

Branch and bound tree 1

Citation: Indo-European Linguistics 8, 1 (2020) ; 10.1163/22125892-20201000

The output will appear in the plot pane in the lower right corner of the RStudio console. By clicking the Export tab, one can save it as a file.22

Trees two and six of the the branch and bound trees are plotted below. To the right of each tree I include a heatmap of the phonological and morphological characters in the dataset so that one can get a sense of the underlying data. Tree two is paired with the phonological characters from the dataset, while tree six is paired with the morphological. In the interest of enhancing the visualization, the original multistate characters were transformed into binary characters. The binary dataset and the code used for the transformation are available at http://doi.org/10.5281/zenodo.3417299.

d242103628e912

Figure 3

Branch and bound tree 2 with binary phonological characters

Citation: Indo-European Linguistics 8, 1 (2020) ; 10.1163/22125892-20201000

In the first three rows of the heatmap, the Anatolian languages show an almost uniform block of 0 values. We see in characters P4 through P7 some of the innovations (i.e., 1 values) that define Proto-Nuclear-Indo-European. (For a description of the change represented by each column, see Ringe & Taylor 2007b and Ringe & Taylor 2007a.)

d242103628e926

Figure 4

Branch and bound tree 6 with binary morphological characters

Citation: Indo-European Linguistics 8, 1 (2020) ; 10.1163/22125892-20201000

I have highlighted a portion of the Anatolian clade and Albanian because these are the loci of variation among the fifteen branch and bound trees. In the next section, I explore these fifteen branch and bound trees further with consensus and maximum clade credibility trees.

Given that the true Indo-European tree is not known, evaluation of phylogenetic methods is challenging (Nichols & Warnow 2008: 760). Since there is no debate among Indo-Europeanists about the members of clades such as Slavic, Celtic, and Germanic, below I use correct assignment of languages to recognized clades as the baseline evaluation measure.

5.3 Maximum clade credibility tree

We can summarize the set of branch and bound trees with a maximum clade credibility tree. The function maxCladeCred() evaluates each tree according to the frequency of each clade within the set of trees.

5.3

Trees with clades that are more frequent will have higher scores. The tree with the highest score is then selected as the maximum clade credibility tree:

d242103628e960

Figure 5

Maximum clade credibility tree from branch and bound search

Citation: Indo-European Linguistics 8, 1 (2020) ; 10.1163/22125892-20201000

5.4 Consensus trees

Another way to summarize a set of trees is with a consensus tree (Paradis 2012: 179–82), which reduces a set of trees to a single tree. There are two types of consensus trees, strict consensus trees and majority-rule consensus trees. In a strict consensus tree, the clades that are not observed in all the trees of a set are represented as polytomies, that is, as multifurcating branches.23 In a majority-rule consensus tree, the clades not observed in a majority of trees are represented as polytomous. To create a consensus tree, use the ape function consensus(). By default, a strict consensus tree is calculated:

5.4
d242103628e991

Figure 6

Strict consensus tree from branch and bound search

Citation: Indo-European Linguistics 8, 1 (2020) ; 10.1163/22125892-20201000

The length of each branch is now uniform because this particular tree was not among the branch and bound trees. (In fact, attempting to calculate the branch lengths of this tree with acctran() will yield an error message that the tree must be binary.)24 The multifurcations reveal uncertainty at a number of points in the tree, in particular with the internal structure of Anatolian and the order of lineage-splitting events among Albanian, Greco-Armenian, Indo-Iranian, and the clade comprising Balto-Slavic, Germanic, and Italo-Celtic.

To calculate a majority-rule consensus tree, use the argument p = 0.5:

5.4
d242103628e1014

Figure 7

Majority-rule consensus tree from branch and bound search

Citation: Indo-European Linguistics 8, 1 (2020) ; 10.1163/22125892-20201000

This tree contains all of the clades that occur in at least fifty percent of the branch and bound trees. The branches are all now bifurcating with the exception of Anatolian.

5.5 Heuristic search

With large datasets, the size of the possible tree space makes it unfeasible to calculate the p-score of each tree. Various heuristic searches have therefore been developed. In phangorn these rely on branch-swapping methods. The basic idea behind such methods is to generate a number of trees by rearranging parts of an original tree and then moving to the one that has the best parsimony score (see further Huson, Rupp, & Scornavacca 2010: 37–40). This process is iterated until no improvement in the length of the tree can be found. The reader should be aware that the heuristic searches below are not guaranteed to find the most parsimonious tree(s), since there is the possibility that they can get stuck in local optima (roughly speaking, local optima are regions of the tree space that are good relative to other areas, but not the best).25

The phangorn package implements a parsimony-based heuristic search known as the ratchet. The ratchet search relies on a branch-swapping algorithm known as tree bisection and reconnection (TBR). I refer the reader to Nixon (1999) and Felsenstein (2004: 51–52) for the details of the algorithm. The following code estimates a maximum parsimony tree with a ratchet search (which returns unrooted trees):

5.5
d242103628e1050

Figure 8

Parsimony ratchet tree

Citation: Indo-European Linguistics 8, 1 (2020) ; 10.1163/22125892-20201000

The parsimony ratchet is generally considered the most reliable among the branch-swapping heuristic search methods.

Two other branch-swapping algorithms are implemented in phangorn: nearest neighbor interchanges (NNI; Felsenstein 2004: 38–41, Huson, Rupp, & Scornavacca 2010: 38) and subtree pruning and regrafting (SPR; Felsenstein 2004: 41–44, Huson, Rupp, & Scornavacca 2010: 38–39). To perform these searches, call the function optim.parsimony(). With the argument rearrangements, one specifies “SPR” or “NNI” rearrangements (the former is the default value). NNI and SPR searches can be used after the parsimony ratchet to see if any further optimization of the parsimony score is possible:

5.5

In this case, optimization was unable to find a better tree. The p-score of the above tree is 3612, which is the same value we obtained above from the branch and bound search. To confirm that the phylogenies are identical, we use all.equal.phylo():

5.5

5.6 Measuring homoplasy and consistency

There are other measures of tree support besides tree length. Here I introduce two, the consistency index and the retention index, both of which provide measures of homoplasy on a tree. homoplasy refers to a situation in which character states develop more than once on a tree. Two types of situations result in homoplasy (Baum & Smith 2013: 93). The first is parallel independent innovation. Changes that are common (e.g., palatalization of velars before front vowels) are good candidates for homoplastic characters. The second type of situation that can result in homoplasy is so-called “Duke of York” changes. To draw again on sound change, a trajectory [a] > [o] > [a] is homoplastic. In the evolutionary biology literature, this phenomenon is known as backmutation.

A character is consistent on a given tree if it exhibits the minimum number of changes (i.e., if it shows no homoplasy). The minimum number of changes is always the observed number of character states minus one. For a binary character with values 1 and 0, the minimum number of changes is 1 (i.e., two observed character states minus one). Any tree that accounts for the distribution of the character states 1 and 0 with a single change is consistent with that character. If a tree requires more changes than the minimum, the character is homoplastic on that tree.

The consistency index is a measure of the consistency of a tree:

(2)

is the minimum number of steps required by a tree. As mentioned above, this is equal to the number of observed character states minus one. is the length of the tree, that is, the actual number of steps on the tree. To calculate the consistency index for a tree, the values of the numerator and denominator are summed for all characters before division. Values of the consistency index range from 1 to close to 0. A consistency index of 1 means that all characters are perfectly consistent on the tree (that is, there is no homoplasy). This situation arises of course when , the minimum number of changes, equals , the actual number of changes.

The consistency index is not without its problems (Sanderson & Donoghue 1989, Archie & Felsenstein 1993, Egan 2006: 73). For one, there is a negative correlation between the consistency index and the number of taxa: the consistency index falls as the number of taxa rises (Sanderson & Donoghue 1989). This correlation is explained by the fact that as the number of nodes (i.e., lineage-splitting events) increases, there are more opportunities for homoplasy (Hauser & Boyajian 1997: 97). So with larger datasets, the accuracy of the consistency index is questionable. Second, it is difficult to compare consistency indices across datasets. Third, autapomorphies (unique innovations) and symplesiomorphies (shared inherited traits) both inflate the consistency index, although neither of these situations should affect it since neither involves homoplasy. Finally, the absence of conventions for interpretating consistency indices means that it is not clear what constitutes a high or low value.

The retention index was intended as an improvement on the consistency index (Farris 1989, Lipscomb 1998). Unlike the latter, the former can range from 0 to 1. Like the consistency index, the retention index is the ratio of the observed number of changes and the minimum number of changes, but it is more complex in that it takes into account the maximum number of possible changes. One can think of it as the proportion of the observed number of synapomorphies (i.e., shared innovations) to the maximum possible number of synapomorphies (Egan 2006: 73, Klingenberg & Gidaszewski 2010: 250). It is calculated as follows:

(3)

is the maximum number of steps required by a tree. To calculate the maximum number of steps on the tree, we count the number of observed states for each character. We select the lowest number in each case and then sum up that value for every character in the dataset.

If the retention index equals one, a character is maximally consistent, i.e., . If the retention index equals zero, a character is maximally homoplastic, i.e., . (This would mean in addition that the character is parsimony uninformative, i.e., that we cannot use it to make any inferences about the topology of the tree.)

Here are the consistency and retention indices for the trees optimized with nearest neighbor interchange:

5.6