## Abstract

The paper investigates the task of inferring a phylogenetic tree of languages from the collection of word lists made available by the *Automated Similarity Judgment Project*. This task involves three steps: (1) computing pairwise word distances, (2) aggregating word distances to a distance measure between languages and inferring a phylogenetic tree from these distances, and (3) evaluating the result by comparing it to expert classifications. For the first task, weighted alignment will be used, and a method to determine weights empirically will be presented. For the second task, a novel method will be developed that attempts to minimize the bias resulting from missing data. For the third task, several methods from the literature will be applied to a large collection of language samples to enable statistical testing. It will be shown that the language distance measure proposed here leads to substantially more accurate phylogenies than a method relying on unweighted Levenshtein distances between words.

## 1. Introduction

Recent years have seen the introduction of many proposals to use phylogenetic inference techniques from bioinformatics in order to extract information about genetic relations from languages. There are essentially two basic approaches being currently employed. *Character-based* methods start by defining a set of features—*characters*—to classify languages. The feature values are assumed to be inert in language change. Therefore the number of shared feature values between two languages can be taken as a measure of their relatedness. Methods such as Maximum Parsimony, Maximum Likelihood, and Bayesian phylogenetic inference take a classification of languages according to a list of features as input and produce a phylogenetic tree including changes in feature values along the branches as output (see, for instance, Felsenstein, 2004, for a comprehensive overview). Suitable features may be cognate classes of basic vocabulary items (as used by, for example, Gray and Atkinson, 2003, and Bouckaert et al., 2012) or grammatical features (Dunn et al., 2005, and subsequent work).

The second approach uses *distance-based* techniques of phylogenetic inference. These methods start from a matrix of pairwise distances between languages that ideally correspond to the time that has passed since the split of the latest common ancestor of the two languages compared along the two lineages leading to those languages. Phylogenetic inference produces a tree where the path length between two leaf nodes is as close as possible to their pairwise distance. Such methods are suitable when dealing with raw data that are not organized in a feature matrix, such as lists of non-cognate-coded basic vocabulary items.

Extracting phylogenetic information from word lists usually proceeds in three steps (see, for instance, Downey et al., 2008 or Holman et al., 2008): (a) the similarity/distance between words from different languages is determined using some kind of alignment algorithm, (b) these word distances are aggregated to pairwise distances between languages, and (c) a phylogenetic tree is inferred. As for the final step, the *Neighbor Joining* algorithm (Saitou and Nei, 1987) has emerged as the *de facto* standard.

The quality of a phylogeny thus inferred can be assessed by comparing it to expert classifications. How such a comparison is to be performed is an active area of investigation; see Wichmann et al. (2010), Greenhill (2011), Huff and Lonsdale (2011), Pompei et al. (2011) for some recent contributions.

In this study I will propose three innovations pertaining to this research program:

A similarity score between words that is computed via weighted alignment, including a procedure to obtain the required weights in a data-driven way,

a novel method to aggregate word similarity score into distances between languages, and

a generalization of existing methods for evaluating the quality of distance measures between word lists, using expert classifications as gold standard.

The study is carried out using version 15 of the the *Automated Similarity Judgment Project* (ASJP) database (Wichmann et al., 2012), a collection of Swadesh lists for more than 5,800 languages^{1} which are phonetically transcribed in a uniform way. Only the 40 most stable Swadesh concepts are used in this paper. After excluding artificial languages, creoles, extinct and reconstructed languages, 5,481 word lists were kept in the database. Attested loan words are not excluded. Diacritics in the phonetic transcriptions are ignored.

As a baseline for comparison, I use the method described in Holman et al. (2008) to compute language distances from ASJP word lists.

The structure of the paper is as follows. Section 2 reviews Holman et al.’s proposal. The novel method for aggregating word similarity scores is developed in Section 3. Section 4 discusses the issue how to evaluate distance measures between languages and presents a comparative evaluation of the different aggregation methods. Section 5 introduces weighted word alignment and presents the procedure to train the required weights with ASJP data. It also provides a thorough empirical comparison of the distance measure obtained in this way with alternative approaches. In Section 6, the method developed here is compared to Kondrak’s (2002) ALINE system. Section 7 contains some final discussion and conclusions.

## 2. State of the Art: The LDND Score

Holman et al. (2008) propose a method to compute distances between ASJP word lists based on the *edit distances* between individual words. The edit distance or *Levenshtein distance* between two words

In the example in Fig. 1 (showing the alignment between the English and Latin words for *horn*, spelled according to the ASJP transcription system; the ASJP symbols are explained in the Appendix), this value would be 2. To control for varying word length, Holman et al. normalize this measure by dividing it by the length of the longest word. In the example, this amounts to

*LDN*takes values between 0 and 1.

The normalized Levenshtein distance provides a distance measure between words, with 0 indicating identity and 1 indicating maximal difference. To obtain a distance measure between two word lists, it seems suggestive to simply average over the LDN scores between corresponding words from the languages to be compared. However, if two languages have small and strongly overlapping sound inventories, the number of chance hits is high as compared to a language pair with large and dissimilar sound inventories. On average, the LDN values between unrelated words will be smaller in the former than in the latter case. To control for this effect, the authors propose a method to calibrate the average LDN score between synonymous word pairs to the specific language pair to be compared.

This is best illustrated with an example. Table 1 shows the pairwise LDN scores for some English and the Swedish vocabulary items from ASJP.

The average of the values along the diagonal—i.e. between words with identical meanings—for the full matrix is

*LDND*score of two languages (Levenshtein Distance Normalized and Divided) as mean LDN score along the diagonal, divided by the mean LDN score off the diagonal. For the comparison of English and Swedish, this amounts to

## 3. Quantifying the Evidence for Genetic Relatedness of Languages

### 3.1. The Evidence for Relatedness

As spelled out in the previous section, the LDND score aggregates distances between words to distances between languages (i.e.: word lists over a given concept list)

computing the distances between all word pairs from

and , computing the average distance between synonymous and the average distance between non-synonymous words, and

dividing the former by the latter.

In this section I will propose an alternative method for aggregating a matrix of distances between words from

To illustrate the underlying intuition, consider again the matrix of LDN scores between English and Swedish words illustrated in Table 1. The distribution of off-diagonal scores is shown in Fig. 2.

**Figure 2**. Off-diagonal LDN scores: English vs. Swedish

Citation: Language Dynamics and Change 3, 2 (2013) ; 10.1163/22105832-13030204

The word pair *fiS/fisK* ‘fish’ has an LDN score of

*fiS/fisk*. Intuitively, the fact that this fraction is so small provides evidence that

*fiS*and

*fisk*—and therefore English and Swedish—are related. Likewise, each of the other diagonal-entries provide a certain amount of evidence for the languages to be related, depending on their position within the distribution of off-diagonal entries.

To make this precise, let us assume that

The ASJP data contain missing entries at many positions. To deal with this issue, we assume that there are

The *rank* of a diagonal entry

(It is tacitly assumed that only those pairs

For the time being we assume that there are no ties; this issue will be taken up later on.

As the sizes of word lists may differ between languages, we normalize the rank by dividing it by the maximal possible rank, which is the number of off-diagonal entries

*normalized rank*

If the languages in question are unrelated, the entries along the diagonal are drawn from the same distribution as the off-diagonal entries. Therefore we expect each rank (between

This is illustrated in Fig. 3. The left panel shows the distribution of diagonal entries (left boxplot) and off-diagonal entries (right boxplot) for the comparison of English and Swedish.

It is clearly visible that the diagonal scores are on average much lower than the off-diagonal scores.

The right panel shows the same data for the comparison of English with Swahili. The two languages are unrelated, and the diagonal entries are similiarly distributed as the off-diagonal entries.

For a pair of unrelated languages we expect the normalized ranks for diagonal entries to be uniformly distributed between

**Figure 3**. Distribution of diagonal and off-diagonal LDN scores: English/Swedish and English/Swahili

Citation: Language Dynamics and Change 3, 2 (2013) ; 10.1163/22105832-13030204

**Figure 4**. Distribution of normalized ranks from related and from unrelated language pairs

Citation: Language Dynamics and Change 3, 2 (2013) ; 10.1163/22105832-13030204

As expected, the normalized ranks for pairs of related languages are heavily skewed towards small values, while the values for unrelated languages approximately follow a uniform distribution.^{2}

Figure 5 displays the same data as histograms with logarithmic binning in log-log plot.

**Figure 5**. Distribution of normalized ranks from related languages; log-log scale

Citation: Language Dynamics and Change 3, 2 (2013) ; 10.1163/22105832-13030204

The values for the related languages lie approximately on a straight line with a negative slope. This indicates that the normalized ranks are distributed according to a *power law* (see, for instance, Clauset et al., 2009 on power law distributions in empirical data). This means that there are real numbers

We can thus approximate the empirical distribution by a continuous probability density function

The distribution of normalized ranks for unrelated languages can be approximated by a constant density function:

Suppose we have to decide whether or not two languages are related on the basis of normalized ranks of all translation pairs. So we compare two hypotheses:

If we make the simplifying assumption that the normalized ranks for the individual translation pairs are stochastically independent, this amounts to

The posterior log odds thus come out as

While we do not know the prior probabilities

However, this only holds for a constant

Let us call this quantity the *Evidence for Relatedness* (ER).

To turn this into an operational definition, one further amendment needs to be made.

Recall that there may be ties, i.e. pairs

This leads to the following final definition:

Definition 1 (Evidence for Relatedness)

It is reasonable to assume that the Evidence for Relatedness becomes stronger the closer two languages are related. ER can thus be considered a similarity measure between languages. It can easily be transformed into a distance measure. ER is maximized if we compare a word list to itself: it contains no homonymies and no missing entries. In this case, all

The theoretical minimum for the ER score is achieved if all diagonal entries are smaller than all off-diagonal entries. In this scenario all

*Distance based on Evidence of Relatedness*(dER) is then defined as follows:

Definition 2 (Distance based on Evidence of Relatedness)

The dER score always assumes a value between

### 3.2. Correcting for Missing Entries

If the word lists to be compared contain missing entries, the dER measure relies on a maximum likelihood estimate of the

Suppose the languages

^{3}

The ER score is defined as the mean of

^{4}

The sum (and thus the average) of

*Erlang distribution*. However, this distribution can be approximated by a normal distribution (also with mean

Definition 3 (Corrected Evidence for Relatedness)

According to this definition, the mean and variance of the ERC scores for unrelated languages do not depend on

^{5}

Just like the ER score, the ERC score is a similarity measure between languages. It can be turned into a distance measure analogously to Definition 2.

Definition 4 (Distance based on Corrected Evidence of Relatedness)

For ASJP data with

To get an idea for the numerical magnitudes, dERC scores for some language pairs are given in Table 2.

It might seem counter-intuitive that the dERC of English to itself is larger than 0. This reflects the fact that the ASJP-list for English contains one pair of homonyms: both ‘I’ and ‘eye’ are transcribed as *Ei*. Therefore the probability of a chance identity is assessed as positive, and therefore the probability of the two lists being identical despite the languages being unrelated is assessed as positive, if very small.

## 4. Empirical Evaluation

As will become clear later on, the main motivation for developing dERC is that this method of aggregation is also applicable to string distance measure with mathematical properties different from LDN.

A standard way to assess the quality of a distance measure between languages is to relate it to an expert classification. In this paper I will make use of three different expert classifications of languages:^{6}

The two-level classification according to the

*World Atlas of Language Structures*, Haspelmath et al. (2008), abbreviated as*WALS*hereafter,the classification according to

*Ethnologue*, Lewis (2009), abbreviated as*Ethn*, andthe classification according to Hammarström (2010), abbreviated as

*Hstr*.

I will use three methods to compare a distance matrix to an expert classification:

Triplet distance: This method has been used in Greenhill (2011), and it is closely related to the Goodman-Kruskal Gamma measure used in Wichmann et al. (2010).

A triplet of languages

is *resolved*if and only if the expert tree contains a node that dominatesand but not . It is correctly classified by the distance measure if and only if . The triplet distance of the distance measure to the expert tree is the proportion of all resolved triplets that are classified incorrectly. ^{7}The triplet distance (TD) measure has the advantage that it only uses comparisons between distances rather than numerical values. It is therefore invariant under all monotonic transformations of the distance measure, including non-linear ones. Also, it does not rely on a phylogenetic algorithm that may introduce its own bias.

Generalized Robinson-Foulds distance: The Robinson-Foulds distance (Robinson and Foulds, 1981) is a standard distance measure between unrooted trees over the same set of leaves. As an illustration, consider the trees in Fig. 6.

The two trees have four and two internal branches, respectively. Each internal branch in an unrooted tree induces a bipartition of the set of leaves. The bipartitions induced by the internal branches on the right are identical to the bipartitions in the tree on the left. Additionally, the tree on the left contains two internal branches that have no counterpart in the tree on the right.

The Robinson-Foulds distance is the number of internal branches in both trees that have no counterpart in the other tree, divided by the total number of internal branches in both trees. In the example, this number is

.

**Figure 6**. Example trees

Citation: Language Dynamics and Change 3, 2 (2013) ; 10.1163/22105832-13030204

However, this number is somewhat misleading. The tree on the left is binary branching, while the one on the right is not. The tree on the left contains all bipartitions that we find in the tree on the right, so the former approximates the information contained in the latter as closely as is possible for a binary branching tree.

This is a standard situation when comparing a tree that has been constructed by a phylogenetic inference algorithm such as Neighbor Joining—which is necessarily binary branching—with an expert tree that is not binary branching. To take this asymmetry into account, I follow Pompei et al. (2011) in using the

*generalized Robinson-Foulds distance*(GRF). The GRF of a binary branching treeto another (perhaps non-binary branching) tree is defined as the proportion of internal branches in that do not have a counterpart in . In the example, the distance of the first to the second tree comes out as . The GRF is always a number between and , with indicating total disagreement and optimal agreement. Generalized quartet distance: Another commonly used distance measure between unrooted trees is the

*quartet distance*(Estabrook et al., 1985). Given an unrooted tree and four leaves, , , and , the tree induces the *butterfly*if and only if one of the bipartitions that is induced by its internal branches separates from . If there is no internal branch separating the quartet into two pairs, the tree induces a *star*on the quartet of leaves.Given two unrooted trees over the same set of leaves, their quartet distance is the proportion of quartets over their leaves that have different topologies in the two trees. In the example trees in Fig. 2, we have

leaves and therefore quartets. Of these 35 quartets, 16 have different topologies in the two trees, so the quartet distance is . Similar to the generalized Robinson-Foulds distance defined above, I will follow Pompei et al. (2011) in using a generalized version of the quartet distance that takes the asymmetry between binary branching inferred trees and multiply branching expert trees into account. The

*generalized quartet distance*(GQD) between an inferred tree and an expert tree is the proportion of butterflies in the expert tree having a different topology in the inferred tree. For the example in Fig. 2, the fit is perfect, i.e. the GQD equals 0.

The quartet measures are less intuitive than the corresponding Robinson-Foulds measures, but they have the advantage of being more tolerant of small errors. For instance, exchanging two leaves in one of two large trees may have a dramatic effect on the GRF, while the GQD changes only slightly.

In the following I will compare the three distance measures between languages discussed so far: LDND, dER and dERC. Let us first look at the triplet distances to the three expert classifications mentioned above, WALS, Ethnologue, and Hammarström (2010). The comparison was performed with the full

For all three expert classifications, we find a slight improvement both from LDND to dER and again from dER to dERC, even though the differences are quite small.

To compute the GRF, for each of the three pairwise distance matrices a phylogenetic tree is computed via the Neighbor Joining algorithm, and those are compared to the three expert classifications both via GRF and via GQD. The results are shown in Table 4.

**Table 4**Generalized Robinson-Foulds distances and generalized quartet distances for LDND, dER and dERC

These figures seem to indicate that LDND performs best according to WALS and Ethn, while dER comes out better for the Hstr. These numbers are arguably misleading, however. The GRF relies on the Neighbor Joining tree, which is quite sensitive to the properties of the specific data set. This can be illustrated with the following little experiment. 10 mutually disjoint subsets of ASJP were drawn, each containing 275 word lists. For each of these subsets, the Neighbor Joining trees for LDND, dER and dERC were computed and compared to the WALS classifaction according to GRF and GQD. The results are shown in Table 5.

Both the numerical values of GRF and GQD and the relative ordering of the three distance measures differ widely. For instance, LDND leads to the lowest GRF value five times, dER three times, and dERC five times.

To detect the quality of different distance measures despite the noisyness of phylogenetic inference, I drew 1,000 random samples from the 5,000+ ASJP word lists, each containing 500 word lists, and averaged over the various tree distance measures to the expert classifications.^{8} The results are given in Table 6 and the distributions are visualized as box plots in Fig. 7.

**Figure 7**. Generalized Robinson-Foulds and quartet distances for 1,000 random samples

Citation: Language Dynamics and Change 3, 2 (2013) ; 10.1163/22105832-13030204

From these data we can conclude that dERC gives slightly better results than dER for all evaluations, so the correction for missing entries does have a positive effect. The comparison between LDND and dERC is equivocal. On average, dERC is slightly better for GRF and slightly worse for GQD.

As the 1,000 samples used here are not stochastically independent,^{9} it is not possible to perform meaningful statistical tests, so it is impossible to say whether there are significant quality differences between LDND and dERC. In any event, the differences are very small.

## 5. Weighted String Alignment

### 5.1. The Method

Levenshtein alignment only distinguishes between identical and non-identical sounds. To achieve a better approximation of the etymologically correct alignment of cognate words, a graded notion of similarity between sounds seems more appropriate. The ALINE system from Kondrak (2002), for instance, uses a sophisticated hand-crafted notion of segment similarity that draws on insights from phonology. ALINE will be discussed in more detail in the next section. The present section reviews an alternative approach that has been used in previous work in bioinformatics (see, for instance, Durbin et al., 1989) and computational dialectometry (cf., among others, Wieling et al., 2009).

The approach emerges from similar considerations that we used in the previous section in the derivation of the ER measure.^{10} Suppose we want to compare two strings

The prior odds

We start with the (unrealistic) assumption that in case

Let

If

Of course the assumptions made here—stochastic independence of positions within a sequence and of evolutionary changes at different positions within a sequence—are wildly unrealistic. Nevertheless this null model leads to workable results, as we will see later on.

Putting the pieces together, we have

In the bioinformatics tradition, this quantity is called the *log odds* score. In computational linguistics, it is also known under the name of *Partial Mutual Information* (PMI, see Church and Hanks, 1990).^{11} I will follow the latter terminology here. Some notation:

We now turn to the issue of insertions and deletions. Suppose we evaluate a specific hypothesis about the historical relation between

*alignment*between the sequences including gaps. As an example, consider the German and Swedish words for

*star*,

*Stern/stjärna*, which are

*StErn/SEnE*in the ASJP transcription. The etymologically correct alignment is

The gap symbol “-” represents a position where either a segment has been deleted or a segment has been added in the other language.

*aligned*strings. In the example,

Following standard practice in bioinformatics, I assume that there is a uniform PMI score for gaps, regardless of the segment the gap is matched with:

The constant

*gap penalty*.

^{12}

However, both in biological evolution and in language change, insertions and deletions frequently operate on contiguous chunks of segments. For instance, in language comparison we frequently find *partial cognates*, i.e. word pairs in which one item is morphologically complex (or is etymologically derived from a morphologically complex word) and the other word is cognate to just one morpheme of the first word. Consider the Latin and Italian words for *mountain*, *mons/montagna*, transcribed as *mons/monta5a* in ASJP. The Italian word is probably derived from the Latin *montaneus* ‘mountainous,’ a denominal adjectivization of *mons*. So the correct alignment is

The three gaps at the end of the upper sequence are the reflex of a single historical process, i.e. suffixation plus semantic change.

Since gaps frequently come in chunks, the penalty for a gap in

*affine gap penalties*. There are two positive constants

The same applies *mutatis mutandis* to gaps within

With these provisos, the PMI score of an alignment of

*Needleman-Wunsch algorithm*(Needleman and Wunsch, 1970) is a simple generalization of the Levenshtein alignment algorithm that, for a given substition matrix

### 5.2. Parameter Estimation

To reliably estimate the PMI scores for all segment pairs, one would ideally need a very large corpus of correctly aligned sequence pairs. In bioinformatics such databases do indeed exist, and several carefully crafted substitution matrices for different domains have been constructed (see Durbin et al., 1989 for details). In dialectometric work (such as Wieling et al., 2012), such data are fairly easy to obtain because dialectometric data are organized in cognate sets, and the linguistically correct alignment between cognate words from different dialects of the same language can be reliably constructed with automatic means.

When dealing with cross-linguistic data from a wide variety of languages such as the ASJP data, the situation is more difficult. Sizeable amounts of expert cognacy judgments only exist for a small number of language families (mainly for Indo-European based on the pioneering work of Dyen et al., 1992, and for Austronesian, see Greenhill et al., 2008). Also, and more importantly, the ultimate goal of this entire enterprise is to do language classification automatically. Therefore, information about language family affiliation should not be utilized for parameter training to avoid circularity.

In the following, a heuristic method is described to extract a large corpus of probable cognate pairs from the ASJP word lists, which can be used for parameter training. The method only relies on the word lists themselves; no additional information about cognacy relations or the genetic affiliation of the languages involved is being used.

To avoid the pitfall of overtraining, I split the ASJP database into two sets of about equal size, the *training set* and the *test set*. For training purposes, I use only the former. The resulting model will then be tested against the latter.

To make sure the two sets are really independent—or at least to approximate this ideal as far as possible with cross-linguistic data—the two sets were constructed in such a way that each WALS family either completely belongs to the training set or the test set.^{13} To be more specific, the set of WALS families was placed in a random order and languages and language families were added to the training set in this order, as long as its size did not exceed half the size of the entire database. The remaining families constitute the test set. The training set contains 2,723 and the test set 2,758 word lists. The lists of language families in the two sets are provided in the Online Supporting Material.

Relying on the training set only, I used the following procedure for constructing a sufficiently large corpus of probable cognate pairs, which can be used for parameter training:

All language pairs that have a dERC distance below a given threshold

are considered to be *probably related*.For a pair of probably related languages

and and a concept , all entries for in the ASJP list for and in the ASJP list for are considered. The pair of words with the lowest LDN score is considered a *potential cognate*.All pairs of probable cognates are then aligned with the Levenshtein algorithm. If there are multiple optimal alignments, only one of them is considered.

^{14}

This yields a set of aligned sequence pairs. The quantity

for each pair of segment types.

Assuming certain values for the gap penalties (more on this later), in the next step the set of potential cognate pairs is aligned with the Needleman-Wunsch algorithm, using the estimated parameters.

Additionally, I assume a threshold

*probable cognates*. The set of aligned probable cognates is then used to re-estimate the PMI scores in the way described above.

The re-estimation of parameters is repeated 10 times. Experience shows that the estimated parameter values do not change substantially anymore after that.

The appropriate choice for the meta-parameters

It can be seen that the distribution is dominated by a bell-shaped curve with the maximum at about

As pointed out above, there is no straightforward way to estimate gap penalties from a training corpus. Appropriate values for

The training procedure supplies a PMI matrix for a given vector

**Figure 8**. Distribution of dERC scores in the training set

Citation: Language Dynamics and Change 3, 2 (2013) ; 10.1163/22105832-13030204

As a heuristic to assess the quality of a parameter configuration, I sampled 1,000 pairs of probably related languages, i.e. languages with a dERC

As even a single evaluation step is computationally quite expensive, advanced methods of optimization such as simulated annealing proved to be impractical. Therefore I performed a simple downhill Nelder-Mead style optimization (cf. Nelder and Mead, 1965), starting from several manually chosen initial positions. The lowest value of the target function was achieved with

^{15}

For a selection of sounds, the optimal PMI scores thus derived are shown in Table 7. (The full matrix is provided in the Online Supporting Material.) Not surprisingly, the entries along the diagonal are all positive, i.e. alignment of two identical elements provides the strongest evidence for relatedness. Additionally, we find positive PMI scores for several sound pairs that are known to be frequently historically related via sound shifts, such as *p/b*, *d/t*, *d/8* (where the ASJP symbol *8* represents voiceless and voiced dental fricatives; cf. Table 15) and *s/h*. The latter case is especially interesting because the two sounds are articulatorily dissimilar, but the sound shift from *s* to *h* is known to be quite common (see, for instance, Ferguson, 1990).

**Figure 9**. PMI scores: hierarchical clustering

Citation: Language Dynamics and Change 3, 2 (2013) ; 10.1163/22105832-13030204

Figure 9 displays a hierarchical clustering of the ASJP sound symbols according to their PMI scores.^{16} We find a primary split between vowels and consonants. The consonants are further divided into three large groups, which largely correspond to the dental, labial, and velar/uvular sounds. The only exception to this pattern according to place of articulation is the position of *h* and *x* (the voiceless and voiced velar fricatives), which are clustered together with the *s*-sounds within the larger cluster of dental sounds. This is probably a reflex of the already mentioned diachronic cline from *s* to *h*.

Following the example of Wieling et al. (2012)—who obtained PMI scores essentially in the same way but using data from different dialects of the same language—I performed non-metric multidimensional scaling with the PMI scores among the vowels. The result is displayed in Fig. 10. We find that the articulatory vowel triangle is reproduced to a good approximation, with the schwa (ASJP symbol *3*) in the center.

Brown et al. (2013) also use the ASJP data to estimate the probability of different sound correspondences across the languages of the world. Their method is quite different from the one developed here, so a comparison of the results provides a certain validity check.

**Figure 10**. Vowel PMI scores: multidimensional scaling

Citation: Language Dynamics and Change 3, 2 (2013) ; 10.1163/22105832-13030204

The authors use a highly conservative heuristic to identify regular sound correspondences. According to this method, a pair of languages

and belong to the same genus, and there are at least two concepts

such that the ASJP entry for from can be transformed into its translation to by replacing all occurrences of by (and vice versa).

To use the running example of the English/Swedish comparison again, there are only two regular correspondences that can be detected from the 40-item word lists: *o-e* (*bon*/*ben* ‘bone’ and *ston*/*sten* ‘stone’); and *i-e* (*liv3r*/*lev3r* ‘liver’ and *si*/*se* ‘see’).

A certain genus is *available* for a correspondence

*PG*score (“percentage of available genera”) of a correspondence is the relative frequency (expressed in percent) of genera exhibiting the correspondence at least once among all genera that are available for that correspondence.

Figure 11 plots the PG scores of all consonant pairs that have a positive PG score in the supporting online material from Brown et al. (2013) against their corresponding PMI score.

**Figure 11**. PG scores vs. PMI scores

Citation: Language Dynamics and Change 3, 2 (2013) ; 10.1163/22105832-13030204

The

### 5.3. Aggregation

In Table 1 the LDN scores for several English/Swedish word pairs were given. Table 8 gives the corresponding PMI scores.

It can be seen that the PMI notion of string similarity is more fine-grained than LDN. For instance, while both *du/yu* and *vi/wi* receive a positive score (i.e. they are more likely to be related than not), the absolute value for the latter is much higher. This reflects the fact that a correspondence between *v* and *w* is more likely than one between *d* and *y*. The pair *fisk/fiS* has an even higher PMI score because (a) the words are longer than *vi/wi*, i.e. the evidence they provide is stronger, and (b) the correspondence *s/S* is very likely. This is counterbalanced only by a single gap penalty.

The distribution of PMI scores for the English/Swedish comparison on the diagonal and off the diagonal is shown in the left panel of Fig. 12. The right panel shows the same data for the comparison English/Swahili.

In comparison to the corresponding plots for LDN, the PMI values are much more spread out. Apart from that, we find a similar qualitative pattern (apart from the inessential difference that LDN is a distance and PMI a similarity measure). For a pair of related languages, the diagonal entries are mostly much higher than the off-diagonal entries, while both collections appear to be drawn from the same distribution for a pair of unrelated languages.

The *normalized ranks* of PMI scores are now computed according to the definition given in Section 3, with LDN scores replaced by PMI scores and

Therefore the theoretical justification for the dERC-style aggregation of normalized ranks to a distance measure between languages also applies to PMI scores.

**Figure 12**. Distribution of diagonal and off-diagonal PMI scores: English/Swedish and English/Swahili

Citation: Language Dynamics and Change 3, 2 (2013) ; 10.1163/22105832-13030204

**Figure 13**. Distribution of normalized PMI ranks from related languages; log-log scale

Citation: Language Dynamics and Change 3, 2 (2013) ; 10.1163/22105832-13030204

Table 9 compares the dERC/LDN scores and dERC/PMI scores for the language pairs from Table 2.

These numbers convey the impression that the dERC/PMI scores for related languages are generally lower than the corresponding dERC/LDN scores, while the scores for unrelated languages are randomly distributed around

### 5.4. Empirical Evaluation

The methods described in Section 4 to compare different distance measures will now be used to evaluate the quality of the dERC/PMI against LDND and the LDN-based version of dERC (dERC/LDN). Only the word lists from the test set will be used for this comparison.

The triplet distances to the three expert classifications are given in Table 10 and visualized in Fig. 14.

**Figure 14**. Triplet distances for LDN, dERC/LDN and dERC/PMI

Citation: Language Dynamics and Change 3, 2 (2013) ; 10.1163/22105832-13030204

We find a slight improvement from LDND to dERC/LDN and a more substantial improvement to dERC/PMI.^{17}

From the distance matrices for the test set for LDND, dERC/LDN and dERC/PMI, the corresponding phylogenetic trees were computed with Neighbor Joining. The generalized Robinson-Foulds distances and quartet distances are given in Table 11.

**Table 11**Generalized Robinson-Foulds distances and generalized quartet distances for LDND, dERC/LDN and dERC/PMI

The results are not decisive, with dERC/PMI giving the lowest GRF scores and dERC/LDN the lowest GQD scores. However, as discussed in Section 4, evaluating different Neighbor Joining trees for a single data set can be highly misleading. Therefore the same procedure as above is applied here: 1,000 random samples of word lists from the test set, each comprising 500 doculects, are generated, Neighbor Joining trees for LDND, dERC/LDN and dERC/PMI are computed, and all three trees are compared to the three expert trees regarding both GRF and GQD. The results are depicted in Fig. 15 and the mean values are given in Table 12.

**Figure 15**. Evaluation results: distribution of GRF and GQD for 100 random samples of 500 languages each

Citation: Language Dynamics and Change 3, 2 (2013) ; 10.1163/22105832-13030204

The mean values for the 1,000 samples display a similar pattern as the triplet distances: LDND and dERC/LDN perform about equally well (with a slight advantage for the former), while dERC/PMI leads to lower distance scores. As the aggregation method for dERC/LDN and dERC/PMI is identical, we can conclude that the PMI based method of measuring string similarities leads to better phylogenetic inference than (normalized) Levenshtein distance.

As a further test, I performed a version of *cross-validation*.^{18} In general,

Cross-validation requires the individual subsets to be independent from each other. As discussed above, obtaining mutually independent subsamples of a cross-linguistic data base such as ASJP that are representative for the data set as a whole is a non-trivial issue. As an approximation, I performed 4-fold cross-validation, where the subsets correspond to the four continental areas Africa (including all Afro-Asiatic languages), Eurasia, the Indo-Pacific region (including Australia), and America.

For each continental area

In 11 out of 12 cases, dERC/PMI provides the best results (the exception being the Hstr classification for America, where LDND is slightly better). The general pattern for Africa, Eurasia and the Indo-Pacific is similar to the test set above: LDND and dERC/LDN are about equally good, while dERC/PMI is about

The average correlation of the PMI matrices obtained during cross-validation with the PMI matrix obtained from the training set is

### 5.5. Discussion

A possible objection against the general approach developed here concerns the risk of circularity. As an anonymous reviewer points out, it might be problematic to perform automatic language classification on the basis of parameters that are trained with data from a database “which was […] obtained through some other type of (manual) analysis.” Let us therefore carefully review what kind of information goes into the training procedure and what kind of information we get out of it.

The construction of the training corpus of word pairs relied on guessing a value for

dERC/LDN scores are determined on the basis of pairwise LDN scores for words from the word lists to be compared. No further information about the genetic affiliation of the languages involved is being used here, and LDN scores are obtained from Levenshtein distances, a general-purpose string comparison method that does not rely on any specifically linguistic information.

**Figure 16**. Triplet distances for LDN, dERC/LDN and dERC/PMI: continental areas

Citation: Language Dynamics and Change 3, 2 (2013) ; 10.1163/22105832-13030204

Once the training corpus is constructed, initial PMI scores are estimated using Levenshtein alignment. In subsequent steps, Needleman-Wunsch alignment is performed ten times, each time using the PMI score estimates from the previous step. For given values of

*unsupervised learning*.

The test procedures in turn use the aggregate distance between word lists thus obtained to do phylogenetic inference and to compare the results to expert classifications. (Triplet distance relies on classifying triplets of languages, so this also involves a kind of phylogenetic inference). So the information that is obtained from the parametrized model—language classification—is of an entirely different nature than the information that went into it, namely word lists.

Another potential objection concerns the fact that the overall gain in accuracy—about

Both for the triplet distance and for GQD, the baseline of completely randomly distributed distances is not

but because there are only three rooted binary trees for a triplet and three butterflies for a quartet of languages. Even very crude distance measures achieve a much higher accuracy as suggested by these baselines. To illustrate this point, I defined such a crude measure: for each word list, the vectors of relative frequencies of occurrence of sounds are computed. The cosine similarity between two languages is then defined as the cosine of these vectors, and the

*cosine distance*is the difference between the cosine similarity score and 1. So this distance measure only quantifies how much the frequency patterns of unigrams differ between word lists, without any reference to the meaning of the words. The Neighbor Joining tree derived from these distances for the entire ASJP database already achieves a GQD ofto WALS, to Ethn and to Hstr. The practically achievable minimum GQD (and likewise for triplet distance and GRF) is arguably somewhat above 0. First, the expert classifications contain controversial units (such as Altaic, Australian, Niger-Congo and Trans-New Guinea in WALS), which may partially be wrong. In this case it would not be a defect of an automatic classification if those units are not detected. Second, the 40-item Swadesh lists arguably do not always contain the information that human experts would need to establish a genetic relationship between a group of languages.

To make a rough guess, the maximum GQD (achievable by a simple-minded distance measure such as the cosine distance) for a given data set may be around 35%–40% and the minimum GQD that can possibly be attained by automatic methods from 40-item Swadesh lists may by around 3%. Each gain in accuracy of a certain percentage thus actually amounts to a much higher proportion (by a factor of about 3) of this range.

## 6. Comparison to ALINE

The PMI scores for word similarities used here are obtained via weighted string alignment. There have been several proposals in the literature on computational historical linguistics and computational dialectometry to employ weighted alignment for this purpose. Some of them use empirically determined log-odds scores as weights like the present proposal (cf. Wieling et al., 2012), while others (see, for instance, Covington, 1996; Somers, 1998; Heeringa, 2004, among others) assume linguistically motivated hand-crafted substitution weights for segment pairs. The most sophisticated approach along the latter lines is perhaps the ALINE system by Kondrak (2002). A detailed discussion of ALINE would go beyond the scope of this article, so I will just mention the essential features.

In ALINE, each sound is represented by a vector of phonetic features, such as *syllabic*, *back*, *place* etc. These features have real numbers as values. The similarity between two segments is computed from their differences in feature values, weighted by the salience of these features.

Additionally, ALINE captures *compressions* and *expansions*, i.e. alignments of a single segment in one word with two adjacent segments in the other word. Kondrak uses the cognate pair Latin *factum*/Spanish *hecho* ‘fact’ to illustrate this point. In the etymologically correct alignment, the Spanish affricate [ʧ] should be matched with the [t] and the [k] in the Latin word simultaneously. ALINE defines weights for aligning a single sound with a consecutive sequence of two sounds as well.

The present proposal uses the Needleman-Wunsch algorithm for string alignment. This algorithm finds the optimal *global* alignment, i.e. an alignment of the full sequences. ALINE uses *half-global* alignment instead. This means that in both strings to be compared, final subsequences can be ignored if this leads to a better alignment score. Half-global alignment is motivated by the observation that the right periphery of words is especially unstable in language change.

In Huff (2010) and Huff and Lonsdale (2011), the system PyAline is described, a freely available Python implementation of ALINE that includes substitution scores of ASJP sound classes. PyAline also contains an implementation of Downey et al.’s (2008) method to aggregate ALINE alignment scores to distances between languages. This facilitates a comparison with the distance measures defined here. In Huff and Lonsdale (2011) such a comparison with LDND is discussed. The authors conclude that both measures perform about equally well in phylogenetic inference.

Downey et al.’s aggregation method differs in two essential ways from ERC. First, word similarities are *normalized*. Given alignment scores (which are similarity scores), the normalized ALINE distance between two words

Second, Downey et al. (2008) define the distance between two languages as the average normalized ALINE distance between translation pairs. This amounts to taking the average of the diagonal in the matrix of individual word distances, while the off-diagonal entries are not taken into account. Let us call this distance measure

These differences in detail make a comparison to dERC difficult, because it has to be factored out whether possible differences in performance are due to the different alignment weights, the different alignment algorithm, the normalization step or the difference in the aggregation scheme. As an additional complication, PyAline’s alignment algorithm is implemented in plain Python, which makes it comparatively slow. There are highly efficient Python libraries for the Needleman-Wunsch algorithm used for the computation of PMI scores, which makes the computation of a pairwise dERC/PMI distance matrix for several thousands of word lists feasible. For PyAline this is not realistic.^{19}

For these reasons, I will defer a detailed comparison of the present proposal with ALINE to another occasion and only report the results of a small pilot study here that could be carried out with moderate computational effort.

From the test set, 10,000 triplets were sampled that are resolved according to WALS. They were used to estimate the triplet distance to WALS for (a) LDND, (b) dERC/PMI, (c)

The results are given in Table 14 and displayed in Fig. 17.

**Figure 17**. Estimated triplet distances

Citation: Language Dynamics and Change 3, 2 (2013) ; 10.1163/22105832-13030204

The estimates for LDND and dERC/PMI are

The results indicate that

With the proviso that these results are still preliminary, they seem to suggest (a) that weighted alignment improves the accuracy of phylogenetic inference in comparison to plain Levenshtein-style alignment, and (b) that empirically determined PMI scores are superior to hand-crafted weighting schemes.

## 7. Conclusion

This paper aims at making three contributions to the current discussion in the field of computational historical linguistics: (1) it argues for the usage of weighted alignment using empirically obtained weights for determining word distances, (2) it proposes a novel method to aggregate word similarities/distances to distances between languages, and (3) it presents several protocols for evaluating automatically generated phylogenies that extend existing proposals.

The results from the previous sections show that weighted alignment improves the accuracy of language distance measures when compared to Levenshtein distance methods. The method used here—the Needleman-Wunsch algorithm using log-odds scores and affine gap penalties—was developed in the context of bioinformatics and is justified by the properties of biomolecular evolution. The model assumptions that underlie its mathematical foundations are actually not met in the case of sound change. It rests on the simplifying assumptions that mutations at different positions are stochastically independent and that mutation probabilities are constant across lineages. The latter assumption, especially, is highly problematic when applied to sound change since specific sound changes are known to be historically contingent events that apply to the entire lexicon of a language. Therefore a more adequate model would have to use a different substitution matrix for each pair of related languages, which captures the history of sound changes along the two lineages from the latest common ancestor. It is in principle possible to obtain these substitution matrices empirically, but this would arguably require much larger word lists than the commonly used Swadesh lists.

Also, work on automatic cognate recognition (see, for instance, List, 2012) has shown that the quality of word alignments improves considerably if *multiple sequence alignment* is used. It is to be expected that language distance measures using multiple alignments will also lead to more accurate phylogenetic inference. An additional advantage of using multiple sequence alignments is that they can be used for character-based methods, which are known to be more accurate than distance-based methods.

A further direction that may lead to higher accuracy is the usage of resampling methods such as bootstrapping and jackknifing, which can be used at various points in the inference process. In this paper, individual word alignment scores were calibrated by comparing them to the distribution of alignment scores across all pairs of non-synonymous word pairs from the two languages to be compared. Sampling a large number of these scores with replacement will arguably lead to a more accurate estimate of this distribution. Furthermore, sampling 40 Swadesh concepts with replacement a large number (

^{20}

Regarding the evaluation described in the previous section, the main innovation presented here is the use of a large collection of random samples of languages to assess the quality of a distance measure. According to my own experience, results obtained in this way are much more robust and informative than evaluation results for a single collection of languages.

## Acknowledgments

This research was supported by the ERC Advanced Grant 324246 *Language Evolution: The Empirical Turn* (EVOLAEMP).

The work being described in this article benefited considerably from discussions with Johann-Mattis List, Taraka Rama, Søren Wichmann and Martijn Wieling, which is gratefully acknowledged. Kate Bellamy, Michael Dunn, Eric Holman, Søren Wichmann and three anonymous reviewers from LDC pointed out various mistakes in a previous version of this article. Thanks also to Thomas Zastrow for setting up the hardware which made this work possible.

## Software Used

All word alignments and distance measure computations were performed using (Numeric) Python. Levenshtein alignment and Needleman-Wunsch alignment were done using the *Levenshtein* package and the *pairwise2* module of the *Biopython* package (Cock et al., 2009; http://biopython.org) respectively.

For the Neighbor Joining algorithm, Joseph Felsenstein’s *Phylip* package (Felsenstein, 1989; http://evolution.genetics.washington.edu/phylip/) was used. Quartet fits were computed with Christian Pedersen’s *qdist* package (http://birc.au.dk/software/qdist/). Thanks to its author and to Thomas Mailand for their help in finding and installing this software.

For manipulating and visualizing phylogenetic trees as well as for computing Robinson-Foulds distances, the Python toolkit *ETE* (http://ete.cgenomics.org/) and Daniel Huson’s *Dendroscope* software (http://ab.inf.uni-tuebingen.de/software/dendroscope/) proved highly useful.

## Online Supporting Material

Descriptions of the training set and the test set, as well as the PMI scores obtained in the way described in Subsection 5.2, are contained in an online document that can be downloaded from http://www.sfs.uni-tuebingen.de/~gjaeger/publications/ldcBenchmarkingSI.pdf, and from http://dx.doi.org./10.1163/22105832-13030204; booksandjournals.brillonline.com/content/22105832/3/2 (click on tab Supplements).

## References

Bouckaert, Remco, Philippe Lemey, Michael Dunn, Simon J. Greenhill, Alexander V. Alekseyenko, Alexei J. Drummond, Russell D. Gray, Marc A. Suchard, and Quentin D. Atkinson. 2012. Mapping the origins and expansion of the Indo-European language family. *Science* 337: 957–960.

Brown, Cecil H., Eric Holman, and Søren Wichmann. 2013. Sound correspondences in the world’s languages. *Language* 89: 4–29.

Church, Kenneth Ward and Patrick Hanks. 1990. Word association norms, mutual information, and lexicography. *Computational Linguistics* 16: 22–29.

Clauset, Aaron, Cosma Rohilla Shalizi, and Mark E.J. Newman. 2009. Power-law distributions in empirical data. *SIAM Review* 51: 661–703.

Cock, Peter J.A., Tiago Antao, Jeffrey T. Chang, Brad A. Chapman, Cymon J. Cox, Andrew Dalke, Iddo Friedberg, Thomas Hamelryck, Frank Kauff, Bartek Wilczynski, and Michiel J.L. de Hoon. 2009. Biopython: Freely available Python tools for computational molecular biology and bioinformatics. *Bioinformatics* 25: 1422–1423 (doi: 10.1093/bioinformatics/btp163).

Covington, Michael A. 1996. An algorithm to align words for historical comparison. *Computational Linguistics* 22: 481–496.

Downey, Sean S., Brian Hallmar, Murray P. Cox, Peter Norquest, and J. Stephen Lansing. 2008. Computational feature-sensitive reconstruction of language relationships: Developing the ALINE distance for comparative historical linguistic reconstruction. *Journal of Quantitative Linguistics* 15: 340–369.

Dunn, Michael, Angela Terrill, Ger Ressink, Robert A. Foley, and Stephen C. Levinson. 2005. Structural phylogenetics and the reconstruction of ancient language history. *Science* 309: 2072–2075.

Durbin, Richard, Sean R. Eddy, Anders Krogh, and Graeme Mitchison. 1989. *Biological Sequence Analysis*. Cambridge, UK: Cambridge University Press.

Dyen, Isidore, Joseph B. Kruskal, and Paul Black. 1992. An Indoeuropean classification: A lexicostatistical experiment. *Transactions of the American Philosophical Society* 82: 1–132.

Estabrook, George F., F.R. McMorris, and Christopher A. Meacham. 1985. Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. *Systematic Biology* 34: 193–200.

Felsenstein, Joseph. 1989. Phylip-Phylogeny Inference Package (Version 3.2). *Cladistics* 5: 164–166.

Felsenstein, Joseph. 2004. *Inferring Phylogenies*. Sunderland: Sinauer Inc. Publishers.

Ferguson, Charles A. 1990. From esses to aitches: Identifying pathways of diachronic change. In William A. Croft, Suzanne Kemmer, and Keith Denning (eds.), *Studies in Typology and Diachrony: Papers Presented to Joseph H. Greenberg on His 75th Birthday*, 59–78. Philadelphia: John Benjamins.

Gray, Russell D. and Quentin D. Atkinson. 2003. Language-tree divergence times support the Anatolian theory of Indo-European origin. *Nature* 426: 435–439.

Greenhill, Simon J. 2011. Levenshtein distances fail to identify language relationships accurately. *Computational Linguistics* 37: 689–698.

Greenhill, Simon J., Robert Blust, and Russell D. Gray. 2008. The Austronesian Basic Vocabulary Database: From bioinformatics to lexomics. *Evolutionary Bioinformatics* 4: 271–283.

Hammarström, Harald. 2010. A full-scale test of the language farming dispersal hypothesis. *Diachronica* 27: 197–213.

Haspelmath, Martin, Matthew S. Dryer, David Gil, and Bernard Comrie. 2008. The World Atlas of Language Structures online. Munich: Max Planck Digital Library. http://wals.info/.

Heeringa, Wilbert Jan. 2004. *Measuring Dialect Pronunciation Difference Using Levenshtein Distance*. PhD dissertation, University of Groningen.

Holman, Eric W., Søren Wichmann, Cecil H. Brown, Viveka Velupillai, André Müller, and Dik Bakker. 2008. Advances in automated language classification. In Antti Arppe, Kaius Sinnemäki, and Urpu Nikanne (eds.), *Quantitative Investigations in Theoretical Linguistics*, 40–43. Helsinki: University of Helsinki.

Huff, Paul. 2010. *PyAline: Automatically Growing Language Family Trees Using the ALINE Distance*. PhD dissertation, Brigham Young University.

Huff, Paul and Deryle Lonsdale. 2011. Positing language relationships using ALINE. *Language Dynamics and Change* 1: 128–162.

Kondrak, Grzegorz. 2002. *Algorithms for Language Reconstruction*. PhD dissertation, University of Toronto.

Lewis, M. Paul (ed.). 2009. *Ethnologue: Languages of the World*. 16th ed. Dallas, TX: SIL International. http://www.ethnologue.com.

List, Johann-Mattis. 2012. *Sequence Comparison in Historical Linguistics*. PhD dissertation, University of Düsseldorf.

Needleman, Saul B. and Christian D. Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. *Journal of Molecular Biology* 48: 443–453.

Nelder, John A. and Roger Mead. 1965. A simplex method for function minimization. *The Computer Journal* 7: 308–313.

Pompei, Simone, Vittorio Loreto, and Francesca Tria. 2011. On the accuracy of language trees. *PLoS ONE* 6: e20109.

Robinson, David F. and Leslie R. Foulds. 1981. Comparison of phylogenetic trees. *Mathematical Biosciences* 53: 131–147.

Saitou, Naruya and Masatoshi Nei. 1987. The neighbor-joining method: A new method for reconstructing phylogenetic trees. *Molecular Biology and Evolution* 4: 406–425.

Somers, Harold L. 1998. Similarity metrics for aligning children’s articulation data. In *Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics*, vol. 2, 1227–1232. Montreal: Association for Computational Linguistics.

Ward, Joe H., Jr. 1963. Hierarchical grouping to optimize an objective function. *Journal of the American Statistical Association* 58: 236–244.

Wichmann, Søren, Eric W. Holman, Dik Bakker, and Cecil H. Brown. 2010. Evaluating linguistic distance measures. *Physica A: Statistical Mechanics and Its Applications* 389: 3632–3639.

Wichmann, Søren, André Müller, Viveka Velupillai, Annkathrin Wett, Cecil H. Brown, Zarina Molochieva, Julia Bishoffberger, Eric W. Holman, Sebastian Sauppe, Pamela Brown, Dik Bakker, Johann-Mattis List, Dmitry Egorov, Oleg Belyaev, Matthias Urban, Harald Hammarström, Agustina Carrizo, Robert Mailhammer, Helen Geyer, David Beck, Evgenia Korovina, Pattie Epps, Pilar Valenzuela, and Anthony Grant. 2012. The ASJP Database (version 15). Downloadable at http://email.eva.mpg.de/ wichmann/listss15.zip (accessed November 6, 2013).

Wieling, Martijn, Eliza Margaretha, and John Nerbonne. 2012. Inducing a measure of phonetic similarity from pronunciation variation. *Journal of Phonetics* 40: 307–314.

Wieling, Martijn, Jelena Prokić, and John Nerbonne. 2009. Evaluating the pairwise string alignment of pronunciations. In *Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education*, 26–34. Athens: Association for Computational Linguistics.

## Appendix: ASJP Transcription Code

Tables 15 and 16 contain the description of the ASJP code (quoted verbatim from Brown et al., 2013).

As Søren Wichmann (p.c.) points out, *doculects* would be a more precise term, as the database comprises languages, dialects, and reconstructed word lists of protolanguages. For simplicity’s sake, however, I will use the terms *language* and *doculect* synonymously throughout this paper.

The empirical distribution is not entirely uniform; it also has a slight bias towards smaller values. This may be due either to long-distance relationships between languages from different families, similarities due to language contact, or to universal biases in the sound-meaning relationship such as onomatopoeia. These effects are small in comparison to the bias for related languages, however, so the uniform distribution is still a good approximation.

This is an instance of a more general rule. If

If

The general form of the exponential distribution is

This follows from elementary laws of probability theory, i.e. the facts that

The reader may excuse my eclectic usage of both Bayesian and frequentist arguments in this section.

All three classifications are provided as metadata in the ASJP database.

The triplet distance is only informative if the languages to be compared exist at the same point in time, i.e. if any two related languages have the same time depth from their common ancestor. If this condition is not met, the triplet distance might be misleading. For instance, it might very well be that Old English and Gothic are closer to each other than Old English is to modern Dutch. Nevertheless the correct classification places Old English and modern Dutch in one group—the West Germanic languages—and Gothic in another one, namely East Germanic. This problem could be avoided by evaluating quartets instead of triplets and induce an unrooted tree. I refrain from doing so here because the number of quartets over a set of languages exceeds the number of triplets by a factor in the order of magnitude of the number of languages. For large data sets, the triplet distance, but not the quartet distance, can still be computed with realistic computational effort. To avoid the problem of difference in time depth, in this article I only use data from languages that are either currently alive or recently extinct.

The choice of exactly 1,000 samples containing exactly 500 languages each is arbitrary. The criterion for choosing these numbers was that the number of samples should be suffiently large to be able to detect trends, and that each sample should not be too small, but small enough to make 1,000 iterations computationally feasible.

It might seem suggestive to evaluate the different distance measures for the individual language families and to average the results because different language families are our best approximation of independent samples when it comes to cross-linguistic data. This protocol has been followed, for instance, by Pompei et al. (2011). Such a procedure strikes me as misleading, though, because it only assesses how well the *internal* classifications of language families are recoverable based on the different distance measures. However, it is equally important to take into account how well the competing measures separate different language families. My somewhat pessimistic conclusion is that it is not possible to create a sufficient number of independent samples from cross-linguistic data that are both independent from each other and representative for the population as a whole.

The following discusssion draws heavily on Durbin et al. (1989).

The PMI score is defined in terms of the binary rather than the natural logarithm. This difference is inessential, however, because it amounts to a constant factor.

Durbin et al. (1989) give a probabilistic interpretation of gap penalties, according to which

This was suggested to me by Eric Holman (p.c.).

To be precise: the implementation of the Levenshtein alignment algorithm I used (the Python package *Levenshtein*) only outputs one alignment, even if there are others that are equally good.

The value of the target function at this point is

To perform the clustering, PMI scores were transformed into distances by subtracting them from the maximal PMI score. For the hierarchical clustering, Ward’s method was used; see Ward (1963).

It might be surprising that the triplet distances given in Table 3—which were calculated for the entire ASJP—are in the 20% range, while the values for the test set are in the 10–15% range. This reflects the fact that the task of automatically classifying a given set of word lists has something like an inherent level of difficulty. The low scores for the test set might have something to do with the fact that almost one third of it are Austronesian languages. Therefore a substantial proportion of triplets to be evaluated consists of two Austronesian and one non-Austronesian language, and the signal distinguishing Austronesian from the rest of the world’s languages is fairly strong.

This was suggested by an anonymous reviewer.

On the hardware currently at my disposal, computing the distance matrix for the full test set with PyAline would take more than a week.

This kind of bootstrapping is generally being used in character-based phylogenetic inference, including work in historical linguistics such as Gray and Atkinson (2003).

Durbin et al. (1989) give a probabilistic interpretation of gap penalties according to which is the logarithm of the probability of observing a gap. However this derivation relies on the tacit assumption that sequences are so long that they can be considered infinite. As words are rather short this leads to a systematic overestimation of gap penalties. Therefore gap penalties have no obvious probabilistic interpretation in the context of computational linguistics.