Abstract
The paper investigates the task of inferring a phylogenetic tree of languages from the collection of word lists made available by the Automated Similarity Judgment Project. This task involves three steps: (1) computing pairwise word distances, (2) aggregating word distances to a distance measure between languages and inferring a phylogenetic tree from these distances, and (3) evaluating the result by comparing it to expert classifications. For the first task, weighted alignment will be used, and a method to determine weights empirically will be presented. For the second task, a novel method will be developed that attempts to minimize the bias resulting from missing data. For the third task, several methods from the literature will be applied to a large collection of language samples to enable statistical testing. It will be shown that the language distance measure proposed here leads to substantially more accurate phylogenies than a method relying on unweighted Levenshtein distances between words.
1. Introduction
Recent years have seen the introduction of many proposals to use phylogenetic inference techniques from bioinformatics in order to extract information about genetic relations from languages. There are essentially two basic approaches being currently employed. Characterbased methods start by defining a set of features—characters—to classify languages. The feature values are assumed to be inert in language change. Therefore the number of shared feature values between two languages can be taken as a measure of their relatedness. Methods such as Maximum Parsimony, Maximum Likelihood, and Bayesian phylogenetic inference take a classification of languages according to a list of features as input and produce a phylogenetic tree including changes in feature values along the branches as output (see, for instance, Felsenstein, 2004, for a comprehensive overview). Suitable features may be cognate classes of basic vocabulary items (as used by, for example, Gray and Atkinson, 2003, and Bouckaert et al., 2012) or grammatical features (Dunn et al., 2005, and subsequent work).
The second approach uses distancebased techniques of phylogenetic inference. These methods start from a matrix of pairwise distances between languages that ideally correspond to the time that has passed since the split of the latest common ancestor of the two languages compared along the two lineages leading to those languages. Phylogenetic inference produces a tree where the path length between two leaf nodes is as close as possible to their pairwise distance. Such methods are suitable when dealing with raw data that are not organized in a feature matrix, such as lists of noncognatecoded basic vocabulary items.
Extracting phylogenetic information from word lists usually proceeds in three steps (see, for instance, Downey et al., 2008 or Holman et al., 2008): (a) the similarity/distance between words from different languages is determined using some kind of alignment algorithm, (b) these word distances are aggregated to pairwise distances between languages, and (c) a phylogenetic tree is inferred. As for the final step, the Neighbor Joining algorithm (Saitou and Nei, 1987) has emerged as the de facto standard.
The quality of a phylogeny thus inferred can be assessed by comparing it to expert classifications. How such a comparison is to be performed is an active area of investigation; see Wichmann et al. (2010), Greenhill (2011), Huff and Lonsdale (2011), Pompei et al. (2011) for some recent contributions.
In this study I will propose three innovations pertaining to this research program:

A similarity score between words that is computed via weighted alignment, including a procedure to obtain the required weights in a datadriven way,

a novel method to aggregate word similarity score into distances between languages, and

a generalization of existing methods for evaluating the quality of distance measures between word lists, using expert classifications as gold standard.
The study is carried out using version 15 of the the Automated Similarity Judgment Project (ASJP) database (Wichmann et al., 2012), a collection of Swadesh lists for more than 5,800 languages^{1} which are phonetically transcribed in a uniform way. Only the 40 most stable Swadesh concepts are used in this paper. After excluding artificial languages, creoles, extinct and reconstructed languages, 5,481 word lists were kept in the database. Attested loan words are not excluded. Diacritics in the phonetic transcriptions are ignored.
As a baseline for comparison, I use the method described in Holman et al. (2008) to compute language distances from ASJP word lists.
The structure of the paper is as follows. Section 2 reviews Holman et al.’s proposal. The novel method for aggregating word similarity scores is developed in Section 3. Section 4 discusses the issue how to evaluate distance measures between languages and presents a comparative evaluation of the different aggregation methods. Section 5 introduces weighted word alignment and presents the procedure to train the required weights with ASJP data. It also provides a thorough empirical comparison of the distance measure obtained in this way with alternative approaches. In Section 6, the method developed here is compared to Kondrak’s (2002) ALINE system. Section 7 contains some final discussion and conclusions.
2. State of the Art: The LDND Score
Holman et al. (2008) propose a method to compute distances between ASJP word lists based on the edit distances between individual words. The edit distance or Levenshtein distance between two words
In the example in Fig. 1 (showing the alignment between the English and Latin words for horn, spelled according to the ASJP transcription system; the ASJP symbols are explained in the Appendix), this value would be 2. To control for varying word length, Holman et al. normalize this measure by dividing it by the length of the longest word. In the example, this amounts to
The normalized Levenshtein distance provides a distance measure between words, with 0 indicating identity and 1 indicating maximal difference. To obtain a distance measure between two word lists, it seems suggestive to simply average over the LDN scores between corresponding words from the languages to be compared. However, if two languages have small and strongly overlapping sound inventories, the number of chance hits is high as compared to a language pair with large and dissimilar sound inventories. On average, the LDN values between unrelated words will be smaller in the former than in the latter case. To control for this effect, the authors propose a method to calibrate the average LDN score between synonymous word pairs to the specific language pair to be compared.
This is best illustrated with an example. Table 1 shows the pairwise LDN scores for some English and the Swedish vocabulary items from ASJP.
LDN scores English/Swedish
The average of the values along the diagonal—i.e. between words with identical meanings—for the full matrix is
3. Quantifying the Evidence for Genetic Relatedness of Languages
3.1. The Evidence for Relatedness
As spelled out in the previous section, the LDND score aggregates distances between words to distances between languages (i.e.: word lists over a given concept list)

computing the distances between all word pairs from
and, 
computing the average distance between synonymous and the average distance between nonsynonymous words, and

dividing the former by the latter.
In this section I will propose an alternative method for aggregating a matrix of distances between words from
To illustrate the underlying intuition, consider again the matrix of LDN scores between English and Swedish words illustrated in Table 1. The distribution of offdiagonal scores is shown in Fig. 2.
The word pair fiS/fisK ‘fish’ has an LDN score of
To make this precise, let us assume that
The ASJP data contain missing entries at many positions. To deal with this issue, we assume that there are
The rank of a diagonal entry
(It is tacitly assumed that only those pairs
For the time being we assume that there are no ties; this issue will be taken up later on.
As the sizes of word lists may differ between languages, we normalize the rank by dividing it by the maximal possible rank, which is the number of offdiagonal entries
If the languages in question are unrelated, the entries along the diagonal are drawn from the same distribution as the offdiagonal entries. Therefore we expect each rank (between
This is illustrated in Fig. 3. The left panel shows the distribution of diagonal entries (left boxplot) and offdiagonal entries (right boxplot) for the comparison of English and Swedish.
It is clearly visible that the diagonal scores are on average much lower than the offdiagonal scores.
The right panel shows the same data for the comparison of English with Swahili. The two languages are unrelated, and the diagonal entries are similiarly distributed as the offdiagonal entries.
For a pair of unrelated languages we expect the normalized ranks for diagonal entries to be uniformly distributed between
As expected, the normalized ranks for pairs of related languages are heavily skewed towards small values, while the values for unrelated languages approximately follow a uniform distribution.^{2}
Figure 5 displays the same data as histograms with logarithmic binning in loglog plot.
The values for the related languages lie approximately on a straight line with a negative slope. This indicates that the normalized ranks are distributed according to a power law (see, for instance, Clauset et al., 2009 on power law distributions in empirical data). This means that there are real numbers
We can thus approximate the empirical distribution by a continuous probability density function
The distribution of normalized ranks for unrelated languages can be approximated by a constant density function:
Suppose we have to decide whether or not two languages are related on the basis of normalized ranks of all translation pairs. So we compare two hypotheses:
If we make the simplifying assumption that the normalized ranks for the individual translation pairs are stochastically independent, this amounts to
The posterior log odds thus come out as
While we do not know the prior probabilities
However, this only holds for a constant
Let us call this quantity the Evidence for Relatedness (ER).
To turn this into an operational definition, one further amendment needs to be made.
Recall that there may be ties, i.e. pairs
This leads to the following final definition:
Definition 1 (Evidence for Relatedness)
It is reasonable to assume that the Evidence for Relatedness becomes stronger the closer two languages are related. ER can thus be considered a similarity measure between languages. It can easily be transformed into a distance measure. ER is maximized if we compare a word list to itself: it contains no homonymies and no missing entries. In this case, all
The theoretical minimum for the ER score is achieved if all diagonal entries are smaller than all offdiagonal entries. In this scenario all
Definition 2 (Distance based on Evidence of Relatedness)
The dER score always assumes a value between
3.2. Correcting for Missing Entries
If the word lists to be compared contain missing entries, the dER measure relies on a maximum likelihood estimate of the
Suppose the languages
The ER score is defined as the mean of
The sum (and thus the average) of
Definition 3 (Corrected Evidence for Relatedness)
According to this definition, the mean and variance of the ERC scores for unrelated languages do not depend on
Just like the ER score, the ERC score is a similarity measure between languages. It can be turned into a distance measure analogously to Definition 2.
Definition 4 (Distance based on Corrected Evidence of Relatedness)
For ASJP data with
To get an idea for the numerical magnitudes, dERC scores for some language pairs are given in Table 2.
dERC scores
It might seem counterintuitive that the dERC of English to itself is larger than 0. This reflects the fact that the ASJPlist for English contains one pair of homonyms: both ‘I’ and ‘eye’ are transcribed as Ei. Therefore the probability of a chance identity is assessed as positive, and therefore the probability of the two lists being identical despite the languages being unrelated is assessed as positive, if very small.
4. Empirical Evaluation
As will become clear later on, the main motivation for developing dERC is that this method of aggregation is also applicable to string distance measure with mathematical properties different from LDN.
A standard way to assess the quality of a distance measure between languages is to relate it to an expert classification. In this paper I will make use of three different expert classifications of languages:^{6}

The twolevel classification according to the World Atlas of Language Structures, Haspelmath et al. (2008), abbreviated as WALS hereafter,

the classification according to Ethnologue, Lewis (2009), abbreviated as Ethn, and

the classification according to Hammarström (2010), abbreviated as Hstr.
I will use three methods to compare a distance matrix to an expert classification:

Triplet distance: This method has been used in Greenhill (2011), and it is closely related to the GoodmanKruskal Gamma measure used in Wichmann et al. (2010).

A triplet of languages
is resolved if and only if the expert tree contains a node that dominatesandbut not. It is correctly classified by the distance measure if and only if. The triplet distance of the distance measure to the expert tree is the proportion of all resolved triplets that are classified incorrectly.^{7} 
The triplet distance (TD) measure has the advantage that it only uses comparisons between distances rather than numerical values. It is therefore invariant under all monotonic transformations of the distance measure, including nonlinear ones. Also, it does not rely on a phylogenetic algorithm that may introduce its own bias.

Generalized RobinsonFoulds distance: The RobinsonFoulds distance (Robinson and Foulds, 1981) is a standard distance measure between unrooted trees over the same set of leaves. As an illustration, consider the trees in Fig. 6.

The two trees have four and two internal branches, respectively. Each internal branch in an unrooted tree induces a bipartition of the set of leaves. The bipartitions induced by the internal branches on the right are identical to the bipartitions in the tree on the left. Additionally, the tree on the left contains two internal branches that have no counterpart in the tree on the right.

The RobinsonFoulds distance is the number of internal branches in both trees that have no counterpart in the other tree, divided by the total number of internal branches in both trees. In the example, this number is
.

However, this number is somewhat misleading. The tree on the left is binary branching, while the one on the right is not. The tree on the left contains all bipartitions that we find in the tree on the right, so the former approximates the information contained in the latter as closely as is possible for a binary branching tree.

This is a standard situation when comparing a tree that has been constructed by a phylogenetic inference algorithm such as Neighbor Joining—which is necessarily binary branching—with an expert tree that is not binary branching. To take this asymmetry into account, I follow Pompei et al. (2011) in using the generalized RobinsonFoulds distance (GRF). The GRF of a binary branching tree
to another (perhaps nonbinary branching) treeis defined as the proportion of internal branches inthat do not have a counterpart in. In the example, the distance of the first to the second tree comes out as. The GRF is always a number betweenand, withindicating total disagreement andoptimal agreement. 
Generalized quartet distance: Another commonly used distance measure between unrooted trees is the quartet distance (Estabrook et al., 1985). Given an unrooted tree and four leaves
,,, and, the tree induces the butterflyif and only if one of the bipartitions that is induced by its internal branches separatesfrom. If there is no internal branch separating the quartet into two pairs, the tree induces a star on the quartet of leaves. 
Given two unrooted trees over the same set of leaves, their quartet distance is the proportion of quartets over their leaves that have different topologies in the two trees. In the example trees in Fig. 2, we have
leaves and thereforequartets. Of these 35 quartets, 16 have different topologies in the two trees, so the quartet distance is. 
Similar to the generalized RobinsonFoulds distance defined above, I will follow Pompei et al. (2011) in using a generalized version of the quartet distance that takes the asymmetry between binary branching inferred trees and multiply branching expert trees into account. The generalized quartet distance (GQD) between an inferred tree and an expert tree is the proportion of butterflies in the expert tree having a different topology in the inferred tree. For the example in Fig. 2, the fit is perfect, i.e. the GQD equals 0.
The quartet measures are less intuitive than the corresponding RobinsonFoulds measures, but they have the advantage of being more tolerant of small errors. For instance, exchanging two leaves in one of two large trees may have a dramatic effect on the GRF, while the GQD changes only slightly.
In the following I will compare the three distance measures between languages discussed so far: LDND, dER and dERC. Let us first look at the triplet distances to the three expert classifications mentioned above, WALS, Ethnologue, and Hammarström (2010). The comparison was performed with the full
Triplet distances for LDND, dER and dERC
For all three expert classifications, we find a slight improvement both from LDND to dER and again from dER to dERC, even though the differences are quite small.
To compute the GRF, for each of the three pairwise distance matrices a phylogenetic tree is computed via the Neighbor Joining algorithm, and those are compared to the three expert classifications both via GRF and via GQD. The results are shown in Table 4.
Generalized RobinsonFoulds distances and generalized quartet distances for LDND, dER and dERC
These figures seem to indicate that LDND performs best according to WALS and Ethn, while dER comes out better for the Hstr. These numbers are arguably misleading, however. The GRF relies on the Neighbor Joining tree, which is quite sensitive to the properties of the specific data set. This can be illustrated with the following little experiment. 10 mutually disjoint subsets of ASJP were drawn, each containing 275 word lists. For each of these subsets, the Neighbor Joining trees for LDND, dER and dERC were computed and compared to the WALS classifaction according to GRF and GQD. The results are shown in Table 5.
Generalized RobinsonFoulds and quartet distances for ten random samples
Both the numerical values of GRF and GQD and the relative ordering of the three distance measures differ widely. For instance, LDND leads to the lowest GRF value five times, dER three times, and dERC five times.
To detect the quality of different distance measures despite the noisyness of phylogenetic inference, I drew 1,000 random samples from the 5,000+ ASJP word lists, each containing 500 word lists, and averaged over the various tree distance measures to the expert classifications.^{8} The results are given in Table 6 and the distributions are visualized as box plots in Fig. 7.
Generalized RobinsonFoulds and quartet distances for 1,000 random samples
From these data we can conclude that dERC gives slightly better results than dER for all evaluations, so the correction for missing entries does have a positive effect. The comparison between LDND and dERC is equivocal. On average, dERC is slightly better for GRF and slightly worse for GQD.
As the 1,000 samples used here are not stochastically independent,^{9} it is not possible to perform meaningful statistical tests, so it is impossible to say whether there are significant quality differences between LDND and dERC. In any event, the differences are very small.
5. Weighted String Alignment
5.1. The Method
Levenshtein alignment only distinguishes between identical and nonidentical sounds. To achieve a better approximation of the etymologically correct alignment of cognate words, a graded notion of similarity between sounds seems more appropriate. The ALINE system from Kondrak (2002), for instance, uses a sophisticated handcrafted notion of segment similarity that draws on insights from phonology. ALINE will be discussed in more detail in the next section. The present section reviews an alternative approach that has been used in previous work in bioinformatics (see, for instance, Durbin et al., 1989) and computational dialectometry (cf., among others, Wieling et al., 2009).
The approach emerges from similar considerations that we used in the previous section in the derivation of the ER measure.^{10} Suppose we want to compare two strings
The prior odds
We start with the (unrealistic) assumption that in case
Let
If
Of course the assumptions made here—stochastic independence of positions within a sequence and of evolutionary changes at different positions within a sequence—are wildly unrealistic. Nevertheless this null model leads to workable results, as we will see later on.
Putting the pieces together, we have
In the bioinformatics tradition, this quantity is called the log odds score. In computational linguistics, it is also known under the name of Partial Mutual Information (PMI, see Church and Hanks, 1990).^{11} I will follow the latter terminology here. Some notation:
We now turn to the issue of insertions and deletions. Suppose we evaluate a specific hypothesis about the historical relation between
The gap symbol “” represents a position where either a segment has been deleted or a segment has been added in the other language.
Following standard practice in bioinformatics, I assume that there is a uniform PMI score for gaps, regardless of the segment the gap is matched with:
The constant
However, both in biological evolution and in language change, insertions and deletions frequently operate on contiguous chunks of segments. For instance, in language comparison we frequently find partial cognates, i.e. word pairs in which one item is morphologically complex (or is etymologically derived from a morphologically complex word) and the other word is cognate to just one morpheme of the first word. Consider the Latin and Italian words for mountain, mons/montagna, transcribed as mons/monta5a in ASJP. The Italian word is probably derived from the Latin montaneus ‘mountainous,’ a denominal adjectivization of mons. So the correct alignment is
The three gaps at the end of the upper sequence are the reflex of a single historical process, i.e. suffixation plus semantic change.
Since gaps frequently come in chunks, the penalty for a gap in
The same applies mutatis mutandis to gaps within
With these provisos, the PMI score of an alignment of
5.2. Parameter Estimation
To reliably estimate the PMI scores for all segment pairs, one would ideally need a very large corpus of correctly aligned sequence pairs. In bioinformatics such databases do indeed exist, and several carefully crafted substitution matrices for different domains have been constructed (see Durbin et al., 1989 for details). In dialectometric work (such as Wieling et al., 2012), such data are fairly easy to obtain because dialectometric data are organized in cognate sets, and the linguistically correct alignment between cognate words from different dialects of the same language can be reliably constructed with automatic means.
When dealing with crosslinguistic data from a wide variety of languages such as the ASJP data, the situation is more difficult. Sizeable amounts of expert cognacy judgments only exist for a small number of language families (mainly for IndoEuropean based on the pioneering work of Dyen et al., 1992, and for Austronesian, see Greenhill et al., 2008). Also, and more importantly, the ultimate goal of this entire enterprise is to do language classification automatically. Therefore, information about language family affiliation should not be utilized for parameter training to avoid circularity.
In the following, a heuristic method is described to extract a large corpus of probable cognate pairs from the ASJP word lists, which can be used for parameter training. The method only relies on the word lists themselves; no additional information about cognacy relations or the genetic affiliation of the languages involved is being used.
To avoid the pitfall of overtraining, I split the ASJP database into two sets of about equal size, the training set and the test set. For training purposes, I use only the former. The resulting model will then be tested against the latter.
To make sure the two sets are really independent—or at least to approximate this ideal as far as possible with crosslinguistic data—the two sets were constructed in such a way that each WALS family either completely belongs to the training set or the test set.^{13} To be more specific, the set of WALS families was placed in a random order and languages and language families were added to the training set in this order, as long as its size did not exceed half the size of the entire database. The remaining families constitute the test set. The training set contains 2,723 and the test set 2,758 word lists. The lists of language families in the two sets are provided in the Online Supporting Material.
Relying on the training set only, I used the following procedure for constructing a sufficiently large corpus of probable cognate pairs, which can be used for parameter training:

All language pairs that have a dERC distance below a given threshold
are considered to be probably related. 
For a pair of probably related languages
andand a concept, all entries forin the ASJP list forand in the ASJP list forare considered. The pair of words with the lowest LDN score is considered a potential cognate. 
All pairs of probable cognates are then aligned with the Levenshtein algorithm. If there are multiple optimal alignments, only one of them is considered.^{14}
This yields a set of aligned sequence pairs. The quantity
for each pair of segment types.
Assuming certain values for the gap penalties (more on this later), in the next step the set of potential cognate pairs is aligned with the NeedlemanWunsch algorithm, using the estimated parameters.
Additionally, I assume a threshold
The reestimation of parameters is repeated 10 times. Experience shows that the estimated parameter values do not change substantially anymore after that.
The appropriate choice for the metaparameters
It can be seen that the distribution is dominated by a bellshaped curve with the maximum at about
As pointed out above, there is no straightforward way to estimate gap penalties from a training corpus. Appropriate values for
The training procedure supplies a PMI matrix for a given vector
As a heuristic to assess the quality of a parameter configuration, I sampled 1,000 pairs of probably related languages, i.e. languages with a dERC
As even a single evaluation step is computationally quite expensive, advanced methods of optimization such as simulated annealing proved to be impractical. Therefore I performed a simple downhill NelderMead style optimization (cf. Nelder and Mead, 1965), starting from several manually chosen initial positions. The lowest value of the target function was achieved with
For a selection of sounds, the optimal PMI scores thus derived are shown in Table 7. (The full matrix is provided in the Online Supporting Material.) Not surprisingly, the entries along the diagonal are all positive, i.e. alignment of two identical elements provides the strongest evidence for relatedness. Additionally, we find positive PMI scores for several sound pairs that are known to be frequently historically related via sound shifts, such as p/b, d/t, d/8 (where the ASJP symbol 8 represents voiceless and voiced dental fricatives; cf. Table 15) and s/h. The latter case is especially interesting because the two sounds are articulatorily dissimilar, but the sound shift from s to h is known to be quite common (see, for instance, Ferguson, 1990).
PMI scores
Figure 9 displays a hierarchical clustering of the ASJP sound symbols according to their PMI scores.^{16} We find a primary split between vowels and consonants. The consonants are further divided into three large groups, which largely correspond to the dental, labial, and velar/uvular sounds. The only exception to this pattern according to place of articulation is the position of h and x (the voiceless and voiced velar fricatives), which are clustered together with the ssounds within the larger cluster of dental sounds. This is probably a reflex of the already mentioned diachronic cline from s to h.
Following the example of Wieling et al. (2012)—who obtained PMI scores essentially in the same way but using data from different dialects of the same language—I performed nonmetric multidimensional scaling with the PMI scores among the vowels. The result is displayed in Fig. 10. We find that the articulatory vowel triangle is reproduced to a good approximation, with the schwa (ASJP symbol 3) in the center.
Brown et al. (2013) also use the ASJP data to estimate the probability of different sound correspondences across the languages of the world. Their method is quite different from the one developed here, so a comparison of the results provides a certain validity check.
The authors use a highly conservative heuristic to identify regular sound correspondences. According to this method, a pair of languages

andbelong to the same genus, and

there are at least two concepts
such that the ASJP entry forfromcan be transformed into its translation toby replacing all occurrences ofby(and vice versa).
To use the running example of the English/Swedish comparison again, there are only two regular correspondences that can be detected from the 40item word lists: oe (bon/ben ‘bone’ and ston/sten ‘stone’); and ie (liv3r/lev3r ‘liver’ and si/se ‘see’).
A certain genus is available for a correspondence
Figure 11 plots the PG scores of all consonant pairs that have a positive PG score in the supporting online material from Brown et al. (2013) against their corresponding PMI score.
The
5.3. Aggregation
In Table 1 the LDN scores for several English/Swedish word pairs were given. Table 8 gives the corresponding PMI scores.
PMI scores English/Swedish
It can be seen that the PMI notion of string similarity is more finegrained than LDN. For instance, while both du/yu and vi/wi receive a positive score (i.e. they are more likely to be related than not), the absolute value for the latter is much higher. This reflects the fact that a correspondence between v and w is more likely than one between d and y. The pair fisk/fiS has an even higher PMI score because (a) the words are longer than vi/wi, i.e. the evidence they provide is stronger, and (b) the correspondence s/S is very likely. This is counterbalanced only by a single gap penalty.
The distribution of PMI scores for the English/Swedish comparison on the diagonal and off the diagonal is shown in the left panel of Fig. 12. The right panel shows the same data for the comparison English/Swahili.
In comparison to the corresponding plots for LDN, the PMI values are much more spread out. Apart from that, we find a similar qualitative pattern (apart from the inessential difference that LDN is a distance and PMI a similarity measure). For a pair of related languages, the diagonal entries are mostly much higher than the offdiagonal entries, while both collections appear to be drawn from the same distribution for a pair of unrelated languages.
The normalized ranks of PMI scores are now computed according to the definition given in Section 3, with LDN scores replaced by PMI scores and
Therefore the theoretical justification for the dERCstyle aggregation of normalized ranks to a distance measure between languages also applies to PMI scores.
Table 9 compares the dERC/LDN scores and dERC/PMI scores for the language pairs from Table 2.
dERC scores
These numbers convey the impression that the dERC/PMI scores for related languages are generally lower than the corresponding dERC/LDN scores, while the scores for unrelated languages are randomly distributed around
5.4. Empirical Evaluation
The methods described in Section 4 to compare different distance measures will now be used to evaluate the quality of the dERC/PMI against LDND and the LDNbased version of dERC (dERC/LDN). Only the word lists from the test set will be used for this comparison.
The triplet distances to the three expert classifications are given in Table 10 and visualized in Fig. 14.
Triplet distances for LDND, dERC/LDN and dERC/PMI
We find a slight improvement from LDND to dERC/LDN and a more substantial improvement to dERC/PMI.^{17}
From the distance matrices for the test set for LDND, dERC/LDN and dERC/PMI, the corresponding phylogenetic trees were computed with Neighbor Joining. The generalized RobinsonFoulds distances and quartet distances are given in Table 11.
Generalized RobinsonFoulds distances and generalized quartet distances for LDND, dERC/LDN and dERC/PMI
The results are not decisive, with dERC/PMI giving the lowest GRF scores and dERC/LDN the lowest GQD scores. However, as discussed in Section 4, evaluating different Neighbor Joining trees for a single data set can be highly misleading. Therefore the same procedure as above is applied here: 1,000 random samples of word lists from the test set, each comprising 500 doculects, are generated, Neighbor Joining trees for LDND, dERC/LDN and dERC/PMI are computed, and all three trees are compared to the three expert trees regarding both GRF and GQD. The results are depicted in Fig. 15 and the mean values are given in Table 12.
Generalized RobinsonFoulds and quartet distances for 1,000 random samples
The mean values for the 1,000 samples display a similar pattern as the triplet distances: LDND and dERC/LDN perform about equally well (with a slight advantage for the former), while dERC/PMI leads to lower distance scores. As the aggregation method for dERC/LDN and dERC/PMI is identical, we can conclude that the PMI based method of measuring string similarities leads to better phylogenetic inference than (normalized) Levenshtein distance.
As a further test, I performed a version of crossvalidation.^{18} In general,
Crossvalidation requires the individual subsets to be independent from each other. As discussed above, obtaining mutually independent subsamples of a crosslinguistic data base such as ASJP that are representative for the data set as a whole is a nontrivial issue. As an approximation, I performed 4fold crossvalidation, where the subsets correspond to the four continental areas Africa (including all AfroAsiatic languages), Eurasia, the IndoPacific region (including Australia), and America.
For each continental area
Triplet distances for LDND, dERC/LDN and dERC/PMI: continental areas
In 11 out of 12 cases, dERC/PMI provides the best results (the exception being the Hstr classification for America, where LDND is slightly better). The general pattern for Africa, Eurasia and the IndoPacific is similar to the test set above: LDND and dERC/LDN are about equally good, while dERC/PMI is about
The average correlation of the PMI matrices obtained during crossvalidation with the PMI matrix obtained from the training set is
5.5. Discussion
A possible objection against the general approach developed here concerns the risk of circularity. As an anonymous reviewer points out, it might be problematic to perform automatic language classification on the basis of parameters that are trained with data from a database “which was […] obtained through some other type of (manual) analysis.” Let us therefore carefully review what kind of information goes into the training procedure and what kind of information we get out of it.
The construction of the training corpus of word pairs relied on guessing a value for
dERC/LDN scores are determined on the basis of pairwise LDN scores for words from the word lists to be compared. No further information about the genetic affiliation of the languages involved is being used here, and LDN scores are obtained from Levenshtein distances, a generalpurpose string comparison method that does not rely on any specifically linguistic information.
Once the training corpus is constructed, initial PMI scores are estimated using Levenshtein alignment. In subsequent steps, NeedlemanWunsch alignment is performed ten times, each time using the PMI score estimates from the previous step. For given values of
The test procedures in turn use the aggregate distance between word lists thus obtained to do phylogenetic inference and to compare the results to expert classifications. (Triplet distance relies on classifying triplets of languages, so this also involves a kind of phylogenetic inference). So the information that is obtained from the parametrized model—language classification—is of an entirely different nature than the information that went into it, namely word lists.
Another potential objection concerns the fact that the overall gain in accuracy—about

Both for the triplet distance and for GQD, the baseline of completely randomly distributed distances is not
butbecause there are only three rooted binary trees for a triplet and three butterflies for a quartet of languages. 
Even very crude distance measures achieve a much higher accuracy as suggested by these baselines. To illustrate this point, I defined such a crude measure: for each word list, the vectors of relative frequencies of occurrence of sounds are computed. The cosine similarity between two languages is then defined as the cosine of these vectors, and the cosine distance is the difference between the cosine similarity score and 1. So this distance measure only quantifies how much the frequency patterns of unigrams differ between word lists, without any reference to the meaning of the words. The Neighbor Joining tree derived from these distances for the entire ASJP database already achieves a GQD of
to WALS,to Ethn andto Hstr. 
The practically achievable minimum GQD (and likewise for triplet distance and GRF) is arguably somewhat above 0. First, the expert classifications contain controversial units (such as Altaic, Australian, NigerCongo and TransNew Guinea in WALS), which may partially be wrong. In this case it would not be a defect of an automatic classification if those units are not detected. Second, the 40item Swadesh lists arguably do not always contain the information that human experts would need to establish a genetic relationship between a group of languages.
To make a rough guess, the maximum GQD (achievable by a simpleminded distance measure such as the cosine distance) for a given data set may be around 35%–40% and the minimum GQD that can possibly be attained by automatic methods from 40item Swadesh lists may by around 3%. Each gain in accuracy of a certain percentage thus actually amounts to a much higher proportion (by a factor of about 3) of this range.
6. Comparison to ALINE
The PMI scores for word similarities used here are obtained via weighted string alignment. There have been several proposals in the literature on computational historical linguistics and computational dialectometry to employ weighted alignment for this purpose. Some of them use empirically determined logodds scores as weights like the present proposal (cf. Wieling et al., 2012), while others (see, for instance, Covington, 1996; Somers, 1998; Heeringa, 2004, among others) assume linguistically motivated handcrafted substitution weights for segment pairs. The most sophisticated approach along the latter lines is perhaps the ALINE system by Kondrak (2002). A detailed discussion of ALINE would go beyond the scope of this article, so I will just mention the essential features.
In ALINE, each sound is represented by a vector of phonetic features, such as syllabic, back, place etc. These features have real numbers as values. The similarity between two segments is computed from their differences in feature values, weighted by the salience of these features.
Additionally, ALINE captures compressions and expansions, i.e. alignments of a single segment in one word with two adjacent segments in the other word. Kondrak uses the cognate pair Latin factum/Spanish hecho ‘fact’ to illustrate this point. In the etymologically correct alignment, the Spanish affricate [ʧ] should be matched with the [t] and the [k] in the Latin word simultaneously. ALINE defines weights for aligning a single sound with a consecutive sequence of two sounds as well.
The present proposal uses the NeedlemanWunsch algorithm for string alignment. This algorithm finds the optimal global alignment, i.e. an alignment of the full sequences. ALINE uses halfglobal alignment instead. This means that in both strings to be compared, final subsequences can be ignored if this leads to a better alignment score. Halfglobal alignment is motivated by the observation that the right periphery of words is especially unstable in language change.
In Huff (2010) and Huff and Lonsdale (2011), the system PyAline is described, a freely available Python implementation of ALINE that includes substitution scores of ASJP sound classes. PyAline also contains an implementation of Downey et al.’s (2008) method to aggregate ALINE alignment scores to distances between languages. This facilitates a comparison with the distance measures defined here. In Huff and Lonsdale (2011) such a comparison with LDND is discussed. The authors conclude that both measures perform about equally well in phylogenetic inference.
Downey et al.’s aggregation method differs in two essential ways from ERC. First, word similarities are normalized. Given alignment scores (which are similarity scores), the normalized ALINE distance between two words
Second, Downey et al. (2008) define the distance between two languages as the average normalized ALINE distance between translation pairs. This amounts to taking the average of the diagonal in the matrix of individual word distances, while the offdiagonal entries are not taken into account. Let us call this distance measure
These differences in detail make a comparison to dERC difficult, because it has to be factored out whether possible differences in performance are due to the different alignment weights, the different alignment algorithm, the normalization step or the difference in the aggregation scheme. As