The paper investigates the task of inferring a phylogenetic tree of languages from the collection of word lists made available by the Automated Similarity Judgment Project. This task involves three steps: (1) computing pairwise word distances, (2) aggregating word distances to a distance measure between languages and inferring a phylogenetic tree from these distances, and (3) evaluating the result by comparing it to expert classifications. For the first task, weighted alignment will be used, and a method to determine weights empirically will be presented. For the second task, a novel method will be developed that attempts to minimize the bias resulting from missing data. For the third task, several methods from the literature will be applied to a large collection of language samples to enable statistical testing. It will be shown that the language distance measure proposed here leads to substantially more accurate phylogenies than a method relying on unweighted Levenshtein distances between words.
Recent years have seen the introduction of many proposals to use phylogenetic inference techniques from bioinformatics in order to extract information about genetic relations from languages. There are essentially two basic approaches being currently employed. Character-based methods start by defining a set of features—characters—to classify languages. The feature values are assumed to be inert in language change. Therefore the number of shared feature values between two languages can be taken as a measure of their relatedness. Methods such as Maximum Parsimony, Maximum Likelihood, and Bayesian phylogenetic inference take a classification of languages according to a list of features as input and produce a phylogenetic tree including changes in feature values along the branches as output (see, for instance, Felsenstein, 2004, for a comprehensive overview). Suitable features may be cognate classes of basic vocabulary items (as used by, for example, Gray and Atkinson, 2003, and Bouckaert et al., 2012) or grammatical features (Dunn et al., 2005, and subsequent work).
The second approach uses distance-based techniques of phylogenetic inference. These methods start from a matrix of pairwise distances between languages that ideally correspond to the time that has passed since the split of the latest common ancestor of the two languages compared along the two lineages leading to those languages. Phylogenetic inference produces a tree where the path length between two leaf nodes is as close as possible to their pairwise distance. Such methods are suitable when dealing with raw data that are not organized in a feature matrix, such as lists of non-cognate-coded basic vocabulary items.
Extracting phylogenetic information from word lists usually proceeds in three steps (see, for instance, Downey et al., 2008 or Holman et al., 2008): (a) the similarity/distance between words from different languages is determined using some kind of alignment algorithm, (b) these word distances are aggregated to pairwise distances between languages, and (c) a phylogenetic tree is inferred. As for the final step, the Neighbor Joining algorithm (Saitou and Nei, 1987) has emerged as the de facto standard.
The quality of a phylogeny thus inferred can be assessed by comparing it to expert classifications. How such a comparison is to be performed is an active area of investigation; see Wichmann et al. (2010), Greenhill (2011), Huff and Lonsdale (2011), Pompei et al. (2011) for some recent contributions.
In this study I will propose three innovations pertaining to this research program:
A similarity score between words that is computed via weighted alignment, including a procedure to obtain the required weights in a data-driven way,
a novel method to aggregate word similarity score into distances between languages, and
a generalization of existing methods for evaluating the quality of distance measures between word lists, using expert classifications as gold standard.
The study is carried out using version 15 of the the Automated Similarity Judgment Project (ASJP) database (Wichmann et al., 2012), a collection of Swadesh lists for more than 5,800 languages1 which are phonetically transcribed in a uniform way. Only the 40 most stable Swadesh concepts are used in this paper. After excluding artificial languages, creoles, extinct and reconstructed languages, 5,481 word lists were kept in the database. Attested loan words are not excluded. Diacritics in the phonetic transcriptions are ignored.
As a baseline for comparison, I use the method described in Holman et al. (2008) to compute language distances from ASJP word lists.
The structure of the paper is as follows. Section 2 reviews Holman et al.’s proposal. The novel method for aggregating word similarity scores is developed in Section 3. Section 4 discusses the issue how to evaluate distance measures between languages and presents a comparative evaluation of the different aggregation methods. Section 5 introduces weighted word alignment and presents the procedure to train the required weights with ASJP data. It also provides a thorough empirical comparison of the distance measure obtained in this way with alternative approaches. In Section 6, the method developed here is compared to Kondrak’s (2002) ALINE system. Section 7 contains some final discussion and conclusions.
2. State of the Art: The LDND Score
Holman et al. (2008) propose a method to compute distances between ASJP word lists based on the edit distances between individual words. The edit distance or Levenshtein distance between two words and is defined as the minimal number of edit operations (insertion, deletion, replacement) necessary to transform into . Alternatively, it can be defined as the minimal number of mismatches in an alignment of two words.
In the example in Fig. 1 (showing the alignment between the English and Latin words for horn, spelled according to the ASJP transcription system; the ASJP symbols are explained in the Appendix), this value would be 2. To control for varying word length, Holman et al. normalize this measure by dividing it by the length of the longest word. In the example, this amounts to . By definition, the normalized Levenshtein distance LDN takes values between 0 and 1.
The normalized Levenshtein distance provides a distance measure between words, with 0 indicating identity and 1 indicating maximal difference. To obtain a distance measure between two word lists, it seems suggestive to simply average over the LDN scores between corresponding words from the languages to be compared. However, if two languages have small and strongly overlapping sound inventories, the number of chance hits is high as compared to a language pair with large and dissimilar sound inventories. On average, the LDN values between unrelated words will be smaller in the former than in the latter case. To control for this effect, the authors propose a method to calibrate the average LDN score between synonymous word pairs to the specific language pair to be compared.
This is best illustrated with an example. Table 1 shows the pairwise LDN scores for some English and the Swedish vocabulary items from ASJP.
The average of the values along the diagonal—i.e. between words with identical meanings—for the full matrix is , while the average of the off-diagonal values—word pairs with different meanings—is . The authors define the LDND score of two languages (Levenshtein Distance Normalized and Divided) as mean LDN score along the diagonal, divided by the mean LDN score off the diagonal. For the comparison of English and Swedish, this amounts to .
3. Quantifying the Evidence for Genetic Relatedness of Languages
3.1. The Evidence for Relatedness
As spelled out in the previous section, the LDND score aggregates distances between words to distances between languages (i.e.: word lists over a given concept list) and by
computing the distances between all word pairs from and ,
computing the average distance between synonymous and the average distance between non-synonymous words, and
dividing the former by the latter.
In this section I will propose an alternative method for aggregating a matrix of distances between words from and to an overall distance measure between and .
To illustrate the underlying intuition, consider again the matrix of LDN scores between English and Swedish words illustrated in Table 1. The distribution of off-diagonal scores is shown in Fig. 2.
The word pair fiS/fisK ‘fish’ has an LDN score of , marked by the dashed line in Fig. 2. Only 4 off-diagonal entries have a lower score, and 31 entries have the same score. This means that a randomly picked pair of non-synonymous words has only a chance of about to be more similar to each other than fiS/fisk. Intuitively, the fact that this fraction is so small provides evidence that fiS and fisk—and therefore English and Swedish—are related. Likewise, each of the other diagonal-entries provide a certain amount of evidence for the languages to be related, depending on their position within the distribution of off-diagonal entries.
To make this precise, let us assume that and are the word lists from the languages to be compared. The entry of is denoted by , and likewise for and . It is assumed that and are pairwise synonymous for all . is the LDN distance between and .
The ASJP data contain missing entries at many positions. To deal with this issue, we assume that there are concepts for which both and contain entries. If does not contain an entry for concept , and are undefined, and likewise for .
The rank of a diagonal entry —written as —is the position that would assume if it is added to the set of off-diagonal entries and the resulting set is sorted in increasing order. Formally, we have
(It is tacitly assumed that only those pairs are counted for which is defined.)
For the time being we assume that there are no ties; this issue will be taken up later on.
As the sizes of word lists may differ between languages, we normalize the rank by dividing it by the maximal possible rank, which is the number of off-diagonal entries . This leads to the definition of the normalized rank , which always assumes a value in the interval :
If the languages in question are unrelated, the entries along the diagonal are drawn from the same distribution as the off-diagonal entries. Therefore we expect each rank (between and ) to be equally likely. However, if the languages are related, we expect some diagonal entries to be small in comparison to the off-diagonal entries, i.e. we expect their ranks to be small.
This is illustrated in Fig. 3. The left panel shows the distribution of diagonal entries (left boxplot) and off-diagonal entries (right boxplot) for the comparison of English and Swedish.
It is clearly visible that the diagonal scores are on average much lower than the off-diagonal scores.
The right panel shows the same data for the comparison of English with Swahili. The two languages are unrelated, and the diagonal entries are similiarly distributed as the off-diagonal entries.
For a pair of unrelated languages we expect the normalized ranks for diagonal entries to be uniformly distributed between and . If the languages are related, this distribution should be skewed towards small values. To test this, I drew 100,000 ranks from a random selection of ASJP language pairs that belong to the same genus according to WALS (Haspelmath et al., 2008), and another sample of 100,000 ranks from language pairs that each represent different WALS families. The histograms of the two distributions are shown in Fig. 4.
As expected, the normalized ranks for pairs of related languages are heavily skewed towards small values, while the values for unrelated languages approximately follow a uniform distribution.2
Figure 5 displays the same data as histograms with logarithmic binning in log-log plot.
The values for the related languages lie approximately on a straight line with a negative slope. This indicates that the normalized ranks are distributed according to a power law (see, for instance, Clauset et al., 2009 on power law distributions in empirical data). This means that there are real numbers and such that
We can thus approximate the empirical distribution by a continuous probability density function with
The distribution of normalized ranks for unrelated languages can be approximated by a constant density function:
Suppose we have to decide whether or not two languages are related on the basis of normalized ranks of all translation pairs. So we compare two hypotheses: (languages are unrelated) and (languages are related). According to Bayes’ formula, the posterior odds of the likelihood of the hypotheses are
If we make the simplifying assumption that the normalized ranks for the individual translation pairs are stochastically independent, this amounts to
The posterior log odds thus come out as
While we do not know the prior probabilities and , we can determine the term empirically. The posterior log odds are a monotonically increasing linear function of this quantity.
However, this only holds for a constant . Recall that is the number of concepts for which both word lists to be compared contain an entry. Due to missing data in ASJP, may assume different values for different language pairs. The maximal number for is . If , we have to estimate the normalized ranks for the missing entry pairs. The maximum likelihood estimation is that the average value of the missing -values equals the average of the known values. Therefore the quantity provides the maximum likelihood estimator. As is constant, the estimated posterior log odds are a monotonically increasing function of the quantity
Let us call this quantity the Evidence for Relatedness (ER).
To turn this into an operational definition, one further amendment needs to be made.
Recall that there may be ties, i.e. pairs with . To put it another way, suppose we form the set of distances and sort it in increasing order. The quantity in the numerator of Eq. (1) is the number of items preceding is this sequence; adding 1 yields the rank of . If there are ties, the rank may not be uniquely defined. In this case we compute the evidence for relatedness for the concept for each possible rank and form the average. The possible ranks in the definition below are the set of ranks that can assume in such a sequence. The normalized rank is the geometric mean of all possible ranks, divided by the number of off-diagonal entries . Forming the geometric rather than the arithmetic mean ensures that the logarithm of the normalized rank equals the arithmetic mean of the logarithms of the possible values of .
This leads to the following final definition:
Definition 1 (Evidence for Relatedness)
It is reasonable to assume that the Evidence for Relatedness becomes stronger the closer two languages are related. ER can thus be considered a similarity measure between languages. It can easily be transformed into a distance measure. ER is maximized if we compare a word list to itself: it contains no homonymies and no missing entries. In this case, all . The number of off-diagonal entries equals , so for all , . Hence the ER score is
The theoretical minimum for the ER score is achieved if all diagonal entries are smaller than all off-diagonal entries. In this scenario all and hence . The Distance based on Evidence of Relatedness (dER) is then defined as follows:
Definition 2 (Distance based on Evidence of Relatedness)
The dER score always assumes a value between and . Note that it does not depend on the values of and , so no parameter fitting is necessary.
3.2. Correcting for Missing Entries
If the word lists to be compared contain missing entries, the dER measure relies on a maximum likelihood estimate of the scores of the missing entries. As a consequence, the absolute value both of positive and of negative evidence is overestimated in this case. While it seems unproblematic to underestimate the similarity between two languages if the available, incomplete word lists do not provide evidence for relatedness, the error in the opposite direction is potentially more serious. In the case of incomplete word lists, chance similarities receive a higher weight than is actually justified. To correct this, high ER scores should be discounted somewhat in proportion to the amount of missing entries.
Suppose the languages and are completely unrelated. Then the scores are, as a good approximation, drawn from a uniform distribution over the interval . The probability density function for the term then follows a standard exponential distribution with 1 as its mean and standard deviation.3
The ER score is defined as the mean of (approximately) independent variables that, if the languages are unrelated, are drawn from a distribution with mean and variance . So the ER score is a random variable with mean and variance if the languages are unrelated.4
The sum (and thus the average) of exponentially distributed variables follows an Erlang distribution. However, this distribution can be approximated by a normal distribution (also with mean and variance ) if is sufficiently large. This follows from the Central Limit Theorem. So we can transform the ER score to a variable that is distributed according to a standard normal distribution in the following way:
Definition 3 (Corrected Evidence for Relatedness)
According to this definition, the mean and variance of the ERC scores for unrelated languages do not depend on , i.e. on the number of missing entries. This enables statistical hypothesis testing for the null hypothesis : “ and are unrelated” vs. the alternative hypothesis : “ and are related.” The -value for a given ERC score is simply the probability that the standard normally distributed variable has a value (technically, this is the converse error function of ), regardless of the number of missing entries.5
Just like the ER score, the ERC score is a similarity measure between languages. It can be turned into a distance measure analogously to Definition 2.
Definition 4 (Distance based on Corrected Evidence of Relatedness)
For ASJP data with , and .
To get an idea for the numerical magnitudes, dERC scores for some language pairs are given in Table 2.
It might seem counter-intuitive that the dERC of English to itself is larger than 0. This reflects the fact that the ASJP-list for English contains one pair of homonyms: both ‘I’ and ‘eye’ are transcribed as Ei. Therefore the probability of a chance identity is assessed as positive, and therefore the probability of the two lists being identical despite the languages being unrelated is assessed as positive, if very small.
4. Empirical Evaluation
As will become clear later on, the main motivation for developing dERC is that this method of aggregation is also applicable to string distance measure with mathematical properties different from LDN.
A standard way to assess the quality of a distance measure between languages is to relate it to an expert classification. In this paper I will make use of three different expert classifications of languages:6
The two-level classification according to the World Atlas of Language Structures, Haspelmath et al. (2008), abbreviated as WALS hereafter,
the classification according to Ethnologue, Lewis (2009), abbreviated as Ethn, and
the classification according to Hammarström (2010), abbreviated as Hstr.
I will use three methods to compare a distance matrix to an expert classification:
Triplet distance: This method has been used in Greenhill (2011), and it is closely related to the Goodman-Kruskal Gamma measure used in Wichmann et al. (2010).
A triplet of languages is resolved if and only if the expert tree contains a node that dominates and but not . It is correctly classified by the distance measure if and only if . The triplet distance of the distance measure to the expert tree is the proportion of all resolved triplets that are classified incorrectly.7
The triplet distance (TD) measure has the advantage that it only uses comparisons between distances rather than numerical values. It is therefore invariant under all monotonic transformations of the distance measure, including non-linear ones. Also, it does not rely on a phylogenetic algorithm that may introduce its own bias.
Generalized Robinson-Foulds distance: The Robinson-Foulds distance (Robinson and Foulds, 1981) is a standard distance measure between unrooted trees over the same set of leaves. As an illustration, consider the trees in Fig. 6.
The two trees have four and two internal branches, respectively. Each internal branch in an unrooted tree induces a bipartition of the set of leaves. The bipartitions induced by the internal branches on the right are identical to the bipartitions in the tree on the left. Additionally, the tree on the left contains two internal branches that have no counterpart in the tree on the right.
The Robinson-Foulds distance is the number of internal branches in both trees that have no counterpart in the other tree, divided by the total number of internal branches in both trees. In the example, this number is .
However, this number is somewhat misleading. The tree on the left is binary branching, while the one on the right is not. The tree on the left contains all bipartitions that we find in the tree on the right, so the former approximates the information contained in the latter as closely as is possible for a binary branching tree.
This is a standard situation when comparing a tree that has been constructed by a phylogenetic inference algorithm such as Neighbor Joining—which is necessarily binary branching—with an expert tree that is not binary branching. To take this asymmetry into account, I follow Pompei et al. (2011) in using the generalized Robinson-Foulds distance (GRF). The GRF of a binary branching tree to another (perhaps non-binary branching) tree is defined as the proportion of internal branches in that do not have a counterpart in . In the example, the distance of the first to the second tree comes out as . The GRF is always a number between and , with indicating total disagreement and optimal agreement.
Generalized quartet distance: Another commonly used distance measure between unrooted trees is the quartet distance (Estabrook et al., 1985). Given an unrooted tree and four leaves , , , and , the tree induces the butterfly if and only if one of the bipartitions that is induced by its internal branches separates from . If there is no internal branch separating the quartet into two pairs, the tree induces a star on the quartet of leaves.
Given two unrooted trees over the same set of leaves, their quartet distance is the proportion of quartets over their leaves that have different topologies in the two trees. In the example trees in Fig. 2, we have leaves and therefore quartets. Of these 35 quartets, 16 have different topologies in the two trees, so the quartet distance is .
Similar to the generalized Robinson-Foulds distance defined above, I will follow Pompei et al. (2011) in using a generalized version of the quartet distance that takes the asymmetry between binary branching inferred trees and multiply branching expert trees into account. The generalized quartet distance (GQD) between an inferred tree and an expert tree is the proportion of butterflies in the expert tree having a different topology in the inferred tree. For the example in Fig. 2, the fit is perfect, i.e. the GQD equals 0.
The quartet measures are less intuitive than the corresponding Robinson-Foulds measures, but they have the advantage of being more tolerant of small errors. For instance, exchanging two leaves in one of two large trees may have a dramatic effect on the GRF, while the GQD changes only slightly.
In the following I will compare the three distance measures between languages discussed so far: LDND, dER and dERC. Let us first look at the triplet distances to the three expert classifications mentioned above, WALS, Ethnologue, and Hammarström (2010). The comparison was performed with the full ASJP word lists that come from living or recently extinct languages and dialects. The results are shown in Table 3.
For all three expert classifications, we find a slight improvement both from LDND to dER and again from dER to dERC, even though the differences are quite small.
To compute the GRF, for each of the three pairwise distance matrices a phylogenetic tree is computed via the Neighbor Joining algorithm, and those are compared to the three expert classifications both via GRF and via GQD. The results are shown in Table 4.
These figures seem to indicate that LDND performs best according to WALS and Ethn, while dER comes out better for the Hstr. These numbers are arguably misleading, however. The GRF relies on the Neighbor Joining tree, which is quite sensitive to the properties of the specific data set. This can be illustrated with the following little experiment. 10 mutually disjoint subsets of ASJP were drawn, each containing 275 word lists. For each of these subsets, the Neighbor Joining trees for LDND, dER and dERC were computed and compared to the WALS classifaction according to GRF and GQD. The results are shown in Table 5.
Both the numerical values of GRF and GQD and the relative ordering of the three distance measures differ widely. For instance, LDND leads to the lowest GRF value five times, dER three times, and dERC five times.
To detect the quality of different distance measures despite the noisyness of phylogenetic inference, I drew 1,000 random samples from the 5,000+ ASJP word lists, each containing 500 word lists, and averaged over the various tree distance measures to the expert classifications.8 The results are given in Table 6 and the distributions are visualized as box plots in Fig. 7.
From these data we can conclude that dERC gives slightly better results than dER for all evaluations, so the correction for missing entries does have a positive effect. The comparison between LDND and dERC is equivocal. On average, dERC is slightly better for GRF and slightly worse for GQD.
As the 1,000 samples used here are not stochastically independent,9 it is not possible to perform meaningful statistical tests, so it is impossible to say whether there are significant quality differences between LDND and dERC. In any event, the differences are very small.
5. Weighted String Alignment
5.1. The Method
Levenshtein alignment only distinguishes between identical and non-identical sounds. To achieve a better approximation of the etymologically correct alignment of cognate words, a graded notion of similarity between sounds seems more appropriate. The ALINE system from Kondrak (2002), for instance, uses a sophisticated hand-crafted notion of segment similarity that draws on insights from phonology. ALINE will be discussed in more detail in the next section. The present section reviews an alternative approach that has been used in previous work in bioinformatics (see, for instance, Durbin et al., 1989) and computational dialectometry (cf., among others, Wieling et al., 2009).
The approach emerges from similar considerations that we used in the previous section in the derivation of the ER measure.10 Suppose we want to compare two strings and —such as sequences of DNA bases or protein molecules, or words that we suspect to be cognates—and figure out whether or not they developed from a common ancestor. We have two hypotheses: : “ and are unrelated” and : “ and are related.” According to Bayesian logic, the following holds:
The prior odds are unknown, so we focus on the first term on the right hand side, which expresses the strength of the evidence of the particular data point for .
We start with the (unrealistic) assumption that in case is true, no insertions or deletions of segments have taken place, so and have the same length. If is true, the segments of and are pairwise historically related. Under the simplifying assumption that point mutations in biological evolution and individual sound shifts in language change are mutually independent, the probability of observing strings and given is the product of the individual probabilities that and are historically related:
Let and be two segments. The quantity is defined as the probability that a specific segment in some sequence (biomolecule/word) developed into an along one phylogenetic branch and into a along another branch. Under this interpretation, is symmetric, i.e. . If the substitution matrix , i.e. the values of , is known for all segment pairs , we have
If and are unrelated, the pairings of and are just randomly picked segments. Let be the probability of occurrence of at an arbitrary position within an arbitrary sequence. With the simplifying assumption that the occurrences of segments at different positions within a sequence are independent of each other, i.e. the sequences have no grammar, we have
Of course the assumptions made here—stochastic independence of positions within a sequence and of evolutionary changes at different positions within a sequence—are wildly unrealistic. Nevertheless this null model leads to workable results, as we will see later on.
Putting the pieces together, we have
In the bioinformatics tradition, this quantity is called the log odds score. In computational linguistics, it is also known under the name of Partial Mutual Information (PMI, see Church and Hanks, 1990).11 I will follow the latter terminology here. Some notation:
We now turn to the issue of insertions and deletions. Suppose we evaluate a specific hypothesis about the historical relation between and , which includes assumptions about segments being inserted or deleted. This leads to an alignment between the sequences including gaps. As an example, consider the German and Swedish words for star, Stern/stjärna, which are StErn/SEnE in the ASJP transcription. The etymologically correct alignment is
The gap symbol “-” represents a position where either a segment has been deleted or a segment has been added in the other language. and now refer to the positions in the aligned strings. In the example, would be the gap symbol.
Following standard practice in bioinformatics, I assume that there is a uniform PMI score for gaps, regardless of the segment the gap is matched with:
The constant , which is positive, is referred to as gap penalty.12
However, both in biological evolution and in language change, insertions and deletions frequently operate on contiguous chunks of segments. For instance, in language comparison we frequently find partial cognates, i.e. word pairs in which one item is morphologically complex (or is etymologically derived from a morphologically complex word) and the other word is cognate to just one morpheme of the first word. Consider the Latin and Italian words for mountain, mons/montagna, transcribed as mons/monta5a in ASJP. The Italian word is probably derived from the Latin montaneus ‘mountainous,’ a denominal adjectivization of mons. So the correct alignment is
The three gaps at the end of the upper sequence are the reflex of a single historical process, i.e. suffixation plus semantic change.
Since gaps frequently come in chunks, the penalty for a gap in at position should be lower if is also a gap than if is a regular segment. This is captured by the notion of affine gap penalties. There are two positive constants (penalty for opening a gap) and (penalty for extending a gap) with such that if :
The same applies mutatis mutandis to gaps within .
With these provisos, the PMI score of an alignment of and gives an estimate of the strength of evidence that provide for under a specific alignment. The upper bound thereof is the maximal PMI score for any alignment of with . The Needleman-Wunsch algorithm (Needleman and Wunsch, 1970) is a simple generalization of the Levenshtein alignment algorithm that, for a given substition matrix and gap penalties and , efficiently (i.e. in quadratic time) finds this optimal alignment and its PMI score.
5.2. Parameter Estimation
To reliably estimate the PMI scores for all segment pairs, one would ideally need a very large corpus of correctly aligned sequence pairs. In bioinformatics such databases do indeed exist, and several carefully crafted substitution matrices for different domains have been constructed (see Durbin et al., 1989 for details). In dialectometric work (such as Wieling et al., 2012), such data are fairly easy to obtain because dialectometric data are organized in cognate sets, and the linguistically correct alignment between cognate words from different dialects of the same language can be reliably constructed with automatic means.
When dealing with cross-linguistic data from a wide variety of languages such as the ASJP data, the situation is more difficult. Sizeable amounts of expert cognacy judgments only exist for a small number of language families (mainly for Indo-European based on the pioneering work of Dyen et al., 1992, and for Austronesian, see Greenhill et al., 2008). Also, and more importantly, the ultimate goal of this entire enterprise is to do language classification automatically. Therefore, information about language family affiliation should not be utilized for parameter training to avoid circularity.
In the following, a heuristic method is described to extract a large corpus of probable cognate pairs from the ASJP word lists, which can be used for parameter training. The method only relies on the word lists themselves; no additional information about cognacy relations or the genetic affiliation of the languages involved is being used.
To avoid the pitfall of overtraining, I split the ASJP database into two sets of about equal size, the training set and the test set. For training purposes, I use only the former. The resulting model will then be tested against the latter.
To make sure the two sets are really independent—or at least to approximate this ideal as far as possible with cross-linguistic data—the two sets were constructed in such a way that each WALS family either completely belongs to the training set or the test set.13 To be more specific, the set of WALS families was placed in a random order and languages and language families were added to the training set in this order, as long as its size did not exceed half the size of the entire database. The remaining families constitute the test set. The training set contains 2,723 and the test set 2,758 word lists. The lists of language families in the two sets are provided in the Online Supporting Material.
Relying on the training set only, I used the following procedure for constructing a sufficiently large corpus of probable cognate pairs, which can be used for parameter training:
All language pairs that have a dERC distance below a given threshold are considered to be probably related.
For a pair of probably related languages and and a concept , all entries for in the ASJP list for and in the ASJP list for are considered. The pair of words with the lowest LDN score is considered a potential cognate.
All pairs of probable cognates are then aligned with the Levenshtein algorithm. If there are multiple optimal alignments, only one of them is considered.14
This yields a set of aligned sequence pairs. The quantity is estimated as the relative frequency of alignments of with (in either direction) within this corpus. The quantity is estimated as the relative frequency of occurrence of the segment type within the entire ASJP data base (or the subset thereof that is used for training purposes). This gives an estimate for
for each pair of segment types.
Assuming certain values for the gap penalties (more on this later), in the next step the set of potential cognate pairs is aligned with the Needleman-Wunsch algorithm, using the estimated parameters.
Additionally, I assume a threshold . All potential cognate pairs with a PMI score are treated as probable cognates. The set of aligned probable cognates is then used to re-estimate the PMI scores in the way described above.
The re-estimation of parameters is repeated 10 times. Experience shows that the estimated parameter values do not change substantially anymore after that.
The appropriate choice for the meta-parameters and remains to be determined. As for the former, it is instructive to look at the distribution of dERC distances. It is displayed in Fig. 8 as a histogram.
It can be seen that the distribution is dominated by a bell-shaped curve with the maximum at about . As pointed out in connection with the derivation of dERC (Section 3.2), we do expect that the ERC for unrelated languages approximately follows a standard normal distribution. The transformation leading from ERC to dERC turns this into a normal distribution with mean at , so the histogram shows that the vast majority of language pairs behave under dERC as if they are unrelated. We also see that the distribution is not symmetric—there are more values below than above the predicted mean value. The threshold should be chosen in such a way that the probability that an unrelated language pair has a dERC is very small, while at the same time ensuring that there are still sufficiently many language pairs with a dERC to make parameter training possible. I picked as a somewhat arbitrary choice which fulfills both requirements. It corresponds to an ERC value of , and the probability that a standard normally distributed variable has a value above that point is about , which seems to be sufficiently small. On the other hand, there are 20,505 language pairs with a dERC score in the training set, which gives rise to about pairs of potential cognates. This figure is large enough for parameter estimation.
As pointed out above, there is no straightforward way to estimate gap penalties from a training corpus. Appropriate values for and have to be found via optimization. The same holds for .
The training procedure supplies a PMI matrix for a given vector , which in turn, together with and , defines a PMI score for pairs of strings, which is a similarity measure for word pairs. Using the dERC aggregation procedure with LDN scores replaced by negative PMI scores, this gives us a distance measure between languages. Let us call it dERC/PMI.
As a heuristic to assess the quality of a parameter configuration, I sampled 1,000 pairs of probably related languages, i.e. languages with a dERC . The mean dERC/PMI between these 1,000 language pairs is treated as the target function to be minimized. According to the way dERC/PMI is computed, this amounts to maximizing the ranks of translation pairs, i.e. maximizing the similarity between synonymous word pairs while at the same time minimizing the similarity between non-synonymous word pairs. As we can assume that there are many cognates among translation pairs from probably related languages, minimizing the mean dERC/PMI is tantamount to maximizing similarity between cognates and minimizing similarity between non-cognates.
As even a single evaluation step is computationally quite expensive, advanced methods of optimization such as simulated annealing proved to be impractical. Therefore I performed a simple downhill Nelder-Mead style optimization (cf. Nelder and Mead, 1965), starting from several manually chosen initial positions. The lowest value of the target function was achieved with , , and .15
For a selection of sounds, the optimal PMI scores thus derived are shown in Table 7. (The full matrix is provided in the Online Supporting Material.) Not surprisingly, the entries along the diagonal are all positive, i.e. alignment of two identical elements provides the strongest evidence for relatedness. Additionally, we find positive PMI scores for several sound pairs that are known to be frequently historically related via sound shifts, such as p/b, d/t, d/8 (where the ASJP symbol 8 represents voiceless and voiced dental fricatives; cf. Table 15) and s/h. The latter case is especially interesting because the two sounds are articulatorily dissimilar, but the sound shift from s to h is known to be quite common (see, for instance, Ferguson, 1990).
Figure 9 displays a hierarchical clustering of the ASJP sound symbols according to their PMI scores.16 We find a primary split between vowels and consonants. The consonants are further divided into three large groups, which largely correspond to the dental, labial, and velar/uvular sounds. The only exception to this pattern according to place of articulation is the position of h and x (the voiceless and voiced velar fricatives), which are clustered together with the s-sounds within the larger cluster of dental sounds. This is probably a reflex of the already mentioned diachronic cline from s to h.
Following the example of Wieling et al. (2012)—who obtained PMI scores essentially in the same way but using data from different dialects of the same language—I performed non-metric multidimensional scaling with the PMI scores among the vowels. The result is displayed in Fig. 10. We find that the articulatory vowel triangle is reproduced to a good approximation, with the schwa (ASJP symbol 3) in the center.
Brown et al. (2013) also use the ASJP data to estimate the probability of different sound correspondences across the languages of the world. Their method is quite different from the one developed here, so a comparison of the results provides a certain validity check.
The authors use a highly conservative heuristic to identify regular sound correspondences. According to this method, a pair of languages exhibits a regular correspondence between the segments and if and only if:
and belong to the same genus, and
there are at least two concepts such that the ASJP entry for from can be transformed into its translation to by replacing all occurrences of by (and vice versa).
To use the running example of the English/Swedish comparison again, there are only two regular correspondences that can be detected from the 40-item word lists: o-e (bon/ben ‘bone’ and ston/sten ‘stone’); and i-e (liv3r/lev3r ‘liver’ and si/se ‘see’).
A certain genus is available for a correspondence - if both segments, and , occur in at least one language within this genus. The PG score (“percentage of available genera”) of a correspondence is the relative frequency (expressed in percent) of genera exhibiting the correspondence at least once among all genera that are available for that correspondence.
Figure 11 plots the PG scores of all consonant pairs that have a positive PG score in the supporting online material from Brown et al. (2013) against their corresponding PMI score.
The -axis is logarithmically transformed while the -axis is linear. We see a strong positive relation (indicated by the regression line) between the two measures. In fact, the correlation between the logarithms of the positive PG scores and the corresponding PMI scores is . This underscores that the empirically induced PMI scores do in fact reflect genuine patterns of regular sound correspondences.
In Table 1 the LDN scores for several English/Swedish word pairs were given. Table 8 gives the corresponding PMI scores.
It can be seen that the PMI notion of string similarity is more fine-grained than LDN. For instance, while both du/yu and vi/wi receive a positive score (i.e. they are more likely to be related than not), the absolute value for the latter is much higher. This reflects the fact that a correspondence between v and w is more likely than one between d and y. The pair fisk/fiS has an even higher PMI score because (a) the words are longer than vi/wi, i.e. the evidence they provide is stronger, and (b) the correspondence s/S is very likely. This is counterbalanced only by a single gap penalty.
The distribution of PMI scores for the English/Swedish comparison on the diagonal and off the diagonal is shown in the left panel of Fig. 12. The right panel shows the same data for the comparison English/Swahili.
In comparison to the corresponding plots for LDN, the PMI values are much more spread out. Apart from that, we find a similar qualitative pattern (apart from the inessential difference that LDN is a distance and PMI a similarity measure). For a pair of related languages, the diagonal entries are mostly much higher than the off-diagonal entries, while both collections appear to be drawn from the same distribution for a pair of unrelated languages.
The normalized ranks of PMI scores are now computed according to the definition given in Section 3, with LDN scores replaced by PMI scores and replaced by . As shown in Fig. 13, the PMI-based normalized ranks from related languages follow approximately a power law distribution and those from unrelated languages a uniform distribution, just as the LDN-based normalized ranks.
Therefore the theoretical justification for the dERC-style aggregation of normalized ranks to a distance measure between languages also applies to PMI scores.
Table 9 compares the dERC/LDN scores and dERC/PMI scores for the language pairs from Table 2.
These numbers convey the impression that the dERC/PMI scores for related languages are generally lower than the corresponding dERC/LDN scores, while the scores for unrelated languages are randomly distributed around for both measures.
5.4. Empirical Evaluation
The methods described in Section 4 to compare different distance measures will now be used to evaluate the quality of the dERC/PMI against LDND and the LDN-based version of dERC (dERC/LDN). Only the word lists from the test set will be used for this comparison.
The triplet distances to the three expert classifications are given in Table 10 and visualized in Fig. 14.
We find a slight improvement from LDND to dERC/LDN and a more substantial improvement to dERC/PMI.17
From the distance matrices for the test set for LDND, dERC/LDN and dERC/PMI, the corresponding phylogenetic trees were computed with Neighbor Joining. The generalized Robinson-Foulds distances and quartet distances are given in Table 11.
The results are not decisive, with dERC/PMI giving the lowest GRF scores and dERC/LDN the lowest GQD scores. However, as discussed in Section 4, evaluating different Neighbor Joining trees for a single data set can be highly misleading. Therefore the same procedure as above is applied here: 1,000 random samples of word lists from the test set, each comprising 500 doculects, are generated, Neighbor Joining trees for LDND, dERC/LDN and dERC/PMI are computed, and all three trees are compared to the three expert trees regarding both GRF and GQD. The results are depicted in Fig. 15 and the mean values are given in Table 12.
The mean values for the 1,000 samples display a similar pattern as the triplet distances: LDND and dERC/LDN perform about equally well (with a slight advantage for the former), while dERC/PMI leads to lower distance scores. As the aggregation method for dERC/LDN and dERC/PMI is identical, we can conclude that the PMI based method of measuring string similarities leads to better phylogenetic inference than (normalized) Levenshtein distance.
As a further test, I performed a version of cross-validation.18 In general, -fold cross-validation means that a data set is split into subsets of equal size. Then one subset is singled out as the test set. The remaining data are used for training, and the model thus obtained is tested against the test set. This is repeated for each subset.
Cross-validation requires the individual subsets to be independent from each other. As discussed above, obtaining mutually independent subsamples of a cross-linguistic data base such as ASJP that are representative for the data set as a whole is a non-trivial issue. As an approximation, I performed 4-fold cross-validation, where the subsets correspond to the four continental areas Africa (including all Afro-Asiatic languages), Eurasia, the Indo-Pacific region (including Australia), and America.
For each continental area , PMI scores where trained with the languages outside following the method described in the previous section. For the threshold and the gap penalties and , the values obtained from the original training set were used. Using the PMI scores thus induced, the pairwise dERC/PMI-values for the languages in were computed, and the triplet distances to WALS, Ethn and Hstr were determined and compared to the corresponding values for LDND and dERC/LDN. The results are given in Table 13 and displayed in Fig. 16.
In 11 out of 12 cases, dERC/PMI provides the best results (the exception being the Hstr classification for America, where LDND is slightly better). The general pattern for Africa, Eurasia and the Indo-Pacific is similar to the test set above: LDND and dERC/LDN are about equally good, while dERC/PMI is about – better. For America, all three distance measures perform roughly equally well.
The average correlation of the PMI matrices obtained during cross-validation with the PMI matrix obtained from the training set is , and the average correlation between the four PMI matrices from cross-validation is . This indicates that the patterns of regular sound correspondences across different samples of language families are highly similar.
A possible objection against the general approach developed here concerns the risk of circularity. As an anonymous reviewer points out, it might be problematic to perform automatic language classification on the basis of parameters that are trained with data from a database “which was […] obtained through some other type of (manual) analysis.” Let us therefore carefully review what kind of information goes into the training procedure and what kind of information we get out of it.
The construction of the training corpus of word pairs relied on guessing a value for . The guess of is to some degree arbitrary, but it was motivated by a visual inspection of the distribution of dERC/LDN scores.
dERC/LDN scores are determined on the basis of pairwise LDN scores for words from the word lists to be compared. No further information about the genetic affiliation of the languages involved is being used here, and LDN scores are obtained from Levenshtein distances, a general-purpose string comparison method that does not rely on any specifically linguistic information.
Once the training corpus is constructed, initial PMI scores are estimated using Levenshtein alignment. In subsequent steps, Needleman-Wunsch alignment is performed ten times, each time using the PMI score estimates from the previous step. For given values of and the gap penalties and , the PMI scores thus obtained define a string similarity score, which is in turn fed into the dERC aggregation scheme to yield a distance measure between word lists. , , are estimated via optimization in a way that minimizes the average distance between a sample of language pairs. This sample was collected using only dERC/LDN scores and . So the only information that enters the entire procedure consists of the plain word lists. No knowledge about the languages involved is used anywhere in parameter training. In the terminology of machine learning, PMI scores are obtained via unsupervised learning.
The test procedures in turn use the aggregate distance between word lists thus obtained to do phylogenetic inference and to compare the results to expert classifications. (Triplet distance relies on classifying triplets of languages, so this also involves a kind of phylogenetic inference). So the information that is obtained from the parametrized model—language classification—is of an entirely different nature than the information that went into it, namely word lists.
Another potential objection concerns the fact that the overall gain in accuracy—about for triplet distances and – for GRF and GQD—may still appear small. However, three considerations should be kept in mind here:
Both for the triplet distance and for GQD, the baseline of completely randomly distributed distances is not but because there are only three rooted binary trees for a triplet and three butterflies for a quartet of languages.
Even very crude distance measures achieve a much higher accuracy as suggested by these baselines. To illustrate this point, I defined such a crude measure: for each word list, the vectors of relative frequencies of occurrence of sounds are computed. The cosine similarity between two languages is then defined as the cosine of these vectors, and the cosine distance is the difference between the cosine similarity score and 1. So this distance measure only quantifies how much the frequency patterns of unigrams differ between word lists, without any reference to the meaning of the words. The Neighbor Joining tree derived from these distances for the entire ASJP database already achieves a GQD of to WALS, to Ethn and to Hstr.
The practically achievable minimum GQD (and likewise for triplet distance and GRF) is arguably somewhat above 0. First, the expert classifications contain controversial units (such as Altaic, Australian, Niger-Congo and Trans-New Guinea in WALS), which may partially be wrong. In this case it would not be a defect of an automatic classification if those units are not detected. Second, the 40-item Swadesh lists arguably do not always contain the information that human experts would need to establish a genetic relationship between a group of languages.
To make a rough guess, the maximum GQD (achievable by a simple-minded distance measure such as the cosine distance) for a given data set may be around 35%–40% and the minimum GQD that can possibly be attained by automatic methods from 40-item Swadesh lists may by around 3%. Each gain in accuracy of a certain percentage thus actually amounts to a much higher proportion (by a factor of about 3) of this range.
6. Comparison to ALINE
The PMI scores for word similarities used here are obtained via weighted string alignment. There have been several proposals in the literature on computational historical linguistics and computational dialectometry to employ weighted alignment for this purpose. Some of them use empirically determined log-odds scores as weights like the present proposal (cf. Wieling et al., 2012), while others (see, for instance, Covington, 1996; Somers, 1998; Heeringa, 2004, among others) assume linguistically motivated hand-crafted substitution weights for segment pairs. The most sophisticated approach along the latter lines is perhaps the ALINE system by Kondrak (2002). A detailed discussion of ALINE would go beyond the scope of this article, so I will just mention the essential features.
In ALINE, each sound is represented by a vector of phonetic features, such as syllabic, back, place etc. These features have real numbers as values. The similarity between two segments is computed from their differences in feature values, weighted by the salience of these features.
Additionally, ALINE captures compressions and expansions, i.e. alignments of a single segment in one word with two adjacent segments in the other word. Kondrak uses the cognate pair Latin factum/Spanish hecho ‘fact’ to illustrate this point. In the etymologically correct alignment, the Spanish affricate [ʧ] should be matched with the [t] and the [k] in the Latin word simultaneously. ALINE defines weights for aligning a single sound with a consecutive sequence of two sounds as well.
The present proposal uses the Needleman-Wunsch algorithm for string alignment. This algorithm finds the optimal global alignment, i.e. an alignment of the full sequences. ALINE uses half-global alignment instead. This means that in both strings to be compared, final subsequences can be ignored if this leads to a better alignment score. Half-global alignment is motivated by the observation that the right periphery of words is especially unstable in language change.
In Huff (2010) and Huff and Lonsdale (2011), the system PyAline is described, a freely available Python implementation of ALINE that includes substitution scores of ASJP sound classes. PyAline also contains an implementation of Downey et al.’s (2008) method to aggregate ALINE alignment scores to distances between languages. This facilitates a comparison with the distance measures defined here. In Huff and Lonsdale (2011) such a comparison with LDND is discussed. The authors conclude that both measures perform about equally well in phylogenetic inference.
Downey et al.’s aggregation method differs in two essential ways from ERC. First, word similarities are normalized. Given alignment scores (which are similarity scores), the normalized ALINE distance between two words and is defined as
Second, Downey et al. (2008) define the distance between two languages as the average normalized ALINE distance between translation pairs. This amounts to taking the average of the diagonal in the matrix of individual word distances, while the off-diagonal entries are not taken into account. Let us call this distance measure .
These differences in detail make a comparison to dERC difficult, because it has to be factored out whether possible differences in performance are due to the different alignment weights, the different alignment algorithm, the normalization step or the difference in the aggregation scheme. As an additional complication, PyAline’s alignment algorithm is implemented in plain Python, which makes it comparatively slow. There are highly efficient Python libraries for the Needleman-Wunsch algorithm used for the computation of PMI scores, which makes the computation of a pairwise dERC/PMI distance matrix for several thousands of word lists feasible. For PyAline this is not realistic.19
For these reasons, I will defer a detailed comparison of the present proposal with ALINE to another occasion and only report the results of a small pilot study here that could be carried out with moderate computational effort.
From the test set, 10,000 triplets were sampled that are resolved according to WALS. They were used to estimate the triplet distance to WALS for (a) LDND, (b) dERC/PMI, (c) , and (d) dERC/ALINE. The latter measure uses the normalized ALINE distances between words and aggregates them according to the dERC scheme.
The results are given in Table 14 and displayed in Fig. 17.
The estimates for LDND and dERC/PMI are and , while the correct values are and respectively (see Table 10). This suggests that the estimates are actually quite accurate.
The results indicate that is substantially less accurate than LDND. However, this is arguably due to the aggregation method rather than the ALINE method of computing word distances. Combining ALINE word distances with dERC-style aggregation gives results that are better than LDND but still less good than dERC/PMI.
With the proviso that these results are still preliminary, they seem to suggest (a) that weighted alignment improves the accuracy of phylogenetic inference in comparison to plain Levenshtein-style alignment, and (b) that empirically determined PMI scores are superior to hand-crafted weighting schemes.
This paper aims at making three contributions to the current discussion in the field of computational historical linguistics: (1) it argues for the usage of weighted alignment using empirically obtained weights for determining word distances, (2) it proposes a novel method to aggregate word similarities/distances to distances between languages, and (3) it presents several protocols for evaluating automatically generated phylogenies that extend existing proposals.
The results from the previous sections show that weighted alignment improves the accuracy of language distance measures when compared to Levenshtein distance methods. The method used here—the Needleman-Wunsch algorithm using log-odds scores and affine gap penalties—was developed in the context of bioinformatics and is justified by the properties of biomolecular evolution. The model assumptions that underlie its mathematical foundations are actually not met in the case of sound change. It rests on the simplifying assumptions that mutations at different positions are stochastically independent and that mutation probabilities are constant across lineages. The latter assumption, especially, is highly problematic when applied to sound change since specific sound changes are known to be historically contingent events that apply to the entire lexicon of a language. Therefore a more adequate model would have to use a different substitution matrix for each pair of related languages, which captures the history of sound changes along the two lineages from the latest common ancestor. It is in principle possible to obtain these substitution matrices empirically, but this would arguably require much larger word lists than the commonly used Swadesh lists.
Also, work on automatic cognate recognition (see, for instance, List, 2012) has shown that the quality of word alignments improves considerably if multiple sequence alignment is used. It is to be expected that language distance measures using multiple alignments will also lead to more accurate phylogenetic inference. An additional advantage of using multiple sequence alignments is that they can be used for character-based methods, which are known to be more accurate than distance-based methods.
A further direction that may lead to higher accuracy is the usage of resampling methods such as bootstrapping and jackknifing, which can be used at various points in the inference process. In this paper, individual word alignment scores were calibrated by comparing them to the distribution of alignment scores across all pairs of non-synonymous word pairs from the two languages to be compared. Sampling a large number of these scores with replacement will arguably lead to a more accurate estimate of this distribution. Furthermore, sampling 40 Swadesh concepts with replacement a large number ( 1,000) of times and performing phylogenetic inference with each sample individually will result in a large number of slightly different inferred trees. These can be used to generate a consensus tree and to quantify the confidence in the language grouping thus obtained.20
Regarding the evaluation described in the previous section, the main innovation presented here is the use of a large collection of random samples of languages to assess the quality of a distance measure. According to my own experience, results obtained in this way are much more robust and informative than evaluation results for a single collection of languages.
This research was supported by the ERC Advanced Grant 324246 Language Evolution: The Empirical Turn (EVOLAEMP).
The work being described in this article benefited considerably from discussions with Johann-Mattis List, Taraka Rama, Søren Wichmann and Martijn Wieling, which is gratefully acknowledged. Kate Bellamy, Michael Dunn, Eric Holman, Søren Wichmann and three anonymous reviewers from LDC pointed out various mistakes in a previous version of this article. Thanks also to Thomas Zastrow for setting up the hardware which made this work possible.
All word alignments and distance measure computations were performed using (Numeric) Python. Levenshtein alignment and Needleman-Wunsch alignment were done using the Levenshtein package and the pairwise2 module of the Biopython package (Cock et al., 2009; http://biopython.org) respectively.
For the Neighbor Joining algorithm, Joseph Felsenstein’s Phylip package (Felsenstein, 1989; http://evolution.genetics.washington.edu/phylip/) was used. Quartet fits were computed with Christian Pedersen’s qdist package (http://birc.au.dk/software/qdist/). Thanks to its author and to Thomas Mailand for their help in finding and installing this software.
For manipulating and visualizing phylogenetic trees as well as for computing Robinson-Foulds distances, the Python toolkit ETE (http://ete.cgenomics.org/) and Daniel Huson’s Dendroscope software (http://ab.inf.uni-tuebingen.de/software/dendroscope/) proved highly useful.
Online Supporting Material
Descriptions of the training set and the test set, as well as the PMI scores obtained in the way described in Subsection 5.2, are contained in an online document that can be downloaded from http://www.sfs.uni-tuebingen.de/~gjaeger/publications/ldcBenchmarkingSI.pdf, and from http://dx.doi.org./10.1163/22105832-13030204; booksandjournals.brillonline.com/content/22105832/3/2 (click on tab Supplements).
Bouckaert, Remco, Philippe Lemey, Michael Dunn, Simon J. Greenhill, Alexander V. Alekseyenko, Alexei J. Drummond, Russell D. Gray, Marc A. Suchard, and Quentin D. Atkinson. 2012. Mapping the origins and expansion of the Indo-European language family. Science 337: 957–960.
Brown, Cecil H., Eric Holman, and Søren Wichmann. 2013. Sound correspondences in the world’s languages. Language 89: 4–29.
Church, Kenneth Ward and Patrick Hanks. 1990. Word association norms, mutual information, and lexicography. Computational Linguistics 16: 22–29.
Clauset, Aaron, Cosma Rohilla Shalizi, and Mark E.J. Newman. 2009. Power-law distributions in empirical data. SIAM Review 51: 661–703.
Cock, Peter J.A., Tiago Antao, Jeffrey T. Chang, Brad A. Chapman, Cymon J. Cox, Andrew Dalke, Iddo Friedberg, Thomas Hamelryck, Frank Kauff, Bartek Wilczynski, and Michiel J.L. de Hoon. 2009. Biopython: Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25: 1422–1423 (doi: 10.1093/bioinformatics/btp163).
Covington, Michael A. 1996. An algorithm to align words for historical comparison. Computational Linguistics 22: 481–496.
Downey, Sean S., Brian Hallmar, Murray P. Cox, Peter Norquest, and J. Stephen Lansing. 2008. Computational feature-sensitive reconstruction of language relationships: Developing the ALINE distance for comparative historical linguistic reconstruction. Journal of Quantitative Linguistics 15: 340–369.
Dunn, Michael, Angela Terrill, Ger Ressink, Robert A. Foley, and Stephen C. Levinson. 2005. Structural phylogenetics and the reconstruction of ancient language history. Science 309: 2072–2075.
Durbin, Richard, Sean R. Eddy, Anders Krogh, and Graeme Mitchison. 1989. Biological Sequence Analysis. Cambridge, UK: Cambridge University Press.
Dyen, Isidore, Joseph B. Kruskal, and Paul Black. 1992. An Indoeuropean classification: A lexicostatistical experiment. Transactions of the American Philosophical Society 82: 1–132.
Estabrook, George F., F.R. McMorris, and Christopher A. Meacham. 1985. Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units. Systematic Biology 34: 193–200.
Felsenstein, Joseph. 1989. Phylip-Phylogeny Inference Package (Version 3.2). Cladistics 5: 164–166.
Felsenstein, Joseph. 2004. Inferring Phylogenies. Sunderland: Sinauer Inc. Publishers.
Ferguson, Charles A. 1990. From esses to aitches: Identifying pathways of diachronic change. In William A. Croft, Suzanne Kemmer, and Keith Denning (eds.), Studies in Typology and Diachrony: Papers Presented to Joseph H. Greenberg on His 75th Birthday, 59–78. Philadelphia: John Benjamins.
Gray, Russell D. and Quentin D. Atkinson. 2003. Language-tree divergence times support the Anatolian theory of Indo-European origin. Nature 426: 435–439.
Greenhill, Simon J. 2011. Levenshtein distances fail to identify language relationships accurately. Computational Linguistics 37: 689–698.
Greenhill, Simon J., Robert Blust, and Russell D. Gray. 2008. The Austronesian Basic Vocabulary Database: From bioinformatics to lexomics. Evolutionary Bioinformatics 4: 271–283.
Hammarström, Harald. 2010. A full-scale test of the language farming dispersal hypothesis. Diachronica 27: 197–213.
Haspelmath, Martin, Matthew S. Dryer, David Gil, and Bernard Comrie. 2008. The World Atlas of Language Structures online. Munich: Max Planck Digital Library. http://wals.info/.
Heeringa, Wilbert Jan. 2004. Measuring Dialect Pronunciation Difference Using Levenshtein Distance. PhD dissertation, University of Groningen.
Holman, Eric W., Søren Wichmann, Cecil H. Brown, Viveka Velupillai, André Müller, and Dik Bakker. 2008. Advances in automated language classification. In Antti Arppe, Kaius Sinnemäki, and Urpu Nikanne (eds.), Quantitative Investigations in Theoretical Linguistics, 40–43. Helsinki: University of Helsinki.
Huff, Paul. 2010. PyAline: Automatically Growing Language Family Trees Using the ALINE Distance. PhD dissertation, Brigham Young University.
Huff, Paul and Deryle Lonsdale. 2011. Positing language relationships using ALINE. Language Dynamics and Change 1: 128–162.
Kondrak, Grzegorz. 2002. Algorithms for Language Reconstruction. PhD dissertation, University of Toronto.
Lewis, M. Paul (ed.). 2009. Ethnologue: Languages of the World. 16th ed. Dallas, TX: SIL International. http://www.ethnologue.com.
List, Johann-Mattis. 2012. Sequence Comparison in Historical Linguistics. PhD dissertation, University of Düsseldorf.
Needleman, Saul B. and Christian D. Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48: 443–453.
Nelder, John A. and Roger Mead. 1965. A simplex method for function minimization. The Computer Journal 7: 308–313.
Pompei, Simone, Vittorio Loreto, and Francesca Tria. 2011. On the accuracy of language trees. PLoS ONE 6: e20109.
Robinson, David F. and Leslie R. Foulds. 1981. Comparison of phylogenetic trees. Mathematical Biosciences 53: 131–147.
Saitou, Naruya and Masatoshi Nei. 1987. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4: 406–425.
Somers, Harold L. 1998. Similarity metrics for aligning children’s articulation data. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, vol. 2, 1227–1232. Montreal: Association for Computational Linguistics.
Ward, Joe H., Jr. 1963. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association 58: 236–244.
Wichmann, Søren, Eric W. Holman, Dik Bakker, and Cecil H. Brown. 2010. Evaluating linguistic distance measures. Physica A: Statistical Mechanics and Its Applications 389: 3632–3639.
Wichmann, Søren, André Müller, Viveka Velupillai, Annkathrin Wett, Cecil H. Brown, Zarina Molochieva, Julia Bishoffberger, Eric W. Holman, Sebastian Sauppe, Pamela Brown, Dik Bakker, Johann-Mattis List, Dmitry Egorov, Oleg Belyaev, Matthias Urban, Harald Hammarström, Agustina Carrizo, Robert Mailhammer, Helen Geyer, David Beck, Evgenia Korovina, Pattie Epps, Pilar Valenzuela, and Anthony Grant. 2012. The ASJP Database (version 15). Downloadable at http://email.eva.mpg.de/ wichmann/listss15.zip (accessed November 6, 2013).
Wieling, Martijn, Eliza Margaretha, and John Nerbonne. 2012. Inducing a measure of phonetic similarity from pronunciation variation. Journal of Phonetics 40: 307–314.
Wieling, Martijn, Jelena Prokić, and John Nerbonne. 2009. Evaluating the pairwise string alignment of pronunciations. In Proceedings of the EACL 2009 Workshop on Language Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and Education, 26–34. Athens: Association for Computational Linguistics.
Appendix: ASJP Transcription Code
Tables 15 and 16 contain the description of the ASJP code (quoted verbatim from Brown et al., 2013).
As Søren Wichmann (p.c.) points out, doculects would be a more precise term, as the database comprises languages, dialects, and reconstructed word lists of protolanguages. For simplicity’s sake, however, I will use the terms language and doculect synonymously throughout this paper.
The empirical distribution is not entirely uniform; it also has a slight bias towards smaller values. This may be due either to long-distance relationships between languages from different families, similarities due to language contact, or to universal biases in the sound-meaning relationship such as onomatopoeia. These effects are small in comparison to the bias for related languages, however, so the uniform distribution is still a good approximation.
This is an instance of a more general rule. If is a uniformly distributed random variable over the interval [0,1] and is a strictly monotonically decreasing function over , then is distributed according to the density function . Here is the derivation: Let be the probability density function of .
If , then and .
The general form of the exponential distribution is , where both the mean and the standard deviation equal . For our special case, .
This follows from elementary laws of probability theory, i.e. the facts that , if and are independent, and .
The reader may excuse my eclectic usage of both Bayesian and frequentist arguments in this section.
All three classifications are provided as metadata in the ASJP database.
The triplet distance is only informative if the languages to be compared exist at the same point in time, i.e. if any two related languages have the same time depth from their common ancestor. If this condition is not met, the triplet distance might be misleading. For instance, it might very well be that Old English and Gothic are closer to each other than Old English is to modern Dutch. Nevertheless the correct classification places Old English and modern Dutch in one group—the West Germanic languages—and Gothic in another one, namely East Germanic. This problem could be avoided by evaluating quartets instead of triplets and induce an unrooted tree. I refrain from doing so here because the number of quartets over a set of languages exceeds the number of triplets by a factor in the order of magnitude of the number of languages. For large data sets, the triplet distance, but not the quartet distance, can still be computed with realistic computational effort. To avoid the problem of difference in time depth, in this article I only use data from languages that are either currently alive or recently extinct.
The choice of exactly 1,000 samples containing exactly 500 languages each is arbitrary. The criterion for choosing these numbers was that the number of samples should be suffiently large to be able to detect trends, and that each sample should not be too small, but small enough to make 1,000 iterations computationally feasible.
It might seem suggestive to evaluate the different distance measures for the individual language families and to average the results because different language families are our best approximation of independent samples when it comes to cross-linguistic data. This protocol has been followed, for instance, by Pompei et al. (2011). Such a procedure strikes me as misleading, though, because it only assesses how well the internal classifications of language families are recoverable based on the different distance measures. However, it is equally important to take into account how well the competing measures separate different language families. My somewhat pessimistic conclusion is that it is not possible to create a sufficient number of independent samples from cross-linguistic data that are both independent from each other and representative for the population as a whole.
The following discusssion draws heavily on Durbin et al. (1989).
The PMI score is defined in terms of the binary rather than the natural logarithm. This difference is inessential, however, because it amounts to a constant factor.
Durbin et al. (1989) give a probabilistic interpretation of gap penalties, according to which is the logarithm of the probability of observing a gap. However, this derivation relies on the tacit assumption that sequences are so long that they can be considered infinite. As words are rather short, this leads to a systematic overestimation of gap penalties. Therefore gap penalties have no obvious probabilistic interpretation in the context of computational linguistics.
This was suggested to me by Eric Holman (p.c.).
To be precise: the implementation of the Levenshtein alignment algorithm I used (the Python package Levenshtein) only outputs one alignment, even if there are others that are equally good.
The value of the target function at this point is , while the baseline, i.e. the mean dERC/LDN, is .
To perform the clustering, PMI scores were transformed into distances by subtracting them from the maximal PMI score. For the hierarchical clustering, Ward’s method was used; see Ward (1963).
It might be surprising that the triplet distances given in Table 3—which were calculated for the entire ASJP—are in the 20% range, while the values for the test set are in the 10–15% range. This reflects the fact that the task of automatically classifying a given set of word lists has something like an inherent level of difficulty. The low scores for the test set might have something to do with the fact that almost one third of it are Austronesian languages. Therefore a substantial proportion of triplets to be evaluated consists of two Austronesian and one non-Austronesian language, and the signal distinguishing Austronesian from the rest of the world’s languages is fairly strong.
This was suggested by an anonymous reviewer.
On the hardware currently at my disposal, computing the distance matrix for the full test set with PyAline would take more than a week.
This kind of bootstrapping is generally being used in character-based phylogenetic inference, including work in historical linguistics such as Gray and Atkinson (2003).
Durbin et al. (1989) give a probabilistic interpretation of gap penalties according to which is the logarithm of the probability of observing a gap. However this derivation relies on the tacit assumption that sequences are so long that they can be considered infinite. As words are rather short this leads to a systematic overestimation of gap penalties. Therefore gap penalties have no obvious probabilistic interpretation in the context of computational linguistics.