1 Chinese Dialect Classification and Chinese Dialect History
The sociolinguistic situation in China is incredibly complex: For a long time, countless, mostly mutually unintelligible, varieties of Chinese have been developing under the backdrop of a common culture and writing system. Influenced by centripetal forces of varying prestige languages, centrifugal forces of geographic distance, and different waves of migrations, the modern Chinese dialects constitute a language family whose linguistic divergence resembles that of the Romance languages (Norman 1988: 187–188, Wang 1997), but whose history is so complex and intertwined that it seems impossible to describe it by means of a classical family tree (Norman 2003: 76).
1.1 Traditional Approaches to Chinese Dialect Classification
Figure 1 shows the geographic distribution of the ten major dialect groups proposed by Wurm and Liu (1987).1 Despite a long tradition of research, there is little agreement among scholars regarding the exact classification of the Chinese dialects. Although the majority of scholars probably agrees with the distinction of the seven major groups of Mandarin (Guānhuà 官話), Xiāng 湘, Gàn 贛, Wú 吳, Hakka (Kèjiā 客家), Yuè 粵, and Mǐn 閩 (Norman 1988: 181), there is less agreement regarding the status of the three more recently proposed groups of Jìn 晉, Huīzhōu 徽州, or Pínghuà 平話 (Kurpaska 2010: 64–73, Yan 2006: 8–19), and there is even less agreement regarding the further subgrouping of the major groups.
Traditionally, Chinese dialect classification is based on the comparison of the sound system of Middle Chinese with the sound systems of the modern dialects (Yan 2006: 9). Assuming that all Chinese dialects more or less go back to the language variety encoded in the extant rhyme books published in the 6th century A. D. (traditionally labeled ‘Middle Chinese’ or Zhōnggǔ Hànyǔ 中古漢語), it is straightforward to determine how specific features, such as the voiced plosives, or tonal categories, are reflected in a given modern dialect variety. In order to compare the features of a given dialect variety with those of Middle Chinese, scholars usually rely on character readings. Based on the pronunciation of diagnostic characters in the target dialect, the phonological differences, compared to the Middle Chinese system, are determined and employed for classification.2Table 1 gives examples for dialect readings of characters with Middle Chinese labial stops as the initial (b, p, pʰ), with each of the dialect varieties representing one of the seven major dialect groups (data taken from Běijīng Dàxué 1962 ).
1.2 Current Challenges in Chinese Dialectology
In the last twenty years, more and more scholars have begun to question the traditional approach to Chinese dialect classification, demanding a paradigm shift in Chinese dialectology (Norman and Coblin 1995, Branner 2000, Baxter 2006). The traditional classification criteria are at the core of the problems. The tendency to reduce rather than to increase the number of features used in dialect classification is “essentially arbitrary” (Baxter 2006: 75), since the classification will directly depend on the initial choice of features (ibid.). Feature reduction will also necessarily lead to a loss of information and deprive genetic classifications of their discriminating power. The reflexes of Middle Chinese initials may be sufficient to identify most standard dialect groups (Lǐ 2005). However, in such a classification, the features are merely used to distinguish certain dialect groups, while they do not explain how they developed.
Although most classifications proposed so far are explicitly based on historical criteria, few of them explicitly try to account for the genealogical development of the Chinese dialects (Yóu 1992: 85–114, Norman 1988: 210–214, Wáng 2009, Sagart 2011). If the goal of classification was only a partitioning of the dialects into different groups that bear no historical implications, it would be much more useful to employ synchronic criteria, such as mutual intelligibility (Yóu 1992: 49–50, Zav’jalova 1996: 18), to accomplish this goal. However, if the purpose of dialect classification is to infer ‘relationships of common ancestry’ (Baxter 2006: 75), it is obvious that the classification needs to be based on features which are diagnostic for genetic relations. Here, most scholars opt for the use of lexical features—namely basic, or core vocabulary (Norman 1988: 181–183, Branner 2000: 25–28, Sagart 2011), since the borrowing of core vocabulary ranks rather low in the standard ‘scale of borrowability’ (Aikhenvald 2007: 7). However, since vocabulary can generally be easily exchanged (Weinreich 1953), basic vocabulary is no miracle cure against the “horizontal forces” of language history. In order to improve Chinese dialect classification, it seems inevitable to improve on those methods that can help to distinguish borrowed from inherited traits.
1.3 Phylogenetic Networks
Given the complex sociolinguistic situation in China, it is not surprising that many scholars claim that the family tree model (first proposed by Schleicher 1853, see Figure 2A) is inadequate to model Chinese dialect history, since it ignores the horizontal (areal) dimension of language relations that has played such an important role for the development of the dialects into their current shape (Sagart 2001, Norman 2003). Unfortunately, the alternative model, the Wave theory (Schmidt 1872, see Figure 2B), is also not very helpful, since it ignores the vertical dimension of language relations that is—of course—also constitutive for the history of Chinese dialects. Network models show a way out of the dilemma, since they can be easily used to display both vertical and horizontal language relations (see Figure 2C), as illustrated early on by Southworth (1964) for the Indo-European languages, and in a recent paper by Wáng (2009) for the Chinese dialects. In the following section, I will try to illustrate how methods for phylogenetic network reconstruction, which have their origin in evolutionary biology but were recently adapted to linguistic needs, can be used to test and explore hypotheses on Chinese dialect classification.
2 Reconstruction of Phylogenetic Networks
2.1 Distance- and Character-Based Approaches
It is common to make a distinction between distance- and character-based methods for phylogenetic reconstruction. The main difference between these different families of methods lies in the aggregation of information: distance-based methods aggregate information on the taxonomic level. Similarities and differences between all taxonomic units (language varieties) are reduced to distance scores. Character-based methods aggregate information on the level of the items that are selected to define the taxonomic units. Character-based methods yield concrete, individual evolutionary scenarios for each character in the dataset.3
The most popular distance-based methods for phylogenetic network reconstruction are based on techniques that produce splits networks (Huson et al. 2010: 87–126). The most popular algorithms are split decomposition (Bandelt and Dress 1992) and NeighborNet (Bryant and Moulton 2004). For both algorithms, the popular SplitsTree software package offers user-friendly implementations (Huson 1998). Splits networks are quite popular in historical linguistics and have been used in many studies on different language families (McMahon et al. 2005, Ben Hamed 2005, Ben Hamed and Wang 2006). However, the new insights these methods provide are rather limited. Only very general conclusions regarding the tree-likeness of the data can be drawn and the results are extremely difficult to interpret. Rates of borrowing cannot be calculated, nor can individual borrowing events be inferred.
The limited information provided by splits networks is illustrated in Figure 3, where a splits network of ten Chinese dialect varieties is displayed.4 As can be seen from the network, there seems to be a considerable amount of conflicting signals. Wēnzhōu 溫州, a Wú dialect, for example, appears in an intermediate position between the Wú (Shànghǎi 上海) and the Mǐn dialects (Fúzhōu 福州 and Táiběi 台北). Chéngdū 成都, a South-Western Mandarin dialect, is placed between Shànghǎi and Chángshā 長沙 (Xiāng) dialects. However, the network does not show us why some varieties occur in conflicting positions, and it also does not show the historical processes underlying the conflicts.
Character-based methods for phylogenetic network reconstruction cope with many shortcomings of distance-based approaches. Since they handle the evolution of individual characters (like cognate sets, or phonetic features), they offer distinct hypotheses regarding character development and offer a unified representation of the tree- and wave-like components of language history. These methods are very common in evolutionary biology (Huson and Scornavacca 2011, Koonin et al. 2001). In linguistics, however, there are only a few proposals so far (Wang and Minett 2005, Nakhleh et al. 2005). Unfortunately, none of these approaches are publicly available, be it in form of standalone software tools or as software libraries. This makes their replication and application to new datasets extremely difficult and tedious. An alternative character-based method, which is freely available, is the so-called minimal lateral network (mln) approach (Dagan and Martin 2007, Dagan et al. 2008, Nelson-Sathi et al. 2011, List et al. 2014 a and b), which is implemented as part of a Python library for quantitative tasks in historical linguistics (List and Moran 2013, List et al. 2015). The mln approach is a character-based method for phylogenetic network reconstruction. The method was originally designed to study microbial evolution. The basic idea of the approach is to test how well a given phylogeny (a so-called reference tree) explains the evolution of a set of characters. Instead of simply testing the explanative force of phylogenies, the mln approach tests different models in which increasing amounts of horizontal transfer are allowed, and then it selects the model that provides the best explanation for the evolution of the characters. In a pilot study by Nelson-Sathi et al. (2011), the mln approach was used to assess borrowing frequencies during Indo-European language history. In List et al. (2014a), an improved version of this approach was presented and successfully applied to a small set of 40 Indo-European languages, where it identified 72% of all known borrowings in the data. In List et al. (2014b), the method was applied to Chinese dialect data, where it revealed a much higher amount of characters suggestive of borrowing than was inferred for Indo-European (48–55% in Chinese vs. 31% in Indo-European).
The benefits of character-based network approaches as opposed to distance-based approaches are illustrated in Figure 4, where the same ten dialect varieties were analyzed with the help of the mln approach.5 As can be seen from the Figure, the network yields a direct hypothesis regarding the intermediate position of Wēnzhōu and draws a horizontal edge between Fúzhōu and Wēnzhōu, thus indicating that these two varieties share a considerable amount of characters which they do not share with other varieties. It also shows that the position of Chéngdū in the splits network can be more readily explained as having a specific contact relation to Nánchāng 南昌, and a more ancient contact relation between the Shànghǎi dialect and the ancestor of Chéngdū, Píngyáo 平遙, and Chángshā dialects. We note further that the network has a root, and internal nodes, representing the ancestors of the contemporary varieties. The network thus contains an explicit temporal dimension. The varying size of the nodes further indicates how many characters (word forms for this specific dataset) are attested for each of the contemporary varieties, and how many word forms are inferred for ancestral varieties.
2.2 Minimal Lateral Networks
The mln approach (List et al. 2014 a and b) takes as input a reference tree and a set of phyletic patterns. Phylogenetic networks are inferred within a three-stage approach. In the first stage (1), gain-loss mapping techniques are used to infer a range of different gain-loss models that explain how the characters could have evolved along the reference tree. In a second stage (2), the best model is chosen by comparing the ancestral and contemporary vocabulary size distributions. In the third stage (3), a minimal lateral network is reconstructed from the gain-loss scenarios inferred by the best model.
Gain-loss mapping (glm, Cohen et al. 2010, Mirkin et al. 2003, List et al. 2014 a and b) is a standard technique in evolutionary biology. It can be used for various purposes, such as the assessment of the tree-likeness of a given dataset, the inference of horizontal gene transfer events, or the reconstruction of ancestral character states. The basic goal of all glm approaches is to infer gain-loss scenarios that explain how a given phyletic pattern developed along a reference tree. A phyletic pattern is a matrix representation of the distribution of cognate sets in a given set of languages. The matrix displays whether cognate sets have reflexes in a given language or not. For each language, a given cognate set is represented by two states: presence (1) or absence (0). Depending on the cognate sets being investigated, different patterns can be observed. This is illustrated in Table 2 where translations of ‘to count’ in three Romance and three Germanic languages are split into two cognate sets and coded as phyletic patterns. Given a reference tree that reflects the general evolution of languages, a gain-loss scenario (gls) explains the evolution of a character in terms of events. Events are state changes from ancestral to descendant nodes of the reference tree, with gain events being defined as changes from state 0 to state 1, and loss events being defined as changes from state 1 to state 0. In lexicostatistical terms, a loss means that a given word form is no longer used to express a given concept. It does not mean that the word is completely lost from the language. As in the case of ‘count,’ a possible English cognate for the German and the Danish form would be tell, which goes back to the same Proto-Germanic root. However, since tell ceased to denote the action of ‘counting things,’ it is not included in our sample, and its current state in English is represented as absent (0) in the phyletic pattern in Table 2.
Figure 5 shows two possible gain-loss scenarios for the first phyletic pattern from Table 2. In scenario A, one gain event and two loss events are inferred. The scenario thus implies that English count was inherited from the common ancestor of the Romance and the Germanic languages with its reflexes being lost in German and Danish. Scenario B, however, implies that no ancestral form of count was present in the common ancestor of the Germanic languages and that it originated independently in English and the ancestor of the Romance languages. We know, of course, that the second scenario is the right one, since English count was borrowed from Old French conter. Given that the independent origin of characters in different branches of a language family is rather rare, we can make the (simplifying) assumption that patchy cognate sets, i.e. cognate sets for which a given gain-loss scenario suggests multiple gain events, are the result of language contact.
In order to find an appropriate gain-loss scenario for a given phyletic pattern, it is important to find criteria that define the appropriateness of gain-loss scenarios. To know whether the second of the two scenarios in Figure 5 is the right one does not help us to select the correct scenario in cases where we do not know the history of the languages in such great detail. Internally, it is easy to define gain-loss models that favor one of the scenarios by either restricting the maximum number of gain events (restriction-based approaches, Nelson-Sathi et al. 2011), or by defining specific “penalties” for gain and loss events (parsimony-based approaches, List et al. 2014 a and b). Externally, however, specific criteria are needed to determine the best model for a given dataset. Nelson-Sathi et al. (2011) follow Dagan and Martin (2007) in using ancestral vocabulary size distributions as a heuristic to determine an optimal gain-loss model. The vocabulary size distribution (vsd) of a given language is defined as the number of words the language uses to express a given set of concepts. The basic idea of this approach is that the number of words that are used to express a given number of concepts in ancestral languages should not differ greatly from the number of words used to express the same concepts in contemporary ones. As illustrated in Figure 6A-C, models that overestimate the tree-likeness of the data yield ancestral vsds that grow drastically (A), while models that overestimate the amount of horizontal transfer yield drastically shrinking vsds (B). The preference should be given to models that yield well-balanced vsds throughout all nodes of the tree (C).
Once an appropriate gain-loss model for a given dataset has been determined, the results of the glm analysis can be interpreted and further analyzed in different ways. The simplest way is to sort out all patchy cognate sets and to investigate these cases individually. Patchy cognate sets can have different origins: They may result from (a) independent convergent evolution, (b) any form of contact between the involved languages or their descendants, including direct, but also semantic transfer, or (c) errors in the data. Given that independent convergent evolution is not a very frequent process (neither in biology nor in linguistics) and that errors in the data should not occur in an ideal world, it is straightforward to assume that the patchiness of the cognate sets results from contact. For a global representation of all patchy cognate sets inferred for a given dataset, one can reconstruct a rooted phylogenetic network (Dagan et al. 2008, Nelson-Sathi et al. 2011) which displays patterns of vertical and horizontal inheritance. The reference tree represents vertical relations and serves as the basic hypothesis regarding the genetic classification of the language varieties under investigation. Additional edges drawn between the nodes of the tree represent the number of times multiple gain events were inferred and indicate potential contact relations between the respective nodes or their descendants in the tree.
3 Minimal Lateral Networks of Chinese Dialects: A Case Study
Few people will deny that neither Stammbaum nor Welle are sufficient to reflect Chinese dialect history in all its complexity. But do network approaches at their current stage offer a serious alternative to the two well-established models of language classification? In the following section, the application of network approaches in Chinese dialectology will be illustrated by applying the mln approach to Chinese dialect data. The mln approach is chosen for two reasons. Firstly, it offers a character-based framework to study language history. It yields direct, clear-cut hypotheses regarding the history of individual words and may thus be directly compared to alternative proposals. Secondly, the implementation of the mln is freely available and scholars can easily replicate and further explore the results of the analysis.6
The data used for this study was originally employed in a study by Ben Hamed and Wang (2006) and kindly provided by the second author. In this study, the authors investigated the suitability of lexicostatistical data for Chinese dialect classification by testing various algorithms, both for the reconstruction of phylogenetic trees and networks. The data consists of lexicostatistical wordlists for 23 Chinese dialect varieties and covers the seven traditional dialect groups. Each of the wordlists contains translations for 200 basic concepts (following Swadesh 1952 and 1955). Due to cases of synonymy where more than one translation or an optional variant was available for the same concept, the total number of words in the dataset amounts to 5349. All words were further coded for cognacy by assigning words which go back to the same ancestor form to a common cognate set. The original data offers different variants of cognate coding. For this study, a strict coding variant was chosen. This means that words are only assigned to the same cognate set if they match completely. This is a simplifying assumption, since—due to the frequency of compounding in Chinese—Chinese dialect words frequently exhibit partial cognate relations in which words contain cognate morphemes without matching completely. This is illustrated in Table 3, where translations of the concept ‘moon’ in four dialects share the common morpheme yuè 月 ‘moon’ (Middle Chinese *ŋjot) but use different (or no) suffixes (data taken from Hóu 2004). In order to handle these cases of partial cognacy adequately, specific models allowing for multiple character states are necessary (Nunn 2011: 59–60). Since the mln approach currently only handles binary character states (i.e. presence vs. absence), the words need to be assigned to three different cognate sets in the analysis, as indicated by the different numbers for each cognate set in the table. This coding procedure yields a total of 1513 cognate sets distributed over the 23 dialect varieties. However, since 922 of the cognate sets occur only once in the data and are thus not informative for the mln approach, only 591 cognate sets were left for the analysis. The full dataset as it was used in this study is provided in Supplementary Material ii
3.1.2 Reference Trees
Using a reference tree is crucial for the application of the mln approach. The reference tree represents the basic hypothesis regarding the vertical direction of character evolution and serves as the “genetic backbone.” Given the multitude of different proposals for Chinese dialect classification, it is useful to test several reference trees instead of employing only a single one when applying the mln approach. For this study, a total of seven different reference trees was prepared. Three reference trees reflect independent hypotheses regarding Chinese dialect classification, as proposed in the literature. Three trees were reconstructed with the help of automatic methods for phylogenetic reconstruction. Additionally, a random tree was used to test the impact of arbitrariness on the mln approach. All numbers reported for the random tree reflect the average results of 50 trials.
The first independent hypothesis is Laurent Sagart’s Arbre des Dialectes Chinois (Sagart 2011, personal communication). Based on distinct innovations for each split in the family tree, the Mǐn dialects were the first to split off, followed by Hakka and Yuè, which form a distinct subgroup. Originally, Sagart’s proposal contains not only the seven major dialect groups but also Wǎxiāng 瓦鄉 and Càijiā 蔡家, two archaic varieties which Sagart assumes to be the first varieties to split off the Sinitic branch of Sino-Tibetan. The second independent hypothesis is Jerry Norman’s Southern Chinese Hypothesis (Norman 1988: 210–214) which divides Chinese dialects into a northern, a central, and a southern zone (including Hakka, Yuè, and Mǐn), based on a small set of lexical and phonetic features. The third hypothesis is the Hànyǔ Fāngyán Shùxíngtú 漢語方言樹形圖 (“Tree chart of Chinese dialects”) by Yóu Rǔjié 游汝杰 (Yóu 1992: 91–106). This hypothesis is based on a thorough investigation of known population movements and the identification of dialect varieties with the respective ancient populations. Among others, it places Wú and Mǐn as close neighbors, as well as Gàn and Hakka. The three hypotheses, Sagart’s Arbre, Norman’s Southern Chinese, and Yóu’s Shùxíngtú are all ilustrated in Figure 7. All hypotheses show only the subgrouping for the major dialect groups of Chinese. The finer groupings were taken from a maximum parsimony analysis of the full dataset described in the study by Ben Hamed and Wang (2006: 46).
Automatic reference trees were reconstructed using two popular distance-based methods for tree reconstruction, namely the Unweighted Pair Group Method with Arithmetic Mean (upgma, Sokal and Michener 1958) and Neighbor-Joining (Saitou and Nei 1987), and maximum parsimony as a popular character-based method. For the distance-based methods, the calculations were repeated for this study, using the upgma and Neighbor-Joining implementations provided by the LingPy library (List et al. 2015). For the character-based parsimony analysis, the results were taken from the study by Ben Hamed and Wang (2006: 47). Reference trees with full resolution, including all 23 dialect varieties for all automatic approaches and the independent hypotheses are given in Supplementary Material iii.
The data was analyzed using the default settings of the most recent version of the mln approach as implemented in LingPy (http://lingpy.org,Version 2.4.1-alpha, List et al. 2015). In the default settings, five different gain-loss models are tested. The models differ regarding the ratio between the penalties for gain and loss events, ranging from 3:1 over 2:1 to 1:1. The higher the penalty for gain events in comparison to loss events, the stronger the tendency of the model to favor vertical over horizontal transmission. Hence, the 3:1 model is the most conservative one, and the 1:1 model the most innovative. The mln method also contains a specific parameter which handles the amount of parallel evolution allowed during model evaluation. According to the default settings, this parameter was set to 0.05, allowing ancestral vocabulary distributions to grow by 5% without affecting the choice of the best model. Using these settings, the mln approach was applied to all seven reference trees. Scripts that reproduce the analyses described in this study, along with a computer-readable version of the data and the reference trees are given in Supplementary Material i (C and D).
It is important to get a clearer picture of the general accuracy of the mln approach. In List et al. (2014a), this was done by testing to which degree known borrowings in an Indo-European dataset were readily identified by the method. In this study, the mln approach identified 72% of all known borrowings correctly. Since we lack a list of known borrowings for the Chinese dialects, we cannot apply this test to the current dataset. However, there is a simple but effective method that may provide similar insights into the power of the mln approach. This method is based on the “seeding” of fake borrowings (Dessimoz et al. 2008). Before analyzing the data, a certain amount of dialect pairs is randomly selected with one dialect serving as a donor and one as a recipient. A certain amount of words is then transferred from each donor to each recipient, thereby replacing the original word form in the given meaning slot.
Here, this method was applied to varying numbers of donor-recipient pairs (3, 6, 9, 12, and 15 pairs) with a varying borrowing rate <= 25%.7 For each number of donor-recipient pairs, 50 different datasets with fake borrowings were produced and analyzed using the reference trees of the three independent hypotheses, the random trees, and two automatic methods for tree reconstruction, namely upgma and Neighbor-Joining. Note that both automatic methods were applied to the data after the borrowings were introduced. The automatic analyses were thus forced to deal with the artificial reticulation in the data.
The averaged results for the tests are given in Table 5. For lower degrees of borrowing (up to 9 dialect pairs), all independent hypotheses perform fairly well on the borrowing-detection task. With higher degrees of borrowing, however, performance decreases drastically, approximating the results of the random analyses. This is not surprising, since, with higher degrees of borrowing involving more and more dialects, cognate sets will have reflexes in more and more dialects and start to resemble the inherited items. The mln method thus successively loses its most important evidence: the patchiness of phyletic patterns. In contrast to the more or less satisfying performance of the analyses based on the independent hypotheses, the performance of the analyses based on automatic reference trees is alarmingly bad, showing only minor (if any) improvements compared to the random analysis. The reason for the bad performance of the automatic methods is that they are based on the ‘infected’ data. The introduction of artificial borrowings reduces and confuses the phylogenetic signal. A phylogenetic signal, however, is needed to guarantee that borrowings can be readily detected. As a result, the phylogenetic reconstruction methods fail to detect the vertical signal in the data, and the resulting reference trees are not suitable to detect horizontal signals with the help of the mln approach. This shows the importance of selecting suitable reference trees when analyzing data with help of minimal lateral networks. It also shows how important it is for historical linguistics to carry out closer investigations on the borrowability of linguistic features. Scripts for the replication of this analysis are provided in Supplementary Material i (B).
3.4.1 General Results
Table 4 shows the general results of the analysis for all seven hypotheses.8 As can be seen from the table, the 2:1 model shows the best fit with the vsd criterion in all analyses. The results for the six regular hypotheses do not differ much with respect to the number of inferred origins (gain events) per cognate set, ranging from 1.41 (Arbre) to 1.50 (upgma) gain events per cognate set on average. The same holds for the proportion of cognate sets which cannot be readily explained with help of the reference tree (column ‘Overall’ in the table). Here, the Arbre requires the lowest amount of horizontal evolution (34% of the cognate sets cannot be explained by vertical inheritance alone), and the Shùxíngtú the largest (39%). In contrast to the minor differences between the six regular hypotheses, the random analysis (based on the averaged scores of 50 trials) shows an exceptionally high amount of 1.91 gain events and a large amount of inferred borrowing events. This shows the importance of choosing a good reference tree when applying the mln approach.
Note that the differences in terms of origins and percentage of horizontal transfer events do not necessarily qualify or disqualify a given hypothesis. They just show how good a given hypothesis explains the data on which it is tested. A hypothesis which reveals that large amounts of the data cannot be explained by vertical transmission alone may be more realistic than one that reduces horizontal transmission to a minimum. If horizontal transmission is a genuine aspect of the language family under investigation, the former hypothesis may come much closer to the real history than the latter. In order to explore the different hypotheses in more detail, it is therefore important to take a closer look at the direct inferences of the hypotheses. Apart from the overall amounts of inferred horizontal events displayed in Table 4, the results for specific subsets are also given in the table. The column ‘Yakhontov’ lists the amount of inferred borrowings for all cognate sets which denote concepts belonging to Sergei Yakhontovs’s list of the 35 most stable basic vocabulary items (Starostin 1995). The column ‘Swadesh-100’ lists the results for those cognate sets which appear on Morris Swadesh’s (1909–1967) list of the 100 most stable basic vocabulary items (Swadesh 1955) but not on Yakhontov’s list. The column ‘Swadesh-200’ shows the results for the rest of the 200 items which neither belong to Yakhontov’s nor Swadesh’s list. Assuming that the likelihood of a concept to be borrowed is the lowest for Yakhontov’s list, followed by Swadesh’s 100-item list, it is interesting to compare how the proportions of inferred horizontal events are distributed over the sublists. Here, the Arbre hypothesis yields the closest reflection of the sublist structure, proposing the smallest amount of horizontal transfer events for Yakhontov’s 35-item list, followed by Swadesh’s 100-item list, and the remainder of events in a third group. Interestingly, the sublist structure is also reflected in the random analysis, although the differences between the lists are less apparent.
Figure 8 shows the minimal horizontal network reconstructed from the Shùxíngtú.9 The three heaviest edges in the mln are between Méixiàn 梅縣 and Guǎngzhōu 廣州 (9 cognate sets), between the common ancestor of Shànghǎi 上海 and Sūzhōu 蘇州 and the common ancestor of Mǐn, Gàn, Hakka, Yuè, and Xiāng (7 cognate sets), and between Chángshā 長沙 and Mandarin (6 cognate sets). In Table 6, the three heaviest horizontal edges as inferred by the independent hypotheses are displayed, along with the forms that cause reticulation. The forms are further marked, depending on the sublists to which they belong. Concepts which appear on Yakhontov’s list are shaded in black, and concepts appearing on Swadesh’s 100-item list are shaded in gray. Numbers (shaded in dark gray on white font) reflect the Leipzig-Jakarta Index (Tadmor 2009) of conceptual items which are supposed to be very stable and highly resistant to borrowing (the smaller the number, the higher the supposed stability and resistance). As can be seen by scanning the data in the table, the numbers reported above find their reflection also in the three heaviest edges of the mlns, with the Arbre involving less stable concepts, compared to the Southern Chinese Hypothesis or the Shùxíngtú. This does not only hold for the rankings proposed by Yakhontov and Swadesh, but also for the Leipzig-Jakarta Index (which is based on empirical data). This suggests that the Arbre mirrors the vertical history of the Chinese dialects more closely than the alternative hypotheses.
Whether the links inferred by the methods reflect real reticulation heavily depends on the reference tree. The nine horizontal edges which the Shùxíngtú infers for Guǎngzhōu and Méixiàn, for example, may as well reflect a closer genetic relationship between Hakka and Yuè (as proposed by the Arbre and the Southern Chinese Hypothesis). However, since the Shùxíngtú places Yuè and Hakka in separate branches, it cannot explain the nine forms shared by the two dialects as the result of common inheritance and therefore has to explain them as horizontal transfer events.
3.4.2 Inferred Ancestral States and the Plausibility of Reference Trees
Since the mln approach so heavily depends on the selection of the ‘correct’ reference tree, it is unsuitable to use the method directly to assess which of a given set of reference trees is the most plausible one. Without further external evidence, minimal lateral networks do not tell us whether a given hypothesis is wrong, or whether a given hypothesis is less suitable than another one. Nevertheless, minimal lateral networks display the consequences of classification hypotheses in terms of distinct historical scenarios of character evolution, and these consequences can be evaluated with the help of external evidence.
Figure 9 shows an evolutionary scenario for the development of translations for ‘sun’ (rìtóu 日頭 and tàiyáng 太陽) in the Chinese dialects as proposed by the mln method. This scenario is far from being correct, since it proposes that the form tàiyáng 太陽 had developed in the common ancestor of Wú and Mandarin, although the Shànghǎi [tʰa33ɦiã44] 太陽 dialect’s form is a recent borrowing from Standard Chinese (Qían 2007: 24–27). Nevertheless, the scenario is better than many other possible scenarios, since it correctly places only the cognate set rìtóu 日頭 in the ancestor languages of all dialects in the sample, rather than the late Mandarin innovation tàiyáng 太陽. Since the mln approach regularly yields these scenarios for all characters in the data, we can test the plausibility of different classification hypotheses by checking how well the scenarios match with external evidence.
As a simple way to do so, it was examined which cognate sets are correctly traced back to the root of the reference trees. As external evidence, an Old Chinese wordlist, which was provided along with the original data used in Ben Hamed and Wang (2006), was used. Since not all word forms of this word list have reflexes in the data, only 148 forms were used for testing, and all other semantic slots were excluded from the analysis. In addition to the original mln analysis (hencefort labeled BestModel), a further analysis, in which no horizontal transfer events were allowed was carried out (henceforth labeled LossOnly). The results for the two analyses are given in Table 7. As can be seen from the table, the models which do not allow for horizontal transfer trace more cognate sets back to the root of the tree. In the case of Southern Chinese and Shùxíngtú, this results in a hypothetical proto-language that contains more than 200 words to express less than 150 concepts. Once horizontal transfer is allowed, the number of cognate sets traced back to the proto-language shrinks drastically. Although all Loss Only analyses show a larger proportion of cognate sets which are correctly traced back to the root, their precision is much lower than that of the mln analyses, since they produce a larger amount of false positives. This shows clearly, that the mln approach yields more realistic reconstructions of ancestral states than pure loss-only models of character evolution.
Comparing the performance of the three hypotheses, the Arbre shows the highest precision, with 76% of all cognate sets traced back to the root occurring in the external wordlist of Old Chinese. The Shùxíngtú traces more cognate sets correctly back to the root, but it also produces a larger amount of false positives, resulting in a precision score of 72%. The Southern Chinese Hypothesis shows the lowest precision of all three independent classification proposals. Recalling that the Arbre also shows the best reflection of item stability as shown by the comparison of the proportion of inferred horizontal transfer events for the different sublists, one may preliminarily conclude that Sagart’s Arbre des Dialectes Chinois offers a more plausible genetic classification of the Chinese dialects than the Southern Chinese Hypothesis or the Hànyǔ Fāngyán Shùxíngtú hypothesis. Given that Sagart’s hypothesis is based on a carefully selected number of shared innovations, as opposed to the mix of innovations and retentions underlying Norman’s Southern Chinese Hypothesis, and the identification of known population movements with contemporary dialects underlying Yóu Rǔjié’s Shùxíngtú, this underlines the importance of shared innovations in language classification, which was already outlined by Karl Brugmann (1849–1919) in 1886 (Brugmann 1886 ).
What can we learn from this study? Are minimal lateral networks the key to reconstructing Chinese dialectal history, or are they just another fancy tool for visualization that explains problems away rather than facing them? The answer lies somewhere in the middle: minimal lateral networks surely do not solve all long-standing problems of Chinese dialectology at once. However, they are much more than the shiny images they yield in the end. Minimal lateral networks show the consequences of a hypothesis about the vertical paths of language history. In this respect, they are much more transparent than splits networks which only assess the tree-likeness of a given dataset. However, since minimal lateral networks are more transparent, they are also more easily exposed to criticism. Erroneous reference trees lead to erroneous inferences about horizontal language relations. If two closely related varieties are wrongly placed on distant branches of the reference tree, the mln method will explain their genetic relations as a result of contact. This is a weakness of the method, but it is not a flaw.
But how good are the minimal lateral networks after all? When testing how well artificially seeded borrowings were detected by the mln method, it became evident how important it is to have an initial guess regarding the “true” vertical history of the languages under investigation. This shows that minimal lateral networks are by no means a miracle cure against the horizontal forces of language history. They can help to identify the weak spots in our data, provided we have a good idea regarding the general vertical history of the languages being investigated. Yet, without an initial hypothesis regarding the genetic history of a language family, they cannot tell us which items in our data are inherited and which are borrowed. Several factors contribute to this specific weakness of the minimal lateral network approach, the most important ones being (a) the underlying model of lexical change, and (b) the restriction of the model to use only tree-topologies to find conflicting signals.
The underlying model of lexical change is problematic, since it models lexical change as a simple process involving cognate gain and cognate loss. It is beyond question that such a model is far too simple to reflect lexical change realistically and accurately. In the current implementation of the mln approach, for example, Shànghǎi [ɦyɪʔ11 kuɑ̃23] 月光, Wēnzhōu [ȵy21 kuɔ35 vai13] 月光佛, and Běijīng [yɛ51 liɑŋ1] 月亮 (all ‘moon’, data from Hóu 2004) are all treated as etymologically unrelated words. Since the mln approach can only handle binary character states of character presence or character absence, their evolution is modeled independently of each other, although it is obvious that the word forms from Shànghǎi and Wēnzhōu are closer related to each other than to the form in Běijīng. It is straightforward to assume that the suffix [vai(2)13] was later added to the word in Wēnzhōu via the regular processes of suffixation (compare Wēnzhōu [ȵi21dɤu35vai13] 日頭佛 ‘sun’), and that the common ancestor of Wēnzhōu and Shànghǎi would denote ‘moon’ with a word of the form 月光, while the common ancestor of Wēnzhōu, Běijīng and Shànghǎi would not. What we are dealing with here are oblique etymological relations (List 2014: 41–44), or—in a more narrow sense—partial cognate relations, which are very frequent in the Sino-Tibetan languages. Binary gain-loss models of lexical change cannot handle these relations. In order to handle them, more complex models which allow for multiple character states and character transition probabilities are needed (Nunn 2011: 59–60). In biology, the use of multi-state models is quite common. It is, however, difficult to apply them to linguistic data, since most existing approaches have been specifically designed to deal with biological data, allowing only for a limited amount of possible character states or restricting transition probabilities to biologically meaningful values, which would not make sense in linguistic applications.
The restriction to use tree-topologies as the sole evidence for proposed horizontal language relations is another shortcoming of the of the mln approach. Given that processes of horizontal transfer go along with areal proximity between language varieties, it seems tempting to take geographic information into account when dealing with questions of language contact. List et al. (2014b) illustrate how minimal spatial networks can be reconstructed from minimal lateral networks by taking areal information into account and drawing horizontal edges between language varieties on a geographic map. The study shows that minimal spatial networks add a useful perspective to the mln approach and may help to reveal cases of close language contact due to areal proximity. However, since they only reflect contemporary language varieties, they can only reveal recent cases of language contact. This is because deeper relations, involving contact between ancestral languages which are represented as internal nodes in minimal lateral networks are not displayed. This reflects a general problem of using geographic information when trying to identify contact relations between languages: Given that we know that waves of migration may play a crucial role during language history, contemporary areal data often provides little evidence regarding deeper layers of contact, and our methods should carefully balance the different pieces of evidence on which they are based.
The obvious shortcomings involving the mln approach show that the frequently mentioned parallels between biological evolution and language history cannot be strained ad infinitum. The application of biological methods in historical linguistics should always be based on a careful adaptation rather than a simple transfer. Nevertheless, even in their current, simplified form, minimal lateral networks show promising results. They are a useful starting point for further investigations in historical linguistics in general, and in Chinese dialectology in particular and it seems worthwhile to develop and test them further.
The supplementary material accompanying this study consists of four parts, including the source code to replicate all analyses discussed in this study (i), a pdf file listing the dataset in human-readable format (ii), a pdf file containing the plots of all reference trees used in this study (iii), and a pdf file containing the plots of mln analyses for all reference trees (iv). The supplementary material can be found at: http://dx.doi.org/10.5281/zenodo.16760.
This research was partially supported by the DFG grant 261553824 (http://gepris.dfg.de/gepris/projekt/261553824). I would like to thank Wáng Fēng for being so kind as to provide the data on Chinese dialects that was used in this study, and the two anonymous reviewers for their helpful comments.
DàxuéBěijīng. 北京大學 (ed.). 1962 . 《漢語方音字匯》 [Chinese dialect character pronunciation list]. 北京：文字改革出版社.
DessimozChristopheMargadantDaniel and GonnetGaston H.2008. dlight – Lateral gene transfer detection using pairwise evolutionary distances in a statistical framework. In: VingronM.WonL. (eds.). Research in Computational Molecular Biology.Berlin and Heidelberg: Springer315–330.
HóuJīngyī. 侯精一 (ed.). 2004. 《現代漢語方言音庫》 [Phonological database of Chinese dialects]. 上海：上海教育出版社.
LǐXiǎofán. 李小凡. 2005. 〈漢語方言分區方法再認識〉 [Reevaluating the classification of the Chinese dialects]. 《方言》 4: 356–363.
List Johann-Mattis Moran Steven Bouda Peter Dellert Johannes Rama Taraka Forkel Robert. 2015: LingPy. Python library for automatic tasks in historical linguistics. Version 2.4.1-alpha (Uploaded on 2015-03-16). URL: http://lingpy.org. doi (Zenodo): http://dx.doi.org/10.5281/zenodo.16093.
MirkinBoris G.FennerTrevor I.GalperinMichael Y.KooninEugene V.2003. Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes. bmc Evolutionary Biology3 (2).
NǎiróngQián. 錢乃榮 . 2007. 《上海方言》 [The Shànghǎi dialect]. 上海：文匯出版社.
SagartLaurent. 2011. Classifying Chinese dialects/Sinitic languages on shared innovations. Paper presented at the Séminaire Sino-Tibétain du crlao (Paris 03/28/2011).
WángHóngjūn. 王洪君. 2009. 〈兼顧演變、推平和層次的漢語方言歷史關系模型〉 [A historical relation model of Chinese dialects with multiple perspectives of evolution level and stratum]. 《方言》3: 204–218.
WurmS. A.LiuYongquan (eds.). 1987. 《中國語言地圖集》 [Language atlas of China]. Hong Kong: Longman Group.
YóuRǔjiá. 游汝杰. 1992. 《漢語方言學導論》 [Chinese dialectology]. 上海：上海教育 出版社.
Zhōngguó Shèhuì Kēxuéyuàn Yǔyán Yánjiūsuǒ. 中國社會科學院語言研究所 (ed.). 1955 . 《方言調查字表》 [List of characters for research on Chinese dialects]. 北京：商務印書館.
1 Figure with modifications taken from http://en.wikipedia.org/wiki/File:Map_of_sinitic_dialect_-_English_version.svg.
2 For a typical diagnostic character list, compare the Fāngyán diàochá zìbiǎo 方言調查字表 (Zhōngguó Shèhuì Kēxuéyuàn 1955).
3 Note that the conversion to distances or characters is crucial for all approaches making use of bioinformatic software. The nature of the original data, be it words, cognate sets, or even structural data (as used, for example, in Dunn et al. 2005).
4 The network was reconstructed with help of the NeighborNet algorithm applied to a small excerpt of the dataset by Hóu (2004) as prepared for the study by List et al. (2014). The network was visualized with help of the SplitsTree software package (Huson 1998). The original dataset and the distance matrix fed to the algorithm is available in Supplementary Material i.
5 The minimal lateral network was reconstructed using the same dataset as used for the NeighborNet analysis and the most recent implementation of the mln method provided as part of the LingPy software package (List and Moran 2013, Version 2.4.1-alpha, List et al. 2015). The original dataset and a script that reproduces the analysis is available in Supplementary Material i.
6 For a coding example on how the mln method can be applied to Chinese dialect data, see: https://gist.github.com/LinguList/7481097.
7 The borrowing rate was chosen in such a way that it cannot exceed 25%, but it can be much less.
8 Scripts that replicate the analyses discussed in this section are provided in Supplementary Material i (C).
9 Plots for the minimal lateral networks of all other hypotheses are given in Supplementary Material V.