One of the best-known types of non-independence between languages is caused by genealogical relationships due to descent from a common ancestor. These can be represented by (more or less resolved and controversial) language family trees. In theory, one can argue that language families should be built through the strict application of the comparative method of historical linguistics, but in practice this is not always the case, and there are several proposed classifications of languages into language families, each with its own advantages and disadvantages. A major stumbling block shared by most of them is that they are relatively difficult to use with computational methods, and in particular with phylogenetics. This is due to their lack of standardization, coupled with the general non-availability of branch length information, which encapsulates the amount of evolution taking place on the family tree. In this paper I introduce a method (and its implementation in R) that converts the language classifications provided by four widely-used databases (Ethnologue, WALS, AUTOTYP and Glottolog) into the de facto Newick standard generally used in phylogenetics, aligns the four most used conventions for unique identifiers of linguistic entities (ISO 639-3, WALS, AUTOTYP and Glottocode), and adds branch length information from a variety of sources (the tree’s own topology, an externally given numeric constant, or a distance matrix). The R scripts, input data and resulting Newick trees are available under liberal open-source licenses in a GitHub repository (
Purchase
Buy instant access (PDF download and unlimited online access):
Institutional Login
Log in with Open Athens, Shibboleth, or your institutional credentials
Personal login
Log in with your brill.com account
Bakker, Dik, André Müller, Viveka Velupillai, Søren Wichmann, Cecil H. Brown, Pamela Brown, Dmitry Egorov, Robert Mailhammer, Anthony Grant, and Eric W. Holman. 2009. Adding typology to lexicostatistics: A combined approach to language classification. Linguistic Typology 13(1). 10.1515/LITY.2009.009.
Bouckaert, Remco, Philippe Lemey, Michael Dunn, Simon J. Greenhill, Alexander V. Alekseyenko, Alexei J. Drummond, Russell D. Gray, Marc A. Suchard, and Quentin D. Atkinson. 2012. Mapping the origins and expansion of the Indo-European language family. Science 337(6097): 957–960.
Bowern, Claire and Bethwyn Evans. 2014. The Routledge Handbook of Historical Linguistics. London: Routledge.
Campbell, Lyle and William J. Poser. 2008. Language Classification: History and Method. Cambridge: Cambridge University Press.
Dediu, Dan. 2011. A Bayesian phylogenetic approach to estimating the stability of linguistic features and the genetic biasing of tone. Proceedings of the Royal Society B 278: 474–479. 10.1098/rspb.2010.1595.
Dediu, Dan and Stephen C. Levinson. 2012. Abstract profiles of structural stability point to universal tendencies, family-specific factors, and ancient connections between languages. PLoS ONE 7(9): e45,198. 10.1371/journal.pone.0045198.
Dryer, Matthew S. and Martin Haspelmath (eds.). 2013. WALS Online. Leipzig: Max Planck Institute for Evolutionary Anthropology. Available at http://wals.info (accessed February 8, 2018).
Dunn, Michael, Simon J. Greenhill, Stephen C. Levinson, and Russell D. Gray. 2011. Evolved structure of language shows lineage-specific trends in word-order universals. Nature 473: 79–82. 10.1038/nature09923.
Felsenstein, Joseph. 2004. Inferring Phylogenies. Sunderland, MA: Sinauer Associates.
Gower, John C. 1971. A general coefficient of similarity and some of its properties. Biometrics 27(4): 857–871. 10.2307/2528823.
Grafen, Alan. 1989. The phylogenetic regression. Philosophical Transactions of the Royal Society B 326(1233): 119–157. 10.1098/rstb.1989.0106.
Gray, Russell D., Alexei J. Drummond, and Simon J. Greenhill. 2009. Language phylogenies reveal expansion pulses and pauses in pacific settlement. Science 323(5913): 479–483. 10.1126/science.1166858.
Hammarström, Harald, Robert Forkel, Martin Haspelmath, and Sebastian Nordhoff (eds.). 2014. Glottolog 2.3. Leipzig: Max Planck Institute for Evolutionary Anthropology. Available at http://glottolog.org (accessed February 8, 2018).
Hijmans, Robert J. 2014. geosphere: Spherical trigonometry. R package version 1.3–11. Downloadable at http://CRAN.R-project.org/package=geosphere.
Jordan, Fiona M., Russell D. Gray, Simon J. Greenhill, and Ruth Mace. 2009. Matrilocal residence is ancestral in Austronesian societies. Proceedings of the Royal Society B 276(1664): 1957–1964. 10.1098/rspb.2009.0088.
Ladd, D. Robert, Seán G. Roberts, and Dan Dediu. 2015. Correlational studies in typological and historical linguistics. Annual Review of Linguistics 1(1): 221–241. 10.1146/annurev-linguist-030514-124819.
Lewis, M. Paul, Gary F. Simons, and Charles D. Fennig (eds.). 2014. Ethnologue: Languages of the World. Dallas, TX: SIL International, 17th ed. Available at http://www.ethnologue.com (accessed February 8, 2018).
Mace, Ruth and Mark Pagel. 1994. The comparative method in anthropology. Current Anthropology 35: 549–564.
Maddison, David R., David L. Swofford, and Wayne P. Maddison. 1997. Nexus: An extensible file format for systematic information. Systematic Biology 46(4): 590–621. 10.1093/sysbio/46.4.590.
Maechler, Martin, Peter Rousseeuw, Anja Struyf, Mia Hubert, and Kurt Hornik. 2015. cluster: Cluster analysis basics and extensions. R package version 2.0.1.
Maurits, Luke and Thomas L. Griffiths. 2014. Tracing the roots of syntax with Bayesian phylogenetics. Proceedings of the National Academy of Sciences of the U.S.A. 111(37): 13,576–13,581. 10.1073/pnas.1319042111.
Nichols, Johanna, Alena Witzlack-Makarevich, and Balthasar Bickel. 2013. The AUTOTYP genealogy and geography database: 2013 release. Accessible at http://www.autotyp.uzh.ch (accessed February 8, 2018).
Paradis, Emmanuel, Julien Claude, and Korbinian Strimmer. 2004. APE: Analyses of phylogenetics and evolution in R language. Bioinformatics 20: 289–290.
R Core Team. 2014. R: A language and environment for statistical computing. Downloadable at http://www.R-project.org (accessed February 8, 2018).
Roberts, Seàn and James Winters. 2013. Linguistic diversity and traffic accidents: Lessons from statistical studies of cultural traits. PLoS ONE 8(8): e70,902. 10.1371/journal.pone.0070902.
Saitou, Naruya and Masatoshi Nei. 1987. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4(4): 406–425.
Schliep, Klaus P. 2011. Phangorn: Phylogenetic analysis in R. Bioinformatics 27(4): 592–593.
Scrucca, Luca. 2013. GA: A package for genetic algorithms in R.Journal of Statistical Software 53(4): 1–37.
Wichmann, Søren, André Müller, Annkathrin Wett, Viveka Velupillai, Julia Bischoffberger, Cecil H. Brown, Eric W. Holman, Sebastian Sauppe, Zarina Molochieva, Pamela Brown, Harald Hammarström, Oleg Belyaev, Johann-Mattis List, Dik Bakker, Dmitry Egorov, Matthias Urban, Robert Mailhammer, Agustina Carrizo, Matthew S. Dryer, Evgenia Korovina, David Beck, Helen Geyer, Pattie Epps, Anthony Grant, and Pilar Valenzuela. 2013. The ASJP database (version 16) Accessible at http://asjp.clld.org/ (accessed February 8, 2018).
All Time | Past Year | Past 30 Days | |
---|---|---|---|
Abstract Views | 625 | 80 | 5 |
Full Text Views | 148 | 2 | 0 |
PDF Views & Downloads | 48 | 3 | 1 |
One of the best-known types of non-independence between languages is caused by genealogical relationships due to descent from a common ancestor. These can be represented by (more or less resolved and controversial) language family trees. In theory, one can argue that language families should be built through the strict application of the comparative method of historical linguistics, but in practice this is not always the case, and there are several proposed classifications of languages into language families, each with its own advantages and disadvantages. A major stumbling block shared by most of them is that they are relatively difficult to use with computational methods, and in particular with phylogenetics. This is due to their lack of standardization, coupled with the general non-availability of branch length information, which encapsulates the amount of evolution taking place on the family tree. In this paper I introduce a method (and its implementation in R) that converts the language classifications provided by four widely-used databases (Ethnologue, WALS, AUTOTYP and Glottolog) into the de facto Newick standard generally used in phylogenetics, aligns the four most used conventions for unique identifiers of linguistic entities (ISO 639-3, WALS, AUTOTYP and Glottocode), and adds branch length information from a variety of sources (the tree’s own topology, an externally given numeric constant, or a distance matrix). The R scripts, input data and resulting Newick trees are available under liberal open-source licenses in a GitHub repository (
All Time | Past Year | Past 30 Days | |
---|---|---|---|
Abstract Views | 625 | 80 | 5 |
Full Text Views | 148 | 2 | 0 |
PDF Views & Downloads | 48 | 3 | 1 |