Save

Making genealogical language classifications available for phylogenetic analysis

Newick trees, unified identifiers, and branch length

In: Language Dynamics and Change
Author:
Dan Dediu Max Planck Institute for Psycholinguistics The Netherlands Nijmegen

Search for other papers by Dan Dediu in
Current site
Google Scholar
PubMed
Close
Download Citation Get Permissions

Access options

Get access to the full article by using one of the access options below.

Institutional Login

Log in with Open Athens, Shibboleth, or your institutional credentials

Login via Institution

Purchase

Buy instant access (PDF download and unlimited online access):

$40.00

Abstract

One of the best-known types of non-independence between languages is caused by genealogical relationships due to descent from a common ancestor. These can be represented by (more or less resolved and controversial) language family trees. In theory, one can argue that language families should be built through the strict application of the comparative method of historical linguistics, but in practice this is not always the case, and there are several proposed classifications of languages into language families, each with its own advantages and disadvantages. A major stumbling block shared by most of them is that they are relatively difficult to use with computational methods, and in particular with phylogenetics. This is due to their lack of standardization, coupled with the general non-availability of branch length information, which encapsulates the amount of evolution taking place on the family tree. In this paper I introduce a method (and its implementation in R) that converts the language classifications provided by four widely-used databases (Ethnologue, WALS, AUTOTYP and Glottolog) into the de facto Newick standard generally used in phylogenetics, aligns the four most used conventions for unique identifiers of linguistic entities (ISO 639-3, WALS, AUTOTYP and Glottocode), and adds branch length information from a variety of sources (the tree’s own topology, an externally given numeric constant, or a distance matrix). The R scripts, input data and resulting Newick trees are available under liberal open-source licenses in a GitHub repository (https://github.com/ddediu/lgfam-newick), to encourage and promote the use of phylogenetic methods to investigate linguistic diversity and its temporal dynamics.

Content Metrics

All Time Past Year Past 30 Days
Abstract Views 625 80 5
Full Text Views 148 2 0
PDF Views & Downloads 48 3 1