Gaming artificial phylogenies

in Language Dynamics and Change
Restricted Access
Get Access to Full Text
Rent on DeepDyve

Have an Access Token?

Enter your access token to activate and access content online.

Please login and go to your personal user account to enter your access token.


Have Institutional Access?

Access content through your institution. Any other coaching guidance?



The reconstruction of phylogenies of cultural artefacts represents an open problem that mixes theoretical and computational challenges. Existing benchmarks rely on simulated phylogenies, where hypotheses on the underlying evolutionary mechanisms are unavoidable, or on real data phylogenies, for which no true evolutionary history is known. Here we introduce a web-based game, Copystree, where users create phylogenies of manuscripts through successive copying actions in a fully monitored setup. While players enjoy the experience, Copystree allows to build artificial phylogenies whose evolutionary processes do not obey any predefined theoretical mechanisms, being generated instead with the unpredictability of human creativity. We present the analysis of the data gathered during the first set of experiments and use the artificial phylogenies gathered for a first test of existing phylogenetic algorithms.


Gaming artificial phylogenies

in Language Dynamics and Change



Atkinson Quentin D. Andrew Meade Chris Venditti Simon J. Greenhill and Mark Pagel. 2008. Languages evolve in punctuational bursts. Science 319(5863): 588–588.

Bordalejo Barbara. 2015. The genealogy of texts: Manuscript traditions and textual traditions. Digital Scholarship in the Humanities 31(3): 563–577.

Bryant David Flavia Filimon and Russell D. Gray. 2005. Untangling our past: Languages trees splits and networks. In Ruth Mace Clare J. Holden and Stephen Shennan (eds.) The Evolution of Cultural Diversity: A Phylogenetic Approach 67–83. Walnut Creek CA: Left Coast Press.

Bryant David John Tsang Paul E. Kearney and Ming Li. 2000. Computing the quartet distance between evolutionary trees. In David Shmoys (ed.) Symposium on Discrete Algorithms: Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms 285–286. Philadelphia PA: Society for Industrial and Applied Mathematics.

Caetlidge Neil. 2001. The Canterbury Tales and cladistics. Neuphilologische Mitteilungen 102(2): 135–150.

Canettieri Paolo Vittorio Loreto Marta Rovetta and Giovanna Santini. 2009. Philology and information theory. Cognitive Philology 1: 1. Downloadable at (accessed February 20 2018).

Chris Christiansen Thomas Mailund Christian N.S. Pedersen and Martin Randers. 2005. Computing the quartet distance between trees of arbitrary degree. In Rita Casadio and Gene Myers (eds.) Algorithms in Bioinformatics. 5th International Workshop WABI 2005 Lecture Notes in Bioinformatics 3692 77–88. Berlin: Springer.

Darwin Charles R. 1859. On the Origin of Species by Means of Natural Selection or the Preservation of Favoured Races in the Struggle for Life. London: John Murray.

Darwin Charles. 1871. The Descent of Man and Selection in Relation to Sex. London: John Murray.

Desper Richard and Olivier Gascuel. 2002. Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle. Journal of Computational Biology 9(5): 687–705.

Drummond Alexei J. and Remco R. Bouckaert. 2015. Bayesian Evolutionary Analysis with Beast. Cambridge: Cambridge University Press.

Dunn Michael Simon J. Greenhill Stephen C. Levinson and Russell D. Gray. 2011. Evolved structure of language shows lineage-specific trends in word-order universals. Nature 473(7345): 79–82.

Dunn Michael Stephen C. Levinson Eva Lindström Ger Reesink and Angela Terrill. 2008. Structural phylogeny in historical linguistics: Methodological explorations applied in Island Melanesia. Language 84(4): 710–759.

Felsenstein Joseph. 2004. Inferring Phylogenies. Sunderland MA: Sinauer Associates.

Gascuel Olivier. 2005. Mathematics of Evolution and Phylogeny. Oxford: Oxford University Press.

Gray Russell D. and Quentin D. Atkinson. 2003. Language-tree divergence times support the Anatolian theory of Indo-European origin. Nature 426(6965): 435–439.

Gray Russell D. Alexej J. Drummond and Simon J. Greenhill. 2009. Language phylogenies reveal expansion pulses and pauses in pacific settlement. Science 323(5913): 479–483.

Grenfell Bryan T. Oliver G. Pybus Julia R. Gog James L.N. Wood Janet Daly Jenny A. Mumford and Edward C. Holmes. 2004. Unifying the epidemiological and evolutionary dynamics of pathogens. Science 303(5656): 327–332.

Grier James. 1989. Lachmann bédier and the bipartite stemma: Towards a responsible application of the common-error method. Revue d’ histoire des textes 18(1988): 263–278.

Hanna Ralph. 2000. The application of thought to textual criticism in all modes—with apologies to A.E. Housman. Studies in Bibliography 53: 163–172.

Holman Eric W. Cecil H. Brown Søren Wichmann André Müller Viveka Velupillai Harald Hammarström Sebastian Sauppe Hagen Jung Dik Bakker Pamela Brown and others. 2011. Automated dating of the world’s language families based on lexical similarity. Current Anthropology 52(6): 841–875.

Holman Eric W. and Søren Wichmann. 2017. New evidence from linguistic phylogenetics supports phyletic gradualism. Systematic Biology 66.4: 604–610.

Holmes Edward C. and Bryan T. Grenfell. 2009. Discovering the phylodynamics of RNA viruses. PLoS Computational Biology 5(10): e1000505.

Jäger Gerhard. 2013. Phylogenetic inference from word lists using weighted alignment with empirically determined weights. Language Dynamics and Change 3(2): 245–291.

Jäger Gerhard. 2014. Evaluating distance-based pyhlogenetic algorithms for automated language classification. Technical report University of Tübingen. Downloadable at (accessed February 20 2018).

Jäger Gerhard. 2015. Support for linguistic macrofamilies from weighted sequence alignment. Proceedings of the National Academy of Sciences of the U.S.A. 112(41): 12752–12757.

Jones Alex. 2001. The properties of a stemma: Relating the manuscripts in two texts from the Canterbury Tales. Parergon 18(2): 35–53.

Joseph Brian D. and Richard D. Janda (eds.). 2004. The Handbook of Historical Linguistics. Malden MA: Blackwell Publishing.

Levenshtein Vladimir I. 1966. Binary codes capable of correcting deletions insertions and reversals. Soviet physics doklady 10: 707–710.

Likic Vladimir. 2008. The Needleman-Wunsch algorithm for sequence alignment. Lecture given at the 7th Melbourne Bioinformatics Course of the Bi021 Molecular Science and Biotechnology Institute University of Melbourne. Lecture notes downloadable at (accessed February 20 2018).

Marmerola Guilherme D. Marina A. Oikawa Zanoni Dias Siome Goldenstein and Anderson Rocha. 2016. On the reconstruction of text phylogeny trees: Evaluation and analysis of textual relationships. PloS One 11(12): e0167822.

Maynard Smith John and Eörs Szathmáry. 1997. The Major Transitions in Evolution. Oxford: Oxford University Press.

Moore Edward. 1889. Contributions to the Textual Criticism of the Divina Commedia. Cambridge: Cambridge University Press.

O’Hara Robert J. 1996. Trees of history in systematics and philology. Memorie della Società Italiana di Scienze Naturali e del Museo Civico di Storia Naturale di Milano 27: 81–88.

Pagel Mark Quentin D. Atkinson and Andrew Meade. 2007. Frequency of word-use predicts rates of lexical evolution throughout Indo-European history. Nature 449(7163): 717–720.

Platnick Norman I. and H. Don Cameron. 1977. Cladistic methods in textual linguistic and phylogenetic analysis. Systematic Biology 26(4): 380–385.

Pompei Simone Emanuele Caglioti Vittorio Loreto and Francesca Tria. 2010. Distance-based phylogenetic algorithms: New insights and applications. Mathematical Models and Methods in Applied Sciences 20(supp01): 1511–1532.

Pompei Simone Vittorio Loreto and Francesca Tria. 2011. On the accuracy of language trees. PloS One 6(6): e20109.

Renfrew Colin April McMahon and Robert Lawrence Trask. 2000. Time Depth in Historical Linguistics. Cambridge: The Macdonald Institute for Archaelogical Research.

Robinson David F. and Leslie R. Foulds. 1981. Comparison of phylogenetic trees. Mathematical Biosciences 53(1–2): 131–147.

Saitou Naruya and Masatoshi Nei. 1987. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution 4(4): 406–425.

Simonson Anne B. Jacqueline A. Servin Ryan G. Skophammer Craig W. Herbold Maria C. Rivera and James A. Lake. 2005. Decoding the genomic tree of life. Proceedings of the National Academy of Sciences of the U.S.A. 102(suppl 1): 6608–6613.

Spencer Matthew Elizabeth A. Davidson Adrian C. Barbrook and Christopher J. Howe. 2004. Phylogenetics of artificial manuscripts. Journal of Theoretical Biology 227(4): 503–511.

Swadesh Morris. 1952. Lexico-statistic dating of prehistoric ethnic contacts: With special reference to North American Indians and Eskimos. Proceedings of the American Philosophical Society 96(4): 452–463.

Swadesh Morris. 1955. Towards greater accuracy in lexicostatistic dating. International Journal of American Linguistics 21(2): 121–137.

Timpanaro Sebastiano. 1985. La genesi del metodo del lachmann (Vol. 5). Torino: Liviana.

Tonello Elisabetta and Paolo Trovato. 2013. Nuove prospettive sulla tradizione della “commedia”: seconda serie (2008–2013). Limena:

Tria Francesca Emanuele Caglioti Vittorio Loreto and Andrea Pagnani. 2010a. A stochastic local search approach to language tree reconstruction. Diachronica 27(2): 341–358.

Tria Francesca Emanuele Caglioti Vittorio Loreto and Andrea Pagnani. 2010b. A stochastic local search algorithm for distance-based phylogeny reconstruction. Molecular Biology and Evolution 27(11): 2587–2595.

Tria Francesca Emanuele Caglioti Vittorio Loreto and Simone Pompei. 2010c. A fast noise reduction driven distance-based phylogenetic algorithm. In Hamid R. Arabnia Quoc-Nam Tran Rui Chang Matthew He Andy Marsh Ashu M.G. Solo and Jack Y. Yang (eds.) Proceedings of BIOCOMP 2010 375–380. Athens GA: CSREA Press.

Wichmann Søren and Anthony P. Grant. 2012. Quantitative Approaches to Linguistic Diversity: Commemorating the Centenary of the Birth of Morris Swadesh. Amsterdam: John Benjamins Publishing.


  • View in gallery
    Figure 1

    User interface of Copystree. The challenge of the game consists in copying a text to the best of the players’ ability, with time constraints and the readability of the text progressively reduced in an artificial way. The text to be copied is presented to the player in a non-editable graphic format, to avoid cut and paste actions, and input is allowed only through a standard HTML text field. At the end of each gaming session, players are given a score based on the similarity between the copy they produced and the text they were prompted with. Higher similarities result in higher scores. The scoring system is not explicitly available to players.

  • View in gallery
    Figure 2

    Evolution of a single copy. To mimic the degradation processes that manuscripts and old books undergo during their lifetime, each copy of a text is associated with an independent phylogenetic lineage, through which the text is progressively degraded. Each fragment can thus be copied several times, each time with a different level of degradation, each new copy being the starting point of a new lineage. Because of the reduced readability of the original text, several variants, for example new words (here highlighted in red), may emerge in the new copies.

  • View in gallery
    Figure 3

    Phylogenetic structures of the artificial phylogenies. A: Schematic illustration of the creation of an artificial phylogeny. Starting from the original text (the root, brown circle), a binary tree is generated via successive copying actions. In each round, a player is presented with a text to copy, chosen from among the elements of the tree available for copying (represented by empty red squares). When copying is completed, the empty square becomes a solid red square and branches into two new nodes of the tree: the copy of the text entered by the player (a green circle) and another empty red square representing the degraded version of the text just copied. This operation is repeated through the successive rounds of the game. At each point in time, the phylogeny consists of a set of artificial texts (squares) and a set of copies (green circles). Only the artificial texts not yet copied (the empty red squares) are available for further copying. Artificial texts are generated to mimic the aging process of each copy, while each copy represents a new phylogenetic lineage (as shown in Fig. 2). Lineages in the tree can be declared inactive and will not be available anymore for copy (black square) if the same fragment is skipped by users more than 3 times. B: A non-binary tree embeds the evolutionary relationship between all the copies of Fig. A. The fact that, in the topology shown in Fig. 3B, the copies 1, 5 and 7 are actually ancestral nodes of the copies below is made explicit by setting the branches above them to have a length = 0. In this way, in the “true phylogeny,” which we will use as reference for the inference, all the copies are treated as terminal nodes (this is needed because all inference algorithms will infer a tree where all the copies are leaves), but, on the other hand, we correctly report them as identical to the internal nodes above them.

  • View in gallery
    Figure 4

    Degradation processes simulating the aging process of a text. Top: Dots, where circular colored spots of different sizes are randomly located at different positions to cover portions of the text. Center: Deletion of single characters in random positions of the text and replacement with blank spaces. Bottom: Multiple Deletions, with the deletion of up to three neighboring characters in randomly chosen locations of the text.

  • View in gallery
    Figure 5

    Statistics of the database collected in the preliminary session. A: Scatter plot for the size of the artificial phylogenies (x axis) for the three languages adopted to generate the phylogenies. Different colors denote different degradation processes (see legend). B: Histogram of the cumulative gaming time per user. C: Histogram of the number of copies per user.

  • View in gallery
    Figure 6

    An example of artificial phylogeny. A: The root of the artificial phylogeny, taken from Hero and Leander by Christopher Marlowe. During the gaming sessions, this text was copied 10 times, following the scheme illustrated in Fig. 3. Here we report the non-binary phylogeny describing the diversification process of the set of copies. B: Examples of two copies belonging to this artificial phylogeny. The texts differ from the root due to accidental typos (marked in blue) and because new words have emerged during the evolution (marked in red). C: Variants that emerged during the evolution of the text. Numbers indicate the tree branch (as marked in the A panel) where the variant appeared. Several events of parallel evolution can be identified, where the same word has emerged in two independent lineages (words marked in orange).

  • View in gallery
    Figure 7

    Mutation rates. A: Mean value and standard deviation of the edit distance (top) and number of variants (i.e. different words) measured between two consecutive copies (bottom), for the three different degradation processes considered. B: Same information as in A but evaluated as a function of the number of copies away from the original text. We show in grey the expected value (plus/minus standard deviation) of both the edit distance and the number of observed variants, under the hypothesis of independent changes (i.e. linear extrapolations of the values of A after many copies). C: Examples of the evolution of a text after multiple copies; changes are highlighted in blue. In the first case, a typo introduced after the first copy is restored in the subsequent one. In the second case, the introduction of a new variant in the first copy induces a change of the semantic content of the sentence, which is retained in the subsequent copy. These examples are taken from the tree of copies of Hero and Leander by Christopher Marlowe (same as Fig. 6).

  • View in gallery
    Figure 8

    Comparison between binary and non-binary trees. Top: Example of a compatible (blue) and a non-compatible (red) edge between a non-binary tree (left, orange) and a binary tree, as considered in the Generalized Robinson-Foulds distance. Bottom: Example of a compatible (blue) and a non-compatible (red) quartet between a non-binary tree (left, orange) and a binary tree, as considered in the Generalized Quartet Distance.

  • View in gallery
    Figure S1

    Accuracy of the reconstruction. We study here the accuracy of the reconstructed phylogenetic trees of our dataset as a function of the size of the phylogeny, i.e., the amount of copied text. In this plot we include all the phylogenies of our dataset (i.e., all three degradation processes considered together); trees are then grouped into classes of 5 elements, for which we show the mean value of the GQD and GRF (y axis) as a function of the mean size of the phylogeny N (x axis). Trees were reconstructed with the three distance-based algorithms considered in this context (see main text): FastME, Neighbor-Joining (NJ) and Fast-SBiX.

Index Card

Content Metrics

Content Metrics

All Time Past Year Past 30 Days
Abstract Views 13 13 5
Full Text Views 5 5 5
PDF Downloads 2 2 2
EPUB Downloads 0 0 0