Save

PoeTree: Poetry Treebanks in Czech, English, French, German, Hungarian, Italian, Portuguese, Russian, Slovenian and Spanish

In: Research Data Journal for the Humanities and Social Sciences
Authors:
Petr Plecháč Institute of Czech Literature, Czech Academy of Sciences, Prague, Czechia

Search for other papers by Petr Plecháč in
Current site
Google Scholar
PubMed
Close
,
Silvie Cinková Institute of Czech Literature, Czech Academy of Sciences, Prague, Czechia
Charles University, Prague, Czechia

Search for other papers by Silvie Cinková in
Current site
Google Scholar
PubMed
Close
,
Robert Kolár Institute of Czech Literature, Czech Academy of Sciences, Prague, Czechia

Search for other papers by Robert Kolár in
Current site
Google Scholar
PubMed
Close
,
Artjoms Šeļa Institute of Polish Language, Polish Academy of Sciences, Warsaw, Poland

Search for other papers by Artjoms Šeļa in
Current site
Google Scholar
PubMed
Close
,
Mirella De Sisto Tilburg University, Tilburg, the Netherlands

Search for other papers by Mirella De Sisto in
Current site
Google Scholar
PubMed
Close
,
Lara Nugues University of Basel, Basel, Switzerland

Search for other papers by Lara Nugues in
Current site
Google Scholar
PubMed
Close
,
Thomas Haider University of Passau, Passau, Germany

Search for other papers by Thomas Haider in
Current site
Google Scholar
PubMed
Close
, and
Neža Kočnik University of Ljubljana, Ljubljana, Slovenia

Search for other papers by Neža Kočnik in
Current site
Google Scholar
PubMed
Close
Open Access

Abstract

This article presents a set of standardised corpora of poetry comprising over 330,000 poems in ten languages (Czech, English, French, German, Hungarian, Italian, Portuguese, Russian, Slovenian, and Spanish). Each corpus has been deduplicated, enriched with Universal Dependencies, provided with additional metadata, and converted into a unified json structure.

  1. Related data set “PoeTree. Poetry Treebanks in Czech, English, French, German, Hungarian, Italian, Portuguese, Russian, Slovenian and Spanish” with doi www.doi.org/10.5281/zenodo.10907309 in repository “Zenodo”
  2. Also access the data through the rest api (as of April 2024): https://versologie.cz/poetree/api_doc
  3. Also access the data through the Python library (as of April 2024): www.pypi.org/project/poetree
  4. Also access the data through the R library (as of April 2024): www.github.com/perechen/poetRee

1. Introduction

With advances in computational literary studies, the demand for open multilingual datasets has been increasing, be it for the purpose of comparative literary research (Storey & Mimno, 2020; Šeļa et al., 2022), as a benchmark for new stylometric methods (Du et al., 2022; Plecháč, 2021), or as training data for multi-lingual models that aim to enhance literary text annotation and processing pipelines (Bamman, 2021; Byszuk et al., 2020; de la Rosa, 2023). Several relevant resources are already available for prose fiction, including the European Literary Text Collection or ELTeC (Odebrecht et al., 2021) and benchmark corpora built by the Computational Stylistics Group (2023). In addition to these, the expansive DraCor project (Fischer et al., 2019) contains dramatic texts across numerous languages and periods. This leaves poetry, the last of the three main literary genres, without a dedicated resource, a situation that hinders research in computational poetics and comparative poetry studies.

Several monolingual corpora of poetry have already been built (Bobenhausen & Hammerich, 2015; Delente & Renault, 2021; Grishina et al., 2009; Haider, 2021a; Horváth et al., 2022; Mittmann, 2019; Navarro-Colorado et al., 2017; Plecháč & Kolár, 2015; Ruiz Fabo et al., 2021), yet their structures and tag sets are not mutually compatible, and the depth of their annotation varies. While the recently released Python library Averell (Díaz Medina et al., 2021) aims to transform these resources into a unified json output, it is hampered by a critical problem, namely that it is not well adapted to the structural peculiarities of the original datasets. Consequently, a large part of the data is lost (out of more than 18,000 poems in the French corpus, for example, only 5,081 make it to the json output; similarly, almost 15,000 poems are lost from the Italian corpus).

In this article, we present a dataset entitled PoeTree (Poetry Treebanks), comprising poetry corpora in ten different languages (Czech, English, French, German, Hungarian, Italian, Portuguese, Spanish, Slovenian, and Russian), with a total of more than 330,000 poems / 89,000,000 tokens. All texts have been deduplicated, morphologically tagged, and parsed for syntactic dependencies with UDpipe. All information is encoded in a shared simple json structure.

2. Resources

  1. Poetree deposited at Zenodo – doi:www.doi.org/10.5281/zenodo.10907309
  2. Other access points (as of April 2024)
    1. rest apiurl:https://versologie.cz/poetree/api_doc
    2. Python library – url:www.pypi.org/project/poetree
    3. R library – url:www.github.com/perechen/poetRee
  3. Temporal coverage: 13th century-20th century; 2009–2023 (construction)

Data stems from the following resources (we refer to each corpus by its iso 639-1 language code):

  1. cs: The Corpus of Czech Verse (Plecháč & Kolár, 2015)
  2. de: German Poetry Corpus (Textgrid and dta) (Bobenhausen & Hammerich, 2015; Haider, 2021a; 2021b; 2023; 2024)
  3. es: Corpus of Spanish Golden Age Sonnets (Navarro-Colorado et al., 2017) + Diachronic Spanish Sonnet Corpus (Ruiz Fabo et al., 2021)
  4. fr: Corpus Malherbə (Delente & Renault, 2021)
  5. hu: elte Poetry Corpus (Horváth et al., 2022)
  6. it: Biblioteca italiana (2023)
  7. pt: Poemas (Mittmann et al., 2019)
  8. ru: Corpus of Russian Poetry (Grishina et al., 2009)

To the best of our knowledge, there are currently two open corpora of English Poetry (Parrish, 2018, and Haider, 2021b, 2023), both based on texts available at Project Gutenberg (2023). The former is known, however, to be vastly contaminated by non-versified documents (fiction, comments, etc.; cf. Pace-Sigge, 2019), while the latter, in its efforts to clean the data of these contaminants, seems to go too far, omitting a large part of the original data. In light of this, we have decided to compile en from scratch for the sake of the PoeTree collection. The texts were acquired from Project Gutenberg through GutenTag (Brooke et al., 2015). Although each text has been manually checked for tagging errors, some of these were beyond repair. This concerns chunks of verse that the system misclassified as prose and the lines of which were merged into a single paragraph. These parts were thus omitted from the final corpus. Although rather infrequent, it remains a known bug in en.

In addition to these, Slovenian corpus (sl) has been compiled from texts available at wikisource platform, part of which was published within the project “Slovenska leposlovna klasika”, financed by the Ministry of Culture of the Republic of Slovenia.

As is shown in Figure 1, the corpora vary largely both in size and time coverage. PoeTree is thus by no means a balanced dataset and does not really aim to be one. Should it comprise poetry in “big” languages with long-lasting traditions along with that in “small” ones, it would necessarily mean getting rid of data in the former. If certain research tasks require it, we think downsampling methods will better meet researchers’ needs.

A visual representation of the amount of poems available for each language in each time period. Focal point are the late 18th century in Russian, around 1500 in Italian, the 19th century in French, English and Czech, and from 1600 to 1900 in German.
Figure 1

Number of poems (duplicates excluded) matched to the years of birth of their authors (25-year range)

Citation: Research Data Journal for the Humanities and Social Sciences 9, 1 (2024) ; 10.1163/24523666-bja10044

3. Cleaning Data

In the aforementioned resources were occasional texts written in foreign languages. We tried to minimise such cases by means of automatic language detection. For each poem we have used the langdetect Python library (Danilák, 2021), using probabilities to determine the language of the text. In those cases where the probability of the respective language was lower than 0.99, the poem was subjected to a manual check and eventually removed. (Even with the threshold set this high the number of poems to inspect was in the lower hundreds.)

Another problem was posed by the existence of duplicates, which is to say multiple identical or slightly differing texts entering a single corpus from different editions. We aimed to identify these by means of approximate substring matching which – unlike the plain vanilla edit distance – is also able to capture cases such as A. Ducros’ poem ‘Les rubans de Marie’, which occurs in fr twice: once encoded as a single poem (coming from the 1854 book Les Capricieuses) and once split into four parts ‘Ruban blanc’, ‘Ruban bleu’, ‘Ruban vert’, and ‘Ruban noir’ (coming from the 1896 collection Les Caresses d’antan). The procedure was as follows:

  1. Let similarity of poems A and B containing |A| ≥ |B| characters respectively be defined as:
    sim(A,B)=1min(lev(a1, B),,lev(an, B))|B|
    where {a1, …, an} is the set of all possible substrings of A and lev(ax, B) is the Levenshtein distance between ax and B.
  2. For each author in each corpus construct an undirected graph where nodes represent their poems and an edge exists between A and B if sim(A, B) > 0.75. (In all corpora, the distributions of poem pairs’ similarities are strongly bimodal [see Figure 2] with a major peak between 0.4 and 0.5 [completely unrelated texts] and a minor peak at 1 [completely identical texts]. The threshold of 0.75 above which poems are considered duplicates roughly corresponds to their local minima.] An example of such a graph for M. Arnold is given in Figure 3a.
  3. For each component of each graph mark one of its nodes as a primary variant and the rest as its duplicates in the following way:
    1. if the component is complete (see Figure 3b):
    2. limit the primary variant candidates to the poems with the highest number of lines
    3. if multiple candidates remain and if the year of creation/publication is known for all of them, limit the candidate set to the earliest ones
    4. if multiple candidates remain, select the primary variant by random
    5. else if the component is a star and the central node is a poem with the highest number of lines (see Figure 3c), mark the central node as the primary variant
    6. else: determine the primary variant manually (see Figure 3d)

A collection of bar graphs. They show the level of similarity for poems which occur in multiple corpora for each language. Low levels of similarity are virtually non-existant. The peak lies around 50% similarity. Then we find a smaller amount of pairs with higher similarity. Another peak occurs around 100% similarity, denoting the perfectly similar poems.
Figure 2

Distribution of poem pairs’ similarities in each corpus

Citation: Research Data Journal for the Humanities and Social Sciences 9, 1 (2024) ; 10.1163/24523666-bja10044

A collection of connection graphs. Some have the form of a circle: multiple separated points on an outer ring connect to multiple unconnected points on an inner ring. Others have the form of a triangle, a cross or a line.
Figure 3

Deduplication through undirected graphs

Citation: Research Data Journal for the Humanities and Social Sciences 9, 1 (2024) ; 10.1163/24523666-bja10044

Note: Orange indicates a primary variant. (A) Graph representing all poems of M. Arnolds (en). (B) Two complete clusters from the graph of F. Hölderlin (de). ‘Der Wanderer’ (1797) is marked as a duplicate since it has fewer lines than ‘Der Wanderer’ (1800). Within the other component ‘Die Dioskuren’ is ruled out on account of its length, ‘An Eduard’ (1801a) is then randomly selected as the primary variant since the two remaining poems have the same number of lines and come from the same year. (C) Star component from the graph of A. Ducros (fr). The central node is selected as the primary variant as it has the highest number of lines. (D) Component from the graph of E. Lešehrad (cs) to be resolved manually.

In this way, 20,999 complete components (88% of which comprised just two nodes) and 585 star components were deduplicated automatically, while 75 components were processed manually (usually concerning cases when a certain poem was continuously reworked up to the point that the similarity of the initial and the final variant was below the threshold). Figure 4 gives the corpora sizes after both language detection and deduplication steps.

A bar graph depicting the amount of duplicate or foreign-language poems in the corpora for each language. Czech and German have the largest collections of about 80.000 entries, of which 10.000 are non-unique. Collections of about 40.000 entries exist in English, Italian and Russian.
Figure 4

The number of poems in each corpus after language detection and deduplication

Citation: Research Data Journal for the Humanities and Social Sciences 9, 1 (2024) ; 10.1163/24523666-bja10044

Given that this way of deduplication may not be suitable to all possible use cases, we have preserved information in each case not only on whether a poem was marked as duplicate according to the steps outlined above but also on the measure of similarity to its 20 nearest neighbours, with the aim of making it possible for PoeTree users to apply other deduplication criteria. Deduplication scripts are available at https://github.com/versotym/poetree_deduplication. Interactive similarity graphs may be inspected in detail at https://versologie.cz/poetree/deduplication.

4. Enriching Data

Where available, author records are enriched with viaf id and wikidata entity id (the former identifier was already present in the elte Poetry Corpus and Diachronic Spanish Sonnet Corpus). This allows not only to unify pen names and alternate spellings under a single identity, but also to acquire additional metadata such as date of birth, date of death, and country of citizenship.

We enrich each poem with lemmatization, morphological tagging, and syntactic parsing according to the Universal Dependencies annotation scheme, using the multilingual UDPipe 2 parser (Straka, 2018). Lemmatization and morphological tags allow for retrieving words in specific contexts. Typically, a researcher might want to retrieve grammatical collocations that denote entities and their properties, or events with their participants and circumstances. This easily translates into nouns and their attributes, verbs and their arguments (subject, objects), and adjuncts (adverbials). Unlike bag-of-words approaches or ordinary linear searches, syntactic parsing allows for direct queries about syntactic elements, abstracting from auxiliary words, modifiers, and nested clauses that might obscure them.

The quality of morphosyntax-based information extraction depends on the quality of the automatic parsing. However, most language models have been trained on modern non-fiction, with the consequence that the rate at which they generate adequate results on older texts, especially in the case of poetry, may be lower than documented. The only way to assess the performance of a parser on a particular domain is by evaluating it on a manually annotated data set from that domain. We have performed such an evaluation of the largest Czech model (based on Prague Dependency Treebank newswire texts from the 1990s) on a random sample of 29 poems from the Czech PoeTree section. The performance was indeed lower in all standard metrics (see Figure 5), and a semi-manual error analysis revealed several systematic errors that would hamper proper extraction of relevant syntactic relations, especially concerning nouns as modifiers of other nouns, which tended to be attached instead as arguments of the nearest governing predicate (cf. Cinková et al., 2024). This finding reveals the need for a domain adaptation of the Czech model to older Czech (poetry) and calls for the same procedure to be performed on the other PoeTree languages.

A bar graph depicting the success rate of different kinds of data enrichment and tagging in PoeTree, as compared to the Prague Dependency Treebank (PDT) as a benchmark. The success rate of PoeTree lies between 60 and 100% in these categories, whilet PDT achieves a success rate of above 90%.
Figure 5

UDPipe 2 performance with Prague Dependency Treebank (pdt) as compared to the testing portion of PoeTree.CS (6591 tokens)

Citation: Research Data Journal for the Humanities and Social Sciences 9, 1 (2024) ; 10.1163/24523666-bja10044

5. Standardisation

In PoeTree, each poem is stored as a standalone json file with a standardised structure. We considered using tei-xml, a widely used format in the community but ultimately decided against it as a storage format for the corpus that focuses on encoding linguistic data over the source and editorial information. json also provides operational ease in our case, as it could be easily manipulated across different coding frameworks and approaches, including research communities that are not familiar with tei. We invite the creation of converters and wrappers that adapt our corpus to tei schemas or present them in other custom formats.

The top-level keys of the json structure are shown in Table 1. The last three keys hold complex data structures. The ‘source’ key holds an object comprising metadata on a particular book edition from which a text comes (see Table 2). The ‘author’ keys may hold either an object, schema (shown in Table 3), or array of such objects in the case of poems with multiple authors or multi-author books where the authorship of particular poems is unknown. The ‘body’ key holds the text of the poem itself. It is an array where each element corresponds to a single line (see Table 4). In each line, there is a ‘words’ key which holds the linguistic analysis provided by UDPipe-2. The default CoNLL-U format (Universal Dependencies, 2013) is split into an array whose elements correspond to particular tokens (see Table 5). Note that unlike CoNLL-U we do not encode multiwords as standalone tokens, but rather delegate this information to an optional ‘multiword’ key of its components. For instance, while a Spanish phrase ‘Esperaré del mal’ gives 5 tokens in CoNLL-U:

1    Esperaré       …

2-3   del          …

2    de         …

3    el          …

4    mal         …

in PoeTree this is encoded as a 4-element list:

[

{id: 0, form: “Esperaré”, …},

{id: 1, form: “de”, …, multiword: {“form”: “del”, id: 1}},

{id: 2, form: “el”, …, multiword: {“form”: “del”, id: 1}},

{id: 3, form: “mal”, …},

]

T1
T2
T3
T4
T5

6. Conclusion and Future Plans

PoeTree in its current state offers an extensive dataset suitable for various tasks (not only) in the field of nlp, stylometry, and computational literary studies.

In the upcoming two years, we aim to evaluate the parser performance on other languages represented in PoeTree, and, most importantly, enrich PoeTree with rhyme detection, fixed forms (sonnet, sestina, etc.) description, topic modelling, and recognised named entities that would link to a common knowledge base (wikidata). Furthermore, we plan to incorporate and standardise the annotation of poetic metres from the original resources (where available and permitted by the license) and to perform our own machine-driven metre detection in the remaining corpora. This would make PoeTree the only existing full-text dataset with comparative information on poetic forms that is aligned across languages. We hope this will enable research that was not possible before: from the evolution of poetic forms to the tracing of literary contacts across cultures, and answers to fundamental questions about the connection between form and meaning from the historical perspective.

Acknowledgements

The creation of this dataset was supported by the Czech Science Foundation (project ga23-07727S).

References

  • Bamman, D. (2021). BookNLP. GitHub. www.github.com/booknlp/booknlp.

  • Biblioteca italiana. (2023). Biblioteca italiana. www.bibliotecaitaliana.it.

  • Bobenhausen, K., & Hammerich, B. (2015). Métrique littéraire, métrique linguistique et métrique algorithmique de l’allemand mises en jeu dans le programme Metricalizer². Langages, 199, 6787. www.cairn.info/revue-langages-2015-3-page-67.htm?contenu=article.

    • Search Google Scholar
    • Export Citation
  • Brooke J., Hammond A., & Hirst, G. (2015). GutenTag: annlp-driven tool for digital humanities research in the Project Gutenberg Corpus. In A. Feldman, A. Kazantseva, S. Szpakowicz, & C. Koolen (Eds.), Proceedings of the fourth workshop on computational linguistics for literature (pp. 4247). Association for Computational Linguistics. www.doi.org/10.3115/v1/W15-0705.

    • Search Google Scholar
    • Export Citation
  • Byszuk, J., Woźniak, M., Kestemont, M., Leśniak, A., Łukasik, W., Šeļa, A., & Eder, M. (2020). Detecting direct speech in multilingual collection of 19th-century novels. In R. Sprugnoli, & M. Passarotti (Eds.), Proceedings of lt4HALA 2020 – 1st workshop on language technologies for historical and ancient languages (pp. 100104). elra. www.lrec-conf.org/proceedings/lrec2020/workshops/LT4HALA/pdf/2020.lt4hala-1.15.pdf.

    • Search Google Scholar
    • Export Citation
  • Cinková, S., Plecháč, P., & Popel, M. (2024). Rhymes and syntax. Morpho-syntactic analysis of the Czech poetry. Primerjalna Književnost, 47(2), 6588. www.doi.org/10.3986/pkn.v47.i2.04.

    • Search Google Scholar
    • Export Citation
  • Computational Stylistics Group. (2023). Resources. https://computationalstylistics.github.io/resources.

  • Danilák, M. (2021). Langdetect. GitHub. www.github.com/Mimino666/langdetect.

  • de la Rosa, J., Pérez Pozo, Á., Ros, S., & González-Blanco, E. (2023). alberti, a multilingual domain specific language model for poetry analysis. arXiv. www.doi.org/10.48550/arXiv.2307.01387.

    • Search Google Scholar
    • Export Citation
  • Delente, É., & Renault, R. (2021). Projet Anamètre: présentation, limites et avancées. In A.-S. Bories, G. Purnelle, & H. Marchal (Eds.), Plotting poetry: On mechanically-enhanced reading (pp. 7392). Presses universitaires de Liège.

    • Search Google Scholar
    • Export Citation
  • Díaz Medina, A., Pérez Pozo, Á., & de la Rosa, J. (2021). Averell: A corpus management tool to transform poetic corpora into a json format compliant with the postdata ontology (v1.2.2). Zenodo. www.doi.org/10.5281/zenodo.5702404.

    • Search Google Scholar
    • Export Citation
  • Du, K., Dudar, J. & Schöch, C. (2022). Evaluation of measures of distinctiveness: classification of literary texts on the basis of distinctive words. Journal of Computational Literary Studies, 1(1). www.doi.org/10.48694/jcls.102.

    • Search Google Scholar
    • Export Citation
  • Fischer, F., Börner, I., Göbel, M., Hechtl, A., Kittel, C., Milling, C., & Trilcke, P. (2019). Programmable corpora: Introducing DraCor, an infrastructure for the research on European drama. In Proceedings of dh2019. Utrecht University. www.doi.org/10.5281/zenodo.4284002.

    • Search Google Scholar
    • Export Citation
  • Grishina E., Korchagin K., Plungian V., & Sichinava, D. (2009). Poeticheskii korpus v ramkah nkria: obschaia struktura i perspektivy ispolzovania. In Natsionalnii korpus russkogo iazyka: 2006–2008. Novye rezultaty i perspektivy (pp. 71113). Nestor-Istoria.

    • Search Google Scholar
    • Export Citation
  • Haider, T. (2021a). A German Poetry Corpus / Deutsches Lyrik Korpus (dlk). GitHub. www.github.com/tnhaider/DLK.

  • Haider, T. (2021b). Metrical tagging in the wild: Building and annotating poetry corpora with rhythmic features. In Proceedings of the 16th conference of the European Chapter of the Association for Computational Linguistics: Main Volume (pp. 37153725). Association for Computational Linguistics. www.doi.org/10.18653/v1/2021.eacl-main.325.

    • Search Google Scholar
    • Export Citation
  • Haider, T. (2023). A computational stylistics of poetry: Distant reading and modeling of German and English verse. Doctoral Thesis. In: opus University of Stuttgart. www.doi.org/10.18419/opus-12721.

    • Search Google Scholar
    • Export Citation
  • Haider, T. (2024). A large annotated reference corpus of New High German Poetry. In Proceedings of LREC-COLING. Torino. www.aclanthology.org/2024.lrec-main.59/.

    • Search Google Scholar
    • Export Citation
  • Horváth, P., Kundráth, P., Indig, B., Fellegi, Z., Szlávich, E., Borbála Bajzát, T., Sárközi-Lindner, Z., Vida, B., Karabulut A., Timári M., & Palkó, G. (2022). elte Poetry Corpus: a machine annotated database of canonical Hungarian poetry. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, & S. Piperidis (Eds.), Proceedings of the 13th conference on language resources and evaluation (LREC 2022) (pp. 34713478). elra. www.aclanthology.org/2022.lrec-1.372.

    • Search Google Scholar
    • Export Citation
  • Mittmann, A., Esteves, E., & Luiz dos Santos, A. (2019). What rhythmic signature says about poetic corpora. In P. Plecháč, B. P. Scherr, T. Skulacheva, H. Bermúdez-Sabel, & R. Kolár (Eds.), Quantitative approaches to versification (pp. 153172). icl cas. https://versologie.cz/conference2019/proceedings/mittmann-pergher-dossantos.pdf.

    • Search Google Scholar
    • Export Citation
  • Navarro-Colorado, B., Ribez Lafoz, M., & Sánchez, N. (2017). Metrical annotation of a large corpus of Spanish sonnets: representation, scansion and evaluation. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the tenth international conference on language resources and evaluation (LREC 2016) (pp. 43604364). elra. www.lrec-conf.org/proceedings/lrec2016/pdf/453_Paper.pdf.

    • Search Google Scholar
    • Export Citation
  • Odebrecht, C., Burnard, L., & Schöch, C. (2021). European Literary Text Collection (ELTeC): April 2021 release with 14 collections of at least 50 novels (v1.1.0) [Data set]. Zenodo. www.doi.org/10.5281/zenodo.4662444.

    • Search Google Scholar
    • Export Citation
  • Pace-Sigge, M. (2019). Typical phraseological units in poetic texts. In G. Corpas Pastor, & R. Mitkov (Eds.), Computational and corpus-based phraseology (pp. 330344). Springer. www.doi.org/10.1007/978-3-030-30135-4_24.

    • Search Google Scholar
    • Export Citation
  • Parrish, A. (2018). A Gutenberg Poetry Corpus. GitHub. https://github.com/aparrish/gutenberg-poetry-corpus.

  • Plecháč, P. (2021). Versification and authorship attribution. Institute of Czech Literature, cas and Karolinum Press. www.doi.org/10.14712/9788024648903.

    • Search Google Scholar
    • Export Citation
  • Plecháč, P., & Kolár, R. (2015). The Corpus of Czech Verse. Studia Metrica et Poetica, 2(1), 107118. www.doi.org/10.12697/smp.2015.2.1.05.

    • Search Google Scholar
    • Export Citation
  • Project Gutenberg. (2023). Project Gutenberg. www.gutenberg.org.

  • Ruiz Fabo, P., Bermúdez Sabel, H., Martínez Cantón, C., & González-Blanco, E. (2021). The diachronic Spanish sonnet corpus: tei and linked open data encoding, data distribution, and metrical findings. Digital Scholarship in the Humanities, 36 (Supplement_1, June 2021), i68i80. www.doi.org/10.1093/llc/fqaa035.

    • Search Google Scholar
    • Export Citation
  • Šeļa, A., Plecháč, P., & Lassche, A. (2022). Semantics of European poetry is shaped by conservative forces: The relationship between poetic meter and meaning in accentual-syllabic verse. PLOS ONE, 17(4), Article e0266556. www.doi.org/10.1371/journal.pone.0266556.

    • Search Google Scholar
    • Export Citation
  • Storey, G., & Mimno, D. (2020). Like Two pis in a pod: Author similarity across time in the Ancient Greek Corpus. Journal of Cultural Analytics, 5(2). www.doi.org/10.22148/001c.13680.

    • Search Google Scholar
    • Export Citation
  • Straka, M. (2018). UDPipe 2.0 prototype at CoNLL 2018 ud shared task. In D. Zeman, & J. Hajič (Eds.), Proceedings of CoNLL 2018: the SIGNLL conference on computational natural language learning (pp. 197207), Association for Computational Linguistics. www.aclanthology.org/K18-2020.

    • Search Google Scholar
    • Export Citation
  • Universal Dependencies (2013). CoNLL-U Format. www.universaldependencies.org/format.html.

Content Metrics

All Time Past 365 days Past 30 Days
Abstract Views 0 0 0
Full Text Views 523 523 56
PDF Views & Downloads 643 643 42