Mining Big Data: A Philologist’s Perspective

In: From Data to Evidence in English Language Research
Author: Tanja Rütten


In this contribution, I argue that big data has a lot to learn from small data in terms of philological meta-data annotation. This includes, inter alia, information about genre, genre networks, author and discourse community as well as information about intended and actual readerships and circulation patterns of the texts contained in a corpus. I illustrate these issues by discussing the Dictionary of Old English Corpus (doec), particularly the prognostic texts contained in it. By investigating the specific functions, circulation patterns and discourse strategies of Old English prognostications, I focus on two points. First, I show how data-mining the doec could be improved by philological annotation. This would allow better contextualisation of statistical linguistic data, and it would also foreground coherent and persistent linguistic patterns of minority genres. Secondly, I show that the weaknesses and improvements discussed for the doec also pertain to other big data corpora, diachronic and synchronic. While it is much more difficult to remedy the lack of philological annotation and to suggest even a very basic outline in truly big data, I argue that this is the only feasible way to interpret statistical data meaningfully, re-appraising Geoffrey Leech’s claim of “total accountability” of all linguistic evidence.


All Time Past Year Past 30 Days
Abstract Views 57 32 2
Full Text Views 12 1 0
PDF Downloads 4 0 0