This paper draws on our personal experience of working with a large diachronic corpus, namely 1.3 billion words of Guardian and Independent news text, from 1984–2013 and ongoing. Big data is thus, for us, both quantitative and temporal. The data exist as raw text and as analysed databases, created by AVIATOR (1990–3), APRIL (1997–2000), WebCorpLSE (2000–) and other tools. We also refer to the coca corpus ().
Our research focus is on lexis, and such big data is thus desirable (; ). The lexicon comprises a few high-frequency words, but many more medium–low frequency words, and a majority of hapax legomena. Big data increases scope and enhances granularity of study, allowing rare and intuitively inaccessible features to be glimpsed (c). Thirty-plus years of diachronic text bring the corpus linguist an evolving understanding of language innovation and change (; ).
On the other hand, big data presents challenges for the corpus linguist. High and even medium-frequency search words and affixes begin to retrieve too much data; hapax legomena, since they are mainly studied for the patterns they show with particular sub-word elements, constitute enormous numbers of tokens for analysis, supplemented by typographical and tagging errors in the corpus “sump” (Clear, 1986). Moreover, whilst it undoubtedly allows microscopic analysis, a very large corpus reveals details of language use which complicate descriptions, and can entice the linguist down time-consuming paths of enquiry which prove fruitless or excessive. At this point in corpus linguistic history, large-scale language corpora are available in advance of the necessary tools for automated analysis.
Through small case studies, the paper will illustrate some of the opportunities and challenges of big data experienced recently, in our work in corpus-based lexicology and in two allied fields: socio-pragmatics and lexical morphology.