Corpus-based Studies of Lexical and Semantic Variation: The Importance of Both Corpus Size and Corpus Design

In: From Data to Evidence in English Language Research
Author: Mark Davies


Small corpora (e.g. 1–5 million words) are often adequate for the study of high-frequency syntactic constructions, but they are typically inadequate for the study of lexical and semantic phenomena, especially for medium and lower-frequency words. “Mega corpora”, on the other hand, may have billions of words of easily-obtainable web pages, but they are often just a huge “blob” of texts, which does not have a structure which lends itself to the study of variation. In this paper, we discuss three corpora of English – coca, coha, and GloWbE – which are very large (about 100 times the size of comparable corpora like ice or the Brown family of corpora), but which also have a corpus design, architecture, and interface that lends itself to the in-depth study of variation. With such corpora, we are able to examine genre-based, historical, and dialectal variation in lexis and meaning in ways that would be difficult or impossible with comparable corpora.


