Search Results

You are looking at 1 - 9 of 9 items for

  • Author or Editor: Geoffrey Leech x

Series:

Geoffrey Leech

Abstract

This paper runs counter to the majority of papers in this volume in focusing on the argument that, while welcoming opportunities to use new resources and methods, we should not neglect to improve and refine the resources and methods we already have.

The path of progress in corpus linguistics is strewn with unfinished business. Because no other realistic course is available, corpus linguists have understandably been following the path of practicality, pragmatism and opportunism. By and large, we have built up the resources and techniques of the present generation by taking advantage of what is already available and what can be relatively easily obtained. Our research efforts have consequently been limited and skewed by what resources we have been able to lay our hands on.

In this paper, I illustrate the skewing effect with reference to corpus design and composition, focusing on the desiderata of representativeness, ‘balancedness’ and comparability. After arguing that we need to give more consideration to these basic requirements, I briefly address the issue of representativity (a term used to mean ‘the degree to which a corpus is representative’) in relation to the use of the world-wide web as a source of corpus data, both with respect to ‘the web as corpus’ and with respect to ‘corpus building from www-material’.

Series:

Geoffrey Leech

Abstract

This chapter begins by considering the contrast between the data-driven paradigm characteristic of corpus linguistics and the theory-oriented paradigm characteristic of some other schools of linguistics, particularly those espousing a generative framework. To illustrate the corpus linguistics paradigm in detail, I present a case study of grammatical differences observed in the LOB and FLOB corpora and also other corpora of the early 1960s and the early 1990s. By abductive or inductive inference from the observed data, (fallible) descriptive generalizations can be made, and tentative conclusions of theoretical interest can be drawn. In conclusion, I argue that corpus linguistics is not purely observational or descriptive in its goals, but also has theoretical implications. However, like a theory-driven inquiry in the classic formulation of Popper’s hypothetico-deductive method (1972: 297), a corpus linguistic investigation can only lay claim to provisional truths, and therefore requires confirmation or refutation by further research findings.

Series:

Geoffrey Leech and Nicholas Smith

Abstract

The quartet of corpora analysed in this paper are the Brown Corpus (AmE, 1961), LOB Corpus (BrE, 1961) Frown Corpus (AmE, 1992) and FLOB Corpus (BrE, 1991). The POS-tagged versions of these matching corpora provide the basis for tracking frequency changes in grammatical usage in written English 1961-1991/2 and for comparing similar changes in AmE and BrE. For example, there have been significant increases in the use of semi-modals, the present progressive, that-relativization, nouns (in particular proper nouns), s-genitives, and verb and negative contractions. Counterbalancing some of these changes, there have been significant decreases in the use of core modals, the passive voice, wh-relativization, and of-genitives. In general, the changes in AmE are more extreme than those in BrE. We discuss these changes in terms of general diachronic processes, particularly socially determined processes such as colloquialization and Americanization.

Series:

Geoffrey Leech and Nicholas Smith

Abstract

The creation of the Lanc-31 corpus (familiarly known as B-LOB - ‘Before LOB’) completes a trio of matching corpora of standard written British English 1931- 1961 - 1991 on the model of the Brown corpus. The short-term history of English in the twentieth century can therefore now be examined using three equidistant broadly-sampled and comparable corpora of the written language, and it is possible to trace how far trends of change already observed in the comparison of LOB (1961) and F-LOB (1991) have themselves been undergoing change over the period in question.

We will present in outline the recent history of a considerable range of grammatical features insofar as it can be learned from frequency counts from these three equivalently-sampled corpora. In many cases examined, the trend of increasing or decreasing frequency observed in the later period (1961-91) is found to be a continuation of a similar trend in the earlier period (1931-61). In other cases there is change in the rate or direction of change. In other words, there is both constancy and change in the rate of change. We provide tentative explanations of these changes, where appropriate, in terms of grammaticalization, colloquialization, Americanization and densification. Comparable developments in American English, based on analysis of the equivalent Brown and Frown corpora, are traced for the 1961-92 period, and provide insight into the relation between the two regional varieties, mostly showing AmE trends to be in advance of those for BrE.

Series:

Paul Rayson, Andrew Wilson and Geoffrey Leech

Abstract

This paper examines the relationship between part-of-speech frequencies and text typology in the British National Corpus Sampler. Four pairwise comparisons of part-of-speech frequencies were made: written language vs. spoken language; informative writing vs. imaginative writing; conversational speech vs. ‘task-oriented’ speech; and imaginative writing vs. ‘task-oriented’ speech. The following variation gradient was hypothesized: conversation – task-oriented speech – imaginative writing – informative writing; however, the actual progression was: conversation – imaginative writing – task-oriented speech – informative writing. It thus seems that genre and medium interact in a more complex way than originally hypothesized. However, this conclusion has been made on the basis of broad, pre-existing text types within the BNC, and, in future, the internal structure of these text types may need to be addressed.

Series:

Edited by Ezra Black, Roger Garside and Geoffrey Leech

This book is about building computer programs that parse (analyze, or diagram) sentences of a real-world English. The English we are concerned with might be a corpus of everyday, naturally-occurring prose, such as the entire text of this morning's newspaper.
Most programs that now exist for this purpose are not very successful at finding the correct analysis for everyday sentences. In contrast, the programs described here make use of a more successful statistically-driven approach.
Our book is, first, a record of a five-year research collaboration between IBM and Lancaster University. Large numbers of real-world sentences were fed into the memory of a program for grammatical analysis (including a detailed grammar of English) and processed by statistical methods. The idea is to single out the correct parse, among all those offered by the grammar, on the basis of probabilities. Second, this is a how-to book, showing how to build and implement a statistically-driven broad-coverage grammar of English. We even supply our own grammar, with the necessary statistical algorithms, and with the knowledge needed to prepare a very large set (or corpus) of sentences so that it can be used to guide the statistical processing of the grammar's rules.

English Corpus Linguistics: Looking back, Moving forward

Papers from the 30th International Conference on English Language Research on Computerized Corpora (ICAME 30). Lancaster, UK, 27-31 May 2009

Series:

Edited by Sebastian Hoffmann, Paul Rayson and Geoffrey Leech

This book showcases sixteen papers from the landmark 30th conference of the International Computer Archive of Modern and Medieval English (ICAME) held at Lancaster University in May 2009. The theme of the book ‘looking back, moving forward’ follows that of the conference where participants reflected on the extraordinary growth of corpus linguistics over three decades as well as looking ahead to yet further developments in the future. A separate volume, appearing as an e-publication in the VARIENG series from the University of Helsinki focuses on the methodological and historical dimensions of corpus linguistics. This volume features papers on present-day English and the recent history of English via the increasing availability of corpora covering the last hundred years or so of the language. Contributors to the volume study numerous topics and datasets including recent diachronic change, regional and new Englishes, learner corpora, Academic written English, parallel and translation corpora, corpora of popular music pop lyrics and computer-mediated communication. Overall the volume represents the state of the art in English corpus linguistics and a peek into the future directions for the field.

Series:

Geoffrey Leech, Nicholas Smith and Paul Rayson

Abstract

This paper has two related purposes. First, our goal is to explain the results of recent research on twentieth century British (as well as American) English, using equivalent corpora of general written (published) English known as the ‘Brown Family’ of corpora. Limiting our attention to British corpora, the ‘Brown Family’ contains three matching corpora of a million words each, the BLOB, LOB and F-LOB corpora, sampled at roughly thirty-year intervals (1931±31 years, 1961 and 1991). (A fourth corpus from 1901±3 is under development, and one-third of it will be used in the latter part of this paper.) These enable us to trace the changing history of written (published) British English over a sixty-year period. Through changes in frequency in grammatical categories and constructions across a variety of genres, we observe largely consistent patterns of change which lend themselves to explanations in terms of what may be called general stylistic trends. To these trends we give such names as colloquialization (movement towards spoken norms of usage), densification (movement towards denser or more compact expression of meaning) and democratization (the trend towards avoidance of discrimination or inequality in the linguistic treatment of individuals). Only the first two of these trends will be explored in this paper.

In the second part of the paper, we show how general stylistic norms, such as are provided by the ‘Brown Family’ corpora, can be used as a reference norm against which statistical deviations identify some of the characteristic features of style of an individual author or an individual text. For this we make use of Rayson’s Wmatrix software (http://ucrel.lancs.ac.uk/wmatrix/) for comparing (groups of) texts in terms of lexical, grammatical and semantic characteristics. Although the comparison is in some respects lacking in accuracy, it identifies typical style markers of an individual text, ordering them in terms of their differentness from the reference norm. It remains to be seen how far this computational technique can place the elusive notion of authorial style on an objective footing, but results so far are promising.