Series:

Joan C. Beal

Abstract

Contrasting the aims and methodologies of corpus linguists and variationists, Charles Meyer writes that the latter ‘have been more interested in spoken language’ and ‘have tended to collect data for private use and have not generally made public their data sets’ (2006: 169). Since the advent of sociolinguistics in the 1960’s, individual scholars and research teams have been amassing recordings of spoken data, often for the purpose of investigating variation across a limited number of linguistic features. Surprisingly little of this material has, however, been made accessible to the wider community of scholars. As John Widdowson points out, ‘much of this data remains hidden and inaccessible, scattered in numerous, often obscure, repositories’ (2003: 81). What is more, these valuable legacy materials are often kept in inadequate storage facilities, and in obsolescent media, leading to the danger of them being lost forever.

The Newcastle Electronic Corpus of Tyneside English (NECTE) was created with the aid of a Resource Enhancement Grant from the then AHRB with the primary objective of ‘rescuing’ legacy materials from the Tyneside Linguistic Survey collected c.1969 and creating an accessible corpus by combining these with more recently-collected data from the Phonological Variation and Change project, collected c.1994. More specifically, the resultant corpus was designed to be of use to as wide a range of end-users as possible and therefore available in a number of formats: sound, phonetic transcription, orthographic transcription and grammatical mark-up. The challenges posed by this project, and the ways in which the project team overcame them, will be the main focus of this paper, and should provide useful pointers to anybook-body intending to embark on creating a corpus of spoken language, whether from legacy materials or from newly-collected data. The topics to be covered are: (i) ethical and legal issues surrounding the making accessible of data collected in an era before ethics review or the UK’s 1998 Data Protection Act; (ii) the challenges involved in gathering metadata and digitising ‘old’ audio material; (iii) standards of transcription and mark-up. Finally, there will be some discussion of plans to process other ‘legacy’ materials, and progress made towards developing common standards, as set out in Kretzschmar et.al. (2006).

Series:

Joan C. Beal and Ranjan Sen

Abstract

This paper gives an account of plans for constructing a searchable database of eighteenth-century English phonology. The project incorporates data from pronouncing dictionaries and other texts dealing with pronunciation published in the second half of the 18th century. The data will be recorded in the form of Unicode transcriptions of as many of the approximately 1,700 words used to exemplify John Wells’ (1982) Standard Lexical Sets as appear in the eighteenth-century texts. Although all the eighteenth-century texts purported to describe the ‘best’ English, they were compiled by authors from different parts of the English-speaking world (mainly different regions of England, Scotland and Ireland but including some from North America) and so can provide evidence for geographical diffusion of innovations. (Beal 1999, C. Jones 2006). This paper provides an account of the design of this database and presents the results of a pilot study demonstrating how such a database can be used to answer questions concerning the chronological, social, geographical and phonological distribution of variation between /hw/ ~/w/ ~ /h/ in WHICH, WHO, NOWHERE, etc. which is of interest to sociolinguists, dialectologists and historical phonologists.