Search Results

Restricted Access

Series:

Nelleke Oostdijk

Abstract

This paper seeks to investigate a number of phenomena that are considered to be characteristic of spoken language use, more in particular disfluencies such as hesitations, false-starts and self-corrections. The aim is to get insight in the nature, frequency and distribution of these phenomena, so that we may consider the implications this has for the construction of a parser geared towards the analysis of spoken language data. The study is based on the normalized data found in the spoken part of ICE-GB.

Restricted Access

Series:

Nelleke Oostdijk

Abstract

In June 1998 the Spoken Dutch Corpus project was started, a five-year project aimed at the compilation and annotation of a 10-million-word corpus of contemporary standard Dutch as spoken in the Netherlands and Flanders. This paper describes the corpus as it is currently under construction. It discusses more specifically the various considerations that have guided its design.

Restricted Access

Corpus-Based Research into Language

In honour of Jan Aarts

Edited by Nelleke Oostdijk and Pieter de Haan

For over two decades Jan Aarts has been actively involved in corpus linguistic research. He was the instigator of a large number of projects, and he was responsible for what has become known as the Nijmegen approach to corpus linguistics. It is thanks to him that words like TOSCA and LDB have become household names in the corpus linguistic community.
The present volume has been collected in his honour. The contributions in it cover a wide range of topics in the field of corpus linguistic research, especially those in which Jan Aarts takes a keen interest: corpus encoding and tagging, parsing and databases, and the linguistic exploration of corpus data. The contributions in this volume discuss work done in this field outside Nijmegen, for the obvious reason that we do not wish to present him with a report on work in which he is himself involved.
Restricted Access

Series:

Inge de Mönnink, Niek Brom and Nelleke Oostdijk

Abstract

In corpus linguistics, but also in computational linguistics and information retrieval, there is an increasing demand for the automatic classification of large amounts of text(s). In his research, Biber uses the Multi-Feature/Multi-Dimension (MF/MD) method to obtain a classification of English texts. A major disadvantage of his approach is the heavy reliance on the frequency count of complex grammatical features which are hard to retrieve automatically. In this paper, we investigate whether Biber’s MF/MD method can be used for automatic text classification. For this purpose, the MF/MD method is applied to the ICE-GB corpus, using three different sets of linguistic features. The results indicate that automatic text classification is indeed feasible using word class tags as input for the MF/MD method.

Restricted Access

English Language Corpora

Design, Analysis and Exploitation. Papers from the thirteenth International Conference on English Language Research on Computerized Corpora, Nijmegen 1992

Edited by Jan Aarts, Pieter de Haan and Nelleke Oostdijk