This book is about building computer programs that parse (analyze, or diagram) sentences of a real-world English. The English we are concerned with might be a corpus of everyday, naturally-occurring prose, such as the entire text of this morning's newspaper.
Most programs that now exist for this purpose are not very successful at finding the correct analysis for everyday sentences. In contrast, the programs described here make use of a more successful statistically-driven approach.
Our book is, first, a record of a five-year research collaboration between IBM and Lancaster University. Large numbers of real-world sentences were fed into the memory of a program for grammatical analysis (including a detailed grammar of English) and processed by statistical methods. The idea is to single out the correct parse, among all those offered by the grammar, on the basis of probabilities. Second, this is a how-to book, showing how to build and implement a statistically-driven broad-coverage grammar of English. We even supply our own grammar, with the necessary statistical algorithms, and with the knowledge needed to prepare a very large set (or corpus) of sentences so that it can be used to guide the statistical processing of the grammar's rules.