A Syntactic Feature Counting Method for Selecting Machine Translation Training Corpora

in Corpus Linguistics Beyond the Word
Restricted Access
Get Access to Full Text

Subject Highlights



Recently, the idea of “domain tuning” or customizing lexicons to improve results in machine translation and summarization tasks has driven the need for better testing and training corpora. Traditional methods of automated document identification rely on word-based methods to find the genre, domain, or authorship of a document. However, the ability to select good training corpora, especially when it comes to machine translation systems, requires automated document selection methods that do not rely on the traditional lexically-based techniques. Because syntactic structures and syntactic feature densities can heavily affect machine translation quality, syntactic feature-based methods of document selection should be used in choosing training and testing corpora. This paper provides evidence that document genres can be distinguished on the basis of syntactic-tag densities alone, supporting the idea that automated document identification is possible using alternative methods. Such methods would be ideal for creating syntactically as well as lexically balanced corpora for both genre and subject matter.

Corpus Linguistics Beyond the Word

Corpus Research from Phrase to Discourse


Table of Contents

Index Card



All Time Past Year Past 30 Days
Abstract Views 20 20 5
Full Text Views 8 8 4
PDF Downloads 5 5 2
EPUB Downloads 0 0 0

Related Content