Accurate Stemming of Dutch for Text Classification

in Computational Linguistics in the Netherlands 2001
Restricted Access
Get Access to Full Text

Subject Highlights

 

Abstract

This paper investigates the use of stemming for classification of Dutch (email) texts. We introduce a stemmer, which combines dictionary lookup (implemented efficiently as a finite state automaton) with a rule-based backup strategy and show that it outperforms the Dutch Porter stemmer in terms of accuracy, while not being substantially slower.

For text classification, the most important property of a stemmer is the number of words it (correctly) reduces to the same stem. Here the dictionary-based system also outperforms Porter. However, evaluation of a Bayesian text classification system with either no stemming or the Porter or dictionary-based stemmer on an email classification and a newspaper topic classification task does not lead to significant differences in accuracy. We conclude with an analysis of why this is the case.

Computational Linguistics in the Netherlands 2001

Selected Papers from the Twelfth CLIN Meeting

Series:

Table of Contents

Index Card

Metrics

Metrics

All Time Past Year Past 30 Days
Abstract Views 10 10 2
Full Text Views 6 6 2
PDF Downloads 6 6 2
EPUB Downloads 0 0 0

Related Content