We report on a project investigating the linguistic properties of English scientific texts on the basis of a corpus of journal articles from nine academic disciplines. The goal of the project is to gain insights on registers emerging at the boundaries of computer science and some other discipline (e.g., bioinformatics, computational linguistics, computational engineering). The questions we focus on in this paper are (a) how characteristic is the corpus of the meta-register it represents, and (b) how different/similar are the subcorpora in terms of the more specific registers they instantiate? We analyze the corpus using several data-mining techniques, including feature ranking, clustering, and classification, to see how the subcorpora group in terms of selected linguistic features. The results show that our corpus is well distinguished in terms of the meta-register of scientific writing; also, we find interesting distinctive features for the subcorpora as indicators of register diversification. Apart from presenting the results of our analyses, we will also reflect upon and assess the use of data mining for the tasks of corpus exploration and analysis.
The overall goal of our research is to uncover the linguistic options of expressing negative attitude and experience. To this end, a small corpus of English newsgroup texts about relationship problems and eating disorders, part of the Englische & Deutsche Newsgroup Texte – Annotiertes Korpus (EDNA corpus), has been annotated manually, using Systemic Functional Grammar (Halliday 1994) as theoretical foundation. The EDNA corpus now contains information about Theme-Rheme structure, modality and negative polarity as well as process types, i.e. types of verbs such as action, relational, mental. In this paper, we focus on modality and negative polarity. We start by looking at syntactic negation, both at clause rank (he didn’t love me) and at phrase rank (I will leave, no second chances). In addition, morphological negation will be considered, e.g. That would be unfair. In a second step, the use of epistemic modality (i.e. likelihood of a proposition) and root modality (expressing obligation, inclination or ability) in the EDNA corpus is studied. How do authors position themselves towards their audience and towards what they are saying by using modal auxiliaries (e.g. can, could, may) and modal adjuncts (e.g. certainly, maybe). Following that, we consider how modality and negative polarity combine. In our corpus, as many as 15% of all clauses carry a negative polarity marker, but as many 23% of clauses with a modality marker are also negated. Furthermore, clauses expressing root modality are more likely to be negated than clauses expressing epistemic modality (38% and 15% respectively). Our results suggest that modality markers and negative polarity markers attract each other.
We present an information-theoretic approach to investigate diachronic change in scientific English. Our main assumption is that over time scientific English has become increasingly dense, i.e. linguistic constructions allowing dense packing of information are progressively used.
So far, diachronic change in scientific writing has been investigated by means of frequency-based approaches (see e.g. ; ; Biber (b, c); ; ; ). We use information-theoretic measures (entropy, surprisal; ) to assess features previously stated to change over time and to discover new, latent features from the data itself that are involved in diachronic change.
For this, we use the Royal Society Corpus (rsc) (), which spans over the time period 1665 to 1869. We present three kinds of analyses: nominal compounding (typical of academic writing), modal verbs (shown to have changed in frequency over time), and an analysis based on part-of-speech trigrams to detect new features that change diachronically. We show how information-theoretic measures help to investigate, evaluate and detect features involved in diachronic change.