This paper describes the first phase of the CEXI project at the University of Bologna in Forlì, involving the selection of the texts to be included in the corpus and decisions about the processing of these texts. The aim of the project is to create a resource which can be used by both students and researchers to learn about translation and translating. The English Italian Translational Corpus can be described as a bi-directional, parallel, translation-driven corpus, which in its core component will consist of about 4.6 million words, or 368 text samples of 10 to 15 thousand words each, from Italian original texts and their translations into English and vice versa, published between 1975 and 2000. The core corpus will subsequently be flanked with additional unidirectional parallel collections which will better reflect the specific characteristics of the two very different translation populations sampled, as well as expanded to include full texts and text types excluded from the core component.
The paper deals with issues such as representativeness, balance and directionality with respect to the Italian and English language book publishing sector, detailing the composition of the different sub-components of CEXI, and showing that the creation of a corpus involves a series of compromises between what is ideally desirable and what is possible given practical and theoretical limitations.