Improving Access to the Dutch Historical Censuses with Linked Open Data

The Dutch Historical Censuses (1795-1971) contain statistics that describe almost two centuries of  History in the Netherlands. These censuses were conducted once every 10 years (with some  exceptions) from 1795 to 1971. Researchers have used its wealth of demographic, occupational, and  housing information to answer fundamental questions in social economic history. However, accessing  these data has traditionally been a time consuming and knowledge intensive task. In this paper, we  describe the outcomes of the CEDAR project, which make access to the digitized assets of the Dutch  Historical Censuses easier, faster, and more reliable. This is achieved by using the data publishing  paradigm of Linked Data from the Semantic Web. We use a digitized sample of 2,288 census tables to  produce a linked dataset of more than 6.8 million statistical observations. The dataset is modeled  using the RDF Data Cube, Open Annotation, and PROV vocabularies. The contributions of  representing this dataset as Linked Data are: (1) a uniform database interface for efficient querying of  census data; (2) a standardized and reproducible data harmonization workflow; and (3) an  augmentation of the dataset through richer connections to related resources on the Web.


Introduction
The Dutch historical censuses were conducted 17 times from 1795 until 1971, once every 10 years.For each of these, the government counted the entire population of the Netherlands, and aggregated the results in three different censuses: demographic (with variables such as gender or age), occupational (about jobs of citizens), and housing (about public and private buildings and other living facilities).After 1971, this exhaustive data collection stopped mostly due to social opposition, and authorities switched to municipal registers and sampling.Nevertheless, the data collected in the 1795-1971 period is of special interest to historians and social scientists because of three facts: (1) it is based on counting the whole Dutch population, instead of sampling; (2) it provides an unprecedented level of detail, hardly comparable to modern censuses due to privacy regulations; and (3) the survey microdata from which the aggregations were originally built is almost entirely lost.
The 1795-1971 census results were published in books containing the aggregated statistical tables, and several institutes, among which the Central Bureau of Statistics (CBS) and the International Institute 1 of Social History (IISH), are holding paper copies.Offering access to these data in a systematic way 2 has always been a priority in these and other institutions.In an effort to improve this access, part of the tables in the historical censuses books have been digitized as 300,000 scanned images in various 3 projects between the CBS, the IISH and several institutes of the Royal Netherlands Academy of Arts and Sciences (KNAW), such as Data Archiving and Networked Services (DANS) and the 4 5 Netherlands Interdisciplinary Demographic Institute (NIDI).In addition, these projects have 6 translated part of these scans, by manual input, into more structured formats, resulting in a collection of 507 machine-readable Excel spreadsheets, containing 2,288 census tables.7 In this paper, we describe the digitally archived results of converting these machine-readable census spreadsheets into Linked Data, a paradigm for publishing structured data on the Web (Heath & Bizer, 2011), to which we refer as the CEDAR RDF Database.The CEDAR RDF Database contains the final database of the harmonized Dutch historical censuses, encoded using the Resource Description Framework (RDF) in two variants: as a complete RDF conversion of the 2,288 tables, although partially harmonized, of the 1795-1971 period (the "cedar" variant); and a partial RDF conversion of 140 tables, fully harmonized, of the 1859-1920 period (the "cedar-mini" variant).This database, its variants, and their associated documentation and metadata are available online in many forms, including a SPARQL endpoint ; deposited at the DANS archiving system EASY and in the form of a 8 9 website . 10 Rather than pursuing the answer of a specific research question, the construction of the CEDAR RDF Database sets the foundation of a platform for researchers and practitioners to facilitate discovering and answering their own research questions.This platform consists of the dataset itself and a set of services on top of it.These services enable the access to Dutch historical registers for users in an unprecedented longitudinal and comparable way, overcoming to a large extent the historical difficulties of analyzing census data across time.In addition, these services use the archived census source materials to provide accurate provenance information; and minimize the efforts required from users in data cleaning and data preparation.

Problem
The Dutch historical censuses were collected with different information needs at given times, implying deep changes in their structure, codes, variables and survey questions (Ashkpour, Meroño-Peñuela & Mandemakers, 2015).However, these changes make comparisons across time hard, and consequently the use of the historical censuses for longitudinal analysis cumbersome.
Addressing this requires extensive manual input from a domain expert.Moreover, this data munging is repeated over and over by different researchers when they start new research.Concretely, previous research (Ashkpour, Meroño-Peñuela & Mandemakers, 2015) shows that the access problems of this dataset are: • Aggregated data .The original surveys describing individuals are not available, and only the aggregated census results remained.This hampers a longitudinal harmonization of census data across multiple years, since variables aggregated differently are hardly comparable.For example, the number of female bakers in Haarlem in 1849 and 1859 might be incomparable if the census of 1849 counted bakers without distinguishing gender, and the census of 1859 only counted female bakers in Noord-Holland (the province where Haarlem is).• Changing variables .The possible values that a variable can get vary over time.Although some census variables are completely stable (for example, the possible values of sex are always male , female and unknown ), some others are very dynamic (for example, the many possible, epoch-dependent values of occupation ).• Missing codes .As a consequence of the previous point, a variable in a specific census year might have an aggregation level that does not have an equivalent in another census year.For example, for the classification of housing type, we have very specific codes for counting people in barracks (e.g., Kazerne der Marechaussee , Artilleriekazernes and so on) or forts (e.g., Fort Isabelle and Fort Kijkduin ); as we do not have this detailed information for all years, we need to aggregate these housing types according to their function into the higher code Military Buildings .• Structural heterogeneity.The collection is curated to be visually faithful with respect to the book originals.This means that the spreadsheet layouts are historically coherent, but also very difficult to query in an homogeneous way.• Inconsistency.Data quality is an important issue within this dataset, and data errors have been previously detected (Ashkpour, Meroño-Peñuela & Mandemakers, 2015).We distinguish between two different kinds of data errors: those already present in the source books, and those introduced by the digitization process.Typically these are in the form of spelling mistakes and variants, contents of columns which have shifted to another column, and columns wrongly merged during transcription.Numeric errors are usually detected by statistical methods, e.g. by checking conformance to Benford's Law (Benford, 1938).• Non reproducibility.Perhaps more importantly, and as a consequence of its poor and loosely connected data representation paradigms (either books, scanned images, or digital spreadsheets), previous socio-historical research that uses this dataset as a source is hardly reproducible.To improve this reproducibility, new paradigms for representing both the original data, and the processes that affect those data, at a very fine-grained level are needed.

Methods
Figure 1 shows the integration pipeline that we used to convert the Dutch historical censuses dataset into the CEDAR RDF Database, a fully fledged, 5-star Linked Dataset.This process starts at the left of Figure 1, where the original spreadsheets are retrieved from their authoritative archive.
The first step after retrieving the original spreadsheets is cell markup .In it, a group of trained experts need to annotate these spreadsheets with a style/color code, producing tables like the one shown in Figure 2.This style/color markup is used to distinguish key information provided in the table, identifying row properties (i.e.variable names, like sex and age ), column and row headers (i.e.variable values, like male and female ), and data (such as 8 people ).This markup is used in the next step to correctly interpret the observations contained in the table.Colour markup is manually added and does not belong to the original data.
In the third step, we combine systems and expert knowledge to harmonize the variables, values and classification systems of the census, using a set of mappings.Our harmonization approach builds on a flexible, yet structured, workflow.This workflow allows researchers to iteratively discover the peculiarities of such a challenging dataset and provide (different) interpretations on the data in an accountable way (Ashkpour, Mandemakers & Boonstra 2016).In order to do so we propose an approach called source-oriented harmonization.By making the harmonization process more structured and explicit we aim stimulate similar efforts across other datasets and projects.The produced mappings (outcome of the harmonization) are rules that establish how a specific string in the original tables (e.g.Diamantsnijders ) should be represented as a unique, standard, harmonized code among the whole dataset (e.g.hisco:88030 ).The mappings also indicate the variable to which they must be applied (e.g.occupation ).Multiple classification systems are used in these mappings, depending on their level of genericity and availability.To reuse as many as we can, and also to assess how globally in use they are, we use LSD Dimensions (Meroño-Peñuela, 2014).LSD (for Linked Statistical Data) 13 Dimensions is an index of statistical variables and values currently published as Linked Data on the Web, where users can find standard identifiers for variables, such as "sdmx:sex", and their corresponding values, such as "sdmx:sex-male" and "sdmx:sex-female".This way, we reuse many variable names and code lists such as gender, age, reference area (place), reference period (time), and so on, mostly based on their SDMX (Statistical Data and Metadata eXchange) counterparts. 14 In the last step, the execution of these rules over the raw data yields what we call the release data, a 4-star Linked Dataset that can be homogeneously queried, and whose results can be consistently replicated.

Data
The output of the presented method generates 6,800,175 census observations.The distribution of variables (dimensions) among these observations is shown in Table 1.The CEDAR RDF Database is released in two different variants: "cedar" and "cedar-mini".The "cedar" variant contains an RDF conversion of the complete 1795-1971 data series, although only partially harmonized.The "cedar-mini" variant contains an RDF conversion of the shorter period 1859-1920, but with highly curated harmonization mappings.For a complete description of the differences between the two subsets, we refer the readers to (Meroño-Peñuela, Ashkpour, Guéret & Schlobach, 2015) and to the dataset documentation in EASY.
The dataset is modeled using the RDF Data Cube , Open Annotation , and PROV vocabularies.These schemas are specifically designed to publish statistical data, annotation data, and provenance data on the Web as Linked Data, respectively.We choose these vocabularies due to their fit with the  2).We use Open Annotations to make statements about missing codes , table layouts ( structural heterogeneity ), and data inconsistencies (see Section "Problem"), without modifying the original contents of the tables.Finally, we use PROV to describe data provenance and all data transformations we perform, from the original table values to their final harmonized form.This allows us to address the issues on aggregated levels , changing variables , and reproducibility (see Section "Problem"), since users can not only make use of the data, but also follow the trails of how it came to be.The reproducibility of analyses on this dataset is further enhanced by the availability of publicly accessible tables that can be linked and referenced, rather than having researchers copying and typing their own data for each research study.
Remarkably, the CEDAR RDF Database can be queried in a straightforward way for data that can be used to support an hypothesis or an answer to a meaningful historical question (see Section "Data usability & New possibilities").After executing the relevant queries, these data are typically provided in the order of seconds; the process of gathering them before the existence of the CEDAR RDF Database implied manual labour in the order of hours or even days.Table 2. Example queries that can be answered using the cedar-mini subset, with the number of tables and cells used to answer them.
According to the Design Principles of Linked Data , a Linked Dataset is only "5-star Linked Data" if 18 its resources are connected to external Linked Datasets as well.In order to to make the CEDAR RDF Database a 5-star Linked Dataset, we issue links as depicted by  in the technical papers cited in the references section, and at http://lod.cedar-project.nl/

Data usability & New possibilities
To improve the usability of the resulting dataset, we make available various interfaces for accessing it.The simplest of these is a website which makes all data available for download.For more 22 fine-grained data querying needs, we have set a SPARQL endpoint where all the data can be queried using the SPARQL query language.We also make publicly available a number of example SPARQL 23 queries.However, use these queries requires Linked Data expertise, and knowledge of the SPARQL 24 query language.To address this, we also publish a Linked Data API and a query frontend using recent tools (Meroño-Peñuela & Hoekstra, 2017).This interface allows both humans and machines to 25 access the data in a systematic way.For instance, the query houseType_all allows to query all house-related information, while houseType_params allows to specify concrete values for the various variables in the query.
The resulting database, and its outgoing links, open up for old and new research possibilities, some of them already in place.For instance, NLGIS , a web portal to display Dutch demographics of the 19th century in maps, makes use of our data to extend their coverage to variables only present in the historical censuses.But perhaps the most immediate potential of the CEDAR database is the ability to reproduce the results of existing historical research, and use queries to build the same data that researchers have used in the past to answer their research questions.For example, (Boonstra, Doorn, van Horik, van Maarseveen & Oudhof, 2007) offers an invaluable collection of data-driven research on the digital version of the censuses, each of its chapters addressing one specific historical research question.The data used in these chapters could now be easily reconstructed by using the CEDAR database and its provided queries and APIs.For example, in the chapter "Beter wonen?Woningmarkt en residentiële segregatie in Amsterdam 1850-1940" , H.M. Laloli investigates whether, and the extent to which, the quality of housing in Amsterdam increased in the period 1850-1940.To do so, he gathers data of the Dutch historical censuses of this period about houses, in particular on Bewoonde and Onbewoonde huizen (occupied or unoccupied houses, Table 1, pp. 160).These variables are included in the CEDAR database, and the original cross-year harmonized tables can be rebuilt through its API.Another example research question in (Boonstra, Doorn, van Horik, van Maarseveen & 27 Oudhof, 2007) looks at what degree did economic specialization occur in the Netherlands in the 19th and 20th centuries.Here, the mapping with HISCO is a fundamental but time-consuming task that now can be done and further refined with a more direct access to the data.This ease at reproducing 28 existing studies can also be understood as an opportunity for extending them and augment their coverage.For example, census data on the historical Dutch women labour force have been used in comparative studies with other nations, finding that Dutch women might have had a higher participation in the labour market than previously suggested (Schmidt & van Nederveen Meerkerk, 2012).Although it would require for other sources to use Linked Data as data representation paradigm, the extension of this study to cover databases of other countries would be not only feasible, but only require to use the same queries over different datasets; a practice that we see already happening in other areas of socioeconomic history (Hoekstra et al., 2018).
In addition to reproducing old research questions, new research questions suggested by semantically rich links can now be more easily investigated: for instance, what is the relationship between exported goods in ships and their manufacturing industries in the Netherlands during the 18th and 19th centuries?Or: to what extent did the demographic characteristics of artists and artisans change after the Golden Age in different disciplines?One of the most interesting methodological outcomes is the transposability of research questions: the same queries can be reused and executed over different sets of the data by just changing one parameter, which can be useful in comparative studies.

Concluding remarks
The Dutch historical census dataset is surrounded by a history of its own, where many have devoted life-long efforts in improving the access to the most important collection of historical statistics about the past of the Netherlands.This paper summarizes the CEDAR RDF Database, an archived dataset that solves some issues related to the accessibility of these data, in particular dealing with their harmonization, querying and reproducibility of studies.This only adds one step to a long data curation tradition, and the authors hope that it will be only the first of many still to come.
Many challenges are open for the future, but we have a special interest in investigating the genericity of our methods, as well as their applicability to other datasets.As mentioned in our contributions, we will transpose the same research questions in form of Linked Data queries to other linked datasets, to investigate if these queries can be re-executed on different data with minimal changes.With respect to the applicability of our methodology to other datasets, we have already successfully published a number of statistical tables as Linked Data by following the same workflow; but in order to 29 strengthen this point within the same domain we will aim at other international census publications, such as the New Zealand censuses and Britain's Histpop reports .

Figure 1 .
Figure1.Integration pipeline for the CEDAR RDF Database.The workflow starts at the archiving system, where the original Excel files are stored and retrieved using its API.Raw data is produced after interpreting complex table layout.These raw data are later transformed into harmonized data by applying integration rules encoded as Open Annotations.Red arrows indicate that manual input is required.

Figure 2 .
Figure 2. One of the census tables of the dataset (occupation census of 1889, province of Noord-Holland).Colour markup is manually added and does not belong to the original data.

Figure 3 .
Concretely, we use expert knowledge to link the census data to: (1) occupations in the the Historical International Standard Classification of Occupations (HISCO) (van Leeuwen, Maas & Miles, 2002); (2) occupations in the ICONCLASS system, with alternative occupation descriptions; (3) to historical Dutch municipalities 19 in gemeentegeschiedenis.nl (Zandhuis, den Engelse & Mac Gillavry, 2015), which link to DBpedia, GeoNames, codes of the CBS (Centraal Bureau voor de Statistiek), and codes of the Amsterdamse Code (van der Meer & Boonstra, 2006); and to maritime trade registers of the Dutch Ships and Sailors dataset (de Boer, Leinenga, van Rossum & Hoekstra, 2014).

Figure 3 .
Figure 3. External datasets in the Linked Open Data (LOD) cloud to which the CEDAR RDF Database is connected to, achieving the 5th Linked Data star.

Table 1 .
Dimensions of the dataset.The second column indicates how many observations in the dataset refer to such dimension.The third column indicates the proportion of observations referring to such dimension with respect to the total number of observations (6.8M).