Save

The Longitudinal IntermediaPlus (2014–2016): A Case Study in Structuring Unstructured Big Data

Other Humanities

In: Research Data Journal for the Humanities and Social Sciences
Authors:
Inga BrentelDepartment for Communication and Media Studies, Institute of Social Science, Heinrich-Heine-University, Düsseldorf, Germany, inga.brentel@uni-duesseldorf.de

Search for other papers by Inga Brentel in
Current site
Google Scholar
PubMed
Close
and
Kristi WintersGESIS, Cologne, Germany, kristi.winters@gesis.org

Search for other papers by Kristi Winters in
Current site
Google Scholar
PubMed
Close
Open Access

Abstract

This article details the novel structure developed to handle, harmonize and document big data for reuse and long-term preservation. ‘The Longitudinal IntermediaPlus (2014–2016)’ big data dataset is uniquely rich: it covers an array of German online media extendable to cross-media channels and user information. The metadata file for this dataset, and its documentation, were recently deposited as its own MySQL database called charmstana_sample_14-16.sql (https://data.gesis.org/sharing/#!Detail/10.7802/2030) (cs16) and is suitable for generating descriptive statistics. Analogous to the ‘Data View’ in spss, the charmstana_analysis (ca) contains the dataset’s numerical values. Both the cs16 and ca MySQL files are needed to conduct analysis on the full database. The research challenge was to process large-scaled datasets into one longitudinal, big-data data source suitable for academic research, and according to fair principles. The authors review four methodological recommendations that can serve as a framework for solving big-data structuring challenges, using the harmonization software CharmStats.

Abstract

This article details the novel structure developed to handle, harmonize and document big data for reuse and long-term preservation. ‘The Longitudinal IntermediaPlus (2014–2016)’ big data dataset is uniquely rich: it covers an array of German online media extendable to cross-media channels and user information. The metadata file for this dataset, and its documentation, were recently deposited as its own MySQL database called charmstana_sample_14-16.sql (https://data.gesis.org/sharing/#!Detail/10.7802/2030) (cs16) and is suitable for generating descriptive statistics. Analogous to the ‘Data View’ in spss, the charmstana_analysis (ca) contains the dataset’s numerical values. Both the cs16 and ca MySQL files are needed to conduct analysis on the full database. The research challenge was to process large-scaled datasets into one longitudinal, big-data data source suitable for academic research, and according to fair principles. The authors review four methodological recommendations that can serve as a framework for solving big-data structuring challenges, using the harmonization software CharmStats.

Online publication date: 6-7-2021

  1. Related data set “Meta-Information on the Sample of the Media-Analysis Data: The Longitudinal IntermediaPlus (2014–2016)” with DOI www.doi.org/10.7802/2030 in repository “gesis

1. Introduction

This article will explain the novel structure developed to handle, harmonize and document big data for reuse and long-term preservation. The Longitudinal IntermediaPlus (2014–2016) big data dataset is unique in its richness: it covers an array of German online media extendable to cross-media channels and information on the users.1 These data are suitable for investigating, inter alia, media use (online and potentially offline), inequalities between social or geographic factors, routines of different (social) groups, media concentration tendencies, audience and market fragmentation in Germany or in comparison with Germany. The metadata file for this dataset and its documentation were recently deposited as its own MySQL database called charmstana_sample_14-16.sql (cs16) and are available for download from gesis-Leibniz Institute for the Social Sciences (see Brentel et al., 2020). Similar to the ‘Variable View’ in spss, the cs16 database contains metadata on the full dataset and is suitable for generating an array of descriptive statistics. The cs16 file can be used to examine German media market analysis on the structural level, for example, the distribution of different German online media market genres (see Kampes, 2020) or the genre-portfolio of different media brands. Importantly, it also details our structuring solutions for the original, unstructured big data media files, including information extraction and the conceptual structure of the full database.

Analogous to the ‘Data View’ in spss, the charmstana_analysis (ca) is the MySQL database file containing the dataset’s numerical values. Both the cs16 and ca MySQL files are needed to conduct analyses on The Longitudinal IntermediaPlus (2014–2016) database or extract variables of interest for analysis. The full ca database (<100gb) is embargoed until summer 2021, to be published with gesis (current at the time of publication). However, prior to its full release in summer 2021, and upon e-mail request to the lead author, a chosen variable set of interest can be made accessible to researchers.

Publications addressing the unique challenges for big data quality-standards have emerged in recent years (inter alia, Jürgens & Jungherr, 2016; van Atteveldt & Peng, 2018; van Atteveldt et al., 2019b; Wilkinson et al., 2016). This paper triangulates with the conceptual literature (Dienlin et al., 2021; Peter et al., 2020; RatSWD, 2019; RatSWD, 2020; van Atteveldt & Peng, 2018; van Atteveldt et al., 2019b; Wilkinson et al., 2016), the practical literature from the field of social media (Jürgens & Jungherr, 2016) and text analysis data (inter alia, Berman, 2013, pp. 2ff.; Lee et al., 2014) by documenting the lessons learned from our big data handling challenges and our technical solutions for big, semi-unstructured tracking and survey data. Four practical recommendations are provided in the conclusion that conform to scientific standards of transparency and reproducibility, following the fair principles of Wilkinson et al. (2016), and can be applied beyond text analysis and social media data.

2. Original Datasets: Description of the Big Data Dataset and the Research Problem

ag.ma’s IntermediaPlus dataset combines digital trace data for online media use with representative survey data for the German population (over 14 years old). Due to rigorous operationalization by well-recognized, academic institutes for data collection (cf. Arbeitsgemeinschaft Media-Analyse e.V.,2020) high-quality data is produced. It includes, inter alia, information on cross-media use, on press media, radio, tv and online (Arbeitsgemeinschaft Media-Analyse e.V., 2014). The ag.ma IntermediaPlus data bundles have six Variable Sections (see Appendix). These bundles result from a joint venture of the German Media-Analysis agencies (ag.ma, agof and gfk/agf). It unites the media planning actors in the commercial branch, the broadcasting, and electronic media vendors branch. Each brought its perspective, different data needs and data use perspectives such as interpretation of media penetration.

Previously, these data were inaccessible because: 1) they are owned and embargoed for two years by companies (and although requests for the data for research were possible, access was not guaranteed); and 2) the structure is closed, it was sparsely documented and learning the data and the technical requirements demanded significant effort: a common issue when working with big data (Tekiner & Keane, 2013; van Atteveldt & Peng, 2018). The challenge of handling and tracking large amounts of metadata information became apparent quickly; because big data is broadly unstructured. The Variable Sections metadata in the original dataset were often identical: a single question but worded with different items, each with routine activity, a free-time activity or media offering, with the same structure and metadata information, year on year (e.g., identical question wording and variable values multiplied the total number of variables significantly). With 4,000-plus online media-use variables to process for each year, all with identical metadata, we needed an automatized looping code that also produced aggregated data-documentation sufficiently detailed to conform to the fair principles (Mons et al., 2017, pp. 51f.). Similarly, in the Variable Section changes in the respondent’s belongings, socio-demographic and household needed to be tracked and made visible for users.

3. Method: Planning and Digitizing the Workflow

Big data transformation requires advanced planning. Before this project, these data were stored as a large-scale, semi-unstructured data source, also known as a data silo, which is closed in its own storage structure and inaccessible to researchers. We pooled and transformed the Media-Analysis commercial media-market data source into a structured big-data dataset, with complete documentation, into a harmonized dataset called The Longitudinal IntermediaPlus Data Source (2014–2016). A customized ‘per variable’ tracking system was designed and documented the harmonization process via CharmStats. The Pressmedia and Radio bundle of Media-Analysis were harmonized in the same traceable and sustainable way (Brentel & Jandura, 2018, 2021; Jandura & Brentel, 2021; Jandura et al., 2021).

Our data processing solution was the use of automated “loops” developed by the lead author for use in the open-source variable harmonization software CharmStats. It has the capacity to generate some re-coding languages and fetch all the metadata associated with the project into a digital report. This allowed us to generate a user-friendly output table for a variable, with all relevant metadata information, across time, and displaying any changes (see Figure 5). These detailed data documents, with information informed by the fair principles, resulted in a highly-reusable, high-quality database.

We developed four steps to solve big-data processing challenges when working with categorical and metric variables for different years, that are large-scaled and relatively unstructured data (see Figure 1):

  1. Step 1: Plan out a codebook documentation structure strategy, conforming to the fair principles of Findability and Accessibility. This documentation structure strategy guided the shape of the files.
  2. Step 2: Identify the metadata for the study, question and variable levels to be imported into CharmStats from *.sav file or Open Office spreadsheets importation, which can be enriched by hand-entered information (e.g., bibliographical information and notes documenting harmonization decisions).
  3. Step 3: Complete the variable harmonization and data processing work in CharmStats.
  4. Step 4: Use CharmStats to produce outputs including the dataset documentation report, the recoding language for data processing in statistical software and codebook reports with complete data documentation, all conforming to the fair principles of Interoperability and Reproducibility.

Figure 1
Figure 1

Four steps of success to produce the IntermediaPlus 2014–2016 longitudinal dataset

Citation: Research Data Journal for the Humanities and Social Sciences 6, 1 (2021) ; 10.1163/24523666-06010001

4. Creating a Big Data Structure

Social scientists may view big-data sources as quite attractive. They are often free and rich sources of data. However, would-be big-data users face unique data handling challenges before they can do any data analysis (Japec et al., 2015; Jungherr et al., 2018, pp. 255f.). These data silos are usually stored as unstructured, large-scale big-data collections, requiring substantial data handling before they can be used (Foster et al., 2017, pp. 7ff.; Maroto, 2016; van Atteveldt & Peng, 2018). For those researchers who want to work with unstructured or semi-unstructured big data formats, but who are used to structured datasets, we propose the use of these data handling and documentation standards.2

Our conceptual approach included first identifying important information in the unstructured or semi-unstructured data for information extraction (cf. Warin & Sanger, 2014).3 We used this identified information to re-organize the unstructured big data into an understandable and re-usable structured big data dataset. Big data can be organized in a variety of ways, but our structuring was driven by which structures would best answer our research questions, while also adding secondary reuse value for the research community.4 We structured these big data on different data “levels”, as set out in Figure 2.

  1. The first structuring level reflects the practical reality of the existing data structure itself, namely Full Entity, Single Entities, and Combined Outlines.
  2. The second-order structuring level was conceptual: the typology for business models following Wirtz (2018, pp. 307ff.).
  3. Our third-order, and final, structuring level was genres (see Kampes, 2020).

Figure 2
Figure 2

The structuring levels for the IntermediaPlus dataset

Citation: Research Data Journal for the Humanities and Social Sciences 6, 1 (2021) ; 10.1163/24523666-06010001

These structures reduced the number of variables down to about half the number of relevant entities. The structures also guided the separating of different online entities into smaller groups, and those smaller groups facilitated processing metadata in CharmStats, while simultaneously enriching the data with more information by automatically adding para-data for later filtering or analysis. This can be achieved with the structure as indicated by the variable names (see Brentel et al., 2020: data-documentation, Description of the work carried out). By way of example, the variables for full entity online offerings begin with the prefix “ga_” (Gesamt-Angebot) while those of a single entity start with “ea_” (Einzel-Angebot), and genre-category labels are displayed with a hashtag, for example, “#Digital”.

5. The CharmStats Workflow

CharmStats is a software solution that offers a structured workflow to overcome big data challenges.5 Developed at gesis Leibniz Institute for the Social Sciences, it breaks down data processing into a metadata-based workflow. Built from ddi metadata standards, CharmStats stores metadata on:

  1. levels of the study, such as study name, collection dates, collection area;
  2. question, such as multi-lingual question wording, show card response options;
  3. variable including response options and corresponding labels; and
  4. value, adding a comment to a response value.

Users import metadata using spss or Open Office spreadsheets, then connect Source and Target Variables metadata in the MySQL database via interactive interfaces. The metadata organization in CharmStats, as shown in Figure 3, allows information to be reused in efficient ways. Users can connect, save, and retrieve metadata connections to generate individual digital documents or to produce codebooks making them reproducible.6 The workflow traces a user’s work to generate individualized data documentation outputs, including generating graphs that visualize the recoding structure and auto-generates code in spss, Stata, MPlus and sas. To accommodate our project’s data needs, CharmStats was adapted to include large-scale data processing for categorical and metric variables. Figure 4 presents an overview of the CharmStats workflow.

Figure 3
Figure 3

Representation of types of metadata handled by CharmStats

Citation: Research Data Journal for the Humanities and Social Sciences 6, 1 (2021) ; 10.1163/24523666-06010001

Figure 4
Figure 4

Conceptual representation of the CharmStats workflow and outputs

Citation: Research Data Journal for the Humanities and Social Sciences 6, 1 (2021) ; 10.1163/24523666-06010001

To manage the approximately 21,500 variables in the original ag.ma’s IntermediaPlus datasets, first a Variable Sections list (e.g., Media Use, Free Time Activities, Socio-demographics) and then a list of variables per Section (e.g., hh_aboalice, education, ga_alst_ct_pi) were needed to structure the dataset. Using CharmStats, we generated digital documentation for our harmonization process using bespoke report templates based on our reporting needs. These digital reports tracked variable changes across the years as detailed documentation, per harmonized variable (as shown in Figure 5). The reports included researcher comments, replication information for later users, hyperlinks within the document so users need not scroll through 18,000 variables of the final data-documentation, thereby making it user-friendly and transparent and facilitating future replication.

Figure 5
Figure 5

Sample of individualized data documentation

Citation: Research Data Journal for the Humanities and Social Sciences 6, 1 (2021) ; 10.1163/24523666-06010001

Note: Based on data documentation in Brentel et al. (2020).

6. Resulting Data: the Longitudinal IntermediaPlus Data Source (2014–2016)

  1. The Longitudinal IntermediaPlus deposited at gesisdoi:www.doi.org/10.7802/2030
  2. Temporal coverage: 2014–2016

These cross-sectional datasets were pooled and transformed into one longitudinal, big-data source, The Longitudinal IntermediaPlus (2014–2016) (<100gb). The metadata file for this dataset, and its documentation, are available for download from gesis as its own MySQL database called charmstana_sample_14-16.sql (cs16). The deposited cs16 metadata and its data documentation detail the exact shape of the variables. It facilitates metadata analysis on the German media market and enables users of the ca to compile their variable ‘set of interest’ to create their own, bespoke version of the full MySQL database for analysis.

The charmstana_analysis (ca) is the MySQL database file containing the dataset’s numerical values. The forthcoming ca MySQL database has a specialized structure to facilitate complex computational analysis and efficient computer performance despite its big size. The pooled and longitudinal ca dataset has around 18,000 variables with more than 1.6 million German respondents. These data are suitable for investigating, inter alia, media use (online and potentially offline), inequalities between social or geographic factors, routines of different (social) groups, media concentration tendencies, audience and market fragmentation in Germany or in comparison with Germany. Variables included are socio-demographic characteristics, free-time activities, the respondent’s belongings and social class. There are variables for online media market characteristics, for example, media provider, marketer, genre, business model or origin of an online media offering. Geographical variables allow analysis on the level of governing districts. The high number of cases in the dataset, and it being statistically representative for Germany, enables complex statistical methods, such as network analysis and analysis for specific (sub)groups without problems of a small-N (van Atteveldt & Peng, 2018, pp. 83f.; van Atteveldt et al., 2019a, p. 3). Both the cs16 and ca MySQL files are needed to conduct analysis on the full database, or extract variables of interest for analysis. The published metadata and data documentation in the cs16 are needed to understand the ca MySQL database and prepare the query for the researcher’s own, a customized version of this database.

7. Conclusion

We confronted the challenge of harmonizing large-scale, semi-unstructured and cross-sectional data silos covering several years, a data preparation challenge that future researchers will also have to confront. We found that digitizing the repetitive work of recoding and documenting using CharmStats was very useful in this large-scale data project. The transformed dataset is now pooled, clearly-structured, longitudinal, readable and understandable, and it comes with finable, accessible, interoperable and reusable documentation ready for academic use. The inclusion of precise recoding instructions for spss, stata, MPlus and sas are of particular value to researchers.

To replicate this process, or apply it to another data silo, we offer four recommendations:

  1. (1)use the inherent data structures and decide a conceptual structure to match your research interest;
  2. (2)import the relevant metadata into CharmStats using spss or an Open Office spreadsheet program;
  3. (3)harmonize the metadata as per your pre-defined structure, make use of the CharmStats features, such as automatized data processing; and
  4. (4)produce your outputs and use a template option to export your bespoke documentation.

For more information on the Longitudinal IntermediaPlus (2014–2016) or the longitudinal datasets for Radio (1977–2015) and Pressmedia (1954–2015) visit the GESIS website for Media-Analysis data or email the lead author, Inga Brentel.

Acknowledgements

This work was supported by the Digital Society research program funded by the Ministry of Culture and Science of the German State of North Rhine-Westphalia. The original data as well as information on data collection was kindly provided by ag.ma and agof.

References

  • Arbeitsgemeinschaft Media-Analyse e.V. (2014). Datensatz Codeplan MA 14.

  • Arbeitsgemeinschaft Media-Analyse e.V. (2020). Datenerhebung der ma Intermedia PLuS. https://www.agma-mmc.de/media-analyse/ma-intermedia-plus/datenerhebung.

    • Search Google Scholar
    • Export Citation
  • Berman, J. J. (2013). Principles of big data: Preparing, sharing, and analyzing complex information. Safari Tech Books Online. Morgan Kaufmann.

    • Search Google Scholar
    • Export Citation
  • Brentel, I., & Jandura, O. (2018). Media-Analyse: Radio – Langfristdaten. https://www.doi.org/10.7802/1620.

  • Brentel, I., & Jandura, O. (2021). Media-Analyse: Pressemedien – Langfristdaten (Version 2.0). https://www.doi.org/10.7802/2157.

  • Brentel, I., Kampes, C. F., & Jandura, O. (2020). Meta-Information des Samples der Media-Analyse Daten: IntermediaPlus (2014–2016; Version: 1.0.0). SoWiDataNet. GESIS. https://www.doi.org/10.7802/2030.

    • Search Google Scholar
    • Export Citation
  • Breznau, N., Rinke, E. M. & Wuttke, A. (2019). OSSC19 Crowdsourced Replication Initiative, Mannheim Centre For European Social Research (MZES), University of Mannheim. https://harmonization.gesis.org/#!BrowseResults/?searchval=Crowdsource.

    • Search Google Scholar
    • Export Citation
  • Cai, L. & Zhu, Y. (2015). The challenges of data quality and data quality assessment in the Big Data Era. Data Science Journal, 14, p.2. http://www.doi.org/10.5334/dsj-2015-002.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Dienlin, T., Johannes, N., Bowman, N. D., Masur, P. K., Engesser, S., Kümpel, A. S., Lukito, J., Bier, L. M., Zhang, R., Johnson, B. K., Huskey, R., Schneider, F. M., Breuer, J., Parry, D.A., Vermeulen, I., Fisher, J.T., Banks, J., Weber, R., Ellis, D.A., … de Vreese, C. (2021). An agenda for open science in communication. Journal of Communication, 71(1), 126. https://www.doi.org/10.1093/joc/jqz052.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Foster, I., Ghani, R., Jarmin, R. S., Kreuter, F. & Lane, J. (2017). Big data and social science: A practical guide to methods and tools. CRC Press.

    • Search Google Scholar
    • Export Citation
  • Gudivada, V., Apon, A., & Ding, J. (2017). Data quality considerations for big data and machine learning: Going beyond data cleaning and transformations. International Journal on Advances in Software, 10, 120.

    • Search Google Scholar
    • Export Citation
  • Jandura, O., & Brentel, I. (2021, forthcoming). Media-Analyse-Daten: Radio-Tranche (2010–2015; MA-Radio). GESIS Datenarchiv, Köln. ZA5762 Datenfile Version 1.0.0, doi:https://www.doi.org/10.4232/1.13662.

    • Search Google Scholar
    • Export Citation
  • Jandura, O., Brentel, I., & Babic, D. (2021). Media-Analyse-Daten: Pressemedien-Tranche (2010–2015). GESIS Datenarchiv, Köln. ZA5761 Datenfile Version 1.0.0. https://www.doi.org/10.4232/1.13661.

    • Search Google Scholar
    • Export Citation
  • Japec, L., Kreuter, F., Berg, M., Biemer, P., Decker, P., Lampe, C., Lane, J., O’Neil, C., & Usher, A. (2015). Big Data in survey research. Public Opinion Quarterly, 79(4), 839880. https://www.doi.org/10.1093/poq/nfv039.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Jungherr, A., Jürgens, P., & Schoen, H. (2018). 12 Twitter-Daten in der Wahlkampfforschung: Datensammlung, Aufarbeitung und Analysebeispiele. In A. Blätte, J. Behnke, K.-U. Schnapp, & C. Wagemann (Eds.), Schriftenreihe der Sektion Methoden der Politikwissenschaft der Deutschen Vereinigung für Politische Wissenschaft. Computational Social Science: Die Analyse von Big Data (1st Ed., pp. 255294). Nomos Verlagsgesellschaft.

    • Search Google Scholar
    • Export Citation
  • Jürgens, P., & Jungherr, A. (2016). A tutorial for using Twitter data in the social sciences: Data collection, preparation, and analysis. ssrn Electronic Journal.Advance online publication. https://www.doi.org/10.2139/ssrn.2710146.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Kampes, C. F. (2020). Welche Genres existieren für Online-Medienangebote? Eine Analyse der Themenstruktur aus Anbietersicht. In W. Deiters, S. Geisler, F. Hörner, & A. K. Knaup (Eds.), Die Kommunikation und ihre Technologien. Interdisziplinäre Perspektiven auf Digitialisierung (pp. 1344). Transcript Verlag.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Lee, K., Noh, Y., Yoon, S., & Cho, Y. (2014). Structuring of unstructured big data and visual interpretation. Journal of the Korean Data and Information Science Society, 25(6), 14311438. https://www.doi.org/10.7465/jkdi.2014.25.6.1431.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Maroto, C. (2016). A data lake architecture with Hadoop and open source search engines: Using enterprise data lakes for modern analytics and business intelligence. Retrieved from https://www.dzone.com/articles/a-data-lake-architecture-with-hadoop-and-open-sour.

    • Search Google Scholar
    • Export Citation
  • Mons, B., Neylon, C., Velterop, J., Dumontier, M., da Silva Santos, L. O. B., & Wilkinson, M. D. (2017). Cloudy, increasingly fair; revisiting the fair Data guiding principles for the European Open Science Cloud. Information Services & Use, 37(1), 4956. https://www.doi.org/10.3233/ISU-170824.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Peter, C., Breuer, J., Masur, P. K., Scharkow, M., & Schwarzenegger, C. (2020, December 11). Empfehlungen zum Umgang mit Forschungsdaten in der Kommunikationswissenschaft. Retrieved from https://www.dgpuk.de/sites/default/files/AG_Forschungsdaten%20Empfehlungen%20DGPuK_0.pdf.

    • Search Google Scholar
    • Export Citation
  • Rat für Sozial- und Wirtschaftsdaten [RatSWD] (2019). Big Data in den Sozial-, Verhal tens- und Wirtschaftswissenschaften: Datenzugang und Forschungsdatenmanage ment – Mit Gutachten “Web Scraping in der unabhängigen wissenschaftlichen Forschung” (Output No. 4[6]). https://www.doi.org/10.17620/02671.39.

    • Search Google Scholar
    • Export Citation
  • Rat für Sozial- und Wirtschaftsdaten [RatSWD] (2020). Datenerhebung mit neuer Informationstechnologie: Empfehlungen zu Datenqualität und -management, Forschungsethik und Datenschutz (Output No. 6[6]). https://www.doi.org/10.17620/02671.47.

    • Search Google Scholar
    • Export Citation
  • Tekiner, F., & Keane, J. A. (2013, October 13–16). Big Data framework. In Proceedings: 2013 ieee International Conference on Systems, Man and Cybernetics: smc 2013, Manchester, United Kingdom(pp. 14941499). IEEE Computer Society. https://www.doi.org/10.1109/SMC.2013.258.

    • Search Google Scholar
    • Export Citation
  • van Atteveldt, W., Margolin, D., Shen, C., Trilling, D., & Weber, R. (2019a). A roadmap for computational communication research. Computational Communication Research, 1(1), 111. https://www.doi.org/10.5117/CCR2019.1.001.VANA.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • van Atteveldt, W., & Peng, T.Q. (2018). When communication meets computation: Opportunities, challenges, and pitfalls, Computational Communication Science. Communication Methods and Measures, 12(2–3), 8192. https://www.doi.org/10.1080/19312458.2018.1458084.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • van Atteveldt, W., Strycharz, J., Trilling, D., & Welbers, K. (2019b). Toward open computational communication science: A practical road map for reusable data and code. International Journal of Communication, 13, 39353954.

    • Search Google Scholar
    • Export Citation
  • Warin, T., & Sanger, W. (2014). Structuring big data: How financial models may help. Journal of Computer Science and Information Technology, 2, 120.

    • Search Google Scholar
    • Export Citation
  • Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J. J., Appleton, G., Axton, M., Baak, A., & Mons, B. (2016). The fair Guiding Principles for scientific data management and stewardship. Scientific Data, 3, Article 160018. https://www.doi.org/10.1038/sdata.2016.18.

    • Crossref
    • Search Google Scholar
    • Export Citation
  • Wirtz, B. W. (2018). Electronic business (6th Ed.). Springer.

Appendix

T1
1

The mixed-methods design data collection includes some 100,000 cases for daily tracking of about 4,000 webpages, a combination of on-site and in-app questionnaires, and using a classic cati-questionnaire survey is carried out twice a year. It excludes online media outlines under public law, such as zdf and ard.

2

For information on a big-data data quality framework from data science, see Cai & Zhu, 2015. The fair principles match this framework.

3

Information extraction is a methodological approach used in computer science; see, for example, Gudivada et al., 2017, p. 5.

4

For more information, see the study description and documentation archived at gesis.

5

The Coding and Harmonization of Statistics software CharmStats is a Java-based, open-source, free software that stores to, retrieves from and connects across harmonization metadata via a MySQL database.

6

In addition to big datasets used, CharmStats was used to document code for a crowdsourced replication initiative experiment (Breznau et al., 2019).

Content Metrics

All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 287 265 13
PDF Views & Downloads 287 279 10