Analyses of social media data have generated considerable excitement among social scientists; however, these data are rarely used in research on religion. This paper presents a case for collecting and analyzing social media data. These data provide a useful adjunct to survey and census data that support the bulk of religious demographic research. Additionally, this chapter presents a relatively simple and reproducible analysis that demonstrates how social media data can be matched to religious demographic data. Drawing upon a dataset of 4.8 million tweets geocoded to the counties from which the tweets originated, this chapter demonstrates that the denominational composition of u.s. counties is related to the frequency at which Twitter users post words associated with seven emotional categories: anger, fear, joy, sadness, disgust, overall positive sentiment, and overall negative sentiment. Using this analysis as a starting point, the chapter concludes with suggestions for future research that more fully leverage the computational tools that are increasingly available to social scientists to study social media data.
The editorial essay in a 2011 special issue of Science titled, “Social Scientists Wade into the Tweet Stream” (Miller 2011), signaled the emergence of social media platforms as sites for social scientific inquiry. Whereas computer scientists and computational linguists had long utilized text as data, the publication of the Science special issue seemed to mark a watershed moment for the mainstreaming of social media analytics in disciplines where quantitative research almost always utilized data from surveys, censuses, or laboratory experiments.
Since the publication of that special issue, a vibrant, interdisciplinary scholarly community has developed around social media analytics; their research demonstrates that social media output meaningfully reflects real-world outcomes. For instance, the incidence of heart disease mortality can be predicted with a high degree of accuracy using county-level measurements of negative psychological language use on Twitter (Eichstaedt et al. 2015). Social media posts predict flu trends (Lampos and Cristianini 2010; Culotta 2010) and other public health outcomes (Paul and Dredze 2011). Twitter data
Of course, social media data come with unique benefits and limitations. On one hand, “trace data” from online interactions are collected via unobtrusive means, eliminating one source of bias introduced by the researcher. These data are available at great scale, and the tools used to analyze them are increasingly available and easy to implement. Text data lend themselves to computational analyses that can discern patterns from messy inputs and recent advances in natural language processing have made it possible to ascertain a degree of meaning from conversational text data. These data yield insights into attitudes, opinions, and emotions that are otherwise difficult or impossible to measure. However, social media data are not representative in the same sense as data from randomly sampled survey respondents or randomly assigned experimental participants, a limitation that challenges the assumptions of parametric models.
Despite the limitations, recent years have seen major advances in both the software to analyze social media data and the theoretical tools to interpret the findings. Sentiment analyses that classify the emotional content of text data are now routine (e.g., Kumari et al. 2015). Advances in natural language processing have yielded increasingly sophisticated tools for mining social media data (Hirschberg and Manning 2015). Unsupervised machine learning techniques enable inductive approaches to data analysis that automate older methods of hand-coding topics in text data (Dumais 2004; Hong and Davison 2010). At the same time, the conceptual tools that enable social scientists to meaningfully interpret computational analyses have advanced (Goggins and Petakovic 2014). Theoretical work in this space considers the special factors that motivate internet users to join social media platforms and post content online. In short, the field of social media analytics is advancing rapidly, with increasingly sophisticated tools, methods, and theory to drive ongoing research.
Meanwhile, a wholly separate literature demonstrates the importance of religious context. This literature, which mostly relies on survey data, demonstrates that the religious characteristics of whole geographic areas are important for a variety of outcomes. These effects are not likely due to covariates of religion such as class or race. In other words, religion seems to have an independent effect that is not reducible to other characteristics of communities. In some cases, religion might have a causal influence on concentrated disadvantage. For instance, recent work considers the effect of religious geographic context on racial residential segregation (Blanchard 2007), aggregate mortality risk (Blanchard, Bartkowski, Matthews, and Kerley 2008), and regional divorce rates (Glass and Levchak 2014). This literature collectively suggests that people—even those with little theological knowledge or no religious affiliation at all—are affected by the religious characteristics of the places they live. Religious geographic context plays an important role in shaping the local subcultural norms and values that predominate in an area, and the religious
These two growing, but disparate, areas of research—social media analytics and religious demographic research—have much to offer one another. Social media data are readily available at scale and provide rich insights into attitudes and behaviors of large groups of people; religious contextual research has already established the importance of the religious characteristics of geographic areas. This paper works toward uniting these literatures. How, for instance, are the religious characteristics of u.s. counties related to the emotional content of social media posts originating from those counties? Does the county-level proportion of evangelicals or mainline Protestants predict the frequency with which Twitter users post words associated with fear, anger, joy, sadness, or disgust? More broadly, how can social media data further our understanding of religion and society in ways that more conventional methods cannot? The following analysis is an initial foray that connects social media data to religious demographic data in hopes that future research will continue in this line of inquiry.
The following analysis uses data on the religious composition of u.s. counties to predict the frequency with which Twitter users post words associated with seven emotional categories: anger, fear, joy, sadness, disgust, overall positive sentiment, and overall negative sentiment. The analysis took place in three stages: first, the tweets were collected and matched to the counties from which each post originated. Second, an automated lookup function was used to compare each tweet to dictionaries of words associated with various emotional sentiments (see “data sources” for information about the emotion dictionaries). Finally, the aggregated Twitter data was matched to religious and other county-level demographic data to determine how language use varied as a function of denominational makeup.
Tweets. The data and syntax (written in R and Stata) used in this project are available online.1 This analysis utilized approximately 4.8 million tweets collected between February and October 2017. This dataset was collected by downloading tweets in real time via Twitter’s streaming api. The streaming api provides a sample of all tweets within the parameters set by the user.2 The tweets were not searched on a keyword; rather, to obtain
Emotion lexicon. The quanteda4 package in R was used to construct a corpus of documents (tweets) that were compared to the nrc Word-Emotion Association Lexicon, a crowdsourced list of approximately 14,000 English words associated with eight categories of emotions.5 The frequency of words associated with each emotional category (anger, fear, etc.) was calculated for each tweet. After aggregating the tweets to the county level, the overall percentage of words from each county that were associated with each emotion were obtained. For instance, in the average county, about 1.5% of all words used by Twitter users were matched to dictionary items associated with joy (see Table 7.1).
County level data. Denominational composition data are from the 2010 Religious Congregations and Membership Study, a county-by-county enumeration of religious congregations conducted by the Association of Statisticians of American Religious Bodies. Other county-level data in this analysis are from the 2010 census and, for crime rates, the fbi’s uniform crime reports from the year 2000. County-level voting data for the 2016 presidential election were scraped from Townhall.com.6
Table 7.1 presents descriptive statistics for all the variables in this analysis. Results show that, on average, about 3% of the words originating from a county were associated with generally positive sentiment. Means for the other emotional sentiments ranged from about .9% to about 1.7%. In terms of denominational composition, the average county contained about 503 religious adherents per 1,000 population. Of the four denominational sub-groups in the analysis, evangelical Protestants represented the largest population and members of historically black Protestant denominations represented the smallest group.
The expression of emotional language online varied regionally across the United States. presents a map in which every dot represents one county in the dataset (only counties with at least 30 tweets are included). The dots are colored by the percent
The next stage of analysis involved matching the Twitter data to the county-level religion data to determine whether emotional sentiment varied as a function of religious context. plots summaries of emotional sentiment by county-level denominational adherence rates. The dots on the chart represent the percentages of all words that express a specific emotional sentiment under different religious scenarios. Orange dots represent counties with exceptionally high religious adherence rates (above the 90th percentile); blue dots represent counties with very low religious adherence rates (below the 10th percentile). Filled circles represent statistically significant differences whereas empty circles represent differences that are not significantly difference from zero.
Some patterns are apparent in . First, words associated with positive sentiment are used more frequently than words in any other emotional category. In terms of denominational composition, the total religious adherence rate tended to predict lower levels of negative emotional sentiments such as disgust and general negativity; however, it also predicted a lower prevalence of positive sentiment. Some denominational
A critique of the results in is that the patterns are not due to religious difference between counties but to other covariates such as basic demographic differences. To that end, the final step in this analysis models the expression of negative emotional expression with religious variables plus controls for earnings, sex and racial composition, crime, politics, inequality, population size, and region. The table entries in Table 7.2 are standardized ordinary least squares regression coefficients. Each of the focal predictors (denominational adherence rates) are included in separate regression models because they all strongly covary with one another. The results suggest that the patterns shown in
Discussion and Conclusion
The current project unites two separate, but growing, bodies of literature: social media analytics and religious demographic research. The former has established that social media output reflects, to varying degrees, the emotions, attitudes, and behaviors of whole populations of people, making social media platforms a meaningful site for social scientific inquiry. The second body of literature demonstrates that the religious composition of geographic areas is a key predictor of local attitudinal norms and structural characteristics of communities.
Social media data are useful for social science in general and for religious demographic research in particular. These data, and the methods used to analyze them, leverage new tools and new data to provide a fresh approach to longstanding theoretical questions in the social scientific study of religion. Ideas about how religiously-derived norms affect whole geographic areas have occupied social scientists’ attention since Weber and Durkheim, both of whom posed religious demographic questions but lacked the quality data available to contemporary researchers. In addition to survey research, ancillary data from unconventional sources can help corroborate existing patterns and expand research into lines of inquiry that are otherwise difficult to access. In short, social media data are relevant for religious demographic research.
This paper presents an example analysis that uses easily attainable social media data and methods that involve simple matching functions between tweets and established dictionaries of words. The intention is to demonstrate that meaningful insights can be obtained using methods that require little computational sophistication. The results of this analysis suggest that the religious characteristics of geographic areas predict differences in language use that withstand adjustments for basic demographic controls. Some interesting denominational patterns emerged, such as the fact that mainline Protestant adherence rates predicted more positive and less negative emotional expressions on Twitter. This finding is consistent with literature that portrays mainline Protestants as prosocial, ecumenical, “civic good guys” (Putnam and Campbell 2010:458) whose presence in communities is generally regarded as beneficial (see also Beyerlein and Hipp 2006; Chaves, Giesel, and Tsitsos 2002). Importantly, however, the findings in this analysis
The possibilities for future research in this area are immense. In the immediate sense, every part of the current analysis can be done using more sophisticated methods, including implementations of supervised and unsupervised machine learning to better ascertain the emotional content of tweets. Moving beyond the present example, future research could focus on any number of other relevant topics, such as the connection between religious demography and online hate speech or communication among members of social movements. Different sources of data could be used, such as internet searches, blog posts, or social media output in different languages around the world. In short, internet data offer tremendous promise and will hopefully play a larger role in the future of religious demographic research.
ChavesMarkHelen M.Giesel and WilliamTsitsos. 2002. “Religious Variations in Public Presence: Evidence from the National Congregations Study.” In The Quiet Hand of God: Faith-Based Activism and the Public Role of Mainline Protestantism edited by RobertWuthnow and John H.Evans108–28. Berkeley: University of California Press.
The online repository is accessible at https://github.com/marsha5813/yird2018.
See the api documentation at https://developer.twitter.com/.
Because users can enable or disable the location function, most tweets are not geotagged. Previous research suggests that slightly fewer than 1% of tweets include geographic metadata. For a review of socio-demographic differences between Twitter users that do and do not share their location, see Sloan and Morgan (2015).
See the package documentation at http://docs.quanteda.io/.
For more information, see http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm.
These data were obtained and tabulated by Github user Tony McGovern at https://github.com/tonmcg/County_Level_Election_Results_12-16.