Is Social Media Data Relevant for Religious Demographic Research?

in Yearbook of International Religious Demography 2018

If the inline PDF is not rendering correctly, you can download the PDF file here.

Analyses of social media data have generated considerable excitement among social scientists; however, these data are rarely used in research on religion. This paper presents a case for collecting and analyzing social media data. These data provide a useful adjunct to survey and census data that support the bulk of religious demographic research. Additionally, this chapter presents a relatively simple and reproducible analysis that demonstrates how social media data can be matched to religious demographic data. Drawing upon a dataset of 4.8 million tweets geocoded to the counties from which the tweets originated, this chapter demonstrates that the denominational composition of u.s. counties is related to the frequency at which Twitter users post words associated with seven emotional categories: anger, fear, joy, sadness, disgust, overall positive sentiment, and overall negative sentiment. Using this analysis as a starting point, the chapter concludes with suggestions for future research that more fully leverage the computational tools that are increasingly available to social scientists to study social media data.


The editorial essay in a 2011 special issue of Science titled, “Social Scientists Wade into the Tweet Stream” (Miller 2011), signaled the emergence of social media platforms as sites for social scientific inquiry. Whereas computer scientists and computational linguists had long utilized text as data, the publication of the Science special issue seemed to mark a watershed moment for the mainstreaming of social media analytics in disciplines where quantitative research almost always utilized data from surveys, censuses, or laboratory experiments.

Since the publication of that special issue, a vibrant, interdisciplinary scholarly community has developed around social media analytics; their research demonstrates that social media output meaningfully reflects real-world outcomes. For instance, the incidence of heart disease mortality can be predicted with a high degree of accuracy using county-level measurements of negative psychological language use on Twitter (Eichstaedt et al. 2015). Social media posts predict flu trends (Lampos and Cristianini 2010; Culotta 2010) and other public health outcomes (Paul and Dredze 2011). Twitter data can be used to track trends in human emotion expressed through language (Golder and Macy 2011) and social media research on gender demonstrates how online gendered discourses reflect and perpetuate real-world inequities (Courtney and Rice 2013; Schwartz et al. 2013; Bamman, Eisenstein, and Schnoebelen 2014).

Of course, social media data come with unique benefits and limitations. On one hand, “trace data” from online interactions are collected via unobtrusive means, eliminating one source of bias introduced by the researcher. These data are available at great scale, and the tools used to analyze them are increasingly available and easy to implement. Text data lend themselves to computational analyses that can discern patterns from messy inputs and recent advances in natural language processing have made it possible to ascertain a degree of meaning from conversational text data. These data yield insights into attitudes, opinions, and emotions that are otherwise difficult or impossible to measure. However, social media data are not representative in the same sense as data from randomly sampled survey respondents or randomly assigned experimental participants, a limitation that challenges the assumptions of parametric models.

Despite the limitations, recent years have seen major advances in both the software to analyze social media data and the theoretical tools to interpret the findings. Sentiment analyses that classify the emotional content of text data are now routine (e.g., Kumari et al. 2015). Advances in natural language processing have yielded increasingly sophisticated tools for mining social media data (Hirschberg and Manning 2015). Unsupervised machine learning techniques enable inductive approaches to data analysis that automate older methods of hand-coding topics in text data (Dumais 2004; Hong and Davison 2010). At the same time, the conceptual tools that enable social scientists to meaningfully interpret computational analyses have advanced (Goggins and Petakovic 2014). Theoretical work in this space considers the special factors that motivate internet users to join social media platforms and post content online. In short, the field of social media analytics is advancing rapidly, with increasingly sophisticated tools, methods, and theory to drive ongoing research.

Meanwhile, a wholly separate literature demonstrates the importance of religious context. This literature, which mostly relies on survey data, demonstrates that the religious characteristics of whole geographic areas are important for a variety of outcomes. These effects are not likely due to covariates of religion such as class or race. In other words, religion seems to have an independent effect that is not reducible to other characteristics of communities. In some cases, religion might have a causal influence on concentrated disadvantage. For instance, recent work considers the effect of religious geographic context on racial residential segregation (Blanchard 2007), aggregate mortality risk (Blanchard, Bartkowski, Matthews, and Kerley 2008), and regional divorce rates (Glass and Levchak 2014). This literature collectively suggests that people—even those with little theological knowledge or no religious affiliation at all—are affected by the religious characteristics of the places they live. Religious geographic context plays an important role in shaping the local subcultural norms and values that predominate in an area, and the religious make-up of communities, cities, states, and nations are deeply intertwined with the taken-for-granted norms that shape the nature of social interaction.

These two growing, but disparate, areas of research—social media analytics and religious demographic research—have much to offer one another. Social media data are readily available at scale and provide rich insights into attitudes and behaviors of large groups of people; religious contextual research has already established the importance of the religious characteristics of geographic areas. This paper works toward uniting these literatures. How, for instance, are the religious characteristics of u.s. counties related to the emotional content of social media posts originating from those counties? Does the county-level proportion of evangelicals or mainline Protestants predict the frequency with which Twitter users post words associated with fear, anger, joy, sadness, or disgust? More broadly, how can social media data further our understanding of religion and society in ways that more conventional methods cannot? The following analysis is an initial foray that connects social media data to religious demographic data in hopes that future research will continue in this line of inquiry.

Analytic Strategy

The following analysis uses data on the religious composition of u.s. counties to predict the frequency with which Twitter users post words associated with seven emotional categories: anger, fear, joy, sadness, disgust, overall positive sentiment, and overall negative sentiment. The analysis took place in three stages: first, the tweets were collected and matched to the counties from which each post originated. Second, an automated lookup function was used to compare each tweet to dictionaries of words associated with various emotional sentiments (see “data sources” for information about the emotion dictionaries). Finally, the aggregated Twitter data was matched to religious and other county-level demographic data to determine how language use varied as a function of denominational makeup.

Data Sources

Tweets. The data and syntax (written in R and Stata) used in this project are available online.1 This analysis utilized approximately 4.8 million tweets collected between February and October 2017. This dataset was collected by downloading tweets in real time via Twitter’s streaming api. The streaming api provides a sample of all tweets within the parameters set by the user.2 The tweets were not searched on a keyword; rather, to obtain a sample of “typical” tweets from the broader tweet stream, a geographic bounding box was specified around the contiguous United States. The bounding box filter ensured that only geotagged tweets were received.3 The geographic coordinates of the tweets were matched to the 2,929 counties from which the tweets originated. This analysis limits the dataset only to the 1,978 counties with at least 30 tweets.

Emotion lexicon. The quanteda4 package in R was used to construct a corpus of documents (tweets) that were compared to the nrc Word-Emotion Association Lexicon, a crowdsourced list of approximately 14,000 English words associated with eight categories of emotions.5 The frequency of words associated with each emotional category (anger, fear, etc.) was calculated for each tweet. After aggregating the tweets to the county level, the overall percentage of words from each county that were associated with each emotion were obtained. For instance, in the average county, about 1.5% of all words used by Twitter users were matched to dictionary items associated with joy (see Table 7.1).

County level data. Denominational composition data are from the 2010 Religious Congregations and Membership Study, a county-by-county enumeration of religious congregations conducted by the Association of Statisticians of American Religious Bodies. Other county-level data in this analysis are from the 2010 census and, for crime rates, the fbi’s uniform crime reports from the year 2000. County-level voting data for the 2016 presidential election were scraped from


Table 7.1 presents descriptive statistics for all the variables in this analysis. Results show that, on average, about 3% of the words originating from a county were associated with generally positive sentiment. Means for the other emotional sentiments ranged from about .9% to about 1.7%. In terms of denominational composition, the average county contained about 503 religious adherents per 1,000 population. Of the four denominational sub-groups in the analysis, evangelical Protestants represented the largest population and members of historically black Protestant denominations represented the smallest group.

The expression of emotional language online varied regionally across the United States. presents a map in which every dot represents one county in the dataset (only counties with at least 30 tweets are included). The dots are colored by the percent of all words originating from the county associated with “positive sentiment” in the nrc Emotion Lexicon. Blue dots represent counties with relatively few positive words (fewer than 3% of all words); orange dots represent counties with many positive words (between 3% and 6% of all words). The geographic distribution of positive words suggests that few positive words are used in the deep south and along the southern Atlantic seaboard. Tweets originating from the Midwest and the northeast tended to include more positive words. Note that any of the seven emotional categories in this analysis can be mapped; however, in the interest of space, this chapter includes only one map to visualize both the general coverage area of the counties in the dataset and to demonstrate that sentiment expression on Twitter varies regionally.

Figure 7.1

Download Figure

Figure 7.1

Map showing counties with at least 30 tweets (1 dot=1 country), colored by frequency of positive language

The next stage of analysis involved matching the Twitter data to the county-level religion data to determine whether emotional sentiment varied as a function of religious context. plots summaries of emotional sentiment by county-level denominational adherence rates. The dots on the chart represent the percentages of all words that express a specific emotional sentiment under different religious scenarios. Orange dots represent counties with exceptionally high religious adherence rates (above the 90th percentile); blue dots represent counties with very low religious adherence rates (below the 10th percentile). Filled circles represent statistically significant differences whereas empty circles represent differences that are not significantly difference from zero.

Figure 7.2

Download Figure

Figure 7.2

Plots of emotional language use by denominational rates

Some patterns are apparent in . First, words associated with positive sentiment are used more frequently than words in any other emotional category. In terms of denominational composition, the total religious adherence rate tended to predict lower levels of negative emotional sentiments such as disgust and general negativity; however, it also predicted a lower prevalence of positive sentiment. Some denominational patterns are also worth noting. For instance, in all the emotional categories except anger, the mainline Protestant adherence rate was consistently associated with more positive and less negative emotional expression. The difference was most striking in overall positive sentiment, where tweets from highly mainline Protestant counties contained about two-thirds of a standard deviation more positive words than counties with few mainliners. Contrast this with the evangelical adherence rate, which predicted a lower prevalence of positive language. The black Protestant adherence rate was associated with fewer positive words, more negative words, less joy, less sadness, and less fear. Finally, highly Catholic counties tenvded to have greater expressions of fear and anger relative to counties with fewer Catholics.

A critique of the results in is that the patterns are not due to religious difference between counties but to other covariates such as basic demographic differences. To that end, the final step in this analysis models the expression of negative emotional expression with religious variables plus controls for earnings, sex and racial composition, crime, politics, inequality, population size, and region. The table entries in Table 7.2 are standardized ordinary least squares regression coefficients. Each of the focal predictors (denominational adherence rates) are included in separate regression models because they all strongly covary with one another. The results suggest that the patterns shown in generally withstand the inclusion of control variables, with some exceptions. The total adherence rate no longer significantly predicted negative sentiment expression after including controls. The evangelical adherence rate did significantly predict the outcome in Table 7.2, whereas the difference-in-means shown in was not statistically significant and the effect was in the opposite direction. Finally, although black Protestantism seems strongly related to negative emotion in , it was not a significant predictor in Table 7.2. Evangelical and mainline adherence both predicted less negative emotional expression, and the Catholic rate predicted an increase in negative emotional expression.

Discussion and Conclusion

The current project unites two separate, but growing, bodies of literature: social media analytics and religious demographic research. The former has established that social media output reflects, to varying degrees, the emotions, attitudes, and behaviors of whole populations of people, making social media platforms a meaningful site for social scientific inquiry. The second body of literature demonstrates that the religious composition of geographic areas is a key predictor of local attitudinal norms and structural characteristics of communities.

Social media data are useful for social science in general and for religious demographic research in particular. These data, and the methods used to analyze them, leverage new tools and new data to provide a fresh approach to longstanding theoretical questions in the social scientific study of religion. Ideas about how religiously-derived norms affect whole geographic areas have occupied social scientists’ attention since Weber and Durkheim, both of whom posed religious demographic questions but lacked the quality data available to contemporary researchers. In addition to survey research, ancillary data from unconventional sources can help corroborate existing patterns and expand research into lines of inquiry that are otherwise difficult to access. In short, social media data are relevant for religious demographic research.

This paper presents an example analysis that uses easily attainable social media data and methods that involve simple matching functions between tweets and established dictionaries of words. The intention is to demonstrate that meaningful insights can be obtained using methods that require little computational sophistication. The results of this analysis suggest that the religious characteristics of geographic areas predict differences in language use that withstand adjustments for basic demographic controls. Some interesting denominational patterns emerged, such as the fact that mainline Protestant adherence rates predicted more positive and less negative emotional expressions on Twitter. This finding is consistent with literature that portrays mainline Protestants as prosocial, ecumenical, “civic good guys” (Putnam and Campbell 2010:458) whose presence in communities is generally regarded as beneficial (see also Beyerlein and Hipp 2006; Chaves, Giesel, and Tsitsos 2002). Importantly, however, the findings in this analysis do not confirm causality. Some unmeasured factor could drive both religious participation and emotional expression. The goal of this research is not to provide definitive causal arguments but to demonstrate how social media data can be matched to religious demographic data.

The possibilities for future research in this area are immense. In the immediate sense, every part of the current analysis can be done using more sophisticated methods, including implementations of supervised and unsupervised machine learning to better ascertain the emotional content of tweets. Moving beyond the present example, future research could focus on any number of other relevant topics, such as the connection between religious demography and online hate speech or communication among members of social movements. Different sources of data could be used, such as internet searches, blog posts, or social media output in different languages around the world. In short, internet data offer tremendous promise and will hopefully play a larger role in the future of religious demographic research.


  • BammanDavidJacobEisenstein and TylerSchnoebelen. 2014. “Gender Identity and Lexical Variation in Social Media.” Journal of Sociolinguistics18 (2):13560.

  • BeyerleinKraig and John R.Hipp. 2006. “From Pews to Participation: The Effect of Congregation Activity and Context on Bridging Civic Engagement.” Social Problems53 (1):97117.

  • BlanchardTroy C. 2007. “Conservative Protestant Congregations and Racial Residential Segregation: Evaluating the Closed Community Thesis in Metropolitan and Nonmetropolitan Counties.” American Sociological Review72 (3):41633.

  • BlanchardTroy C.John P.BartkowskiTodd L.Matthews and Kent R.Kerley. 2008. “Faith, Morality and Mortality: The Ecological Impact of Religion on Population Health.” Social Forces86 (4):1591620.

  • ChavesMarkHelen M.Giesel and WilliamTsitsos. 2002. “Religious Variations in Public Presence: Evidence from the National Congregations Study.” In The Quiet Hand of God: Faith-Based Activism and the Public Role of Mainline Protestantism edited by RobertWuthnow and John H.Evans10828. Berkeley: University of California Press.

  • Courtney WaltonS. and Ronald E.Rice. 2013. “Mediated Disclosure on Twitter: The Roles of Gender and Identity in Boundary Impermeability, Valence, Disclosure, and Stage.” Computers in Human Behavior29 (4):146574.

  • CulottaAron. 2010. “Towards Detecting Influenza Epidemics by Analyzing Twitter Messages.” In Proceedings of the First Workshop on Social Media Analytics11522. soma ’10. New York, ny, usa: acm.

  • DumaisSusan T.2004. “Latent Semantic Analysis.” Annual Review of Information Science and Technology38 (1):188230.

  • EichstaedtJohannes C.Hansen AndrewSchwartzMargaret L.KernGregoryParkDarwin R.LabartheRaina M.MerchantSnehaJhaet al.2015. “Psychological Language on Twitter Predicts County-Level Heart Disease Mortality.” Psychological Science Published online January20 2015.

  • GlassJennifer and PhilipLevchak. 2014. “Red States, Blue States, and Divorce: Understanding the Impact of Conservative Protestantism on Regional Variation in Divorce Rates.” ajs119 (4):100246.

  • GogginsSean and EvaPetakovic. 2014. “Connecting Theory to Social Technology Platforms: A Framework for Measuring Influence in Context.” American Behavioral Scientist58 (10):137692.

  • GolderScott A. and Michael W.Macy. 2011. “Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures.” Science333 (6051):187881.

  • HirschbergJulia and Christopher D.Manning. 2015. “Advances in Natural Language Processing.” Science349 (6245):26166.

  • HongLiangjie and Brian D.Davison. 2010. “Empirical Study of Topic Modeling in Twitter.” In 8088. 1964870: acm.

  • KumariPoojaShikhaSinghDevikaMoreDakshataTalpade and ManjiriPathak. 2015. “Sentiment Analysis of Tweets.” International Journal of Science Technology & Engineering1 (10):13034.

  • LamposVasileiosTijlDe Bie and NelloCristianini. 2010. “Flu Detector—Tracking Epidemics on Twitter.” In Machine Learning and Knowledge Discovery in Databases599602. Lecture Notes in Computer Science. SpringerBerlin, Heidelberg.

  • MillerGreg. 2011. “Social Scientists Wade into The Tweet Stream.” Science333 (6051):181415.

  • PaulMichael J and MarkDredze. 2011. “You Are What You Tweet: Analyzing Twitter for Public Health.” Icwsm20:26572.

  • PutnamRobert D. and David E.Campbell. 2010. American Grace: How Religion Divides and Unites Us. New York: Simon & Schuster.

  • SchwartzH. AndrewJohannes C.EichstaedtMargaret L.KernLukaszDziurzynskiStephanie M.RamonesMeghaAgrawalAchalShahet al.2013. “Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach.” PLoS ONE8 (9):116.

  • SloanLuke and JeffreyMorgan. 2015. “Who Tweets with Their Location? Understanding the Relationship between Demographic Characteristics and the Use of Geoservices and Geotagging on Twitter.” PLoS ONE10 (11): e0142209.

The online repository is accessible at

See the api documentation at

Because users can enable or disable the location function, most tweets are not geotagged. Previous research suggests that slightly fewer than 1% of tweets include geographic metadata. For a review of socio-demographic differences between Twitter users that do and do not share their location, see Sloan and Morgan (2015).

See the package documentation at

These data were obtained and tabulated by Github user Tony McGovern at

If the inline PDF is not rendering correctly, you can download the PDF file here.