Quantifying the dynamics of topical fluctuations in language

The availability of large diachronic corpora has provided the impetus for a growing body of quantitative research on language evolution and meaning change. The central quantities in this research are token frequencies of linguistic elements in the texts, with changes in frequency taken to reflect the popularity or selective fitness of an element. However, corpus frequencies may change for a wide variety of reasons, including purely random sampling effects, or because corpora are composed of contemporary media and fiction texts within which the underlying topics ebb and flow with cultural and socio-political trends. In this work, we introduce a computationally simple model for controlling for topical fluctuations in corpora - the topical-cultural advection model - and demonstrate how it provides a robust baseline of variability in word frequency changes over time. We validate the model on a diachronic corpus spanning two centuries, and a carefully-controlled artificial language change scenario, and then use it to correct for topical fluctuations in historical time series. Finally, we show that the model can be used to show that emergence of new words typically corresponds with the rise of a trending topic. This suggests that some lexical innovations occur due to growing communicative need in a subspace of the lexicon, and that the topical-cultural advection model can be used to quantify this.


Introduction
Elements of a language, be they words or syntactic constructions, never exist by themselves, but in some context. Contexts, or topics, tend to change with the times, along with the world that they describe. These changes are expected to be reflected in (representative, balanced) diachronic corpora. If a particular topic-be it computers, cuisine or terrorism-rises or falls in public interest or newsworthiness, it would be reasonable to expect a similar effect in the corpus frequencies of lexical elements relevant to the given topic, particularly content words such as nouns 2 . It follows from this that the changing popularity of some words, apparent from raw corpus frequencies, might well be explained simply by the rise or fall of their most prevalent topics, rather than being a product of other aspects driving language change, such as sociolinguistic prestige or inherent contextual fitness. This paper seeks to investigate this idea, which we believe is rather intuitive and widely held, yet to our knowledge has not been formalized in a quantitative way. We will argue that by doing so, we arrive at an informative baseline in frequency-based approaches to lexical dynamics and language change in general. In particular, we show its potential for quantifying topic-driven innovations in the lexicon, and its utility in distinguishing selection-driven change from changes stemming from language-external factors, which manifest as topical fluctuations.
More precisely, we introduce a quantitative measure of topical change that we call advection, a term borrowed from physics where it is used to denote the transport of a substance by the bulk motion of a fluid. The analogy is that words are swept along by movements (increases or decreases in frequency) of associated topics. We implement a topical advection measure using a readily interpretable computational technique based on a robust method from distributional semantics. This approach requires very little tuning of global parameters and produces reasonable results given a sufficiently large corpus. As we will show, it is capable of capturing the effect of changing topic frequencies on the frequencies of individual words.
We begin in Section 2 by providing a brief overview of the state of the art of corpus-based evolutionary language dynamics research and identify the difficulties associated with disentangling different contributions to word frequency changes that may be of interest. We introduce the topical-cultural advection model in Section 3, and define our measure of advection in terms of the frequency change of associated words. We first show (Section 4.1) that advection is positively correlated with word frequency changes in the Corpus of Historical American English (COHA), indicating that the model successfully captures a component in language change. In Section 4.2 we test the advection model by showing that it correctly associates word frequency changes with a stylistic shift in an artificially-constructed corpus. We then show how it can be used to perform a frequency time series decomposition (Section 4.3), and finally (Section 4.4) how it also allows us to quantify the propensity for new words to emerge alongside trending topics.
We conclude that topical advection should be controlled for in any corpus-based research which relies on the (changing) frequencies of lexical items to make claims about patterns or mechanisms of language change. While this paper focuses on language, we believe that the same basic approach could also be utilized in studying the rise and fall of other products of human culture, given appropriate databases or corpora.
2 Background: corpus-based approaches to lexical dynamics and language evolution A question that often arises in corpus-based evolutionary language dynamics is the causal origin of language change. A key difficulty lies in disentangling the many different possible causes of language change, some of which may be of greater or lesser interest. Some changes may be a result of purely random effects, such as imprecise reproductions by learners, or subtle changes in pronunciation that accumulate over time (Jespersen, 1922), because individual speakers have access only to a finite sample of utterances. In evolutionary terms, this amounts to the problem of teasing apart drift from evolutionary selection in language change. Even where one can identify a systematic component to a change (selection, in evolutionary terms), factors that might be of greater interest from a linguistic perspective, e.g., inherent preference for a simpler variant in competition with other biases (Culbertson and Kirby, 2016), need to be disentangled from those that are driven by changes in society and culture, or uneven sampling of genres or topics in a corpus (Hinrichs et al., 2015;Lijffijt et al., 2012;Pechenick et al., 2015;Szmrecsanyi, 2016). Such considerations have come to the fore due to sharp increase in the availability of quantitative data over the last decades. These datasets record how languages are used (corpora), what their distinguishing features are (typological databases) and to what extent languages are used (demographic databases). This development has given rise to the field of language dynamics, which has been described as an interdisciplinary approach to language change, evolution, and interlanguage competition, relying on large databases and quantitative modeling, including simulation-based approaches (Wichmann, 2008). Since our contribution applies to corpus research first and foremost, our focus in the following brief review will be on this strand of language dynamics.

Previous research
Of greatest utility from the perspective of understanding language change are large diachronic collections of language use, as from these one can extract trajectories of change and dynamics of competition between communicative variants. One body of research aims to quantify statistical laws of language change over time, those of word growth and decline, and relationships between word frequencies and lexical evolution (Cuskley et al., 2014;Feltgen et al., 2017;Keller and Schultz, 2013;Keller and Schultz, 2014;Lieberman et al., 2007;Newberry et al., 2017;Pagel et al., 2007). This has also involved claims of the effects of real-world events (like wars) on these processes (Bochkarev et al., 2014;Petersen et al., 2012;Wijaya and Yeniterzi, 2011).
There is also an emerging strand of research investigating semantic change and language dynamics from the point of view of meaning, using diachronic corpora and distributional semantics methods. These include the various flavors of Latent Semantic Analysis (Deerwester et al., 1990) and word2vec (Mikolov et al., 2013). This research broadly falls into two categories. On the one hand, methods proposals and critiques accompanied by exploratory results (Dubossarsky et al., 2017;Frermann and Lapata, 2016;Gulordava and Baroni, 2011;Hamilton et al., 2016b;Jatowt and Duh, 2014;Kulkarni et al., 2015;Sagi et al., 2011;Schlechtweg et al., 2017;Wijaya and Yeniterzi, 2011). On the other, applications of these methods, usually with more specific linguistic questions in mind (Dautriche et al., 2016;Dubossarsky et al., 2016;Hamilton et al., 2016a;Perek, 2016;Rodda et al., 2016;Xu and Kemp, 2015). Notably, all of these approaches are, one way or another, based on (co-occurrence) frequencies of words, and as such naturally subject to sampling biases potentially introduced by uneven representation of topics and genres in a corpus.
We believe our contribution is also relevant for traditional corpus linguistics, or research more geared towards investigating specific phenomena in some target language(s)-if it involves counting frequencies of words or other elements of speech in diachronic corpora, and using these counts in explanatory models. In all of these cases, it is necessary to deal with factors that serve to confound the explanatory factor of interest, for example, those that are specifically linguistic, such as various language processing and transmission biases. In particular, as noted above, there is a need to separate random and systematic effects, and frequency changes arising from changes in topic and genre across the corpus and over time. We expand on both confounds below.

Confound 1: language change involves drift
It is widely agreed that not all language change is necessarily caused by selection by speakers for certain variants or utterances, but also involves random processes (i.e., drift, or neutral evolution) (Andersen, 1987;Blythe, 2012;Hamilton et al., 2016a;Jespersen, 1922;Newberry et al., 2017;Reali and Griffiths, 2010;Sapir, 1921). Naturally, this should be taken into account in a diachronic study of language. This requires some way of distinguishing changes resulting from drift and those, potentially more interesting ones, resulting from selection.
Our proposal is by no means the first attempt to construct some form of baseline or null model against which potential cases of directed change can be compared. There have been various proposals to carry over the selection and neutral drift paradigm from evolutionary biology, where drift refers to cases for differential replication without selection (cf. Croft, 2000). It has been argued that a prerequisite for studying language change through this paradigm would be the construction of well-informed null models (Blythe, 2012). Proposals in this vein tend to rely directly on or draw from Kimura's neutral model of evolution and the Wright-Fisher model (Ewens, 2004;Kimura, 1994). Alleles are equated with linguistic variants and neutral evolution (drift) with (neutral, random) language change (Reali and Griffiths, 2010).
Adopting this framework, Newberry et al. (2017) claim to show that tests developed in genetics for distinguishing drift and selection of variants in biological evolution can be applied to frequency time series of competing linguistic variants. In particular, they apply the Frequency Increment Test (Feder et al., 2014), and do so on three test cases of changes in the grammar of the English language. They conclude that this constitutes a systematic approach for distinguishing changes likely resulting from linguistic selection rather than drift. With the culturomics proposal (Michel et al., 2011) in mind, Sindi and Dale (2016) propose another model to detect departures from neutral evolution in word frequency variation, based on comparing frequency series with randomly generated baselines.
In a slightly different sense, the notion of '(linguistic) drift' has also been used previously in a computational semantics study (Hamilton et al., 2016a). Drift is defined there as semantic change stemming from (presumably regularly ongoing) change in language-not a reflection of considerable change in the culture that a particular language codifies. The latter is labeled as 'cultural shift', which is claimed to be more common in nouns than verbs. Detecting 'significant' changes in word meaning has also been attempted (Kulkarni et al., 2015), with the two aforementioned approaches using a similar distributional semantics method for determining semantic similarity across time, and the latter employing a similar significance detection method as Feder et al. (2014).
The concept of linguistic drift is also commonly utilized in computational modeling of experimental communication data, where the null model, without communicative biases (such as bias for egocentric coordination or superior expression, cf. Tamariz et al. (2014)) would consist of randomized changes, or drift. The question of distinguishing selection from drift has also arisen more widely in cultural evolution, for example, in the contexts of prehistoric pottery (Crema et al., 2016), keywords in academic publishing (Bentley, 2008) and baby names (Hahn and Bentley, 2003).
Another take on neutral evolution was proposed by Stadler et al. (2016), who demonstrated using a simulation model that language change may also self-actuate without selection but via momentum, whereby variants simply become more popular by virtue of having gradually become more popular. The model also produced S-shaped frequency change curves, which have been argued to be a characteristic of language change (Blythe and Croft, 2012). Relatedly, a similar S-shaped trajectory was seen in a model where a neutral process of language acquisition interacts with a dynamic social network structure (Kauhanen, 2017) 2.3 Confound 2: language is not independent of its environment No linguistic element exists in isolation: we use language to communicate about salient events in the world, and the language in use in a given time period therefore indirectly reflects the events, concerns and preoccupations of that time. These reflections should be observable in a representative corpus. The potential effect of real-world changes and hot media topics on corpusbased language usage patterns have been noted in multiple recent studies (see below). However, the way this is approached varies between studies with different aims. We observe at least three ways the connection between language use and real-world change has been considered: as a minor by-product of corpora; as an assumption for language-based culture research; and thirdly, as a factor to be necessarily accounted for in linguistic analysis. All of these deserve further discussion.

Topical-cultural impact on corpora as an inconsequentiality
In a study of mathematical approaches to detecting selection (against drift, cf. Section 2.2) Sindi and Dale (2016) observe that words with very similar frequency change patterns also qualitatively belong to similar semantic clusters or topics (e.g., words related to war increasing during periods of war at similar rates). Since their focus is on evolutionary selection dynamics, the topical effect is discussed in passing. Keller and Schultz (2013) look into word formation dynamics and also observe qualitatively that cultural changes seem to be reflected in the dynamics of the larger morpheme families, but do not explore further.

Topical-cultural impact on corpora as an assumption
The field of 'culturomics' is based on the assumption that changes in the sociocultural environment of a language should be reflected in the concurrent usage of its lexical items. Word frequencies in large diachronic collections of texts (such as Google Books) are seen as an interesting way of observing and studying historical real-world changes (Bentley et al., 2014;Michel et al., 2011). It has also been noted that times of change and conflict, such as wars and revolutions, are observable in language dynamics, such as the emergence of new words (Bochkarev et al., 2014;Bochkarev et al., 2015) and word growth rates (Petersen et al., 2012). Petersen et al. (2012) conclude that "[t]opical words in media can display long-term persistence patterns /.../ and can result in a new word having larger fitness than related 'out-of-date' words". Sociopolitical change can in some cases be observed in the contemporary (distributional) semantics of words, e.g., Kennedy being associated with senator before and president after the year of his election (Wijaya and Yeniterzi, 2011). There have been at least two claims of correlations between changes in language and political processes (Frimer et al. (2015) on the US Congress, Caruana-Galizia (2015) on Nazi Germany), although these have both recently been criticized for methodological errors resulting in spurious correlations (Koplenig, 2017b). The culturomics approach, and research based on the Google Books corpus in particular, has been recently criticized for ignoring important issues such as metadata of the texts underlying the corpus (Koplenig, 2017a) and unbalanced sampling of topics, genres or authors in corpus composition (Pechenick et al., 2015).

Topical-cultural impact on corpora as a problem
While the relationship between topicality and language use allows us to use language as a window into changes in the world, as claimed by practitioners of culturomics, it poses a problem if we want to use fluctuations in those same patterns of language use as a diagnostic for linguistic, rather than sociocultural, change. In recent years a number of authors have drawn attention to the importance of controlling for contextual factors such as genre and topic, with some voicing the concern that studying language change via corpus frequencies of linguistic elements alone could potentially be very much misleading. We review some of these below. Lijffijt et al. (2012) are concerned with testing the assumption that a single-genre general purpose corpus should be relatively homogeneous over time. They find that the period of the English Civil War had an identifiable effect on word frequencies in the Corpus of Early English Correspondence, which they attribute to the over-representation of war-related topics and authors with a military background, violating the assumption. In a corpus study on the English whichthat alternation, Hinrichs et al. (2015) emphasize the importance of controlling for genre and register, since those alternating variants are associated with different genres. In a study on the evolution of the English genitive markers, Szmrecsanyi (2016) -lamenting the unreliability of corpus frequencies in general -reasons that while a "proper" grammatical change has taken place, "[a] good deal of the diachronic frequency variability in the dataset can be traced back to environmental changes in the textual habitat". They point out that the shifting nature of the topics in the news section of their diachronic English language corpus -in particular, the coverage of non-animate entities such as collective bodies -plays a role in the changing frequencies of of -genitives, their object of study. 3 Topical effects have also been suggested to play a role in word survival dynamics and semantic change. In a synchronic sociolinguistic study of Mãori loanwords in New Zealand English, Calude et al. (2017) point out that simple across-corpus loanword frequencies could be misleading in terms of loanword success, since "certain words and concepts can become more widely used because they might be relevant to certain topics of conversation". Studying the success of loanwords in French news corpora, Chelsey and Baayen (2010) similarly ask if topic matters: is the occurrence of many financial borrowings the result of a high proportion of financial articles in the corpus, or are financial borrowings just more likely to become entrenched? Their conclusion is that, without information on topics, there is simply no way to tell. Investigating the rise and decline of words in online newsgroups, Altmann et al. (2011) find that while diffusion among users (speakers) is the primary determinant of the success of a word, spread across the conversation threads within newsgroups (which could also be seen as "topics") also plays a significant role, with both being better predictors than raw frequency. Using a distributional semantics approach, Rodda et al. (2016) find qualitative support for the idea that the diffusion of Christianity drove semantic change in Ancient Greek, but point to the over-representation of certain genres in their corpus and call for more research on the effects of corpus composition.
Although many corpora do include metadata on genres and registers, fine-grained topicswhich may well change rapidly within genres like daily news -are more often than not missing from the picture. Consequentially, there appears to be a widely articulated need across various branches of corpus-based language research for a method to control for topical fluctuations in corpora, as they are recognized to have potentially far-reaching effects on linguistic analyses based on such data, particularly if they make use of frequencies of linguistic elements. The method we introduce below aims to address that issue.
for words specific to certain topics, and less pronounced (or absent altogether) for words with a more general meaning. While we do not claim that our approach offers a remedy to all the concerns reviewed above, we will show that it does provide a simple, easily implemented and intuitive baseline for controlling for topic-related effects arising from sociocultural change or uneven sampling of a corpus. In this section we define the topical-cultural advection model. To aid readability, we defer certain technical details of the implementation to a Technical Appendix.

Definition of the model
In its simplest form, the topic of a target word in the topical-cultural advection model is defined as the set of words that are most strongly associated with the target word in terms of cooccurrence over a particular period of time. The context sets should be re-evaluated for each period subsample in a corpus, to accommodate for natural semantic change of words (which would also entail changes in context).
The advection value of a word in time period t is defined as the weighted mean of the changes in frequencies (compared to the previous period) of those associated words. More precisely, the topical advection value for a word ω at time period t is where N is the set of m words associated with the target at time t and W is the set of weights (to be defined below) corresponding to those words. The weighted mean is simply where x i and w i are the i th elements of the sets X and W respectively. The log change for period t for each of the associated words ω is given by the change in the logarithm of its frequencies from the previous to the current period. That is, where f (ω ; t) is the number of occurrences of word ω in the time period t. Note we add 1 to these frequency counts, to avoid log(0) appearing in the expression.
The crucial ingredient in the model is the set of weights W for the words in N . Here, we adopt the positive pointwise mutual information (PPMI) score (Church and Hanks, 1990). We provide details of how PPMI is calculated in the Technical Appendix. The idea is that PPMI assigns a higher score to words that are strongly associated, based on their co-occurrence with other words. While a very general, high frequency word may occur more often in the vicinity of a target word than some specific, low frequency word, the conceptual association between the target and the general word is likely quite low, as the latter co-occurs with many other words as well -while the topic-specific one likely does not. PPMI captures this notion and downweighs co-occurrence counts with such general words. In terms of the advection model, weighting the frequency changes of the context words by their association scores leads to a better model, as context words more strongly associated with the target more likely belong to the same underlying topic.

Connections with previous work
This model builds on the core notions and recent developments in distributional (vector) semantics, where the meanings and topics of words are defined through its vector of co-occurring words. These vector spaces may be learned directly from data (Mikolov et al., 2013) or be based on term co-occurrence matrices (Deerwester et al., 1990;Pennington et al., 2014). In all of these approaches, two words with similar vectors (across dimension reduced vector spaces, or across the vocabulary of context words) are considered to have similar meaning. A common measure of similarity is the cosine of the angle between the two vectors. Recently, an alternative has been proposed in the form of the APSyn measure (Santus et al., 2016), which involves comparing the rankings of the topmost associated context words instead of the whole vocabulary. The intuition is, only the m most associated context words hold relevant information about the target word, while most of the words are likely irrelevant. Santus et al. (2016) demonstrate the capacity of APSyn to perform as well (given sufficiently large m), and in some cases better than the vector cosine. Considering only top ranking contexts is also similar to Hamilton et al. (2016a), who use cosine similarity between word vectors between time periods to measure semantic change, but as a second measure, the extent of the change in a word's similarity to its top nearest neighbors (Hamilton et al., 2016a). We adopt this approach of considering only the top most associated context words here to determine a "topic" for each word, using PPMI as the association score.
It is nevertheless worthwhile to compare our PPMI-weighted approach with a more traditional topic model. To this end, we also implemented the advection measure using Latent Dirichlet Allocation (LDA) (Blei et al., 2003). In this approach, each of its latent k topics 4 is assigned a frequency change value based on the frequency changes in the vocabulary, weighted by their association with the topic (as a latent topic is essentially a distribution across the vocabulary). The topical advection value of a target word mean over the changes in the topics, weighted according to the probability a word belongs to each given topic. The details of this calculation are given in the Technical Appendix.
As will be seen below in Section 4.1, the descriptive power of the two models is rather similar. While LDA is widely used, we feel that our simple PPMI-weighted model has certain advantages. In addition to being almost parameter free (as described above), is much less computationally complex (thus faster), and the results are easily interpretable. Specifically, each "topic" of a target is a short list of top context words (meaning the advection value, being the weighted mean of their log frequency change values, is on the same scale as the target word log frequency change values). It is also straightforward to observe the behavior of a target word's topic and calculate its advection value both before and after it has entered the language or gone out of use -by re-using the context word list and the corresponding weights from a period where the target word was already (or still) frequent enough for its topic to be inferred. 5

Results of applying the advection model in a number of language change scenarios
We now turn to two large, representative, POS-tagged corpora, in order to get a sense of how well the topical-cultural advection model performs, and proceed to demonstrate a number of useful applications. We preface the results with a few crucial technical details that apply to all the following subsections, while leaving a more thorough description of the parameterizations of the models and relevant corpus preprocessing steps to be described to the Technical Appendix.
The word counts for each time period (segment) in a corpus were normalized as frequencies per million words. Since cultural effects are likely the most pronounced on content words, particularly nouns (see also Hamilton et al. (2016a)), we only consider common noun targets in the following analyses. For the context vectors (see Section 3.1), we exclude stop words and use only content words (based on POS tags), and use the top m = 75 context words. We set a (rather conservative) threshold of minimum 100 occurrences per period for words to be included in the model. If a word occurs less than 100 times in a corpus period, it will not be assigned a context vector -thus also no advection value for this period -nor will it be used as a context word. This comes down to a classical statistical sampling problem: if the sample is too small, inference is not going to be reliable. If a word only occurs a few times, then its context vector (topic) is more likely to be composed of quite random words, in a random ranking, while if a word is observed numerous times, the ranking of its (recurring) context words becomes more reliable.
This however also means that it is not possible to calculate the advection value for low frequency words like recent innovations and words going out of usage. Since these correspond to periods of particular interest for such words, we experimented with using a "smoothing" procedure to improve the informativeness of the topics. Specifically, the 'smoothed' data, used for deriving the topics, comprises text from a target period and its preceding period (word counts still correspond to the frequencies in the target period). This procedure increases the chance of inclusion for relevant context words that would otherwise not be present due to being too low frequency in one or both of the periods. Consequently, it also improves the precision of the advection measure for words decreasing in frequency in a given target period.

Advection and diachronic language change
We use the Corpus of Historical American English (COHA) (Davies, 2010) as a test set in order to evaluate the extent to which the model is capable of accounting for variance in word frequency changes. The COHA spans two centuries, starting with 1810, is binned into decadelength subcorpora by default, and is meant to be balanced across genres for each period. To test the descriptive power of the two aforementioned implementations of the advection model, PPMI-based and LDA-based, we correlate the log frequency change values of common nouns between successive decades in the COHA corpus to their respective advection values (their log topic frequency change values in the same decades). 6 We calculate the correlations for two samples of nouns: the first one includes data points of frequency changes across 19 decades of all nouns that occur above the chosen frequency threshold of 100 occurrences at least in one period (or in the concatenation of two periods, in case of the smoothed versions  Table 1: The R 2 values resulting from correlating frequency changes and advection values based on the two methods, with and without smoothing (all p < 0.001). Left half: models using all nouns that occur above the threshold at least in one period; frequency change data points from 19 decades (more data points in the smoothed versions: because concatenated data results in more words being above the minimal threshold). Right half, separated by double line, marked: models using the persistent subset.
always remain above this 100-word-per-period threshold and furthermore also occur persistently between 20 and 1000 times per million words in the corpus in each decade from 1900-2000. Our results are presented in Table 1.
We find that, as expected, frequency changes correlate significantly and positively with advection, and that the smoothing operation further improves the correlation. We find that we obtain similar R 2 values for both the PPMI-weighted and LDA-based methods (in fact, slightly higher for the less complex PPMI version introduced here). The correlation between advection and word frequency changes is also evident from Figure 1. The different scales on the axes indicate that words experience more rapid changes in either direction than topics, as one might expect, topic values being averages of context word frequency changes. We also find that the correlations are stable across the different time periods (see Technical Appendix). These results clearly show that topical fluctuations can be expected to explain a significant amount of change in word frequencies, which one might otherwise be tempted to attribute to other processes, such as selection-be it due to content or formal biases, sociolinguistic prestige, phonological neighborhood effects (e.g. Newberry et al., 2017), or other changes in the transmission biases of speakers (cf. Enfield, 2014). As such, the topical-cultural advection measure serves as a useful baseline in any quantitative model predicting frequency changes in linguistic elements.

Artificially-constructed language change based on genres in a synchronic corpus
Having established that advection constitutes one (small but significant) contribution to word frequency change in general, we now test whether our model can identify instances where it is the main contribution to change. This is difficult to determine with natural data, as one does not know a priori what the driver of change is. To deal with this problem in a controlled way, we construct an artificial corpus wherein the main component of change between two time periods is a known stylistic shift. We should then find that changes in word frequencies are strongly correlated with topics that are more prevalent in one style than the other.
Specifically, we employ the Corpus of Contemporary American English (COCA) (Davies, 2008), which is the synchronic cousin of COHA. This consists of contemporary American English language texts from the relatively short time period 1990-2012, but crucially is spread across multiple (labelled) genres. We used the 'academic' and 'spoken' subcorpora to construct a "change" in language from academic to spoken style and content, by defining the former as one "period" and the latter as the following one. We then measured the log frequency changes of nouns, as in the previous section, and their respective advection (log topic frequency change) values. Again, the advection measure correlates positively with frequency change, and describes a notable amount of its variability: in our favoured PPMI-based model, we find R 2 = 0.52 without smoothing and R 2 = 0.8 with smoothing applied. 7 This is to say, the advection model appears to successfully pick up on the genre change, reflected in the high (positive) correlation value-the decrease in academic and increase in spoken style word frequencies corresponding to the fall of the academic and rise of the spoken topics or genres. Importantly from the perspective of validating our model, the R 2 values are higher here than in the analysis of COHA. Presumably there are other forces affecting word frequencies in the COHA besides topic fluctuations, and the changes between decades are likely less stark. Table 2 lists the words most strongly affected by the simulated language change for comparison.

Using advection for time series decomposition to control for topical fluctuations
Having measured the descriptive power of the advection model and demonstrated how it behaves with re-evaluated topics over time, we now turn to an application of the model to deal with the confounds set out in Section 2.3.3. When it comes to predicting frequency changes of words or any other linguistic elements between periods of time, the advection measure can be included as a control variable in a predictive model (see Section 4.1). In the case of time series analysis (i.e., involving multiple changes over time), it is possible to utilize the advection measure as a form  of (in the following example, additive) time series decomposition, by carrying out the following operation. For a given word, for every period data point: subtract the advection value (log topic frequency change) from the word's log frequency change value. We make use of the PPMI vectors based model here: the advection values therein are averages over individual word log frequency changes, so the two quantities are on the same natural scale and can simply be subtracted from each other.
The time series of word frequencies can be subsequently reformed as the (exponential of the) cumulative sum of the resulting log change values. The operation described above is analogous to seasonal decomposition, a commonly applied approach in (multi-year) time series analysis to control for seasonal ups and downs (e.g., heating costs in cold and warm seasons). In our case, the "seasonality" (advection) is not inferred from the time series itself, but calculated independently. Another way of looking at this is as a way of distinguishing the metaphorical "word of the day", one that is selected for, from a word that just comes and goes with the "topic of the day". Adjusting for topics has the potential to be useful in carrying out more objective tests of linguistic selection (cf. Newberry et al., 2017), by controlling for the topical-cultural element. Figure 2 illustrates the results of the aforementioned decomposition operation on four words with different frequency trajectories. 8 While the word car rises steadily across time, so do its topics (initially relating to trains, later to automobiles)-discounting the topic renders the frequency time series quite flat. Similarly, payment peaks at times of recession, as does its topic. negro peaks above its topic in 1900s, only to fall almost completely out of use later, while its topic (i.e., set of related words) carries on. 9 Finally, while happiness is decreasing (with a notable drop in the 1930s), its related topic(s) do as well, resulting in a slightly less drastic adjusted time series.
One obvious concern with using the advection measure for decomposition-subtracting topic frequency change from word frequency change-is that it might be over-correcting frequency changes and interfere with observing genuine competition in language, whereby one lexical element is replaced with a synonym over time. To investigate this possibility, we construct a second artificial corpus in which a set of words are replaced with (invented) synonyms in a controlled way.
Our manipulation of the preparsed COHA corpus (cf. Section 4.1) was as follows. We selected four nouns of various frequencies that each: occur frequently enough in the corpus during the past century to evaluate their topics; exhibit relative stability across 11 periods (decades 1900-2000) in terms of their occurrence frequency, meaning (based on the APSyn measure (cf. Section 3.2) on their context word vectors); and have small (absolute) advection values. The words roof (frequency at period 1: 160 per million words), reason (707), town (733), and face (1898) satisfied these criteria.
We then generated artificial competing synonyms by replacing a linearly-increasing 10 proportion of the occurrences of each of the four target words with an invented "synonym" (word ) in the corpus. For example, at period 1, the invented synonym town appears nowhere in the manipulated corpus, while in period 2, 10% of the occurrences of town are replaced with town in the manipulated corpus, 20% in period 2 and so on up to 100% in period 11. Importantly, the replacement positions in the corpus were sampled at random, in order to simulate a scenario where the two synonyms are used freely (i.e., without regard for any contextual factors like style or genre).
On applying the advection correction to each of the original words and their synonyms, we find their frequencies are shifted slightly from their known values. However, if we plot the advectionadjusted fraction of occurrences of a word or its invented synonym, the linear replacement is clearly apparent, as can be seen from Figure 3. In other words, advection-based decomposition does not obscure genuine (although in this case artificial) cases of selection.  McMahon (1994: 194) notes that "new words are most likely to survive, and indeed to be created in the first place, if they are felt to be necessary in the society concerned. This is a difficult notion to formalize, but a well-established one". Previous empirical research has linked vocabulary size with communicative need as well. Studying colour words in 110 languages across the world, Gibson et al. (2017) argue that the communicative needs rising from the environment where these languages are spoken dictates (to an extent) the colour naming systems that emerge. In another cross-linguistic study, Regier et al. (2016) show that the need for efficient communication -which varies across cultures and environments -does seem to drive vocabulary size (in their case, of words for 'ice' and 'snow').

Advection predicts lexical innovation
From a historical perspective, this suggests the hypothesis that an increasingly popular topic (i.e. exhibiting positive advection) would be expected to attract new words, providing the detailed vocabulary required-or, conversely, a new word would be expected to exhibit a strong positive advection at its period of first occurrence, compared to the advection values of its topic in previous periods. We are now equipped to test the latter hypothesis.
We identified a test set of 133 "successful" novel common nouns from the COHA that meet the following criteria: our successful novel nouns appear as new words in the 1970s to 2000s, and, importantly, occur with high enough total frequency across (at least some of) these decades for their topics to be reliably modeled (it is this sense that the nouns are "successful"). Notably, each period of COHA includes a rather large number of new words, but most of them occur at very low frequencies. Figure 4 illustrates the differences in subcorpora sizes across decades in the corpus and the number of new nouns per period. 11 To remedy the small sample problem particularly relevant to new words (that often start out at low frequencies), we again used the simple "smoothing" technique (see introduction of Section 4), this time concatenating data from all the last four decades for the purposes of constructing the topic vectors. We chose only new words from the last few decades of the corpus in order to carry out the following comparison. As each topic consists of a list of words, we computed their advection values (log frequency changes) across ten decades preceding the decade where the target word would first occur in the corpus. 12 This allowed us to measure how many of the (successful) new words belong to topics that exhibit higher advection than before in the period where the new word first appears. For 55% of novel nouns out of the 133, the advection value of the topic associated with the word was found to be above the upper bound of the 95% confidence interval of the mean of its advection values over the previous 100 years. 38% fell around the means, and 7% were below the lower bound of their respective confidence intervals. 13 We also conducted a t-test in the following manner to test the apparent tendency. We calculated the z-score of the advection value of each of the 133 new words at the decade of first occurrence, using the mean and standard deviation values of the previous decades (separately for each of the new words). A one-sample t-test on this set of z-scores indicated that its mean is significantly (p < 0.001) above zero -or in other words, the advection values of new words are on average significantly higher at the time of entry than in preceding decades.
These findings suggests that the appearance of new words does indeed correspond to the rise of certain topics, or the increasing communicative need for new words. Figure 5 illustrates this effect for three novel words that enter into the corpus at different advection values.
12 Importantly, the advection calculation only took into account words that actually occur (frequency above 0) in a given decade. For example, cappuccino, a relevant topic word for latte does not appear before the 1970s, so its 0-frequency changes are not allowed to bias the earlier advection values to be closer to 0. Although some topic words are also new, like in this example, most topic words do occur in previous decades. The average vector length across all the test words and decades was 65 out of 75 (the chosen vector length m used throughout this study), the shortest being 35 words. 13 We also checked if the large number of new words above their mean advection values could possibly be due to some particular semantic cluster of words that might all belong to a similar (trending) topic and thus inflate the results. We computed the APSyn similarity (Santus et al., 2016) on all pairs of the topic vectors of the 133 nouns and found them to be sufficiently dissimilar, with a mean of 0.004 and a maximum similarity of 0.3, on a standardized scale of 0 to 1 (where 1 stands for identical). The dashed and dotted green line: the advection (log change) values of the topic from word's entry period (corresponds to the left hand axis; above 0 indicates increase, below 0 decrease in frequency). The black circle emphasizes the advection value in the word's entry period that is compared against the mean of the preceding advection values -the mean is indicated with the horizontal solid red line, with its confidence interval as the light red shading around the mean line. Note that the value is above the mean for latte, around the mean for pantsuit, and below the mean for recliner. The dark green words: the top topic words (lemmas) of the target word in the entry period.

Discussion
A number of factors operating on the level of the individual speaker that potentially influence linguistic selection have been proposed and tested, either in experimental settings, simulations, or corpora with speaker metadata-such as the competing pressures of learnability, expressivity and efficiency (Carr et al., 2017;Kanwal et al., 2017;Kirby et al., 2008;Smith et al., 2013),egocentricity and content biases (Tamariz et al., 2014), socially conditioned variation (Samara et al., 2017), and various other social effects (Calude et al., 2017;Labov, 2011;Lev-Ari and Peperkamp, 2014). While language change is perpetuated by the utterance selections of individual speakers over time,some factors also influencing selection may be seen as properties of the population, or those of the linguistic system, such as various structural-phonological properties (e.g., the aforementioned Szmrecsanyi (2016)), phonological dispersion and clustering (Dautriche et al., 2017;Dautriche et al., 2016;Newberry et al., 2017),polysemy (Calude et al., 2017;Hamilton et al., 2016b), social network properties (Baxter et al., 2009;Castelló et al., 2013), relative inter-community prestige (Abrams and Strogatz (2003) and following literature) and community consensus (Pierrehumbert et al., 2014).
A language corpus is essentially a sample of aggregated utterance selections by (a sample from) the population of speakers. In principle, these and other factors could be tested on a corpus, as some have been -a diachronic one in case of claims about change dynamics, and synchronic if the claims concern properties of language as such. Models connecting individuallevel biases and population-level observations have been recently proposed as well (Kandler and Powell, 2018;Kandler et al., 2017). In the former, diachronic case, if the analysis was to involve changing frequencies over time, then the topical-cultural advection model would be straightforwardly applicable as a factor of control or baseline change. It could likely also improve tests for selection and drift (Newberry et al., 2017) by adjusting for the component of fluctuating topics presumably driven by socio-cultural processes or "newsworthiness". While contextual suitability for a topic could be argued to be a signal for selection on its own, our model remains applicable, allowing for a quantification of that signal, or to be used as a predictor on its own, as shown in Section 4.4.
In the case of natural language, our contribution does require a certain amount of data to be reliable (in terms of inference of the topics, cf. Section 4). As such, it is directly applicable to (sufficiently large) diachronic corpora (i.e., involving 2 or more time periods), but less likely so in experimental settings. Then again, if an artificial language (e.g. Kirby et al., 2008) with a limited number of elements was to be used, then the model may be applicable in some cases (i.e., if the frequency of elements over time or iterations is of import, the experiment involves free production, the co-occurrence of elements yields reliable topics, and these topics are suspected to influence frequencies via advection).In principle, the advection model could also be used in other domains of cultural evolution, where there is diachronic data available about the systematic co-occurrence of traits or properties (in lieu of context words) of cultural elements (in lieu of target words, such as nouns in the previous sections).
In a sense, our model also orthogonally complements the momentum model of Stadler et al. (2016). They demonstrate, using a simulation of language evolution, that change can selfperpetuate without selection, when a linguistic variant gains enough momentum in its frequency changes over time. While they model momentum from the frequency change of a variant itself, we model the frequency change of a variant potentially driven by the frequency change in its immediate contextual topic (not itself), or what could be called 'topical momentum'. Although, it should be noted that caution should be taken with claims of causation.

Conclusions
We presented the topical-cultural advection model, along with two potential implementations, as a straightforward method capable of capturing topical effects in frequency changes of linguistic elements over time. In particular, we demonstrated that the model accounts for a considerable amount of variability in noun frequency changes between decades in a corpus spanning two centuries, retains its capacity when used on an artificially sampled corpus where a change in style and contents has been simulated, and can, to an extent, predict lexical innovation, based on increases in topic frequencies. We also introduced a way of using the advection measure for time series decomposition to distinguish (presumably selection-driven) changes from topical fluctuations (or potentially uneven corpus sampling) in frequency time series of linguistic elements. We conclude that the topical-cultural advection model adds an important analytical approach to the toolkit for corpus-based lexical dynamics research, or any investigation drawing inference from changing frequencies of linguistic (or other cultural) elements over time.

Technical appendix Notes on preprocessing and parameters
We take a number of preprocessing steps to ensure a reasonable quality in the inference of the topic vectors that underlie the advection model. Both in the case of COHA and COCA, we exclude stop words (and also a list of known OCR errors) and use only content words (based on corpus POS tags). While COHA and COCA distinguish proper and common nouns in its tagging, we noticed quite a few proper nouns were tagged as common ones, hence we decided to remove all capitalized words (this is particularly relevant in the context of Section 4.4, where we needed to avoid detecting mistagged proper nouns as innovative common nouns).
We used a context window of 5 words on both sides of the target word (after the removal of stop words, etc.), linearly weighted by distance, for inferring co-occurrence. The co-occurrence matrices were subsequently weighted, using the positive pointwise mutual information (PPMI) between each target word w and context word c: This is essentially a weighting scheme that gives more weight to co-occurrence values of word pairs that occur together but not so much with other words, and less weight to pairs that cooccur with everything. Since we set a threshold of 100 occurrences per period for a word to be included, we circumvent the known small values bias of PPMI. Since we use positive PMI, all co-occurrence values end up as ≥ 0. See e.g. the Jurafsky and Martin (2009)

textbook for further details and examples.
For the advection model based on vectors drawn from a PPMI-weighted co-occurrence matrix, we use the top m = 75 context words as the topic (having observed that very small values lead to less reliable topics, while considerably larger values deteriorate the results in some cases). Importantly, the word counts (that underlie the log change values, which in turn make up the advection values) for each period were normalized to per million frequencies using the total word count in that period (periods corresponding to decades by default in COHA).

Algorithmic description
1. Preprocessing steps 1.1 (optional) Basic text cleaning (using a list of OCR errors, a list of stop and function word tags, words shorter than 3 characters), keep only content words; remove all capitalized words to avoid proper nouns 1.2 (optional) Affix tags to words in the POS class of interest (e.g., nouns in our case; more tags and more specific tags improve disambiguation, but also increases sparsity)  (0)).

(A) Topics and advection (if using the PPMI vectors based approach)
3.1 Generate term co-occurrence matrices for each period (e.g., target words as rows and context words as columns), using a context window of some length (we used ±5, and linearly weighted context words by distance within the window) 3.1.1 (optional) If targeting a specific POS class, filter the matrices by keeping only rows with the previously affixed tag 3.1.2 (optional) Filter by setting a frequency threshold for a word to be included (we used a threshold of 100 raw occurrences per period (or per concatenated dataset, if using smoothing) 3.2 Apply positive pointwise mutual information (PPMI) weighting to each matrix 3.3 Retrieve and store relevant context words for each target, in each period (i.e., sort each row of each matrix and store the top m context words, along with their PPMI weights in that row; we used m = 75) 3.4 (optional) to apply the "smoothing" operation, concatenate data from pairs of successive periods instead, and apply the previous 3 steps 3.5 For each target word ω, in each period t, calculate its advection value: 3 3.1 Train Latent Dirichlet Allocation (Blei et al., 2003) models for all period subcorpora (we used the following parameters: α = β = 0.1, k = 500, maximum allowed iterations: 5000) 3.2 For each word ω in each period t, calculate its advection value: 3.2.1 Given the k topics, τ , identified by LDA, we determine the number of times n(ω, τ ) that each word ω appears in each of the topics τ . From this we can define the two conditional distributions p(ω|τ ) = n(ω, τ )/ ω n(ω , τ ) and p(τ |ω) = n(ω, τ )/ τ n(ω, τ ). Given a word frequency change logChange(ω; t) at time t, its contribution to the change of the topic τ is logChange(ω; t)p(τ |ω).
4. (optional) Measure the descriptive power of the advection model by correlating the advection value of each word in each period to its respective log frequency change value.
Additional remarks on the model and data processing

Variations in operationalizing the test corpus
The results presented in Section 4.1 were based on correlating all frequency changes of the target words in all decades to their respective advection values in the respective periods. We also ran the correlations period by period and although the results varied somewhat between periods, they appeared quite consistent (cf. lower panel of Figure 6). Accordingly, running mixed effects regression models with intercepts and slopes for periods appeared to improve the overall descriptive power (based on conditional pseudo-R 2 ) somewhat.
The aforementioned results were based on comparing frequency changes between decadelength bins of the COHA. We also experimented with different temporal distances to see if the model behaves considerably differently, with the following results (see Fig. 6).

Corpus size
We used fairly large corpora (COHA and COCA) for our analyses. Mileage of utilizing the advection model with smaller corpora would probably vary, and is of course open for experimentation in terms of the parameters, thresholds and possibly the topical-semantic smoothing as described above. It is not impossible that superior results could be potentially achieved using more sophisticated methods of topic modeling with carefully optimized parameterizations (we did also implement the advection measure using a traditional topic modeling algorithm, LDA, which did not end up outperforming our simpler model in terms of descriptive power). Notably, the advection model is not expected to work as well with highly polysemous or general words (and homonyms), as it would with words with a more specific meaning (unless the meanings are somehow disambiguated and sense-tagged). Polysemy, however, is a widespread problem across most NLP tasks, not only the one at hand.
We re-evaluated the topics of words for every period to accommodate for natural semantic change. In principle this is not necessary, if the meaning of a word is known to be very stable across time. In this case, the context vector from a single period, or aggregated across periods, could be used. The latter would also remedy the inherent problem of inferring context vectors for low-frequency words.

Semantic change, corpus smoothing
We note that the advection model should not be affected by the recent critique by Dubossarsky et al. (2017), who show that distributed semantic change measures tend to be biased by differences in frequency. In particular, they call into question the entire enterprise of automatically measuring meaning change, attempting to replicate previous studies (Dubossarsky et al., 2015;Hamilton et al., 2016b) and finding that the proposed results either do not hold up or have drastically diminished descriptive power in comparisons against randomized baselines -attributing them to problems in vector space construction methods as well as bias from word frequency.
The same context word vectors we use to determine topic could indeed easily also be used to determine semantic change, by comparing the lists of top context words (cf. Figure 7) between periods either by directly using the APSyn measure (cf. Section 3.2), or comparing the entire (suitably aligned) PPMI context vectors using vector cosine (in case of the former, care should be taken not to include 0-weight words in the topics, since APSyn only considers the rankings of context words in the vector, not their weights).
However, advection (topic frequency change) is meant to be re-evaluated for each corpus period. As such, semantic change is not directly a concern. We did also demonstrate additional results using what we called "smoothing" (Section 4), or concatenating the data from the target period t and the preceding period t − 1 for the purpose of inferring topic vectors. In our experiments, this improved the power of advection to predict frequency change. In principle, smoothing could be applied using any number of t ± n periods; we also experimented with concatenating the entire corpus, and found that the descriptive power of the advection model suffered considerably. We assume semantic change to be the reason, since the context words (using which the advection measure is calculated) relevant to a target in one period may be quite irrelevant from another period, if the use (meaning) of the target differs -leading to uninformative topics.
Finally, we should emphasize that this is not meant as a model of causation: it does not make any claims concerning the direction of the influence (words becoming more or less frequent because of a trending or declining topic, or groups of words, i.e., the topics, following one influential trending word). It does, however, provide a way to measure the correlation between the frequency change of a word and the frequency change in its topic.

Using advection to visualise fluctuations of and changes in topics over time
We introduce here another possible way of investigating the rise and fall of topics over time. Figure 7 illustrates how the advection model straightforwardly captures these processes. We chose three English words (actress, security, merchandise) that do not change much in frequency over time in the COHA, and remain frequent enough over a century to allow for their topics to be reliably estimated. Figure 7 depicts both the decade-by-decade advection values (weighted mean log frequency change of context words) of the target words, but also the (cumulative sums of) log frequency changes of the topics from previous decades. As a topic is a list of words with weights, it is a simple task, particularly in the PPMI vectors based approach, to calculate the weighted mean frequency change of a topic from a given period in any other period.
For example, the topics of actress from the early 20th century wane over time (cf. Figure 7), while the later cinema-related topics stay strong (judging by the topics, the meaning of actress itself has shifted somewhat). The topic of security has shifted from the financial realm to that of physical and national security, yet financial topics in general remain relevant in the corpus.
Note that the plot shows the cumulative sum of the log frequency changes of earlier topics (where each such time series is initiated with the advection value of the period where the topic is from). Calculating just the 'frequencies of topics' as a mean of topic word frequencies would not be a sensible thing to do: the words that make up a topic may occur at considerably different frequencies, so their mean would not be particularly meaningful, whereas their mean log change is. We also took care not to include topic words with 0 frequency in a given period in the calculation of the mean log change value for that period (as a change from 0 to 0 would just be 0, biasing the results).  1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000  Upper panel: each colored line corresponds to frequency data from one decade (starting with the red 1810s and ending with the pink 1990s). Each point on the line indicates the descriptive power (R 2 ) of the advection model, in describing the variance in frequency changes between the origin and future periods. E.g., the red 1810s line at the 1820s mark indicates that the advection measure (based on re-evaluated topics) describes 26% of the variance in changes in frequencies of (1981) nouns between the 1810s and 1820s; the red 1810s line ending on 0.28 at the 2000s mark means that it describes 28% of the variance in changes in frequencies of (5273) nouns between the 1810s and 2000s (including those which were at 0 frequency in the 1810s). The dashed horizontal line corresponds to the value reported in Section 4.1, of the model using only changes between immediate decades (and not differentiating between decades). Values corresponding to models between immediate decades (like the one in Section 4.1) have been indicated with black circles. The thick grey line, corresponding to the right-hand axis, indicates the number of unique nouns in each decade model that can be assigned a topic in that decade (the corpus being larger in more recent decades). Lower panel: the same data arranged to show the effect of temporal distance on the descriptive power of the advection model: with increased distance, the values slightly improve for some decade subcorpora, but not all. The round green points connected with the solid line give the advection value of the target word (weighted mean of the log frequency changes in the topic, consisting of relevant context words) as re-calculated for each period (with smoothing, see Section 4). The dark green words: the top topic words (lemmas) of the target word in that period. The dashed lines, ranging in colour from light gray (earlier) to dark blue (more recent topics), show the cumulative sum of the advection (log change) values of previous topics: as each topic is simply a set of words (with weights), it is possible to track how this particular set fares over time.