Dialect differences and linguistic divergence: A cross-linguistic survey of grammatical variation

This article presents a new type of comparative linguistic survey, analysing (socio-)linguistic variation in a database of 1155 grammatical constructions drawn from 42 diverse languages. We focus in particular on variation in the expression of grammatical meanings, and the extent to which grammatical variation differentiates geographic dialects. This is the first study we know of to present a systematic, cross-linguistic survey of dialect differentiation. We identify three main structural types of grammatical variation: FORM , ORDER and OMISSION , and find that in situations of close contact between dialects, where signalling of distinct group identities is more relevant, form variables are more likely to differentiate dialects than the other two types. Order and omission variables usually only differentiate dialects that have minimal contact. Our survey suggests that social signalling may have a substantial role in the divergence of grammars, and provides systematic support for previous proposals regarding convergence and divergence under contact


Introduction
Language simultaneously expresses semantic content, and signals aspects of social identity.Comparative linguistics has explored in great detail how linguistic form maps to semantic content, but there has been little comparative research on the relationship of language structure to social identity.Variationist sociolinguistics offers detailed studies social identity signalling, but has thus far provided relatively limited cross-linguistic comparison (Stanford 2016;Di Garbo et al. 2021).In this study we address this gap using a new type of data: a survey of linguistic and sociolinguistic variables from a wide range of language families.
Our interest in social signalling lies particularly in the role it plays in fomenting differences between language varieties.Under one model of linguistic differentiation, social groups separate and progressively lose contact, which allows their respective language varieties to gradually become more different from one another (Paul 1888: 25).Linguistic differentiation is a function of time spent apart.But in recent years another type of differentiation has come to light, sometimes known as 'linguistic divergence'.Studies of egalitarian multilingualism have shown that social groups may live in very close interaction, while nonetheless carefully policing their language borders, deploying social norms and interactional etiquette to maintain or enhance the distinctness of their language varieties (e.g.François 2011;Di Carlo 2018;Evans 2019;Epps 2020).Even the differences between closely-related dialects may be carefully cultivated to construct distinctive group affiliations (Morphy 1977;Stanford 2009;Vaughan 2018).Evolutionary theorists conjecture that this sort of lectal differentiation may play a role in controlling access to local networks of mutual assistance (Nettle & Dunbar 1997;Dunbar 2003), and artificial language experiments have lent some support to this, by successfully simulating divergence of lects under social pressures (e.g.Roberts 2010;Sneller & Roberts 2018;Lai et al. 2020).Furthermore, a largescale study of vocabulary differentiation supports the concept of 'punctuational bursts', that is, language varieties changing their vocabulary more rapidly as part of the process of socialgroup fission (Atkinson et al. 2008).All these studies point to the potential for language varieties to diverge because of the social interaction between groups.We can therefore define 'linguistic divergence' as differentiation that is driven by language contact, rather than the absence of contact.
Linguistic divergence raises a series of important questions for the study of language change, such as: What types of linguistic structure are affected?What parameters of social interaction promote divergence?What types of group relations?How might divergence be incorporated into our models of language phylogeny?
In this study we focus on one small part of the puzzle.We study pairs or clusters of dialects, that is, closely related language varieties that are associated with distinctive geographic territories, drawing our data from 42 reference grammars of geographically and genetically dispersed languages.We extract data on GRAMMATICAL VARIABLES in these languages, which are grammatical meanings or functions that can be expressed in more than one way (for example, in English future tense can be expressed by both will and gonna).For each of these grammatical variables, the crucial question is whether it distinguishes dialects, or cuts across dialects.We develop a simple structural typology of grammatical variation, and ask whether dialects are more likely to be differentiated by some types rather than others.We also estimate degrees of social contact between dialect groups, as this appears to play an important role in the patterning of structural types.While our method offers a substantially new type of data on language dynamics, it also has natural limitations as it is essentially a convenience sample of whichever grammatical variables the authors of reference grammars happen to mention.Nonetheless, the method provides valuable data for answering the specific question of whether certain types of variation are more likely to be dialectal than others.
Our typology identifies three structural types of grammatical variation, exemplified below: (1) FORM variables, in which distinct grammatical markers appear in the same linear position; (2) ORDER variables, involving the same linguistic elements in different linear orders; and (3) OMISSION variables, distinguished by the presence/absence of a grammatical marker.1 Form and order variables correspond to the paradigmatic and syntagmatic dimensions of language, respectively (Bloomfield 1933;Saussure 1959), while omission variables are related to notions such as redundancy and underspecification.
The remainder of this article runs as follows.Section §2 establishes our framework for conceptualising dialect groups, linguistic variables and social signalling.Section §3 introduces the database used for this study.Section §4 describes the three structural types of grammatical variation, with informal observations on how these relate to sociolinguistics and dialect relations.Section §5 provides a formal quantitative analysis, lending support to the hypothesis that the structural types function differently with respect to dialect contact and linguistic divergence.Section §6 summarises our findings and discusses implications for further research.

Dialect differences, social contact and social signalling
In this study we consider language varieties to be in a 'dialectal' relationship whenever they share the vast majority of their grammar, phonology and lexicon, but nonetheless are different enough to be recognised as distinct varieties.We call such relations 'dialectal' when they are based on geography (e.g.different villages, provinces or regions), as opposed to other types of language variety associated with socio-economic groups, subcultures, formal registers etc.Note that this approach to dialects is independent of ethno-linguistic naming practices.For example, there is a dialectal relationship between varieties spoken on the Aguaytía and San Alejandro rivers in Peru, both of which are known by the label Kakataibo (Zariquiey 2011;Zariquiey 2018: 3).But there is also a dialectal relationship between Emmi and Mendhe, two very similar varieties associated with neighbouring clan estates in northern Australia, which do not share an ethno-linguistic label (Ford 1998).
The concept of 'dialect' has often been used for unwritten regional varieties in relation to written, supra-regional 'standard' varieties (e.g. Haugen 1988;Chambers & Trudgill 1998;Abraham 2006).But this approach is only relevant in those circumstances where there is a supra-local standard, such as a national language.The current study is broadly concerned with human languages, most of which have no supra-local standard form. 2Therefore most of the dialectal relations in this study are between pairs of closely-related, regional varieties, usually without any political hegemony of one over the other. 3 Dialectal differences develop through the identification of social groups with distinct geographic territories, provoking fission of an erstwhile shared language variety into two distinct varieties.Because people tend to interact with those who live near them, geography plays a major role in the development of sociolinguistic groupings (Paul 1888: 23ff.;Trudgill 1986: 39).For example, from the tenth century until the sixteenth century there was a fairly discrete and integrated community living on the island of Jersey, speaking a shared variety of Norman French.In the sixteenth century, forty families from Jersey moved to the smaller island of Sark, leading to the subsequent divergence of a Sark dialect from the Jersey dialect (Liddicoat 1994: 6).
Language differentiation does not always follow a smooth trajectory of separation.Language varieties may remain in close contact for hundreds or thousands of years, remaining similar because they maintain continual contact and share many linguistic innovations.There are also instances where dialects are in a process of convergence, rather than divergence (Trudgill 1986).More generally, language histories are not always made up of neat iterative splits (e.g.Garrett 2006;François 2014), and the formal similarity involved in dialectal relationships can arise from any type of relatively recent social contact.But irrespective of these diverse histories, dialect differentiation is an important first step in the larger process of diversification that eventually leads to radically distinct languages.In this study our main findings are synchronic observations of how social contact patterns with certain types of grammatical differentiation.But we will also use these findings to consider dialectal relations as one stage in a larger diachronic process, making predictions about likely trajectories of language change ( §6).

Variables and dialects
We conceptualise linguistic variation in terms of VARIABLES, where a variable involves two or more expressions that have the same semantic content (Weinreich et al. 1968: 159).Given a pair of variant expressions, x1 ~ x2, an individual language user, at a given point of time, 2 Fijian is the only language included in this study for which dialect variables are reported to be in a standard vs vernacular relationship.This involves Standard Fijian, based on the dialect of Bau but used as a lingua franca across Fiji, and Boumaa Fijian, spoken in a region on the island of Taveuni (Dixon 1988). 3Recent lexicostatistical research provides a very different approach to dialectal relations, distinguishing dialect pairs from different-language pairs based on degrees of phonological distinction between their vocabularies, as measured by orthographic Levenshtein distances (Wichmann 2019).
has a particular probability of selecting one variant or the other.At the extremes are individuals who categorically select just one variant or the other.
We assume that individuals form geographically associated social groups, which are groups of individuals who associate to a geographic region and tend to interact with each other more than they interact with those outside the group (Croft 2000: 20).However there is also some degree of interaction between individuals in different groups.Figure 1(a) shows two social groups, one above and one below.Within each group there are dense social connections (solid lines), but there are also some social connections (dotted lines) running between groups.Each individual is shaded in greyscale, representing their probability of using variants x1 ~ x2 (cf.Blythe & Croft 2021).For variable (a) we see that most individuals use both variants, and the two groups have similar distributions.We can say that this is a 'intra-group' or 'non-dialectal' variable.For variable (b) we see the same pair of social groups, but here variant selection is strongly biased towards x1 in the top group, and x2 in the bottom group.We can say that this is a 'dialectal' variable -noting that it does not require a categorical split between groups, but only that there is a notable difference between groups with respect to the variable.

Social signalling and language structure
In Figure 1(b) above, we use dashed lines to represent interaction between individuals in different social groups.Where this interaction is substantial, it is plausible that dialect differences provide cues about group affiliation.Because there is substantial interaction between the groups, individuals would have some exposure to both dialectal variants, and could form conscious or subconscious associations between variants and group identity.Recent studies have suggested that group interaction of this type can result in linguistic divergence, that is, the differentiation of language varieties facilitated by social contact between groups who speak different varieties (e.g.Di Carlo 2018;Evans 2019;Epps 2020).Thus linguistic differentiation may be driven by social proximity, rather than social distance (Gal 2016: 127); and dialects, which may eventually become different enough to be counted as separate languages, are the product of 'active differentiation among local communities' (François 2012: 92).This is the linguistic instantiation of a more general anthropological process of contact-driven 'schismogenesis' (Bateson 1935).Proximity and social contact provide the context for social signalling, or 'social-indexicality', whereby linguistic cues are associated with distinctive social groups (Agha 2003;Silverstein 2003;Eckert 2008;Tabouret-Keller 2017;Eckert 2019; for historical background see Jahr 2017). 4 However, dialectal differences may also develop without any social signalling.This is especially likely once dialect groups have such reduced contact that individuals would not have sufficient exposure to both variants, and the group identities would be less relevant to managing social relations.When groups are socially separated, over time they may develop dialectal differences independent from social signalling; furthermore, such innovations have less chance of spreading between the groups, due to lack of contact.
Previous work in sociolinguistics has considered whether certain types of language structure are more or less amenable to social signalling.There may be cognitive or communicative constraints that make it easier to associate certain types of formal distinction with social identity.Various terms have been applied to this, such as 'marker vs indicator' (Labov 1972) 'metapragmatic awareness' (Silverstein 1981), 'pragmatic salience' (Errington 1985) and 'sociolinguistic salience' (Kerswill & Williams 2002;Rácz 2013;Levon & Buchstaller 2015).Recent research has highlighted an important distinction between socialindexicality and conscious awareness, which need not go hand-in-hand, and can be difficult to disentangle (e.g.Campbell-Kibler 2016;Drager & Kirtley 2016).Research on sociolinguistic salience has largely focused on phonetic variation, though there is also evidence of speakers being strongly aware of at least some grammatical variables (Squires 2016).
One enduring idea has been that 'surface' linguistic forms are more capable of social signalling than 'deep' linguistic structure (e.g.Labov 1993;Hinskens 1998; see also Eckert 2019).Thus phonology and lexicon are more social-indexical, while morphology is less so, and syntax is the least social-indexical of all (Romaine 1981;Cheshire 1987;Dediu et al. 2013: 311).At the same time, it can be difficult to determine when variables should be counted as surface forms, and when apparent 'surface forms' actually reflect variation in underlying structure (Meyerhoff & Walker 2012).In parallel to the socio-variationist literature, studies in comparative linguistics suggest that communities in contact tend to differentiate themselves using the forms of lexemes or grammatical markers, while 4 There may also be purely cognitive, as opposed to socio-cultural, factors at work in linguistic divergence.In a study of bilingual production, where the two languages (Dutch and English) share a large number of similar forms, speakers exhibited a bias against those forms of ambiguous provenance, which suggests that bilingual processing could drive lexicons apart (Ellison & Miceli 2017).
subconsciously converging in their morphosyntactic structures.These case studies have generally showed that languages or dialects in close contact tend to diverge in their lexical forms, while converging in their grammatical structure, again pointing at sociolinguistic awareness as the key factor favouring differentiation of lexicon but not grammar (Gumperz & Wilson 1971;Grace 1981;Ross 1996;Ross 2001;François 2011).
In this study we follow the lead of these earlier works in testing whether some dimensions of language have greater potential for social signalling than others.Rather than applying a surface vs depth model (as in the socio-variationist work cited above), which depends on specific analyses of structural layers, we instead focus on paradigmatic versus syntagmatic dimensions of surface structure (see §4 below), since these can be applied in a relatively theory-neutral way based on linguistic documentation.And rather than making a lexicon vs grammar split (as in the comparative linguistic work cited above), we focus purely on the expression of grammatical meanings, while distinguishing the structural properties of form, order and omission.Our findings on form vs order variables can be roughly equated to previous findings on lexicon vs structure under language contact, though we set aside the question of grammatico-semantic isomorphism, which has also played a major role in such studies (see §4.1 below).The most important contribution of our survey is to go beyond case studies of individual contact situations, and make generalisations about the process of linguistic differentiation based on a diverse sample.However for this, we require new methods in comparative linguistics.

Collating a cross-linguistic sample of dialect differences
Typological databases usually distil the information in reference grammars, in order to assign one type of grammatical expression to each language.Linguistic variation is a kind of noise that such databases must filter out.In this study we take the opposite approach, specifically targeting whatever reference grammars report to be variable (see also Di Garbo et al. 2021).We extracted data from reference grammars of 42 languages (see map in Figure 2), representing 28 different language families, and all inhabited continents.Although this is only a small sample of the world's linguistic diversity, it is uniquely systematic and wideranging within the nascent field of comparative sociolinguistics.All languages in the current sample are spoken languages, though we aim to include signed languages in future work. 5ariables were added to the database by searching reference grammars for mentions of variation, using a combination of keyword searches and reading (see Supplementary Information for further details).A grammatical variable was coded wherever the text reports two or more ways of expressing the same grammatical meaning or function.In cases where more than two variants are documented, we annotate just two variants (see Supplementary Information D).Grammatical meanings are relatively abstract categories, such as future, negation, continuous aspect, first-person, directionality; or functions such as focus, subordination or transitivity (Lehmann 1995;Hopper and Traugott 2003;Boye and Harder 2012).This method yielded 1155 grammatical variables, for each of which we coded the structural type and dialectal status into a spreadsheet, later transformed into an R datatable (see examples in Table 2 below).Most grammars also mention a range of phonological and lexical variables, which we noted for further research but have not included in this study.Our data was selected to represent a diverse range of languages and social situations, but the sample is not formally balanced either by language family or region (see Supplementary Information A).Most language families are represented by a single language, but a few (mostly larger families) have multiple languages.Furthermore, the number of datapoints contributed by each language family varies widely, since some grammars yielded more variables than others.Figure 3 shows the number of grammatical variables contributed by each language family (using maximal language families as annotated in Glottolog (Hammarström et al. 2022)), and how many of these are dialectal variables.Most families contributed between 10 and 50 data points, while others contributed 100 or more.Austronesian and Sino-Tibetan contributed more data because our sample includes several grammars, representing distinct branches of these families.But in the case of Athapaskan, and to a lesser extent Basque, we have sampled just one grammar from each of these families (Athapaskan: Rice 1989; Basque: Hualde & Urbina 2003), but these two sources were unusually rich in grammatical variables.We accept this imbalance because it allows us to capture all the information provided by the reference grammars, while the statistical problem can be adequately managed by using a mixed-effects regression model with language families as group effects ( §5).For 26 of the 28 language families, at least one of the grammatical variables was reported to be dialectal.We annotate variables wherever the source presents expressions as having the same meaning, but we do not attempt to further investigate whether these expressions have exactly the same connotations or truth-conditional semantics.It is also worth noting that some variables may involve pragmatic conditioning (especially for order variables, see §4.2), description of which is largely beyond the scope of the source grammars.Sociolinguists working on grammatical variation have long recognised this as a difficult problem (cf.Lavandera 1978;Romaine 1981;Cheshire 1987 inter alia).
We take an onomasiological approach, i.e. using meanings as our reference points, rather than forms.Consequently a grammatical variable is annotated wherever two expressions can convey the same grammatical meaning; but one or both of these expressions may be also capable of expressing other meanings.For example, the meaning FUTURE may be expressed alternately by a specific future tense marker, or by a marker that spans both future and present (i.e.non-past) meanings.This is an important point to which we return below ( §4.1).We also note that our data coding is not directly comparable to some other studies of grammatical change (e.g.Greenhill et al. 2017;Matsumae et al. 2021), which use features from the World Atlas of Language Structures (Dryer & Haspelmath 2013).Only a subset of our grammatical variables correspond to features coded in the atlas.
Reference grammars usually attest variation in a succinct, impressionistic form, glossing over the nuances of variant distributions.In our Figure 1(b) schema, we noted that variants may each have some usage in each group, while nonetheless making a stochastic group distinction.This is mirrored in reference grammars, which sometimes describe categorical dialectal variables, and sometimes note that one variant is 'more common' in one dialect than another.Our coding represents both of these situations as dialectal variables, without distinguishing categorical from stochastic types.

Limitations of the method
An important limitation of our method is that some reference grammars pay closer attention to dialectology than others, meaning that our primary data is partial and approximate.Grammar writers may present something as non-dialectal or 'free variation', when closer inspection would show it to be dialectal.Alternatively, grammar writers may mistakenly report something as dialectal, when it is actually intra-group variation.We must therefore assume that there is a certain degree of noise in our data sources.Furthermore, individual grammar writers have different propensities to report variation at all (as suggested by the wide range of counts in Figure 3 above), and each has their own particular areas of interest.We therefore cannot draw any conclusions about whether certain grammatical meanings are more likely to have variable expression than others.
Another limitation of the data is the difficulty of coding up the grammars in a fully reproducible way.Coding was performed by all three authors of this article, with most grammars being coded by multiple authors to improve consistency (see Supplementary Information F).We found that our coding of structural types and dialectal status were quite consistent, but it was difficult to achieve consistency on exactly how many variables are identified in a given section of a reference grammar.We therefore do not treat the absolute number of variables as an interpretable finding, instead focusing on patterns in structural types and dialectal status.
Although our database cannot claim to be either comprehensive or fully reproducible, we have no reason to expect that these limitations should invalidate the findings presented in this study.We ask only whether certain types of grammatical variation are more likely to be dialectal than others, and the limitations of the reference-grammar method do not appear to impact on this question.The methodological limitations would invalidate our findings if there were systematic inaccuracies in whether structural types are identified as dialectal or intragroup, but we do not have any reason to expect such systematic errors.
Compared to the reference grammars used in this study, dedicated sociolinguistic studies could provide more detailed information about specific variables and their (stochastic) group associations.But variationist sociolinguistics does not offer a large enough sample of variables from diverse languages, as the field is still heavily focused on a small number of politically dominant, cosmopolitan languages (Stanford 2016;Mansfield & Stanford 2017).We preferred reference grammars because they provide a more diverse linguistic sample.But another important advantage is that grammars include information on both dialectal and nondialectal variables, which is crucial to identifying which types of structure are more or less likely to differentiate dialects.

Coding social contact
As well as coding multiple linguistic variables, for each reference grammar we also coded degrees of social contact between dialect groups.Reference grammars provide information on social relations, either directly by reporting on social interaction, or indirectly in comments on mutual intelligibility of dialects, geographic proximity etc.We used this information to create a rubric for assigning dialectal relations to three degrees of social distance: Close, Medium, Distant (see Supplementary Information E).This is an admittedly coarse and informal measure, which does not capture the nuances of social relations among groups.Nor does it capture diachronic dynamics, with social relations changing from one historical period to another.Nonetheless, it was important to parameterise social contact in our data since the dialect relations reported in the grammars clearly encompassed very different degrees of contact, as illustrated by the following examples. 7he Kugu Nganhcara grammar (Smith & Johnson 2000) reports the very closest type of dialect relations.The language as a whole is reported to have about 300 speakers, but within this population speakers identify with six different patriclans, each of which is associated with distinct geographic territory, and has its own dialect or 'clan lect' (Smith & Johnson 2000: 358).However, rather than living separate lives on their separate territories, people from each clan group are highly mobile, and often live intermingled in the same residential groups, for example when jointly exploiting natural resources.The mingling of residential groups is also ensured by clan exogamy (marriage between people from different clans).Thus there is extensive interaction between speakers of different clan lects, and we code Kugu Nganhcara dialect relations as Close.
An intermediate level of contact is found in the grammar of Channel Island French (Liddicoat 1994), which focuses on dialects from the islands of Jersey and Sark.As mentioned above, the Sark community split off from Jersey in the sixteenth century.Both dialect groups have had predominantly agricultural livelihoods since then, with social interaction organised around local villages and their markets.This implies a lower level of contact between the two dialect groups.On the other hand, the distance between the islands is small and easily navigable (about 30km), and the agricultural communities have been involved in significant cross-channel trade.We assigned this dialect relation a Medium contact value.
A Distant dialect relation is found in Somali (Saeed 1999), a language spoken by several million people across a large region.Northern dialects are spoken by pastoralists living on relatively arid country, and southern dialects are spoken by agriculturalists living in a river delta some hundreds of kilometres to the south.Mutual intelligibility is asymmetrical, with southerners able to use northern dialect as a lingua franca, but northerners being less familiar with the southern dialect.
Note that for most grammars (e.g.Kugu Nganhcara), we coded the same degree of social distance for all dialect relations.But for other grammars (e.g.!Xun), some dialect relations were judged to be more distant than others.This is also the case in Hup (Epps 2008), where the grammar reports a generally high level of mutual intelligibility, and notes that the main social groups, patrilineal clans, live alongside each other in shared villages.On this basis the central and eastern dialect areas of Hup are coded as a Close dialect relationship.However the western dialect speakers have less interaction with the central and eastern groups, and the central/eastern speakers say the western dialect is 'hard to understand' (Epps 2008: 13).On this basis, we assigned a Distant relationship between western dialect and the other two.
Table 3 shows the coding of some example variables.Most of the examples shown here are dialectal variables, but Kugu Nganhcara SOV ~ SVO, and Nishnaaabemwin nominal conjunction and plural marking, are examples of intra-group variables (Dialects = NA).There are two dialectal variables from Hup, but one of these is between the Close villages, while the other is between the Distant western dialect compared to other areas. 8The coding of the Type column will be explained in the following sections.The following subsections describe each type in turn, and make general observations about their dialectal or non-dialectal status.

Form variables
A FORM variable is where variant expressions of a grammatical meaning are distinguished by the form of a grammatical marker, but in other respects the construction is the same.A wellstudied example in English involves negative predicates, which vary in the form of the negative auxiliary/copula, e.g.she isn't home ~ she ain't home.This has a social signalling function, marking social class, stance and style (Levinson 1988;Cheshire et al. 2005).English has other well-known form variables that also have strong social connotations.These include 'negative concord', involving paradigmatic contrast between negative determiners any ~ no (Wolfram 1969), and the verbal progressive suffix -ing ~ -in (Campbell-Kibler 2010).Latin American Spanish offers another well-studied example, in the expression of second-person singular subject, where the voseo phenomenon involves distinctive 2SG markers both in free pronouns and in verbal suffixes.In some areas voseo is a salient marker of regional dialects, for example in Colombia (Collazos 2015 A particularly flamboyant example of dialectal form variation is in Bininj Gun-wok, where certain verbal prefixes index patrilineal clan heritage, which affords rights to territorial estates (Garde 2008).What makes this example so striking is that the prefixes do not carry any semantic content: they are semantically vacuous 'fillers', used purely for social signalling: (5) Bininj Gun-wok; Djordi vs Kurulk vs Mok clan-lects (Garde 2008: 150-154) a. yi-njarra-kinje-men b. yi-bayid-kinje-men Form variables may emerge either from sound changes, or grammaticalization pathways.A phonologically induced example can be seen in the American English 1SG.FUT auxiliary, where African-American dialects have innovated I'm'a, marking a point of differentiation from other dialects I'm gonna.Here phonological erosion has been applied differently in different dialects.Although such variables have a phonological dimension, we still treat them as grammatical variables wherever the sound change appears to be specific to a grammatical marker, as opposed to being a regular sound change.This criterion therefore includes some variants that are phonologically similar, such as the Kharia clitic forms in (7) below.Form variation via grammaticalisation pathways can be seen in the second-person plural pronoun in English dialects, which may take the form youse (e.g.Australian) or y'all (e.g.southern USA), exhibiting different grammaticalisation paths in the development of the pluralising suffix.
Form variables are the most frequent type of grammatical variable in our data, accounting for 57% (N=654) of all variables annotated.There is at least one form variable reported in each of the 42 languages.Form variables are also the type in which the highest proportion are dialectal, with 58% (N=380) of form variables being dialectal.Form variation of grammatical markers therefore appears to be a cross-linguistically frequent type of dialectal differentiation.
The types of grammatical markers involved in form variables include affixes, clitics and function words, and encompass a wide range of grammatical meanings and functions.10Examples of pronominal form variation can be found in Fijian free pronouns (6), Kharia pronominal clitics on irrealis middle verbs ( 7) and Bininj Gun-wok verbal agreement prefixes (8).( 6) Fijian; Standard/Bau dialect vs Boumaa dialect (Dixon 1988 As mentioned above, we define grammatical variables as two ways of expressing a grammatical meaning/function, even if these two expressions may themselves have differences of functional range (e.g. a specific FUT marker vs a more general NON-PAST marker).When markers with different functional ranges distinguish dialects, this implies that the dialects are not isomorphic in their grammatico-semantic structure.Therefore, our findings on dialects being differentiated by the form of grammatical markers should not be interpreted as showing that dialects differ only on their 'surface' forms, since the grammatico-semantic structure is also different in many instances (cf.Grace 1981).Investigation of such non-isomorphisms may reveal further important patterns of grammatical divergence, however this is beyond the scope of the current study.

Order variables
An ORDER variable is where two variant expressions are composed of the same combination of forms, but the linear ordering is different.11Order variables may involve positioning of a grammatical marker, or re-ordering of lexical elements, without a change of meaning.For example, Spanish object clitics may be positioned either after an infinitive verb or before the finite verb: (14) Spanish; intra-group (Schwenter & Torres Cacoullos 2014) a. no puede manejar=los b. no los=puede manejar NEG can.3SG.PRS manage=3PL.M.OBJ 'She can't manage them.'An example that involves reordering of lexical elements is the English verb-particle construction, where a transitive verb-particle lexical construction may occur as two adjacent elements preceding the object NP, or may embrace the object NP: (15) English; intra-group (Haddican et al. 2020;Röthlisberger & Tagliamonte 2020) a. pick up Studies of variable order have revealed a range of conditioning factors such as semantics, phonology and information structure.Spanish object clitic placement (as in 14) is primarily influenced by object topicality and animacy, and the degree of grammaticalisation of the finite verb (Schwenter & Torres Cacoullos 2014: 524).Basic constituent order (SOV, OVS etc) is variable in many languages, where it is generally influenced by information structure (Payne 1992).English particle placement (as in 16) is primarily influenced by the phonological weight of the object NP (Haddican et al. 2020;Röthlisberger & Tagliamonte 2020).In Tagalog, variable ordering of nouns with adjective modifiers has been shown to be strongly influenced by phonotactics at word boundaries (Shih & Zuraw 2017).In all these instances, variant selection is largely determined by factors relating to production planning and the referential structure of discourse (Tamminga et al. 2016), but social signalling appears to be largely absent.
There are relatively few instances in the sociolinguistics literature where order variation is reported to differentiate dialects; but there are some.For example, although English particle verb order is primarily driven by phonology, the centuries of separation between American and British Englishes have facilitated a divergence of frequencies.Both American and British Englishes are slowly increasing their frequency of VOP, to the detriment of VPO, but this change is slightly more advanced in Britain (Haddican et al. 2020;Röthlisberger & Tagliamonte 2020).This is a kind of slow-moving, stochastic dialectal drift, which appears to be facilitated by reduced social contact, rather than being driven by group interaction and social signalling.On the other hand, stochastic divergence may eventually become categorical, and this may then lead to a more sociolinguistically salient variable.For example, Dutch has variable order of a sentence-final participle and auxiliary verb, which in some areas is associated with regional dialects (De Sutter 2005).In English, some northwestern British dialects developed a double-object dative construction that uses a different order from other dialects (16) (Gast 2007;Siewierska & Hollmann 2007;Gerwin 2013).Note that this is distinct from the more familiar 'dative alternation', which involves an additional preposition.12Because the north-western form (16a) is not used at all in other dialects, this difference may be more salient to language users, compared to a stochastic divergence.( 16) British English; e.g.Manchester vs other dialects (Gast 2007) a. give it me b. give me it V Th Rec Drawing on language contact literature, we can conceptualise order variables as involving differences of PATTERN, as opposed to form variables that involve differences of MATTER (Matras & Sakel 2007;Gardani 2020).The contact literature investigates several types of pattern-borrowing, with the general finding that contact-drive convergence affects patterns more than matter, due to social constraints that tie particular matter to a particular language (Gumperz & Wilson 1971;Matras & Sakel 2007: 857).Conversely, we might expect that in situations of socially mediated linguistic divergence, matter will provide more social signalling than patterns.This would therefore predict that form variables will be more likely to differentiate dialects than order variables.
In our database there are fewer order variables (N=149) compared to form variables (N=654), though there was at least one order variable in each of the 42 grammars.A minority of these order variables (22%, N=33) are reported to differentiate dialects, though as we will see below, a clearer pattern is revealed once we break this down according to degrees of social contact.
Our order variables range across diverse phrase and word structures.There are several examples of variable orderings in basic constituent order (17), and also word order within the NP ( 18 There are also some order variables (N=36) that involve the positioning of affixes and clitics.Most of these involve variation in the position of an affix within a word (24,25).But there are also some that involve an affix or clitic that attaches at variable positions in the phrase, for example in Tundra Nenets where in some contexts person agreement can be hosted by either verb or noun (26).( 24) Bantawa; intra-group (Doornenbal 2009: 274) a. kʰim kʰar-a-kʰa-ci b. kʰim kʰar-a-ci-kʰa house go-PST-see-DU 'You two go home please!' (25) Urarina; intra-group (Olawsky 2006: 480, 524) a. itçau-rʉ-rehete=lʉ b. itçau-rehete-kʉre=lʉ live-PL-HAB=REM 'They used to live.' (26) Tundra Nenets; intra-group (Nikolaeva 2014: 323,329)  16  a.yil′e-qm′a mərin′i b. to-qma-m′i yal′a-r′i-x°na live-PFV.AN city.PL.1SG come-PFV.AN-1SG day-LIM-LOC 'the cities where I lived' 'on the same day when I came' Only a minority of order variables are dialectal, but the dialectal instances are found across word, affix and clitic constituent levels.For example in Ma'di, past transitive clauses are SOV in one dialect area and SVO in another (27). 17In Turung, speakers in some villages sometimes use a different basic constituent order from those in others (28).This is an instance where most dialects has a fixed order (SOV) while some villages show variation SOV~SVO, which is presumed to have arisen from contact with a neighbouring SVO language (Morey 2010: 513).Like the English double-object dative example above ( 16), the 15 ABIL= ability; AGT = agent; NR = non-relational.
16 AN = 'action nominal'.Some linguists might take variable sites of attachment as evidence for clitic rather than affix status.But since there is no consensus on how to distinguish affixes from clitics (Spencer & Luis 2012: 220), in our coding we simply follow the authors of grammars in how they distinguish clitics vs affixes. 17There is a slight caveat: the Ma'di SOV ~ SVO variable is not quite purely syntagmatic in its variation, as there is also a difference in tonal verb inflection between the two variants, where Lokai SOV uses a non-past low tone on the verb, but 'Burolo SVO does not.However the ordering of constituents can be considered the primary dimension of variation and therefore we coded this as an order variable.Note also that both Lokai and 'Burolo have SVO order for uninflected verbs that encode present or future tense (p.541).
There are also a few instances of dialectal variation in affix order (N=6).Examples are illustrated here from Slave negated verbs with incorporated post-positions (32) and the Bininj Gun-wok immediacy affix (33).An example of dialectal order variation in clitics is found in Somali, where interrogatives may host pronominal and negative clitics in either order (34).'not you?'

Omission variables
The third major type of grammatical variable in our data is the OMISSION variable, where the difference between two variants consists solely in the presence/absence of a grammatical marker.A well-studied omission variable is in French verbal negation, where the particle ne is variably present or absent (35).The single-marked version has over time become dominant in speech, and spread across geographic dialects, while the double-marked version remains in writing (Ashby 1981;Armstrong 2002;Martineau & Mougeon 2003).
(35) French negation; written vs spoken (Ashby 1981) je (ne) sais pas 1SG NEG know NEG 'I don't know' Other well-studied examples include the presence/absence of an overt relativiser in some English relative clause types (Jaeger 2010;Wasow et al. 2011), optional case markers in Japanese (Kurumada & Jaeger 2015), and in various Australian languages (McGregor 2006;Gaby 2008;Meakins 2015).In all these instances, the omissable grammatical marker is to some extent semantically redundant, and its presence/absence is largely determined by informational context.
The concept of coding (a)symmetry has often been applied to contrastive grammatical meanings, such as asymmetrical markedness relations in SG vs PL, but it can also be applied to our structural types of variation in expressing the same meanings. 21While form variables are symmetrical differences between two alternant expressions, an omission variable is an asymmetrical difference.Just as coding asymmetries are thought to be governed by principles of efficient communication (Haspelmath 2021), we might expect that omission variables should be governed primarily by informational efficiency, instead of social signalling.
Omission variables are less frequent than form variables, but more frequent than order variables.They account for 26% (N=261) of all grammatical variables we identified, and at least one omission variable was identified for 40 of the 42 languages.26% of omission variables (N=67) are dialectal, which is about the same rate as for order variables.
As in the well-studied examples above, omission variables in our data often appear to be driven by redundancy.For example in the Tundra Nenets omission variable shown earlier in this paper (3, repeated for convenience as 36), scalar comparison is expressed by the juxtaposition of two NPs, with an ablative suffix to mark the standard of comparison.Optionally, a comparative suffix may appear on the adjective denoting the scalar property, but we might assume that the comparative meaning of the construction is already clear without this marker.(36) Tundra Nenets; intra-group (Nikolaeva 2014: 174) tʹuku° pəni° taki° pəne-xəd° səwa (-rka) this coat that coat-ABL good(-COMP) 'This coat is better than that one.' Tundra Nenets also provides one of the few examples of an omission variable with a dialectal association.Negative clauses always begin with a NEG particle, but speakers from the eastern region additionally use a -q 'connegative' suffix on the verb, which is often (but not always) omitted by speakers from the western region.

Testing the relationship between structural type and dialectal status
In the previous section we noted the percent of each variable type that is dialectal; however, we can better understand the behaviour of structural types with respect to dialect differentiation by breaking our data down by degree of social contact between groups.Figures 4(a,b) illustrate grammatical variables, coloured to distinguish dialectal variables in pink and intra-group variables in dark blue, grouped by degrees of social distance.Figure 4a shows raw count data, and 4b shows percentages of dialectal vs intra-group.Figure 4b shows that around half of form variables are dialectal, and this tendency is quite consistent across degrees of distance.For order and omission variables, however, only a minority are dialectal in settings of Close or Medium contact, while around half are dialectal in settings of Distant contact.To test for a relationship between structural types, social distance and dialectal differentiation, we fitted a mixed-effects regression model as follows. 22The outcome to be predicted is whether a grammatical variable is dialectal or not, and the fixed effects are structural type and social distance.We modelled the fixed effects using treatment coding (a.k.a.'dummy coding'), with form variables in situations of close contact as the baseline or 'reference levels'.The model then estimates the effect of switching to either an order or omission variable, the effect of increasing social distance, and finally an interaction effect of both switching type and increasing distance.Order and omission are coded as treatment contrasts, each being compared against form variables.Distance is coded as a polynomial contrast, testing for a linear or quadratic change in dialectal status as distance increases: Close < Medium < Distant (Schad et al. 2020).
Based on our impressionistic analysis of the data, we expect that omission or order types, in situations of close social contact, should have a lower probability of being dialectal.
On the other hand, increasing social distance, while focusing on form variables, does not appear to affect the probability of a variable being dialectal.Finally, we expect an interaction between both omission and order variables and social contact: when we increase social distance, omission and order variables should be more likely to be dialectal, compared to their low probability of being dialectal under close contact.We also include random effects in the model to control for undue influence from particular language families (using maximal language families as annotated in Glottolog (Hammarström et al. 2022)). 23As noted above, our data contains different quantities of data from different language families, but we control for this imbalance by including a random intercept for each family, and a random slope parameter for structural type in each family. 24he model was fitted in R using the lmer package (Bates et al. 2015), with estimates of the predictors shown in Table 4.The intercept represents the log odds of a form variable, in a close-contact situation, being dialectal.This is not significantly different from zero, i.e. even chances of being dialectal or not.The fixed effects conform to our impressionistic analysis: comparing omission or order types to the form baseline (while keeping close contact as a reference level) produces highly significant, negative effects on the probability of a grammatical expression being dialectal.Meanwhile the effect of social distance, when considered with the form type as a reference level, is not significant. 25But when considering the interaction of structural type with social distance, we find that greater social distance significantly increases the probability of an order variable being dialectal.For omission variables, the social distance effect is not significant, though it trends towards an increased probability of being dialectal.It should be noted that under the treatment coding scheme used here, the negative effect of omission and order variable types is not a general effect, but rather one that holds under the condition of close social distance.The interaction between social distance and variable type suggests that at greater social distances, order variables (and perhaps omission variables) become more like form variables in having a higher probability of distinguishing dialects.To further evaluate the significance of the interaction between social distance and structural type, we used ANOVA to compare the model in Table 4 with a similar model that does not include the interaction term.This shows that the interaction term reduces the deviance of the model (from 1128 to 1115) and improves the Akaike Information Criterion (from 1150 to 1145).
The difference between models is statistically significant (p < 0.01).
The family-level intercepts for language families range from -2.02 (Baining) to 2.09 (Athapaskan), with a standard deviation of 1.16.These figures represent the probability of grammatical variables being dialectal in different families (at the reference level: form variables in close contact) -for example, that most variables are reported to be dialectal in Athapaskan, and only a few in Baining.As noted above, we expect there to be differences between grammar writers in how much attention they pay to dialectology, and we suspect that these random intercepts most likely reflect grammar-writing methodologies, rather than actual differences between language families.The family-level slopes for the order type (compared to form) range from -2.20 (Austronesian) to -1.37 (Athapaskan), with a standard deviation of 0.43.The family-level slopes for the omission type range from -2.64 (Baining) to -1.11 (Athapaskan), with a standard deviation of 0.48.Notice that these family level slopes are all negative, suggesting that order and omission variables are less likely to be dialectal irrespective of language family.The difference between structural types thus appears to be a robust cross-linguistic pattern, rather than being unduly influenced by exceptional families in our data.
To further evaluate the model we compared its predictions against the actual data.Figure 5 shows the actual precentages of dialectality in the data (as in Figure 4b above), compared against the model's predicted probabilities of a grammatical variable being dialectal.We use shapes to represent different structural types, and line types to represent actual data versus predictions.As the figure shows, the model is a fairly good fit for the data, though it does not predict the sharp increase in dialectal order variables that is found in the data.This is likely because a substantial number of the dialectal order variables, with distant social contact, come from just a few families (especially Basque and Kxa), and the model attributes this to family-level random effects rather than a general effect.Nonetheless, even when controlling for language family, the model still predicts that omission and order variables are more likely to be dialectal as social contact decreases.26

Summary of findings and implications
One simple finding of our study is that grammatical markers often differentiate dialects.This supports the notion that social signalling can be integrated into the grammatical system, at least if we consider 'surface forms' to be part of the grammar (cf.Cheshire 1987;Labov 1993).When reference grammars report two distinct markers as expressing the same grammatical meaning, in roughly half the instances this is reported to be a dialectal distinction.Although reference grammars cannot be read as comprehensive sources on dialectology, this finding nonetheless suggests that dialect differentiation is frequently intertwined with grammar.Whether this is simply a matter of 'surface forms', or whether it also results in grammatico-semantic differentiation, is an important question for further research ( §4.1).
Furthermore, our survey reveals that different structural types of variation exhibit different patterns with respect to dialectality.Dialect differentiations by form variables applies equally under close or distant social contact.If we assume that dialects usually reduce their degree of contact over time, this would imply that much of the form differentiation is established during periods of close contact, and little is added once they move apart.By contrast, order and omission variables are rarely dialectal in situations of close contact, but order variables become more likely to differentiate dialects as they become distant from one another.Since social signalling is only relevant to the extent that groups are in social contact, these findings suggest that form variables are driven to a greater extent by social signalling, compared to order and omission variables.This is compatible with studies of language contact that identify divergence of 'matter' or 'lexicon', alongside convergence of 'pattern' or 'structure'.Our study builds on the previous research by operationalising specific types of linguistic variation, and showing systematic differences between them in a cross-linguistic sample.

Language speciation and diversification
In the introduction we defined 'linguistic divergence' as diversification driven by social contact, as opposed to social separation.The implication of our findings is that linguistic divergence affects not just lexical items, but also grammatical markers such as affixes and function words.Figure 6 extends the schema from Figure 1 above, representing what we conjecture to be typical pathways for form and order variables in linguistic divergence.An initially integrated social group splits into two, and the degree of social contact (dotted lines) between these groups decreases over time.Form variables tend to differentiate dialects soon after group fission, while there is still regular social interaction between members of the groups, performing much of the initial work in 'language speciation'.Form differences persists even after social contact wanes, as a relic of the earlier phase.By contrast, order variables only begin to differentiate groups once social contact wanes.If we project this schema onto a multi-millennial timescale, it would suggest that language families diversifying 'in-situ', with prolonged social contact between related varieties, should exhibit more diversity in the forms of grammatical markers compared to families that diversify in a more dispersed manner.Differentiation of grammatical markers (and lexicon) could become so extensive that these varieties would become quite distinct languages, despite a lack of geographic or social separation (François 2011;François 2012).We hope that future research will be able to test this conjecture, for example by enriching phylogenetic data with information on (historical) degrees of social contact.One preliminary study of this type has investigated contact and lexical divergence in Oceanic languages (Miceli et al. 2016), finding some evidence that more social contact favours more diversification of the lexicon.This is compatible with the findings of the current study, though our findings suggest that social contact may favour diversification not just in open-class lexical items but also in grammatical markers.
If social signalling is more easily achieved with grammatical markers than with linear ordering, this is also consistent with sociolinguistic research.There are many well-studied examples of form variables that are salient markers of social identity, such as isn't ~ ain't in English, or voseo in South American Spanish.For order variables, on the other hand, it is more difficult to point at sociolinguistically salient examples, though there are some rare cases such as British English give me it ~ give it me.

Why do order variables resist social signalling?
Why exactly should order variation exhibit less social signalling, compared to form variation?In the discussion above it was noted that order variables, where they have been studied in detail, have been shown to be strongly influenced by phonological, semantic and pragmatic factors in their contexts of occurrence (Milroy & Gordon 2003: 187;Cheshire et al. 2005).One possible explanation for their lack of social signalling is that these strong language-internal factors inhibit the development of social signalling.If variant selection is strongly predicted by the linguistic contextual factors in each instance of occurrence, this may mean that there is less variance available for socio signalling.
As an example of a linguistic variable that is strongly predicted by linguistic contextual factors, take the standard English dative alternation (Bresnan & Nikitina 2009;Bresnan & Ford 2010).Regression modelling of speakers' choice between two alternative dative expressions shows that variant selection is influenced by linguistic factors including definiteness, discourse accessibility, animacy, identity of verb lexeme and the number of words in each constituent.A model combining these predictors achieves 94.5% accuracy on unseen corpus data (Bresnan & Ford 2010: 180), suggesting that although both variants are grammatically acceptable, there is in fact very little variance in their occurrence, once linguistic contextual factors are taken into account.The dative alternation is not known to have a social signalling function, and this may be precisely because there is so little variance left over after linguistic context is factored out (but see Jenset et al. 2018).Similar arguments may apply to omission variables, which are also reported to be highly conditioned by linguistic contextual factors, especially informational redundancy (Wasow et al. 2011;Kurumada & Jaeger 2015).
The argument outlined above is similar to a theory of social signalling in terms of expectations and surprisal (Rácz 2013;Jaeger & Weatherholtz 2016;Lai et al. 2020).
Originally developed with respect to phonetic variables, the core proposal is that listeners learn the contextual probabilities of hearing various sounds.For a phonetic variant to be a social signal, it should have a high surprisal (negative log probability) based on purely linguistic context.For example, in British English, glottalisation of stops has low surprisal in coda position, where it occurs quite frequently as a function of language-internal articulatory patterns, and therefore has little potential to be interpreted as a social signal.But in intervocalic position it has higher surprisal, facilitating social signalling by intervocalic tglottalisation (Rácz 2013: 145).A theory of social signalling in terms of surprisal is compatible with the idea that order variables resist social signalling because they are so heavily conditioned by linguistic context.Again, this would imply that variant selection leaves little residual surprisal, and surprisal is key to the interpretation of social signalling.

Figure 1 .
Figure 1.Linguistic variables and social groups: (a) Where the selection of variants x1 ~ x2 has a similar distribution in two social groups, we call this 'intra-group' or 'non-dialectal' variation; (b) Where the selection of variants x1 ~ x2 aligns with social groups, we call this 'dialectal' variation.

Figure 3 .
Figure 3. Number of grammatical variables per language family, dialectal and non-dialectal.

Figure 4a .
Figure 4a.Counts of grammatical variables categorised as intra-group or dialectal, grouped by degrees of social contact.

Figure 5 .
Figure 5. Actual data versus probabilities predicted by the model.

Figure 6 .
Figure 6.Schematic model of how form and order variables differentiate dialects over time.

Table 2 . Examples of grammatical variable coding (see Supplementary Information for further details and references)
The !Xun constituent order variation applies only with certain verbs. 14ILL = illative case.Some elements of the glossing are simplified.1SG NRL-house EMPH manner PROX make-ABIL-NMLZ 'my house that I am able to build like this here' 13