Language–General Auditory–Visual Speech Perception: Thai–English and Japanese–English McGurk Effects

Cross-language McGurk Effects are used to investigate the locus of auditory–visual speech integration. Experiment 1 uses the fact that [ ], as in ‘sing’, is phonotactically legal in word-final position in English and Thai, but in word-initial position only in Thai. English and Thai language participants were tested for ‘n’ perception from auditory [m]/visual [ ] (A[m]V[ ]) in word-initial and -final positions. Despite English speakers’ native language bias to label word-initial [ ] as ‘n’, the incidence of ‘n’ percepts to A[m]V[ ] was equivalent for English and Thai speakers in final and initial positions. Experiment 2 used the facts that (i) [ð] as in ‘that’ is not present in Japanese, and (ii) English speakers respond more often with ‘tha’ than ‘da’ to A[ba]V[ga], but more often with ‘di’ than ‘thi’ to A[bi]V[gi]. English and three groups of Japanese language participants (Beginner, Intermediate, Advanced English knowledge) were presented with A[ba]V[ga] and A[bi]V[gi] by an English (Experiment 2a) or a Japanese (Experiment 2b) speaker. Despite Japanese participants’ native language bias to perceive ‘d’ more often than ‘th’, the four groups showed a similar phonetic level effect of [a]/[i] vowel context × ‘th’ vs. ‘d’ responses to A[b]V[g] presentations. In Experiment 2b this phonetic level interaction held, but was more one-sided as very few ‘th’ responses were evident, even in Australian English participants. Results are discussed in terms of a phonetic plus postcategorical model, in which incoming auditory and visual information is integrated at a phonetic level, after which there are post-categorical phonemic influences.


Introduction
Visual (face and lip) information plays an important role in speech perception, not only in noisy but otherwise natural conditions (Sumby and Pollack, 1954), but also in clear but unnatural conditions when auditory and visual speech components are mismatched (Dodd, 1977;McGurk and MacDonald, 1976). These pioneering mismatching studies plus four decades of research on what is now called the McGurk Effect show that visual speech information is combined with auditory speech information, albeit unconsciously, whenever it is available (although the nature of the combination may differ from more natural conditions; see Alsius et al. in [ka], which generally yielded 'da' or 'ta' responses, respectively, but when the auditory component was a velar rather than a bilabial, A[ga]V [ba] or A[ka]V[pa] there were 54% and 44% combination responses, respectively. The issue of phonotactically illegal responses such as 'bga' is taken up further in the introduction to Experiment 1.
These early demonstrations sparked investigation of the issues and processes involved in the McGurk effect and auditory-visual integration (see e.g., Bernstein et al., 2002;Robert-Ribes et al., 1995. These issues include whether auditory-visual integration is specific to speech; where, how, and when the McGurk effect occurs; and the McGurk effect within and between languages. Research in each of these areas is reviewed, especially with respect to how they relate to the psycholinguistic locus of auditory-visual speech integration, which is the focus of the two studies presented here.

Auditory-Visual Integration: Speech vs. Non-Speech
McGurk-type effects have been found with non-speech stimuli, e.g., when auditory and visual information for 'pluck' and 'bow' sounds on a cello are mismatched (Saldaña and Rosenblum, 1993). In a second experiment Saldaña and Rosenblum found the music effect to be weaker than a consonant mismatch McGurk effect, suggesting that, while the McGurk effect may not be specific to speech, different processes may be involved in speech and nonspeech McGurk-like effects. This is supported by neural evidence (Baart et al., 2014) that, while peak N1 EEG responses occur earlier in auditory-visual than auditory-alone conditions in both speech and non-speech (sine-wave-speech) modes, only in the speech mode is there modulation of the P2 response and an effect of phonetically-incongruent auditory-visual stimuli on event-related potentials (ERPs) around 200 ms after stimulus onset. Moreover, there is fMRI evidence that lipreading activates the auditory cortex, but only while meaningful speech movements are observed, not for non-speech (gurning) movements (Calvert et al., 1997;Campbell et al., 2001). Thus auditory-visual and visual processing differs for speech and non-speech, though the exact nature of this difference is yet to be specified.

Auditory-Visual Integration: Where?
Based on evidence that auditory-visual integration in primates occurs in the superior temporal sulcus (STS), Beauchamp, Nath and Pasalar (2010) investigated the role of the STS in the McGurk effect. They found the incidence of adults' fusion responses to A[ba]V[ga] (but not auditory-only [ba]) was significantly reduced when a transcranial magnetic stimulation (TMS) signal was delivered to the STS within 100 ms of auditory stimulus onset. This temporal constraint is consistent with other findings: ERPs for incongruent auditoryvisual speech presentations 200 ms after stimulus onset (Baart et al., 2014); speeding up of cortical processing of auditory signals if visual speech is presented within 100 ms of auditory signal onset (Van Wassenhove et al., 2015); and fusion responses to McGurk effect stimuli even up to an auditory-visual temporal asynchrony window of 200 ms (Munhall et al., 1996;Van Wassenhove et al., 2007).

Auditory-Visual Integration: how?
The issue of how auditory-visual occurs is addressed below, first in a nonlinguistic and then in a linguistic manner.

Common or Separate Metrics for Auditory and Visual Information
A broad distinction between the possible processes involved in the McGurk effect is between integration of auditory and visual information in a common metric space vs. the combination of modality-specific auditory and visual information in a sequential decision process.
The sequential process view is well exemplified by Massaro's (1987Massaro's ( , 1996 fuzzy logic model of speech perception (FLMP). In this, incoming auditory, visual, and top-down information are processed independently and in parallel, then evaluated with features in each source being independently assigned fuzzy truth values for the degree to which they match phoneme prototype features, and finally there is specification of a particular phoneme. Thus auditory and visual perceptual information remain separate rather than being represented in a common metric.

Phonetic or Phonemic Auditor-Visual Integration
Another way in which this distinction may couched is between phonetic and phonemic modes of psycholinguistic processing. The phonetic vs. phonemic distinction rests on language experience. A large number of speech sounds, phones, are used across the world's languages, but only a subset of phones or groupings of phones, phonemes, are used to distinguish meaning in a particular language. Auditory-visual integration at a phonetic level would be in a space in which all language-general phonetic information was available free from specific linguistic experience; conversely auditory-visual integration at a phonemic level would allow language-specific information to impinge upon the percept, or perhaps more correctly upon the participants' verbal responses. As common metric theories emphasise the abstract nature of this metric, it would be more in line with phonetic processing. As separate metric theories rest on comparison of truth values to learned prototypes, they would be more in line with phonemic processing.
An ingenious set of experiments by Green Kuhl, 1989, 1991;Green and Miller, 1985; address this issue by examining whether established auditory speech perception phenomena also occur when some phonetic information is specified auditorally, and some visually. For example, when identifying speech sounds along an acoustic [bi]-[pi] voice onset time (VOT) continuum, English language participants have higher VOT crossovers (more [bi] responses) when the vowel length of [i] is increased. To test this cross-modally, Green and Miller (1985) paired an auditory [bi]- [pi] continuum spoken at medium rate with visual articulations of [bi] and [pi] spoken at slow and fast rates, and found that visual speech rate information acted upon auditory perception in the same way as did auditory rate information. These results show psycholinguistic level interaction of auditory and visual information, and that visemes independently specify phonetic information, just as their auditory counterparts do. However, the psycholinguistic level at which this occurs is not wholly specified. To determine this, an extension would be required in which the same auditory-visual integration is specified in a language in which vowel length is phonemic. If the same effects were still obtained then this would support a phonetic level explanation as the change in vowel length would entail a change of a phoneme not just a phonetic change. A similar, but not identical line of reasoning is employed in the two experiments reported here.

Auditory-Visual Integration: When?
The common metric and phonetic views imply auditory-visual integration temporally early in psycholinguistic processing, while prototype and phonemic views imply temporally later processing. Visual speech activates auditory cortex (see Calvert et al., 1997;Campbell et al., 2001, see above) and brainstem FFRs (Frequency Following Responses) show that as early as 11 ms post-acoustic stimulus onset, auditory-visual speech compared with unimodal speech results in a 1.3 ms delay at the brainstem, with slightly greater suppression for congruent, A[da]V[da], than incongruent, A[da]V[fu], AV pairings (Musacchia et al., 2006).
One interpretation of these studies is that auditory-visual integration and/or visual influence occurs in an auditory brainstem common metric (with differential treatment of congruent and incongruent McGurk-like stimuli, Musacchia et al., 2006), and that such auditory coding continues through to the auditory cortex both generally (Calvert et al., 1997;Campbell et al., 2001, see above) and differentially, as fMRI measures show there is greater posterior parietal cortex activation for concordant stimuli than discordant auditoryvisual stimuli (Saito et al., 2005). However, further evidence is required before definitive conclusions can be drawn.

Auditory-Visual Integration: Within and Between Languages
Phoneme classes are the product of linguistic experience. At birth, infants discriminate between phones irrespective of whether they are native or nonnative in the ambient speech environment. Then gradually and especially from around four months infants learn to perceive speech in terms of the phoneme classes of the surrounding (native) language or languages. The attunement process is similar across languages, but the result differs -particular languagespecific phonemic classes are set up over and above the phonetic substrate. As attunement is attentional rather than structural, there is no absolute loss of perceptual discrimination and within-phoneme (phonetic) information remains available. Thus investigation of auditory-visual integration within and between languages should prove instructive for the issue of phonetic vs. phonemic level processing.

Auditory-Visual Integration: Development Within Languages
The McGurk effect is evident early in infancy (Burnham and Dodd, 2004;Desjardins and Werker, 2004;Rosenblum et al., 1997). For example, 4 1/2month-old infants perceive A[ba]V[ga] as 'da' or 'tha' significantly more often than as 'ba' (Burnham and Dodd, 2004). Such evidence could be taken to show auditory-visual integration occurs before phonemic processing is evident in development. However, as with the Green studies and their relation to phonetic processing, this is evidence not watertight, as there are indications of phonemic influences even at this tender age.
First, it has been found that English language 10-week-olds match a voice to a talking face, irrespective of whether the face-voice matching is for a native English or a foreign voice; but 20-week-olds only do so for their native English language (Dodd and Burnham, 1988). Such auditory-visual speech perception language-specificity is also evident in visual speech; English-language 4-and 6-month-olds discriminate English and French languages on the basis of visual information alone, but 8-month-olds only do so if they have bilingual English/French experience (Weikum, Vouloumanos, Navarra, Soto-Faraco, Sebastian-Galles and Werker, 2007).
Second, there is recent evidence that infants as young as three months have already set up phonological categories. Thirty-year-old adults who were adopted at three to five months from Korean families and brought up in Dutch families in the Netherlands showed, compared to 30-year-old Dutch controls, better perceptual learning of a now foreign Korean three-way voicing contrast, better generalisation of such learning to different speech (voicing) contrasts, more accurate production of the contrasts, and significant speech perceptionproduction correlations compared to inferior perception and no such correlations in controls (Choi et al., 2017). These results show that phoneme categories are set up before three to five months and are remarkably strong and resilient. Thus, while 4 1/2-month-old infants' perception of the McGurk effect is striking, it is not necessarily achieved in the absence of phonological categories, so the occurrence of the McGurk effect in infancy is silent with respect to whether phonetic or phonemic processes are involved in the mature McGurk effect. The studies reported here are designed to provide further evidence on the influence of phonetic and phonemic factors on McGurk effect responses.

Auditory-Visual Integration: Development Between Languages
The incidence and nature of the McGurk effect changes as a function of language in three ways. First, the McGurk effect is stronger in some languages than others. In a series of studies Sekiyama and her colleagues found a much weaker McGurk effect in Japanese than in English listeners: Japanese listeners are much less influenced by visual speech (Sekiyama and Tohkura, 1991). However, Japanese listeners do increase their reliance on visual information when auditory noise is introduced (Sekiyama, 1994) and Sekiyama et al. (1995) found other conditions (when easy-to-lipread speakers are used, or when speakers lengthen their vowels) under which Japanese listeners' incorporation of visual cues is facilitated. Nevertheless, Japanese listeners generally show less robust McGurk effects than English language listeners. Sekiyama and Burnham (2008) discovered the developmental locus of this departure; while Japanese-and English-language six-year-olds show equivalent McGurk effect strength, between six and eight years McGurk effect strength in English-language children increases, while that for Japanese children remains unchanged between six and eight years, and even beyond, at 11 years and in adulthood. This six-to eight-year-old increase for English language children is thought to be due to the relative complexity, compared to Japanese, of the English phonological repertoire and its more complex phonotactics, combined with the opacity of the English script (which is being learned between six and eight years) (Erdener and Burnham, 2013;Sekiyama and Burnham, 2008).
Second, there are inter-language effects on auditory-visual speech integration. In a 2 × 2 design Sekiyama and Tohkura (1993) investigated American and Japanese listeners' perception of the McGurk effect with stimuli spoken by American and Japanese speakers. Over and above Japanese listeners' generally reduced visual influence, both Japanese and American listeners demonstrated greater visual influence when presented with stimuli in the non-native language. Sekiyama and Tohkura (1993) explained this in terms of auditory ambiguity: when participants hear phones that are phonemically relevant in their language but acoustically deviant, then any extra information that can be used, will be used. Similar foreign speaker effects have been found both for Japanese and American speakers (Kuhl, Tsuzaki, Tohkura and Meltzoff, 1994), and also in a range of other cross-language contexts: When Austrian and Hungarian listeners are presented with an Austrian speaker McGurk effect stimulus, more McGurks (more visual influence) are found in the Hungarian listeners; deGelder, Bertelson, Vroomen and Chen (1995) found McGurk fusion effects in both Dutch and Cantonese language participants listening to a Dutch speaker, but the incidence of visually-influenced blends was greater for the Cantonese listeners; and Fuster-Duran (1996) found that when presented with auditory-visual conflict in pairs of German words, e.g., 'brat-Grad', or in Spanish words e.g., 'napa-paca', Spanish and German participants incorporated visual information more for foreign than for native auditory-visual words.
Third, in a cross-language study Werker et al. (1992)

identifications of A[ba]V[ða] by English than by
French participants; the latter tended to respond with 'da' or 'ta'. However, the frequency of 'tha' identifications increased as a function of English language experience. This linguistic influence could be perceptual: French participants may, through their language experience, have a phonologicallydetermined perceptual bias against perceiving [ð], which is ameliorated by English-language experience. However, it could equally be a post-perceptual labelling effect: French participants may perceive [ð] just as often as their English-speaking counterparts but could have a phonologically-determined response bias against reporting their perceptual experience of [ð] as 'th'. While Werker et al. (1992) ensured that their French participants could produce [ð] and also found no difference in French participants' incidence of written and spoken 'd' or 't' responses to acoustic-only [ð] presentations, it is nevertheless possible that there was a top-down language-based response bias against a phone not used in the perceivers' native phonology or orthography. The experiments reported here are aimed at obviating such ambiguities, by setting up McGurk effect situations in which both phonetic and phonological information are available and attempting to specify which in fact is used.
This review shows that the McGurk effect may be viewed in two ways. On the one hand the McGurk effect and auditory-visual integration may occur temporally early in processing, in a common abstract auditory-visual space and involve language-general phonetic psycholinguistic processes. On the other, it may occur temporally late in processing, in a disparate auditory and visual space and involve language-specific phonemic psycholinguistic processes. The evidence provides some convergence on where the McGurk effect and auditory-visual integration may occur (e.g., Beauchamp et al., 2010), the temporally early fusion of auditory and visual information (e.g., Musacchia et al., 2006), the effect of low level phonetic information (e.g., Green and Miller, 1985), the early appearance of auditory-visual integration in development (e.g., Burnham and Dodd, 2004;Dodd and Burnham, 1988), and between language differences (Sekiyama and Burnham, 2008). But whether the McGurk effect is a phonetic or phonemic phenomenon cannot yet be ascertained. The review sets out the nature of this distinction and how it is tied in with other issues, early vs. late, common vs. disparate space. The experiments reported here bear directly on this issue and will provide information on the nature of the psycholinguistic influences on the McGurk effect and possibly more generally, on auditory-visual integration.

General Hypotheses
The two views of the nature of the McGurk effect set out above may be represented in the following two hypotheses. If: • Auditory-visual speech integration is language-specific, phonemic, then the prevalence of the McGurk effect should be sensitive to the phoneme classes of the native language of the perceiver; the same set of McGurk stimuli (e.g., an AxVy combination) should result in different emergent percepts depending on the native language of the perceiver, and • Auditory-visual speech integration is language-general, phonetic, then the same set of McGurk stimuli, AxVy, would result in the same emergent percept, irrespective of the perceiver's language background.
Two experiments designed to test these two alternatives are presented here. Each uses between-language differences in order to tease out phonetic vs. phonemic effects. In the first experiment differences in phonotactic constraints between English and Thai are the research tool, and in the second differences in phoneme repertoire between English and Japanese are the tool. In addition, both use the McGurk effect as a research tool. While it is clear from the above review that incongruent McGurk and congruent auditory-visual may differ in their integration processes, any McGurk effect results here will point the way for further auditory-visual integration research.

Experiment 1: Phonotactic Constraint Differences -the A[m]V[n] Fusion in Thai vs. English
The logic of this experiment rests on two facts. First, in English, the A[m]V[ ] auditory-visual pairing (which incorporates visual [ ], as in sing) results in the emergent percept 'n'. Second, in Thai the use of the phoneme [ ] is phonotactically unconstrained, it can be used in word-initial, -medial or -final positions, whereas in English it is constrained to use in word-medial and -final positions; it cannot be used in word-initial position. There are mentions of phonotactically illegal phone strings in previous literature. Even the original McGurk and MacDonald (1976) paper mentions possible responses of 'bga' which, while phonotactically illegal in the written form, is probably not in the spoken form, as two spoken stop consonants must have at least a voice onset gap between them (and probably an inserted schwa sound). Here, it is not a phonotactiaclly illegal response that is the target, but rather the use of [ ] in the A[m]V[ ] stimulus in the initial position, which is of interest because of the phonotactic illegality of [ ] in initial position in English.
Thai and English adults were presented with A[m]V[ ], either in syllablefinal or syllable initial position, and with other control (auditory-visual, auditory-only and visual-only) stimuli. Given the phonotactic illegality of initial [ ] in English and the ambiguity of visual-only [ ] (Mills, 1987;Mills and Thiem, 1980), English language participants would be expected to respond 'n' more often in the initial than in the final position for visual-only [ ] and matching auditory-visual A[ ]V[ ]. Over and above this, if auditory-visual processing occurs phonemically, then the relative incidence of 'n' responses to A[m]V[ ] should be greater in initial than final position for English but equivalent in initial than final positions for Thai participants. If, on the other hand, auditory-visual processing occurs at a more basic phonetic level, then English participants' responses to A[m]V[ ] should also be equivalent in initial and final positions, and should also not differ from that of the Thai participants.

Participants
A total of 24 native Thai and 24 native Australian English adult speakers were tested, 12 males and 12 females in each language group. The mean age of the Australian English participants was 25 years 0 months and of the Thai participants was 31 years 2 months. The Australians were all monolingual with no experience of Thai, or any language in which [ ] is used in wordinitial position. The optical and acoustic components were then separated and combined to produce auditory-visual, auditory-only, and visual-only stimuli as follows. Auditory-visual congruent:

Procedure
Participants sat in a room facing and 50 cm away from a video monitor, with a response key on a table in front of them. The response key contained a central 'ready' key surrounded, in a semicircle, by five response buttons, labelled 'm', 'm-ng', 'n', 'ng-m', and 'ng' for the English participants, and corresponding Thai graphemes for the Thai participants, (m), -(m-ng), (n), -(ng-m), (ng). There were two phases in the experiment, one with initial position consonant stimuli, and one with final position consonant stimuli, with phase order counter-balanced between participants, and in each phase there was a practice and a test block.
The practice block consisted of 15 practice trials; each of the five consonants/consonant clusters was presented three times, once in each of auditoryvisual, auditory-only, and visual-only modes. in the final consonant phase. These trials were included to alert participants to the possible responses they could make, to give them practice at using the five keys, and to eliminate the data of participants whose responses were too slow (see below). In practice trials an output from the experiment control program (i) activated a reward light to flash on the left side of the monitor for correct responses, and (ii) an error buzzer sound to inform the participant (and experimenter) of any failure to respond within 2.2 s, or if the participant took their finger off the 'ready' button prior to the onset of the sound. The participants were told the meaning of these rewards and errors. Such feedback was only given in practice trials, not in test trials.
The Instructions given at the beginning of each phase included a request to respond as quickly as possible after the stimulus had appeared on the video monitor. All trials, practice and test, proceeded as follows. To ensure concentration, any particular trial only commenced once the participant pressed the 'ready' key. Once the 'ready' key was pressed, the stimulus was presented on the video monitor and the participant's task was to press one of the five response keys as quickly and accurately as possible. The onset of the acoustic component of the stimulus triggered a computer clock, such that participants' button press reactions were timed from sound onset, or in the case of visualonly trials conditions from the 'onset' of the original but now silent auditory component.

Results
As [ ] is phonotactically illegal in initial but not The overall distribution of responses on such trials, the incidence of 'n' responses on such trials, and the reaction times for 'n' responses on such trials are reported in turn below.

Response Distribution on V[ ], A[ ]V[ ], and A[m]V[ ] Test Trials
Confusion matrices for each of the three stimulus types containing visual [ ], crossed with the five possible responses are set out in Table 1.
As can be seen in Table 1   This is not due to a differential distribution of responses other than the 'n' (emergent fusion) responses. As can be seen in Table 1

Reaction Times on V[ ], A[ ]V[ ], and A[m]V[ ] Test Trials
Reaction times for all stimuli involving a visual [ ] are shown in Fig. 2

V[ ] Trials
English subjects' combined reaction times (RTs) on V[ ] trials were significantly longer than those of Thais, F (1, 46) = 8.26 (M difference = 201 ms), but there were no other main or interaction effects. English speakers generally took longer to process visually ambiguous [ ], presumably due to general lack of experience with this viseme.

A[ ]V[ ] Trials
RTs for both language groups were generally slower on final than initial conditions for A[ ]V[ ] trials (M difference = 240 ms), F (1, 46) = 7.72, presumably due to RTs being measured from sound onset resulting in slower RTs for final consonants. English were generally slower than Thai language participants (M difference = 224 ms), F (1, 46) = 21.71, but there was no interaction with initial/final position, indicating that the slower English RTs are probably due to the general paucity of [ ] in the English language (Roberts, 1965), rather than the specific phonotactic illegality of initial [ ].  Nevertheless, when the incongruent A[m]V[ ] is presented, in English language participants these language-specific and word-position-specific effects disappear; there is equal incidence of 'n' fusion responses by Thai and English language participants in the initial position and in the final position, and the reaction times are also equivalent for the two language groups. It is as if the illegality of [ ] in the initial position for English language participants has ceased to exist. These results are consistent with the integration of A[m] and V[ ] in a space in which language-specific phonotactic constraints indeed do not exist, in a phonetic language-general space.

Experiment 2: Phonemic Repertoire Differences -Japanese vs. English
The equivalence of the McGurk effect across languages irrespective of phonotactic differences in Experiment 1 could be taken to show that auditory-visual integration occurs phonetically. However, it could be argued that as [ ] is a member of the English phoneme repertoire, albeit phonotactically illegal in initial position, English language speakers may have treated [ ] as a native phoneme even though it was in an illegal position, and the equivalent McGurk effects may then be said to have occurred in a phonemic space.

The different pattern of responses to V[ ], A[ ]V[ ], and A[m]V[ ] make this unlikely.
Nevertheless, the phonetic-not-phonemic conclusion would be strengthened if equivalent cross-language McGurk effects were found when a particular phone is a phoneme in one language but not in the other.
The next set of experiments rests on three established facts and phenomena: (1) the phone [ð], as in 'that' is phonemic in English but not Japanese, (2) the incongruent A[b]V[g] stimulus presented to English language speakers results predominantly in 'd' or 'th' responses, and (3) the relative incidence of different non-auditory responses when presented with incongruent AV stimuli differs as a function of vowel context, and (4) English language speakers' relative incidence of 'd' and 'th'

responses to A[b]V[g] stimulus differs as a function of whether the A[b]V[g] stimulus is A[ba]V[ga] or A[bi]V[gi]
. Following a brief elaboration of these points, the results of two experiments with (i) English language and (ii) Japanese language participants are presented. Hampson et al. (2003)  g] across vowel contexts, Shigano found there were considerably more auditory-based 'b' responses than are found with English language participants (see also Sekiyama and Burnham, 2008), and that over and above this, across vowel contexts, [i] to [a] to [u], the incidence of auditory-based 'b' responses increased (38% to 43% to 67%, respectively) while the incidence of 'd' fusion responses decreased (59% to 30% to 3%, respectively).
Finally, Green (Green, 1996(Green, , 1998Green and Norrix, 1997)  Note that this is different to the way in which Werker et al. (1992) exploited the lack of [ð] in French; there they investigated the incidence of 'tha' (compared with 'ta' or 'da')

responses to A[ba]V[ða] in English and French language participants. That design is still allows phonological influences to override any phonetic-level effects. In the design here it is the [a]/[i] × 'd'/'th' pattern of results that is of interest and this should be influenced only by phonetic interactions, that is in A[b]V[g] the different co-articulatory effects of auditory [b] with [a] or [i] and visual [g] with [a] or [i].
This proposition was tested with a group of Australian English language participants, and three groups of Japanese participants with Beginner, Intermediate, or Advanced level of English proficiency. In Experiment 2a an English speaker presented the stimuli, and in Experiment 2b a Japanese speaker presented the stimuli.

Experiment 2a: the [a]-[i]/'d'-'th' Effect With an English Speaker Method
Sixteen native adult Australian English speakers, and 48 native adult Japanese speakers were tested, 16 at each of three levels of English proficiency, Beginner, Intermediate and Advanced. All Japanese participants had had standard English lessons for six years in secondary school.
Beginner and Intermediate participants were volunteers from language schools in Sydney. Their English ability was determined at the time of enrolment on the basis of an oral interview and written test (National ELICOS, English Language Intensive Courses for Overseas Students, Accreditation Scheme, used in all accredited language schools). In the Beginner group, there were 11 females and five males (mean age = 24.5 years, range = 17.5-28 years), who had been in Australia for a mean of six weeks (range 1 week-3.5 months) prior to testing. In the Intermediate group there were seven females and nine males (mean age = 23.4 years, range = 17.8-31.5 years), who spent a mean period of three months (range 2 weeks-12 months) in Australia.
Advanced group participants were required to have worked in Australia or New Zealand for at least one but not more than 10 years and were recruited by advertisement and word of mouth. There were 12 females and four males (mean age = 28.4 years, range = 20-40.9 years), and they had lived in Australia for a mean of three years (range 1.1-8 years). The majority were undergraduate and postgraduate students studying at various institutions in Sydney and some were Japanese lecturers/teachers working in Sydney.
In the Australian English group there 10 females and six males (mean age = 24.0 years, range = 19-48 years) recruited from the first-year psychology student pool at the University of NSW. All but two of the English participants were monolingual, however both bilingual subjects reported speaking mostly English on a daily basis.
Stimuli The procedure for testing and data collection was exactly the same as for Experiment 1 except that the response pad with its central 'ready' key had six response buttons (labelled 'b', 'g', 'd', 'th', 'bg' and 'gb') arranged in a semicircle around it.

Results and Discussion
The native phoneme bias results are presented then those for the incongruent trials.  Figure 3 shows the percentage of correct ('th') responses on AO, VO, and AV trials,

Native Phoneme Bias
, collapsed over vowel context. As expected (because [ð] is not phonemically relevant in Japanese) English language participants made more correct responses than Japanese perceivers, and there was an increase in correct responses as the Japanese perceivers' experience with English increased (Beginner to Intermediate to Advanced). Statistical analyses confirmed these observations. Japanese participants made more errors than Australian English subjects on [ð] trials in all three modes, AV, F (1, 60) = 11.16, AO, F (1, 60) = 24.26, and VO, F (1, 60) = 7.39 trials. Additionally, correct responses for Japanese participants improved linearly as a function of English language experience in AV, F (1, 60) = 8.03, and AO trials, F (1, 60) = 10.04, and quadratically in VO trials, F (1, 60) = 7.43.
Thus, as would be expected on the basis of their native phonology, Japanese participants have a native phoneme bias against reporting 'th' when presented with [ð] in auditory-only, visual-only, and auditory-visual conditions. And, as would also be expected, this bias is ameliorated as a function of increasing facility with English.

Response to Incongruent A[b]V[g] Trials
Given this native phoneme bias, the next question concerns the pattern of 'd' and 'th' responses across [a] and [i] vowel contexts for the Japanese and English language participants. Figure 4 shows   each vowel context, [a] and [i] revealed that English participants (83.6%) gave more fusion responses that did Japanese participants (70.3%), F (1, 60) = 5.11 (as would be expected - Sekiyama and Tohkura, 1991). There were generally more fusion responses in the [i] than [a] vowel environment (total = 84.1%, and 69.8% respectively), F (1, 60) = 31.31, but this difference was due mainly to the Japanese participants (total = 83.6% vs. 56.8% for [i] and [a], respectively) rather than the English language participants (total = 84.4% vs. 82.8%), F (1, 60) = 8.9. In addition, as expected on the basis of a native phoneme bias, Japanese made many more 'da' (42.3%) than 'tha' (28.0%) responses whereas English language participants made more 'tha' (73.0%) than 'da' (10.5%) responses, F (1, 60) = 36.25.  F (1, 60) = 10.80, and despite the above differences between Japanese and English language participants in the number of fusion responses and the relative proportion of 'd' and 'th' responses, this overall [a]/[i] vowel × 'd'/'th' response effect did not interact with language background, F (1, 60) = 0.058. As can be seen in Fig. 4 the overall effect is due to more 'th' than 'd' responses in [a] vowel context, and more 'd' than 'th' responses in [i] vowel context, exactly as was hypothesised. For McGurk stimuli, English language and Japanese language participants respond to changes in vowel context in a statistically equivalent manner, presumably at a phonetic level of processing.
Although there was no interaction with language background, it is of interest to inspect the individual [a]/[i] vowel × 'd'/'th' plots (see Fig. 5). The effect can be viewed in two ways (and these are set out in graphic form in Ta Japanese group showed a reversal of the effect and for the Advanced Japanese group the advantage was quite small. These departures are not significant, as there was no interaction of the [a]/[i] vowel × 'd'/'th' with language, but nevertheless, it could be argued that a more powerful experiment with more participants may be required to detect any language background differences. However there was sufficient power in the analyses to detect significant language background differences in other respects: for Japanese vs. English language participants there were less fusion responses overall; more fusion responses in the [i] than [a] vowel environment; and more 'da' than 'tha' responses overall. Nevertheless, further elaborations would be useful.

Experiment 2b: the [a]-[i]/'d'-'th' Effect' With a Japanese Speaker
The results in Experiment 2a, and the veracity of the conclusion that the same or similar phonetic effects occur for Japanese and English language participants, may rest on the fact that an English speaker was used and that both English and Japanese language participants would expect an English language speaker to use the [ð] consonant. A Japanese speaker would not be expected to do so, and this may change the nature of any fusion responses. Experiment 2b was the same as Experiment 2a except that a Japanese speaker was used as the stimulus person.
Four new groups of 16 participants were tested. In the Beginner group there were 12 females and four males (mean age = 24.1 years, range = 18-31 years) and they had been in Australia for a mean of 4.2 weeks (range: 3-11 weeks); the Intermediate group consisted of 12 females and four males (mean age = 24.1 years, range = 18-31 years) and they had been in Australia for a mean of 4.1 months (range: 2-8 months); and the Advanced group had 11 females and five males (mean age = 26.6 years, range = 20-37 years), and they had been in Australia for a mean of 4.13 years (range: 1.5-9.3 years). The Australian English group, 11 females and five male participants (mean age = 20.7 years, range = 18-26 years) were recruited from the first-year psychology student pool at the University of NSW. Figure 6 shows the percentage of 'th' responses on AV, AO, and VO [ð] trials. As can be seen the results on [ð] trials with a Japanese speaker were similar to when an Australian English speaker was used, but there were some differences. Both English (mean = 95.31%) and Japanese (mean = 88.28%) participants performed well on AV [ð] trials, and there was no significant difference between them. Similarly when only visual information for [ð] was available, both English (mean = 94.54%) and Japanese (mean = 85.15%) participants performed well, but the ability to respond to VO [ð] correctly with 'th' increased for Japanese participants as a function of their experience with English, F (1, 60) = 4.22. With only auditory information, however, there was a clear advantage for Australian English over Japanese perceivers in their perception of 'th' from AO [d] (mean = 68.75% vs. 49.05%), F (1, 60) = 16.62,  indicating a native phoneme bias in the Japanese listeners when only auditory information is available.

Response to Incongruent A[b]V[g] Trials
There was a similar native phoneme bias here for Japanese participants perceiving a Japanese speaker as in the previous experiment when an Australian English speaker was used.  Fig. 7; Fig. 8 shows the same separately for each language group.
Compared to Experiment 2a with an English language speaker, here with a Japanese speaker, the percentage of 'th' responses was greatly reduced such that there were far fewer 'd' than 'th' responses for both the Japanese and the English language participants, F (1, 60) = 335.95. This was accompanied by a departure from the usual 'Japanese McGurk effect' (Sekiyama and Burnham, 2008;Sekiyama and Tohkura, 1991) -the number of fusion responses was statistically equivalent between Japanese (mean = 56.5%) and English language participants (mean = 62.5%), F (1, 60) = Japanese Advanced, and a reversal of the expected effect for the English language group. As even the English language group did not show the expected result, and as there were very few 'th' responses for the Japanese groups, no meaningful statements can be made about the 'th' response half of the expected results. Nevertheless, all four groups showed the same expected effect for the 'd' responses Thus there is evidence for similar phonetic level processing by the English and the Japanese participants, but here the evidence for the full [a]/[i] vowel × 'd'/'tha' response effect is not as strong as in Experiment 2a as it appears that the Japanese speaker did not afford the perception of a 'th' consonant from A[b]V[g] even for English language participants. It is perhaps the case that the particular acoustics of Japanese [b], or the particular visible articulatory movements of Japanese speech, or both affect perception of 'th'. Resolution of this issue awaits further research.

General Discussion
In the introduction behavioural and brain response data were presented to show where, how, and when and in what linguistic and non-linguistic auditoryvisual integration occurs, both in artificial (McGurk) contexts or more natural contexts. However, it was concluded that whether the interaction between auditory and visual percepts of speech occur at a language-general, phonetic, or the language-specific, phonemic level, is yet to be determined. While further research employing brain response and behavioural identification data would clarify integration of multisensory speech stimuli, the research studies reported here provide new information, and may assist in guiding such further research.
Experiments 1 and 2 provide evidence that in the McGurk effect, auditory and visual speech information is initially integrated at a phonetic level of processing. Experiment 1 provides such evidence despite a difference in phonotactic constraints in Thai and English, and Experiment 2 provides such evidence despite a difference in phoneme response repertoire in English and Japanese.
In This shows a phonetic level of integration. There was no significant interaction of this effect with language group, and inspection of each group reveals similar results across groups with minor deviations. When this experiment was repeated with a Japanese speaker (Experiment 2b), there was a relative absence of 'th' responses, even for the English language participants; and the hypothesis was only supported with respect to the pattern of 'd' responses across the two vowel contexts.
The Japanese results also reveal other influences on auditory-visual speech perception, and may inform what has been called the Japanese McGurk effect -fewer fusion responses by Japanese perceivers. The number of fusion responses is greater in the [i] vowel context, and this is particularly so for Japanese participants looking at a Japanese speaker. This suggests that there is a subtle interplay of phonetic, phonemic, and speaker variables in the McGurk effect. Phonetically, with the [a] vowel the phonetic conditions are more conducive to the perception of 'th'. Phonemically, 'th' is irrelevant for Japanese perceivers. And with regard to the speaker, Japanese speakers' articulation may support the perception of 'd' rather than 'th', even for English perceivers. Thus it is possible that the 'Japanese McGurk effect' (Sekiyama and Burnham, 2008;Sekiyama and Tohkura, 1991) is in part the product of the common use of the [a] vowel in McGurk effect studies (also see Shigeno, 2000).
The results from these studies show that low-level early, and with respect to speech, phonetic processes determine what is perceived when incongruous auditory and visual information is presented. Over and above this, there appears to be later, phonemic and even cultural effects on what is reported when incongruous auditory and visual information is presented. We contend that auditory-visual integration occurs in a common representational space with a motoric (Robert-Ribes et al., 1995, articulatory/gestural Mattingly, 1985, 1989;Studdert-Kennedy and Goodell, 1992), or phonetic (Dodd and Burnham, 1988; basis. Whatever the nature of this common metric, the important point is that auditory-visual integration occurs early and directly, devoid of any influence of learned associations or phonological prototypes. Certain visual and auditory information may be clearer with some speakers than others, so at this early stage there will be varying degrees of information available for integration. Following integration into phonetic categories, there can be late post-categorical effects of native phoneme inventory or even phonotactics, and cross-cultural factors. Thus a distinction can be made between the direct McGurk effect (phonetic-level auditory-visual integration), and the reported McGurk effect, additionally influenced by later phonological (post-categorical) effects.
Much has been learned about the world from studying the conditions under which a particular system breaks down in disciplines as diverse as engineering, medical science and psychology. The McGurk effect entails such a breakdown -in this case in auditory-visual speech perception. In most perception studies, e.g., in illusions such as the Poggendorf illusion (Day and Dickinson, 1976), the breakdown is thought to occur due to going beyond the normal limits of a system or misapplying a usually useful perceptual strategy. However, it has been suggested that the (neural) processes involved in the McGurk effect and normal integration of auditory and visual speech information may be different (see Introduction, and Alsius et al. in this issue). If so then it is possible that the discovery of auditory-visual fusions by McGurk and MacDonald in 1976 (see MacDonald in this issue) have led to erroneous conclusions regard auditory-visual speech perception.
Such an extreme conclusion is not warranted, for at the very least the McGurk effect (along with other parallel movements in the Zeitgeist) has led to increased research attention to auditory-visual and more generally intermodal speech and other perception. In addition, the McGurk effect may just tell us something new. Consider as an example the results of Experiment 1 here. Following the above ideas of early direct fusion and later post-categorical effects, with congruent auditory-visual presentations of A[ ]V[ ] in the initial position there is integration, and then phonemic/phonotactic processes come into play such that English language perceivers report 'n' more often, and do so with slower processing time than do Thai language perceivers. However, when the incongruent A[m] replaces A[ ], it appears that the immediate engagement of phonemic post-categorical processes is blocked; perception of A[m]V[ ] appears to employ similar processes (equal reaction times) and results in similar responses (equivalent incidence of 'n' for English and for Thai language perceivers). Such a phenomenon may assist in the understanding of phonetic and phonemic processing -when there is conflict or ambiguity, the processes learned as a product of specific linguistic experience (phonemic processing) can be bypassed in order to access more basic levels of perception (see Burnham, Tyler and Horlyck, 2002). In the McGurk effect this bypass results in an illusion -but maybe such a bypass may lead to more veridical perception in other circumstances, such as in second language learning when naturally-occurring pairings of familiar mouth movements with unfamiliar speech sounds or unfamiliar mouth movements with familiar speech sounds.
studies reported here were previously reported in a chapter (Burnham, 1998) and a conference contribution (Burnham and Keane, 1997).