Save

Ventriloquist Illusion Produced With Virtual Acoustic Spatial Cues and Asynchronous Audiovisual Stimuli in Both Young and Older Individuals

In: Multisensory Research
Authors:
Marnix StawickiDepartment of Otorhinolaryngology / Head and Neck Surgery, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
Graduate School of Medical Sciences, Research School of Behavioral and Cognitive Neurosciences (BCN), University of Groningen, Groningen, The Netherlands

Search for other papers by Marnix Stawicki in
Current site
Google Scholar
PubMed
Close
,
Piotr MajdakAcoustics Research Institute, Austrian Academy of Sciences, Vienna, Austria

Search for other papers by Piotr Majdak in
Current site
Google Scholar
PubMed
Close
, and
Deniz BaşkentDepartment of Otorhinolaryngology / Head and Neck Surgery, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
Graduate School of Medical Sciences, Research School of Behavioral and Cognitive Neurosciences (BCN), University of Groningen, Groningen, The Netherlands

Search for other papers by Deniz Başkent in
Current site
Google Scholar
PubMed
Close
View More View Less
Open Access

Abstract

Ventriloquist illusion, the change in perceived location of an auditory stimulus when a synchronously presented but spatially discordant visual stimulus is added, has been previously shown in young healthy populations to be a robust paradigm that mainly relies on automatic processes. Here, we propose ventriloquist illusion as a potential simple test to assess audiovisual (AV) integration in young and older individuals. We used a modified version of the illusion paradigm that was adaptive, nearly bias-free, relied on binaural stimulus representation using generic head-related transfer functions (HRTFs) instead of multiple loudspeakers, and tested with synchronous and asynchronous presentation of AV stimuli (both tone and speech). The minimum audible angle (MAA), the smallest perceptible difference in angle between two sound sources, was compared with or without the visual stimuli in young and older adults with no or minimal sensory deficits. The illusion effect, measured by means of MAAs implemented with HRTFs, was observed with both synchronous and asynchronous visual stimulus, but only with tone and not speech stimulus. The patterns were similar between young and older individuals, indicating the versatility of the modified ventriloquist illusion paradigm.

Abstract

Ventriloquist illusion, the change in perceived location of an auditory stimulus when a synchronously presented but spatially discordant visual stimulus is added, has been previously shown in young healthy populations to be a robust paradigm that mainly relies on automatic processes. Here, we propose ventriloquist illusion as a potential simple test to assess audiovisual (AV) integration in young and older individuals. We used a modified version of the illusion paradigm that was adaptive, nearly bias-free, relied on binaural stimulus representation using generic head-related transfer functions (HRTFs) instead of multiple loudspeakers, and tested with synchronous and asynchronous presentation of AV stimuli (both tone and speech). The minimum audible angle (MAA), the smallest perceptible difference in angle between two sound sources, was compared with or without the visual stimuli in young and older adults with no or minimal sensory deficits. The illusion effect, measured by means of MAAs implemented with HRTFs, was observed with both synchronous and asynchronous visual stimulus, but only with tone and not speech stimulus. The patterns were similar between young and older individuals, indicating the versatility of the modified ventriloquist illusion paradigm.

1. Introduction

In daily life, perception often relies on integration of signals from multiple senses (see Note 1) (Beauchamp, 2005; de Gelder and Bertelson, 2003; Ernst and Bülthoff, 2004; Lovelace et al., 2003; Stein and Meredith, 1993, and for a more recent review, Chen and Vroomen, 2013). An example of such multisensory integration of auditory and visual stimuli is the ventriloquist illusion, the mis-localization of the source of an auditory stimulus (such as the actual talker) when it is presented with a temporally synchronous but spatially discordant visual stimulus (such as the moving mouth of a puppet; for a review, see Vroomen and de Gelder, 2004). The ventriloquism effect has often been quantified using the spatial ventriloquist paradigm, in which the location of the perceived event for synchronously presented but spatially separated auditory and visual stimuli is reported, and compared to the location of the auditory stimulus presented alone (Bermant and Welch, 1976; Bertelson and Aschersleben, 1998). In this paradigm, the perceived target auditory position is pulled towards the visual stimulus (Pick et al., 1969) and a larger lateral displacement in reporting location is required for a correct judgment. This pulling yields an increase in the measured threshold in location compared to the auditory-only condition. Interestingly, this procedure is similar to the estimation of the minimum audible angle (MAA), the discrimination threshold of two spatially discordant sound events, in other words, the smallest angle that a listener can distinguish between two spatially separated sound sources (Perrott and Saberi, 1990). Thus, when quantifying the amount of audiovisual (AV) interaction in terms of the size of the ventriloquism effect, we essentially could make use of the magnitude of the MAA.

Previous studies indicated the ventriloquist illusion to be a robust and near-optimal bimodal sensory integration (Alais and Burr, 2004), as the illusion could be induced even when individuals were instructed and trained to ignore (Vroomen et al., 1998) or to not attend to the visual stimulus (Vroomen et al., 2001). These observations combined implied the ventriloquist illusion to mainly rely on automatic sensory processes (Bertelson et al., 2000).

Due to this robustness and its automatic nature, as well as the relatively simple task involved, the ventriloquist illusion could be a potentially useful tool in characterizing AV integration in a variety of populations, for example, in identifying effects of aging, as well as age-related hearing loss. AV integration and aging has long been an interest for research and clinical purposes, but the relevant studies have produced at times mixed results. With aging, as a result of age-related sensory and cognitive changes, perception of speech becomes challenging, especially in noisy situations or when other forms of distortions (e.g., reverberation) are involved (e.g., Bergman et al., 1976). Visual cues can help improve speech perception (Hoffman et al., 2012; Pichora-Fuller et al., 1995). It was hypothesized that older individuals may rely more on visual speech cues and show an enhanced AV integration to compensate for age-related sensory and cognitive changes (Cienkowski and Carney, 2002; de Boer-Schellekens and Vroomen, 2014; Freiherr et al., 2013; Laurienti et al., 2006; Tye-Murray et al., 2007). However, while some studies supported such superior AV integration with older individuals (e.g., Başkent and Bazo, 2011; Helfer, 1998; Laurienti et al., 2006) some others showed a smaller AV benefit in older individuals (Musacchia et al., 2009; Tye-Murray et al., 2008, 2010). Another argument for potentially increased AV integration with aging came from the difficulty by older individuals in inhibiting information from one sensory modality for a task conducted in another modality, resulting in an inherently stronger multisensory integration (e.g., Couth et al., 2018). Lastly, an increase in temporal integration was suggested in older individuals, likely caused by factors such as general cognitive slowing down or reduced early sensory memories (Fogerty et al., 2016; Fozard, 1990; Salthouse, 1996). While it is not yet fully understood how the age-related changes in temporal integration are mediated by the auditory and visual modalities (Saija et al., 2019), if the stimulus contributed from each modality varies with age, producing a fused AV percept may become challenging. Opposing this view, increased temporal integration may contribute to a longer time window where the auditory and visual inputs are fused into one AV object. Yet, support from literature has, again, been mixed for this idea, with some studies showing evidence for a longer temporal AV integration window with older individuals while some others showing no such evidence (Alm and Behne, 2013; Başkent and Bazo, 2011; Diederich et al., 2008; Hay-McCutcheon, 2009).

A number of factors in these studies may have complicated the interpretation of the findings. In some studies, the baseline auditory-only performance differed between young and older groups; baseline speech intelligibility was lower and baseline response times were longer with the older group. Hence, tasks relying on speech intelligibility or lipreading might have been affected by age-related sensory and cognitive changes (Pichora-Fuller et al., 1995; Saija et al., 2014; Sommers et al., 2005), complicating the investigation of an age-only effect on AV integration. The differing baselines can prevent a fair between-subject comparison of relative improvement in performance with addition of multisensory cues because of the so-called inverse effectiveness, i.e., the stronger effect from addition of stimuli conveyed via other senses as the effectiveness of the uni-sensory stimuli is low (Couth et al., 2018; Holmes, 2009; Laurienti et al., 2006; Stein and Meredith, 1993). A simpler task able to produce similar auditory-only performance across the subject groups may be advantageous in assessing the relative changes in performance as a result of AV integration of added visual stimuli.

The ventriloquist illusion does not necessarily rely on speech understanding — in fact, it can be conducted with much simpler auditory stimuli — and being a mostly an automatic process, it may minimize the potential confounds discussed above and provide a useful tool to explore age effects on AV integration. While earlier studies implied that auditory-only MAAs can be affected by aging (Strouse et al., 1998), more recently, Otte et al. (2013) showed that MAAs in azimuth were relatively insensitive to it. An age-insensitive measure, such as MAA, would hence be expected to produce similar uni-sensory baseline performance between the young and older groups, minimizing the confound of inverse effectiveness.

Hence, in this study, as a first step, we explored the applicability and robustness of the illusion. More specifically, we used a modified version of the ventriloquist illusion measure that (1) relied on MAAs, (2) was reduced in response bias by measuring left/right judgments of the AV event in an interleaved adaptive staircase procedure (Bertelson and Aschersleben, 1998), (3) was adapted from the free-field procedure (Bertelson and Aschersleben, 1998) to binaural stimulus reproduction via headphones (Wightman and Kistler, 1989), by using generic and easily available head-related transfer functions (HRTFs; Shaw, 1974), (4) used both non-speech (tones) and speech (words) stimuli, as these differ in stimulus complexity and related perceptual mechanisms, likely inducing differences in the AV integration processes (Lalonde and Holt, 2016; Tuomainen et al., 2005), (5) was tested with both young and (with nearly normal hearing) older individuals and with stimuli adjusted to minimize further potential age-related hearing-loss effects, and (6) used both synchronous and asynchronous A and V stimuli as the temporal synchronicity may modulate the AV integration differently for younger and older individuals. We expected that, if the ventriloquist illusion is robust, our modified paradigm, combined with matching baseline auditory-only performance, would provide a useful tool that is simple to implement and easy to use to systematically investigate AV integration in young and older populations.

2. Materials and Methods

2.1. Participants

Two groups of native-Dutch speakers, young and older, participated in this study. The inclusion criteria were normal or corrected-to-normal vision, and normal or near-normal hearing. Vision was tested by identifying the visual ‘catch stimulus’ in a 3 × 3 grid where the other eight stimuli were in the ‘normal’ condition. The visual stimuli used for this purpose were the same as the stimuli used during data collection. Participants were seated at a viewing distance of 1.5 m and they had to do this task three times correctly before being allowed to participate. There was no time limit during this vision test and the visual stimulus was repeatedly played until the odd stimulus was identified.

The inclusion criterion for normal hearing was having hearing thresholds lower than or equal to 25 dB HL at the audiometric test frequencies of 0.25, 0.5, 1, 2, and 4 kHz for both ears, measured with standard clinical audiometry procedures. For the young group, 21 individuals (4 males), all below the age of 30 years (23.4 yr ± 3.2), participated in the study. For the older group, 64 older individuals with self-reported normal hearing were screened. Among the older individuals, 47, who initially volunteered to participate, did not meet the inclusion criterion for normal hearing, and were therefore excluded from the study before testing. From the remaining 17 older participants, two were not able to do the task, leading to their exclusion during the testing. After the exclusions, the older group consisted of 15 participants (3 males), all above the age of 60 years (64.5 yr ± 2.7).

Figure 1.
Figure 1.

Hearing thresholds shown for the young (Y) and older (O) groups, averaged over the participants and the two ears.

Citation: Multisensory Research 32, 8 (2019) ; 10.1163/22134808-20191430

Figure 1 shows the average hearing thresholds for the two groups (Y = young; O = older). Despite the careful hearing screening, there was a small difference in the thresholds between the two groups (which we have also observed in our previous studies on age effects, e.g., Saija et al., 2014, 2019). As a precaution, to explore potential audibility effects, we investigated this difference. Firstly, we focused on the audiometric test frequency of 500 Hz, the frequency of the pure tone stimulus used in the study. At this test frequency, the average hearing thresholds were 2.3 dB ± 3.3 and 7.2 dB ± 6.2 for young and older groups, respectively, which did not differ significantly ( p = 0.072, by a Mann–Whitney U test). Secondly, we focused on the audiometric test frequencies between 0.25 and 4 kHz, as this range corresponded to the bandwidth of the lowpass-filtered speech stimuli used in the study. At these test frequencies, the average hearing thresholds were 2.3 dB ± 3.0 and 10.7 dB ± 4.7 for young and older groups, respectively. While the hearing thresholds differed significantly between young and older individuals ( t = 5.685, p < 0.001, by a two-tailed t-test with unequal variances), none of the older participants showed hearing threshold deficits larger than 20 dB, rendering all older participants as hearing within normal limits. The average interaural threshold differences were almost identical, 4.5 dB ± 1.3 and 4.6 dB ± 1.7 for young and older groups, respectively.

The Medical Ethical Committee of the University Medical Center Groningen approved the study protocol. Before the screening, the participants received written and oral information about the study and provided written informed consent. They were reimbursed for travel expenses and participation time according to departmental policy.

2.2. Auditory Stimuli

Two types of auditory stimuli were used, pure tone and speech. The pure-tone stimulus consisted of four 200-ms long 500-Hz tone bursts with an interval of 1 s between the individual bursts. Each tone burst had a 5-ms on/off ramping by Hann window. The pure-tone stimulus was presented at the sensation level (SL) of 60 dB re individual hearing threshold measured at 500 Hz and averaged over both ears. By presenting the stimuli at the same SL across participants we aimed to account for the listener-specific hearing thresholds, which slightly differed across participants (as described in the previous section). The speech stimulus consisted of digital recordings of meaningful consonant–vowel–consonant (CVC) Dutch words, spoken by a female speaker, and taken from the corpus of the Nederlandse Vereniging of Audiologie (NVA; Bosman and Smoorenburg, 1995). We chose this corpus as it is also used as a clinical diagnostic tool with hearing-impaired populations in Dutch clinics. The corpus has 180 unique words that are ordered into 15 unique lists of 12 words. In clinical assessments, the number of lists is usually increased to 45 by re-ordering the words within a list, and such an extended corpus was also used in our study. The lists are balanced across each other in phonemic distribution. The duration of words ranges roughly between 700 and 1000 ms. The speech materials used in our study were lowpass-filtered (3-kHz cutoff frequency, 60-dB/octave slope) to further ensure similar audibility between young and older groups, as hearing thresholds at audiometric test frequencies above 4 kHz were not part of the inclusion criteria. Similar to tones, the speech stimuli were presented at the individually adjusted SL of 60 dB re average individual hearing thresholds at 0.5, 1, and 2 kHz for both ears.

2.3. Binaural Stimulus Reproduction

Acoustic targets were created by filtering auditory stimuli (in case of speech stimulus, following the low-pass filtering) with spatially up-sampled HRTFs of the KEMAR manikin (Gardner and Martin, 1995). Listener-specific HRTFs were not required because the spatial direction of the virtual stimuli varied only along the horizontal plane and non-individualized HRTFs are thought to provide sufficient cues for the sound localization in horizontal planes (Wenzel et al., 1993).

The original HRTFs from the KEMAR manikin were available (Note 2) at a lateral resolution of 5° (see Fig. 2, top panel). This lateral sampling is larger than the MAAs found in normal-hearing listeners (approximately 1° — Perrott and Saberi, 1990). Thus, the original HRTFs were not sufficient for our study. A super-resolution HRTF set was calculated by directionally up-sampling the original HRTF set to the lateral resolution of 0.5°. The most salient cues for the lateral direction of a sound are the broadband interaural time and level differences (ITDs and ILDs, respectively; Macpherson and Middlebrooks, 2002). Correspondingly, for each ear, in the original HRTF set, the broadband timing and amplitude spectra were directionally interpolated. More specifically, for each ear’s HRTF set, broadband timing was removed, amplitude spectra were interpolated, and the interpolated timing information was applied. The broadband timing was removed by replacing the HRTF’s phase spectrum by the minimum-phase spectrum (Oppenheim et al., 1999) corresponding to HRTF’s amplitude spectrum. For the interpolation of the amplitude spectra, the complex spectra of the minimum-phase HRTFs for two adjacent available directions were averaged according to a weighting that corresponded to the interpolated target direction.

Figure 2.
Figure 2.

Left-ear head-related transfer functions (HRTFs) shown in the time domain (i.e., head-related impulse responses) as a function of the azimuth angle. Top: Original HRTFs (resolution of 5°; Gardner and Martin, 1995). Bottom: Interpolated HRTFs (super resolution of 0.5°). Color: Amplitude of the impulse responses shown in dB.

Citation: Multisensory Research 32, 8 (2019) ; 10.1163/22134808-20191430

For the interpolation of the timing, a continuous-direction model of the time-of-arrival (TOA) was applied (Ziegelwanger and Majdak, 2014). TOA is the broadband delay arising from the propagation paths from the sound source to the listener’s ear. For a given direction of a sound, the interaural difference of TOAs corresponds to the ITD. The TOA model parameters describe listener’s geometry (head and ears) and configure a continuous-direction function of broadband TOA. We used this function to calculate TOAs for directions in steps of 0.5°. To this end, for each ear, the model was fit to an HRTF set as described by Ziegelwanger and Majdak (2014) using the implementation from the Auditory Modeling Toolbox (Søndergaard and Majdak, 2013). Then, each minimum-phase HRTF was temporally up-sampled by a factor of 64, circularly shifted by the TOA obtained from the continuous-direction TOA model for the target direction, and then down-sampled to the sampling rate of 44.1 kHz (Fig. 2, lower panel). Note that the temporal oversampling was required to achieve an interaural resolution of 0.35 μs. A brief quality check (see Fig. 2), revealed (1) the main peaks at the same temporal positions as those in the original HRTFs, and (2) similar temporal modulations in both original and super-resolution HRTFs. Note that, as a result of the conversion to minimum-phase systems, the slowly rising energy before the main peak present in the original HRTFs is not present in the super-resolution HRTFs. In summary, the final HRTF set (Note 3) contained HRTFs with the interpolated amplitude and broadband timing information associated with the ILD and broadband ITD, respectively, at a lateral resolution of 0.5°.

2.4. Visual Stimulus

The visual stimulus was the same geometric shape for both tone and speech stimuli. This shape was modulated according to auditory signal intensity, which differed between the tone and speech stimuli. We opted to use the same simple visual stimulus for both stimulus types, instead of using lipreading cues for speech, for several reasons: (1) to ensure consistency between the two stimulus types, (2) to ensure simplicity, for example for potential clinical applications, where it would be easier to implement a generic visual stimulus, and (3) to minimize any potential interference from additional cognitive processing that may be required from speech lipreading. The generic visual stimulus consisted of a yellow circle on a black background presented in the center of the screen (Fig. 3). The diameter of the circle was modulated in proportion to a 16-ms moving average of the root-mean-square (RMS) amplitude of the auditory stimuli, with a minimum size of 10 mm and a maximum size of 15 mm. Further, a black square was shown on top of the yellow circle, in the center. The edge length of the square was proportional to the RMS amplitude of the auditory signal, with a minimum size of 0 mm and a maximum size of 3 mm. The size of the objects followed the sound amplitude immediately, being only limited by the update rate of the computer monitor. In the catch trials (explained later), the square was rotated by 45°. In order to focus attention on the screen, visual rendering started 1 s prior to the auditory stimulus and showed the yellow circle with minimum size until the auditory stimulus had started.

Figure 3.
Figure 3.

Snapshots of the visual stimulus (top row) shown with the corresponding auditory speech stimulus (CVC word ‘poes’; bottom row). Top: Visual stimuli in normal and catch trials are shown, alternating in each panel from left to right, and each panel shows a different snapshot taken at a different point in time. Bottom: The auditory stimulus is shown in temporal waveform. The red contour line shows the slow-moving envelope of the auditory signal over time. Black vertical lines mark the point in time of the snapshots. Note the correspondence between the square size of the visual stimulus and the envelope amplitude in the auditory stimulus at the specific times shown with the vertical black lines.

Citation: Multisensory Research 32, 8 (2019) ; 10.1163/22134808-20191430

2.5. Apparatus

The experiment was conducted in an anechoic chamber. Participants were seated in a chair located at a distance of 1 m from the computer screen. The chair was specifically designed for this study. An adjustable neck rest limited head movement, and ensured that the participant faced the screen that displayed the visual stimulus.

The auditory stimuli were digitally created with a sampling rate of 44.1 kHz using Matlab R2009b (Mathworks Inc., Natick, MA, USA) on a Mac computer (Apple Inc.). They were routed via the digital sound interface Audiofire 4 (Echo Digital Audio Corporation, Santa Barbara, CA, USA) to the digital-to-analog converter DA10 (Lavry Engineering Inc., Kingston, WA, USA) and then presented via the headphones HD 600 (Sennheiser, Wedemark, Germany). The presentation levels of the auditory stimuli were calibrated with a KEMAR manikin (GRAS, Holte, Denmark) and the sound level meter Type 2610 (Brüel Kjær, Nærum, Denmark). The linearity of the system was verified for SPLs between 40 and 90 dB, i.e., the SPL range of our auditory stimuli. The visual stimulus was presented on a computer screen, with an update rate of 60 frames per second. For presentation of auditory and visual stimuli, PsychToolBox-3 (Kleiner et al., 2007) was used. This software is designed in particular to allow a synchronized auditory and visual presentation, with an intermodal timing accuracy of approximately 2 ms. We have confirmed this with multiple measurements. Audio stimulus delay was quantified by time-stamping the command to present a signal and time-stamping incoming audio on a microphone, mounted on the KEMAR manikin. The delay measured was less than 1 ms. Video stimulus delay was quantified by internal diagnostics by time-stamping the command to display a target and obtaining the timestamp of the completed visual rendering. This delay was also less than 1 ms. Even when multiple visualization commands and audio commands were issued during a testing sequence, a delay of more than 1 ms was never measured. Thus an intermodal timing accuracy of less than 2 ms was obtained. We have not conducted any other controls for other potential delays. We used PsychToolBox-3 to create the intermodal lags that were part of the experimental design.

2.6. Procedure

All participants were naïve to the experimental protocol. All tests were administered and evaluated by the first author.

MAAs in the horizontal plane were measured using a lateralization task in an adaptive 3-down-1-up staircase procedure (Levitt, 1971); however, two runs, starting both at extreme lateral positions for left and right, were simultaneously interleaved, following the procedures of Bertelson and Aschersleben (1998). In each trial, a target auditory stimulus was presented, with or without visual stimulus, depending on the experimental condition. Participants were asked to make a left/right judgment according to where they lateralized the target source by saying ‘links’ or ‘rechts’ (Dutch equivalent of ‘left’ and ‘right’, respectively). Each run started with an auditory target virtually positioned at the lateral angle of 10°. After three consecutive correct responses, the angle decreased by a fixed initial step size. After an incorrect response, the angle increased. The transition from decreasing to increasing angle, and vice versa, defined a reversal. The initial step size was 4°, and with each reversal, the step size was halved until the minimum of 0.5°, the resolution of our modified HRTFs, was reached. The trials from the two interleaved runs were randomly chosen such that the participant was not aware of the side actually tested. Both runs continued until eight reversals were obtained for each. For both runs, the values measured at the last four reversals were averaged and the difference between the averages produced the MAA. Depending on testing conditions and participant performance, an MAA was acquired in 5 to 15 minutes, after which a pause was given.

2.7. Testing Conditions

The MAA was measured for each auditory stimulus (pure tone, speech) in combination with three AV conditions, namely, auditory only (NoV), with synchronous visual stimulus (SyncV), and with asynchronous visual stimulus (AsyncV).

Participants were asked to look straight ahead to the monitor wh