Assessing audiovisual saliency and visual-information content in the articulation of consonants and vowels on audiovisual temporal perception

In: Seeing and Perceiving
  • 1 Cognitive Systems Research Institute (CSRI), GR
  • 2 Department of Experimental Psychology, Oxford University, GB

Research has revealed different temporal integration windows between and within different speech-tokens. The limited speech-tokens tested to date has not allowed for the proper evaluation of whether such differences are task or stimulus driven? We conducted a series of experiments to investigate how the physical differences associated with speech articulation affect the temporal aspects of audiovisual speech perception. Videos of consonants and vowels uttered by three speakers were presented. Participants made temporal order judgments (TOJs) regarding which speech-stream had been presented first. The sensitivity of participants’ TOJs and the point of subjective simultaneity (PSS) were analyzed as a function of the place, manner of articulation, and voicing for consonants, and the height/backness of the tongue and lip-roundedness for vowels. The results demonstrated that for the case of place of articulation/roundedness, participants were more sensitive to the temporal order of highly-salient speech-signals with smaller visual-leads at the PSS. This was not the case when the manner of articulation/height was evaluated. These findings suggest that the visual-speech signal provides substantial cues to the auditory-signal that modulate the relative processing times required for the perception of the speech-stream. A subsequent experiment explored how the presentation of different sources of visual-information modulated such findings. Videos of three consonants were presented under natural and point-light (PL) viewing conditions revealing parts, or the whole, face. Preliminary analysis revealed no differences in TOJ accuracy under different viewing conditions. However, the PSS data revealed significant differences in viewing conditions depending on the speech token uttered (e.g., larger visual-leads for PL-lip/teeth/tongue-only views).

