Save

Eyes on Emotion: Dynamic Gaze Allocation During Emotion Perception From Speech-Like Stimuli

In: Multisensory Research
Authors:
Minke J. de Boer Research School of Behavioural and Cognitive Neurosciences (BCN), University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
Department of Otorhinolaryngology, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
Laboratory for Experimental Ophthalmology, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands

Search for other papers by Minke J. de Boer in
Current site
Google Scholar
PubMed
Close
,
Deniz Başkent Research School of Behavioural and Cognitive Neurosciences (BCN), University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
Department of Otorhinolaryngology, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands

Search for other papers by Deniz Başkent in
Current site
Google Scholar
PubMed
Close
, and
Frans W. Cornelissen Research School of Behavioural and Cognitive Neurosciences (BCN), University Medical Center Groningen, University of Groningen, Groningen, The Netherlands
Laboratory for Experimental Ophthalmology, University Medical Center Groningen, University of Groningen, Groningen, The Netherlands

Search for other papers by Frans W. Cornelissen in
Current site
Google Scholar
PubMed
Close
Open Access

Abstract

The majority of emotional expressions used in daily communication are multimodal and dynamic in nature. Consequently, one would expect that human observers utilize specific perceptual strategies to process emotions and to handle the multimodal and dynamic nature of emotions. However, our present knowledge on these strategies is scarce, primarily because most studies on emotion perception have not fully covered this variation, and instead used static and/or unimodal stimuli with few emotion categories. To resolve this knowledge gap, the present study examined how dynamic emotional auditory and visual information is integrated into a unified percept. Since there is a broad spectrum of possible forms of integration, both eye movements and accuracy of emotion identification were evaluated while observers performed an emotion identification task in one of three conditions: audio-only, visual-only video, or audiovisual video. In terms of adaptations of perceptual strategies, eye movement results showed a shift in fixations toward the eyes and away from the nose and mouth when audio is added. Notably, in terms of task performance, audio-only performance was mostly significantly worse than video-only and audiovisual performances, but performance in the latter two conditions was often not different. These results suggest that individuals flexibly and momentarily adapt their perceptual strategies to changes in the available information for emotion recognition, and these changes can be comprehensively quantified with eye tracking.

Abstract

The majority of emotional expressions used in daily communication are multimodal and dynamic in nature. Consequently, one would expect that human observers utilize specific perceptual strategies to process emotions and to handle the multimodal and dynamic nature of emotions. However, our present knowledge on these strategies is scarce, primarily because most studies on emotion perception have not fully covered this variation, and instead used static and/or unimodal stimuli with few emotion categories. To resolve this knowledge gap, the present study examined how dynamic emotional auditory and visual information is integrated into a unified percept. Since there is a broad spectrum of possible forms of integration, both eye movements and accuracy of emotion identification were evaluated while observers performed an emotion identification task in one of three conditions: audio-only, visual-only video, or audiovisual video. In terms of adaptations of perceptual strategies, eye movement results showed a shift in fixations toward the eyes and away from the nose and mouth when audio is added. Notably, in terms of task performance, audio-only performance was mostly significantly worse than video-only and audiovisual performances, but performance in the latter two conditions was often not different. These results suggest that individuals flexibly and momentarily adapt their perceptual strategies to changes in the available information for emotion recognition, and these changes can be comprehensively quantified with eye tracking.

1. Introduction

Successful social interactions involve not only an understanding of the verbal content of one’s conversational partner, but also their emotional expressions. In everyday life, the majority of social interactions takes place as face-to-face communication and emotion perception is thus multimodal and dynamic in nature. Historically, however, emotion perception has been investigated in a single perceptual modality, with static facial emotional expressions being studied most commonly. These unimodal studies have shown one can discriminate between broad emotion categories from visual cues, such as from activations of specific facial muscle configurations (Bassili, 1979; de Gelder et al., 1997; Ekman and Friesen, 1971) but also from specific body movements and postures (de Gelder, 2009; Jessen and Kotz, 2013), as well as from auditory cues, such as prosodic speech information (Banse and Scherer, 1996; Juslin and Laukka, 2003).

The vast amount of literature on multisensory perception in general indicates that observers integrate information in an optimal manner, by weighing the unimodal information based on its reliability prior to linearly combining the now weighted unimodal signals. Because of this, the multimodal benefit, i.e., the strength of the multimodal integration, in perception is largest when the reliability of the unimodal cues is similar and each sense provides unique information. Likewise, when one sense is much more reliable — such as hearing for time interval estimation — this sense will receive a higher weight and the multisensory signal could be roughly equal to the most reliable unisensory signal (see, e.g., Alais and Burr, 2004; Ernst and Banks, 2002; Ernst and Bülthoff, 2004). However, while it is well known that observers integrate optimally, it is unknown if they also employ specific perceptual strategies when integrating. For example, how different is the visual exploration of an object when the observer is allowed to touch the object compared to when the observer is not allowed to touch the object? Here, we investigated such multisensory perceptual strategies, and the manner in which they adapt to the presence of multiple modalities, by measuring observers’ viewing behavior in the context of emotion perception.

The continual adjustments of weighting unimodal information for multisensory perception make audiovisual integration a flexible process. Consequently, it can be expected that the viewing behavior observers employ also reflects this flexibility. It is long known that people naturally tend to foveate the regions of an image that are of interest (e.g., Yarbus, 1967). What is of interest in an image is defined by visual saliency (Itti and Koch, 2000), but also by the nature of the perceptual task (see, e.g., Hayhoe and Ballard, 2005). Võ and colleagues (2012) proposed that gaze allocation is a functional, information-seeking process. They performed an eye-tracking study in which participants were asked to rate the likeability of videos featuring pedestrians engaged in interviews. When the video was shown with the corresponding audio, participants mostly looked toward the eyes, nose, and mouth. When the audio signal was removed, there was a decrease in fixations to the face in general, and to the mouth in particular. Thus, despite the fact that the visual signal remained unchanged, the viewing behavior changed, indicating that viewing behavior is not only directed by visual information but also by information in other modalities. These findings led the authors to conclude that gaze is allocated on the basis of information-seeking control processes. On the other hand, one could instead argue that gaze was still mostly guided by saliency. Audiovisual synchrony likely increases the saliency in certain image regions, which are then fixated more often. If the audiovisual synchrony disappears when the video is muted, the saliency of the mouth decreases and it is looked at less. On the other hand, Lansing and McConkie (2003), using video recordings of everyday sentences showing only the face of the speaker, found an increase in fixations on the mouth when the video was presented without sound. The participants’ task was quite different from that in Võ et al. (2012) however, as here participants were required to repeat the spoken sentence. In this study (Lansing and McConkie, 2003), the mouth provides the majority of the information relevant for the task and gaze is thus directed toward it, and even more so when the task is made more difficult by removing the audio. Hence, while both these studies (Lansing and McConkie, 2003; Võ et al., 2012) used similar stimuli, the findings are drastically different, which would indicate that gaze allocation is indeed a flexible information-seeking process.

While speech sounds are mainly produced with mouth movements, many facial features additionally contribute to emotional expressions. Emotion perception from speech may thus be more complex than speech perception in terms of predicting gaze allocation. Naturally, in face-to-face communication, humans do not observe an isolated face, but a dynamic whole body that contributes with gestures and posture that may be relevant for recognizing emotions. It has been shown that observers can, under some conditions, recognize emotions from bodily expressions equally well as they can from facial expressions (see de Gelder, 2009 for a review). Additionally, studies showed that emotional prosody (such as pitch, tempo, and intensity) affects what facial emotion is perceived when the emotion in the voice is incongruent with the emotion in the face (de Gelder and Vroomen, 2000; Massaro and Egan, 1996). It has also been shown that visual attention is guided by emotional prosody, where observers look more often at faces expressing the same emotion than at faces expressing a different emotion (Paulmann et al., 2012; Rigoulot and Pell, 2012). However, these studies on the integration of facial expressions with emotional prosody mostly used static images as visual stimuli. It could thus be that observers did not necessarily attribute the face and voice to the same person, or the emotions were not being expressed at the same time. In addition, while vocal emotion always unfolds over time, a static image of a facial expression does not, despite the fact that facial expressions are dynamic in real life.

Therefore, in the present study, aiming for enhanced ecological validity, we presented dynamic multimodal emotional stimuli that always contained congruent emotion cues to express one of twelve different emotions, and also included emotions from the same family, such as anger and irritation. The stimuli were obtained from the Geneva Multimodal Emotion Portrayals (GEMEP) core set (Bänziger et al., 2012), which contains audiovisual video recordings of emotional expressions, with actors uttering a short nonsense sentence in an emotional manner. The video recordings show the actor from the waist up and therefore include both facial expressions as well as body, arm, and hand gestures. These stimuli have been shown to be recognizable well above chance level and were rated to be fairly believable and authentic. We used this stimulus set to measure how auditory and visual information is integrated for emotion perception.

For the purpose of this study, we consider information from two modalities as integrated when the addition of a second modality modulates the perception of the first modality (e.g., Etzi et al., 2018; Samermit et al., 2019; Taffou et al., 2013), or vice versa, or when the two modalities are combined into a unified multimodal percept (see Collignon et al., 2008; Kokinous et al., 2015 for similar descriptions). This combination into a unified percept could be indicated by, e.g., a gain in task performance larger or smaller than expected on the basis of independent summation of auditory and visual information or when an illusory percept arises due to the fusion of incongruent visual and auditory information (McGurk effect; McGurk and Macdonald, 1976). Relevant to our study, one form of integration is when observers alter their viewing strategies under different circumstances and tasks (Buchan et al., 2008; Võ et al., 2012).

Here, we used eye tracking to gain insight into observers’ viewing strategies and in what way they extract and make use of information from the stimuli. Based on previous studies examining viewing behavior during emotion perception, we cannot make a clear prediction about which areas will be fixated on most of the time, as most of these studies used static stimuli. However, two scenarios are likely: either gaze is mostly guided by information-seeking processes, or gaze is mostly guided by saliency. From the information-seeking perspective, when the task is to decode a speaker’s emotional state — the focus of the current study — and congruent audio is added to a video signal, the audio signal may help in decoding the emotional information, as the information in the two modalities overlaps to some extent. Hence, auditory information could render certain visual information largely redundant, such as the motion of a speaker’s mouth. Therefore, it may no longer be necessary to look at the mouth to retrieve that information and gaze can be directed elsewhere to examine different, potentially more unique, information. Alternatively, emotion recognition may rely mostly on salience, in which case an observer would always look at the most expressive region, such as the mouth for happy expressions and the eyes in angry expressions (in line with Smith et al., 2005). In this case we do not expect any changes in viewing behavior in response to the presence or absence of audio. Consequently, a change in viewing behavior in response to a change in modalities available can provide complementary information to task performance as a measure of audiovisual integration. In order to analyze what regions of the stimulus participants were looking at, we employed an Area-of-Interest (AOI) based analysis. Our AOIs were dynamic to capture the dynamic nature of the stimuli. Previous studies have shown that, when observing faces, most fixations are on the eyes, nose, and mouth (Groner et al., 1984; Walker-Smith et al., 1977). In addition, it has been shown that hand movements are frequent in emotion expression (Dael et al., 2012), hence observing these movements might be useful as well for identifying the expressed emotion. Therefore, we focused our analysis on the fixations on the eyes, nose, mouth, and hands, which all could drastically change in location over the time course of the video.

To assess the presence of audiovisual integration, we evaluated whether the accuracy scores for emotion identification differed for audio-only, video-only, and audiovisual stimulus presentation. A difference in accuracy is an indication of integration and the direction this difference is in indicates whether any changes in viewing behavior are indeed functional, i.e., lead to better performance. Several studies have shown that emotion perception improves when participants have access to more than one modality conveying the same emotion (de Gelder and Vroomen, 2000; Massaro and Egan, 1996; Paulmann and Pell, 2011). Conversely, other studies have implied visual information dominates over auditory information and that — consequently — multimodal information may not necessarily improve emotion recognition and the contribution of the audio may be limited (Bänziger et al., 2009; Jessen et al., 2012; Wallbott and Scherer, 1986). These conflicting findings may be the result of differences in the reliability of the auditory and visual information presented in these studies. Collignon and colleagues (2008) found visual dominance when the stimuli were presented without any noise, but found robust audiovisual integration when they added noise to the visual stimulus. The visual dominance was found despite the fact that the unimodal emotion recognition performance (correct recognition rate) was the same for the noiseless visual and auditory stimuli. Thus, it appears that in noise-free environments, visual information is often treated as more reliable. Based on this, we hypothesized that we would find visual dominance in participants’ accuracy scores.

2. Materials and Methods

2.1. Participants

In total, 23 young healthy participants volunteered to take part in the experiment (ten male, mean age = 23 ± 2.3 years, range: 20–31). One participant did not pass all screening criteria (described below in Section 2.2.) and was therefore excluded from the experiment before data collection. One other participant was excluded due to severe difficulties in calibrating the eye tracker. Consequently, 21 participants completed the entire experiment (nine male, mean age = 23 ± 2.4, range: 20–31) and were included in the data analysis. The sample size was initially based on similar previous studies on audiovisual emotion perception (e.g., Collignon et al., 2008; Paulmann and Pell, 2011; Skuk and Schweinberger, 2013; Takagi et al., 2015) and was subsequently modified in order to ensure proper counterbalancing of the experimental blocks. All participants were given sufficient information about the nature of the tasks of the experiment, but were otherwise naïve as to the purpose of the study. Written informed consent was collected prior to data collection. The study was carried out in accordance with the Declaration of Helsinki and was approved by the local medical ethics committee (ABR nr: NL60379.042.17).

2.2. Screening

Prior to the experiment, potential participants’ hearing and eyesight were tested to ensure auditory and (corrected) visual functioning was within the normal range.

Normal auditory functioning was confirmed by measuring auditory thresholds for pure tones at audiometric test frequencies between 125 Hz and 8 kHz. A staircase method, similar to typical audiological procedures, was used to determine the thresholds, in a soundproof booth. Testing was conducted at each ear, always starting with the right ear. In order to participate in the experiment, audiometric thresholds at all test frequencies needed to be as good as or better than 20 dB HL for the better ear.

Normal visual functioning was tested with measurements of visual acuity and contrast sensitivity (CS). These tests were performed using the Freiburg Acuity and Visual Contrast Test (FrACT, version 3.9.8; Bach, 1996, 2006). A visual acuity of at least 1.00 and a logCS of at least 1.80 (corresponding roughly to a 1% luminance difference between target and surround) were used as cutoff thresholds to participate in the experiment. Visual tests were performed on the same computer as used in the main experiment.

Additional exclusion criteria were neurological or psychiatric disorders, dyslexia, and the use of medication that can potentially influence normal brain functioning.

2.3. Stimuli

The stimuli used in this study were taken from the Geneva Multimodal Emotion Portrayals (GEMEP) core set (for a detailed description, see: Bänziger et al., 2012), which consists of 145 audiovisual video recordings (mean duration: 2.5 s, range: 1–7 s) of emotional expressions portrayed by ten professional French-speaking Swiss actors (five female). The vocal content of the expressions were two pseudo-speech sentences with no semantic content but resembling the phonetic sounds in western languages (“nekal ibam soud molen!” and “koun se mina lod belam?”). Out of the total set of 17 emotions, 12 were selected for the main experiment. The selection was made to produce a well-balanced design, such that all actors portrayed the selected emotions, and further, these emotions could be distributed evenly on the quadrants of the valence-arousal scale (Russell, 1980; see Table 1), thereby balancing positive and negative emotions as well as high- and low-arousal emotions within the selected stimulus set. This resulted in a total of 120 stimuli used in our experiments. The five remaining emotions that were excluded from data collection were used as practice material to acquaint participants with the stimulus materials and the task.

Table 1.
Table 1.

The selected emotion categories used in the experiment. The emotions for the main experiment are distributed over the quadrants of the valence-arousal scale (Russell, 1980). The five additional emotions are used for the practice trials

Citation: Multisensory Research 34, 1 (2021) ; 10.1163/22134808-bja10029

The audio from all movie files was edited in Audacity (version 2.1.2; http://audacityteam.org/), to remove any audible noise or clipping from the audio recordings, and saved as 16-bit WAV-files. To do so, in most cases, the editing consisted of using the built-in ‘Noise Reduction’ effect to reduce background noise as much as possible without affecting the speech signal. In rare cases, the files contained clipping, which was removed by manually adjusting the clipped regions of the waveform. Audio recordings were then root-mean-square (RMS)-equalized in intensity level, and re-merged with the corresponding video files (thereby replacing the old audio) using custom-made scripts.

2.4. Experimental Setup

Experiments were performed in a silent room, which was dark except for the illumination provided by the screen. Participants were seated in front of a computer screen at a viewing distance of 70 cm with their head in a chin and forehead rest to minimize head movements. Stimuli were displayed and manual responses were recorded using MATLAB (Version R2015b; The Mathworks, Inc., Natick, MA, USA), the Psychophysics Toolbox (Version 3; Brainard, 1997; Kleiner et al., 2007; Pelli, 1997) and the Eyelink Toolbox (Cornelissen et al., 2002) extensions of MATLAB. The stimuli were presented full-screen on a 24.5-inch monitor with a resolution of 1920 × 1080 pixels (43 × 24.8 degrees of visual angle). Average screen luminance was 38 cd/m2. Stimulus presentation was controlled by an Apple MacBook Pro (early 2015 model). Audio was produced by the internal soundcard of this computer and presented binaurally through Sennheiser HD 600 headphones (Sennheiser Electronic GmbH & Co. KG, Wedemark, Germany). The sound level was calibrated to be at a comfortable and audible level, at a long-term RMS average of 65 dB SPL.

To measure eye movements, an Eyelink 1000 Plus eye tracker, running software version 4.51 (SR Research Ltd., Ottawa, ON, Canada), was used. Gaze data were acquired at a sampling frequency of 500 Hz. The eye tracker was mounted on the desk right below the presentation screen. At the start of the experiment, the eye tracker was calibrated using its built-in nine-point calibration routine. Calibration was verified with the validation procedure in which the same nine points were shown again. The experiment was continued if the calibration accuracy was sufficient (average error of less than 0.5° and a maximum error of less than 1.0°). A drift check was performed both at the start of the experiment and after each break. If the drift was too large (i.e., more than 1.0°), the calibration procedure was repeated.

2.5. Procedure

In this study, behavioral and eye-tracking data were obtained to identify accuracy and gaze fixation of emotion identification with dynamic stimuli. In each trial, prior to each stimulus presentation, a central fixation cross appeared for a random duration between 500 and 1500 ms. The response screen followed each stimulus presentation after 100 ms and remained on screen until the participant made his or her response. The order of events in a typical trial is shown in Fig. 1.

Figure 1.
Figure 1.

Schematic representation of the events in a single trial. Participants first were shown a fixation cross (left), followed by the stimulus, presented audiovisually (middle top), visually (middle), or aurally (middle bottom). After stimulus presentation, a response screen (right) with labels indicating the possible emotions appeared and remained on screen until the participant made a (forced) response. Emotion labels were in Dutch, from top right going clockwise they are: opgetogen (joy), geamuseerd (amusement), trots (pride), voldaan (pleasure), opgelucht (relief), geïnteresseerd (interest), geïrriteerd (irritation), ongerust (anxiety), verdrietig (sadness), bang (fear), wanhopig (despair), and woedend (anger).

Citation: Multisensory Research 34, 1 (2021) ; 10.1163/22134808-bja10029

Participants were asked to identify the emotion presented in one of three stimulus presentation modalities: audio-only (A-only), video-only (V-only), or audio and video combined (AV). They were asked to respond as accurately as possible in a forced-choice discrimination paradigm, by clicking on the label on the response screen corresponding with the identified emotion. Emotion labels were shown and explained before the experiment. Participants were further instructed to blink as little as possible during the trial and maintain careful attention to the stimuli.

In total, each participant was presented with all 120 stimuli (twelve emotions × ten actors) in all three blocks: an A-only block, a V-only block, and an AV block. Block order was counterbalanced between participants. Stimulus order within each block was randomized. Participants were encouraged to take breaks both within and between blocks (breaks were possible after every 40 trials) to maintain concentration and prevent fatigue. Breaks were self-paced and the experiment continued upon the participant pressing the spacebar. Following each break, a drift correction was applied to the eye-tracking calibration. Fifteen practice trials (five practice trials for each modality) preceded the experiment to familiarize participants with the task and stimulus material. In total, the experiment consisted of 375 trials, including the 15 practice trials, and took at most one hour to complete. Feedback on the given responses was provided during the practice trials only.

2.6. Analyses of Behavioral Data

To assess the presence of audiovisual integration, we tested whether performance for emotion identification differed for A-only, V-only, and AV stimulus presentation. We additionally employed a measure that quantifies the size of the effect from audiovisual integration, i.e., whether audiovisual integration is sub-additive (i.e., lower than expected based on the simultaneous and independent processing of both unisensory modalities), additive (i.e., equal to a summation of the auditory and visual evidence), or supra-additive. A supra-additive effect would be indicative of a gain in performance beyond what is gained by independently summing the information from both modalities (Crosse et al., 2016; Stevenson et al., 2014).

Accuracy scores for each emotion and modality were converted to unbiased hit rates (Wagner, 1993) prior to further analyses. Unbiased hit rates ( H u ) were used to account for response biases. Unbiased hit rates were then arcsine-transformed to ensure normality and analyzed in R (version 3.6.0; R Foundation for Statistical Computing, Vienna, Austria — https://cran.r-project.org) with repeated-measures ANOVA (aov_ez from the afex package, version 0.25-1). For the ANOVA, arcsine-transformed H u was the dependent variable, and modality (with three levels; A-only, V-only, and AV) and emotion (with 12 levels) the fixed-effects variables. Greenhouse–Geisser correction was performed in cases of a violation of the sphericity assumption. Effect sizes are reported as generalized eta-squared (ges). Pairwise comparisons were performed to test main effects (comparing different modalities) and interactions (the effect of modality for each emotion) using lsmeans from the emmeans package (version 1.4.1). For comparing differences between modalities, the Bonferroni correction was applied to make sure our conclusions were not based on a possibly too liberal adjustment. For comparing modality differences between emotions we used the False Discovery Rate (FDR) correction in order to ensure no effects were lost due to strict adjustments of p-values due to the many pairwise comparisons made.

For a quantitative assessment of the AV integration effect, we tested if the measured performance for AV exceeded the statistical facilitation produced by A + V. To quantify the predicted H u for the independent summation of A and V we used the following equation (Crosse et al., 2016; Stevenson et al., 2014):
(1) H u ˆ ( A V ) = H u ( A ) + H u ( V ) H u ( A ) · H u ( V )

If the H u for the AV modality exceeds the predicted H u , as assessed by a paired t-test, this indicates A and V are integrated in a supra-additive manner (see, e.g., Calvert, 2001; Hughes et al., 1994). Paired t-tests were only performed when at least the differences between AV and V-only and between AV and A-only were significant.

2.7. Analyses of Eye-Tracking Data

Fixations were extracted from the raw eye-tracking data using the built-in data-parsing algorithm of the Eyelink eye tracker. We performed an AOI-based analysis for fixations made during stimulus presentation (only for the AV and V-only modalities as for the A-only modality there is no visual stimulus aside from a fixation cross). Trials with blinks longer than 300 ms during stimulus presentation were discarded. The analysis was restricted to fixations made between 200 ms and 1000 ms after stimulus onset. The first 200 ms were discarded because this is the time needed to plan and execute the first eye movement. No data after 1000 ms were taken into account to limit data analysis to the duration of the shortest movie at 1000 ms.

In the videos, the eyes (left and right), nose, mouth, and hands (left and right) of the speaker were chosen as AOIs. Because the stimuli are dynamic, we created dynamic AOIs. Coordinates of the AOI positions for each movie and each frame were extracted using Adobe® After Effects® CC (Version 15.1.1; Adobe Inc., San Jose, CA, USA). For the face AOIs, these coordinates were obtained by placing an ellipsoid mask on the face area and applying a tracker using the ‘Face Tracking (Detailed Features)’ method, which automatically tracks many features of the face (see Fig. 2 for an example frame with AOIs drawn in). Face track points were visually inspected and manually edited (i.e., moved into the correct place) whenever the tracking software failed to correctly track them.

Figure 2.
Figure 2.

Face tracking in Adobe After Effects CC. The yellow line is the ellipsoid mask after automatic alignment to the contours of the face. Each circled cross is a face track point. The colored rectangles indicate the locations of the different areas of interest (AOIs); the red rectangles denote the right- and left-eye AOIs, the purple rectangle shows the nose AOI, and the blue rectangle specifies the mouth AOI.

Citation: Multisensory Research 34, 1 (2021) ; 10.1163/22134808-bja10029

Coordinates of all obtained face track points for each movie frame were stored in a textfile and used to create rectangular AOIs. For the eyes’ AOI we used the coordinates of the following face track points: ‘Right/Left Eyebrow Outer’ for the x-position of the lateral corner, ‘Right/Left Eyebrow Inner’ for the x-position of the medial corner, ‘Right/Left Eyebrow Middle’ for the top, and the middle between the y-positions of ‘Left Pupil’ and ‘Nose tip’ for the bottom, indicating the eye–nose border. Two individual AOIs were created for the left and right eye, which were later merged for analyses. For the nose AOI: the eye–nose border as the top, the nose–mouth border (middle between the y-positions of ‘Right Nostril’ and ‘Mouth Top’), the x-position of ‘Right Nostril’ for the left corner, and the x-position of ‘Left Nostril’ for the right corner. For the mouth AOI: the x-position of ‘Mouth Right’ for the left corner, the x-position of ‘Mouth Left’ for the right corner, the nose–mouth border for the top, and the y-position of ‘Mouth Bottom’ for the bottom. Each AOI was expanded by 10 pixels on each side (20 pixels across the horizontal and vertical axes), except at the eye–nose and nose–mouth borders. Overlap between AOIs was avoided. The actual size of each AOI varied across actors and frames e.g. due to some actors being closer to the camera.

For the hand AOIs, the ‘Track Motion’ method was used, in which a single tracker point (per hand) was used to track position. The tracker point was placed approximately in the center of the hand. The track point was manually edited whenever the tracking software failed to correctly track it. This happened often due to the complex movements the hands made in most movies. Figure 3 shows example frames from one movie. After extracting the coordinates, a sphere with a radius of 75 pixels was used to create the AOI.

Figure 3.
Figure 3.

Hand tracking using Adobe After Effects CC. In both images, the attach point is at the center (from which the coordinate is extracted), the inner box is the feature region (i.e., what the tracked region looks like), and the outer box is the search region of the tracker (i.e., the region in which the tracker will search for the feature region). Additionally, the tracked points in previous frames can be seen. As can be seen in the left image, tracking works well early in the movie. As the hand starts to change shape later in the movie, however, the tracker errs. This can be seen on the right image where the tracker loses the hand from sight and tracks the arm and background instead.

Citation: Multisensory Research 34, 1 (2021) ; 10.1163/22134808-bja10029

Then, for each fixation datapoint we checked whether the fixation was on one of the AOIs (with the coordinates from the movie frame cooccurring with the time of the fixation), leading to one binary vector for each AOI with the same length as the length of the fixation data. These vectors were then averaged per trial, giving a mean fixation proportion on each AOI for each trial. Lastly, the means were arcsine-transformed. A mixed linear regression was performed in R (using lmer from the lme4 package, version 1.1-21) on correct trials only, as we were most interested in examining whether changes in viewing behavior due to changes in modality availability were adaptive, leading to good performance. In line with the analyses of unbiased hit rates, the model included modality, emotion, and AOI as fixed effects, which were allowed to interact with each other. Random intercepts were included for participant and movie and a random slope for modality was included for both participant and movie if the model still converged (otherwise, only a random slope for modality was included for participants). Overall significance of the main effects and interactions was assessed using the Anova function from the car package (version 3.0-3). Pairwise comparisons were performed to test whether fixation proportions on different AOIs differed for different modalities and emotions using lsmeans. As before, for comparing differences between modalities, the Bonferroni correction was applied while for comparing differences between emotions we used the FDR correction.

Lastly, we ran a second model to test whether fatigue or boredom, which may have occurred due to the lengthy duration of the experiment, had an effect on fixation patterns, by adding experimental block to the model. There was no significant effect of block on fixation patterns ( χ 2 ( 1 ) = 1.79, p = 0.18), ruling out the additional effect from potential boredom and fatigue.

3. Results

Participants identified dynamic emotional expressions presented in movies while their eye movements were recorded. The objective of this study was to see if emotions are processed similarly whether conveyed in a unimodal (A-only, V-only) or multimodal (AV) manner, as measured by performance levels and fixation patterns. To achieve this objective, here we present analyses of accuracy and gaze differences for different modalities and emotions. Accuracy and fixation data for individual participants can be found in Supplementary Figs S1, S2, S3, and S4. Confusion matrices for each modality can be found in Supplementary Fig. S5.

3.1. Accuracy Across Modalities and Emotions

Accuracy scores in unbiased hit rate ( H u ) and averaged over all participants and testing blocks is shown in Fig. 4. On average, participants performed the task with a mean accuracy of 0.37, well above the chance level of 0.083.

Figure 4.
Figure 4.

Task performance for each modality, shown as unbiased hit rates and averaged across all participants and blocks. Each box shows the data between the first and third quartiles. The horizontal black solid line in each box denotes the median while the horizontal black dashed line in each box denotes the mean. The whiskers extend to the lowest/highest value still within 1.5 × interquartile range of the first/third quartile. Dots are outliers. The red line indicates the grand average performance (0.37). The black dotted horizontal line indicates chance level performance (0.083).

Citation: Multisensory Research 34, 1 (2021) ; 10.1163/22134808-bja10029

A visual inspection of Fig. 4 suggests performance is lowest for the A-only modality and highest for the AV modality. This was also confirmed by the ANOVA, which had H u as the dependent variable, and modality and emotion as independent variables. The model showed an overall effect of modality ( F 2 , 40 = 42.7, p < 0.001, ges = 0.18), a main effect of emotion ( F 11 , 220 = 53.1, p < 0.001, ges = 0.48), and a significant interaction between modality and emotion ( F 22 , 440 = 5.2, p < 0.001, ges = 0.07). Bonferroni-adjusted pairwise comparisons showed performance was significantly different between all modalities (A-only –AV: t 40 = 9.13, p < 0.001; A-only–V-only: t 40 = 5.80, p < 0.001; V-only–AV: t 40 = 3.34, p = 0.006). Therefore, performance was lowest for A-only (mean accuracy = 45%), intermediate for V-only (mean accuracy = 62%), and highest for AV (mean accuracy = 70%), with all differences between modalities being significant.

Further inspection of the modality-by-emotion interaction showed that, in general, performance was lowest for A-only, intermediate for V-only, and highest for AV, but this was not true for all emotions. In fact, for most emotions (except for Pleasure, Relief and Anxiety), there was no significant difference in performance between V-only and AV. In addition, for some negative valence emotions (Fear and Anger), none of the comparisons between modality pairs produced a significant difference. Lastly, for Pleasure, Relief, and Despair the difference between V-only and A-only was not significant. The complete list of all comparisons is given in Table 2 and further visualized in Fig. 5.

Table 2.
Table 2.

Contrasts for the modality-by-emotion interaction showing the model estimate differences, with the False Discovery Rate (FDR)-adjusted p-values in parentheses. A positive contrast means performance in the first condition was better than in the second of the comparison (and v.v.). Significant differences are indicated in bold

Citation: Multisensory Research 34, 1 (2021) ; 10.1163/22134808-bja10029

Figure 5.
Figure 5.

Task performance for each modality, similar to Fig. 4, but shown for each emotion. The red line in each panel indicates the average performance for that particular emotion. The black dotted horizontal line indicates chance level performance (0.083).

Citation: Multisensory Research 34, 1 (2021) ; 10.1163/22134808-bja10029

While AV performance was significantly higher than both A-only and V-only performance, indicating that AV integration took place, the AV integration effect was sub-additive as performance for AV was significantly lower than predicted on the basis of additivity ( t 20 = 3.06, p = 0.006; H u ˆ ( A V ): 0.52 ± 0.12, H u ( A V ): 0.45 ± 0.10). Considering individual emotions, only for anxiety, pleasure, and relief performance differed between both AV and V-only and between AV and A-only, and thus, only for these emotions it was further tested whether AV performance was supra-additive. AV performance was not significantly different from the predicted additive performance for Anxiety [ t 20 = 0.006, p = 0.99; H u ˆ ( A V ): 0.30 ± 0.16, H u ( A V ): 0.30 ± 0.14], for Pleasure [ t 20 = 1.33, p = 0.20; H u ˆ ( A V ): 0.56 ± 0.05, H u ( A V ): 0.51 ± 0.05] or for Relief [ t 20 = 1.54, p = 0.14; H u ˆ ( A V ): 0.50 ± 0.05, H u ( A V ): 0.43 ± 0.05], indicating that the AV integration effect was additive in all three emotions.

Figure 6.
Figure 6.

Fixation proportions for correct trials on all areas of interest (AOIs) (face, i.e., eyes, nose, mouth; and hands), across the analyzed time course (a) and averaged over the analyzed time course (b), both averaged over all stimuli and participants. Shaded areas around each line (a) and error bars (b) denote the standard error of the mean (SEM).

Citation: Multisensory Research 34, 1 (2021) ; 10.1163/22134808-bja10029