The Multimodal Trust Effects of Face, Voice, and Sentence Content

Trust is an aspect critical to human social interaction and research has identiﬁed many cues that help in the assimilation of this social trait. Two of these cues are the pitch of the voice and the width-to-height ratio of the face (fWHR). Additionally, research has indicated that the content of a spoken sentence itself has an effect on trustworthiness; a ﬁnding that has not yet been brought into multisensory research. The current research aims to investigate previously developed theories on trust in relation to vocal pitch, fWHR, and sentence content in a multimodal setting. Twenty-six female participants were asked to judge the trustworthiness of a voice speaking a neutral or romantic sentence while seeing a face. The average pitch of the voice and the fWHR were varied systematically. Results indicate that the content of the spoken message was an important predictor of trustworthiness extending into multimodality. Further, the mean pitch of the voice and fWHR of the face appeared to be useful indicators in a multimodal setting. These effects interacted with one another across modalities. The data demonstrate that trust in the voice is shaped by task-irrelevant visual stimuli. Future research is encouraged to clarify whether these ﬁndings remain consistent across genders, age groups, and languages.


Introduction
Trust can ensure personal safety in interpersonal relations, which is an important topic to investigate for its societal implications.Numerous contributing factors to trust have been identified throughout years of research.Among others, these include race, sex, age, and emotion (Forde-Smith and Feinberg, 2023).Perhaps more intricate, the human voice and certain facial features have been noted to influence social trait perception (for studies on the voice: Belin et al., 2017;McAleer et al., 2014;Schild et al., 2020;Tsantani et al., 2016; for studies on the face: Alaei et al., 2020;Merlhiot et al., 2021) including trust (for studies on the voice: Jiang et al., 2020;O'Connor and Barclay, 2017;Schirmer et al., 2020; for studies on the face: Ferguson et al., 2019;Sofer et al., 2015).In terms of vocal acoustics, it has been generally found that a lower fundamental frequency (F0), perceived as lower pitch, is judged as more trustworthy for men but not women in economic domains, yet less trustworthy in romantic situations (Schild et al., 2020).However, other research indicates that people are more trusting of higher-pitched male voices regardless of situational context (O'Connor and Barclay, 2017;Weiss et al., 2021).Further, lower pitch has been linked to higher levels of testosterone (Dabbs and Mallinger, 1999) and to self-reports of sexual infidelity (Schild et al., 2021).
Certain facial features have also received academic interest in explaining trustworthiness.Interestingly, more typical faces are linked to greater attribution of trust than are more attractive faces (Sofer et al., 2015), but more average faces are also judged as more attractive in other studies (Langlois and Roggman, 1990).Indeed, the influence of face typicality remains a debated topic.Participants were, however, able to read other's attachment styles from their faces (Alaei et al., 2020).Further, the facial width-to-height ratio (fWHR) has repeatedly been suggested to influence dominance and trust perception (Arnocky et al., 2018;Banai et al., 2023;Geniole et al., 2014;Merlhiot et al., 2021;Stirrat and Perrett, 2010).Lower fWHR (i.e., larger distances between the eyes and mouth and lesser face width) is linked to perceptions of greater integrity (Ormiston et al., 2017), while higher fWHR (i.e., smaller distances between the eyes and mouth and greater face width) is found to be linked to trust exploitation in trust game settings (Stirrat and Perrett, 2010) and dominance (Merlhiot et al., 2021).Links between the fWHR and leader preference (Banai et al., 2023) and self-reported sociosexuality and sex drive (Arnocky et al., 2018) have also been suggested.It is thought that the fWHR exerts an influence on trust, but this relationship is possibly mediated through dominance and aggression (Geniole et al., 2014).
Most of the studies conducted on the trustworthiness of the voice use a correlational design (Belin et al., 2017;Jiang et al., 2020;McAleer et al., 2014;Schild et al., 2020) and use actors as their stimuli (Schirmer et al., 2020).These studies generally measure the mean F0, an important acoustic correlate of perceived pitch, and correlate F0 with social traits that are measured via questionnaires.Studies that experimentally manipulated F0 and inquired about social-trait impressions were conducted by O'Connor and Barclay (2017) and by Tsantani et al. (2016), the first of which found that higher pitch is perceived as being more trustworthy in neutral, economic, and mating-related settings; a situational context effect that is interesting to further explore due to its societal relevance.Contrastingly, Tsantani et al. (2016) have identified a tendency to prefer lower-pitched voices in first impressions.
In correlational work on the trustworthiness of the face, fWHR is typically correlated with social traits (Arnocky et al., 2018;Durkee and Ayers, 2021;Geniole et al., 2014;Ormiston et al., 2017;Stirrat and Perrett, 2010), but, Merlhiot et al. (2021) did experimentally manipulate the fWHR and reported that a higher fWHR is linked to higher perceptions of dominance.
In describing human trust dynamics, it is important to include its multimodal nature.In the majority of real-world interactions people are exposed to not only a face but also an accompanying voice.Apart from a small number of studies, though, this multimodal nature has remained a neglected variable in trust research.Previous studies indicate that perception of emotions is a cross-modal phenomenon rather than a post-perceptual decision (de Gelder and Vroomen, 2000).In line with this view, researchers have shown that both the influence of the face and that of the voice are important in the assimilation of social-trait judgements (Mileva et al., 2018;Rezlescu et al., 2015).Mileva et al. (2018) studied the assimilation of trust judgements through a series of experiments.The authors concluded that perception of faces and perception of voices work independently from one another in the assimilation of socialtrait judgements.Having studied both dominance and trustworthiness, they report that voices have a larger influence when dominance is considered, but this is not true for trustworthiness, where the face shows a measure of superiority.They note, however, that they found no difference in trustworthiness ratings based on vocal pitch.Yet, previously discussed literature provides evidence that this effect exists (e.g., Jiang et al., 2020;O'Connor and Barclay, 2017;Schirmer et al., 2020), and literature published a year after Mileva et al. (2018) added that the situational context of the spoken sentence further exerts an influence (Schild et al., 2020).Further, the study used facial stimuli that were highly diverse in lighting, emotional expression, and pose.As research indicates that the fWHR has an effect on trustworthiness, it is interesting to examine what its isolated effects are in a multimodal setting.Rezlescu et al. (2015) employed a similar research aim and found that voices and faces hold similar weight in the assimilation of multimodal trustworthiness judgements, and note that the effects of vocal trustworthiness were higher for trustworthy Downloaded from Brill.com 05/15/2024 03:18:59PM via Open Access.This is an open access article distributed under the terms of the CC BY 4.0 license.https://creativecommons.org/licenses/by/4.0/faces than for untrustworthy faces.These results conflict with those of Mileva et al. (2018), who note that the face holds more weight in these judgements.
The present study builds upon the work by Mileva et al. (2018) and expands on it by incorporating a neutral and romantic situational context in line with Schild et al. (2020), and by including isolated fWHR manipulations in the visual stimuli.We investigated changes in perceived trustworthiness of a voice when tested unimodally versus multimodally in an experimental instead of correlational design.We expected to find that lower-pitched voices, romantically charged sentences, and higher fWHR are perceived as less trustworthy in unimodal trials, and that this effect transfers to multimodal stimuli.

Stimuli
Visual stimuli were acquired by photographing two male and two female caucasian Dutch-speaking university students.Photographs were taken using a NIKON D7200 from a set range of 100 cm in a well-lit room with a black background.The models were instructed to refrain from displaying any emotional facial expressions.Next, the photographs were imported into Photoshop (Adobe Inc., 2019), where manipulations to the fWHR were applied.The fWHR was manipulated by changing the distance between the eyes and mouth in systematic steps in each direction.The manipulations consisted of −18.75%, −12.5%, +0.0%, +12.5%, and +18.75% alterations.These distances were based on distance calculations by Photoshop and were chosen due to larger steps being intuitively perceived as unnatural.As the pictures were taken from a constant distance, complying with the distance measures provided by Photoshop did not result in inconsistent manipulations across models.When manipulation of the fWHR caused imbalance in the proportions of the face, additional changes were made to the forehead, chin height, nose height, and face width to accommodate the change in fWHR.All visual stimuli used in this study can be found in Fig. 1.In total, there were four models whose photographs were manipulated with five different fWHRs, resulting in 20 different photographs.
Auditory stimuli were recorded from the same models who were asked to pronounce four sentences in a neutral tone in Dutch.Two sentences were of neutral content and two sentences were romantic in nature (see Table 1).Romantic context was chosen as a condition due to previous research indicating that this type of stimuli influence the trustworthiness of the voice (Schild et al., 2020).By introducing this finding into multimodal research, its hypothesized transmodal effects on visual stimuli can be examined.
Sentences were recorded with an external microphone attached to the camera (RØDE VideoMicro).The audio recordings were manipulated using the Downloaded from Brill.com 05/15/2024 03:18:59PM via Open Access.This is an open access article distributed under the terms of the CC BY 4.0 license.https://creativecommons.org/licenses/by/4.0/ Figure 1.Width-to-height ratio (fWHR) of the face; manipulations on all models.The center column (0.0%) displays the original photographs for all four models.The first two columns display the increased fWHR conditions relative to the original (i.e., the distance between the eyes and mouth was decreased with 18.75% and 12.5%), and the two columns to the right of center display the the increased fWHR conditions (i.e., relative to the original, the distance between the eyes and mouth was increased with 12.5% and 18.75%).Images are used with (written) consent from all models.voice-editing software Praat (Boersma and Weenink, 2013).The average F0 of each recording was either increased or decreased by 20 Hz.This resulted in 32 clips in total (4 models × 4 sentences × 2 F0 manipulations).An overview of the mean F0 of each model's voice for each sentence can be found in Table 2.

Participants
A total of 26 female participants (mean age: 19.50, SD: 1.794) participated in return for participation credits.This study adopted a female sample because  previous research showed that the effects of situational content on trustworthiness is most prominent in females (Schild et al., 2020).If a significant cross-modal effect of situational content on trustworthiness exists, it is therefore more likely to be found in females.All participants provided written informed consent prior to testing, and the experiment was conducted in accordance with the Declaration of Helsinki.The experiment was approved by the Tilburg University internal ethical review board (project ID: EC-2016.48).

Procedure
The experiments were conducted in a sound-attenuated booth and participants were seated in front of a full HD monitor (BenQ XL 2540-B, 24.5 inch, refresh rate 240 Hz).The audio was provided through Sennheiser HD201 headphones at a comfortable listening volume of 65 dB.Stimuli were presented via OpenSesame (Mathôt, 2012).Participants first judged the trustworthiness of the voices, followed by trustworthiness judgements of the face, and trustworthiness judgements of audiovisual (AV) trials.Total testing (including breaks) lasted ∼35 min.

Voice Trustworthiness
After hearing an auditory stimulus, participants were asked "How trustworthy did you experience this voice?"and communicated their judgement by pressing one of the number keys on the keyboard (1 to 9) corresponding to a nine-point Likert scale (1 = not trustworthy at all, 9 = highly trustworthy).For each pitch manipulation (i.e., + or −20 Hz relative to the original recording) and model (two male and two female models), there were two sentences in neutral condition, and two in the romantic condition, for a total of 32 trials.

Face Trustworthiness
Participants were shown photographs of the four models on the screen (displayed for 3000 ms), and were again asked to indicate how trustworthy they perceived these photographs, following the same response procedure as in the auditory task.In total, there were 20 trials (5fWHR ratios × 4 models), shown in blocked fashion (i.e., all fWHR ratios for one model were shown before moving on to the next model).

Audiovisual Trustworthiness
After a short break, the audiovisual part of the experiment was conducted in which the voices were presented together with the faces.Importantly, participants were asked only about the trustworthiness of the voice in the audiovisual trials.The addition of the faces in the audiovisual trials served as a secondary influence on the trustworthiness of the voice.The edited photographs were paired with the manipulated voice clips and presented for 3000 ms.These trials were shown in a fixed order and blocked per model (i.e., all stimuli pairs of one model were presented before moving on to the next model).This fixed order of trials minimized the number of times that one stimulus (i.e., a fWHR manipulation or audio clip) was presented twice in a row.This task consisted of 160 trials (4 models × 5 fWHR ratios × 4 sentences × 2 pitch manipulations).These trials followed the same response procedure as the voice-only and face-only parts.

Data Analysis
Separate analyses were run on the unimodal voice, unimodal face, and audiovisual data.The unimodal trust scores of the voice were analysed in a 2 (Content: Neutral or Romantic) × 2 [Pitch: High (F0 + 20 Hz) or low (F0 − 20 Hz)] × 2 (Model gender: Male or Female) repeated-measures ANOVA.For the unimodal visual stimuli, a 2 (Model gender: Male or Female) × 5 (fWHR manipulations) repeated-measures ANOVA was conducted.Audiovisual data were analysed using a 2 (Content: Neutral and Romantic) × 5 (fWHR manipulations) × 2 [Pitch: Low (F0 − 20 Hz) and High (F0 + 20 Hz)] × 2 (Gender of the speaker: Male and Female) repeated-measures ANOVA.We applied Greenhouse-Geisser correction in case sphericity was violated, but consistently report the unadjusted degrees of freedom.Significant results of the multivariate tests were further examined using paired t tests.F 1,25 = 22.509, p < 0.001, η 2 p = 0.474, because averaged across pitch and model gender, neutral sentences were rated more trustworthy (M = 6.47) than sentences with a romantic content (M = 5.97).The significant main effect of pitch, F 1,25 = 7.876, p = 0.010, η 2 p = 0.240, indicated that averaged across content and model gender, sentences with a low pitch were rated less trustworthy (M = 6.03) than sentences with a high pitch (M = 6.41).There was no significant main effect of model gender, F 1,25 = 0.010, p = 0.920, η 2 p = 0.000, no significant content × model gender interaction effect, F 1,25 = 1.928, p = 0.177, η 2 p = 0.072, and no significant content × pitch × model gender interaction, F 1,25 = 1.248, p = 0.274, η 2 p = 0.048.However, the data did indicate a significant interaction effect content × pitch, F 1,25 = 15.182,p < 0.001, η 2 p = 0.303.This effect indicated that lower-pitched voices were perceived especially untrustworthy in romantic settings (M = 5.63) as compared with neutral settings (M = 6.43), regardless of speaker gender.The pitch × model gender interaction approached significance, F 1,25 = 4.217, p = 0.051, η 2 p = 0.144, mainly because pitch had a larger effect in male than female voices (i.e., low-pitched voices were judged as less trustworthy, especially so in male voices).However, due to the smaller sample size, this effect should be interpreted with caution.For male voices, the difference between low-and high-pitch sentences (M = 0.57) was, at least numerically, larger than for female voices (M = 0.20).

Discussion
Female listeners can readily judge whether the voice of a person is trustworthy or not, but it remains unclear which factors contribute to trust in the voice.Here, we examined in an experimental design whether trust in the voice is affected by the pitch of the voice, the gender of the speaker, the sight of a simultaneously presented face whose fWHR was systematically varied, and the content of a sentence (neutral or romantic).We found that each of these factors contributed to trust judgements of the voice.Most importantly, Downloaded from Brill.com 05/15/2024 03:18:59PM via Open Access.This is an open access article distributed under the terms of the CC BY 4.0 license.https://creativecommons.org/licenses/by/4.0/ in audiovisual trials we found that the trustworthiness of a face, as manipulated by fWHR, systematically affected judgements of trust in the voice, thus demonstrating the multimodal nature of trust in voice processing.Furthermore, female faces were generally more trustful than male faces, possibly because of a same-sex effect in which females trust females more than males.Furthermore, sentences with neutral content were more trustful than sentences with romantic content.This made a low-pitched male voice speaking romantically related sentences to be judged least trustworthy when combined with a face with a high fWHR.Perception of trust in the voice is thus a multimodal phenomenon.In the following, we expand on this general conclusion.Mileva et al. (2018) indicated that trust judgements are more influenced by visual stimuli than vocal stimuli.We have expanded on this by isolating the fWHR from the face and vocal pitch and sentence content from the voice.Reading from the effect sizes of audiovisual testing, we find that visual stimuli lead in the assimilation of trust judgements (η 2 p = 0.511 of the fWHR as compared to η 2 p = 0.296 of the vocal pitch and η 2 p = 0.403 of the sentence content).Indeed, more variance is associated with the visual fWHR than with auditory vocal pitch or content independently.This is, however, inconsistent with Rezlescu et al. (2015), which found that the voice and face hold equal weight in multimodal trustworthiness judgements.Rezlescu et al. (2015) also found that the effect of trustworthy voices is stronger for trustworthy faces.The present audiovisual data do not indicate a significant interaction effect between fWHR and vocal pitch, and such an effect could thus not be replicated.
Further, we find that the pattern displayed by unimodal visual data is echoed in the multimodal data.Higher fWHR is perceived as more trustworthy than low fWHR in unimodal visual data as well as in neutral and romantic audiovisual data.This similarity in trend across tasks demonstrates that the trustworthiness of a voice is crossmodally shaped by the accompanying face.
In line with previous research (O 'Connor and Barclay, 2017;Schild et al., 2020Schild et al., , 2021)), the results indicate that the content matters: romantic sentences were less trustworthy than neutral sentences across modalities to female listeners.Contrary to Mileva et al. (2018) but in line with Levitan et al. (2018), O'Connor and Barclay (2017), and Schild et al. (2020), vocal pitch shows significant influence in auditory trials.We add to this that this effect extends into audiovisual trust perception.It could be that differences in methodology are responsible for finding a significant effect of vocal pitch.Where Mileva et al. (2018) asked participants about the trustworthiness of the person, this study specifically inquired about the person's voice.Further, the current study tested a different population and presented its stimuli in a blocked fashion, which may have influenced results.Pitch also interacted with content in auditory trials: lower F0 voices in a romantic context were perceived least trustworthy.Interestingly, this interaction effect does not extend into the audiovisual modality.Given that the auditory stimuli were the same in auditory and audiovisual trials, it logically follows that this disappearance of an effect is due to multimodality.Instead of interacting with vocal pitch, the content of the sentence now interacts with the gender of the speaker as presented by the visual stimulus.There, indeed, romantic sentences presented alongside male faces were experienced as least trustworthy.Female listeners appear to be more cautious when romantic sentences are provided by male speakers instead of female speakers.This disappearance may also be attributed to a lack of power, since visual inspection of this effect suggests a larger effect of pitch in romantic situations than in neutral situations.Future research is encouraged to take note of the suggested existence of this effect.Altogether, this suggests that the interaction effects produced by the tested variables depend on whether the information is presented unimodally or audiovisually.
The main effect of the speaker's gender found in unimodal visual data, in which males were less trustworthy than females, was only partly held in audiovisual data (in romantic sentences), and not found in unimodal auditory data.It may be that the face holds few cues to trustworthiness beyond its gender (although the fWHR was a significant predictor), and viewers are more quickly inclined to stereotype the target stimulus.Yet it is, then, interesting that this effect disappears when these same visual stimuli were tested with an accompanying voice.It would be expected that this same cognition would follow and gender still be a reliable cue to gauge the trustworthiness of a target, but the results indicate that when tested audiovisually, the gender of the speaker was no longer an important cue to trustworthiness.It is possible that in this audiovisual paradigm, it is easier for participants to ignore the visual stimuli.Still, however, there exists a significant interaction effect between gender and the situational context in audiovisual data in which romantic voices spoken by male speakers were particularly untrustworthy in comparison to female speakers.Here, it may be possible that the effect of gender becomes modulated by the situational content without being of importance by itself.
Interestingly, we found that larger fWHR was associated with higher trustworthiness, while previous research suggests that these ratios are experienced as more dominant and aggressive (Merlhiot et al., 2021); more dominant ratios are also more trustworthy.This is not entirely consistent with previous literature suggesting that more typical faces are perceived as more trustworthy (Sofer et al., 2015), as the present visual data show a peak at the −12.5% manipulation rather than for the original faces.To our knowledge, the fWHR has not yet been investigated in the context of face typicality, and future research is encouraged to investigate its implication.

Limitations
A number of limitations may influence the results of this study.Firstly, the homogeneous nature of its participant pool does not allow for generalizability to the entire population.It is important to reiterate that a female-only sample was used in this study.Previous research suggests that vocal pitch serves as an indicator of relationship fidelity to female listeners (Schild et al., 2020), which was important to consider when testing the influence of situational context.Additionally, the age demographic falls short of being representative of the entire population because the subject pool consisted of psychology students.An optimal sample size was also not obtained due to practical difficulties, such that effects with a p value close to 0.05 ought to be interpreted with caution.Still, our results showed only one such marginally significant effect (namely the pitch × gender effect in auditory settings).Other effects were highly significant, which we can interpret confidently.Future research may investigate whether the effects found in this study persist across genders and ages.
Secondly, trust was not manipulated in the semantic sense as it was in previous research, which allowed the situational context of the spoken sentence to shape its definition: neutral sentences inquire about general trustworthiness and romantically charged sentences about mating-related trustworthiness.As this situational marker was missing in the unimodal visual trials, the definition of trust was left to be determined by the participants.Trust could therefore entail different traits to the participants and have made our findings less precise.
Lastly, previous research has outlined how accent and prosody profiles influence the trustworthiness of the voice (Jiang and Pell, 2018;Jiang et al., 2018Jiang et al., , 2020)).The models of this study spoke in a Southern-Dutch accent and our results can therefore not predict how cross-modal trustworthiness is influenced by less or more prosodic accented voices or languages.Future research is encouraged to take note of these nuances.

Conclusion
In conclusion, the assimilation of multimodal trustworthiness judgements seems more complex than an addition or averaging of unimodal visual and auditory cues.We show that the manner of assimilation of trust judgements is dependent on the modality of the provided information and its situational context.When presented with a face, female viewers use gender as a cue to trustworthiness, however, this cognition disappears when the face is accompanied by a voice.Furthermore, we confirm that while the voice and face are reliable predictors of trustworthiness, it is important to note that the situational Downloaded from Brill.com 05/15/2024 03:18:59PM via Open Access.This is an open access article distributed under the terms of the CC BY 4.0 license.https://creativecommons.org/licenses/by/4.0/content of the spoken sentence holds considerable weight in these judgements.Finally, we show that the trustworthiness that is placed on the voice is largely influenced by the accompanying face, even when participants were only asked about the voice.Adding to existing literature, we propose that the previously found effects of vocal pitch, fWHR, and situational context on perceived trustworthiness persist into the audiovisual modality.

Figure 2 Figure 2 .
Figure 2 shows the unimodal auditory data for neutral and romantic sentences.The repeated-measures ANOVA showed a main effect of content, Downloaded from Brill.com 05/15/2024 03:18:59PM via Open Access.This is an open access article distributed under the terms of the CC BY 4.0 license.https://creativecommons.org/licenses/by/4.0/

Figure 3 .
Figure 3. Trust scores on unimodally presented photographs.\The y-axis represents the groupaveraged trust scores for male and female models at each of the five width-to-height ratios (fWHR) of the face.Error bars represent one standard error of the mean.

Figures 4
Figures 4 and 5 visualize the audiovisual data where participants judged the voice while trying to ignore a face.The repeated-measures ANOVA on the audiovisual trust scores indicated a significant main effect for sentence content, F 1,25 = 16.886,p < 0.001, η 2 p = 0.403, indicating that romantic sentences (M = 5.83) were, in general, judged as less trustworthy than neutral sentences (M = 6.35).Downloaded from Brill.com 05/15/2024 03:18:59PM via Open Access.This is an open access article distributed under the terms of the CC BY 4.0 license.https://creativecommons.org/licenses/by/4.0/

Figure 4 .
Figure 4. Trust scores on neutral audiovisual sentences.Neutral sentences: The y-axis represents the group-averaged trust scores for male (blue lines) and female (red lines) models at each of the five width-to-height ratios (fWHR) of the face at low (dotted lines) and high (plain lines) pitch.Error bars represent one standard error of the mean.

Figure 5 .
Figure 5. Trust scores on romantic audiovisual sentences.Romantic sentences: The y-axis represents the group-averaged trust scores for male (blue lines) and female (red lines) models at each of the five width-to-height ratios (fWHR) of the face at low (dotted lines) and high (plain lines) pitch.Error bars represent one standard error of the mean.

Table 1 .
Sentences with their closest English translation and substantive context

Table 2 .
Mean F0 (in Hz)for each model and sentence The average F0 is displayed for high and low pitch conditions, for each of the four models (model gender is indicated in parentheses).