Explaining the Effect of Likelihood Manipulation and Prior through a Neural Network of the Audiovisual Perception of Space

Results in the recent literature suggest that multisensory integration in the brain follows the rules of Bayesian inference. However, how neural circuits can realize such inference and how it can be learned from experience is still the subject of active research. The aim of this work is to use a recent neurocomputational model to investigate how the likelihood and prior can be encoded in synapses, and how they affect audio-visual perception, in a variety of conditions characterized by different experience, different cue reliabilities and temporal asynchrony. The model considers two unisensory networks (auditory and visual) with plastic receptive fields and plastic crossmodal synapses, trained during a learning period. During training visual and auditory stimuli are more frequent and more tuned close to the fovea. Model simulations after training have been performed in crossmodal conditions to assess the auditory and visual perception bias: visual stimuli were positioned at different azimuth (±10° from the fovea) coupled with an auditory stimulus at various audio-visual distances (±20°). The cue reliability has been altered by using visual stimuli with two different contrast levels. Model predictions are compared with behavioral data. Results show that model predictions agree with behavioral data, in a variety of conditions characterized by a different role of prior and likelihood. Finally, the effect of a different unimodal or crossmodal prior, re-learning, temporal correlation among input stimuli, and visual damage (hemianopia) are tested, to reveal the possible use of the model in the clarification of important multisensory problems.

variety of conditions characterized by a different role of prior and likelihood. Finally, the effect of a different unimodal or crossmodal prior, re-learning, temporal correlation among input stimuli, and visual damage (hemianopia) are tested, to reveal the possible use of the model in the clarification of important multisensory problems.

Supplementary Material
A. Mathematical Description of the Neural Network

Basal Structure of the Network
The neural network model consists of two chains of N unisensory neurons (Fig. 1, upper panel). Each neuron codes for a particular spatial position in its modality. Moreover, each chain is topologically organized, i.e., proximal neurons code for proximal positions. In the following, we will denote with a first subscript the particular area (auditory or visual) and with a second subscript, after a comma, the neuron position within the area.
Each neuron receives three different kinds of inputs: a sensory input from the environment (say u), a lateral input from neurons of the same modality (say l) and a cross-modal input from neurons of the other modality (say c). The global input (equal to the sum of the previous three contributions) is then passed through a sigmoidal relationship, ( )  , which accounts for the presence of a lower threshold and upper saturation in neuron activity, and a first-order low-pass filter with time constant , which accounts for the neuron integrative capacity.
Hence, for the generic k-th neuron in the modality S (S = A or V for the auditory and visual modalities, respectively) we can write , represents the neuron output, and the sigmoidal relationship is described by the following s and x0 are parameters, which set the slope and the position of the sigmoidal relationship. According to Eq. (2), the neuron output activity is normalized between 0 and 1 (zero means a silent neuron, one a maximally activated neuron).
It is worth noting that, for the sake of simplicity, we used the same parameters (, s and x0) for all neurons independently of their modality. This choice was adopted to minimize the number of model assumptions.
The expression for the sensory input is computed as the scalar product of the sensory representation of the stimulus (i.e., the vector We assumed that the neuron receptive field, k S R , , has initially a large extension, described with a Gaussian function, and then progressively shrinks during training, to fit the width of the external input (see section "Training the network").
The lateral input is computed as follows where kj  represents a lateral intra-area synapse connecting the presynaptic neuron j to the post synaptic neuron k in the same area. Here we used the classical Mexican-hat arrangement: a neuron is excited by proximal neurons in the same area, and inhibited by more distal ones are parameters which set the strength and width of the excitatory and inhibitory portions of the Mexican hat. In particular, we have represents the distance between neurons' preferred positions, i.e.

( )
It is worth noting that we used the same expression of lateral synapses (Eq. (S5)) in both the auditory and visual areas, to limit the number of model assumptions.
Finally, the cross-modal term in Eq. (1) is computed as the convolution of the vector of cross modal synapses and the activity in the other unisensory area, i.e.
where kj SQ w , represents a cross-modal synapse from the pre-synaptic neuron j in the area Q to the post-synaptic neuron k in the area S. We assumed that the cross-modal synapses are initially ineffective and are progressively reinforced during the training phase.

Training the Network
Starting from the initial basal value of synapses, the network has been trained during a training period in which the sensory input representations (i.e., IA and IV) have been given with a random distribution.

5
The synapses describing the receptive field, kj S r , , and those describing the cross-modal link between the two areas, kj SQ w , , have been trained using a learning rule with a classical Hebbian potentiation factor and a decay term. We can write, in scalar form Eqs. (8) and (9) have been applied, at each step, using the final steady state values of the neuron output (i.e., when transient phenomena are exhausted).
At the beginning of training all cross-modal synapses are assumed equal to zero. Conversely, the receptive-field synapses have a broad spatial extension, and moderate amplitude, identical for the two modalities, i.e., where r0 sets the initial strength of the receptive field, and R  establishes its initial spatial extension and i.e., a wide initial receptive fields) . Of course, Eq. (10) holds only at the first step of training.

Probability Distribution and Spatial Accuracy of the Inputs
According to the previous section, we assumed that the sensory inputs are composed of a deterministic term, which represents the spatial distribution of the input, centered on the stimulus spatial position, and a Gaussian white noise term (zero mean value and assigned standard deviation).  is the standard deviation of the spatial representation. According to physiology, we assumed that the visual inputs are spatially more accurate than the auditory ones, Conversely, we assumed that the standard deviation of noise (say S  ) is a given fraction of the input strength, to set the signal to noise ratio (see Table 1 in the text).
In order to simulate the presence of better acuity at the center, and reduced acuity at the periphery, we assumed that the SDs of the visual and auditory inputs increase with the eccentricity of the stimulus.
The expression of V  has been taken from an empirical curve on visual acuity by Dacey (1993) (see also Ursino et al., 2017 for more details). By denoting with The auditory acuity also decreases from the center to the periphery, although it is difficult to quantify this effect being influenced by many factors, such as the stimulus intensity and frequency (Middlebrooks and Green, 1991;Wood and Bizley, 2015). However, this effect is less evident and of smaller entity compared with the visual one (Perrott and Saberi, 1990). Hence, we used a simpler linear relationship, assuming that The positions of the two stimuli (i.e., A  and V  in Eq. (S11)) have been randomly generated from the prior probability distribution described below.
We assume that both the visual and auditory input have a greater probability close to the fovea, and smaller probability at the periphery. This corresponds to have a non-uniform prior in visual unisensory conditions. The following probabilities have been used to generate the position of the visual and auditory inputs during training.
Visual unisensory prior: the visual position follows a Gaussian distribution, centered at the fovea.
The standard deviation sV (which here plays the role of a space constant) has been set at 7 deg; i.e., the visual stimuli becomes very rare at ±20 deg eccentricity. The standard deviation is assumed higher than in the visual case: we have sA = 30 deg assuming that, head movements in auditory unimodal conditions are less efficient than eye movement in visual unimodal conditions to maintain the stimulus close to the center.
Cross modal prior: in the cross modal case during training, we assumed that the visual and auditory inputs originate from independent causes with a given probability (say ) but are produced by the same cause, hence originate from proximal spatial positions, with the complementary 8 probability (1 -). According to the Bayes rule, the joined prior probability can be computed from knowledge of the individual probability of one stimulus, and the conditional probability of the other.
A problem is whether, in cross modal conditions, the distribution is dominated by the visual prior (more sharply close to the center) or by the auditory one (less sharply close to the center). We assumed that, in 50% of cases, the cross-modal stimuli follow the visual distribution and in the other 50% of cases follow the auditory one. Hence where we used equations (S14) and (S15) for the visual and auditory priors, and the following expression for the conditional probability In writing Eq. (S17) we assumed that the conditional probability is computed as the weighted sum of the prior unimodal distribution, reflecting the moderate possibility that the two stimuli are independent, and a second term, , reflecting the probability that the auditory and visual events are originated from the same source.
As in the previous work, we used a value of space constant sAV = 1 deg, assuming a small audio-visual distance when the two stimuli originate from the same source.   Figure S6. The lesioned network (90% of damaged neurons in the right visual hemifield) was used to replicate an experiment similar to that performed in hemianopic patients (Leo et al., 2008). Simulated results are shown in the upper plots and in vivo data are redrawn in the lower plots. In the network, a visual stimulus was applied either at -10° (intact hemifield) or at +10° (in the lesioned hemifield) and paired with an auditory stimulus applied at the same spatial position (SP) or at 16° and 32° of spatial disparity (DP16, DP32). The auditory stimuli were presented in unimodal conditions (A), too. The simulations were performed using a visual stimulus with strength 18, an auditory stimulus with strength 36, and in noisy condition (average values are displayed). Plots in the left column show the absolute localization error (absolute difference between the perceived auditory location and the real auditory location) computed in each condition (A, SP, DP16 and DP32) separately for the visual stimulus in the intact and damaged hemifield. Plots in the right column show the percentage of auditory bias [100*(perceived auditory location minus the real auditory location) / (actual visual-auditory disparity)] in DP16 and DP32 conditions (collapsed together) for the visual stimulus in the intact and damaged hemifield. According to the network (upper plots), a visual stimulus in the intact hemifield slightly reduces the auditory localization error in SP condition and strongly increases auditory mislocalization in DP condition, producing a high ventriloquism effect; conversely, a visual stimulus in the lesioned hemifield has only a weak impact on auditory localization error, and the ventriloquism effect radically declines. These network outcomes display good agreement with the in vivo data (lower plots).