Motor-based prediction mediates implicit vocal imitation
by Yuchunzi Wua,b,*, Zhili Hanc, Xing Tiana,b,d,*
aDivision of Arts and Sciences, New York University Shanghai, Shanghai, China
bNYU-ECNU Institute of Brain and Cognitive Science at NYU Shanghai, Shanghai, China
cNingboTech University, Ningbo, Zhejiang, China
dShanghai Key Laboratory of Brain Functional Genomics (Ministry of Education), School of Psychology and Cognitive Science, East China Normal University, Shanghai, China
*Correspondence to: Division of Arts and Sciences, New York University Shanghai, Shanghai, China. E-mail address: yw2062@nyu.edu (Y. Wu), xing.tian@nyu.edu (X. Tian).
This user research article summarizes the publication: Wu, Y., Han, Z., & Tian, X. (2025). Motor-based prediction mediates implicit vocal imitation. NeuroImage, 310, 121169. https://doi.org/10.1016/j.neuroimage.2025.121169.
Introduction
Phonetic convergence—the unconscious adaptation of one’s speech to resemble the vocal characteristics of an interlocutor—is a fundamental human behaviour that plays an important role in fostering social cohesion and communication efficiency (Pardo et al., 2017). What drives this automatic vocal mimicry? Current theories point to the brain’s predictive mechanisms. When we listen to someone speak, we actively anticipate their sounds using two types of internal signals: memory-based predictions, which capture the speaker’s unique vocal identity, and motor-based predictions, which originate from our own vocal production system and reflect our own voice characteristics (Gambi & Pickering, 2013). Discrepancies between these predictions and the actual incoming speech are thought to drive us to gradually adjust our voice toward the speaker’s.
While memory-based predictions are relatively well understood, it remains unclear whether motor-based predictions suppress or enhance the brain’s sensitivity to acoustic features matching the listener’s own voice. To answer this, we designed a novel EEG-based speaking oddball task, recruiting male participants to listen and respond to a female speaker’s voice. This pitch difference allowed us to cleanly separate the two prediction types: memory-based predictions would reflect the speaker’s higher female pitch, while motor-based predictions would reflect the listeners’ own lower male pitch. Participants were divided into a shadow group — who completed an extended shadowing task to promote vocal convergence — and a non-shadow group, who did not. Using mismatch negativity (MMN), a well-established EEG marker of the brain’s automatic detection of sound deviations (Näätänen et al., 2005), we tested three hypotheses: that motor-based predictions would either have no additional effect beyond memory-based ones, suppress the brain’s sensitivity to listener-matched sounds, or enhance it (see Figure 1) — with the latter potentially guiding perceptual learning and vocal adjustment.
Figure 1.
Theoretical model and experimental framework.
(A) Model of predictive processing where a speaker’s utterance triggers a listener’s covert motor imitation and forward-model predictions. Mismatches between these predictions and subsequent input drive phonetic convergence.
(B) Experimental timeline showing behavioural and EEG tasks for the shadow and non-shadow groups.
(C) Overview of EEG tasks: the oddball task (80% standard, 10% each deviant) and control tasks (equal stimulus probability). Participants repeated words in the oddball and speaking control tasks but counted covertly in the counting control task.
(D) Derivation of the corrected MMN index by using the speaking control task as an acoustic baseline.
(E) Testing hypotheses: Memory-based predictions alone would yield comparable MMN responses for both deviants. Motor-based predictions would either reduce sensitivity (smaller MMN for listener-matched low deviant) or enhance sensitivity (larger MMN for low deviant) to predicted acoustic features.
Methods
Participants
A total of sixty-two native Mandarin-speaking male participants were recruited. To ensure data quality, the final sample included 48 participants. These participants were divided equally into two groups: the shadow group (M = 22.13, SD = 2.13) and the non-shadow group (M = 21.96, SD = 2.39). Fourteen participants were excluded from the initial cohort due to poor audio recordings or excessive noise in EEG data. The study was conducted in accordance with the Declaration of Helsinki and received approval from the institutional review board at New York University Shanghai.
Experimental Design and Material
The experimental stimuli consisted of 40 disyllabic pseudo-Japanese words (e.g., mewa). These words were recorded by a female native Mandarin speaker proficient in Japanese pronunciation. The average pitch of these words was adjusted to 210 Hz, representative of a typical female voice. For the EEG sessions, these standard words were further pitch-shifted to create low (150 Hz) and high (270 Hz) deviant stimuli.
The study utilized a comprehensive seven-task structure (Figure 1) to track the progression of phonetic convergence:
1. Learning Task: Participants familiarized themselves with novel Japanese vowel pronunciations by listening to syllable recordings.
2. Pre-record Task: Participants read the pseudo-words aloud to establish a baseline for their natural vocal characteristics.
3. Shadowing Task: Participants in the shadow group repeated the words immediately after hearing the female speaker, providing exposure to her voice.
4. Post-record Task: Participants read the words again to assess the persistence and generalization of vocal changes.
5. Speaking Oddball Task (EEG): Participants listened to a sequence of words (80% standard, 10% low deviant, 10% high deviant) while preparing to repeat the standard word when randomly cued.
6. Speaking Control Task (EEG): Similar to the oddball task, but with standard and deviant stimuli presented at equal frequency (33.3% each) to control for pitch-related acoustic variations.
7. Counting Control Task (EEG): Participants covertly counted the stimuli to determine if neural effects persisted when prepared and heard speech were irrelevant.
The Speaking Oddball design was particularly novel, as it mimicked the sensorimotor dynamics of traditional shadowing while allowing us to use the mismatch negativity (MMN) as a neural index of how internal sound representations are formed.
Acoustic Data Processing and Analysis
Vocal responses were captured using a SHURE SM58 microphone and an MP13 Mini-Mic preamplifier, with recordings processed at a sampling rate of 44.1 kHz. To evaluate the extent of phonetic convergence, we analysed two primary dependent variables: MFCC dissimilarity and pitch difference. We calculated the first 13 Mel-frequency cepstral coefficients (MFCCs) for both the participants’ and the speaker’s utterances to capture a broad array of acoustic features. Dynamic time warping was then applied to align these sequences, effectively accounting for temporal variations between the participants and the model speaker. We utilized cosine distances between the aligned MFCCs to measure spectral similarity, where lower values indicated greater vocal convergence. For the pitch analysis, fundamental frequency was extracted using the ProsodyPro script in Praat (Xu, 2013) and compared to the speaker’s average pitch of 210 Hz. Statistical analysis was performed using linear mixed-effects regression models (LMERs) in R.
EEG Data Processing and Analysis
The neurophysiological component of this study relied on a high-precision 32-channel active electrode system, utilizing the Brain Products actiCHamp amplifier and EasyCap. Electrodes were positioned according to the international 10-20 system, with impedances maintained strictly below 10 kΩ to ensure optimal signal quality. Recorded data were initially referenced to the Cz electrode, and supplemented by horizontal and vertical electrooculograms to monitor ocular activity.
The preprocessing pipeline was implemented using the FieldTrip toolbox in MATLAB (Oostenveld et al., 2011). Continuous data were downsampled to 250 Hz and bandpass filtered between 0.1 and 30 Hz. To isolate and remove artifacts from eye blinks and movements, we performed Independent Component Analysis (ICA) using the Extended Infomax algorithm. Following artifact rejection, the data were re-referenced to the global average of all electrodes. Epochs were extracted over a 700-ms window, including a 100-ms pre-stimulus baseline. Any trials with peak-to-peak amplitudes exceeding 100 µV or significant muscle contamination were excluded from further analysis.
Statistical evaluation of MMN was conducted using temporal and spatiotemporal cluster-based permutation tests (Maris & Oostenveld, 2007). The temporal analysis focused on the Fz channel, a standard site for observing MMN effects (Näätänen et al., 2007), while the spatiotemporal test explored broader neural patterns across all electrodes. To isolate the neural signatures specifically attributable to mismatch detection, we employed a difference-in-difference (DID) analysis. This method involved subtracting the speaking control task ERPs from the oddball task ERPs (hence, corrected MMN), effectively removing potential confounds related to physical acoustic differences or pitch-related perceptual variations. Finally, we conducted correlational analyses to examine the relationship between participants’ vocal performance and their neural responses within the significant MMN time windows .
Results
Acoustic Results
The shadow group showed significant vocal learning effects (Figure 2A-B). Participants exhibited greater vocal similarity to the model speaker during the shadowing (β = -0.041, p < .001) and post-record tasks (β = -0.015, p = .037) compared to the initial pre-record baseline. Interestingly, the shadow group also exhibited a significant pitch divergence during the shadowing task, with the pitch difference being notably larger in the shadowing task than in the pre-record task (β = 5.282, t = 2.52, p = .012). Further investigation into syllable-specific changes revealed that this was likely driven by the imitation of the speaker’s stress patterns. In contrast, the non-shadow group showed no significant task effects for either MFCC dissimilarity or pitch difference (Figure 2C-D).
Figure 2.
Acoustic results for both shadow and non-shadow groups across various tasks.
(A) MFCC dissimilarity and (B) pitch difference results for the shadow group.
(C) MFCC dissimilarity and (D) pitch difference results for the non-shadow group.
Each violin plot shows the distribution of data, with corresponding boxplots illustrating the median and interquartile range (IQR). Whiskers extend from the hinges to the farthest values within 1.5 times the IQR. The pre-record task serves as the reference baseline for comparisons, with decreases in values indicating convergence behaviour. ***p < .001, *p < .05.
EEG Results
In the shadow group, we identified significant raw and corrected MMN effects for the low deviant—the stimulus matched to the listener’s male pitch range (Figure 3). For the oddball task, the temporal cluster-based test revealed a significant difference at Fz between the low deviant and the standard from approximately 150 to 210 ms, p = .020, t = -55.82. Crucially, the corrected MMN effect, obtained via the DID analysis between the oddball and speaking control tasks, also revealed a significant negative cluster from 120 to 210 ms across frontal and central regions, p = .011, t = -4.92. Conversely, the high deviant did not elicit a significant MMN effect at Fz in the shadow group. These findings suggested that motor-based predictions increase the brain’s sensitivity to listener-matched acoustic features.
Figure 3.
ERP results for the shadow group.
(A) Low deviant effects. Top Panel: ERP responses at Fz, showing responses to low deviant and standard stimuli for the oddball, speaking control, and counting control tasks (solid lines). Dotted lines represent ERP differences, with shaded areas indicating significant periods (p < .05, cluster-level). Bottom Panel: Spatiotemporal characteristics of ERP differences between low deviant and standard stimuli across all channels, with significant clusters indicated by blue (negative amplitude) and red (positive amplitude) bars (p < .05, cluster-corrected). Topographic displays at the centroid time points are shown in the upper right.
(B) Spatiotemporal characteristics of ERP differences calculated by subtracting the speaking control task responses (low deviant – standard) from the oddball task, with a topographic display at the specified time point.
(C) High deviant effects. Similar to panel A, showing ERP responses to high deviant stimuli at Fz, with corresponding ERP differences. Bottom Panel: Spatiotemporal characteristics of ERP differences between high deviant and standard stimuli, with topographic displays at centroid time points.
(D) Spatiotemporal characteristics of ERP differences calculated by subtracting counting control task responses (high deviant – standard) from those in the speaking control task, with a topographic display at the centroid time point.
The non-shadow group exhibited a different pattern (Figure 4): while they showed a significant MMN effect for the high deviant at Fz from 120 to 180 ms, p = .009, t = -53.20, they lacked the corrected MMN response for the low deviant, p > .05, indicating that their neural responses were primarily driven by acoustic factors rather than refined internal predictions.
Figure 4.
ERP results for the non-shadow group.
(A) Low deviant effects. Top Panel: ERP responses at Fz, showing responses to low deviant and standard stimuli for the oddball, speaking control, and counting control tasks (solid lines). Dotted lines represent ERP differences, with shaded areas indicating significant periods (p < .05, cluster-level). Bottom Panel: Spatiotemporal characteristics of ERP differences between low deviant and standard stimuli across all channels, with significant clusters indicated by blue (negative amplitude) and red (positive amplitude) bars (p < .05, cluster-corrected). Topographic displays at the centroid time points are shown in the upper right.
(B) High deviant effects. Similar to panel A, showing ERP responses to high deviant stimuli at Fz, with corresponding ERP differences. Bottom Panel: Spatiotemporal characteristics of ERP differences between high deviant and standard stimuli across all channels, with topographic displays at centroid time points.
(C) ERP difference responses at Fz and spatiotemporal characteristics of ERP differences calculated by subtracting the speaking control task responses (high deviant – standard) from the oddball task. Solid and dashed lines represent ERP difference responses for the oddball and speaking control tasks, respectively, with the dotted line representing the difference of differences between the two tasks. A topographic display of significant clusters at the centroid time point is included.
Acoustic-Neural Correlations
To bridge the behavioural and neural findings, we conducted correlational analyses between vocal performance and brain responses. In the shadow group, MFCC dissimilarity was positively correlated with the ERP peak amplitude at Fz for the low deviant in the oddball task, r = 0.59, p = .002 (Figure 5A). A similar correlation was observed in the speaking control task, r = 0.41, p = .049 (Figure 5B). These results demonstrated that participants whose vocalizations more closely resembled the speaker’s exhibited larger negative neural responses to the listener-matched pitch. This direct link reinforced the conclusion that more effective vocal learning was associated with the formation of more accurate motor-based predictions and enhanced neural sensitivity to predicted features.
Figure 5.
Correlational results.
(A) Correlation between MFCC dissimilarity and ERP peak amplitude at Fz elicited by the low deviant in the oddball task for the shadow group.
(B) Correlation between MFCC dissimilarity and ERP peak amplitude at Fz elicited by the low deviant in the speaking control task for the shadow group.
Each scatterplot includes fitted trend lines, with corresponding correlation coefficients (r) and p-values displayed.
Discussion
Our findings supported the third hypothesis: motor-based predictions enhanced, rather than suppressed, the brain’s sensitivity to listener-matched acoustic features. The significant corrected MMN effects observed in the shadow group indicated that motor signals functioned as top-down attentional modulators during social imitation, increasing neural response gain to facilitate the perceptual learning required for vocal adjustment. This challenged traditional views that motor-based predictions primarily served to cancel out expected sensory feedback — a mechanism thought to reduce redundancy during self-generated speech.
It was crucial to distinguish this from the role of memory-based predictions. Memory-based signals, derived from speaker exposure, typically reduced sensitivity to identity cues to accommodate natural vocal fluctuations. In our shadow group, this tolerance explained the absence of a significant MMN for the high-deviant stimulus, which remained within the speaker’s gender category. Importantly, the non-shadow group — who had limited exposure to the speaker’s voice — showed no such pattern, suggesting that both types of predictions required sufficient speaker exposure to become refined and behaviourally meaningful.
Furthermore, the observed pitch divergence highlighted the motor system’s inherent flexibility. Participants prioritised imitating adaptable articulatory patterns — specifically the speaker’s stress structure — over absolute pitch. This behavioural shift suggested that phonetic convergence served a deeper social function: building rapport and cohesion through shared rhythm. By orchestrating these complex sensorimotor interactions, the human brain effectively navigated the nuances of social communication.
References
Gambi, C., & Pickering, M. J. (2013). Prediction and imitation in speech. Frontiers in Psychology, 4(June), 340. https://doi.org/10.3389/fpsyg.2013.00340
Li, S., Zhu, H., & Tian, X. (2020). Corollary Discharge Versus Efference Copy: Distinct Neural Signals in Speech Preparation Differentially Modulate Auditory Responses. Cerebral Cortex. https://doi.org/10.1093/cercor/bhaa154
Maris, E., & Oostenveld, R. (2007). Nonparametric statistical testing of EEG- and MEG-data. Journal of Neuroscience Methods, 164(1), 177–190. https://doi.org/10.1016/j.jneumeth.2007.03.024
Näätänen, R., Jacobsen, T., & Winkler, I. (2005). Memory‐based or afferent processes in mismatch negativity (MMN): A review of the evidence. Psychophysiology, 42(1), 25–32. https://doi.org/10.1111/j.1469-8986.2005.00256.x
Näätänen, R., Paavilainen, P., Rinne, T., & Alho, K. (2007). The mismatch negativity (MMN) in basic research of central auditory processing: A review. Clinical Neurophysiology, 118(12), 2544–2590. https://doi.org/10.1016/j.clinph.2007.04.026
Oostenveld, R., Fries, P., Maris, E., & Schoffelen, J.-M. (2011). FieldTrip: Open Source Software for Advanced Analysis of MEG, EEG, and Invasive Electrophysiological Data (pp. 1–9). https://doi.org/10.1155/2011/156869
Pardo, J. S., Urmanche, A., Wilman, S., & Wiener, J. (2017). Phonetic convergence across multiple measures and model talkers. Attention, Perception, & Psychophysics, 79(2), 637–659. https://doi.org/10.3758/s13414-016-1226-0
Whitford, T. J. (2019). Speaking-Induced Suppression of the Auditory Cortex in Humans and Its Relevance to Schizophrenia. Biological Psychiatry: Cognitive Neuroscience and Neuroimaging, 4(9), 791–804. https://doi.org/10.1016/j.bpsc.2019.05.011
Xu, Y. (2013). A Tool for Large-scale Systematic Prosody Analysis (pp. 7–10). In Proceedings of Tools and Resources for the Analysis of Speech Prosody (TRASP 2013).





