The Dual-Task Costs of Audiovisual Benefit: Effects of Noise and ‘Native’ Speaker Status

Listeners typically understand speech more accurately when they can see and hear the talker relative to hearing alone. However, seeing the talker’s face does not necessarily reduce the cognitive costs associated with processing speech as measured by dual-task costs. In difficult listening conditions, dual-task response times may be faster for audiovisual than audio-only speech, but when listening conditions are easy, the presence of a talking face may have no effect on dual task responses or even slow responses relative to listening alone. The current study expanded upon this work by including samples of both native and nonnative English speakers and assessing speech intelligibility, subjective listening effort (Experiment 1), and dual-task costs (Experiment 2) for audio-only and audiovisual speech across multiple noise levels. We found that seeing the talker reduces dual-task costs only in difficult listening conditions in which the visual information is necessary to accurately identify the speech. The effects of background noise and speech modality were robust within groups of native as well as nonnative listeners, suggesting that if researchers are interested in studying general phenomena related to speech processing (i.e., rather than specifically studying how language background affects results), these effects would have emerged regardless of whether the sample was limited to native speakers of English. However, the magnitude of some effects differed for native and nonnative listeners.

By Violet A. Brown, Adina Holloway, Amadou Touré, Salma Ali, Alyssa Alvarez, Tiffany Nyamao, Yuxin Lin, Ostap Hrebeniuk, & Julia F. Strand

Impaired Performance in Noise: Disentangling Listening Effort From the Irrelevant Speech Effect

Noise can reduce the intelligibility of spoken language and increase the effort necessary to understand speech. Listening effort, “the deliberate allocation of mental resources to overcome obstacles in goal pursuit when carrying out a [listening] task” (Pichora-Fuller et al., 2016), is commonly assessed by measuring response times to secondary tasks while listening to speech or by testing memory for the content of the speech. Increasing the level of background noise tends to slow responses and impair memory, and these effects are attributed to the resource-intensive process of reevaluating speech that was initially obscured or misheard. However, given that noise can impair performance on cognitive tasks that do not require processing auditory information, it is possible that noise-induced impairments typically ascribed to processing degraded speech may instead reflect increased cognitive load from the presence of noise itself. The current study assessed whether noise, in the absence of a speech task, can affect performance on tasks intended to measure listening effort. In Experiment 1 (positive control), target speech consisting of single words was presented aurally in background noise and we measured listening effort with three commonly-used paradigms. Experiment 2 was identical except that the target words were presented orthographically rather than aurally. Results showed that noise impaired performance on all three tasks when the target stimuli were presented aurally, consistent with a large body of work in the listening effort literature. Experiment 2 revealed that performance on some tasks was impaired by the presence of masking noise (particularly two-talker babble), indicating some domain-general interference. However, the magnitude of the noise-induced interference effects were markedly smaller in Experiment 2 than Experiment 1, suggesting that measures of listening effort capture variability attributable to the challenges associated with listening to speech in noise, and do not simply measure distraction or noise-induced cognitive interference.

By Janna W. Wennberg, Naseem H. Dillman-Hasso, Violet A. Brown, & Julia F. Strand

Measuring the Dual-Task Costs of Audiovisual Speech Processing Across Levels of Background Noise

Successful communication requires that listeners not only identify speech, but do so while maintaining performance on other tasks, like remembering what a conversational partner said or paying attention while driving. This set of four experiments systematically evaluated how audiovisual speech—which reliably improves speech intelligibility—affects dual-task costs during speech perception (i.e., one facet of listening effort). Results indicated that audiovisual speech reduces dual-task costs in difficult listening conditions (those in which visual cues substantially benefit intelligibility), but may actually increase costs in easy conditions—a pattern of results that was internally replicated multiple times. This study also presents a novel dual-task paradigm specifically designed to facilitate conducting dual-task research remotely. Given the novelty of the task, this study includes psychometric experiments that establish positive and negative control, assess convergent validity, measure task sensitivity relative to a commonly-used dual-task paradigm, and generate performance curves across a range of listening conditions. Thus, in addition to evaluating the effects of audiovisual speech across a wide range of background noise levels, this study enables other researchers to address theoretical questions related to the cognitive mechanisms supporting speech processing beyond the specific issues addressed here and without being limited to in-person research.

By Violet A. Brown

Noisy speech impairs retention of previously heard information only at short time scales

When speech is presented in noise, listeners must recruit cognitive resources to resolve the mismatch between the noisy input and representations in memory. A consequence of this effortful listening is impaired memory for content presented earlier. In the first study on effortful listening, Rabbitt, The Quarterly Journal of Experimental Psychology, 20, 241-248 (1968; Experiment 2) found that recall for a list of digits was poorer when subsequent digits were presented with masking noise than without. Experiment 3 of that study extended this effect to more naturalistic, passage-length materials. Although the findings of Rabbitt’s Experiment 2 have been replicated multiple times, no work has assessed the robustness of Experiment 3. We conducted a replication attempt of Rabbitt’s Experiment 3 at three signal-to-noise ratios (SNRs). Results at one of the SNRs (Experiment 1a of the current study) were in the opposite direction from what Rabbitt, The Quarterly Journal of Experimental Psychology, 20, 241-248, (1968) reported - that is, speech was recalled more accurately when it was followed by speech presented in noise rather than in the clear - and results at the other two SNRs showed no effect of noise (Experiments 1b and 1c). In addition, reanalysis of a replication of Rabbitt’s seminal finding in his second experiment showed that the effect of effortful listening on previously presented information is transient. Thus, effortful listening caused by noise appears to only impair memory for information presented immediately before the noise, which may account for our finding that noise in the second-half of a long passage did not impair recall of information presented in the first half of the passage.

By Violet A. Brown, Katrina Sewell, Jed Villanueva, Julia F. Strand

The Effects of Temporal Cues, Point-Light Displays, and Faces on Speech Identification and Listening Effort

Among the most robust findings in speech research is that the presence of a talking face improves the intelligibility of spoken language. Talking faces supplement the auditory signal by providing fine phonetic cues based on the placement of the articulators, as well as temporal cues to when speech is occurring. In this study, we varied the amount of information contained in the visual signal, ranging from temporal information alone to a natural talking face. Participants were presented with spoken sentences in energetic or informational masking in four different visual conditions: audio-only, a modulating circle providing temporal cues to salient features of the speech, a digitally rendered point-light display showing lip movement, and a natural talking face. We assessed both sentence identification accuracy and self-reported listening effort. Audiovisual benefit for intelligibility was observed for the natural face in both informational and energetic masking, but the digitally rendered point-light display only provided benefit in energetic masking. Intelligibility for speech accompanied by the modulating circle did not differ from the audio-only conditions in either masker type. Thus, the temporal cues used here were insufficient to improve speech intelligibility in noise, but some types of digital point-light displays may contain enough phonetic detail to produce modest improvements in speech identification in noise.

By Katrina Sewell, Violet A. Brown, Grace Farwell, Maya Rogers, Xingyi Zhang, & Julia F. Strand

Spread the Word: Enhancing Replicability of Speech Research Through Stimulus Sharing

Purpose: The ongoing replication crisis within and beyond psychology has revealed the numerous ways in which flexibility in the research process can affect study outcomes. In speech research, examples of these “researcher degrees of freedom” include the particular syllables, words, or sentences presented; the talkers who produce the stimuli and the instructions given to them; the population tested; whether and how stimuli are matched on amplitude; the type of masking noise used and its presentation level; and many others. In this research note, we argue that even seemingly minor methodological choices have the potential to affect study outcomes. To that end, we present a reanalysis of six existing data sets on spoken word identification in noise to assess how differences in talkers, stimulus processing, masking type, and listeners affect identification accuracy. Conclusions: Our reanalysis revealed relatively low correlations among word identification rates across studies. The data suggest that some of the seemingly innocuous methodological details that differ across studies—details that cannot possibly be reported in text given the idiosyncrasies inherent to speech—introduce unknown variability that may affect replicability of our findings. We therefore argue that publicly sharing stimuli is a crucial step toward improved replicability in speech research.

By Julia F. Strand & Violet A. Brown

Preregistration: Practical Considerations for Speech, Language, and Hearing Research

In the last decade, psychology and other sciences have implemented numerous reforms to improve the robustness of our research, many of which are based on increasing transparency throughout the research process. Among these reforms is the practice of preregistration, in which researchers create a time- stamped and uneditable document before data collection that describes the methods of the study, how the data will be analyzed, the sample size, and many other decisions. The current article highlights the benefits of preregistration with a focus on the specific issues that speech, language, and hearing researchers are likely to encounter, and additionally provides a tutorial for writing preregistrations. Conclusions: Although rates of preregistration have increased dramatically in recent years, the practice is still relatively uncommon in research on speech, language, and hearing. Low rates of adoption may be driven by a lack of under- standing of the benefits of preregistration (either generally or for our discipline in particular) or uncertainty about how to proceed if it becomes necessary to deviate from the preregistered plan. Alternatively, researchers may see the ben- efits of preregistration but not know where to start, and gathering this informa- tion from a wide variety of sources is arduous and time consuming. This tutorial addresses each of these potential roadblocks to preregistration and equips readers with tools to facilitate writing preregistrations for research on speech, language, and hearing.

By Violet A. Brown & Julia F. Strand

Speech and Non-Speech Measures of Audiovisual Integration are not Correlated

Many natural events generate both visual and auditory signals, and humans are remarkably adept at integrating information from those sources. However, individuals appear to differ markedly in their ability or propensity to combine what they hear with what they see. Individual differences in audiovisual integration have been established using a range of materials, including speech stimuli (seeing and hearing a talker) and simpler audiovisual stimuli (seeing flashes of light combined with tones). Although there are multiple tasks in the literature that are referred to as “measures of audiovisual integration,” the tasks themselves differ widely with respect to both the type of stimuli used (speech versus non-speech) and the nature of the tasks themselves (e.g., some tasks use conflicting auditory and visual stimuli whereas others use congruent stimuli). It is not clear whether these varied tasks are actually measuring the same underlying construct: audiovisual integration. This study tested the relationships among four commonly-used measures of audiovisual integration, two of which use speech stimuli (susceptibility to the McGurk effect and a measure of audiovisual benefit), and two of which use non-speech stimuli (the sound-induced flash illusion and audiovisual integration capacity). We replicated previous work showing large individual differences in each measure but found no significant correlations among any of the measures. These results suggest that tasks that are commonly referred to as measures of audiovisual integration may be tapping into different parts of the same process or different constructs entirely.

By Jonathan M. P. Wilbiks, Violet A. Brown, & Julia F. Strand

Revisiting the Target-Masker Linguistic Similarity Hypothesis

The linguistic similarity hypothesis states that it is more difficult to segregate target and masker speech when they are linguistically similar. For example, recognition of English target speech should be more impaired by the presence of Dutch masking speech than Mandarin masking speech because Dutch and English are more linguistically similar than Mandarin and English. Across four experiments, English target speech was consistently recognized more poorly when presented in English masking speech than in silence, speech-shaped noise, or an unintelligible masker (i.e., Dutch or Mandarin). However, we found no evidence for graded masking effects—Dutch did not impair performance more than Mandarin in any experiment, despite 650 participants being tested. This general pattern was consistent when using both a cross-modal paradigm (in which target speech was lipread and maskers were presented aurally; Experiments 1a and 1b) and an auditory-only paradigm (in which both the targets and maskers were presented aurally; Experiments 2a and 2b). These findings suggest that the linguistic similarity hypothesis should be refined to reflect the existing evidence: There is greater release from masking when the masker language differs from the target speech than when it is the same as the target speech. However, evidence that unintelligible maskers impair speech identification to a greater extent when they are more linguistically similar to the target language remains elusive.

By Violet A. Brown, Naseem H. Dillman-Hasso, ZhaoBin Li, Lucia Ray, Ellen Mamantov, Kristin J. Van Engen, & Julia F. Strand

Revisiting the Relationship Between Implicit Racial Bias and Audiovisual Benefit for Nonnative-Accented Speech

Speech intelligibility is improved when the listener can see the talker in addition to hearing their voice. Notably, though, previous work has suggested that this “audiovisual benefit” for nonnative (i.e., foreign-accented) speech is smaller than the benefit for native speech, an effect that may be partially accounted for by listeners’ implicit racial biases (Yi et al., 2013, The Journal of the Acoustical Society of America, 134[5], EL387–EL393.). In the present study, we sought to replicate these find- ings in a significantly larger sample of online participants. In a direct replication of Yi et al. (Experiment 1), we found that audiovisual benefit was indeed smaller for nonnative-accented relative to native-accented speech. However, our results did not support the conclusion that implicit racial biases, as measured with two types of implicit association tasks, were related to these differences in audiovisual benefit for native and nonnative speech. In a second experiment, we addressed a potential confound in the experimental design; to ensure that the difference in audiovisual benefit was caused by a difference in accent rather than a difference in overall intelligibility, we reversed the overall difficulty of each accent condition by presenting them at different signal-to-noise ratios. Even when native speech was presented at a much more difficult intelligibility level than nonnative speech, audiovisual benefit for nonnative speech remained poorer. In light of these findings, we discuss alternative explanations of reduced audiovisual benefit for nonnative speech, as well as methodological considerations for future work examining the intersection of social, cognitive, and linguistic processes.

By Drew J. McLaughlin, Violet A. Brown, Sita Carraturo, & Kristin J. Van Engen