Audio-Aero Tactile Integration in Speech Perception: Methodology

Info: 2721 words (11 pages) Dissertation Methodology
Published: 23rd Nov 2021

Share this: Facebook Twitter Reddit LinkedIn WhatsApp

This example is part of a set:

Chapter III - Statement of the problem

Studies investigating audio-tactile integration using syllables as stimuli & response type as closed choice paradigm demonstrated multisensory integration, but audio-tactile integration could not be replicated when investigated using sentences as stimuli & response type as open choice paradigm. Approach of the study didn`t follow a continuous hierarchical pattern, i.e. stimulus type was suddenly upgraded to five- word long sentences from monosyllables which a radical change. In addition to that, comparatively sophisticated response type, open choice paradigm, was also chosen for the study. Thus, the unanswered question is to determine which of the factors; use of sentences or open choice paradigm, lead to null result in continuous speech perception paper (Derrick et al., 2016). To determine this, it is essentially important to take a step back & investigate the multisensory integration by using syllables in an open-choice paradigm.

3.1 Study aim

The present study aims to identify whether the benefits from audio-tactile integration uphold for monosyllable identification task, in varying signal to noise ratios (SNRs) when the participants do not have to make a forced choice between two alternatives, but are presented with a more ecologically valid open-choice condition.

3.2 Hypothesis

The research question is whether aero-tactile information influence syllable perception using an open choice identification task. This will be investigated by testing the following 2 hypotheses

SNR at 80% accuracy level will interact with phoneme and air flow such that:

Hypothesis 1: – SNR at 80% accuracy level will be decreased when listening to congruent audio-tactile stimuli than audio only stimuli.

Hypothesis 2: – SNR at 80% accuracy level will be increased when listening to incongruent audio-tactile stimuli than audio only stimuli.

3.3 Justification

The proposed study is an extended version of the previous work of Gick and Derrick (2009) but with the mode of response being an open- choice design instead of a 2-way forced-choice paradigm task. This response format was chosen because it has been shown to provide a more conservative estimate of the participant`s percept in previous studies (Colin et al., 2005; Massaro, 1998). Moreover, an open-choice design allows for a better assessment of the precision of audio- tactile integration in speech perception as the possibility of subject guessing can be minimized. Hence, the outcome of this study will extend our knowledge on the effectiveness of integration of tactile information in the enhancement of auditory speech perception, in a more natural setting with minimal cues.

In addition to this, simplifying stimulus type to monosyllables would allow us to identify the conditions under which audio-tactile integration occurs, without the confounding factors needing higher cognitive and linguistic processing (semantic information, context information, utterance length etc.) that were present during the continuous speech studies (Derrick et al., 2016).

3.4 Significance

This study would be a valuable contribution for multisensory speech perception literature for fundamental scientific studies and further researches. Moreover, insights from the study could be used for evidence based practise clinically especially for training communication skills in individuals with sensory deficits.

Chapter IV - Methodology

The current study builds on the methodology of the original aero-tactile integration paper (Gick & Derrick, 2009), coupling an acoustic speech signal with small puffs of air on the skin. The difference with the present study is that this time the participants are free to choose their response, without any constraints, based on their own perceptual judgement rather having to choose between two response alternatives. University of Canterbury Human Ethics Committee has reviewed and approved this study on 15 May 2017 (Approval number 2017-21 LR). See Appendix for a copy of the approval letter.

4.1 Participants

Forty-four (44) healthy participants (40 females and 4 males), with a mean age of 23.34 years were recruited for the study. Inclusionary criteria set for recruitment process were

Native English speaker
Aged between 18 to 45
No history of speech, language or hearing issues

Of 44 participants tested, seven participants didn`t match language criteria (New Zealand, Canada, United States or United Kingdom English), three participants had higher pure tone threshold of >25db in either ear, which leaves with 34 participants. In addition, 5 participants from them reached a ceiling effect for some of the conditions, hence their database could not be included fully. They were unable to correctly identify some of the stimuli at a +10dB SNR level, suggesting they had difficulty doing task in an effectively noiseless environment. None had to be completely excluded, but participant 2’s “ka” and “ba”, participant 6’s “pa”, participant 8’s “ka”, participant 14’s “ka”, and participant 37’s “ga”, “ da” and “ba” data had to be excluded due to these ceiling effects. None of the participants had history of speech or language delays.

Participants (n= 34) were primarily undergraduate speech-language therapy students and remaining (n=10) were recruited via email, Facebook, advertisement on the New Zealand Institute of Language, Brain and Behaviour (NZILBB) website and around the university. Undergraduate students received credits for their research participation while other volunteers were given a $10 gift voucher as compensation for their time. As part of recruitment process, participants received an information sheet (Appendix) which was discussed with them before beginning any of the procedures. Following this discussion, if they chose to participate, they were asked to sign a written consent form (Appendix).

All the participants were asked to complete a questionnaire (Appendix) detailing demographic information on age, dialect and history of speech, language and hearing difficulties. As part of the initial protocol, participants underwent an audiological screening. Pure tone audiometry was carried for frequencies 500Hz, 1KHz, 2KHz and 4KHz using a Interacoustics AS608 screening audiometer. Pure tone thresholds were calculated and if the threshold is less than or equal to 25, hearing sensitivity was considered to be within normal range. Participants not meeting the inclusion criteria could choose to still complete the study to gain research experience. This resulted in data for 7 non-native English speakers.

4.2 Recording procedure and stimulus

Speaker was asked to come in a sound-attenuated booth in lab and speech audio was recorded using a Sennheiser MKH-416 microphone attached to a Sound Devices USB-Pre2 microphone amplifier fed into a PC. Video recordings of the English syllables, labials (/pa/and /ba/) and velars (/ka/ and /ga/), spoken by a female native New Zealand English speaker were recorded using a video camera (Panasonic Lumix DMC-LX100) speaking with their lips ~1 cm away from a custom-made airflow estimator system that does not interfere with audio speech production. Speaker produced twenty repetitions of each stimulus, and stimuli were presented in randomized order to be read aloud off a screen.

To produce the air puff, an 80 ms long 12 kHz sine wave used to drive the pump action of Aerotak (Derrick & De Rybel, 2015). This system stores the audio signal and the air ﬂow signal in the left and right channel of a stereo audio output respectively. The stored audio is used to drive a conversion unit that splits the audio into a headphone out (to both ears) and right channel air pump drive signal to a piezoelectric pump that is mounted on the tripod.

Auditory stimuli

The speech stimuli of the English syllables were matched for duration (390-450ms each), fundamental frequency (falling pitch from 90 Hz to 70 Hz) and intensity (70 decibels). Using an automated process, speech token recordings were randomly superimposed 10.000 times within a 10 second looped sound file to generate speech noise for the speaker. According to Jansen and colleagues (2010) and Smits and colleagues (2004), this method of noise generation results in a noise spectrum virtually identical to the long-term spectrum of the speech tokens of the speaker and thus ensuring accurate signal to noise ratios for each speaker and token. Speech tokens and the noise samples were adjusted to the same A-weighted sound level prior to mixing at different signal to noise ratios.

Tactile stimuli

In order to create best match of the airflow produced by the airflow generation system with the dynamics of it produced in speech, the air flow outputs were generated by a Murata MZB1001T02 piezoelectric device (Tokyo, Japan), controlled through the Aerotak system, as described in Derrick, de Rybel, and Fiasson (2015). This device extracts signal representing turbulent airflow during speech from the recorded speech samples. These stimuli syllables were then passed through the air ﬂow extraction algorithm to generate a signal for driving a system to present air ﬂow to the skin of participants simultaneous with audio stimuli.

4.3 Stimulus presentation

Experiment was run individually for each participant. The entire procedure lasted approximately 40 minutes. Data was collected using a Apple MacBook Air laptop in sound attenuated room for of four underlying tokens each of ‘ba’, ‘pa’, ‘ga’, and ‘ka’. Stimuli were placed in speech in noise generated using the same techniques described in Derrick, et al., (2016) with exception that the software used was custom R and FFMPEG. Speech in noise ranged from -20 to 10 SNR with 0.1 SNR increments. From -20 to 0 SNR, signal was decreased, and noise kept stable. From 0 to 10 SNR, signal was kept the same volume and noise decreased. Thus, the overall amplitude was maintained stable throughout the experiment.

The pump has the following specifications: the 5-95% rise time takes 30 ms (Derrick, et al., 2015), with a maximum pressure of 1.5 kPa during loud speech, and a maximum flow rate of 0.8 l/m, which corresponds to about a twelfth of that of actual speech.

And the correct responses were lower-case ‘pa’, ‘ba’, ‘ga’, and ‘ka’/’ca’ based on the underlying audio signal. Whenever the participant responded accurately, the SNR increases, thereby increasing the task complexity. Similarly, for every incorrect response, SNR drops, making signal clearer and simplifying task for the participants. Thus, results of each trial allow for a re-tuning of the SNR’s for each syllable to compensate for how easy the individual recording was for perceivers to detect in noise. This method of assigning stimulus values based on preceding response is the procedure of an adaptive staircase. The auditory signals were degraded with speech- based noise and the signal-to-noise ratio were varied using software implemented with an adaptive transformed up-down staircase to obtain a psychometric curve of perception based on the 80% accuracy response in noise. The transformed up – down method (Quest staircase) has been adopted as it is a reasonably fast and typical method. (Watson & Pelli, 1983). Eight adaptive staircases were set for each token stimuli and thus each QUEST staircase had 32 repetitions.

4.4 Procedure

The study was designed to examine the influence of audio – aero tactile integration on speech perception using an open choice task. Each participant`s perception was assessed using randomized presentation of 6 possible combinations of auditory only and congruent and incongruent auditory and aero-tactile stimuli of English syllables – /pa/, /ba/, /ga/ and /ka/. Participant heard 32 times of 8 tokens of each syllable without air ﬂow, and 32 tokens of each syllable with air ﬂow generated from the underlying sound ﬁle, for a total of 256 tokens. Length of time per token is per participant 6.5 seconds on average. Once the initial protocol was done, participants were seated in a sound- attenuated booth wearing a sound isolating headphone (Panasonic Stereo Headphones RP-HT265). They were presented with the auditory stimuli via headphones at a comfortable loudness level through an experiment designed in PsychoPy software (Peirce 2007 & 2009). Tactile stimuli were delivered at the suprasternal notch via the air pump placed aiming at the subject`s neck at a pressure of ~7 cm H₂0, fixed at ~2.2cm from the skin surface. The back of the hand was chosen because it is a location where participants typically receive no direct airflow during their own speech production. Participants Integrated perception was estimated by asking them to type out the perceived syllable into the experiment control program that indicated whether the answer is correct or not based on the software-provided expected outcome.

Participants were told that they might experience some noise and unexpected puffs of air along with syllables, consisting of a consonant and a vowel, during the task. Participants were asked to type down the syllables that they heard, and push enter key to record their responses. Since the experiment part of the study, requiring active listening, lasted about 20 minutes of the total procedure, participants could take short listening breaks if they required one. Researcher stayed inside the experiment room with the participant during the experiment to monitor if placement is not disturbed and to ensure that participants are comfortable.

4.5 Data Analysis

From the forty-four participants, who took part in the study, data of thirty-four (34) participants who fit in the inclusion criteria, were analyzed to answer the research question. Initially, data was entered and sorted in Microsoft Excel 2016 spreadsheet. 32 repetitions were extracted for each staircase and statistics were run on the last SNR which is at 80% accuracy level. Descriptive statistics were run initially.

Box plots were used to plot to visualize variation of SNR with place of articulation (graphs).

Variation of SNR with audio only and audio-tactile condition for each target stimuli was plotted using a boxplot (graph).

Generalized linear mixed-effects models (GLMM), seen in the R-formatted (R Core Team, 2016), were run on the interaction between aspiration [aspirated (‘pa’ and ‘ka’) vs. unaspirated (‘na’ and ‘ga’) stops], place [labial (‘pa’ and ‘ba’) vs. velar [‘ga’ and ‘ka’], and artificial air puff (present vs. absent).

Model fitting was then performed in a stepwise backwards iterative fashion, and models were back-fit along the Akaike information criterion (AIC), to measure quality of fit. This technique isolates the statistical model that provides the best fit for the data, allowing elimination of interactions in a statistically appropriate manner. The final model was:

SNR ~ place * manner + (1 + (place * manner) | participant)

In this model, the SNR at 80% accuracy was compared to the fixed effects. These included: 1) Place of articulation (labial vs. velar), 2) manner of articulation (voiced vs. voiceless), 3) the interaction of place and manner, and 4) The full-factorial random effect covering place and manner of articulation by participant.