Chapter III - Statement of the problem
Studies investigating audio-tactile integration using syllables as stimuli & response type as closed choice paradigm demonstrated multisensory integration, but audio-tactile integration could not be replicated when investigated using sentences as stimuli & response type as open choice paradigm. Approach of the study didn`t follow a continuous hierarchical pattern, i.e. stimulus type was suddenly upgraded to five- word long sentences from monosyllables which a radical change. In addition to that, comparatively sophisticated response type, open choice paradigm, was also chosen for the study. Thus, the unanswered question is to determine which of the factors; use of sentences or open choice paradigm, lead to null result in continuous speech perception paper (Derrick et al., 2016). To determine this, it is essentially important to take a step back & investigate the multisensory integration by using syllables in an open-choice paradigm.
3.1 Study aim
The present study aims to identify whether the benefits from audio-tactile integration uphold for monosyllable identification task, in varying signal to noise ratios (SNRs) when the participants do not have to make a forced choice between two alternatives, but are presented with a more ecologically valid open-choice condition.
The research question is whether aero-tactile information influence syllable perception using an open choice identification task. This will be investigated by testing the following 2 hypotheses
SNR at 80% accuracy level will interact with phoneme and air flow such that:
Hypothesis 2: – SNR at 80% accuracy level will be increased when listening to incongruent audio-tactile stimuli than audio only stimuli.
The proposed study is an extended version of the previous work of Gick and Derrick (2009) but with the mode of response being an open- choice design instead of a 2-way forced-choice paradigm task. This response format was chosen because it has been shown to provide a more conservative estimate of the participant`s percept in previous studies (Colin et al., 2005; Massaro, 1998). Moreover, an open-choice design allows for a better assessment of the precision of audio- tactile integration in speech perception as the possibility of subject guessing can be minimized. Hence, the outcome of this study will extend our knowledge on the effectiveness of integration of tactile information in the enhancement of auditory speech perception, in a more natural setting with minimal cues.
In addition to this, simplifying stimulus type to monosyllables would allow us to identify the conditions under which audio-tactile integration occurs, without the confounding factors needing higher cognitive and linguistic processing (semantic information, context information, utterance length etc.) that were present during the continuous speech studies (Derrick et al., 2016).
This study would be a valuable contribution for multisensory speech perception literature for fundamental scientific studies and further researches. Moreover, insights from the study could be used for evidence based practise clinically especially for training communication skills in individuals with sensory deficits.
The current study builds on the methodology of the original aero-tactile integration paper (Gick & Derrick, 2009), coupling an acoustic speech signal with small puffs of air on the skin. The difference with the present study is that this time the participants are free to choose their response, without any constraints, based on their own perceptual judgement rather having to choose between two response alternatives. University of Canterbury Human Ethics Committee has reviewed and approved this study on 15 May 2017 (Approval number 2017-21 LR). See Appendix for a copy of the approval letter.
Forty-four (44) healthy participants (40 females and 4 males), with a mean age of 23.34 years were recruited for the study. Inclusionary criteria set for recruitment process were
- Native English speaker
- Aged between 18 to 45
- No history of speech, language or hearing issues
Of 44 participants tested, seven participants didn`t match language criteria (New Zealand, Canada, United States or United Kingdom English), three participants had higher pure tone threshold of >25db in either ear, which leaves with 34 participants. In addition, 5 participants from them reached a ceiling effect for some of the conditions, hence their database could not be included fully. They were unable to correctly identify some of the stimuli at a +10dB SNR level, suggesting they had difficulty doing task in an effectively noiseless environment. None had to be completely excluded, but participant 2’s “ka” and “ba”, participant 6’s “pa”, participant 8’s “ka”, participant 14’s “ka”, and participant 37’s “ga”, “ da” and “ba” data had to be excluded due to these ceiling effects. None of the participants had history of speech or language delays.
Participants (n= 34) were primarily undergraduate speech-language therapy students and remaining (n=10) were recruited via email, Facebook, advertisement on the New Zealand Institute of Language, Brain and Behaviour (NZILBB) website and around the university. Undergraduate students received credits for their research participation while other volunteers were given a $10 gift voucher as compensation for their time. As part of recruitment process, participants received an information sheet (Appendix) which was discussed with them before beginning any of the procedures. Following this discussion, if they chose to participate, they were asked to sign a written consent form (Appendix).
All the participants were asked to complete a questionnaire (Appendix) detailing demographic information on age, dialect and history of speech, language and hearing difficulties. As part of the initial protocol, participants underwent an audiological screening. Pure tone audiometry was carried for frequencies 500Hz, 1KHz, 2KHz and 4KHz using a Interacoustics AS608 screening audiometer. Pure tone thresholds were calculated and if the threshold is less than or equal to 25, hearing sensitivity was considered to be within normal range. Participants not meeting the inclusion criteria could choose to still complete the study to gain research experience. This resulted in data for 7 non-native English speakers.
4.2 Recording procedure and stimulus
Speaker was asked to come in a sound-attenuated booth in lab and speech audio was recorded using a Sennheiser MKH-416 microphone attached to a Sound Devices USB-Pre2 microphone amplifier fed into a PC. Video recordings of the English syllables, labials (/pa/and /ba/) and velars (/ka/ and /ga/), spoken by a female native New Zealand English speaker were recorded using a video camera (Panasonic Lumix DMC-LX100) speaking with their lips ~1 cm away from a custom-made airflow estimator system that does not interfere with audio speech production. Speaker produced twenty repetitions of each stimulus, and stimuli were presented in randomized order to be read aloud off a screen.
To produce the air puff, an 80 ms long 12 kHz sine wave used to drive the pump action of Aerotak (Derrick & De Rybel, 2015). This system stores the audio signal and the air ﬂow signal in the left and right channel of a stereo audio output respectively. The stored audio is used to drive a conversion unit that splits the audio into a headphone out (to both ears) and right channel air pump drive signal to a piezoelectric pump that is mounted on the tripod.
The speech stimuli of the English syllables were matched for duration (390-450ms each), fundamental frequency (falling pitch from 90 Hz to 70 Hz) and intensity (70 decibels). Using an automated process, speech token recordings were randomly superimposed 10.000 times within a 10 second looped sound file to generate speech noise for the speaker. According to Jansen and colleagues (2010) and Smits and colleagues (2004), this method of noise generation results in a noise spectrum virtually identical to the long-term spectrum of the speech tokens of the speaker and thus ensuring accurate signal to noise ratios for each speaker and token. Speech tokens and the noise samples were adjusted to the same A-weighted sound level prior to mixing at different signal to noise ratios.
In order to create best match of the airflow produced by the airflow generation system with the dynamics of it produced in speech, the air flow outputs were generated by a Murata MZB1001T02 piezoelectric device (Tokyo, Japan), controlled through the Aerotak system, as described in Derrick, de Rybel, and Fiasson (2015). This device extracts signal representing turbulent airflow during speech from the recorded speech samples. These stimuli syllables were then passed through the air ﬂow extraction algorithm to generate a signal for driving a system to present air ﬂow to the skin of participants simultaneous with audio stimuli.
4.3 Stimulus presentation
Experiment was run individually for each participant. The entire procedure lasted approximately 40 minutes. Data was collected using a Apple MacBook Air laptop in sound attenuated room for of four underlying tokens each of ‘ba’, ‘pa’, ‘ga’, and ‘ka’. Stimuli were placed in speech in noise generated using the same techniques described in Derrick, et al., (2016) with exception that the software used was custom R and FFMPEG. Speech in noise ranged from -20 to 10 SNR with 0.1 SNR increments. From -20 to 0 SNR, signal was decreased, and noise kept stable. From 0 to 10 SNR, signal was kept the same volume and noise decreased. Thus, the overall amplitude was maintained stable throughout the experiment.
The pump has the following specifications: the 5-95% rise time takes 30 ms (Derrick, et al., 2015), with a maximum pressure of 1.5 kPa during loud speech, and a maximum flow rate of 0.8 l/m, which corresponds to about a twelfth of that of actual speech.
And the correct responses were lower-case ‘pa’, ‘ba’, ‘ga’, and ‘ka’/’ca’ based on the underlying audio signal. Whenever the participant responded accurately, the SNR increases, thereby increasing the task complexity. Similarly, for every incorrect response, SNR drops, making signal clearer and simplifying task for the participants. Thus, results of each trial allow for a re-tuning of the SNR’s for each syllable to compensate for how easy the individual recording was for perceivers to detect in noise. This method of assigning stimulus values based on preceding response is the procedure of an adaptive staircase. The auditory signals were degraded with speech- based noise and the signal-to-noise ratio were varied using software implemented with an adaptive transformed up-down staircase to obtain a psychometric curve of perception based on the 80% accuracy response in noise. The transformed up – down method (Quest staircase) has been adopted as it is a reasonably fast and typical method. (Watson & Pelli, 1983). Eight adaptive staircases were set for each token stimuli and thus each QUEST staircase had 32 repetitions.
The study was designed to examine the influence of audio – aero tactile integration on speech perception using an open choice task. Each participant`s perception was assessed using randomized presentation of 6 possible combinations of auditory only and congruent and incongruent auditory and aero-tactile stimuli of English syllables – /pa/, /ba/, /ga/ and /ka/. Participant heard 32 times of 8 tokens of each syllable without air ﬂow, and 32 tokens of each syllable with air ﬂow generated from the underlying sound ﬁle, for a total of 256 tokens. Length of time per token is per participant 6.5 seconds on average. Once the initial protocol was done, participants were seated in a sound- attenuated booth wearing a sound isolating headphone (Panasonic Stereo Headphones RP-HT265). They were presented with the auditory stimuli via headphones at a comfortable loudness level through an experiment designed in PsychoPy software (Peirce 2007 & 2009). Tactile stimuli were delivered at the suprasternal notch via the air pump placed aiming at the subject`s neck at a pressure of ~7 cm H20, fixed at ~2.2cm from the skin surface. The back of the hand was chosen because it is a location where participants typically receive no direct airflow during their own speech production. Participants Integrated perception was estimated by asking them to type out the perceived syllable into the experiment control program that indicated whether the answer is correct or not based on the software-provided expected outcome.
Participants were told that they might experience some noise and unexpected puffs of air along with syllables, consisting of a consonant and a vowel, during the task. Participants were asked to type down the syllables that they heard, and push enter key to record their responses. Since the experiment part of the study, requiring active listening, lasted about 20 minutes of the total procedure, participants could take short listening breaks if they required one. Researcher stayed inside the experiment room with the participant during the experiment to monitor if placement is not disturbed and to ensure that participants are comfortable.
4.5 Data Analysis
From the forty-four participants, who took part in the study, data of thirty-four (34) participants who fit in the inclusion criteria, were analyzed to answer the research question. Initially, data was entered and sorted in Microsoft Excel 2016 spreadsheet. 32 repetitions were extracted for each staircase and statistics were run on the last SNR which is at 80% accuracy level. Descriptive statistics were run initially.
Box plots were used to plot to visualize variation of SNR with place of articulation (graphs).
Variation of SNR with audio only and audio-tactile condition for each target stimuli was plotted using a boxplot (graph).
Generalized linear mixed-effects models (GLMM), seen in the R-formatted (R Core Team, 2016), were run on the interaction between aspiration [aspirated (‘pa’ and ‘ka’) vs. unaspirated (‘na’ and ‘ga’) stops], place [labial (‘pa’ and ‘ba’) vs. velar [‘ga’ and ‘ka’], and artificial air puff (present vs. absent).
Model fitting was then performed in a stepwise backwards iterative fashion, and models were back-fit along the Akaike information criterion (AIC), to measure quality of fit. This technique isolates the statistical model that provides the best fit for the data, allowing elimination of interactions in a statistically appropriate manner. The final model was:
SNR ~ place * manner + (1 + (place * manner) | participant)
In this model, the SNR at 80% accuracy was compared to the fixed effects. These included: 1) Place of articulation (labial vs. velar), 2) manner of articulation (voiced vs. voiceless), 3) the interaction of place and manner, and 4) The full-factorial random effect covering place and manner of articulation by participant.
Cite This Work
To export a reference to this article please select a referencing stye below:
Related ServicesView all
Related ContentAll Tags
Content relating to: "Physiology"
Physiology is related to biology, and is the study of living organisms and how they function. Physiology covers all living organisms, exploring how the body performs basic functions in relation to physics and chemistry.
Chemical, Physical, and Mechanical Properties of the Human Skin
Understanding the chemistry and physical properties of the skin assists industry in manufacturing products, such as deodorant, lipstick, and moisturizers to name a few....
Audio-Aero Tactile Integration in Speech Perception: Methodology
Methodolgy from a dissertation on how multisensory integration can enhance communication, more specifically how tactile information can help us to perceive speech better....
DMCA / Removal Request
If you are the original writer of this dissertation methodology and no longer wish to have your work published on the UKDiss.com website then please: