Model of Speech Recognition Using MFCC Extraction

9596 words (38 pages) Dissertation

16th Dec 2019 Dissertation Reference this

Tags: Sciences

Disclaimer: This work has been submitted by a student. This is not an example of the work produced by our Dissertation Writing Service. You can view samples of our professional work here.

Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of

Chapter 1.


1.1 Introduction

Speech is the most natural way of communicating for human beings. While this has been true since the dawn of civilization, the invention and widespread use of the telephone, television, radio and audio phonic storage media has given more importance to the communication of voice and voice [2] processing. Advances in digital signal processing technology has been the use of speech in many different areas of application such as compression of speech enhancement, synthesis and recognition [4]. In this thesis, using speech processing home automation is accomplished through the development of a system of voice recognition for single words using MFCC approach. The concept of a machine that can recognize human voice has been a feature accepted in science fiction. Bill Gates (Microsoft Corp. co-founder) is automatic speaker recognition (ASR) as one of the most important innovations for future operating systems. From a technology perspective it is possible to distinguish between two broad types of ASR: direct voice input (DVI) and large-vocabulary continuous speech (LVCSR) recognition. DVI devices are mainly directed to command-and-control voice, while LVCSR systems are used to fill the form or create documents based on the voice. In both cases, the underlying technology is more or less the same. DVI systems are typically configured for small size medium (up to several thousand words) vocabulary and could employ technical word or phrase of spotting. Also, DVI systems are usually required to immediately respond to a voice command. LVCSR systems involve vocabularies of perhaps hundreds of thousands of words and are normally configured to transcribe continuous speech. From an application point of view, the benefits of using ASR derived from providing a busy hands additional communication channel eyes human-machine interaction (HMI), or simply the fact that talking can be faster to write. 

1.2 Objectives

The objective of the project is design automation with the model of speech recognition using MFCC extraction.

1.3 Motivation

The motivation for home automation with speech recognition is simple; It is man principle of communication and is, therefore, a convenient and accessible way of communication with machines. Speech communication has evolved to be efficient and robust and it is clear that the path of computer speech recognition is the modeling of the human system. Unfortunately since pattern recognition standpoint, human recognizes speech through a very complex interaction between several levels of processing; using information syntactic and semantic processing and classification of low-level well very powerful patterns. Front powerful and sophisticated classification algorithms is, ultimately, it is not enough; many other forms of knowledge, for example, Linguistics, semantics and pragmatics, must be built into the recognizer. Or even in a lower level of sophistication, just simply with generating “a good” representation of speech (i.e., a good set of features to be used in a pattern classifier); the classifier itself must have a considerable degree of sophistication. This is the case, however, does not effectively discriminate between classes and, in addition, better the characteristics of the easiest is the task of classification. Automatic speech recognition is therefore a commitment to engineering between the ideal, i.e. a complete model of the human being and the practice, i.e., tools which provide the science and technology and to allow the costs.

At the highest level, the speaker recognition systems contain two main modules (see Fig. 1.1): feature extraction and matching of the function. Extraction is the process that removes a small amount of data from the voice signal that can be used more late to represent each speaker. Feature matching consists of the procedure to identify the speaker unknown by comparison of features extracted from your voice input with the one of a set of known speakers. We will discuss each module in detail in later sections.

Figure 1.1. Speaker identification training

Figure 1.2. Speaker identification tests

 1.4 Problem Statement

The main problem in the majority of home automation systems is the authentication. Since all appliances are controlled by many users, it becomes a problem when people start using the applications without knowing it having normal conversations. There is also the possibility of intruders or neighbors with electrical appliances.

1.5 Methodology

To overcome this problem of authentication, a highly reliable speech recognition system must apply by which reduces to the minimum the risk of the use of appliances without knowing it. Discourse analysis, also known as front end analysis and feature extraction, is the first step in an automatic speech recognition system. This process aims to extract acoustic speech waveform characteristics. The output of analysis front end is a set of parameters that represent the observed acoustic properties of input signals of speech, for the further use for acoustic modelling compact and efficient. There are three main types of processing techniques of front-end, i.e. the linear predictive coding (LPC), mel-frequency cepstral coefficients (MFCC) and forecast linear perceptual (PLP), where the two last are most commonly used in ASR systems. Here we use the best-known approach and popular technique of extraction of Mel frequency Cepstrum coefficients (MFCC).


1.6 Outline of Thesis

This thesis is organized as follows.

In Chapter 1, the objective and the introduction is given part of the project.

In Chapter 2, study of literature

In Chapter 3, Hardware implementation of home processing of speech automation.

In Chapter 4, the algorithm used in this thesis are discussed.

In Chapter 5, Software implementation of home automation with speech processing.

In Chapter 6, GUI of the method is given

In Chapter 7, results, discussion, we will discuss the implementation of these projects and the scope for future work

Chapter 2

Literature Survey

2.1 Basic Acoustics and Speech Signal

As relevant data in the field of speech recognition, this chapter intends to discuss how the speech signal is produced and perceived by humans. This is a crucial issue that must be considered before that one can pursue and decide which approach to use voice recognition.

2.1.1 The Voice signal

Human communication is to be seen as a comprehensive scheme of the process of production of speech perception of speech between the speaker and the listener, see Figure 2.1. 2.1. Schematic diagram of the process of speech production/perception

Five different elements, formulation of speech a., vocal mechanism human B., C. acoustic air, D. perception of hearing, understanding and intervention. The first element (formulation a. speech) is associated with the formulation of voice in the mind of the speaker. This formulation is used by the human vocal mechanism (B. human vocal mechanism) to produce the actual speech waveform. Waveform is transmitted by air (C. ) ( Acoustic air) for the listener. During this transfer the sound wave can be affected by external sources, e.g. noise, resulting in a more complex wave. When the wave reaches the listener (ears) hearing system the precepts of listening (understanding the speech of e.) from the mind of waveform (D. perception of the ear) and the listener begins to process this form of wave to understand its contents, so that the listener understands what the speaker is trying to tell you. A problem with speech recognition is to “simulate” how the process of listening to the speech produced by the speaker. There are several actions that take place in the listeners head and system during the process of speech signals. The process of perception can be as the inverse process of speech production. The basic unit of theoretical to describe how to bring the speech formed in the mind, the linguistic meaning is called phonemes. Phonemes can be grouped based on the properties of either of the two weather way to wave or frequency characteristics and classified into different sounds produced by the human vocal tract.

Speech is:

•The variable signal,

Communication process structured solid,

•Depends on known physical movements,

•Composed of acquaintances, various units (phonemes),

•It is different for each speaker,

• Can be fast, slow or variable speed,

•May have high pitch, low pitch, or be whispered,

•Has widely different types of environmental noise,

•May lack distinct boundaries between units (phonemes),

•Has a unlimited number of words.


2.1.2 Speech Production

In order to understand how the production of speech is done, you need to know how to create the vocal mechanism human, see Figure 2.2. The most important parts of the mechanism human vocal are the vocal tract with the nasal cavity, which begins in the veil. The veil is a trap as a mechanism which is used to formulate sound nasal when necessary. When the veil is lowered, the nasal cavity is coupled together with the vocal tract to formulate the desired voice signal. The cross-sectional area of the vocal tract is limited by the tongue, lips, jaw, or veil and varies from 0 to 20 cm2. 2.2. Human Vocal mechanism

2.1.3 Properties of Human Voice

One of the most important parameters of the sound is its frequency. The sounds are discriminated against each other by the help of their frequencies. When the frequency of a sound increases, the sound is sharp and irritating. When the frequency of a sound, the sound gets deepen. The sound waves are waves that are produced from the vibration of materials. The highest frequency that can produce a human being is about 10 kHz. And the lowest value is 70 Hz. These are the maximum and minimum values. This range of frequency changes for each person. And the magnitude of a sound is measured in decibels (dB). A normal human language has a range of 100-3200 Hz Hz of frequency and its magnitude is in the range of 30 dB – 90 DB. A human ear can perceive sounds in the range of frequency between 16 Hz and 20 kHz. And a change in frequency of 0.5% is the sensitivity of a human ear.

Characteristics of the speaker,

         Due to differences in the length of the vocal tract, men, women and children-speaking are different.

         Regional accents are the differences in resonant frequency, duration and tone.

         Individuals have patterns of resonant frequency and duration patterns which are unique (which allows us to identify the speaker).

         Training on a type of loudspeaker data “learns” to that group or characteristics of the person, automatically makes the recognition of other types of speakers much worse.

 2.2 Automatic Speech Recognition (ASR)

2.2.1 Introduction

Speech processing is the study of speech signals and methods of processing of these signals. Signals are usually processed in a representation digital voice processing can be seen as the interaction of digital processing of signals and natural language processing. Natural language processing is a subfield of artificial intelligence and Linguistics. It studies the problems of automated generation and understanding of the natural human languages. Natural language generation systems convert information from computer databases in normal human language of sound and natural language understanding systems convert samples of human language in more formal presentations that are easier for computer programs to manipulate.

Speech coding

It is the compression of the speech (in code) for transmission with codecs of speech using techniques of processing of speech and audio signal processing. The techniques used are similar to the compression of audio data and human audio coding where knowledge in psychoacoustics is used to transmit only the data that is relevant to the hearing. For example, narrow band voice coding, only information on the frequency of 400 Hz, 3500 Hz band is transmitted but the reconstructed signal is still adequate for intelligibility. However, speech coding differs from audio encoding where there are much more available statistical information on the properties of speech. In addition, some auditory information that is relevant to the encoding of audio may be unnecessary in speech coding context. In the coding of the voice, the most important criterion is the preservation of intelligibility and “sympathy” of the speech, with a limited amount of transmitted data. It should be noted that the intelligibility of speech includes, in addition to the literal content, also speaker identity, emotions, intonation, timbre etc. that are important for perfect intelligibility. The more abstract concept of the sympathy of the degraded speech is a property than intelligibility, since it is possible that the degraded speech is completely intelligible, but subjectively annoying to the listener.

Voice synthesis

Speech synthesis is the artificial production of human speech. A system (TTS) text to speech converts normal text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. Synthesized speech can also be created by concatenating pieces of recorded speech that are stored in a database. The systems differ in the size of the stored speech units; a system that stores phones or diaphones provides the largest output range, but may lack clarity. For specific domains, the storage of whole words or phrases allows the output of high quality. Alternatively, a synthesizer can incorporate a model of vocal tract and other human voice characteristics to create a completely “synthetic” voice output. The quality of a speech synthesizer is judged by its similarity to the human voice and its ability to be understood. An intelligible text-to-speech program allows people with visual impairments or reading disabilities to listen to works written in a home computer. Many operating systems include speech synthesizers since the 1980’s.

Voice analysis

Voice problems that require voice analysis commonly originate the vocal cords since it is the source of sound and is thus more actively to be tired. However, the analysis of the vocal cords is physically difficult. The location of the vocal cords prohibited direct measurement of the movement effectively. Imaging methods such as x-ray x or ultrasounds do not work because the vocal cords are surrounded by cartilage that distorts the image quality. Movements in the vocal cords are rapid, fundamental frequencies are usually between 80 and 300 Hz, thus avoiding the use of ordinary video. High-speed videos provide an option but to see the vocal cords the camera should be placed in the throat that makes it difficult to talk about something. Most important indirect methods are reverse filtration of recordings of sound and electroglottographs (egg). In inverse filter methods, the sound of the voice is recorded out of the mouth and then filtered by a mathematical method to remove the effects of the vocal tract. This method produces an estimate of the waveform of pulse pressure indicating again inversely movements of the vocal cords. The other kind of reverse indication is the electroglottographs, which works with electrodes connected to the throat of the subject near the vocal cords. Changes in the conductivity of the throat conversely indicates is how large a portion of the vocal touch each other. Thus yields one-dimensional information of the contact area. Inverse filtering nor egg so enough to fully describe the glottal movement and provide indirect evidence only of that movement.

Speech recognition

Speech recognition is the process by which a team (or another type of machine) identifies spoken words. Basically, it means talking to the computer and have it correctly recognize what he’s saying. This is the key to any discourse related to the application. As it will be explained later, a number there are ways to do it, but the basic principle is somehow extract certain key characteristics of speech uttered and then treat these features as a key to recognize the word when he spoke again.

 2.2.2 Basics of Speech Recognition


A statement is the vocalization (talk) a word or words that represent a unique meaning to the computer. Expressions can be a single word, a Word, a phrase or even multiple sentences.

Figure 2.3. Utterance of “Hello”

Speaker unit

Speaker-dependent systems are designed around a specific speaker. They are usually more accurate for correct, but much less accurate for other speakers speaker. They assume that the speaker to speak with a consistent voice and tempo. Independent speaker systems are designed for a variety of speakers. Adaptive systems generally start as independent speaker systems and use training techniques to adapt to the speaker to increase your recognition accuracy.


Vocabularies (or dictionaries) are lists of words or expressions that can be recognized by the system of the Vocabularios Mr. little ones are usually easier for a computer to recognize, while larger vocabularies are more difficult. Unlike normal dictionaries, each entry does not have to be a single word. It may be as long as a sentence or two. Smaller vocabularies may have only 1 or 2 expressions recognized (for example, “Wake Up”), while large vocabularies may have 100 thousand or more!


The capacity of a recognizer may be examined by measuring accuracy – or well recognizes expressions. This includes not only correctly identify an utterance but also identify whether the spoken word is not in his vocabulary. Good ASR systems have an accuracy of 98% or more! The acceptable accuracy of a system really depends on the application.


Some speech recognizers have the ability to adapt to a speaker. When the system has this capability, you can allow training that will take place. An ASR system is trained by having the speaker repeat standard or common phrases and adjusting their comparison algorithms to match it with the speaker in particular. A recognizer training usually improves its accuracy. Training can be used also by speakers who have difficulty speaking or to pronounce certain words. As the speaker can be constantly repeated an utterance, ASR training systems must be able to adapt



2.2.3 Classification of ASR Units

A voice recognition system can function in many different conditions as dependent/independent speaker, isolated continuous speech, small/large vocabulary recognition. Voice recognition systems can be divided into several different classes describing what kind of expressions that have the ability to recognize. These classes are based on the fact that one of the difficulties of the ASR is the ability to determine when a speaker begins and ends a sentence. Most packages can fit into more than one class, depending on the mode you’re using.

Isolated words

Recognizers of the word isolated generally require each utterance have tranquility (lack of audio signal) on both sides of the display window. It does not mean that accepts loose words, but they require a simple utterance at a time. Often, these systems have States “Preview / no-escucha”, which require the speaker to wait between expressions (generally processing during breaks). Isolated statement would be a name better for this class.

Connected words

Connect the word systems (or, more correctly, ‘statements connected’) are similar to the isolated words, but they allow separate than ‘run together’ returns with a minimal pause between them.

Continuous speech

Continuous recognition is the next step. With capacities of continuous speech recognizers are some of the most difficult to create and that they must use special methods to determine limits of utterance. Continuous speech recognizers allow users to talk almost naturally, while the computer determines the content. Basically, your computer dictated.

Spontaneous speech

A variety of definitions of what speech it seems spontaneous actually is. At a basic level, it can be considered is an expression that is natural not rehearsed and sound. An ASR system with capability of spontaneous speech should be able to handle a variety of natural speech features as words executes “ums” and “ahs” and stutters even mild.

Speaker unit

ASR engines are classified as dependants of speakers and independent. Speaker-dependent systems are trained with a speaker and recognition is made only for that speaker. Independent speakers are trained with a speaker system. This is obviously much more complex than dependent speaker recognition. A problem of intermediate complexity would be training with a group of speakers and recognize the voice of a speaker within that group. We could call this the speakers group-dependent recognition.


2.2.4. Why is the automatic speaker recognition difficult?

There are some problems with voice recognition that have not been discovered. However, there are a number of problems that have been identified in recent decades most of which still remains unresolved. Some of the main problems in ASR are:

Determination of word boundaries

Speech is usually continuous in nature and word boundaries are not clearly defined. One of the common errors of continuous speech recognition is the lack of a tiny space between words. This happens when the speaker is speaking at a high speed.

Different accents

People from different parts of the world pronounce words differently. This leads to errors in the ASR. However, this is a problem that is not limited to ASR but that plagues human listeners.

Large vocabularies

When the number of words in the database is large, words that sound similar tend to produce a high amount of error is, there is a good chance that a Word is recognized as the other.

Change in the acoustics of the room

Noise is an important factor in the ASR. It is in fact in noisy conditions or acoustic costumes that the limitations of today’s engines nowadays ASR become prominent.

Temporal variation

Various speakers speak at different speeds. Today the ASR engines only not able to adapt to.


2.2.5 Expression Analyzer

Speech analysis, also known as front end analysis and feature extraction, is the first step in an automatic speech recognition system. This process aims to extract acoustic speech waveform characteristics. The output of analysis front end is a set of parameters that represent the observed acoustic properties of input signals of speech, for the further use for acoustic modelling compact and efficient. There are three main types of processing techniques of front-end, i.e. the linear predictive coding (LPC), mel-frequency cepstral coefficients (MFCC) and forecast linear perceptual (PLP), where the two last are most commonly used in ASR systems.

Linear predictive coding

LPC starts with the assumption that a speech signal is produced by a buzzer at the end of a tube (voice sounds), with occasional whistles added and pop sounds. Although apparently crude, this model is actually a close approximation to the reality of the production of speech. The glottis (the space between the vocal cords) produces the buzz, which is characterized by its intensity (loudness) and frequency (pitch). The vocal tract (mouth and throat) forms the tube, which is characterized by its resonances, which are called forming. Whistles and clicks are generated by the action of the tongue, lips and throat during sibilants and plosives.

LPC analyzes the estimation of formant speech signal, eliminating the effects of speech signal and estimate the intensity and frequency of the remaining buzz. The process of elimination of the formants is called reverse filtering, and the remaining signal after the subtraction of the filtered signal modeling is called residue. The numbers that describe the intensity and frequency of the rumors, the formants and residue signal, can be stored or transmitted somewhere else. LPC synthesizes the signal of speech by reversing the process: use the buzz and other parameters to create a source signal, use the formants to create a filter (which represents tube) and run the source through the filter, which speaks. Because speech signals vary over time, this process is performed on the speech signal short pieces, which are called frames; usually 30 to 50 frames per second give intelligible speech with good compression level.

Mel frequency Cepstrum coefficients

These derive from a type of representation of cepstral of the audio clip (a “spectrum-of-a-spectrum”). The difference between the cepstrum and the Mel-frequency cepstrum is that in theMFC, placed the frequency bands logarithmically (on the mel scale) which approximates the response of the human auditory system more closely than the frequency bands spaced linearly obtained directly the FFT and DCT. This allows for better data processing, for example, in audio compression. However, unlike ultrasound, CSBMS lack a land of ear mode, therefore, may not represent accurately the perceived loudness. CSBMS are commonly derived as follows: 1. take a signal2 Fourier transform of (an extract from the window). Map of amplitudes of registration of the retrieved over spectrum on the Mel scale, with windows.3 superimposed triangular. Take the discrete cosine transforms the Mel list register of amplitudes, as if it were a signal.4. The CSBMS are the amplitudes of the resulting spectrum.

2.2.6 Speech Classifier

The problem of the ASR belongs to a much wider scientific and engineering topic called pattern recognition. Pattern recognition aims to classify the objects of interest in one of a number of categories or classes. The objects of interest are generically called patterns and in our case are sequences of acoustic vector extracted from a speech of entry using the techniques described in the previous section. Classes here refer to individual speakers. Since the classification in our case procedure applies on the extracted features, it may be also called feature matching. The State of the art in game techniques characteristic of speaker recognition includes dynamic time Warping (DTW), Hidden Markov modelling (HMM) and vector quantization (VQ).

Chapter 3.

The Feature Extraction

3.1 Process

Deriving the acoustic characteristics of speech signal is called feature extraction. Feature extraction is used in  training and testing phases. They consist of the following steps: 1. Frame Blocking 2. Windowing 3. FFT (Fast Fourier Transform) 4. Wrapping 5 Mel-frequency. Cepstrum (Mel frequency Cepstral coefficients)

The Feature Extraction

This stage is known as the front-end processing of speech. The main objective of the feature extraction is to simplify the recognition by summarizing the vast amount of speech data without losing acoustic properties that defines the speech [12]. The schematic diagram of the steps shown in Figure 3.1.

Figure 3.1 Feature Extraction process

3.1.1 Frame Blocking

Investigations show that the characteristics of voice signal remains fixed in a short enough time interval period (called quasi-stationary). For this reason, the speech signals are processed in short time intervals. It is divided into frames with sizes usually between 30-100 milliseconds. Each frame overlaps the front frame by a predefined size. The overlay scheme aims to smooth the transition from frame to frame [12].


3.1.2 Windowing

The second step is to the window all frames. This is done to eliminate discontinuities at the edges of the frames. If the function of windows is defined as (n) w, 0 < n < N-1 where N is the number of samples per frame, the resulting signal will be; and (n) = x (n) w (n). Generally hamming windows are used [12].

3.1.3 Fast Fourier transform

The next step is to take fast Fourier transform of each frame. This transformation is a quick way of Fourier transform discrete transforms and changes the time domain to frequency [12].

 3.1.4 Mel frequency warping

The human ear perceives the frequencies of non-linearly. Research shows that the scale is linear up to 1 kHz and logarithmic above. The filter Bank of Mel-Scale (scale of the melody) that characterizes the human ear perceiveness of frequency. Used as a pass band filter for this stage of identification. The signs of each frame is passed through bandpass Mel-scale filter to mimic the human ear [17] [12] [18]. As mentioned previously, psychophysical studies have shown that human perception of the contents of the frequency of sounds for speech signals does not follow a linear scale. Thus for each tone with a real frequency, f , measured in Hz, a subjective tone is measured on a scale known as “mel scale. The mel frequency scale is a spacing of linear frequency below 1000 Hz and a logarithmic rate above 1000 Hz. As a point of reference, the pitch of a tone of 1 kHz, 40 dB above the threshold of auditory perception, is defined as 1000 mels. We can therefore use the following approximate formula to calculate the mels for a given frequency f in Hz:

An approach to simulate subjective spectrum is to use a Bank of filters, a filter for each component of desired frequency of mel. Filter Bank has a triangular band-pass frequency response, and space as well as bandwidth is determined by a constant interval of mel. S() modified spectrum thus consists in the power output of these filters when S() entry. The number of K , mel cepstral coefficients, is usually chosen as 20 note that this filter bank is applied in the frequency domain; so it simply means that the windows of the shape of the triangle in Figure 4.2 in the spectrum. A useful way of thinking about this deformed mel filter bank is to see each filter as a bin of histogram (where containers have overlap) in the frequency domain. A useful and efficient way to implement this is to consider these triangular filters on the Mel scale, which in effect would be equally spaced filters.

Figure 3.2. Filter Bank on Mel frequency scale

3.1.5 Cepstrum

Cepstrum name was derived from the spectrum by reversing the first four letters of the spectrum. We can say cepstrum is the transformer of the Fourier transform of the registry with unwrapped phase Fourier transformer.

         Mathematically we can say Cepstrum of signal = FT (log (FT (thesignal)) + j2IIm)

Where is m the integer required to properly unwrap the angle or imaginary part of the complex function of the registry.

         Algorithmically say – signal – FT – registration – phase unwrapping – FT-Cepstrum

For the definition of the values real real cepstrum uses the logarithm function. While for the definition of complex values while the complex cepstrum uses the complex logarithm function. The real cepstrum uses the information of the magnitude of the spectrum. whereas complex cepstrum contains information on magnitude and phase of the initial spectrum, allowing the construction of the signal. We can calculate the cepstrum in many ways. Some of them need an algorithm deformation stage, others are not. Following figure shows cepstrum signal line.

As discussed in the section of frames and windows iscomposed signal of speech quickly vary the sequence of excitation of e (n) part being convolved with slowly varying vocal system impulse response part (n).

Once the fast part being convolved slowly different and different it makes it difficult to separate the two parts, cepstrum introduces to separate these two pieces.

3.1.6 Mel Frequency Cepstrum Coefficient

In this project we are using Mel frequency Cepstral coefficient. Mel frequency Cepstral coefficients are coefficients which represent audio based on perception. This coefficient has a great success in speaker recognition applications. It is derived from the audioclip Fourier transformation. In this technique the frequency bands are placed logarithmic, while in the Fourier frequency bands placed not logarithmic. As logarithmic frequency bands are placed in MFCC, it approximates the response of the human system more closely than any other system. These coefficients enable better data processing. At the Mel frequency Cepstral coefficients compute Mel Cepstrum is the same that the real Cepstrum except Mel Cepstrum frequency scale is twisted to a corresponding Mel scale.

The Mel scale was designed by Stevens, Volkmann and Newman in 1937. The Mel scale is mainly based on the observational study tone or frequency perceived by human beings. The scale is divided into units of mel. In this test the hearing person or test began listening to a frequency of 1000 Hz and 1000 marked Mel for reference. Then the listeners asked changing the frequency until it reaches the double frequency of the reference frequency. Then this frequency tagged Mel 2000. The same procedure is repeated half frequency and this frequency tagged as Mel 500 and so on. On this basis, the normal frequency the frequency of Mel is assigned. The Mel scale is normally a linear mapping below 1000 Hz and logarithmically spaced above 1000 Hz. . Figure below shows the example of normal frequency is mapped into the Mel frequency.

Chapter 4

Hardware Description

4.1 Components Used

1. Arduino-UNO

2. Motor Driver – IC L293D

3. DC Motors

4. 9V Battery

5. LED’s

6. Battery Caps

7. Resistor


4.1.1 Arduino-Uno

02.jpgFigure 4.1 Arduino UNO Pin Configuration

Arduino UNO is a development board based on Atmega 328P as the micro-controller.

The technical aspects of arduino are-

Micro controller Atmega 328 p
Operating voltage 5V
Preferred input voltage 7-12 v -limit 6-20 V
Digital I/O pins 14 (of which 6 are output PWM)
PWM Digital/O pins and analog input pin 6 and 6 respectively
Current per I/O Pin and 3.3 20mA and 50mA respectively
Flash memory 32 KB (Atmega 328) that 0.5 is used by the boot loader
SRAM and EEPROM 2 KB and 1 KB (Atmega 328 p) respectively
Clock speed 16 MHz

Table 4.1 Aspects of Arduino UNO

Aspects of Arduino UNO


The Arduino Uno can be powered via the USB connection or with external power supply. The power source is automatically selected.

External (non USB) power can come from a battery or an adapter AC to DC. The adapter can be connected by plugging into a Center positive 2.1 mm plug in the boards power jack.  Battery cables can be inserted at GND and power connector Vin pin headers.

The Board can be operated with an external source of 6 to 20 volts. If less than 7V, however, 5V pin can supply less than 5 volts and the board can become unstable. If you use more 12V, voltage regulator may overheat and damage the board. The recommended range is 7 to 12 volts.

Power supply pins are as follows:

Vin – The input voltage to the board when you are using an external power source (as opposed to 5 volts from the USB connection or another regulated power supply). You can supply voltage through this pin, or, if the supply voltage is via the power jack, access through this pin connector.

5V – This pin outputs a 5V regulated from the regulator on the board. The board can be supplied with power from the DC power supply (7-12 v), USB (5V) connector, or the VIN pin of the Board (7-12 v). Supplying voltage of 5V or 3.3V pins bypasses the regulator and can damage the plate. It is not advised

3V3 – A 3.3 volt supply generated by the controller on board. Maximum current is 50 mA.

GND – ground pins.

IOREF – This pin on the board  provides the reference voltage that runs the microcontroller. A shield set up correctly can read the voltage of the IOREF pin and select the appropriate power source or enable the outputs voltage translators to work with 5V or 3.3V.

2. The memory

The ATmega328 has 32 KB (with 0.5 KB occupied by the boot loader). It also has 2 KB of SRAM and 1 KB of EEPROM.

3. Input and Output

See the correspondence between Arduino pins and ATmega328P ports. The allocation for the Atmega8, 168, and 328 is identical.

Each of the 14 digital pins on the Uno can be used as an input or output, using pinMode, digitalWrite () and digitalRead() functions. They operate at 5 volts. Each pin can provide or receive 20 mA as recommended operating condition and has an internal resistor pull up (off by default) of 20-50 k ohms. A maximum of 40 mA is the value that must not be exceeded on any I/O pin to avoid permanent damage to the microcontroller.

In addition, some pins have specialized functions:

Serial: 0 (RX) and 1 (TX). Used for receive (RX) and data transmission serial TTL (TX). These pins are connected to the corresponding pins of the chip ATmega8U2 USB to TTL serial.

External interrupts: 2 and 3. These pins can be configured to cause a disruption at a low value, a rising or falling edge or a change in value.

PWM: 3, 5, 6, 9, 10 and 11. Provide PWM 8-bit output with the analogWrite() function.

SPI: 10 (SS), 11 (MOSI), 12 (MISO), 13 (SCK). These pins support SPI communication.

LED: 13. There is a built-in LED by digital pin 13. When the pin is high value, the LED is lit, when the PIN is low, it is off

TWI: A4 or SDA pin and A5 or SCL pin.

The Uno has 6 analog inputs, with A0 to A5, which provides 10-bit resolution (i.e. 1024 different values). Default measure from ground to 5 volts, although it is possible to change the upper end of their range using the AREF pin and the analogReference() function.
There are few other pins on the board:

AREF. Voltage reference for the analog inputs. Used with analogReference().

Reset. Bring this line LOW to reset the microcontroller. Typically used to add a reset button to shields that block the one on the Board.

4. Communication

The Uno has a series of facilities to communicate with a computer, another Uno board and other microcontrollers. The ATmega328 provide UART TTL (5V) serial communication, which is available in the digital pins 0 (RX) and 1 (TX). An ATmega16U2 on the board channels this serial communication over the USB and appears as a virtual com port to software on the computer. 16U2 firmware uses standard USB COM drivers, and no external driver is necessary. The Arduino Software (IDE) includes a serial monitor that allows simple textual data to be sent to and from the board. The RX and TX on the dashboard illuminates when transmitting data via the USB-to-serial chip and USB connection to your computer. It also allows serial communication in any one digital pins.

The ATmega328 also support I2C (TWI) and SPI communication. The Arduino Software (IDE) includes a Wire library to simplify the use of the I2C bus.

4.1.2 Motor Driver – IC L293D

L293D is a dual H-bridged motor driver integrated circuit (IC). Motor controllers act as current amplifiers since they carry in a low intensity of current control signal and provide a higher current signal. This higher current signal is used to drive the motors.

L293D contains two circuits built-in H-bridge driver. Its common-mode of operation, two DC motors can drive at the same time, both in forward and backward direction. Two motor engine operations can be controlled by logic input pins 2, 7, 10 and 15. 00 or 11 logic input will stop the corresponding engine. Logical 01 and 10 turn clockwise and counterclockwise respectively.

Enable pins 1 and 9 (corresponding to the two motors) must be high for motors to operate. When an enable input is high, the associated driver is enabled. As a result, outputs become active and work in phase with their contributions. When is the enable input is low, then that  driver is disabled, and their outputs are and in the high-impedance state.


Figure 4.2 -Motor Driver-IC L293DPin Configuration

The PIN description is as below-

 No PIN  Function  Name
1 Enable pin for engine 1; active high They allow 1.2
2 1 entry for Motor 1 Input 1
3 1 output for Motor 1 Output 1
4 Earth (0V) Earth
5 Earth (0V) Earth
6 2 the output for Motor 1 Exit 2
7 Input 2 Motor 1 Input 2
8 Power supply for engines; 9-12V (up to 36V) 2 VDC
9 Enable pin for Motor 2; active high They allow 3.4
10 1 entry for Motor 1 Input 3
11 1 output for Motor 1 Exit 3
12 Earth (0V) Earth
13 Earth (0V) Earth
14 2 the output for Motor 1 4 output
15 Input2 for Motor 1 Input 4
16 Supply voltage; 5V (up to 36V) 1 VCC

Table 4.2 – Pin descriptionof IC L293D


4.1.3 DC Motors


Figure 4.3 – DC Motor Externally

A conventional DC Motor consists of the following-

  1. Permanent magnets
  2. Armature coil
  3. Commutator rings


Figure 4. 4 -Internal structure of DC motor

Current is supplied to the armature coil through the commutator rings by a dc voltage. The flow direction is decided by the dc supply. By the Lorentz law, as the magnetic field lines cutting the conductor carrying current, experiences a force perpendicular to current vector and magnetic field vector is experienced by the coil. Opposite sides of the coil that have current vector perpendicular to the magnetic field of the permanent magnets experience forces in the opposite direction.


Figure 4.5 -The Forces on the DC Motor Armature Coil

So the coil experiences a torque and turns. That coil is connected to an axis, that the rotation can be seen when a DC source is supplied. This shaft is connected to a gear or a wheel to get the desired result.

The rpm of the engine depends on the strength of the magnetic field (B), current (i) supply of DC, the voltage DC (V). For all the components to be undamaged, the preferred voltage range is 7-12V.

4.1.4 9V Batteries


Figure 4.6 -9V Battery

Batteries have three parts, an anode (-), a cathode (+) and the electrolyte. The cathode and the anode (the positive and negative sides in both ends of a traditional battery) are connected to an electrical circuit.

Figure 4.7 –Working of Battery

Chemical reactions in the battery leads to a structure up of electrons on the anode. This results in a electrical difference between the anode and the cathode. You can think of this difference as an unstable accumulation of electrons. Electrons want to reorganize to get rid of this difference. But they do it in a certain way. Electrons repel each other and try to go to a place with fewer electrons.

In a battery, the only place to go is the cathode. The electrolyte will prevent that electrons go directly from the anode to the cathode within the battery. When the circuit is closed (a cable connects the cathode and the anode) the electrons will be able to reach the cathode. In the photo above, electrons pass through the cable, the light in the path of the light bulb. It is a way to describe how electrical potential causes electrons to flow through the circuit.

However, these electrochemical processes change chemicals in the anode and the cathode to make them stop the supply of electrons. So there is a limited amount of energy in a battery.

When you recharge battery, change the direction of flow of electrons using other sources of energy, such as solar panels. The electrochemical processes in reverse, and the anode and cathode are restored to their original state and can provide more power.

4.1.5 LED


Figure 4.8 -LED

Here the LED has two terminals, cathode and anode. The anode is the longer terminal and the cathode is the shorter terminal.


Figure 4.9 -LED Working

LEDs are simply the diodes which are designed to emit light. When a diode is biased forward, electrons and holes are zipping forward and backward through the junction and are constantly combining and annihilating each other. Sooner or later, once an electron moves from the n-type in p-type silicon, it will get combined with a hole and disappear. That makes an atom more stable and complete and follows a little burst of energy in the form of a small ‘package’ or photons of light.










Chapter 5

Software Description

5.1 Software Used

  1. Arduino

5.1.1 Arduino

This project uses Arduino 1.0.5 IDE for programming the Arduino Board.


Figure 5.1-Arduino IDE

The upper part of the ide is composed of several symbols in the toolbar. Each symbol carries out a specific task.

The ‘Tick’ symbol which is on the upper left corner, is the compiler button. The symbol of ‘Right arrow’ which is next to the compile button, is upload button. This feature helps the user to record the code in the Arduino microcontroller.

The ‘Script’ symbol is the new script button. This helps the user to open a new script for a new program. ‘Arrow up’ symbol is open script button. This performs the function of opening the selected, previously saved program currently browsing the archives of the pc. This allows the user to directly open the Arduino saved easily programs. ‘Arrow down’ symbol is the Save button to save. This makes the task of saving the program that the user typed in. The file gets saved in the format ‘. ino’ by default.

The following steps are to be followed to record the program on Arduino:

1. Open the Arduino Ide.

2. Now press the open script button, to open a previously saved program or else open one of the examples that are available with the software. To open the sample program, select files.

3. Programs are available there. To select the program led flashing go to basics, then Blink.


Figure 5.2 – Selecting / Opening program Arduino IDE

4. Now that the program is opened, we have to select the COM port that is connected to the Arduino Board.

5. Select tools, serial ports, select the COM port that is present in the Arduino Board.

6. Now select the Arduino Uno (in this project) by going to Tools, Board, Arduino UNO.


Figure 5.3 -Selection of Board in the Arduino IDE

7. Now load the program by pressing the button upload











Chapter 6

Algorithm of MFCC approach:

A block diagram of the structure of a processor MFCC is as shown in Fig. 4.1. Speech input is normally recorded in a sampling rate above 10000 Hz. This sampling frequency was chosen to minimize the effects of aliasing in the conversion from analog to digital. These sampled signals can capture all frequencies up to 5 kHz, which cover most of the energy of sounds generated by human beings. The main purpose of the MFCC processor is to imitate the behavior of the human ear. In addition, instead of the forms of discourse, MFCC is shown to be less susceptible to variations above.

Figure 6.1 diagram block MFCC

 First store us the signal of speech as a vector of 10000 samples. Our experiment it was observed that current discourse uttered by eliminating up to the wine of the static portions on samples of 2500, therefore using a technique simple threshold was held the silence detection to extract real uttered speech. It is clear that what we wanted to achieve was a voice-based biometrics system can recognize individual words. As our experiments revealed almost all isolated words were pronounced within 2500 samples. Then we directly use the triangular windows overlapping in the frequency domain. Obtained the energy within each triangular window, then the DCT of the logarithms to achieve good compaction within a small number of coefficients as described by the MFCC approach.

The simulation was conducted in MATLAB. The different stages of the simulation have been represented in the form of the plots shown. The continuous input signal speech seen as an example for this project is the word “Hello”.

Figure 6.2Word “Hello” taken for analysis

Figure 6.3 Word “Hello” after silence detection

Figure 6.4 Word “Hello” after windowing using Hamming window

Figure 6.5 Word “Hello” after FFT

6.1 MFCC Approach Flowchart

Chapter 7

Graphical User Interface

Figure 7.1 GUI that represents the initial window for users to record

The previous window is the first window that appears after running the program. Here different users to sore ordered samples by clicking on the respective fields.

Figure 7.2 GUI that represents different cases for User 1

Figure 7.3GUI that represents different cases of User 2

Figure 7.4 GUI indicating to record the input signal for processing

In this window the user is requested to record the signal input to the process. When you post here the input, the signal is processed using MFCC approach and the mean square error is calculated for each case. Once the mean quadratic error is less than the threshold value, the corresponding case of signal is shown in the graphical interface.

Figure 7.6 GUI that represents when the fan signal is detected

Now when the input signal is not stored signals showing then a message as “not detected” is displayed in the GUI.

Figure 7.5 GUI when no signal is detected

Cite This Work

To export a reference to this article please select a referencing stye below:

Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.
Reference Copied to Clipboard.

Related Services

View all

DMCA / Removal Request

If you are the original writer of this dissertation and no longer wish to have your work published on the website then please: