Voice Recognition System: Characteristic of Signal of Speech

Info: 14503 words (58 pages) Dissertation
Published: 11th Dec 2019

Share this: Facebook Twitter Reddit LinkedIn WhatsApp

ABSTRACT

The technology has advanced in all fields and always has reduced the time and effort of mankind. One such application that reduces the effort of the human being is automation of appliances using speech processing. We can control the basic electrical and appliances in our home through the speech.

Voice recognition is the process of automatically recognize certain words spoken by a particular speaker individual information in waves of speech-based. This technique allows you to use the speaker’s voice to verify your identity and provide control of access to services as base voice based on biometrics, access to the database, voice dialing, voice mail and remote access to computers. Front end of signal processing to extract the set of features is an important stage in any speech recognition system. The optimal set is not yet determined but the efforts of researchers. There are many types of characteristics, which are derived differently and have good impact on the recognition rate. This project presents one of the techniques for extracting the characteristic of a signal of speech, which can be used in speech recognition systems. The key is to convert the speech wave form some kind of parametric representation (at a rate considerably lower information) for further analysis and processing. This is often referred to as the front end of signal processing. A wide range of possibilities there are for parametric shape which represents the signal of speech for speaker recognition task. MFCC is perhaps the best-known and most popular, and these will be used in this project. CSBMS are based on the known variation of critical bandwidths of human hearing frequently spaced linearly at low-frequency filters and logarithmic at high frequencies have been used to capture the phonetically important features of speech. However, another key feature of the speech is quasi-stationarity, i.e. is still soon which is to study and analyze in the short term, the frequency-domain analysis.

To achieve this, we first made a comparative study of the MFCC approach. Home based voice automation system in is based on the recognition of isolated words. A particular speaker/s says the word once in the session’s sample in order to form and storing the word of access features. Later in the assessment session the user pronounces the word corresponding to the application again in order to gain recognition if there is a match. At this stage an intruder can also test system to test the property of intrinsically safe by the word itself. Once it is a coincidence of an application is to run through between MATLAB and Arduino serial communication

CONTENTS

Abstract i

List of Figures v

List of table’s vii

Abbreviations viii

Chapter 1 INTRODUCTION 1

1.1 Introduction 1

1.2 Objectives 1

1.3 Motivation 2

1.4 Problem Statement 3

1.5 Proposed Methodology 4

1.6 Outline of Thesis 4

Chapter 2 LITERATURE SURVEY 5

2.1 Basic Acoustics and Speech Signal 5

2.1.1 The Voice signal 5

2.1.2 Speech Production 6

2.1.3 Properties of Human Voice 7

2.2 Automatic Speech Recognition (ASR) 8

2.2.1 Introduction 8

2.2.2 Basics of Speech Recognition 10

2.2.3 Classification of ASR Units 12

2.2.4 Why is the automatic speaker recognition difficult? 13

2.2.5 Expression Analyzer 14

2.2.6 Speech Classifier 15

Chapter 3 THE FEATURE EXTRACTION 16

3.1 Process 16

3.1.1 Frame Blocking 17

3.1.2 Windowing 17

3.1.3 Fast Fourier transform 17

3.1.4 Mel frequency warping 17

3.1.5 Cepstrum 18

3.1.6 Mel Frequency Cepstrum Coefficient 20

Chapter 4 HARDWARE DESCRIPTION 22

4.1 Components Used 22

4.1.1 Arduino-Uno 22

4.1.2 Motor Driver – IC L293D 26

4.1.3 DC Motors 27

4.1.4 9V Batteries 29

4.1.5 LED 30

Chapter 5 SOFTWARE DESCRIPTION 32

5.1 Software Used 32

5.1.1 Arduino 32

5.1.2 MATLAB 34

5.1.2.(a) Introduction to MATLAB 34

5.1.2.(b) The MATLAB system 35

5.1.2.(c) GRAPHICAL USER INTERFACE (GUI) 36

Chapter 6 ALGORITHM OF MFCC APPROACH 49

MFCC Approach and Flowchart 49

Chapter 7 GRAPHICAL USER INTERFACE 53

7.1 GUI representing different windows 53

Chapter 8 IMPLEMENTATION AND WORKING 58

8.1 Introduction 58

8.1.1 Matlab Working 58

8.1.2 Matlab- Arduino Interface 59

8.1.3 Arduino Working 60

Chapter 9 CONCLUSION AND FUTURE SCOPE 61

9.1 Conclusion 61

9.2 Applications 61

9.3 Future Enhancements 62

Annexure 63

References 67

List of Figures

Figure 1.1	Speaker identification training	3
Figure 1.2	Speaker identification tests	3
Figure 2.1	Schematic diagram of the process of speech production	5
Figure 2.2	Human Vocal mechanism	7
Figure 2.3	Utterance of “Hello”	10
Figure 3.1	Feature Extraction process	16
Figure 3.2	Filter Bank on Mel frequency scale	18
Figure 3.3	Cepstrum signal line	19
Figure 3.4	Frequency mapped into the Mel frequency	21
Figure 4.1	Arduino UNO Pin Configuration	22
Figure 4.2	Motor Driver-IC L293D Pin Configuration	26
Figure 4.3	DC Motor Externally	27
Figure 4.4	Internal structure of DC motor	28
Figure 4.5	The Forces on the DC Motor Armature Coil	28
Figure 4.6	9V Battery	29
Figure 4.7	Working of Battery	29
Figure 4.8	LED	30
Figure 4.9	LED Working	31
Figure 5.1.1	Arduino IDE	32
Figure 5.1.2	Opening program Arduino IDE	33
Figure 5.1.3	Selection of Board in the Arduino IDE	34
Figure5.2.1	Basic guide window to create a GUI	37
Figure 5.2.2	Parts of GUI Implementation	38
Figure 6.1	MFCC block diagram	49
Figure 6.2	Word “Hello” taken for analysis	50
Figure 6.3	Word “Hello” after silence detection	50
Figure 6.4	Word “Hello” after windowing using Hamming window	51
Figure 6.5	Word “Hello” after FFT	51
Figure 6.6	MFCC Approach Block Diagram	52
Figure 7.1	GUI that represents the initial window for users to record	53
Figure 7.2	GUI that represents different cases for User 1	54
Figure 7.3	GUI that represents different cases for User 2	54
Figure 7.4	GUI indicating to record the input signal for processing	55
Figure 7.5	GUI that represents when the fan signal is detected	56
Figure 7.6	GUI when no signal is detected	57
Figure 8.1	Matlab Working	58
Figure 8.2	Arduino Working	60

List of Tables

Table 4.1	Aspects of Arduino UNO	23
Table 4.2	Pin description of IC L293D	27
	Abbreviations
ASR	Automatic Speaker Recognition
DVI	Direct Voice Input
LVCSR HMI	Large Vocabulary Continuous Speech Recognition Human Machine Interaction
MFCC	Mel Frequency Cepstrum Coefficients
LPC	Linear Predictive Coding
PLP	Perceptual Linear Prediction
TTS	Text to Speech
DTW	Dynamic Time Warping
HMM	Hidden Markov Modeling
VQ	Vector Quantization
FFT	Fast Fourier Transform
LED	Light Emitting Diode
GUI	Graphical User Interface
DCT	Discrete Cosine Transform
UART	Universal Asynchronous Receiver and Transmitter

Chapter 1.

Introduction

1.1 Introduction

Speech is the most common way of interacting for living beings. While this has been true since the dawn of civilization, the invention and widespread use of the telephone, television, radio and audio phonic storage media has given even more importance to communication speech and voice [1] processing. Advances in digital signal processing technology has been the use of speech in many different areas of application such as compression of speech enhancement, synthesis and recognition [2]. In this thesis, using speech processing home automation is accomplished through the development of a system of voice recognition for single words using MFCC approach.

The concept of a machine that can recognize human voice has been a feature accepted in science fiction.

Bill Gates (Microsoft Corp. co-founder) is automatic speaker recognition (ASR) as one of the most important innovations for future operating systems. From a technology perspective it is possible to distinguish between two broad types of ASR: direct voice input

(DVI) and large-vocabulary continuous speech (LVCSR) recognition. DVI devices are mainly directed to command-and-control voice, while LVCSR systems are used to fill the form or create documents based on the voice. In both cases, the underlying technology is more or less the same. DVI systems are typically configured for small size medium (up to several thousand words) vocabulary and could employ technical word or phrase of spotting. Also, DVI systems are usually required to immediately respond to a voice command. LVCSR systems involve vocabularies of perhaps hundreds of thousands of words and are normally configured to transcribe continuous speech.

From an application point of view, the benefits of using ASR derived from providing a busy busy hands additional communication channel eyes human-machine interaction (HMI), or simply the fact that talking can be faster to write.

1.2 objectives

The objective of the project is design automation with the model of speech recognition using MFCC extraction.

1.3 motivation

The motivation for home automation with speech recognition is simple; It is man principle of communication and is, therefore, a convenient and accessible way of communication with machines. Speech communication has evolved to be efficient and robust and it is clear that the path of computer speech recognition is the modeling of the human system. Unfortunately since pattern recognition standpoint, human recognizes speech through a very complex interaction between several levels of processing; using information syntactic and semantic processing and classification of low-level well very powerful patterns. Front powerful and sophisticated classification algorithms is, ultimately, it is not enough; many other forms of knowledge, for example, Linguistics, semantics and pragmatics, must be built into the recognizer. Or even in a lower level of sophistication, just simply with generating “a good” representation of speech (i.e., a good set of features to be used in a pattern classifier); the classifier itself must have a considerable degree of sophistication. This is the case, however, does not effectively discriminate between classes and, in addition, better the characteristics of the easiest is the task of classification. Automatic speech recognition is therefore a commitment to engineering between the ideal, i.e. a complete model of the human being and the practice, i.e., tools which provide the science and technology and to allow the costs.

At the highest level, the speaker recognition systems contain two main modules (see Fig. 1.1): feature extraction and matching of the function. Extraction is the process that removes a small amount of data from the voice signal that can be used more late to represent each speaker. Feature matching consists of the procedure to identify the speaker unknown by comparison of features extracted from your voice input with the one of a set of known speakers. We will discuss each module in detail in later sections.

Figure 1.1. Speaker recognition training

Figure 1.2. Speaker recognition tests

1.4 Approach to the problem

The main problem in the majority of home automation systems is the authentication. Since all appliances are controlled by many users, it becomes a problem when people start using the applications without knowing it having normal conversations. There is also the possibility of intruders or neighbors with electrical appliances.

1.5 Methodology

To overcome this problem of authentication, a highly reliable speech recognition system must apply by which reduces to the minimum the risk of the use of appliances without knowing it. Discourse analysis, also known as front end analysis and feature extraction, is the first step in an automatic speech recognition system. This process aims to extract acoustic speech waveform characteristics. The output of analysis front end is a set of parameters that represent the observed acoustic properties of input signals of speech, for the further use for acoustic modelling compact and efficient. There are three main types of processing techniques of front-end, i.e. the linear predictive coding (LPC), mel-frequency cepstral coefficients (MFCC) and forecast linear perceptual (PLP), where the two last are most commonly used in ASR systems. Here we use the best-known approach and popular technique of extraction of Mel frequency Cepstrum coefficients (MFCC).

1.6 Outline of thesis

This thesis is organized as follows.

In Chapter 1, the objective and the introduction is given part of the project.

In Chapter 2, study of literature

In Chapter 3, the feature extraction

In Chapter 4, the Hardware description is given.

Chapter 5, gives the description of Software.

The algorithm used in this thesis are discussed in Chapter 6.

GUI of the method used is given in Chapter 7.

In Chapter 8, results, discussion, we will discuss the implementation of these projects and the scope for future work.

Chapter 2.

Literature Survey

2.1 Base acoustics and Speech signal

As relevant data in the field of speech recognition, this chapter intends to discuss how the speech signal is produced and perceived by humans. This is a crucial issue that must be considered before that one can pursue and decide which approach to use voice recognition.

2.1.1 The voice signal

Human communication is to be seen as a comprehensive scheme of the process of production of speech perception of speech between the speaker and the listener, see Figure 2.1.

Figure 2.1. Schematic diagram of the process of speech production/perception

Five different elements, formulation of speech a., vocal mechanism human B., C. acoustic air, D. perception of hearing, understanding and intervention. The first element (formulation a. speech) is associated with the formulation of voice in the mind of the speaker. This formulation is used by the human vocal mechanism (B. human vocal mechanism) to produce the actual speech waveform. Waveform is transmitted by air (C. ) ( Acoustic air) for the listener. During this transfer the sound wave can be affected by external sources, e.g. noise, resulting in a more complex wave. When the wave reaches the listener (ears) hearing system the precepts of listening (understanding the speech of e.) from the mind of waveform (D. perception of the ear) and the listener begins to process this form of wave to understand its contents, so that the listener understands what the speaker is trying to tell you. A problem with speech recognition is to “simulate” how the process of listening to the speech produced by the speaker. There are several actions that take place in the listeners head and system during the process of speech signals. The process of perception can be as the inverse process of speech production. The basic unit of theoretical to describe how to bring the speech formed in the mind, the linguistic meaning is called phonemes. Phonemes can be grouped based on the properties of either of the two weather way to wave or frequency characteristics and classified into different sounds produced by the human vocal tract.

Speech is:

•The variable signal,

Communication process structured solid,

•Depends on known physical movements,

•Composed of acquaintances, various units (phonemes),

•It is different for each speaker,

• Can be fast, slow or variable speed,

•May have high pitch, low pitch, or be whispered,

•Has widely different types of environmental noise,

•May lack distinct boundaries between units (phonemes),

•Has a unlimited number of words.

2.1.2 Speech Production

In order to understand how the production of speech is done, you need to know how to create the vocal mechanism human, see Figure 2.2. The most important parts of the mechanism human vocal are the vocal tract with the nasal cavity, which begins in the veil. The veil is a trap as a mechanism which is used to formulate sound nasal when necessary. When the veil is lowered, the nasal cavity is coupled together with the vocal tract to formulate the desired voice signal. The cross-sectional area of the vocal tract is limited by the tongue, lips, jaw, or veil and varies from 0 to 20 cm2.

Figure 2.2. Human Vocal mechanism

2.1.3 Properties of Human Voice

The most prominent feature of sound is its frequency. Frequencies helps us to discriminate sounds. When frequency of sound is high, it is sharp and irritating. When the frequency of a sound, the sound gets deepen. The sound waves are waves that are produced from the vibration of materials. The highest frequency that can produce a human being is about 10 kHz. And the lowest value is 70 Hz. These are the maximum and minimum values. This range of frequency changes for each person. And the magnitude of a sound is measured in decibels (dB). A normal human language has a range of 100-3200 Hz Hz of frequency and its magnitude is in the range of 30 dB – 90 DB. A human ear can perceive sounds in the range of frequency between 16 Hz and 20 kHz. And a change in frequency of 0.5% is the sensitivity of a human ear.

Characteristics of the speaker,

 Due to differences in the length of the vocal tract, men, women and children-speaking are different.

 Regional accents are the differences in resonant frequency, duration and tone.

 Individuals have patterns of resonant frequency and duration patterns which are unique (which allows us to identify the speaker).

 Training on a type of loudspeaker data “learns” to that group or characteristics of the person, automatically makes the recognition of other types of speakers much worse.

2.2 Automatic speech recognition (ASR)

2.2.1 Introduction

Speech processing is the investigation of speech signals and techniques for preparing of these signals. Signals are generally prepared in a portrayal computerized voice handling can be viewed as the connection of advanced handling of signals and normal language processing. Natural language processing is a subfield of manmade brainpower and Linguistics. It concentrates the issues of robotized era and comprehension of the regular human dialects. Common dialect era frameworks change over data from PC databases in ordinary human dialect of sound and normal dialect understanding frameworks change over examples of human dialect in more formal introductions that are less demanding for PC projects to control.

Speech coding

It is the pressure of the speech (in code) for transmission with codecs of discourse utilizing methods of handling of discourse and sound flag preparing. The strategies utilized are like the pressure of sound information and human sound coding where learning in psychoacoustics is utilized to transmit just the information that is applicable to the hearing. For instance, limit band voice coding, just data on the recurrence of 400 Hz, 3500 Hz band is transmitted yet the reproduced flag is as yet satisfactory for clarity. In any case, discourse coding contrasts from sound encoding where there are a great deal more accessible measurable data on the properties of discourse. What’s more, some sound-related data that is significant to the encoding of sound might be superfluous in discourse coding setting. In the coding of the voice, the most imperative basis is the safeguarding of understandability and “sensitivity” of the discourse, with a restricted measure of transmitted information. It ought to be noticed that the comprehensibility of discourse incorporates, notwithstanding the exacting substance, additionally speaker character, feelings, inflection, timbre and so on that are critical for impeccable coherence. The more unique idea of the sensitivity of the debased discourse is a property than understandability, since it is conceivable that the corrupted discourse is totally comprehensible, yet subjectively irritating to the audience.

Speech Synthesis

Speech snthesis is the counterfeit generation of human discourse. A framework (TTS) content to discourse changes over ordinary content into discourse; different frameworks render typical semantic portrayals like phonetic interpretations into discourse. Blended discourse can likewise be made by linking bits of recorded discourse that are put away in a database. The frameworks contrast in the extent of the put away discourse units; a framework that stores telephones or diaphones gives the biggest yield go, yet may need lucidity. For particular spaces, the capacity of entire words or expressions permits the yield of high caliber. On the other hand, a synthesizer can join a model of vocal tract and other human voice attributes to make a totally “engineered” voice yield. The nature of a discourse synthesizer is judged by its similitude to the human voice and its capacity to be caught on. A coherent content to-discourse program permits individuals with visual hindrances or perusing incapacities to tune in to works written in a home PC. Many working frameworks incorporate discourse synthesizers since the 1980’s.

Voice examination

Voice issues that require voice examination regularly begin the vocal ropes since it is the wellspring of sound and is along these lines all the more effectively to be drained. Be that as it may, the examination of the vocal strings is physically troublesome. The area of the vocal ropes precluded coordinate estimation of the development adequately. Imaging techniques, for example, x-beam x or ultrasounds don’t work on the grounds that the vocal strings are encompassed via ligament that bends the picture quality. Developments in the vocal lines are fast, basic frequencies are generally in the vicinity of 80 and 300 Hz, along these lines staying away from the utilization of standard video. Fast recordings give a choice yet to see the vocal strings the camera ought to be put in the throat that makes it hard to discuss something. Most imperative circuitous strategies are switch filtration of recordings of sound and electroglottographs (egg). In backwards channel strategies, the sound of the voice is recorded out of the mouth and afterward separated by a scientific technique to expel the impacts of the vocal tract. This technique delivers a gauge of the waveform of heartbeat weight showing again contrarily developments of the vocal ropes. The other sort of turn around sign is the electroglottographs, which works with cathodes associated with the throat of the subject close to the vocal lines. Changes in the conductivity of the throat then again demonstrates is the way expansive a part of the vocal touch each other. Along these lines yields one-dimensional data of the contact region. Backwards separating nor egg so enough to completely portray the glottal development and give roundabout confirmation just of that development.

Voice Recognition

Voice recognition is the procedure by which a group (or another sort of machine) recognizes talked words. Fundamentally, it implies conversing with the PC and have it accurately perceive what he’s maxim. This is the way to any talk identified with the application. As it will be clarified later, a number there are approaches to do it, however the essential standard is by one means or another concentrate certain key attributes of discourse expressed and after that regard these components as a key to perceive the word when he talked once more.

2.2.2 Recognition Basics

Utterance

A statement is the vocalization (talk) a word or words that represent a unique meaning to the computer. Expressions can be a single word, a Word, a phrase or even multiple sentences.

Figure 2.3. Utterance of “Hello”

Speaker unit

Speaker-dependent systems are designed around a specific speaker. They are usually more accurate for correct, but much less accurate for other speakers speaker. They assume that the speaker to speak with a consistent voice and tempo. Independent speaker systems are designed for a variety of speakers. Adaptive systems generally start as independent speaker systems and use training techniques to adapt to the speaker to increase your recognition accuracy.

Vocabularies

Vocabularies (or dictionaries) are lists of words or expressions that can be recognized by the system of the Vocabularios Mr. little ones are usually easier for a computer to recognize, while larger vocabularies are more difficult. Unlike normal dictionaries, each entry does not have to be a single word. It may be as long as a sentence or two. Smaller vocabularies may have only 1 or 2 expressions recognized (for example, “Wake Up”), while large vocabularies may have 100 thousand or more!

Precision

The capacity of a recognizer may be examined by measuring accuracy – or well recognizes expressions. This includes not only correctly identify an utterance but also identify whether the spoken word is not in his vocabulary. Good ASR systems have an accuracy of 98% or more! The acceptable accuracy of a system really depends on the application.

Training

Some recognizers voice have the ability to adapt to a speaker. When the system has this capability, you can allow training that will take place. An ASR system is trained by having the speaker repeat standard or common phrases and adjusting their comparison algorithms to match it with the speaker in particular. A recognizer training usually improves its accuracy. Training can be used also by speakers who have difficulty speaking or to pronounce certain words. As the speaker can be constantly repeated an utterance, ASR training systems must be able to adapt

2.2.3 Classification of ASR system

A voice recognition system can function in many different conditions as dependent/independent speaker, isolated continuous speech, small/large vocabulary recognition. Voice recognition systems can be divided into several different classes describing what kind of expressions that have the ability to recognize. These classes are based on the fact that one of the difficulties of the ASR is the ability to determine when a speaker begins and ends a sentence. Most packages can fit into more than one class, depending on the mode you’re using.

Isolated words

Recognizers of the word isolated generally require each utterance have tranquility (lack of audio signal) on both sides of the display window. It does not mean that accepts loose words, but they require a simple utterance at a time. Often, these systems have States “Preview / no-escucha”, which require the speaker to wait between expressions (generally processing during breaks). Isolated statement would be a name better for this class.

Connected words

Connect the word systems (or, more correctly, ‘statements connected’) are similar to the isolated words, but they allow separate than ‘run together’ returns with a minimal pause between them.

Continuous speech

Continuous recognition is the next step. With capacities of continuous speech recognizers are some of the most difficult to create and that they must use special methods to determine limits of utterance. Continuous speech recognizers allow users to talk almost naturally, while the computer determines the content. Basically, your computer dictated.

Spontaneous speech

A variety of definitions of what speech it seems spontaneous actually is. At a basic level, it can be considered is an expression that is natural not rehearsed and sound. An ASR system with capability of spontaneous speech should be able to handle a variety of natural speech features as words executes “ums” and “ahs” and stutters even mild.

Speaker unit

ASR engines are classified as dependants of speakers and independent. Speaker-dependent systems are trained with a speaker and recognition is made only for that speaker. Independent speakers are trained with a speaker system. This is obviously much more complex than dependent speaker recognition. A problem of intermediate complexity would be training with a group of speakers and recognize the voice of a speaker within that group. We could call this the speakers group-dependent recognition.

2.2.4. Why is the automatic speaker recognition difficult?

There are some problems with voice recognition that have not been discovered. However, there are a number of problems that have been identified in recent decades most of which still remains unresolved. Some of the main problems in ASR are:

Determination of word boundaries

Speech generally has a continuous nature and the boundaries of the words are not defined properly. One of the common errors of continuous speech recognition is the lack of a tiny space between words. This happens when the speaker is speaking at a high speed.

Different accents

Every person has his own accent. The pronunciation of words varies from person to person. This posts a serious problem to ASR. However, this is a problem that is not limited to ASR but that plagues human listeners.

Large vocabularies

When the number of words in the database is large, words that sound similar tend to produce a high amount of error is, there is a good chance that a Word is recognized as the other.

Change the acoustics of the room

Noise is an important factor in the ASR. It is in fact in noisy conditions or acoustic costumes that the limitations of today’s engines nowadays ASR become prominent.

Temporal variation

Various speakers speak at different speeds. Today the ASR engines only not able to adapt to.

2.2.5 Expression Analyzer

Speech analysis, also known as front end analysis and feature extraction, is the first step in an automatic speech recognition system. This process aims to extract acoustic speech waveform characteristics. The output of analysis front end is a set of parameters that represent the observed acoustic properties of input signals of speech, for the further use for acoustic modelling compact and efficient. There are three main types of processing techniques of front-end, i.e. the linear predictive coding (LPC), mel-frequency cepstral coefficients (MFCC) and forecast linear perceptual (PLP), where the two last are most commonly used in ASR systems.

Linear predictive coding

LPC begins with the supposition that a speech signal is delivered by a bell toward the finish of a tube (voice sounds), with intermittent shrieks included and pop sounds. Albeit evidently unrefined, this model is really a nearby estimation to the truth of the creation of discourse. The glottis (the space between the vocal ropes) creates the buzz, which is described by its force (commotion) and recurrence (pitch). The vocal tract (mouth and throat) shapes the tube, which is described by its resonances, which are called framing. Shrieks and snaps are created by the activity of the tongue, lips and throat amid sibilants and plosives.

LPC examines the estimation of formant discourse flag, taking out the impacts of discourse flag and gauge the power and recurrence of the rest of the buzz. The procedure of disposal of the formants is called switch sifting, and the staying signal after the subtraction of the separated flag demonstrating is called buildup. The numbers that portray the force and recurrence of the gossipy tidbits, the formants and buildup flag, can be put away or transmitted elsewhere. LPC combines the flag of discourse by turning around the procedure: utilize the buzz and different parameters to make a source flag, utilize the formants to make a channel (which speaks to tube) and run the source through the channel, which talks. Since discourse signals change after some time, this procedure is performed on the discourse flag short pieces, which are called outlines; for the most part 30 to 50 outlines for each second give coherent discourse with great pressure level.

Mel frequency Cepstrum coefficients

These derive from a type of representation of cepstral of the audio clip (a “spectrum-of-a-spectrum”). The difference between the cepstrum and the Mel-frequency cepstrum is that in theMFC, placed the frequency bands logarithmically (on the mel scale) which approximates the response of the human auditory system more closely than the frequency bands spaced linearly obtained directly the FFT and DCT. This allows for better data processing, for example, in audio compression. However, unlike ultrasound, CSBMS lack a land of ear mode, therefore, may not represent accurately the perceived loudness. CSBMS are commonly derived as follows: 1. take a signal2 Fourier transform of (an extract from the window). Map of amplitudes of registration of the retrieved over spectrum on the Mel scale, with windows.3 superimposed triangular. Take the discrete cosine transforms the Mel list register of amplitudes, as if it were a signal.4. The CSBMS are the amplitudes of the resulting spectrum.

2.2.6 Speech classifier

The issue of ASR has a place with a considerably more extensive subject in logical and designing alleged example acknowledgment. The objective of example acknowledgment is to actualidadgay objects of enthusiasm into one of various classifications or classes. The objects of intrigue are blandly called designs and for our situation are successions of acoustic vectors that are removed from an information discourse utilizing the strategies depicted in the past segment. The classes here allude to individual speakers. Since the order technique for our situation is connected on separated elements, it can be additionally alluded to as highlight coordinating. The best in class in highlight coordinating strategies utilized as a part of speaker acknowledgment incorporates Dynamic Time Warping (DTW), Hidden Markov Modeling (HMM), and Vector Quantization (VQ).

Chapter 3

Feature Extraction

3.1 Processing

Getting the acoustic qualities of the discourse flag is alluded to as Feature Extraction. Highlight Extraction is utilized as a part of both preparing and acknowledgment stages. It involve the accompanying steps:1. Outline Blocking2. Windowing3. FFT (Fast Fourier Transform) 4. Mel-Frequency Wrapping5. Cepstrum (Mel Frequency Cepstral Coefficients)

Feature Extraction

This stage is regularly alluded as discourse handling front end. The principle objective of Feature Extraction is to improve acknowledgment by outlining the immense measure of discourse information without losing the acoustic properties that characterize the discourse [3]. The schematic chart of the means is portrayed in Figure 4.1.

Figure 3.1. Include Extraction Steps

3.1.1 Frame Blocking

Examinations demonstrate that discourse flag qualities remains stationary in an adequately brief timeframe interim (It is called semi stationary). Thus, discourse signs are prepared in brief time interims. It is separated into edges with sizes for the most part between 30 and100 milliseconds. Each casing covers its past casing by a predefined estimate. The objective of the covering plan is to smooth the move from edge to outline [3].

3.1.2 Windowing

The second step is to window outlines all. This is done keeping in mind the end goal to dispose of discontinuities at the edges of the edges. On the off chance that the windowing capacity is characterized as (n) w, 0 < n < N-1 where N is the quantity of tests in each edge, the subsequent flag will be; and (n) = x (n) w (n). By and large hamming windows are utilized [3].

3.1.3 Fast Fourier Transform

The following stride is to take Fast Fourier Transform of each edge. This change is a quick method for Discrete Fourier Transform and it changes the space from time to recurrence [3].

3.1.4 Mel Frequency Warping

The human ear see the frequencies non-straightly. Explores demonstrate that the scaling is straight up to 1 kHz and logarithmic over that. The Mel-Scale (Melody Scale) channel bank which portrays the human ear perceiveness of recurrence. It is utilized as band pass sifting for this phase of recognizable proof. The signs for each casing is gone through Mel-Scale band pass channel to imitate the human ear [4] [3] [5]. As said above, psychophysical examines have demonstrated that human view of the recurrence substance of sounds for discourse signals does not take after a direct scale. Subsequently for each tone with a present recurrence, f , measured in Hz, to subjective pitch is measured on a scale called the “mel scale. The mel-recurrence scale is a straight recurrence separating beneath 1000 Hz and a logarithmic pacing over 1000 Hz. As a kind of perspective point, the pitch of a 1 kHz tone, 40 dB over the perceptual hearing edge, is characterized as 1000 mels. Along these lines we can utilize the accompanying estimated equation to process the mels for a given recurrence f in Hz:

One way to deal with recreating the subjective range is to utilize a channel bank, one channel for each coveted mel-recurrence segment. Channel bank that has a triangular bandpass recurrence reaction, and the dividing and the data transfer capacity is controlled by a consistent mel-recurrence interim. The changed range of S() therefore comprises of the yield energy of these channels when S() is the info. The quantity of mel-cepstral coefficients, K , is regularly picked as 20.Note that this channel is connected in the recurrence space bank; Therefore it just sums to taking those triangle-shape windows in the Fig 4.2 on the range. A valuable state of mind about this mel-twisted channel bank is to see each channel as a histogram receptacle (where containers have cover) in the recurrence area. A helpful and proficient method for executing this is to consider these triangular channels in the Mel scale where they would in actuality be similarly dispersed channels.

Figure 3.2. Channel Bank in Mel recurrence scale

3.1.5 Cepstrum

Cepstrum name was gotten from the range by turning around the initial four letters of range. We can state cepstrum is the Fourier Transformer of the log with unwrapped period of the Fourier Transformer.

• Mathematically we can state Cepstrum of flag = FT (log (FT (thesignal)) + j2IIm)

Where m is the interger required to appropriately unwrap the point or imaginarypart of the mind boggling log work.

• Algorithmically we can state – Signal – FT – log – stage unwrapping – FT – Cepstrum

For characterizing the genuine values genuine cepstrum utilizes the logarithm work. While for characterizing the perplexing qualities though the complex cepstrum utilizes the mind boggling logarithm work. The genuine cepstrum utilizes the data of the size of the range. where as unpredictable cepstrum holds data about both extent and period of the underlying range, which permits development there of the flag. We can compute the cepstrum by numerous ways. Some of them have to stage twisting calculation, others don’t. Figure underneath demonstrates the pipeline from flag to cepstrum.

Figure 3.3 Cepstral flag line

As we talked about in the Framing and Windowing segment iscomposed of rapidly part e (n) shifting sign discourse that succession being convolved with gradually fluctuating excitation part (n) vocal framework drive reaction.

Eleven we being convolved the rapidly fluctuating part and gradually shifting part it makes hard to isolate the two sections, cepstrum is acquainted with particular this two sections. The condition for the cepstrum is given beneath:

3.1.6 Mel Frequency Cepstrum Coefficient

In this venture we are utilizing Mel Frequency Cepstral Coefficient. Mel recurrence Cepstral Coefficients will be coefficients that speak to sound in light of observation. This coefficient has an extraordinary accomplishment in speaker acknowledgment application. It is gotten from the Fourier Transform of the audioclip. In this strategy the recurrence groups are situated logarithmically, though in the Fourier Transform the recurrence groups are not situated logarithmically. As the recurrence groups are situatedlogarithmically in MFCC, it approximates the human framework reaction more nearly than whatever other framework. These coefficients permit better preparing of information. In the Mel Frequency Cepstral Coefficients the estimation of the Mel Cepstrum is same as the genuine Cepstrum aside from the Mel Cepstrum’s scale is distorted to keep up a correspondence to the Mel recurrence scale.

The Mel scale was anticipated by Stevens, Volkmann and Newman in 1937. The Mel scale is essentially in view of the investigation of watching the pitch or recurrence seen by the human. The scale is separated into the units mel. In this test the audience or test individual begun hearing to recurrence of 1000 Hz, and marked it 1000 Mel for reference. At that point the audience members were made a request to change the recurrence till it ranges to the recurrence double the reference recurrence. At that point this recurrence named 2000 Mel. A similar system rehashed for the a large portion of the recurrence, then this recurrence named as 500 Mel, et cetera. On this premise the ordinary recurrence is mapped into the Mel recurrence. The Mel scale is typically a direct mapping underneath 1000 Hz and logarithmically divided over 1000 Hz. Figure underneath demonstrates the case of typical recurrence is mapped into the Mel recurrence.

Figure 3.4 Frequency mapped along with the Mel recurrence

The condition (1) above demonstrates the mapping the ordinary recurrence into the Mel recurrence and condition (2) is the converse, to get back the typical recurrence.

Chapter 4

Hardware Description

4.1 Components Used

1 Arduino-one

2. motor Driver – IC L293D

3 DC Motors

4 9V Battery

5. LED’s

6 Battery Caps

7 resistor

4.1.1 Arduino-one

Figure 4.1 Arduino one Pin Configuration

Arduino is one development board based on Atmega 328 p as the micro-controller.

The technical aspects of arduino are –

Micro controller	Atmega 328 p
Operating voltage	5V
Preferred input voltage	7-12 v -limit 6-20 V
Digital I/O pins	14 (of which 6 are PWM output)
PWM Digital/O pins and analog input pin	6 and 6 respectively
Current per I/O Pin and 3.3	20mA and 50mA respectively
Flash memory	32 KB (Atmega 328) that 0.5 is used by the boot loader
SRAM and EEPROM	2 KB and 1 KB (Atmega 328 p) respectively
Clock speed	16 MHz

Table 4.1 Aspects of Arduino one

Aspects of Arduino one

1. power

The Arduino Uno can be powered via the USB connection or with external power supply. The power source is automatically selected.

External (non-USB) power can come from a battery or an AC adapter to DC. The adapter can be connected by plugging into a Center positive 2.1 mm plug in the boards power jack. Battery cables can be inserted at GND and power connector Vin pin headers.

The Board can be operated with an external source of 6 to 20 volts. If less than 7V, however, 5V pin can supply less than 5 volts and the board can become unstable. If you use more 12V voltage regulator may overheat and damage the board. The recommended range is 7 to 12 volts.

Power supply pins are as follows:

Vin– The input voltage to the board when you are using an external power source (as opposed to 5 volts from the USB connection or another regulated power supply). You can supply voltage through this pin, or, if the supply voltage via the power jack, access through this pin connector is.

5V– This pin outputs to 5V regulated from the regulator on the board. The board can be supplied with power from the DC power supply (7-12 v), USB (5V) connector, or the VIN pin of the Board (7-12 v). Supplying voltage of 5V or 3.3V pin bypasses the regulator and can damage the plate. It is not advised

3V3– A 3.3 volt supply generated by the controller on board. Maximum current is 50 mA.

GND-ground pins.

IOREF– This pin on the board provides the reference voltage that runs the microcontroller. To shield set up correctly can read the voltage of the IOREF pin and select the appropriate power source or enable the outputs voltage translators to work with 5V or 3.3V.

2 The memory

The ATmega328 has 32 KB (with 0.5 KB occupied by the boot loader). It also has 2 KB of SRAM and 1 KB of EEPROM.

3. Input and Output

See the correspondence between Arduino pins and ATmega328P ports. The allocation for the Atmega8, 168, and 328 is identical.

Each of the 14 digital pins on the one can be used as an input or output, using pinMode, digitalWrite () and digitalRead() functions. They operate at 5 volts. Each pin can provide or receive 20 mA as recommended operating condition and has an internal resistor pull up (off by default) of 20-50 k ohms. A maximum of 40 mA is the value that must not be exceeded on any I/O pin to avoid permanent damage to the microcontroller.

In addition, some pins have specialized functions:

Serial: 0 (RX) and 1 (TX). Used for receive (RX) and data transmission serial TTL (TX). These pins are connected to the corresponding pins of the chip ATmega8U2 USB to TTL serial.

External interrupts: 2 and 3. These pins can be configured to cause a disruption at a low value, rising or falling edge or to change in value.

PWM: 3, 5, 6, 9, 10 and 11. Provide PWM 8 – bit output with the analogWrite() function.

SPI: 10 (SS), 11 (MOSI), 12 (MISO), 13 (SCK). These pins support SPI communication.

LED: 13. There is a built-in LED by digital pin 13. When the pin is high value, the LED is lit, when the PIN is low, it is off

TWI: A4 or SDA pin and A5 or SCL pin.

The one has 6 analog inputs, with A0 to A5, which provides 10-bit resolution (i.e. 1024 different values). Default measure from ground to 5 volts, although it is possible to change the upper end of their range using the AREF pin and the analogReference() function.
Existen few other pins on the board:

AREF. Voltage reference for the analog inputs. Used with analogReference().

Reset. Bring this line LOW to reset the microcontroller. Typically used to add a reset button to shields that block the one on the Board.

4 Communication

The one has a series of facilities to communicate with a computer, another one board and other microcontrollers. The ATmega328 provide UART TTL (5V) serial communication, which is available in the digital pins 0 (RX) and 1 (TX). An ATmega16U2 on the board channels this serial communication over the USB and appears as a virtual com port to software on the computer. 16U2 firmware uses standard USB COM drivers, and not external driver is necessary. The Arduino Software (IDE) includes a serial monitor that allows simple textual data to be sent to and from the board. The RX and TX on the dashboard illuminates when transmitting data via the USB-to-serial chip and USB connection to your computer. It also allows serial communication in any one digital pins.

The ATmega328 also support I2C (TWI) and SPI communication. The Arduino Software (IDE) includes a Wire library to simplify the use of the I2C bus.

4.1.2 Motor Driver – IC L293D

L293D is a dual H-bridged motor driver integrated circuit (IC). Motor controllers act as current amplifiers since they carry in to low intensity of current control signal and provide a higher current signal. This higher current signal is used to drive the motors.

L293D contains two circuits built-in H-bridge driver. Its common-mode of operation, two DC motors can drive at the same time, both in forward and backward direction. Two engine engine operations can be controlled by logic input pins 2, 7, 10 and 15. 00 or 11 logic input will stop the corresponding engine. Logical 01 and 10 turn clockwise and counterclockwise respectively.

Enable pins 1 and 9 (corresponding to the two motors) must be high for motors to operate. When an enable input is high, the associated driver is enabled. As a result, outputs become active and work in phase with their contributions. When is the enable input is low, then that driver is disabled, and their outputs are and in the high-impedance state.

Figure 4.2 – Motor Driver-IC L293DPin Configuration

The PIN description is as below –

No PIN	Function	Name
1	Enable pin for engine 1; active high	They allow 1.2
2	1 entry for Motor 1	Input 1
3	1 output for Motor 1	Output 1
4	Earth (0V)	Earth
5	Earth (0V)	Earth
6	2 the output for Motor 1	Exit 2
7	Engine 2 input 1	Input 2
8	Power supply for engines; 9-12V (up to 36V)	₂VDC
9	Enable pin for Motor 2; active high	They allow 3.4
10	1 entry for Motor 1	Input 3
11	1 output for Motor 1	Exit 3
12	Earth (0V)	Earth
13	Earth (0V)	Earth
14	2 the output for Motor 1	4 output
15	Input2 for Motor 1	Input 4
16	Supply voltage; 5V (up to 36V)	₁VCC

Table 4.2 – Pin descriptionorf IC L293D

4.1.3 DC Motors

Figure 4.3 – DC Motor Externally

To conventional DC Motor consists of the following –

1. Permanent magnets

2. Armature coil

3. Commutator rings

Figure 4. 4 – Internal structure of DC motor

Current is supplied to the armature coil through the commutator rings by a dc voltage. The flow direction is decided by the dc supply. By the Lorentz law, as the magnetic field lines cutting the conductor carrying current, experiences a force perpendicular to vector current and magnetic field vector is experienced by the coil. Opposite sides of the coil that have current vector perpendicular to the magnetic field of the permanent magnets experience forces in the opposite direction.

Figure 4.5 – The Forces on the DC Motor Armature Coil

So the coil experiences a torque and turns. That coil is connected to an axis, that the rotation can be seen when a DC source is supplied. This shaft is connected to a gear or a wheel to get the desired result.

The rpm of the engine depends on the strength of the magnetic field (B), current (i) supply of DC, the voltage DC (V). For all the components to be undamaged, the preferred voltage range is 7-12V.

4.1.4 9V Batteries

Figure 4.6 – 9V Battery

Batteries generally consists of three parts, an anode (-), a cathode (+) and the electrolyte. The connection between cathode and the anode make up the electrical circuit.

Figure 4.7 – Working of Battery

Chemical reactions in the battery leads to a structure up of electrons on the anode. This results in a difference between the anode and the cathode electrical. You can think of this difference as an unstable accumulation of electrons. Electrons want to reorganize to get rid of this difference. But they do it in a certain way. Electrons repel each other and try to go to a place with fewer electrons.

In a battery, the only place to go is the cathode. The electrolyte will prevent that electrons go directly from the anode to the cathode within the battery. When the circuit is closed (a cable connects the cathode and the anode) the electrons will be able to reach the cathode. In the photo above, electrons pass through the cable, the light in the path of the light bulb. It is a way to describe how electrical potential causes electrons to flow through the circuit.

However, these electrochemical processes change chemicals in the anode and the cathode to make them stop the supply of electrons. So there is a limited amount of energy in a battery.

When you recharge battery, change the direction of flow of electrons using other sources of energy, such as solar panels. The electrochemical processes in reverse, and the anode and cathode are restored to their original state and can provide more power.

4.1.5 LED

Figure 4.8 – LED

Here the LED has two terminals, cathode and anode. The anode is the longer terminal and the cathode is the shorter terminal.

Figure 4.9 – LED Working

LEDs are simply the diodes which are designed to emit light. When a diode is biased forward, electrons and holes are zipping forward and backward through the junction and are constantly combining and annihilating each other. Sooner or later, once an electron moves from the n-type in p-type silicon, it will get combined with a hole and disappear. That makes an atom more stable and complete and follows a little burst of energy in the form of a small ‘package’ or photons of light.

Chapter 5

Software Description

5.1 software Used

1.Arduino

2.MATLAB

5.1.1 Arduino

This project uses Arduino 1.0.5 IDE for programming the Arduino Board.

Figure 5.1.1-Arduino IDE

The upper part of the ide is composed of several symbols in the toolbar. Each symbol carries out a specific task.

The ‘Tick’ symbol which is on the upper left corner, is the compiler button. The symbol of ‘Right arrow’ which is next to the compile button, is upload button. This feature helps the user to record the code in the Arduino microcontroller.

The ‘Script’ symbol is the new script button. This helps the user to open a new script for a new program. ‘Arrow up’ symbol is open script button. This performs the function of opening the selected, previously saved program currently browsing the archives of the pc. This allows the user to directly open the Arduino saved easily programs. ‘Arrow down’ symbol is the Save button to save. This makes the task of saving the program that the user typed in. The file gets saved in the format ‘. ino’ by default.

The following steps are to be followed to record the program on Arduino:

1. Open the Arduino Ide.

2. Now press the open script button, to open a previously saved program or else open one of the examples that are available with the software. To open the sample program, select files.

3. Programs are available there. To select the program led flashing go to basics, then Blink.

Figure 5.1.2 – Selecting / Opening program Arduino IDE

4. Now that the program is opened, we have to select the COM port that is connected to the Arduino Board.

5. Select tools, serial ports, select the COM port that is present in the Arduino Board.

6. Now select the Arduino Uno (in this project) by going to Tools, Board, Arduino UNO.

Figure 5.1.3 -Selection of Board in the Arduino IDE

7. Now load the program by pressing the button upload

5.1.2 MATLAB

5.1.2. (a) Introduction to MATLAB:

MATLAB is a high-performance language for technical computing. It integrates computation, visualization, and programming in an easy-to-use environment where problems and solutions are expressed in familiar mathematical notation. Typical uses include

 Math and computation

 Algorithm development

 Data acquisition

 Modeling, simulation, and prototyping

MATLAB is an interactive system whose basic data element is an array that does not require dimensioning. This allows you to solve many technical computing problems, especially those with matrix and vector formulations, in a fraction of the time it would take to write a program in a scalar non interactive language such as C or FORTRAN.

The name MATLAB stands for matrix laboratory. MATLAB was originally written to provide easy access to matrix software developed by the LINPACK and EISPACK projects. Today, MATLAB engines incorporate the LAPACK and BLAS libraries, embedding the state of the art in software for matrix computation.

MATLAB has evolved over a period of years with input from many users. In university environments, it is the standard instructional tool for introductory and advanced courses in mathematics, engineering, and science. In industry, MATLAB is the tool of choice for high-productivity research, development, and analysis.

MATLAB features a family of add-on application-specific solutions called toolboxes. Very important to most uses of MATLAB, toolboxes allow you to learn and apply specialized technology. Toolboxes are comprehensive collections of MATLAB functions (M – files) that extend the MATLAB environment to solve particular classes of problems. Areas in which toolboxes are available include signal processing, control systems, neural networks, fuzzy logic, wavelets, simulation, and many others.

5.1.2. (b) The MATLAB system:

The MATLAB system consists of five main parts

 Development Environment:

This is the set of tools and facilities that help you use MATLAB functions and files. Many of these tools are graphical user interfaces. It includes the MATLAB desktop and command window, a command history, an editor and debugger, and browsers for viewing help, the workspace, files, and the search path.

 The MATLAB Mathematical Function Library:

This is a vast collection of computational algorithms ranging from elementary functions, like sum, sine, cosine, and complex arithmetic, to more sophisticated functions like matrix inverse, matrix Eigen values, Bessel functions, and fast Fourier transforms.

 The MATLAB Language:

This is a high-level matrix/array language with control flow statements, functions, data structures, input/output, and object-oriented programming features. It allows both “programming in the small” to rapidly create quick and dirty throw-away programs, and “programming in the large” to create large and complex application programs.

 Graphics:

MATLAB has extensive facilities for displaying vectors and matrices as graphs, as well as annotating and printing these graphs. It includes high-level functions for two-dimensional and three-dimensional data visualization, image processing, animation, and presentation graphics. It also includes low-level functions that allow you to fully customize the appearance of graphics as well as to build complete graphical user interfaces on your MATLAB applications.

 The MATLAB Application Program Interface (API):

This is a library that allows you to write C and FORTRAN programs that interact with MATLAB. It includes facilities for calling routines from MATLAB (dynamic linking), calling MATLAB as a computational engine, and for reading and writing MAT-files.

Various toolboxes are there in MATLAB for computing recognition techniques, but we are using IMAGE PROCESSING toolbox.

5.1.2. (c) GRAPHICAL USER INTERFACE (GUI):

MATLAB’s Graphical User Interface Development Environment (GUIDE) provides a rich set of tools for incorporating graphical user interfaces (GUIs) in M-functions. Using GUIDE, the processes of laying out a GUI (i.e., its buttons, pop-up menus, etc.)and programming the operation of the GUI are divided conveniently into two easily managed and relatively independent tasks. The resulting graphical M-function is composed of two identically named (ignoring extensions) files:

 A file with extension .fig, called a FIG-file that contains a complete graphical description of all the function’s GUI objects or elements and their spatial arrangement. A FIG-file contains binary data that does not need to be parsed when he associated GUI-based M-function is executed.

 A file with extension .m, called a GUI M-file, which contains the code that controls the GUI operation. This file includes functions that are called when the GUI is launched and exited, and callback functions that are executed when a user interacts with GUI objects for example, when a button is pushed.

To launch GUIDE from the MATLAB command window, type guide filename. Where filename is the name of an existing FIG-file on the current path. If filename is omitted, GUIDE opens a new (i.e., blank) window.

Figure 5.2.1 Basic guide window to create a GUI

A graphical user interface (GUI) is a graphical display in one or more windows containing controls, called components that enable a user to perform interactive tasks. The user of the GUI does not have to create a script or type commands at the command line to accomplish the tasks. Unlike coding programs to accomplish tasks, the user of a GUI need not understand the details of how the tasks are performed.

GUI components can include menus, toolbars, push buttons, radio buttons, list boxes, and sliders just to name a few. GUIs created using MATLAB tools can also perform any type of computation, read and write data files, communicate with other GUIs, and display data as tables or as plots.

The Implementation of a GUI

While it is possible to write an M-file that contains all the commands to lay out a GUI, it is easier to use GUIDE to lay out the components interactively and to generate two files that save and launch the GUI:

A FIG-file – contains a complete description of the GUI figure and all of its children (uicontrols and axes), as well as the values of all object properties.

An M-file – contains the functions that launch and control the GUI and the callbacks, which are defined as sub functions. This M-file is referred to as the application M-file in this documentation. Note that the application M-file does not contain the code that lays out the uicontrols; this information is saved in the FIG-file.

The following diagram illustrates the parts of a GUI implementation.

Figure 5.2.2 Parts of GUI Implementation

User Interface Controls

The Layout Editor Component palette contains the user interface controls that you can use in your GUI. These components are MATLAB uicontrol objects and are programmable via their Callback properties. This section provides information on these components.

 Push Buttons

 Toggle Buttons

 Frames

 Radio Buttons

 Listboxes

 Checkboxes

 Popup Menus

 Edit Text

 Axes

 Static Text

 Figures

Chapter 6

Algorithm of MFCC approach

6.1 MFCC Approach and Flowchart

A block diagram of the structure of a processor MFCC is as shown in Fig. 4.1. Speech input is normally recorded in a sampling rate above 10000 Hz. This sampling frequency was chosen to minimize the effects of aliasing in the conversion from analog to digital. These sampled signals can capture all frequencies up to 5 kHz, which cover most of the energy of sounds generated by human beings. The main purpose of the MFCC processor is to imitate the behavior of the human ear. In addition, instead of the forms of discourse, MFCC is shown to be less susceptible to variations above.

Figure 6.1 diagram block MFCC

First store us the signal of speech as a vector of 10000 samples. Our experiment it was observed that current discourse uttered by eliminating up to the wine of the static portions on samples of 2500, therefore using a technique simple threshold was held the silence detection to extract real uttered speech. It is clear that what we wanted to achieve was a voice-based biometrics system can recognize individual words. As our experiments revealed almost all isolated words were pronounced within 2500 samples. Then we directly use the triangular windows overlapping in the frequency domain. Obtained the energy within each triangular window, then the DCT of the logarithms to achieve good compaction within a small number of coefficients as described by the MFCC approach.

The simulation was conducted in MATLAB. The different stages of the simulation have been represented in the form of the plots shown. The continuous input signal speech seen as an example for this project is the word “Hello”.

Figure 6.2Word “Hello” taken for analysis

Figure 6.3 Word “Hello” after silence detection

Figure 6.4 Word “Hello” after windowing using Hamming window

Figure 6.5 Word “Hello” after FFT

Figure 6.6 MFCC Approach Block Diagram

Chapter 7

Graphical User Interface

7.1 GUI Representing Different Windows

In order to have an easier representation of the speech signals, a GUI is used in this project. The GUI used here contains various windows. The first window is the User Selection window. In this project the samples from two users are taken into consideration. The current user has to choose between user1 and user2 as shown in the below figure and then have to store the samples.

Figure 7.1 GUI that represents the initial window for users to record

When a person chooses from one of the two options i.e user1 and user2, a separate window pops up depending upon his choice. The windows corresponding to user1 and user 2 are shown below.

Figure 7.2 GUI that represents different cases for User 1

Figure 7.3GUI that represents different cases of User 2

After storing the samples from two individuals we need to now test the user recognition process. In order to do that one of the individuals have to give the input signal which is to be compared with the previous stored samples.

The window below represents the person using to give the input signal. When you post here, the signal is processed using MFCC approach and the mean square error is calculated for each case. Once the mean quadratic error is less than the threshold value, the corresponding case of signal is shown in the graphical interface.

Figure 7.4 GUI indicating to record the input signal for processing

After the input signal is recorded here, the signal is processed using MFCC approach and the mean square error is calculated for each case. If the mean square error between the input signal and the stored samples is less than the threshold value, the input given by the user is detected. An example from the project is given below.

Figure 7.5 GUI that represents when the fan signal is detected

Now similarly if the input signal does not correspond to any of the samples from the two users, we get the message that the signal is not detected as shown below.

Figure 7.6 GUI when no signal is detected

Chapter 8

Implementation and Working

8.1 Introduction

This chapter gives idea of how the project is implemented and explains the working. The working of the project is explained in three parts.

1. Matlab working

2. Matlab- Arduino Interface

3. Arduino Working

8.1.1 MATLAB Working

Figure 8.1 MATLAB Working

The Matlab working depends on the input given by the user through the GUI. If the chosen action is ‘Sample Record’, then the recording is performed. This input is sampled and stored as a ’.wav’ audio file in the computer by Matlab.

If the chosen action is ‘Record Here’ then the user’s voice input is recorded once. This input is first sampled by the Matlab and MFCC approach is applied and the mean square error is calculated. If the mean square error is less than the threshold then that particular sample is taken as the input.

According to the input decided, a character is sent to the Arduino for it to give the output as required.

8.1.2 Matlab- Arduino Interface

In our project the input is the speech signal and output is the desired action on the electrical and mechanical appliances at our home. The first step towards this is establishing a connection between our input and the microcontroller. This is achieved through Matlab (input receiver) and the Arduino (Microcontroller).

Usually the interface between a micro-controller and computer is done by either of the two communication interface – UART OR USART. UART stands for Universal Asynchronous Receiver and Transmitter and USART stand for Universal Synchronous Asynchronous Receiver and Transmitter. The interface between the Arduino UNO and computer is TTL – UART.

With the help of MATLAB we establish a GUI (Graphical User Interface), which allows the user to give his speech signal as an input in a very systematic way. Initially both the users record the samples which are light, fan and door in their respective tone and pitch. After saving the samples we record the input signal. The MFCC approach is applied to each of the samples and the input signal and the mean square error is found between the input signal and each of the samples. The one which is below the threshold is considered as the input given by user. With the help of MATLAB programming we set few characters for transmitting for each command detected, character ‘1’ when FanOn is given as input, ‘2’ when FanOff is given as input, ‘3’ when LightOn is given as input, ‘4’ when LightOff is given as input, ‘5’ when DoorOn is given as input and ‘6’ when DoorOff is given as input.

Arduino is connected to the Laptop through a USB and the Microcontroller is programmed to implement a particular action for a particular character received through the USB.

8.1.3 Arduino Working

Figure 8.2 Arduino Working

The Arduino working is simple. The character obtained from Matlab through the above procedure is compared to the six required characters, to identify it. Once identified, the Arduino drives the targeted output, like the door motor or tube light.

Chapter 9

Conclusion and Future Scope

9.1 Conclusion

The goal of this project was to create a speaker recognition system which can used in home automation. The features of the sample signals from the users are extracted and then it is compared with the features of the input signals in order to find the match. The feature extraction is done by using MFCC (Mel Frequency Cepstral Coefficients).

9.2 Applications

After nearly sixty years of research, speech recognition technology has reached a relatively high level. However, most state-of-the-art ASR systems run on desktop with powerful microprocessors, ample memory and an ever-present power supply. In these years, with the quick evolvement of equipment and programming advances, ASR has turned out to be increasingly convenient as an option human-to-machine interface that is required in future.

9.3 Future Enhancements

This project focused on “Isolated Word Recognition”. But we feel the idea can be extended to “Continuous Word Recognition” and ultimately create a Language Independent Recognition. The utilization of Statistical Models like HMMs, GMMs or learning models like Neural Networks and other related parts of Artificial Intelligence can likewise be joined toward this path to enhance the present venture. This would make the framework much tolerant to varieties like complement and unessential conditions like clamor and related deposits and thus make it less blunder inclined.

1. We can use a more accurate method like Neural Networks for speech processing.

2. Instead of just ON and OFF functions we can add a feature of variable output, to control the speed of the fan or brightness of the lights.

References

[1] Wei Han, Cheong-Fat Chan, Chiu-Sing Choy and Kong-Pang Pun – „An Efficient MFCC Extraction Method in Speech Recognition’, Department of ElectronicEngineering, The Chinese University of Hong Kong, Hong, IEEE – ISCAS, 2006

[2] Waleed H. Abdulla – „Auditory Based Feature Vectors for Speech RecognitionSystems’

, Electrical & Electronic Engineering Department, The University of Auckland

[3] Woszczyna, M.: “JANUS 93: Towards Spontaneous Speech Translation”, IEEE

Proceedings Conference on Neural Networks, (1994).

[4] www.dspguide.com/zipped.htm: “The Scientist and Engineer’s Guide to DigitalSignal Processing” (Access date: March 2005).

[5] Brookes, M.: “VOICEBOX: a MATLAB toolbox for speech processing”,www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html, (2003).

Original

Desde un punto de vista de aplicación, los beneficios de utilizar ASR derivan de proporcionar un canal de comunicación adicional en manos ocupado ocupado ojos persona-máquina interacción (HMI), o simplemente del hecho de que hablar puede ser más rápido que escribir.