Proposed Approach is explained in section 2.
Experiments and sample results are presented in section 3.
Conclusions and future precedences of the work are discussed in section 4.
1. Framing: The recorded discrete signal s(n) has always a finite length NTOTAL, but is usually not processed whole due to its quasi-stationary nature. The signal is framed into pieces of length N much less than NTOTAL samples. The vocal tract is not able to change its shape faster than fifty times per second, which gives us a period of 20 milliseconds during which the signal can be assumed to be stationary. The length N of the frames is based on a compromise between time and frequency resolution. An overlapping of the individual frames is used to increase precision of the recognition process.
2. Windowing: Before further processing, the individual frames are windowed. We have used a standard hamming window. where sw(n) is the windowed signal, s(n) is the original signal N samples long, and w(n) is the window itself.
3. Pre-Emphasis: Pre-emphasis is processing of the input signal by a low order digital FIR filter so as to flatten spectrally the input signal in favor of vocal tract parameters. It makes the signal less susceptible to later finite precision effects. This filter is usually the first order FIR filter defined as sp(n)=s(n)-a.s(n-1) Where a is a pre-emphasis coefficient lying usually in an interval of (0.9 to1), s(n) is the original signal, and sp(n) is a pre-emphasized signal.
We have used a part of CMA PDU database and then we created a custom database of our own.
All the files have a sampling rate of 11025 Hz and are in wav format. They all are single channel and have a 8 Bit/sample rate
This is the link to our database
After using SVM to classify the test samples based on the extracted features, we got different accuracies while using different sets of features. We got an accuracy of 85% while using Formants and Pitch. While formants should help us better classify, we got a better accuracy(92%) using only the pitch.
Fig 1,2: Identifing of voiced frames
Fig 3,4: Calculating frequencies of voiced samples of speech
Fig 5,6: Identifing Clusters and assining to male and female, eventually calculating accuracy
Fig 7,8: Variables calculated by Classifying SVM and plottind of male vs female cluster of features
Considering the efficiency of the results obtained, it is concluded that the algorithm implemented in MATLAB is working successfully.Different speech by the same speaker spoken in the near to identical conditions generated the same pitch value establishing the system can be used for speaker identification after further work.
This project is focused on gender classification using speech signals. Three main excitation parameters based on the pitch,Linear prediction cepstrul coefficients (LPCC) and Formant frequency extraction were discussed. Each of the described algorithms has their advantages and drawbacks. From the experimental results, the pitch method has the highest efficiency for the taken data sets.
Our long term goal is to implement a gender classifier that can automatically predict the gender of the speaker based on the above investigation. We are considering an indoor environment(less noise).We want to see that the classification of sounds into global categories can be performed with very less calculation effort.
By identifying the gender and removing the gender specific components, higher compression rates can be achieved of a speech signal, thus enhancing the information content to be transmitted and also saving the bandwidth. Our work related to gender detection showed that the model can successfully be implemented in Speaker Identification, separating the male and female speaker to reduce the computation involved at later stage. Further work is also needed with regard to formant calculation by extracting the vowels from the speech. While working on formants we concluded that including formant for gender detection would make the system text dependent.
1.Linear prediction analysis
(http://iitg.vlab.co.in/?sub=59&brch=164&sim=616&cnt=1088)
2.Digital signal Processing, Alan V. Oppenheim/ Ronald W. Schafer
3. Vinay K. Ingle, John G. Prokakis, “DigitalSignal Processing Using MATLAB”
4. J.R. Deller Jr., J.H.L. Hansen, J.G. Proakis, Discrete-Time Processing of Speech Signals, IEEE Press, New York, 2000
5. Eric Keller, “Fundamentals Of Speech Synthesis And Speech Recognition”
6.J. Makhoul, "Linear prediction: a tutorial review", Proc. IEEE, vol. 63, 1975.
7.L.R. Rabiner and R.W. Schafer, Theory and Application of Digital Speech Processing, First Edition, Prentice Hall, New York, 2011.
8.Gender Classification by Speech Analysis - Semantic Scholar Research paper
9.Pawan Kumar, Nitika Jakhanwal and Anirban Bhowmick
"Gender classification using pitch and formants"
10.Gender identification and performance analysis of speech signals
(http://ieeexplore.ieee.org/xpls/icp.jsp?arnumber=7342709)
11. Researcg gate publications
(https://www.researchgate.net/publication/220846517_Gender_classification_using_pitch_and_formants)