This project is from an interview task. It is accomplished in 3 days, which include literature review, corpus prep and model training. Thus many things are not tuning in to the best way. I will update it in the future. Currently, this post will record the thinking and solution to this gender classification task.
The project is uploaded to: https://github.com/EyonJoshua91/genderRecognition
Due to the limited time, I have only implement model with MFCC features and GMM classifiers. No VAD. hyper-parameter for GMM is set small in purpose without any tuning.Training corpus is selected from voxforge dataset, 1200 audios for male and 1200 audios for female. Final accuracy is 79.750000% on 10% of the training corpus.Improvement can be expected by tuning hyper-parameters and add VAD.
For test purpose, you can use GenderDetect.py to perform the job. Usage can be found in the github page or its document.
The goal of this project is to build a system to tell whether speaker is male or female from any audio clip.
After reviewing recent literature(2015-2016), common recipe for gender classification can be divided into two steps. First step is to choose effective features to represent the original speech signal, following step is to utilize a classifier to detect gender. I will discuss the options for different steps and show what I have tried in this task.
Noisy and other-way degraded speech is more easy to fail at gender classification system. Therefore, it is better audio filter can be designed for specific acoustic environment. I was using voxforge corpus(), thus this step is skipped. Following steps are pre-emphasis, frame and window the signal.
The goal of this step is to translate frame level audio into corresponding feature. The core idea is to design feature which can cover spectral and prosodic property of signals. And single feature is always not sufficiently accurate for large varieties of speakers. Based on that, two kinds of features are popular in the literature. One way is to compute summary statics of speech features (pitch), like F0 mean,maximum, minimum, formant frequencies, spectrum skewness, spectral spread and many other designed index to describe the current time span.[3,4,7] While algorithms for calculating above index are always not reliable due to the specific acoustic environment, many researchers are using MFCC which is so widely used in speech area as the classifier input.[1,2,5,6]
When the feature set is determined and being as input, it becomes a common patter recognition task, classification problem. Generally speaking, the problem is to derive a soft or hard decision boundary in feature space to classify the feature as male or female. In the meantime, all the algorithms construct mathematical models to take the designed features as input and output the label or the probability of being female or male. I will list some optional classifiers which suits for this task.
1 K-Nearest Neighbor(KNN):
This is a straight forward method to classify features. According to , it shows that the larger k sets the worse result will get. Thus, KNN will be the easiest to implement but result is not guaranteed.
2 Support Vector Machine(SVM):
It can be treated as a two-step classification process. At first, kernel function, like polynomial kernel, Gaussian kernel, etc, performs a low to high dimensional feature transformation. Then we are going to find a hyper-plane to separate the male and female and maximum the margin.  shows that this method could get a acceptable result, especially using radial basis function (KBN) kernel.
3 Gaussian Mixture Model(GMM):
This is also widely used for modeling speech features. We can build male and female GMM separately. By comparing the output probability, we assign the label for each feature. According to [2,5], mixtures number can be set 128,256,512,1024 to expect an acceptable performance.
4 Multi-Layer Perceptron(MLP):
This will be the simplest neural network(NN) to perform the classification job. It may not generate a good result but can be treated as a baseline for the NN-based method.
5 Deep Belief Network(DBF):
This is the state-of-art NN in speech area. It also divides into 2 steps. At first greedy layer-wise unsupervised training is applied to stacked RBMs, then the back-propagation will be used for fine-tuning the final weights.
1 Unvoiced Features:
It is obvious that we can only tell the gender of speaker from the voiced part of speech. Thus, we should design some mechanism to deal with the unvoiced part in the speech. There maybe 2 ways to deal with this problem. One is to use some VAD(voice activity detection) method to detect non-speech frames and discard them. The other is to permute the frames randomly and using a sliding window to take average of the permuted frames. And the average result will become the new input to the following classifier.
2 Classify result in frame level and sentence level:
The above discussion is all on the frame level. Given an audio, we could output the label or probability of each frame in the audio. Then we have two ways to determine the label of the whole audio. One is just to pick the label who appears most, the other is to accumulate the probability of different labels and then choose the bigger one as the final label.
Current project recipe:
The following pipeline illustrate options for gender classification based on MFCCs.
Further Consideration(after coding):
1 Currently, the system is text-independent. If text is limited, feature focused on the spectral statistics, like F0 may be better for the system.
2 In the coding part, the number of female and male is equal by purpose. This is meaningful for the research. While for a product, such thing may not be true for a better system performance.
 Z. Qawaqneh, A. Abumallouh and B. D. Barkana, “Modifying deep neural network structure for improved learning rate in speakers’ age and gender classification,” 2016 Annual Connecticut Conference on Industrial Electronics, Technology & Automation (CT-IETA), Bridgeport, CT, USA, 2016, pp. 1-6.
 M. Kos, D. Vlaj and Z. Kačič, “Speaker’s gender classification and segmentation using spectral and cepstral feature averaging,” 2011 18th International Conference on Systems, Signals and Image Processing, Sarajevo, 2011, pp. 1-4.
 J. Přibil, A. Přibilová and J. Matoušek, “GMM-based speaker gender and age classification after voice conversion,” 2016 First International Workshop on Sensing, Processing and Learning for Intelligent Machines (SPLINE), Aalborg, 2016, pp. 1-5.
 J. Přibil, A. Přibilová and J. Matoušek, “Comparison of one and two-level architecture of the GMM-based speaker age classifier,” 2016 39th International Conference on Telecommunications and Signal Processing (TSP), Vienna, 2016, pp. 299-302.
 A. Abumallouh, Z. Qawaqneh and B. D. Barkana, “Deep neural network combined posteriors for speakers’ age and gender classification,” 2016 Annual Connecticut Conference on Industrial Electronics, Technology & Automation (CT-IETA), Bridgeport, CT, USA, 2016, pp. 1-5.
 J. Ahmad et al., “Gender Identification using MFCC for Telephone Applications – A Comparative Study,” International Journal of Computer Science and Electronics Engineering, 2015) , pp. 351-355
 S. Rahman, F. Kabir, F. Kabir and M. N. Huda, “Automatic gender identification system for Bengali speech,” 2015 2nd International Conference on Electrical Information and Communication Technologies (EICT), Khulna, 2015, pp. 549-553.