Buy Negative Facebook Reviews

Buy Negative Facebook reviews are comments from critical customers who have received the services of a particular business and are not at all satisfied with the services received. At that time the…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Spoken Language Identification via Neural Network

Language is important in almost every aspect of our lives. It allows people to communicate in a manner that enables sharing of ideas allowing us to express feelings and desires.

Currently, there are an estimated total of 6500 unique languages spoken in our world. However, due to the ever accelerating of advancement in technology, our world seems to be shrinking smaller by the moment.

According to the information provided by US Census and American Community Survey, 1 in every 5 US resident speak a language other than English natively. This percentage has almost doubled since 1980.

The findings are that there is an increase in the number of residents who require some form of language support to access public services.

The inspiration and goal of my project came from my diverse background and intellectual curiosity. By creating a proof of concept. Machine Learning will be utilized to build a model to predict the language spoken from an audio recording.

As a practical application, this model can be utilized by emergency response services to identify the language which the caller is speaking in.

The raw data for my project comes from Common Voice. It is a dataset by Mozilla which consists of single word samples from 18 different languages.

The volunteers record their voice speaking in various languages and cross validate recordings made by other volunteers.

In the exploratory data analysis, I noted the significant imbalance of samples per each of the language classes.

Although there are a few methods which are viable to counter the effects of imbalanced dataset including oversampling. I elected to create my model based on the top 4 classes with the most samples.

The top 4 languages are, Spanish — English — French — and German

For data preprocessing, each of the downloaded audio files in a compressed format was converted to a lossless and uncompressed form.

Each audio file also undergone normalization and trimming to resemble an uniform sample to be further processed.

Then, utilizing each of these samples, I used Librosa package for feature extraction.

The feature from each of the samples extracted was MFCC. It stands for Mel Frequency Cepstral Coefficients. Simply put, this feature mimics the way humans distinguish between frequencies.

Each of the features were averaged by time to create the 1 dimensional sequence.

For the baseline model, I used Random Forest Classifier. It is less influenced by outliers and it is robust with feature selection.

The results from the model on audio from a known speaker and known vocab yielded an accuracy of 43.8 % and F1 score of .43.

When the model is applied on an audio from unknown speaker with unknown vocab, the accuracy is around 26% and F1 of .15.

At this point, the baseline is set. Now we move on to the more complex machine learning model via neural net.

For the neural network architecture, I utilized PlaidML and Tensorflow Keras.

Together, I performed deep learning on my embedded device via a sequential model with 1 dimensional convolution and bidirectional LSTM for the long short-term memory layers.

It is then passed to a fully connected dense layer prior to softmax classification on each of the languages.

The rationale behind this architecture stem from proven auditory classification techniques used directly from audio signals.

The results of the neural network model yielded similar results inline with the base model.

With an audio from known speaker on known vocab, the accuracy is around 42%. On an audio sample from an unknown speaker and unknown vocab, the accuracy is 25%.

In reviewing my results, I recognized the inherent complexity involved with analyzing audio data. The dataset which I used with one word samples are not enough to distinguish classes well.

Further, the oversampling technique used is inferior to gather better samples which the model can utilize to train on. Not to mention the various accents and speaking style.

In a practical manor, I believe the model will be tremendously helpful especially in the public services sector.

For future work, as mentioned, I will gather additional data to better train the models. I will conduct additional research to gain more domain expertise on spoken language analysis.

Further, in addition to model tuning and architecting, I shall consider additional techniques via uncommon models for feature extraction and analysis.

Thank you very much for your time and please feel free to reach out if you have any questions or comments.

Add a comment

Related posts:

Spitroasted In My Tent

Trans woman meets two men and gets spitroasted.