Machine Learning Projects

- Published on
- Duration
- 3 months (ongoing)
- Role
- Data Scientist

This is a collection of my machine learning projects that I have worked on. I will be updating this page as I work on more projects.
Directory of Projects
Link to my GitHub Repo
Supervised Learning
Classification
- Gender Classification
In this tutorial, a dataset containing physical attributes was loaded and used to train a model to predict the gender of a person. This is purely for the purpose of training a machine learning model in Python and not pushing any social or political agendas.

Firstly, a static csv dataset is loaded, then a series of exploratory data analysis (EDA) is performed to understand the data. The dataset contains physical attributes such as forehead width, whether the subject has a wide nose and the distance from the nose to the lips, which are used to predict the gender of a person.

Next, we prepare the data by first label encoding the gender column, then MinMax scaling the numerical columns.
Before training the model, we first determine the best features using selectKBest, which selects the top k features based on the ANOVA F-value between label and features.

The first classification model we used was the K Nearest Neighbors (KNN) model, which is a simple and effective model for classification tasks. The model was trained on the training set and evaluated on the test set.
Stats are as follows:
Performance (%) | |
---|---|
Accuracy | 95.6 |
Precision | 0.95 |
Recall | 0.96 |
A confusion matrix was generated to visualize the performance of the model, which shows that the model performed well in predicting the gender.

Next up, we used the Naïve Bayes model, which is a probabilistic model that is based on Bayes' theorem. The model was trained on the training set and evaluated on the test set.
Stats are as follows:
Performance (%) | |
---|---|
Accuracy | 95.6 |
Precision | 0.95 |
Recall | 0.96 |
Conclusion
In conclusion, while the accuracy score of each model were identical, the K neighbor model had a slightly longer computational time, however, can be considered negligible.
However, this is not to suggest that Naïve Bayes is of the better model based on this dataset. Naïve Bayes can suffer from the zero-probability problem, which according to Glen (2019), is defined as a particular attribute conditional probability equals zero. This will result in the model failing to produce a valid prediction completely.
For this dataset, both models have been assessed to be suitable due to its low computational time but highly accurate prediction.
Regression
Unsupervised Learning
Clustering