Predicting Income and Gender with Census Data

October 20, 2020

In this project we try to predict income and gender using US Census Data. We will use classification models to try to predict both variables.
Before we try out the different classification models there was some data exploration and processing that had to occur first: 1 - Since income data was a numerical variable, we changed it to a categorical variable where the income was either <= 50k or >50k. 2 - We deal with missing and duplicate values 3 - Removal of outliers 4 - Correlation between features 5 - One-hot-encoding categorical variables 6 - Splitting data into training and test Data

After that is done we start to run and evaluate models with the following metrics:

1 - Accuracy

2 - Precision/Recall

3 - F-1 Score

The different models that are used in this project were:

1 - Support Vector Machines

2 - K-Nearest neighbors

3 - Random forest

4 - Gradient Boosting

5 - Logistic Regression

We were able to create prediction models for income and gender with over an 85% accuracy. Gradient boosting came out as the best working model for both cases.

Later in the project we looked over feature importance within the Census data and then did a GridSearch for each of the types of models to determine the best parameters before running the actual models. The complete workbook can be found within the Github link

Github Repository