The aim of this small project was the prediction of mid-career salary of the students, considering data from different universities.
The notebook is the following: datafrom_collegetuitiondiversityandpay_dataset.ipynb and I collected the data from here: https://www.kaggle.com/datasets/jessemostipak/college-tuition-diversity-and-pay.
After downloading the 5 tables and loading them with pandas, I merged the 2 tables (salary_potential,tuition_cost) that were useful to extract the information to predict the
mid-career salary.
Once I did that, I extracted the most important features, calculating in descending order the correlation coefficients for all the features.
I handled the categorical features, filled the missing values and scaled the input features.
Later, I fit the model on the train dataset and predicted on the test dataset. The algorithm with the lowest rmse was the DecisionTreesRegressor.
I finally plot the accuracy in various ways to visualize the results in more detail:
Accuracy Plots - training


