The aim of this investigation is to look at the relationship between the salaries people earn; across various industries and at different positions or rankings. The dataset provided was created from the survey responses of 514 American citizens about their annual income as well some other general information surrounding their person and occupation.
The first notebook was used to conduct a simple statistical analysis on the data; looking at the salaries in relation to gender, position and years worked. It also explored the distribution of the data as well as the correlation matrix between the salary and other features in the dataset. This notebook was my introduction to hypothesis testing and use of the p value to negate or accept a null hypothesis. The second notebook found in the repository is used to model the dataset using Linear Regression from the Scikit-learn package. The number of years worked is used to predict the average salary, due to the seemingly linear relationship observed. I initially used the statsmodel.api but once I learned of scikit-learn, opted for it instead due to its ease of use and clearly-defined methods.
The concept of observing a linear relationship and then finding a line that best describes this relationship is nothing new, considering that y = mx + c was introduced as early as Grade 6 in school. However, coding the algortithm from scratch really gave me an appreciation for this simple equation and helped me better understand that the 'line of best fit' is not some random dash across scatter points that look like they are headed in a particular direction.
Some key learnings from this investigation was the importance of scaling the data. I observed that the data distribution was initially bimodal alluding to the possibility of two very distinct groups of respondents; those whose salaries were dependent on the number of years worked and those whose salaries are dependent on the position they occupy in their resoective fields.
The model returned a poor score overall meaning that the selected model may not have been the best one for the dataset. This was pleasantly surprising considering that the visualization said otherwise. Various attempts were made to improve the model during evaluation, but no score improvement was observed. Linear regression was definitely not the best model for the dataset, despite the what the plot showed.