NBA Contract Value Database

Introduction

Every year when the NBA's Most Valuable Player award is given out, there is alyways much debate over who truly is the most valuable player in the league. This discourse has always been very entriguing to me, so I decided to take my love of Computer Science and data analysis to see if I could combine these facets to find the statistically most valuable player in the league for the 2019 - 2020 season. I knew from the start this wouldn't give the exact answer of the "most valuable" player, due to superstar heavy nature and contract sturcture of the NBA, but I thought it would be interesting to see if this could uncover any underrated role players in the league or hidden insights. I felt that this would be an interesting project to combine with my newfound interest in databasing, so I therefore decided to use that as the framework to solve this problem.

Data Collection and Curation

I used Basketball Reference's salary and stat information to collect all the data I needed for this database. However, before I was to start collecting player data, I was going to need to decide statistic minimums for players to qualify for the study. After some research and discussion with others, I decided the statistical minimums I was going to use were 10 games played and 10 minutes played per game. I believe that these minimums would remove players harshly affected by injury and those who might have had inflated stats due to playing in garbage time. With the list of players meeting these statistical minimums now solidified, I moved on to what statistics I would collect for each player that qualified. The statistics I first decided on were salary, team, rookie scale contract status, VORP, and Win Shares. I believe that the salary statistic is self-explanatory, as you cannot measure contract value without the player's salary. The team qualifier will be used to measure team-based value rankings that serve as a side project in the database and is also just good information to have. The first contentious point of data I believe I collected was in fact whether the player was on a rookie scale contract. For those who are unfamiliar, when an NBA player is selected in the first round of the NBA Draft their contract is set in stone by a pre-determined rookie scale based on the pick they were selected at. I knew that if I did not do something to separate these players, the top of the list would be dominated by young stars like Jayson Tatum, Luka Doncic, and Trae Young, who have not been able to negotiate their first contract yet, and the focus of the project was more geared at looking at smart contracts signed by teams than good draft picks they made. However, they should still be included in this study in some capacity, so I decided to have each player have a data point of either "Yes" or "No" marking whether they were on a rookie scale contract, allowing me to toggle including them in queries I made on the database. Then we come to the final statistics I collected for each player, VORP and Win Shares. Both stats are advanced metrics that are used to measure different contributions players make to a team, and both have their various pros and cons, so I wanted to go with a two-pronged approach and include them both for better coverage. VORP stands for Value Over Replacement Player which measures the value that a player has over an average replacement level player and is calculated using the box score +/- over the course of a season. While in my opinion VORP is the best advanced statistic to measure NBA success, I feel that the box score +/- the statistic is based on does not really measure the winning and hustle certain players like Marcus Smart bring, which is why I also wanted to add in Win Shares. Win Shares is the amount of wins a specific player ended up bringing to a team that season, and therefore measures how much value a player contributes to winning. However, the biggest drawback with this statistic in my opinion, is that this stat overvalues bad/average players on good teams and over punishes good players on bad teams. I feel that overall, both statistics cover the weaknesses of the other. When looking at the data and how it would be implemented, I quickly came to realize that there would be two problems. The first of these is issues with the number zero. While I will get deeper into my algorithm when I dive into that section later, I will give context that at its core the formula is (advanced statistic)/(salary). Then, say hypothetically that we have Player A making 5 million dollars with a VORP of 0 and Player B making 30 million dollars with a VORP of 0. Using the previous formula, the result would be 0 for both, however the player making 5 million is quite a bit more valuable than the one making 30 million due to A being on a cheaper contract. The second problem comes with advanced metric values that are negative. Since we will be combing the results from two statistics, there may be some range issues from the combination of two negative numbers and a valuation issues if a positive and negative number are added. This led me to adjusted Win Shares and adjusted VORP by making the minimum value of these statistics 0.1 and increasing the rest accordingly. For example, Darius Garland had the lowest Win Shares in the league with a statistic of -1.3 Win Shares. I there for added 1.4 to -1.3 to get an Adjusted Win Shares of 0.1, and did this same process with every player, for both statistics accordingly. For example, if a player had a Win Shares of 5.0, their adjusted Win Shares would be 6.4. With all my statistics set in stone, I was now able to create my database and write my querying script.

Database Creation and Algorithms

For this project, I chose to use the MongoDB databasing platform. The main reason I chose to use MongoDB is its ability to easily take a JSON document into the database as a collection, the tradeoff being more work needing to be done by the user to query the system. This was a tradeoff I was willing to make, so I took the excel spreadsheet I had used to collect data, converted it into a JSON, and uploaded it into the local MongoDB database I had created. With the database finished, I started working on the method to create my ranked results. I first finalized my algorithm for calculating a player’s value score, which is started by calculating a Win Shares and a VORP score for each player by doing the following formulas: Win Shares/Salary and VORP/Salary. Once I got these scores calculated for each player, I used the Numby Package interpret feature to rank all the VORP and Win Shares scores on separate lists on a scale from 0 to 100. Once the rankings were finished, I tallied the final score for each player, divided it in half, and had my list of rankings. Before I dive into analyzing the results I was given, I decided to add some more functionality to the database. I quickly realized that there was a lot of variability in terms of generating results with changing salary limits, including rookie scale contracts, using just one advanced metric instead of both, and using the raw metric instead of my adjusted version. Therefore, I wrote some code that allows the user to change the parameters by which the ranking is done and makes it much easier for me to test out different variations of the rankings. Through this experimentation I also decided to create a team ranker that ranks the teams by highest average value per player and total value on the team, this team ranker also has the same user settings options as the player ranker does.

Analysis and Results

For this analysis section I will be analyzing the results on a broad level and will referencing the data in the excel spreadsheet in the GitHub project, so feel free to open that up if you would like to follow along. I feel like the most evident is that my adjusted statistic hurt morts parts of my results than it helped it. For example, if you look in the first two categories which take all qualified league contracts into account, you will see that when the raw values are used great players such as Jarret Allen, Mitchell Robinson, and Duncan Robinson top the list, however when my adjusted values are used players like Johnathan Williams and PJ Dozier sailed to the top due to their miniscule two-way contracts compared to the rest of the league. Another clear example of this is seen in columns E and F. When the raw advanced values were used in column F, league MVP Giannis Antetokounmpo is ranked as the 5th most valuable player, however, when the adjusted advanced values are used, he drops all the way to the 28th. I will say that my adjusted advanced values did succeed in one area however, discerning between the least valuable players in the league. When looking at the bottoms of columns E and F, you can clearly see that while the adjusted advanced values in E clearly differentiate between the least valuable players, column F assigns the bottom 30 players all a score of 0.0. So, while my adjusted advanced values are not a base improvement on the raw values, I believe that a clear dichotomy has been established where the raw values are best for ranking the most valuable players and the adjusted values are best for ranking the least valuable players. Something else that I should mention is that I was right about my assumption that the top of these rankings would not show the conventional or truly "best" players in the league, but instead solid role guys, which was correct as the all-contract rankings were topped by players such as Jarrett Allen, Mitchell Robinson, and Duncan Robinson, while the $5,000,000+ was topped with role players like Montrezl Harrell, Daniel Theis, and Ivica Zubac. The team ranker results were in line with how I expected them to be, but I was shocked at how low the NBA champion Lakers were ranked in both lists, which I believe is due to their top-heavy roster. While I am disappointed my adjusted advanced statistics did not come out as a superior option to the raw statistics, I am very pleased with how the rankings turned out.

Conclusion

Overall, I had an incredibly fun time working on this project. It was very cool to see how my passions for the NBA, data science, and database mangement could all be constructively funnelled into one project. I think the most interesting piece is that with the database set up, I can easily expand this project's scope, whether that be adding in salaries from previous seasons or comparing salary values between seasons

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md
SelectionofRankingsResults.xlsx		SelectionofRankingsResults.xlsx
main.py		main.py
playersalariesandstats2019_2020.json		playersalariesandstats2019_2020.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NBA Contract Value Database

Introduction

Data Collection and Curation

Database Creation and Algorithms

Analysis and Results

Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NBA Contract Value Database

Introduction

Data Collection and Curation

Database Creation and Algorithms

Analysis and Results

Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages