View on GitHub

League23

Data science and analysis project for DSC 80 at UCSD

League of Legends Snowball Effect

Our data science and analysis project for DSC 80 at UCSD

Introduction

In League of Legends, the primary objective is to destroy the Nexus, which is the only win condition. However, for a team to reach their opponent’s Nexus, they must first destroy their opponent’s turrets, or towers. Each team has eleven towers on each side, which are spread out along the three lanes. Not all towers need to be destroyed to reach the nexus, just all in one lane. Towers are also an indicator of how successful a team is in a certain match. We wanted to answer the question: How does getting three turrets first affect the game?

The dataset we used was provided by Oracle’s Elixir, and we focused on the 2023 season statistics, as it was the most recent complete dataset. The dataset has 125,904 rows and 131 columns. Each row represents a player’s performance in a given match, and there are 5 players in a team. Additionally, there is one row after all players in the team denoting the team’s overall performance, and can be easily identified by NaN values for the player’s name. Within the dataset, the columns we are interested in using are gameid, result, minionkills, monsterkills, totalgold, firsttower, firstmidtower, heralds, gameid, league, kills, dpm, and firsttothreetowers.

Data Cleaning and Exploratory Data Analysis

For our dataset, we didn’t want to spend too much time imputing large amounts of data, so we wanted to only look at games whose datacompleteness is ‘complete’. This still gives us many observations to analyze, and allows us to keep moving in our project. We also wanted to look at the overall performance of team instead of a player-by-player analysis, so we filtered it further to only look at the summary rows for each team in a game.
For our univariate analysis, we decided to plot the distribution of minionkills as a histogram, and we saw that it was roughly normal:
image
For our bivariate analysis, we wanted to observe the relationship between being the first to getting three towers and winning the game, and we visualized this using a pie chart:
image
One interesting aggregation we found was by grouping the dataframe by firsttothreetowers and taking the mean of the resulting dataframe. Some interesting statistics from this aggregation are the win rate at 79% for teams that were the first to three towers, and the first mid tower taken was 82% for teams that were the first to take three towers. The table is shown here:

result firstmidtower totalgold heralds hextechs team kpm minionkills firsttower inhibitors pentakills playoffs kills monsterkills dpm
firsttothreetowers
0.0 0.209221 0.177204 53245.981423 0.641708 0.244336 0.334783 805.232556 0.224739 0.373131 0.006343 0.231763 10.787608 166.909946 2076.788635
1.0 0.790732 0.822796 60448.867777 1.342851 0.450487 0.588414 823.283594 0.775210 1.514616 0.017562 0.231815 17.785407 191.391910 2466.149385

Assessment of Missingness

The Heralds column in the dataset is NMAR. When looking into the data it has not dependency on other columns. It only depends on it’s own column where if no heralds were taken then the value in the Heralds column is NaN. To find more about the data a column could be added called Herald_Taken which is true if a herald was taken and false it no herald was taken. This could be used to find a dependency in the missingness.
We decided to try and see if the missingness of the split column depended on the league column in our dataset, since each league has multiple splits, and certain leagues may not have any splits. We performed a permutation test with the total variation distance as the test statistic on the following:
Null hypothesis: The distribution of league when split is missing is the same as distribution of league when split is not missing.
Alternative hypothesis: The distribution of league when split is missing is not the same as distribution of league when split is not missing.
The results of this permutation test yielded a p-value of 0.0, which means we can reject the null hypothesis at the standard 5% significance level. The visualization is shown here:
image
We tried to see if the missingness of split depended on the teamid in our dataset by performing a permutation test with the total variation distance as the test statistic on the following:
Null hypothesis: The distribution of teamid when split is missing is the same as distribution of teamid when split is not missing.
Alternative hypothesis: The distribution of teamid when split is missing is not the same as distribution of teamid when split is not missing.
The results of this permutation test yielded a p-value of 0.0, which means we fail to reject the null hypothesis at the standard 5% significance level. The visualization is shown here:
image

Hypothesis Testing

For our permutation test, we want to create an experiment with the following: Null hypothesis: The result for teams with firsttothreetowers and the teams without firsttothreetowers are drawn from the same distribution. Alternate hypothesis: The result for teams with firsttothreetowers and the teams without firsttothreetowers are not drawn from the same distribution. To do this experiment, we shuffled the result column and used the difference of means as the test statistic. We found the the p-value of our observed statistic to be 0.0, which means we can reject the null hypothesis that the result for teams with firsttothreetowers and the teams without firsttothreetowers are drawn from the same distribution at the standard 5% significance level. The visualization of the test statistics and observed value is shown below:
image

Framing a Prediction Problem

Our prediction model aims to answer the question: How likely can a team get the first to three towers? In other words, given the statistics that are an indicator of team strength, can we predict whether or not a team will be able to beat their opponent to destroy three towers?
This problem is a binary classification problem, as we are trying to predict the binary column firsttothreetowers. For our test statistic, we will use accuracy, as we want to reduce false positives, and can easily do further analysis on false negatives. We want to use kills, minionkills, monsterkills, and dpm as our features, since they are indicators of team strength and we hypothesize that stronger teams will be the first to get three towers in a game.

Baseline Model

Our baseline model consisted of the X_train and X_test having kills, minionkills, monsterkills, and dpm as columns, and our y_train and y_test simply being the firsttothreetowers column. All of the features we used are quantitative, so encoding was not necessary for our base model. We used a Decision Tree Classifier because we wanted a simple model to test our data on. The base model achieved an accuracy of 77%, which we believe is quite good, but can be improved on since the training accuracy is 100%, which suggests that our decision tree has overfit to the training data.

Final Model

For our final model, we decided to add more features with firsttower, firstmidtower, and totalgold. firsttower and firstmidtower are binary nominal features, which are by nature encoded, while totalgold is quantitative and does not require encoding either. We believe that adding the features like firsttower and firstmidtower are good indicators of map control of a team in the game, which can be a good predicator of firsttothreetowers. Additionally, totalgold was added to provide a glimpse into a team’s economic strength in a given game, which further supports the idea that stronger teams will get three towers first.
We decided to use a Random Forest model instead of a decision tree in our base model, as we believed that a Random Forest model is less prone to overfit by design commpared to a decision tree. We also decided to incorporate some preprocessing techniques like a standard scaler to make sure that no column was weighted more than others, and we performed hyperparameter tuning and cross validation using GridSearchCV to give us the best parameters. We primarily experimented with the max depth (max_depth) of the trees and the number of trees (num_estimators) in the Random Forest, and found the best parameters to be a max_depth of 10 and n_estimators of 19. The accuracy of our best estimator was 84%, which is an improvement on the base model’s performance. The confusion matrix of our best estimator is shown below:
image

Fairness Analysis

For our fairness analysis, we wanted to see if our trained model performed better for teams with more kills. We set the threshold at 25 kills, and performed a permutation test on the following:
Null hypothesis: Our model is fair. Its accuracy for teams with less than 25 kills and teams with 25 kills or more are roughly the same, and any differences are due to random chance.
Alternate hypothesis: Our model is unfair. Its accuracy for teams with less than 25 total kills is lower than its precision for teams with 25 kills or more. Our p-value for this pemutation test was 0.31, which means we fail to reject the null hypothesis at the standard 5% significance level, and we can say that our model is not unfair based on the number of kills a team has. The visualization of our test results is shown below:
image