Monday, December 28, 2015

Exploring random forest hyper-parameters using a League of Legends dataset

Over the last few blog posts, I have used random forests to investigate data from the game League of Legends. In this last post, I will explore model optimization. Specifically I will look at how hyper-parameters like forest size, and node size can influence classification accuracy, show that dimensionality reduction doesn't help random forests, and compare random forest performance to Naive Bayes. For details of this post, please look at this Jupyter notebook.

Background

As a refresher, League of Legends is an online multiplayer game where teams try to destroy each others' base. The aim of this project is to predict the eventual winner of a game at early timepoints, using game conditions; for example, I would like to be able to say the Blue Team with a 10,000 gold lead has an X% chance of winning, and determine X by machine learning. I gathered data from Riot Games's API, and created pandas dataframes containing information from 30,000 games at 5 minute intervals. For examples, you can see previous posts. In this post, as mentioned above, I will explore how I optimized my random forest model.

Forest size

The basic task my random forest is trying to do is predict the winner of a game of League of Legends, a classification task. Random forests work by generating a large number of decision trees, and averaging the results across trees. Thus one of the most obvious random forest parameters to optimize is the number of trees in the forest. To look at this, I ran the prediction algorithm many times, with different number of trees for each run.

As you increase the number of trees in the forest, the accuracy improves, until it plateaus around 25 trees. I am using around a dozen features in my dataset, which makes me hypothesize that your forest should have a number of trees equal to twice the number of features. It would be interesting to look at datasets with more features to see how they perform, and see how general this is.

Node size

The software package I am using for building random forests, scikit-learn, by default creates decision trees where each leaf is pure (that is, each group contains only wins or losses, and no mixture of wins and losses). This can lead to overfitting of individual trees, and by extension the random forest. However, you can set a parameter in the random forest to increase the minimum leaf size, or the minimum size of nodes for splitting. To see how this influenced prediction accuracy, I ran the prediction algorithm over a range of minimum node sizes.

The model accuracy goes up as the minimum sample size increases, plateauing around a 200 sample minimum. The model then maintains its accuracy for larger minimums, before decreasing once the minimum size is 5,000, or approximately 15% of the data. Once the minimum is that large, the trees are restricted in their depth, and cannot make enough splits to separate the data.

Rather than making graphs like this for every parameter, I ran a grid search over tree depth, leaf size, and node size. The best parameters for my dataset with 30,000 games, and a dozen features was a maximum depth of 10 levels, minimum leaf size of 10, and minimum split size of 1000. Optimizing these parameters yielded a 2-5% increase in accuracy over the default parameters (over a range of timepoints).

Dimensionality reduction

There is a lot of collinearity in my dataset, which can be a problem for certain machine learning algorithms; theoretically, random forests are resilient to collinearity. To see this for myself, I performed dimensionality reduction, specifically PCA, on my features, and then ran the random forest algorithm again. Here I am presenting the data slightly differently, in terms of prediction accuracy at specific times within the game.
The PCA model is actually a little worse! In the past I have used PCA on data before performing regression, and which improved regression performance by 5%. It's good to know that random forests work as advertised, and don't require dimentionality reduction beforehand.

Naïve Bayes

Part of the reason I did this project was to gain experience with random forests, and compare their performance with other algorithms. Naïve Bayes is a decent, fast default algorithm which is often used for benchmarking. I hadn't used it previously, as there is a little bit of manipulation necessary to combine categorical and continuous features in Naïve Bayes (like 5 lines of code!). As a final experiment, I wanted to see how Naïve Bayes compared to Random Forests.
Naïve Bayes performs almost as well! And runs a whole lot faster! What is probably happening here is that the dataset is just not deep enough for the strength of random forests to shine.