Pages

Monday, December 28, 2015

Exploring random forest hyper-parameters using a League of Legends dataset

Over the last few blog posts, I have used random forests to investigate data from the game League of Legends. In this last post, I will explore model optimization. Specifically I will look at how hyper-parameters like forest size, and node size can influence classification accuracy, show that dimensionality reduction doesn't help random forests, and compare random forest performance to Naive Bayes. For details of this post, please look at this Jupyter notebook.

Background

As a refresher, League of Legends is an online multiplayer game where teams try to destroy each others' base. The aim of this project is to predict the eventual winner of a game at early timepoints, using game conditions; for example, I would like to be able to say the Blue Team with a 10,000 gold lead has an X% chance of winning, and determine X by machine learning. I gathered data from Riot Games's API, and created pandas dataframes containing information from 30,000 games at 5 minute intervals. For examples, you can see previous posts. In this post, as mentioned above, I will explore how I optimized my random forest model.

Forest size

The basic task my random forest is trying to do is predict the winner of a game of League of Legends, a classification task. Random forests work by generating a large number of decision trees, and averaging the results across trees. Thus one of the most obvious random forest parameters to optimize is the number of trees in the forest. To look at this, I ran the prediction algorithm many times, with different number of trees for each run.

As you increase the number of trees in the forest, the accuracy improves, until it plateaus around 25 trees. I am using around a dozen features in my dataset, which makes me hypothesize that your forest should have a number of trees equal to twice the number of features. It would be interesting to look at datasets with more features to see how they perform, and see how general this is.

Node size

The software package I am using for building random forests, scikit-learn, by default creates decision trees where each leaf is pure (that is, each group contains only wins or losses, and no mixture of wins and losses). This can lead to overfitting of individual trees, and by extension the random forest. However, you can set a parameter in the random forest to increase the minimum leaf size, or the minimum size of nodes for splitting. To see how this influenced prediction accuracy, I ran the prediction algorithm over a range of minimum node sizes.

The model accuracy goes up as the minimum sample size increases, plateauing around a 200 sample minimum. The model then maintains its accuracy for larger minimums, before decreasing once the minimum size is 5,000, or approximately 15% of the data. Once the minimum is that large, the trees are restricted in their depth, and cannot make enough splits to separate the data.

Rather than making graphs like this for every parameter, I ran a grid search over tree depth, leaf size, and node size. The best parameters for my dataset with 30,000 games, and a dozen features was a maximum depth of 10 levels, minimum leaf size of 10, and minimum split size of 1000. Optimizing these parameters yielded a 2-5% increase in accuracy over the default parameters (over a range of timepoints).

Dimensionality reduction

There is a lot of collinearity in my dataset, which can be a problem for certain machine learning algorithms; theoretically, random forests are resilient to collinearity. To see this for myself, I performed dimensionality reduction, specifically PCA, on my features, and then ran the random forest algorithm again. Here I am presenting the data slightly differently, in terms of prediction accuracy at specific times within the game.
The PCA model is actually a little worse! In the past I have used PCA on data before performing regression, and which improved regression performance by 5%. It's good to know that random forests work as advertised, and don't require dimentionality reduction beforehand.

Naïve Bayes

Part of the reason I did this project was to gain experience with random forests, and compare their performance with other algorithms. Naïve Bayes is a decent, fast default algorithm which is often used for benchmarking. I hadn't used it previously, as there is a little bit of manipulation necessary to combine categorical and continuous features in Naïve Bayes (like 5 lines of code!). As a final experiment, I wanted to see how Naïve Bayes compared to Random Forests.
Naïve Bayes performs almost as well! And runs a whole lot faster! What is probably happening here is that the dataset is just not deep enough for the strength of random forests to shine.

Friday, November 20, 2015

Influence of region, skill and patch on predictability in League of Legends

Last month, I used random forests to predict the winner of League of Legends games. Since then I have downloaded datasets from different regions, different ELOs, and on different patches, and checked how these factors influence predictability. This post will summarize the results, but for details of the analyses, please see this notebook.

Differences between regions

Korea is famed for its great players, and an "open mid" philosophy (in a normal game of League of Legends, you cannot surrender until 20 minutes into the game; in Korea, players accelerate this process by letting the other team win the game quickly through an "open mid"). Are we able to see those attributes in our model? As a first way to look at this, we can plot the length of games in different regions.
The EUW and NA have similar distributions of game lengths, with around 8% of games being surrendered at 20 minutes. The Korean games, on the other hand, have a few percent more games ending before 20 minutes, due to open mids, and a 14% of games ending right after 20 minutes. Since so many games are ending earlier, can we see this in the model's predictions?

Indeed we can, with games being a few percentage points more predictable for games that last 20 minutes or less. However, once the games last 25 minutes, all the regions are similarly predictable. This means that the Koreans are surrendering winnable games.


Skill and predictability

We can run the same type of analysis for skill. In the notebook I compare Bronze V games to challenger solo queue, and show that low skill games are slower, and less predictable than high skill games. Here, I will compare high skill solo queue games to master tier team ranked. The team ranked dataset is quite a bit smaller than the solo queue data (1,000 games vs 30,000), but in my tuning, I've found there is not a big difference in accuracy when the dataset is over 1,000 games.

We can start again by plotting game length.

Team ranked games are about two minutes faster than solo queue games. We can then run the prediction algorithm again on the team ranked games.

Team ranked games are much more predictable than solo queue games! (This is in fact the largest difference in predictability I have seen between different datasets). What is likely happening here is that coordinated teams are more able to capitalize on their advantages, and press them around the map. With fewer mistakes to allow comebacks, the games are faster, and more predictable. It would be interesting to further look at professional games to see how they compare.

Patches and predictability

In recent patches, Riot has made significant changes to the game, which makes it feel like the games are faster, and comebacks more difficult. I scraped challenger and master ranked games from the most recent patch, and compared them to ranked games from Season 5. We can again start with game length.

If you thought games were faster, your intuition was right, as games are around three minutes faster in preseason 2016. What about predictability?

As you'd expect given the previous relationships between speed and predictability, preseason 2016 is a few percentage easier to predict. In a previous blog post, I investigated the importance of different features on predictability, and found that dragons were not important in helping teams win. Putting on my amateur designer hat, this means that dragon is even less important, since games are less likely to last long enough for the accumulated advantages to matter.

In my next post, I'm going to go more in depth on the modeling, investigating random forest hyper-parameters, and maybe trying a few different models.

Thursday, October 22, 2015

Visualizing champion difficulty in Jupyter

I've been playing around some more with League of Legends analysis and Jupyter. This time I scraped champion information from League of Graphs, and calculated which champions were the easiest / hardest to play, and which champions are best in the early / late game. Unfortunately, Blogger does not interact with Jupyter, so I have posted the notebook on nbviewer, if you are interested in taking a look!

Friday, October 16, 2015

Playing in random forests in League of Legends

I've wanted to learn more about machine learning, specifically python's scikit-learn module. I'm an avid League of Legends player (summoner names lemmingo and Umiy), and Riot Games provides a thorough API for querying game data, so I decided to explore machine learning using League of Legends (LoL). Specifically, I wanted to see if I could predict the eventual winner of a game long before the game finishes. In this post, I'm going to explore some features of the LoL dataset, describe what features are important in predicting the winner, show how the predictions change over time, and investigate how winnable surrendered games are. If you are interested in the details of how I did my analysis, go to the github page for this project, or this Jupyter notebook.

For those who don't know what League of Legends is, it's an online competitive game where teams of five players try to destroy each other's base. Some key things players do during the game are: 1. get gold; 2. destroy towers; 3. kill other players; and 4. kill boss monsters called dragons. Games usually last 25-40 minutes, and each game has a clear winner (no ties). Teams can surrender after twenty minutes if the the game is a rout.

Loading the data

The details of how I loaded the data are in the Jupyter notebook, but I will give a summary here. To get a list of players, I first queried the API for the list of featured games; from those feature games I extracted 50 player names.  Then I created a list of every ranked game those players were in during the year 2015, approximately 30,000 games in total. These should all be high skill-level games. I then queried the API for the first 6,000 of these games. This took a few hours because the API is limited to ~0.8 requests per second. A JSON with all of the data for 6,000 games takes up ~1GB (this is due to the JSON being text, and containing a lot of redundant information). Finally, I extracted features from the JSON. Now that I have my scripts more established, I can query games, extract their features, and save them directly to a (smaller) csv.

Exploring the features

As with any good data science project, the first thing I did was explore the features. I started by plotting a histogram of the game lengths:
As Riot says, the average game lasts between 25-35 minutes. Few games last less than 20 minutes, with a large spike of games ending right after 20 minutes, due to surrender. There are others spikes just after 30 and 40 minutes, which I don't have a clear explanation for. Perhaps large team fights break out around then.

The second thing I plotted was the average gold difference at 20 minutes.
This is about as normal a distribution as one could hope for, with a standard deviation of 6,300. Rather than plot each feature individually, we can plot how pairs of features are related to each other:

The features shown here are the gold difference between the teams, the kill difference, difference in number of towers destroyed, and the difference in dragons killed. Apologies for the ugly axes.
As you might expect, there is a large degree of collinearity between gold difference, kill difference, tower difference, and dragon difference. This would be a problem for linear regression, but I am going to use random forests, which are less sensitive to collinearity. If I were interested purely in prediction accuracy and generality, I would perform PCA on these components, but for now I'll just use them raw.

Feature selection

I was curious about which features are most predictive of the eventual game winner. One of the cool things about random forests is that they can identify which features are important based on how many trees they are found in. I created a random forest classifier and used it to predict the winner of the game at five minute intervals. Then for each timepoint, I extracted the feature importance, and plotted a heatmap:
Some features: Gold difference, kill difference, tower difference, dragon difference, and firsts of bunch of featues. Carry share is the max kills on a team divided by total team kills. 
As one might expect, gold difference is the most predictive feature, at every time point. Next most important are the kill and tower difference. Surprisingly, dragon difference is relatively unimportant. This is probably because killing dragons yields only a small advantage. If I were to put on my amateur designer hat, I'd recommend Riot decrease the number of dragons required to get the double dragon buff. Another note is that first blood, first dragon, and first tower are mostly uninformative.

To get a better look at how feature importances changes with time, let's take slices of feature importance at 20 and 35 minutes.
There are some significant changes in importance between these time points. Notably, barons and inhibitors gain importance at 35 minutes. To compensate for this, there is a loss of importance in the rest of the features. Once again, dragons are relatively unimportant, even at the late timepoints. Inhibitors may be ever so slightly more predictive than barons.

Accuracy over time

Given the above feature importance data, I reduced the number of features in the classifier to avoid overfitting the data, and only used the features for the differences, and the baron and inhibitor features. With these features, I then created a small random forest at each time point, and performed cross-validated predictions for the eventual winner of the game.
At the beginning of the game, the classifier isn't successful because there isn't enough information to work with. As the game goes on, the classifier improves, peaking at ~80% accuracy between 20-35 minutes. However, the accuracy drops for longest games. One possibility for this is that there are fewer games that last 40 minutes (less than 1000), which means there is less data to train the classifier. Another possibility is that games that last a long time are fairly close, and close games are harder to predict.

How winnable are surrendered games?

As one last fun thing to do, I wanted to investigate the games that were surrendered early. I separated the games into "early surrenders," and "close" games. To explore the surrendered games, I plotted the gold difference at twenty minutes. The gold difference is bimodal, with one team normally having a large gold lead:
I then created used a random forest to predict the winner of these games, and it did so with 99% accuracy. To see whether these surrendered games were really unwinnable, I trained a random forest on the close games, and then used that forest to predict the surrendered games. Here, it was only 94% accurate. My interpretation of this is that around 6% of surrendered games are winnable!

I have lots of ideas for how to expand on these initial results. I would like to train the model on low ELO games, and see how predictable those are in comparison to high ELO games. Riot has an API for professional games as well, and it would be fun to see if pro games are more (or less) predictable than amateur games. I could scrape games from different regions and compare them. One long term goal is to create a live predictor for LCS games as they happen. If you have any ideas, let me know in the comments!

Monday, September 21, 2015

Walk Along the Paper Trail: Garfield Gap

It's been three years since I did a Walk Along paper summary! Wow!

Recently in our journal club, we discussed a paper by Garfield et al from the Lowell lab. In discussion, some unusually interesting points were raised, and I'd like to think about them here.

Background

I've written about this before, but here is a quick refresher on hunger in the brain. Many types of neurons control metabolism, but the most famous are AgRP neurons in the arcuate nucleus of the hypothalamus. If you stimulate AgRP neurons with channelrhodopsin, or chemogenetically, you can make mice "voraciously feed," aka stuff food in their face. Notably, AgRP neurons are GABAergic, which means they are shutting down neurons in other areas.

While I've written about AgRP neurons before, I've generally ignored what AgRP itself is: a neuropeptide. AgRP does not bind to AgRP receptors, but is actually an endogenous antagonist for the melanocortin-4 receptor (MC4R), whose principle ligand is α-MSH. AgRP neurons cohabitate in the arcuate nucleus with POMC neurons that produce α-MSH; stimulation of POMC neurons can reduce feeding (albeit on a longer timescale than AgRP neurons).

AgRP neurons project to a lot of different brain areas. To see whether all of these projections can induce feeding, the Sternson lab stimulated axon terminals in a bunch of different brain regions, and found that stimulation of only some of these terminals (namely the PVH, aBNST, and LH) could induce feeding. One question Garfield et al wanted to answer was, "what is the molecular identity of these downstream targets?"

Is occlusion anything?

Since AgRP does not have any agonistic receptor, Garfield and colleagues investigated neurons expressing MC4R in various brain regions. They started by performing channelrhodopsin assisted circuit mapping (CRACM!) to see if AgRP neurons connect to MC4R neuons. To do this, they infected AgRPCre :: MC4RCre mice with channelrhodopsin in the AgRP neurons, and GFP in the MC4R neurons in the PVH. They sliced brains, patched onto PVHMC4R neurons, and photostimulated the AgRP axon terminals (see diagram below). They found that 25/30 MC4R neurons received IPSCs following photostimulation, showing they were connected to AgRP neurons (panel A). They also patched onto non-MC4R neurons in the PVH, and found that only 2/10 neurons received AgRP input, showing that the AgRP-> PVH connection was fairly specific for MC4R neurons (panel B).


Patch recordings of MC4R neurons downstream from AgRP. AgRP neurons express ChR2-mCherry (red), and MC4R neurons express GFP (green). A. When you photostimulate AgRP neuron terminals, most MC4R neurons receive GABAergic inputs. B. When you photostimulate AgRP input to Non-MC4R neurons, most neurons do not receive input.
A previous paper by Atasoy and Sternson had claimed that AgRP neurons project to oxytocin or SIM1 neurons in the PVH, so Garfield investigated a few other neuron types in the PVH as well, but found no connections to any of them.

After showing the connection in-vitro, they wanted to show it functionally in-vivo using behaviour. They performed an occlusion study where they infected both AgRP and PVHMC4R neurons with ChR2, then put an optic fibre over the PVH to stimulate the AgRP fibre terminals simultaneously alongside PVHMC4R cell bodies (panel g, below). When they did this, they found the food intake was lower than AgRP neuron stimulation alone (panel h).

g. Diagram of experiment. AgRP neurons express ChR2. In different experiments, MC4R or OXT neurons also express ChR2. The optic fibre is placed over the PVH to stimulate cell bodies and AgRP terminals. h. Stimulation of AgRP terminals increases feeding. This is reduced ("occluded") by simultaneous stimulation of MC4R neurons. It is NOT reduced by simultaneous stimulation of OXT neurons.

I originally liked the idea of behavioural occlusion, perhaps because it reminded me of LTP occlusion. However, I'm not sure that it's informative in circuit mapping. First, my intuition is that direct excitation beats indirect inhibition. So if you stimulate AgRP terminals and MC4R neurons at same time, and direct excitation wins, it doesn't really tell you anything. Second, if you know two brain regions oppositely modulate behaviour, stimulating both of them does not tell you whether they directly interact. For example, stimulation of PKC-delta neurons in the CeA reduces feeding, and PKC-delta neurons are not connected to AgRP neurons. If you simultaneously stimulate AgRP and PKC-delta neurons, and one "occludes" the other, it won't mean they are directly connected. It only means one is stronger than the other!

In fact, I think the term "occlussion" is misleading, and not used in the same way as it was in LTP. In LTP, two protocols "occlude" each other if they both induce LTP alone, but stimulating both of them does not yield additional LTP. They are said to occlude each other because they use the same signaling pathway. In behaviour, "occlusion" has referred to stimulation of opposing pathways where one wins out. This experimental paradigm seems to be catching on, but I'm not sure it actually means anything.

Do all neuropeptides synapse on receptors?

Garfield next started looking for other AgRP-MC4R connections in other brain regions. As before, they infected AgRP neurons with ChR2-mCherry, and patched onto GFP-infected MC4R neurons in the anterior BNST, and the lateral hypothalamus (LH). Unlike before, there were no connections between AgRP neurons and MC4R neurons in these other brain regions.

Whole-cell patch onto neurons MC4R downstream from AgRP neurons expressing ChR2. No neurons, either in the aBNST or LH, were connected to AgRP neurons. Note the n's are not connected.
This raises question to me, how often do peptide neurons synapse onto peptide receptor neurons? (AgRP is certainly a strange case insofar as it doesn't have a natural receptor). There are hundreds of peptide-receptor pairs, and from the brief literature I've looked at, people don't always verify these cells are connected. For example, in the recent Dietrich paper about NPY5R (NPY is another neurotransmitter for AgRP neurons), they stimulated AgRP neurons and applied NPY5R antagonists, but never actually patched onto neurons to see if they were connected.

All neuropeptide neurons express multiple neurotransmitters, including classic ones and neuropeptides, and it's possible each transmitter work at different temporal and spatial scales. Fast neurotransmitters like glutamate or GABA are reuptaken quickly, and so cannot diffuse; in contrast, neuropeptides can last for minutes or hours. This would allow, for example, AgRP neurons to form GABAergic synapses on specific targets, and let their neuropeptides diffuse and act as paracrine [?] signaling.

In any case, I hope that as people explore these systems, they verify that the neurons we assume are connected, in fact are.

Collateral

Once they identified PVHMC4R neurons as important for feeding, they wanted to know where they projected, and specifically if PVHMC4R projected to multiple targets or single ones. They identified PBN as a major target of PVHMC4R neurons, and used rabies virus to retrogradely label PVHMC4R that project to the PBN. When they investigated other brain regions that PVHMC4R projects to, they did not see axons, showing these neurons do not send collaterals.


(Top) Diagram of experiment. PVHMC4R neurons are labeled in red. PVHMC4R that project to the PBN are labeled in green. (Images) There are red and green cells in the PVH, and red and green terminals in the PBN. However, there are only red terminals in the vlPAG and NTS/DMV.
Previous research has shown that AgRP neurons do not have collaterals, and made me wonder whether this is a common feature of mid- or hind-brain neurons. However, a recent paper from Luqin Luo's lab mapped axons of locus coeruleus norepinephrine neurons, and found that those neurons do send collaterals widely, albeit biased towards specific areas. As we get more information about cell types, and perform more extensive tracing studies, we will get a better sense of what parts of the brain have discrete pathways versus overlapping projections.