**Calculating spike distance**

Previously, I argued that the odor codes evolves over multiple breaths. After presenting some example cells where the odor code shifted between the first and second breath, I turned to the population level, and showed this figure:

A. Schematic of population vector. B. Distance between breaths. Breath identity shown underneath. |

It's possible to calculate the "population spike distance" by just calculating the Euclidian distance between the population vector for each breath. Yet that method is fairly noisy, as most cells are uninformative. When I tried that simple method, the result was similar to that shown above, but not as clean. To make the differences more obvious, I performed PCA on the population vector, and then calculated the distances between the first five principal components (shown above). Here, each of the five components was informative, and the distances between the breaths were much clearer.

After looking at different breaths for the same odor, I next wanted to investigate the differences between odors. Rather than look at "spike distances," though, I used a prediction algorithm.

To do the prediction, I built population vectors as I did above, but instead of observing different breaths for the same odor, I observed different odors for the same breath. Once again, I transformed the data via PCA and took the first 5-10 principal components. I then created a sample population vector for an individual trial, and calculated the distance between the sample vector and the average vectors for each odor. The "predicted" odor is the one with the smallest distance from the trial. This was repeated for all trials to get the average prediction rate. When I did this, I got the prediction rates show in the top panel:

Here, the predictions are between three different odors, so the chance level is 33%. The first five breaths are pre-odor breaths, while breaths 6-10 are during the odor. As you can see, the odor breaths are >95% correct, which is great. However, many of the pre-odor breaths have prediction rates >50%, which is obviously bad (I have different control breaths for each odor; the pre-odor prediction chooses among the three control breaths).

In playing with the data, I then noticed something odd: as I increased the numbers of bins or principal components that I used to make predictions, the pre-odor predictions got higher (panel B above). With 20 bins and 20 PCs, I could get pre-odor predition rates of >70% for each odor! I wasn't making predictions, but was over-fitting my model to the data so that it could never be wrong!

And this is where I realized my mistake. When you do PCA, the algorithm tells you how much each component describes the variance in the data. The first few principal components (PC) account for most of the variance, while the later components account for less. When I was doing my prediction algorithm (and my distance calculations above), I was weighing each PC equally, and over-representing principal components which didn't have much meaning.

Once I realized this, it was a simple procedure to weigh each PC according to its variance, and re-run the prediction (panel C above). Following that, the pre-odor predictions are at chance; the positive control is finally working. The downside to this correction, however, is that the odor prediction during the odor was now between 60-80%. This lower predictive ability makes more sense, though, given the trial-to-trial noise in the signal, and the relatively low number of neurons in the population vector.

Having realized my error, I returned to my original analysis on the breath distance, and added the proper weightings. When I did this, the results were slightly different:

The control breaths are still quite distant from the odor breaths. However, the first breath is no longer so different from the subsequent breaths. Indeed, it appears that there is an evolution in the code over the first few breaths before the code stabilizes. The stark difference between the different breaths had blurred.

I'm guessing that this is a tyro analysis mistake. I only stumbled upon it because I figured a reviewer would want to see pre-odor prediction rates to compare to those during the odor. I know that when I read a paper, I rarely delve into the detailed methods of these more complicated analyses. And if I do, they aren't always informative. Given how often people perform analysis by themself, with custom code, it's easy to forget how many simple, subtle mistakes one can make. The only way to avoid them is to gain experience, and to constantly question whether what you're doing actually makes sense, and agrees with what you've already done.

Today I found an even BIGGER problem with my odor prediction. When I was creating my "average spike population" I was including the test trial in the population. And I was once again getting pre-odor prediction rates near 100%. Excluding the test-trial from the average population made everything MUCH more sensible.

**The problem discovered**

After looking at different breaths for the same odor, I next wanted to investigate the differences between odors. Rather than look at "spike distances," though, I used a prediction algorithm.

To do the prediction, I built population vectors as I did above, but instead of observing different breaths for the same odor, I observed different odors for the same breath. Once again, I transformed the data via PCA and took the first 5-10 principal components. I then created a sample population vector for an individual trial, and calculated the distance between the sample vector and the average vectors for each odor. The "predicted" odor is the one with the smallest distance from the trial. This was repeated for all trials to get the average prediction rate. When I did this, I got the prediction rates show in the top panel:

n= 105 neurons, and 6 odors, split between two sets of three odors. 10-12 trials. |

In playing with the data, I then noticed something odd: as I increased the numbers of bins or principal components that I used to make predictions, the pre-odor predictions got higher (panel B above). With 20 bins and 20 PCs, I could get pre-odor predition rates of >70% for each odor! I wasn't making predictions, but was over-fitting my model to the data so that it could never be wrong!

And this is where I realized my mistake. When you do PCA, the algorithm tells you how much each component describes the variance in the data. The first few principal components (PC) account for most of the variance, while the later components account for less. When I was doing my prediction algorithm (and my distance calculations above), I was weighing each PC equally, and over-representing principal components which didn't have much meaning.

Once I realized this, it was a simple procedure to weigh each PC according to its variance, and re-run the prediction (panel C above). Following that, the pre-odor predictions are at chance; the positive control is finally working. The downside to this correction, however, is that the odor prediction during the odor was now between 60-80%. This lower predictive ability makes more sense, though, given the trial-to-trial noise in the signal, and the relatively low number of neurons in the population vector.

**Back to breath distance**

Having realized my error, I returned to my original analysis on the breath distance, and added the proper weightings. When I did this, the results were slightly different:

n=11 odors from 5 experiments with >15 cells. |

I'm guessing that this is a tyro analysis mistake. I only stumbled upon it because I figured a reviewer would want to see pre-odor prediction rates to compare to those during the odor. I know that when I read a paper, I rarely delve into the detailed methods of these more complicated analyses. And if I do, they aren't always informative. Given how often people perform analysis by themself, with custom code, it's easy to forget how many simple, subtle mistakes one can make. The only way to avoid them is to gain experience, and to constantly question whether what you're doing actually makes sense, and agrees with what you've already done.

**Update:**

Today I found an even BIGGER problem with my odor prediction. When I was creating my "average spike population" I was including the test trial in the population. And I was once again getting pre-odor prediction rates near 100%. Excluding the test-trial from the average population made everything MUCH more sensible.