Monday, May 18, 2015

Playing with deconvolution and GCaMP6 imaging data

The Palmiter lab recently got an Inscopix microscope. We are still troubleshooting our surgeries and recordings right now, so we don't have any imaging data yet. Given that, I wanted to set up our analysis pipeline ahead of time. Specifically, I wanted to see how we can identify calcium events. In this post I will explore how well deconvolution works on calcium imaging data from the Svoboda lab.

Why deconvolution

Inscopix provides image analysis software, Mosaic, which is pretty good at motion correction, and identifying ROIs. For Mosaic's signal detection, it identifies events using the simple criterion:
ΔF / F > F0 + 3SD
This works adequately for low frequency events on a low background, but completely ignores the magnitude of events, or what happens when events occur in quick succession. While I've done imaging before, I've never done deconvolution, so I thought it would be fun to try it out.

To get sample GCaMP6 imaging data, I turned to the CRCNS, which has a bunch of neuroscience datasets. The most relevant dataset was from the original GCaMP6 paper wherein they recorded calcium fluorescence in parallel with loose seal electrophysiology, to allow comparisons between calcium and spikes. For this example, I loaded the processed data from the GCaMP6s imaging dataset. If you don't care about loading data in python, you can skip the next section and head straight to Deconvolution time!

Loading Svoboda lab data into python

The data is stored in .mat files using the Svoboda lab data format. Since understanding someone else's data structure is as difficult as understanding their musical taste, I'll walk through how to load the data in python. To load the data, you can simply use the function, with the following arguments:

import as scio

svoboda_data = scio.loadmat( 'data_20120521_cell5_003', squeeze_me = True, struct_as_record = False)

It is critical to use the squeeze_me and struct_as_record flags! If you don't, python will load a weird object that will crash your python instance. This took me an embarrassingly long time to figure out.

svoboda_data is a dictionary with four fields; the only one we care about is the 'obj' field, so we can extract it:

svoboda_obj = svoboda_data['obj']

svoboda_obj has a bunch of fields as well; the most important one is timeSeriesArrayHash.value, which contains the data. The value field is an array of five structs, number 0-4, which have:

0: average fluorescence of an ROI
1: average fluorescence of the surrounding neuropil
2: raw e-phys trace from the cell
3: high-pass filtered e-phys
4: detected spikes

The value struct has two fields we care about, .time (which has the... time) and .valueMatrix (which has the values of the fluorescence or voltage). Knowing this, we can load the fluorescence and spike data:

ca_data = svoboda_obj.timeSeriesArrayHash.value[0].valueMatrix

ca_time = svoboda_obj.timeSeriesArrayHash.value[0].time

spike_data = svoboda_obj.timeSeriesArrayHash.value[4].valueMatrix

spike_time = svoboda_obj.timeSeriesArrayHash.value[4].time

I wrote a small wrapper function in python to load full Svoboda lab experiments. You can then plot the calcium fluorescence and spike times on each other:
Fluorescence (blue trace, DF / F) and spike data (green) from the GCaMP6s data set, recorded May 15, 2012, cell 1.
For this cell, most spikes generated calcium events, and multiple spikes in a short time period generated larger calcium events. However, some spikes did not yield much (if any) calcium response. Calcium events, in general, had mostly exponential decays.

Deconvolution time!

Now that the data is loaded, we can deconvolve it. Thankfully, Alistair Muldal has implemented the fast non-negative deconvolution described in Vogelstein, 2010 (PyFNND. To be honest, I only partially understand the deconvolution algorithm. It starts by assuming that calcium can be modeled by spikes that return to baseline with exponential decay, but how they turn that generative model into a fit went beyond me.

In any case, we can run the PyFNND deconvolution on our calcium trace (I've calculated the ΔF / F separately), and get both a fit of the calcium trace, and an imputed spike train:

from pyfnnd import deconvolve, plotting 

n_best, c_best, LL, theta_best = deconvolve(
df_f.reshape, dt=0.0166, verbosity=1, learn_theta=(0, 1, 1, 1,0) )

plotting.plot_fit(df_f.reshape(-1, 1).T, n_best, c_best, theta_best, 0.0166)
Top: Original fluoresence (blue) and fit fluorescence (red), given the spike heuristics shown below.
Bottom: Imputed spike heuristics for the calcium trace shown above. Note that n_hat does not give a spike train, but rather a spike heuristic.
I played around with the theta parameters to see if I could get a better fit, but the best results I got were with the default ones above.

To get a spike train, you need to threshold the spike heuristic n_hat. Here I used a threshold of 0.1 (to avoid getting false positive spikes), and plotted the imputed spike train, real spike train, and calcium signal.

The imputed spike train (red) for a threshold of 0.1, and the real spikes (green).
Given how often the real spikes did not generate clean calcium events, I feel like the deconvolution did a pretty good job identifying spikes. The deconvolved spikes were often slightly delayed compared to the actual ones (although this could probably be fixed).

To get a sense for how accurately the deconvoluted spike train matched the real spike train, I calculated the Victor-Purpura distance (VP distance) between the two. The VP distance calculates how many times you would have to insert, delete, or move a spike to turn one spike train into another. VP was handily implemented by Nicolas Jimenez in the python module fit_neuron. (Note, if you want to use this module you should download the .tar.gz from Github, as the pip install has a bug at the moment.) I also wanted to use this metric to get a better sense of the optimal threshold for n_hat, so I ran the VP distance for thresholds of n_hat ranging from 0 to 1. I used a cost of q=5, so that spikes would be inserted or deleted only if there was not a nearby spike within 200 ms.

from fit_neuron.evaluate import spkd_lib as spkd

vp_cost = 5

spk_times = ca_data.spike_time[ca_data.spike_train>0.5]

vp_dist = spkd.victor_purpura_dist(ca_data.fluor_time[n_best >thresh], spk_times, vp_cost)

Similarity between the imputed spike train and real spike train. The normalized VP distance is the VP distance divided by the total number of real spikes. The threshold is the threshold for n_hat. 
A normalized VP distance of 0 means all the spikes were the same for each spike trains; and a distance of 2 means all the spikes from one train had to be deleted and re-inserted to recreate the other spike train. I think a distance of 0.8 means that just over half the real spikes were shared with the imputed spikes. I ran the deconvolution for all 9 neurons in the GCaMP6s dataset, and the minimum normalized distance ranged from 0.73-1. Some cells were undermined by bursts of firing, which caused a lot of false positives in the deconvolution. Others were difficult due to random non-spike related calcium fluctuations. I tried some filtering or baseline subtraction to make the results better but could not do significantly better than the default settings

A few concluding thoughts. Python is sweet, and people have implemented a lot of useful algorithms in it. This allowed me to try things out that would have been otherwise impossible. In the future I will try to clean up python posts by using iPython notebooks.

Deconvolution works pretty well at turning large calcium events into number of spikes. The timing won't be exact, but if you're doing calcium imaging, precise timing (<50ms) doesn't matter anyway.

[The following comments reflect my understanding after a few days of coding / playing. If I make any mistakes here, hopefully my commenters can correct me!] Finally, GCaMP6s was sold as having single spike resolution, but in my hands it does not, at least not for single trials. In Fig. 3F of the GCaMP6 paper they claim to detect 99% of single spikes in visual cortex with a 1% false positive rate. When I read that, I thought they had done something similar to what I just did: given a calcium fluorescence trace, and no prior knowledge, try to identify as many action potentials as you can, and compare to ground truth. What they actually did was answer the question: if you know what a single spike's fluorescent trace is, can you tell the difference between known spike events and random noise? In real imaging, of course, you don't know what spike calcium events look like (it's certainly not a simple exponential!). Also, that false positive rate, while sounding stringent, is quite permissive: if you image at 60Hz, with a 1% false positive rate, you will detect false spikes at a rate of 0.6Hz; if your cell fires at 0.5Hz, as their example cells do, that's a problem.

Having said that, using deconvolution with GCaMP6 works well depending on the type of data you're interested in. If you are recording from low frequency neurons where each action potential can cause spiking in downstream neurons, missing half the spikes is important. But if your neuron fires has a high baseline firing rate, and individual spikes aren't too important, the combination works well.


  1. Thank you for this post! Just downloaded some of their data and could not figure out how to actually get into it in Python; those record flags for loadmat were what I needed.

    -Gabe K. Ocker

    1. Glad I could help! And to see I'm not the only one who got tripped up by the flags.