Sunday, January 1, 2017

Using KDTrees in Apache Spark

OR: The best Coffee Shop in Hong Kong to catch Pokemon

I work with spatial data all the time, and one of the most common things I do with spatial data is find the nearest locations between two sets of objects. For example, in the context of Pokemon Go, you might ask, "what is the nearest Pokestop to a given Pokemon?" The standard way to do this is to use a data structure called a KDTree.

In the past six months, I have started using Apache Spark, and quickly grown to love it. However, I haven't found any good tutorials on how to use KDTrees in Spark. To fill the void, I have written a short tutorial on how to use scipy KDTrees in Spark. The tutorial covers how to load Pokemon location information from Hong Kong, why KDTrees are great, how to create a KDTree of coffee shops in Hong Kong, and the code to combine them using Spark. I wrote the tutorial as a Jupyter notebook, but haven't figured out how to embed those in Blogger, so head over to Github for a gander.

If you want a sneak preview, this is how I define the udf which does the query:
coffee_udf = F.udf( partial(query_kdtree, cur_tree = coffee_tree_broadcast),
                    T.ArrayType( T.IntegerType() ) )