Extracting Large Datasets from the Google App Engine Datastore

The Google App Engine (GAE) is a very interesting service for quickly deploying web apps or building systems that use Google’s infrastructure. I’ve recently been working on a project that uses the GAE to receive large amounts of data from a mobile app that is part of a feasibility trial with 20 or so participants. Therefore, the two key things that I wanted to accomplish are (a) getting a back-end that receives/processes/stores data up and running as quickly as possible and (b) having a way to extract/download the data for offline analysis. The GAE was perfect for part (a), and the system was set up: the data that rolled in was easily stored in the NDB data store.

Now the headache: getting data out of the system. What ways could this be done?

  • Don’t even try: using a front-end request handler. These are limited to 60-seconds, hardly enough time to read through/format/process a large datastore.
  • In the old days, it was possible to back up data from the developer console. This limits data extraction to admins only, which is not ideal. This post describes how this was done, and worked for me… until the console was upgraded and the ‘Backup’ functionality was removed (does anyone know where it is?).
  • The bulkloader. This post describes how to use it, and I got this working. Simple and command-line based. Hours and hours later, the bulkloader was still trudging along and would then fail, which seemed to happen if my machine momentarily lost it’s WiFi connection. I didn’t see any way to modify how the bulkloader throttles data download; so I’d only really suggest using this method if your data is very small.
  • Using the remote API shell, and paging through the data store entities. I got this working too, and this seemed to be faster than the bulkloader.. but would hit some over quota limits after some time (even though my project has billing enabled). Again, probably ideal for small data, and admin-access only.

What I did in the end was inspired from the post that wanted me to start the backup from the admin dashboard:

  1. I found this post that explains how to transfer data from the data store to BigQuery. What it is actually doing is transferring data from the data store to the cloud storage, and the uploading it from cloud storage into BigQuery.
  2. I wrote a map-reduce job that does the same as the first part of that post. The problem I encountered was that the example and  all the documentation points to using the “FileOutputWriter” in “mapreduce.output_writers” which does not exist anymore (therefore producing ImportErrors). By digging around the code, I found that I could replace it with “GoogleCloudStorageConsistentOutputWriter”
  3. I ran the map-reduce job by hitting the URL with a get request. My code then redirects to the mapreduce status page. The whole thing took approximately 10 hours to complete.
  4. I downloaded the data using gsutil (the -m flag enables multithreaded downloads):
    gsutil -m cp -R gs://my-bucket/my-dir .

3 thoughts on “Extracting Large Datasets from the Google App Engine Datastore

  1. Just out of curiosity, how big is the dataset that is taking 10 hours?

    When I was first experimenting with MapReduce and mass Datastore operations, it was taking me approximately 4 hours to read 1 million entities out of Datastore. Now, I’ve gotten most of my MapReduce jobs under 20 minutes for the same scale, depending on what kind of processing needs to be done.

    1. The data was, from the sound of it, approximately an order of magnitude larger than yours.. so it seems going from 4 to 10 hours would make some kind of sense. I guess parameters would play a key role here; I’d be interested to see how you built/set up your 20-minute version. Thanks!

      1. Here is an excellent article that helped me with MapReduce: http://sookocheff.com/posts/appengine/mapreduce/mapreduce-yaml/

        If you define your MapReduce job in the mapreduce.yaml file at the top level, you can kick it off via the web interface and makes it a little safer than sending a GET request, as access to the web interface can be limited to admin accounts only.

        As for the speed, MapReduce jobs send shards into the ‘default’ push task queue. At the top level file queue.yaml, you can dictate the speeds at which tasks are consumed. Local development is single instance only, however when deployed this will cause multiple instances off your app spin up faster to help take down the shards. Here is my queue.yaml file for experiments, which is the max rate:

        – name: default
        rate: 100/s
        bucket_size: 100
        max_concurrent_requests: 100

        I find it fascinating that GAE/MapReduce makes parallel programming so accessible. Instead of thinking of complexity and the amount of time it will take, it’s possible to think of time as a fixed constant and the variables being complexity and the number of instances. In other words, if it’s slow throw more power at it.

        If you try my tips out, let me know how it goes!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s