The Google App Engine (GAE) is a very interesting service for quickly deploying web apps or building systems that use Google’s infrastructure. I’ve recently been working on a project that uses the GAE to receive large amounts of data from a mobile app that is part of a feasibility trial with 20 or so participants. Therefore, the two key things that I wanted to accomplish are (a) getting a back-end that receives/processes/stores data up and running as quickly as possible and (b) having a way to extract/download the data for offline analysis. The GAE was perfect for part (a), and the system was set up: the data that rolled in was easily stored in the NDB data store.

Now the headache: getting data out of the system. What ways could this be done?

  • Don’t even try: using a front-end request handler. These are limited to 60-seconds, hardly enough time to read through/format/process a large datastore.
  • In the old days, it was possible to back up data from the developer console. This limits data extraction to admins only, which is not ideal. This post describes how this was done, and worked for me… until the console was upgraded and the ‘Backup’ functionality was removed (does anyone know where it is?).
  • The bulkloader. This post describes how to use it, and I got this working. Simple and command-line based. Hours and hours later, the bulkloader was still trudging along and would then fail, which seemed to happen if my machine momentarily lost it’s WiFi connection. I didn’t see any way to modify how the bulkloader throttles data download; so I’d only really suggest using this method if your data is very small.
  • Using the remote API shell, and paging through the data store entities. I got this working too, and this seemed to be faster than the bulkloader.. but would hit some over quota limits after some time (even though my project has billing enabled). Again, probably ideal for small data, and admin-access only.

What I did in the end was inspired from the post that wanted me to start the backup from the admin dashboard:

  1. I found this post that explains how to transfer data from the data store to BigQuery. What it is actually doing is transferring data from the data store to the cloud storage, and the uploading it from cloud storage into BigQuery.
  2. I wrote a map-reduce job that does the same as the first part of that post. The problem I encountered was that the example and  all the documentation points to using the “FileOutputWriter” in “mapreduce.output_writers” which does not exist anymore (therefore producing ImportErrors). By digging around the code, I found that I could replace it with “GoogleCloudStorageConsistentOutputWriter”
  3. I ran the map-reduce job by hitting the URL with a get request. My code then redirects to the mapreduce status page. The whole thing took approximately 10 hours to complete.
  4. I downloaded the data using gsutil (the -m flag enables multithreaded downloads):
    gsutil -m cp -R gs://my-bucket/my-dir .

My last year in review ended with “I think that 2014 is going to be a year of prioritising and getting things finished.” I think I (mostly!) managed to do just that, with most of my work this year focusing on one project, Emotion Sense. My #YearInReview:

  • Most of the year was dedicated to analysing our growing Emotion Sense data and our results are now taking shape. Gillian presented our first poster. There is so much more yet to come from this; I hope that my 2015 year in review will have this as its first point.
  • We received a grant from the EPSRC’s Impact Acceleration Follow-On Fund, to help turn Emotion Sense into a commercial product for healthcare measurement, monitoring, and intervention. Emotion Sense has also been accepted as an i-Teams project which will kick off in early 2015.
  • We received funding from the Medical Research Council to test and further refine Q Sense: a smoking cessation intervention that (building from Emotion Sense) uses smartphone sensing to guide quitters on a personalised, data driven journey away from tobacco. Felix presented a poster about the project as well.
  • A paper about group co-location in social networks – lead by Chloë – was published in PLoS ONE (arxiv link).
  • I submitted/published a book chapter on “The Anatomy of Mobile Location-based Recommender Systems” (preprint) for the upcoming second version of the Recommender Systems Handbook.
  • I (finally?) published some old work from my time at UCL on crowdsourcing public transit status updates via a smartphone app (pdf).
  • Advait published the results of his masters project analysing shared bicycle systems around the world in an upcoming special edition of Transportation (Special Issue on Emerging, Passively Generated Datasets for Travel Behavior and Policy Analysis).
  • I gave talks in the SACHI Group (University of St. Andrews), Institute for Public Health and Dept. of Psychiatry (University of Cambridge), the E-Health Unit (University College London), and the Dept of Primary Care and Dept. of Education (University of Oxford), at Health 2.0 London, and at Data Science London’s Healthcare hackathon. Every talk was about smartphones (e.g., these slides).
  • I did less consulting work than last year, but what I am doing is more on track with my other work (… smartphones!).

I guess that the trend should continue: I hope that 2015 is going to have Emotion Sense written all over it.

I’m at Health 2.0 Europe: London November 10-11. Brief, brief notes:

The “From Health to Wellness” non-auditorium session (where I presented about Emotion Sense).

A recent talk that I gave in Cambridge’s Dept of Psychiatry and Institute of Public Health, and UCL’s e-Health unite.

Since launching Emotion Sense, I have received a number of e-mails from people who would like to conduct studies or build apps that collect similar kinds of data:

  • Momentary Survey Responses: answers to questions about the moment that the person is in. How happy are you? How focused are you? Where are you? Or, indeed, any of the other kinds of questions that are typical of the experience sampling method.
  • Smartphone Sensor Data: it is increasingly well-known that smartphones have a variety of sensors that can be used to characterise a person’s behaviour and environment. Here is a paper (pdf) that I wrote about that very topic. What are a participant’s GPS coordinates? What are their calling and texting patterns like? And so on.

The main challenge, however, is building this kind of software. This takes time, money, and a lot of programming/design effort: precious resources that are not widely available. Moreover, it is impossible to ‘use’ the Emotion Sense app for anything other than what it was built to do: collect data about positive/negative mood (via the very specific questions that we have chosen) while taking users who download the app on what we have called a ‘journey of discovery,’ as they unlock feedback screens while they use the app.

In essence, Emotion Sense isn’t useful for others who are interested in doing similar kinds of research. There are two directions that I can take this: (a) building new, specific apps on a case-by-case basis, or (b) building a new, generic app that tries to suit the broad needs behind doing this kind of research. While I am pursuing both of these paths, this post is about the second, (b).

I have spent a few months working on a new app: Easy M. The app essentially takes all of the pieces of the Emotion Sense puzzle (indeed, it also uses the same open-source code) but has reshuffled everything into an app that can suit others’ needs. The idea behind this was to build an app much like the fantastic MyExperience toolkit, which unfortunately is only available for ‘old’ Windows Mobile devices.

The Easy M app is now online on the app store, and we are soon going to launch the first studies that use this tool. The way this works is (hopefully) simple: your users download the app, put in a code for your study, review what it is they are joining (in terms of data collection) and after giving their consent, the app automatically reconfigures itself to do everything you need: ask the questions that you want, and collect the sensor data that you would like to focus on. Once your study has run its course, the app logs your users out of the study and makes sure that any pending data gets sent up to the servers.

The online documentation includes most of the details about what this app can and cannot do. In particular, there is currently no way for you to use the tool without getting in touch with me, so that I can set up the appropriate configuration files for your needs; the data also comes to our servers to then be transferred to you. As I develop the tool, I hope to add many of the missing bits — for now, I’m interested in hearing from you if this is the kind of tool that you would like to use in your research.

Slides from a talk I’ve given at a few different places recently, covering the design and deployment of Emotion Sense and the initial design of Easy M.

 

2013 was a long, long year for me; here’s my attempt to summarise it as succinctly as possible.

  • I released Emotion Sense. To date, it has been downloaded approximately 30,000 times. In the days after the release, the press coverage was very intense and exciting. Following the release, there has been a lot of admin to keep the app going, working, updated, and to keep our servers in shape. The research is on its way, too!
  • I continued my data science consulting work. I worked with banks, hotels, data science groups, and a charity. It is like getting a breath of fresh non-academic air. It’s always an eye-opener to see what data-related problems pop up in the so-called ‘real’ world.
  • I worked with Kiran Rachuri on the open-source Android sensing library that we have now released (along with a data manager and trigger library). A bunch of students at Birkbeck College tested the library as part of their mobile dev course; we wrote about it in a workshop paper.
  • I was a guest lecturer on the Coursera Introduction to Recommender Systems; I got to talk with Michael about the temporal issues that emerge in online recommender systems that I studied during my PhD.
  • I wrote a chapter for the upcoming Handbook of Human Computation. The other chapters look very interesting, I’m looking forward to reading them.
  • We published a study (at ACM Ubicomp) that we conducted in late 2012 that used the libraries above in an app that asks people about their feelings and context (a precursor to Emotion Sense).. the paper has the enigmatic title “Contextual Dissonance…”; in the final days before submission, it was being edited from 3 different time zones (which turned out to be surprisingly efficient).
  • I was Ubicomp’s social media chair. That meant running their Twitter, Facebook, and Google+ accounts, which was a bit of fun. I also co-organised the conference’s Workshop on Pervasive Urban Applications.
  • I worked with Jagadeesh Gorla and colleagues to publish some work on group recommender systems at WWW. I didn’t get to go to Brazil, but I hear the work contributed to the founding of a Cambridge-based recommender system company.
  • What will smartphone-based behaviour change interventions look like in the futre? We wrote a paper about that for IEEE Pervasive Computing’s Special Issue on Understanding and Changing Behaviour (pdf).
  • Some early work on a smoking cessation app received funding from the MRC. I’m looking forward to that kicking off in 2014 too.
  • I built the Android Easy M app. It’s a bit like Emotion Sense under the hood, but it’s for other researchers who want to run sensor-collecting experience sampling studies with their own setup. The first few studies will be kicking off in the beginning of 2014.
  • I gave talks at Birkbeck College, Ghent University, Sussex University, the Ubhave Conference, DrinkAware’s Workshop, Ubicomp (and its workshops) and at a Government Social Research Workshop. All black slides ftw.
  • I reviewed papers for a range of different conferences (ICWSM, SIGIR, CIKM) and workshops. I realised that I have reached the stage where I probably read more unpublished work than published papers. I certainly spend more time reading papers that I have to review than those that I don’t!

With all this manic stuff going on, I think that 2014 is going to be a year of prioritising and getting things finished.

Recently, we launched the Emotion Sense app for Android on the Google Play Market. This app combines experience sampling with the passive data that modern smartphones can collect, and gives people feedback about how their reported mood compares to this data. Of course, this app was designed and built to support our research into how daily mood relates to the behavioural signals that phones can capture.

To support the launch, this press release was published, which has since been picked up by a number of newspapers and blogs (a sample of them are listed here). Overall, the response has been overwhelming and we have learned a number of lessons which I should dedicate a separate blog post to.

Naturally, since this app collects data from a wide variety of sensors, there have been a number of concerns about privacy. This issue has and continues to be very important to us. However, a number of comments that have been made are misleading and so this post aims to clarify how and why we collect sensor data.

Why do we collect sensor data? What happens to this data?

The aim of collecting sensor data is to support academic research into daily moods and smartphone sensors. We are academic researchers, not marketers: we are ethically bound and fully committed to never sell or share the data the app collects. We are not interested in advertising or making money off your data: we are interested in progressing the state-of-the-art in Computer Science (sensing and data mining) and Psychology (studying daily life). If any of us leave the University, to move on to other projects, we will leave the data behind. We will also never make the data available to anyone, except the person who generated it. If you want your data, you can simply get in touch.

Our research is not about prying into individual’s lives. We are looking for broad patterns, which emerge from many people doing the same thing (e.g., using an app). We really have no interest in looking at anything other than aggregate patterns in the data.

How is the data anonymised?

The app does a fair bit of data collection on your phone, but then anonymises it: we do not receive your raw data. In particular:

  • The app does not send us conversations,or any audio recordings at all. All it does is measure the ambient volume, which is a number (e.g., “23”). We do not and cannot track the web sites you visit, your eye movement, or how you touch your phone screen (in fact, other researchers have shown that this is impossible on a number of Android devices!).
  • The app does not record any text message content or clear text phone numbers. In fact, it uses a one way hash function to convert a phone number into a indecipherable string. So, for example, we will see that a phone texted another phone, identified as “abdjasdfkjqwercsdsdsaqt2″ and sent 3 words.

How is this research funded?

We have not paid anyone to write/blog about our work, and our project has no commercial partners. Our work is funded by the Engineering and Physical Sciences Research Council: details of the project are available here.

We are aware that people may still have some questions about how the app works. If you or any of your readers has any questions, please feel free to contact me: neal.lathia@cl.cam.ac.uk

Or visit our FAQs page.

Update: Potential Reasons for this Confusion

As pointed out to me, some of the confusion about what we are doing with data may be due to this earlier research paper that was published in 2010 and used the same Emotion Sense name. I would like to point out that the app used in this paper is very different from the app that we released to the general public.

The publicly available app does not perform any speaker recognition or emotion detection. This is for a number of reasons:

  • Privacy. Since an audio recording should have informed consent, and not everyone around your phone may have given it, we chose to not store an audio that is recorded beyond (as above) the ambient volume.
  • Technical Challenges. While audio processing research is progressing, an earlier trial of the techniques we used in previous research were inconclusive and overly cumbersome: the sounds that people are surrounded with when their phones are in their bags, pockets, etc. go well beyond the controlled trial that was done in the lab.
  • Moving Beyond the Microphone. There is, naturally, more to a smartphone than its microphone. The design of the publicly available app therefore seeks to learn about daily life using a more holistic approach (i.e., combining the data from different sensors).

I was recently pointed to this excellent blog post that argues how the ideas of ‘big data’ and ‘quantified self’ do not fit well together. The title here comes directly from that post: “Big Data and Quantified Self, just like chocolate and champagne, do not pair together well.” In the true spirit of online blogging, I thought I’d reply here instead of via e-mail.

The key idea is that ‘big data’ tends to focus on the ‘average’ person: the aggregate of many noisy data points that, when put together, give an indication of behaviour that is the sum of everyone, but manifested by no one. Self-tracking, or quantified-self, data comes from a self-selecting sample of the population and therefore is not representative of everyone: “self-trackers  are different from other people with regard to mentality, psychological traits, lifestyles, behaviors, etc. So even if we derive a certain pattern based on a data from a hundred, thousand or even five thousand self-trackers with diabetes, that pattern won’t necessarily hold for all other people with diabetes.”

I mostly agree with this: my thoughts only differ in terms of the conclusions.

First, this problem is increasingly emerging/actively discussed in all ‘big data’ research. Studying how people move around cities based on foursquare check-ins only looks at people who like foursquare, researching how twitter predicts elections only looks at the sample of people who use twitter, and 96% of brain research has been conducted on westerners. Psychologists agree that they have been mostly studying people who are WEIRDos (Western, Educated, Industrialized, Rich, and Democratic). While something certainly has to be done to address this, I would posit that throwing away everything we have learned is not one of those things: there are many domains (take, for example, medicine) where ‘small’ tests have led to methods that have successfully scaled to all. Instead, we need to increase our awareness about how much of a sub/self-selecting-sample we are dealing with when making our conclusions.

By being full of people, ‘big data’ also has one key advantage: it can help overcome the data sparsity that any single self tracker will face, and finding links between people’s behaviours is the only way to do that. While tracking my mood, I know that I cannot accurately record it every minute, since I am otherwise engaged. However, your actions and mood may have something to teach me.

Mathematically speaking (see the other blog post), I’m saying that while Y_me = f_me(X_me), and Y_you = f_you(Y_you), since we are all human there are bound to be some people in the world where f_me ~ f_you: and we can learn from one another. So one of the goals of the quantified self movement should be to facilitate this process: putting people together in a room where they each talk about their lessons learned is a first step in this direction.

The only difference I see between QS and big data? By looking at your own data, QS seems to encode the ideas of mindfulness (beyond just self-experimentation). When I look at my QS data, I stop and think about my life. When I’m running my ‘big data’ experiments, I don’t!

A couple of weeks ago I was invited to participate in a workshop at NYU’s CUSP, or Center for Urban Science and Progress. As they describe themselves:

The Center for Urban Science + Progress (CUSP) is a unique public-private research center that uses New York as its laboratory and classroom to help cities around the world become more productive, livable, equitable, and resilient. CUSP observes, analyzes, and models cities to optimize outcomes, prototype new solutions, formalize new tools and processes, and develop new expertise/experts. These activities will make CUSP the world’s leading authority in the emerging field of “Urban Informatics.”

The theme of the workshop was ‘mobile sensing’ – with, of course, a particular focus on how it may support urban science.

Talks. The invited speakers were from diverse backgrounds and institutions, making for a very interesting line up. I did not take any notes, so my summary here is vastly unfair to each talk:

  • Rob van Kranenburg (@robvank) naturally spoke about the Internet of Things, and how it fits into the broader ecosystem of cities.
  • Mischa Dohler (@mischadohler)’s talk covered urban sensors, and gave rise to a big debate about where the boundary between crowd-sourcing and urban sensing should lie.
  • Jacqueline Lu (NYC Parks and Recreation) spoke about how data is supporting efforts to maintain and promote green spaces in the city. This was particularly interesting since it made me realise how a seemingly ‘trivial’ problem (maintaining trees) is actually vastly complex when placed into urban settings.
  • Vivek Singh (MIT) spoke about ongoing mobile sensing experiments that investigate how behaviours can be promoted between social groups.
  • Margaret Martonosi (Princeton) spoke about her work using Call Detail Record data (e.g., see this paper). The data gives fantastic geographical coverage and potential to study many facets of mobility, while presenting very difficult challenges with regards to inference and privacy.
  • Weisi Guo (Warwick University), the workshop organiser, spoke about his research about understanding cities through mobile sensors.
  • Jarlath O’Neil-Dunne (University of Vermont, @jarlathond) gave a talk about geographic analysis using satellite data – I learned about how LiDAR data (e.g., this blog post) can be used to, for example, find trees that would otherwise be hidden by shade: a very challenging data feature extraction task.
  • Raz Schwartz (Rutgers, @razsc) is the co-creator of the Livehoods project, which is a great example of attempts to uncover the structure of the city via social media analysis.
  • Eiman Kanjo (King Saud University) discussed her work with mobility and affective sensing using smartphones (see her publications here)
  • Andrew Eckford (York University – Canada not UK!) gave a very interesting talk about molecular communication for harsh environments (say, flooded subway tunnels!) – something I had never heard of.
  • Graham Cormode (AT&T) discussed his work on distributed data monitoring and mining. See his personal page here.
  • Lin Zhang (Tsinghua University) talked about his work with sensors on Beijing’s taxi cabs for pollution monitoring (MobiSys paper here). The dataset is available on request.

Finally, I briefly talked (with very sparse slides) about open challenges in mobile sensing – ranging from energy efficiency to data inference and behaviour change measurement.

Open Ideas. The fact that this broad range of researchers all agreed to come to a workshop on ‘urban sensing’ shows how this field is still in its infancy; I very much enjoyed the fact that everyone spoke about very different things. In fact, we even differed on the basics:

  • What is urban? It is one of those words that could mean tunnels in a subway, parking sensors on a road, or community driven tree maintenance. The ‘where’ of all the research above certainly agreed on cities: but within this context, there is a hierarchy than ranges from metropolitan-scale analysis down to individual citisen’s sensors.
  • What is sensing? It seems that ‘sensor’ is quickly becoming a term to mean ‘a source of data;’ while this is consistent with the past, my gut tells me that historically this would not have been the case. Both tweets, accelerometers, and satellites are sensors, albeit very different in nature: and there is ample space for research both within the scope of individual sensors and finding links/building systems that bridge between them.

 

Follow

Get every new post delivered to your Inbox.