Open Problems in Shared Bicycle Research

I recently attended and presented at a workshop on shared bicycle systems in Paris, France. The workshop, called “Spatio-temporal Data Mining for a Better Understanding of People’s Mobility: The Bicycle Sharing System (BSS) Case Study” was organised by Latifa Oukhellou from IFSTTAR; the program and slides are online here (and the presentation that I gave on my recent paper is embedded below).

I personally think that it was an excellent opportunity to collect a variety of people who have been working with data from shared bicycle systems; particularly since the work spanning this ‘nascent’ field is from people with very diverse academic backgrounds. The day uncovered surprising similarities in the techniques that people are using to analyse a variety of city’s data (e.g., clustering stations), and, more broadly, what the few (but growing) researchers in this field have been trying to solve.

The day also really served to expose a number of key problems:

  1. Data Acquisition. There is a blatant tension between researchers who have and want to continue studying these systems and those practitioners who run shared bicycle system web sites. A vast majority of researchers in the group have obtained their data by regularly crawling a shared bicycle map, like this one of London. This has allowed researchers to collection time-varying station capacity data, and is useful for training algorithms that seek to predict how many bikes a station will have. However, this is clearly not an ideal way to collect data, and I hope to see a closer collaboration between transport operators and researchers in this domain in the future.
  2. Data Quality. The data that is collected by scraping web sites is prone to inaccuracies and noise, and this can lead to errors in our analysis. For example, Ollie (who has made all the great online maps of shared bike systems) pointed out that one of the differences that I uncovered in my recent work was not due to a change in activity, but in the fact that the station that I thought had changed patterns had, instead, simply been moved closer to a train station. While, in hindsight, I don’t think this completely deconstructs the work I did (!) I wonder how much of the broader research is somewhat affected by similar hidden changes (or, at least, changes that a web scraper would not be seeking).
  3. Data Granularity. More importantly, web-scraped data does not capture important features of the system, such as origin-destination pairs or the actual habits of the systems’ users. As researchers, we know that the value of data is often proportional to its granularity. For example, all of the recent work that I have done using Oyster card data would not have been possible if all I had was station gate counts, which is the rough equivalent of the data that most shared bicycle researchers have. How are people responding to incentives? What is the variety of behaviours that the system users are exhibiting? All these questions are currently beyond our (data’s) reach.
  4. Limits of the Data. A very important point was raised during the day: any data that the transport authority holds will inherently only capture the “satisfied” part of the travel demand. All public transport operators do not currently have means of gauging how many passengers they have failed to transport; whether that be because the person has made the (healthy) choice to walk, or (in the shared bicycle case) that person has found an empty station when they sought a bicycle, or a station where all the bike’s tires have been punctured.
  5. Motivation for Mining Shared Bike Data. As researchers, I don’t think that we have fully uncovered the entire family of problems that data from shared bicycle systems can address; I felt that some propositions were lacking in a grounded motivation. There are a wide range of problems that could be addressed, if the right data were at hand. For example, can the data be used to discover bicycles that are broken? Can real-time data mining guide a load balancing truck to best suit current and predicted travel demand? This is where perhaps a closer relationship with transport operators may again be helpful.

Overall, I think it was a great workshop, and I encourage you to look at the presentations that are online. If you have an interest in this area, I would also encourage you to join the Google Group that I set up for researchers in this domain to share their findings.

The A/B City

One of the key ingredients of many successful online companies has been rapid iteration and improvement of their services via A/B testing. In its essence, you split your users into two (or more) groups, serve them variants of your service (e.g., different algorithms, user interfaces) and then sit back and measure how each group behaves. Once you are operating at web-scale, the sheer size of visitors and potential for rich data collection can really inform those companies about how they are performing and what ideas work better than others. Size matters the most: (we hope that) once we are dealing with a large enough sample, we will randomize all the other confounding factors that may play a role in what we are trying to measure. In other words, the web turns the world into a living laboratory.

Unfortunately, this cool technique is not readily available for the physical world. Imagine, for example, trying to evaluate a policy that aims to reduce the number of children being killed by cars. How do you split your users (and ensure randomization)? How do you cope with the fact that people will already behave differently in different geographic areas? (Moreover, how do you come to terms with the ethical questions?). The sad conclusion seems to be that, when we intervene on the physical world and observe a change, it remains difficult to speak of anything more than a correlation between what you did and the behavior you (hope to have) caused.

Luckily, alternative approaches exist. The equivalent of an A/B test when you can’t randomize your sample is a quasi-experiment. In a recent paper, we adopted this perspective in order to examine the impact of changing the user access policy to London’s (shared) Boris bikes. By splitting our data around the time of the change, carefully cleaning it, and ensuring that we maintain a large (temporal and spatial) scale of data, we examined how sensor readings from bicycle stations can be used to observe how the policy change propagated across time and the city. Interestingly, the data showed us that this change in policy resounded differently in different stations: some locations that were areas that people went to in the morning and left from in the evening flipped their pattern completely.

The rest of the details are in the paper (reference below). However, since this is the year that politicians are taking on coding, maybe they should also start taking a few hints from web companies and start running their own A/B tests as well.

N. Lathia, S. Ahmed, L. Capra. Measuring the Impact of Opening the London Shared Bicycle Scheme to Casual Users. (To appear) In Elsevier Transportation Research Part C, accepted December 2011.

A Round Table on Transport Big Data

One evening last week, I attended a very interesting round table discussion about big data in the UK transport scene. I was invited to kick start the evening by saying a few words, and did so in front of a small but diverse group of people hailing from many corners of the UK transport sector. Here’s a quick round up of how the evening went.

First, what did I talk about? I put down five points in my notes:

  1. The Big Data Trend. The term “big data” is spreading like wild fire; it is heard everywhere nowadays. But, of course, today’s big data is tomorrow’s small data. Yesterday’s big data is hardly data at all. A rough simile I devised to show what really matters: if you have a lot of gold, you are rich. But if you have a lot of data, it’s more like having a lot of time to live your life: the real value is what you do with it (not just having it in the first place).
  2. The Two Worlds. A notable trend in data is the divergence between how web-based companies have been using it (it is their lifeblood) and how “physical-world” companies have been tackling data (which seems to be more of “challenge,” a “problem,” or an “untapped opportunity”). There are therefore valuable lessons to be learned from how web companies view and reason about their data that may be easily transferred to the offline domain.
  3. Brief words on stuff I’ve been doing. Seek elsewhere (or in other blog posts) for this info.
  4. Some lessons: what should you be using “big data” for? Well, this falls down to the basics of being a data scientist (a term that we discussed in and of itself as well).
    1. Challenging assumptions. Every time that I have an idea about how people use a system (e.g., the tube), data analysis shows that there are a variety of behaviours and uses that I had no idea would ever emerge. But do. Lots of data teaches you to be a bit humble before throwing around your views of how the world should work.
    2. Discovery. Often times, I didn’t know what problem(s) need solving until I’ve put my hands on some data. This has been my experience with TfL ticket purchasing data.
    3. Measuring and deciding. Data is not only useful for observing how people behave, but also as a means for measuring the outcome of interventions on their behaviours.
    4. Building and engaging. A powerful aspect of big data is its cross-applicability. Just as Google uses clicks to improve search, transport authorities could use fare data to build personalised services, or evaluate transport policy.
  5. Some discussion points. The travel information “market” is very much open. I claimed that, if transport operators and authorities don’t learn to quickly capitalise on their own data, others will. Moreover, the (e.g. smart card) data they hold is hardly unique: mobile phone operators and social media feeds are quickly beginning to contain the same (or richer) data about how people navigate cities. What are the challenges to adopting big data? Why aren’t we there already?

The ensuing conversation covered a lot of ground; I wasn’t taking notes throughout and had to wait until I got back home to jot down some of the themes that emerged. But here are some highlights:

  1. Transportation vs. customer experience. The transportation industry’s core business is getting people from A to B. Progress in the public transport domain has been marred by the notorious lack of competition (you either take our bus or find your own way…). As such, transport authorities have been (are?) less concerned with providing a “customer experience” in favour of maintaining and improving infrastructure. Can this be changed by a fresh perspective on data?
  2. Data modeling, application developing. Should transport operators be in the business of making mobile phone applications? Again, that steers them away from their “core business.” However, they already do data-modeling in house, so should they also go the next step? Or should they not be doing data science in-house at all (and seek aide externally from those who may do it better for that too)?
  3. Open, valuable, private data. There is a continuous cry for transport operators to release their data. However, in this case I argued that there are two camps of data that they hold:
    1. Open-able data. Includes train time tables, fares, aggregate usage statistics, etc. A daydream for the frenzied travel-app maker or data journalist.
    2. Private data. The implicit profiles that they can build of all their passengers. Where people go, when, etc. I argued that, if they are ever to find new value in their data, it would be in this category (but return to theme #1 above…).

Overall, a great evening. Other tiny highlights: I recommended Predictably Irrational as well as The Wisdom of the Crowds, told everyone what Foursquare is, and received a book on Managing Business Risk.

Addressing Overspending on Public Transport Using Recommendations

Update: find the research paper & more details on my website.

One of the continuing themes in my research is about designing and evaluating systems that public transport authorities could build (or allow others to build) if they delved into the treasure of data that their fare collection systems create. In this case, there are many similarities between what your Oyster card collects about you and the data that many web sites collect while you browse the web (rate things, click on links, add friends,…). The key difference is that these web sites often use your data to give you something back: Amazon recommends stuff based on what you have purchased, Facebook personalises your news feed based on who you interact with, and Google personalises your search results. Why can’t TfL do the same thing? Why can’t they, based on your trip history, recommend to you what the best travel card is to buy? 

Why care about this question? Have you ever bought a travel card that you didn’t use? Have you ever topped up so many times in one week that you realise you should have bought a travel card? Moreover, how do you know that you aren’t wasting  your money buying the wrong transport fares all the time? It is difficult to be 100% sure that you are always getting value for your money.

In some of my recent research, I examined the same (fully anonymised) data set that I had used for a previous study. The data contains a 5% sample of Oyster card records from two 83-day long periods (who touched in/out where & when), along with all the top ups and travel cards that these travellers bought during the sampled time span. This time, I asked: how much money are these travellers ‘wasting’ by buying the incorrect fares?

In this case, “incorrect” simply means that they could have bought a cheaper ticket for the trips that they ended up making. For example, if a person buys a 7-day travel card, but then only travels during two days, it would have been cheaper for that person to use pay as you go. Overall, I found the program I wrote found that this 5% sample of travellers are spending just under GBP 2.5 million more than they should. Two points to note are: (a) using this sample of people to approximate the entire population would mean multiplying these figures by 20 and (b) the datasets each cover an 83-day period. In other words, this translates into an estimate that travellers cumulatively spend approximately GBP 200 million per year more than they need to, by simply buying the incorrect tickets.

To understand why this may be going on, consider this graph below (click for larger).
Graph ImageIt shows the relationship between travel, in terms of trips per day/week/month/year (only in Zones 1 and 2) and what the cheapest fare will be for you to use. For example, you may start on pay as you go; if you travel more than 3 times in one day, a peak travel card is cheaper (and you should already get it by being capped); if you travel more than 3.58 days in a week, a 7-day card is cheaper (and so on). Remember: this is (1) only for Zones 1-2, and (2) there are also relationships between the best price and non-contiguous travel. So, maybe you don’t travel enough in 1 week for a 7-day travel card to be cheaper, but your trips over the whole month means that a 30-day travel card would have been best!

I then tested a variety of machine learning/data mining algorithms that tried to classify users to transport tickets based on their travel patterns. One of the most powerful classifiers that we tested is known as a decision tree: in our experiments, it was able to match travellers to the cheapest fare with over 98% accuracy, which translates into substantial savings for everybody. Wonderful!

I’d like to point out that this research is not bad news for Transport for London. They are not to blame for people’s incorrect purchases (if you really want to, you can point at them for having a set of prices that are possibly too complex to digest). Instead, this is an opportunity. The data that they have should be used to make their services better and to encourage Londoners to use TfL as much as possible. In fact, the same data that I’ve been looking at here to help travellers save money could also be used to help TfL make more money, at the same time. Oh, the future of data…

The paper I have written about this has just been accepted at the 2011 ACM Conference on Knowledge Discovery and Data Mining, and will be shortly is now available on my publications page. If you have any questions, please contact me or tweet @ me. The paper’s full abstract is below.

Abstract. As the public transport infrastructure of large cities expands, transport operators are diversifying the range and prices of tickets that can be purchased for travel. However, selecting the best fare for each individual traveller’s needs is a complex process that is left almost completely unaided. By examining the relation between urban mobility and fare purchasing habits in large datasets from London, England’s public transport network, we estimate that travellers in the city cumulatively spend, per year, up to approximately GBP 200 million more than they need to, as a result of purchasing the incorrect fares. We propose to address these incorrect purchases by leveraging the huge volumes of data that travellers create as they move about the city, by providing, to each of them, personalised ticket recommendations based on their estimated future travel patterns. In this work, we explore the viability of building a fare-recommendation system for public transport networks by (a) formalising the problem as two separate prediction problems and (b) evaluating a number of algorithms that aim to match travellers to the best fare. We find that applying data mining techniques to public transport data has the potential to provide travellers with substantial savings.

TfL’s Bus Passenger Survey

TfL Survey After running my last online survey (read about it here) on Londoner’s trip and ticket purchasing habits, I was pointed to the fact that TfL are, themselves, doing a bit of research. They have been passing out short surveys for passengers to fill in on a number of bus routes, asking people to fill in their origins, destinations, reasons for travel.

It seems that they turned to surveys for many of the same reasons that I did: when you collect Oyster card data, you can see where people are travelling from (and, if they take the tube, where they are going to), but never understand why they are doing it. These surveys will help them get a small glimpse into why people were on buses at that time of day, and where they were going.

Thanks for the pointer (and the survey), Tamas.

Survey: Londoners! How do you get around town?

Following on from some recent research about how individuals move about town, we are calling out to all Londoners to participate in a survey that we recently put together. The survey asks about two things: your travel habits and how you fund those travel habits. But a good place to start is: why should anyone care about these things?

At face value, topping up your Oyster card with credit or buying a travel card seems simple and mundane. However, we all know that the cost of travel in London is not only always growing – it also depends on who you are (which determines which discounts you are eligible for) where you travel to and from (i.e. what zones), when you travel (e.g., rush-hour or day time) and how frequently you tend to move between places over time periods that span from single days to an entire year (anyone out there ever bought an annual travel card? Not me!).

In other words, there isn’t really a transparent link between how you travel and what the cheapest fare for you to be paying is. Yes, I know about the daily capping on pay as you go – but if you are going to be travelling every day for seven days in a row, there is no “cap” on your weekly spend- so maybe you should have bought that 7-day pass! How do Londoners make these decisions?

There are some fascinating numbers relating to money and the tube. Over £40,000 was refunded to travellers between January and August 2010 as a result of complaints regarding overcharging. TfL itself estimates that over £300,000 is wasted per day by passengers buying paper tickets instead of opting for the electronic equivalent (see here), and other investigations have revealed that approximately £30 million of travel credit is sitting in the system, idle and unused. These vast sums of wasted money all point to the fact that making the correct decision at the point of purchase is not only uninformed and lacking in transparency, but also incredibly difficult for travellers to reason about in order to purchase the cheapest fare for themselves.

The survey has three parts:

  1. Questions about your travel habits! Where do you start/end your days? How often do you travel? What times do you travel? How consistent are your commutes?
  2. Questions about your topping up/travel card purchase habits! How much do you top-up by? Why and when do you use pay as you go? What travel cards do you buy? Why do you buy them?
  3. An opportunity for you to really help our research and enter a prize draw for a new Apple iPad! All you have to do is give us your Oyster card number and allow us to get your 8 week travel history from Transport for London. How will we use this? Your travel history will give us a direct insight into how groups of Londoners navigate our city. Keep in mind that we don’t want or ask for your name, telephone number, age, gender, or occupation. You are, to that extent, very anonymous (we ask for your email address for the prize draw). We just care about your Oyster card number and what kind of Oyster card it is- your travel history data will be stored safely and anonymously and will only be used for this research project. If you have any concerns or need clarification, get in touch with me (email or twitter).

So, have I linked to the survey enough already? Please help us and fill it out!

Personalised Public Transport

I’m just on my way back from beautiful Sydney, where I presented a paper called “Mining Public Transport Usage for Personalised Intelligent Transport Systems” (by me, Jon Froehlich, and Licia Capra) at the IEEE 2010 International Conference on Data Mining. The abstract of the paper reads as follows:

Traveller information, route planning, and service updates have become essential components of public transport systems: they help people navigate built environments by providing access to information regarding delays and service disruptions. However, one aspect that these systems invariably lack is a way of tailoring the information they offer in order to provide personalised trip time estimates and relevant notifications to each traveller. Mining each user’s travel history, collected by automated ticketing systems, has the potential to address this gap. In this work, we analyse one such dataset of travel history on the London underground. We then propose and evaluate methods to (a) predict personalised trip times for the system users and (b) rank stations based on future mobility patterns, in order to identify the subset of stations that are of greatest interest to each other and thus provide useful travel updates.

This roughly translates to:

Public transport in a large city like London can be chaotic; the information services that were built to support it do not take into consideration who you are when they spit out updates. At the same time, most Londoners now use Oyster cards, that record detailed traces of each person’s movements around the city. The research question we address in the paper is: can Oyster card records be leveraged to build personalised travel info services? Much like the way Amazon says “recommended especially for you” – can we do similar things with travel data? Short answer: yes. Long answer: read the paper. Medium answer: look at slides below.