Update: find the research paper & more details on my website.
One of the continuing themes in my research is about designing and evaluating systems that public transport authorities could build (or allow others to build) if they delved into the treasure of data that their fare collection systems create. In this case, there are many similarities between what your Oyster card collects about you and the data that many web sites collect while you browse the web (rate things, click on links, add friends,…). The key difference is that these web sites often use your data to give you something back: Amazon recommends stuff based on what you have purchased, Facebook personalises your news feed based on who you interact with, and Google personalises your search results. Why can’t TfL do the same thing? Why can’t they, based on your trip history, recommend to you what the best travel card is to buy?
Why care about this question? Have you ever bought a travel card that you didn’t use? Have you ever topped up so many times in one week that you realise you should have bought a travel card? Moreover, how do you know that you aren’t
wasting your money buying the wrong transport fares all the time? It is difficult to be 100% sure that you are always getting value for your money.
In some of my recent research, I examined the same (fully anonymised) data set that I had used for a previous study. The data contains a 5% sample of Oyster card records from two 83-day long periods (who touched in/out where & when), along with all the top ups and travel cards that these travellers bought during the sampled time span. This time, I asked: how much money are these travellers ‘wasting’ by buying the incorrect fares?
In this case, “incorrect” simply means that they could have bought a cheaper ticket for the trips that they ended up making. For example, if a person buys a 7-day travel card, but then only travels during two days, it would have been cheaper for that person to use pay as you go. Overall,
I found the program I wrote found that this 5% sample of travellers are spending just under GBP 2.5 million more than they should. Two points to note are: (a) using this sample of people to approximate the entire population would mean multiplying these figures by 20 and (b) the datasets each cover an 83-day period. In other words, this translates into an estimate that travellers cumulatively spend approximately GBP 200 million per year more than they need to, by simply buying the incorrect tickets.
To understand why this may be going on, consider this graph below (click for larger).
It shows the relationship between travel, in terms of trips per day/week/month/year (only in Zones 1 and 2) and what the cheapest fare will be for you to use. For example, you may start on pay as you go; if you travel more than 3 times in one day, a peak travel card is cheaper (and you should already get it by being capped); if you travel more than 3.58 days in a week, a 7-day card is cheaper (and so on). Remember: this is (1) only for Zones 1-2, and (2) there are also relationships between the best price and non-contiguous travel. So, maybe you don’t travel enough in 1 week for a 7-day travel card to be cheaper, but your trips over the whole month means that a 30-day travel card would have been best!
I then tested a variety of machine learning/data mining algorithms that tried to classify users to transport tickets based on their travel patterns. One of the most powerful classifiers that we tested is known as a decision tree: in our experiments, it was able to match travellers to the cheapest fare with over 98% accuracy, which translates into substantial savings for everybody. Wonderful!
I’d like to point out that this research is not bad news for Transport for London. They are not to blame for people’s incorrect purchases (if you really want to, you can point at them for having a set of prices that are possibly too complex to digest). Instead, this is an opportunity. The data that they have should be used to make their services better and to encourage Londoners to use TfL as much as possible. In fact, the same data that I’ve been looking at here to help travellers save money could also be used to help TfL make more money, at the same time. Oh, the future of data…
The paper I have written about this has just been accepted at the 2011 ACM Conference on Knowledge Discovery and Data Mining, and
will be shortly is now available on my publications page. If you have any questions, please contact me or tweet @ me. The paper’s full abstract is below.
Abstract. As the public transport infrastructure of large cities expands, transport operators are diversifying the range and prices of tickets that can be purchased for travel. However, selecting the best fare for each individual traveller’s needs is a complex process that is left almost completely unaided. By examining the relation between urban mobility and fare purchasing habits in large datasets from London, England’s public transport network, we estimate that travellers in the city cumulatively spend, per year, up to approximately GBP 200 million more than they need to, as a result of purchasing the incorrect fares. We propose to address these incorrect purchases by leveraging the huge volumes of data that travellers create as they move about the city, by providing, to each of them, personalised ticket recommendations based on their estimated future travel patterns. In this work, we explore the viability of building a fare-recommendation system for public transport networks by (a) formalising the problem as two separate prediction problems and (b) evaluating a number of algorithms that aim to match travellers to the best fare. We find that applying data mining techniques to public transport data has the potential to provide travellers with substantial savings.