Archives for category: research

One of the key ingredients of many successful online companies has been rapid iteration and improvement of their services via A/B testing. In its essence, you split your users into two (or more) groups, serve them variants of your service (e.g., different algorithms, user interfaces) and then sit back and measure how each group behaves. Once you are operating at web-scale, the sheer size of visitors and potential for rich data collection can really inform those companies about how they are performing and what ideas work better than others. Size matters the most: (we hope that) once we are dealing with a large enough sample, we will randomize all the other confounding factors that may play a role in what we are trying to measure. In other words, the web turns the world into a living laboratory.

Unfortunately, this cool technique is not readily available for the physical world. Imagine, for example, trying to evaluate a policy that aims to reduce the number of children being killed by cars. How do you split your users (and ensure randomization)? How do you cope with the fact that people will already behave differently in different geographic areas? (Moreover, how do you come to terms with the ethical questions?). The sad conclusion seems to be that, when we intervene on the physical world and observe a change, it remains difficult to speak of anything more than a correlation between what you did and the behavior you (hope to have) caused.

Luckily, alternative approaches exist. The equivalent of an A/B test when you can’t randomize your sample is a quasi-experiment. In a recent paper, we adopted this perspective in order to examine the impact of changing the user access policy to London’s (shared) Boris bikes. By splitting our data around the time of the change, carefully cleaning it, and ensuring that we maintain a large (temporal and spatial) scale of data, we examined how sensor readings from bicycle stations can be used to observe how the policy change propagated across time and the city. Interestingly, the data showed us that this change in policy resounded differently in different stations: some locations that were areas that people went to in the morning and left from in the evening flipped their pattern completely.

The rest of the details are in the paper (reference below). However, since this is the year that politicians are taking on coding, maybe they should also start taking a few hints from web companies and start running their own A/B tests as well.

N. Lathia, S. Ahmed, L. Capra. Measuring the Impact of Opening the London Shared Bicycle Scheme to Casual Users. (To appear) In Elsevier Transportation Research Part C, accepted December 2011.

The data mining community has a soft spot for competitions.

At the ACM KDD 2011 conference (San Diego, August 21-24 2011), the results of a competition on predicting music ratings was announced and a panel called “Lessons Learned from Contests in Data Mining” was held. The panel featured familiar faces in the data mining competition world- including Jeremy from Kaggle, who I met at ICDM 2010 in Sydney. If you haven’t heard of Kaggle, they are an exciting new start-up which offers a platform to easily run data mining competitions, and already have a hot list of ongoing/past competitions. Organising a panel is no easy task; I have seen some that are, essentially, a set of advertisements for each panelist (or their employer). This one worked out quite well: so well, in fact, that I don’t really want to write about the panel itself, but some of the thoughts I came away with about data competitions after it.

Competitions fascinate me. It looks like they are a great untapped resource of human potential. They are exciting, not only in and of themselves, but for how people react to them and behave while participating in them. But what is their place in the grander scheme of things? Here are some points for each side of the table:

Competitions: a key piece of the future of data mining

Competitions are open; they are democratic. They allow a PhD student in glaciology outperform state-of-the-art algorithms for detecting dark matter. The wisdom of the crowds only appears when the crowd is diversified- competitions are the portal to diversity. The winner may get the money, but everybody learns a lot about data mining, prediction, and algorithms. Not to mention that the data set attracts remarkable attention to the field that it comes from (says the guy who started reading about recommender systems when the Netflix prize was first announced).

Competitions allow you to flex your algorithmic muscle in front of people who care. Honestly, I wouldn’t mind never having to sit through another paper’s presentation that concludes with “so we got a 5.7% improvement over the baseline.” Put it in a competition! Show that your 5.7% is worthwhile! I also remember a workshop paper (Future of IR Evaluation @ SIGIR 2009) that showed that this kind of research has actually led to little to no improvements over time (since researchers seem to simply pick a baseline that suits their needs). Competitions mean a quicker turn around time. They get rid of (the flawed) peer review system: everybody has the same data, the same benchmark, the same metric of progress, the one leader board. The state-of-the-art doesn’t inch forward over the course of 3 to 6 month periods; it moves forward in a matter of days.

The benefit to science extends beyond the aspect of quick iteration and performance improvement. Competitions inherently separate problems with an engineering core (build the best bridge) from those that are fundamental science (tell me why the bridge stands). I’m looking forward to the day that scientists write papers about why their classifier outperforms the state of the art, rather than to tell me that their classifier does so.

Competitions: overlooking what is really important in data mining

Perhaps the biggest (unwilling) criticism of data mining competitions came from Peter Norvig’s keynote at the conference. While he talked about big data research at Google, I remember his saying something along the lines of “once you have enough data… you are just counting.” Improved statistical methods do not matter as soon as data is abundant enough; in fact, algorithms that seem poor with small data sets often outperform the “better” ones when you feed in more data. How does this hurt competitions? Well, for now, the data sets that these competitions release are small enough to allow people to participate without owning a warehouse of servers. Is all that effort dedicated to designing better algorithms put to good use then?

Competitions have also recently been their own worst enemy. The outcome of each competition seems to be the same: we won with an ensemble of hundreds of predictors. The winners are rewarded for being members of huge teams that can build a massive blender and spit out a delicious smoothie of predictions. The most interesting insights about the data itself do not seem to be secret keys to victory.

The KDD Cup this year was also a great example of a hotly contested criticism of competitions: due to privacy concerns (perhaps relating to the history of the forgone Netflix Prize Part II?), the actual artists in the data set were not identified. They were just numerical identifiers. What is the point of designing an algorithm if you don’t know what the data means?

Looking back

By looking over my ramblings thus far, it seems that my biggest worry about competitions is that they do not reward insight or understanding – yet. The nature of a competition (particularly those that do not allow researchers to walk away with the data) will forbid researchers from asking new questions, that differ from the competition’s goal. They are difficult to design and conduct- who knows if the metric is even the right one, or the target is appropriate?

Some say that, with data, “you may be spot on about a problem, but the solution doesn’t appear magically out of the data.” I see where this was going, but am not yet convinced; the (harder?) challenge of identifying the problem, finding what data should be used to solve it, and how the solution should be measured, is not solved.

There is a great saying that I’ve heard a few times from Italian friends (of course, relating to football/soccer) that is very appropriate here: “in football, it’s not the team that plays better than wins, it’s the team that scores the most goals.” Is the future of data mining, and more broadly, data-science, going to be all about scoring an RMSE-goal, or is there a slice of the pie that competitions do not cover?

I’ve just finished a poster for my paper (“Mining Mobility Data to Minimise Travellers’ Spending on Public Transport”) that will shortly appear at the upcoming ACM SIG KDD Conference on Knowledge Discovery and Data Mining. You can find it below. I’m looking forward to going to San Diego, and the conference next week.

ACM KDD 2011 Poster

“Big data” is a great catch phrase online. Every time I hear it used, it seems to be that what used to be big data is now mediocre-sized data, and what used to be medium sized is laughably small. There is certainly a lot of exciting research going on in this space. One of the things that bugs me, though, is that a lot of effort is being pushed into what can be done with this data, with (seemingly) less thought being given to how this data came about in the first place.

I’m very much to blame for this too. The last two papers I wrote using data from Transport for London’s (TfL) Oyster cards (IEEE ICDM 2010 blog post, and ACM KDD 2011 blog post) focused on what systems could be built to help travellers get around town and save money. So, when designing a web survey about travel habits, I thought it would be nice to step away from this frame of mind and look at, instead, the how and why questions instead of the what next question.

If you are familiar with TfL’s system, you’ll know that you can go and request an 8-week history of your data at any time (after 8 weeks, they anonymize your data). 85 of the respondents to my survey answered all my questions as well as giving me their Oyster card numbers and permission to ask TfL for their history; TfL was very helpful in getting the data to me.

The result was 3 datasets: (1) Subjective answers to a range of travel questions, (2) Transit records for the same individuals, and (3) a large-scale dataset anonymised travel records that I have used before. Lots of fun to be had here; the resulting paper has just been accepted to ACM Ubicomp (it’s on my publications page). Here is a short pitch of what it contains:

Data as a Mirror: Our Self-Image
First, I focused on datasets (1) and (2). How do they compare to each other? What do travellers think they do, and what do their Oyster cards say that they have been doing? Some of the results include:

  • Estimating travel needs: We are bad at estimating how often we will use public transport. However, we are better at estimating how many trips we will tend to take on week ends, even though we tend to rate our week end trips as highly irregular. We are better at estimating what time we will be travelling (rather than how often we will do so).
  • Hidden trips: Over half of the respondents made a claim like “I never use the bus during week days” but actually did. Moreover, nearly 65% of respondent’s trips were to/from stations that they did not list as typical.
  • Hidden savings: We asked how much people typically top up by (each time they go to a machine to buy credit) and then checked out how much they actually do. About 20% of the respondents over-estimated their typical transaction.

Data as a Mirror: the Effectiveness of Public Policy
Discovering that there are differences between how we perceive and actually use systems is becoming more commonplace. The second half of this work, instead, turned to design vs. usage: we looked at the behaviours that transport operators are trying to elicit from their passengers, and whether they actually appear in the data:

  • Do peak-time fares encourage people to avoid peak-hour travel? It seems not; instead, it guides their purchasing behaviour.
  • Do travel cards and free travel encourage greater usage? It seems like this one works! When people have travel cards (i.e., they don’t have to pay on a per-trip basis) or have free travel (by reaching a daily payment limit), they end up using buses more often. Is this an incentive for people to be lazy?
  • Do students buy discounted fares? Unfortunately, most of them don’t. One respondent explained to us that this is because only the long-term travel cards are discounted, and students’ perceive their travel to be highly irregular. Perhaps the financial incentives are not well placed?

Next step: will people change once the mirror of data is put up to their face?

Reference:
[pdf] N. Lathia, L. Capra. How Smart is Your Smartcard? Measuring Travel Behaviours, Perceptions, and Incentives. In 13th ACM International Conference on Ubiquitous Computing. Beijing, China. September 17-21, 2011

The most exciting mobile apps right now are the location-based ones. They aim to connect us to each other in a way that the web never will: graying the borders between virtual and physical interaction.  And it’s become a hotly contested arena: Foursquare (who announced that they have hit 10 million users), Gowalla, Google Places/Hotpot, Facebook Places.. everyone wants to know where you are.

One of the nice things about these systems is that there are plenty of reasons why you may be interested in using them. In fact, at an upcoming ACM RecSys workshop that I’m organizing, we’ve invited Janne Lindqvist to talk to us about his research about exactly why people check-in through these services (see a pdf of recent ACM CHI paper here). If you are a user of these systems yourself, you know that there are strong elements of social signalling, connecting with friends, playing games and competition going on (I even held the mayorship of UCL’s Computer Science department on foursquare for a while); as a researcher, you may be interested to know about how these check-ins reflect the geo-spatial structure of our cities (pdf).

One of the key technologies that fits directly into all of these platforms is a recommender system. Some kind of way for our mobile phones to take all the data we are giving them and not only connect us to each other, but help us discover places around us (if you think this is scary instead of exciting, read what I have to say about that). Google Hotpot (blog with video post) and Foursquare (detailed blog post) are already breaking ground in this area. However, I still think that none of the mobile recommender systems are getting it quite right. Why not? Here’s an unordered list of unsolved problems that I’ve encountered. Can you think of any more?

  1. The Radius of Recommendation. Most systems seem to be filtering results based on how far away they are from you (in some cases, this is a setting that you can manually tune). Let’s assume that you are not going to play with this parameter and leave it set at 2km. When you are looking for live music, there is a band playing 1.9km which will be recommended, but your favorite band, which is playing 2.1km away, will be filtered out. In other words, distance (seems to) matter until preference is high. Similarly, distance-filtering means that I can’t look for tomorrow’s lunch recommendations while I’m at home, since I won’t be near home when I eat lunch. Should these systems also take into account your regular travel habits?
  2. The Elusive Concept of Context. Figuring out what people are looking to do is difficult. To that end, none of the current mobile systems fit the true model of a recommender system (as a set of “query-less results”); they still need people to say what they are looking for. Is the time of day (and historical habits) not being leveraged enough?
  3. The Ontology of Location. I just checked the Explore tab on Foursquare. Near UCL, there are three trending train stations: Euston, St. Pancras, and King’s Cross. In this case, the social signalling aspect of these systems (I’m here!”) is bleeding into the recommendations (“Where should I go?”).
  4. Discovery vs. Re-finding (“you’ve been here”). There are some cases where re-finding great places is awesome (see this tool). There are other cases where it borders on banal and obvious. Are these systems reasoning on users’ familiarity with the locations they frequent?
  5. Venues vs. Events. Venues are the stage for events. I often will not want to go somewhere unless there is something special happening there. However, check-ins and ratings reflect what is happening now, and not what I can plan for. How can these systems discover what is relevant and interesting before it happens?
  6. Venues vs. Friends. Similarly to above, a venue that is trending will not be as appealing as the one where my best friend is having a pint. Unless it’s trending for such a good reason that I’ll be convinced to drag my friend away from his pint.
  7. Data Noise. Near my flat, I once saw a recommendation for another person’s bed. Nuff said.
  8. Collecting Data. There is a battle going on as to what the right way to collect data is. Should it be binary/check-in only? As above, this taints our data with social-signalling noise. Should it be on a 5* scale? For many venues that does not make sense. Should it be passive/GPS only? Then we won’t know where people actually are (not to mention the privacy concerns). Should it be based on prompts (“check in at?”)? That quickly becomes annoying.
  9. Online vs. Mobile Information Needs. It looks like Google is leveraging my search history when it recommends places. This does make sense (somewhat): I will often google a place before I go there. But my online information needs often do not match what I need when I’m on the go. For example, I will soon be going to San Diego for ACM KDD 2011. When I checked my Places recommendations, they ask me to rate the San Diego airport. Along those lines, if you are familiar with my research, you won’t be surprised to hear that I’ve sometimes seen London tube stations there as well. The relevance of places that I search for online doesn’t seem to match places that I would like to discover when I’m out in the real world.

Recommender systems are very much at home online, where everything is a few clicks away. It seems that the best techniques for mobile recommendation will not be simply lifted from the web; they will be able to balance between distance and preference, understand context, in order to help uncover the gems that my city is hiding. Exciting!

On June 12, I attended the First Workshop on Pervasive and Urban Applications (PURBA) @ Pervasive 2011 (San Francisco). The two keywords that this workshop was designed around (pervasive & urban) explain why I was there: for anyone researching and building technology that they hope to see used in the wild someday, there are fewer things more exciting than cities and the people who move around them.

Papers & Presentations

Here is a list of the papers that were presented, along with a few notes. My notes do no justice to the papers themselves, which I suggest you read for full details. I’ve grouped the papers by theme.

Applications

  • “Identifying and Understanding Urban Sport Areas” [pdf]. This paper focused on sport tracking applications (in particular: mysportpals.com). The data they have can be used to visualize where and how people go for runs within the confines of urban areas. One of the attendees asked an interesting question: how could this tool be used to help people discover places to work out where there are no crowds? Of course, a user of this kind could simply go running where the map says he should not go. However, more could be done: what about the relationship between the design of the built environment and its usage? The app could, for example, report crowding levels as well as locations (e.g., parks) that were built for urban sport but are not used.
  • “A Centralized Real-time Advanced Driver Assistance System Based on Smartphones” Andrea, from Milan’s Politecnico, presented his work that is part of the EC SAFE SPOT project. The core idea is that by tracking drivers with their mobile phones, accidents can be avoided with timely alerts.
  • “Driving Innovation in Urban Computing With a Community Testbed.” [pdf] The presentation began with an interesting comparison: physicists have the large-hadron collider for their experiments, so why don’t ubicomp researchers have a similar testbed for all their experiments? This is exactly what they are building in Oulu, Finland: a unified testbed for urban computing. While this is a great effort, I wonder whether the analogy of a hadron collider is appropriate: physicists are looking for universal laws, but aren’t we also to cater for the cultural, geographical, and design differences between urban areas?
  • “Fu Chi: A Mobile Communication System for Philadelphia’s Chinatown” [more details] The name Fu Chi stands for “Future Chinatown.” In this case, the Chinatown in Philadelphia – where the authors are testing a tool to help residents overcome communication barriers between the locals and the community around them.

Analyzing Mobile Phone Data. I don’t know if I heard any sentence more than “data from mobile phones” throughout the day. The following papers inferred land usage (for example: work and recreational areas) from mobile phone records. Researchers also cluster users’ call patterns, and then looked at the locations of a particular cluster of users (inferred as students) to seek for clues that confirm who these people are (i.e., are there schools near the towers?)

  • “Robust Land Use Characterization of Urban Landscapes using Cell Phone Data” [pdf]. Presented by Enrique (from Telefonica I+D Madrid)
  • “Clustering Mobile Call Detail Records to Find Usage Groups.” [pdf] from folks from AT&T.
  • “Exploring the Relationship between Land Use and Mobile Phone Usage: Analysis of a Dataset for the Amsterdam Metropolitan Region.” A collaboration between MIT’s Senseable, University of Salzburg, Vrije University, and the CurrentCity foundation.
  • “Spatial structure and Dynamics of Urban Communities” by Fergal from the National Centre of Geocomputation.

Taxi/Travel Data. Data from taxis sounds like buckets of fun. Not only can you use it to explore how people move, but there is interesting potential for applications, such as helping drivers find their next ride and passengers find their next taxi. Can taxi traces reveal the different roles they place in urban mobility? Does trip duration and income flow any well known distribution? Can we define better strategies from taxi traces resulting in higher mobility? Two papers at the workshop were looking into these ideas:

  • “Predicting Urban Human Mobility Using Large-Scale Taxi Traces”
  • “Exploratory Study of Urban Flow using Taxi Traces”

And, of course, the rest of the papers:

  • “Sensing The Urban: Using location-based social network data in urban analysis” [best student paper pdf] Anil, from UCL CASA, presented his work analyzing and comparing the Foursquare check-in data that he has crawled from three cities (New York, London, Paris).
  • “Digital Archiving of People Flow using Person Trip Data of Developing Cities” A practical method for re-constructing people’s flow from fragmented mobility data.
  • “Bridging the Social and Physical Sensing Worlds: Detecting Coverage Gaps and Improving Sensor Networks” Can we use social data to understand the city? Most urban data is based on physical sensors; most online data is highly rich and contextual. Can the two be used to compliment one another?
  • “Autopoiesic Content: A Conceptual Model for Enabling Situated Self-generative Content for Public Displays” Self-regenerating content for LCD panel systems.
  • “Enabling Real-time City Sensing with Kernel Stream Oracles and MapReduce” The authors talked about using the power of Map-Reduce to process large-scale sensor data. Check out the i2maps project.
  • “Influence of User Choice on Perception of Wireless Connection Genuineness and Security” Researchers are investigating issues surrounding trust in pervasive computing. A great example was the link to the way the Internet works in the hotel: how do we know to trust a piece of paper put on a table?

Wordle

Below is a wordle (made by Martin Wirz) of the workshop proceedings. Click on the thumbnail for the large version:

Wordle: PURBA Workshop Proceedings Wordle

I’m remotely following the #chi2011 hashtag on twitter (tweets from the ACM CHI Conference on Human Factors in Computing Systems). Here are some of the things that have caught my eye:

Workshops

  • The 2nd International Workshop on Persuasion, Influence, Nudge & Coercion through mobile devices (link)
  • Workshop on Crowd sourcing and Human Computation (link)
    • Including this paper called “Why I Hate Mechanical Turk Research (And Workshops)”
  • Workshop: Data Collection By the People, For the People (link)
  • Feminism and Interaction Design Workshop (link)

Research, Papers, Fun

  • Motivating Reductions in Domestic Energy Consumption Using Social Networks (link)
  • The Trouble With Social Computing Systems Research (pdf)
  • Make music with your brain & heart beat (video)
  • How do you feel anger, joy, sadness, love? (site)
  • Analytics of CHI 2011 tweets (link)
  • The Social Media Classroom (link)
  • CHI Sustainability Wiki (link)
  • The Tiramisu-Transit App Field trial is presented. This is very close to what I’m building right now!
  • We’ve done all this research… so now what? (slideshare)
  • Playing with cats & technology (youtube)

Twitter – don’t we all love it?

  •  The program has sessions on
  • “Twitter Systems” (Monday 11AM, 4 papers)
  • “Microblogging Behavior” (Tuesday 11AM, 4 papers)
  • Here is a paper by Haewoon Kwak (who I have remotely collaborated with on a SIGIR 2009 paper) looking at unfollow dynamics… and another one that also looks at unfollowing, from the sociological perspective.

Other Companies

  • Microsoft at CHI (link)
  • Google at CHI (link)
  • (Interesting comment on) Apple at CHI (tweet)
  • CHI 2011 also clashes with Google I/O.

Post-Conference

  • A Google document has been set up for people to add links to notes/posts/slides.
  • A blog post summarising the RepliCHI panel.
  • The “extended-dance remix” of the final keynote (by @ethanZ) is available here.

This post will be updated as the tweets roll in. The conference closed with a keynote by Ethan Zuckerman: a detailed blog post of his talk is here. And here is the link to CHI 2012!

Update: find the research paper & more details on my website.

One of the continuing themes in my research is about designing and evaluating systems that public transport authorities could build (or allow others to build) if they delved into the treasure of data that their fare collection systems create. In this case, there are many similarities between what your Oyster card collects about you and the data that many web sites collect while you browse the web (rate things, click on links, add friends,…). The key difference is that these web sites often use your data to give you something back: Amazon recommends stuff based on what you have purchased, Facebook personalises your news feed based on who you interact with, and Google personalises your search results. Why can’t TfL do the same thing? Why can’t they, based on your trip history, recommend to you what the best travel card is to buy? 

Why care about this question? Have you ever bought a travel card that you didn’t use? Have you ever topped up so many times in one week that you realise you should have bought a travel card? Moreover, how do you know that you aren’t wasting  your money buying the wrong transport fares all the time? It is difficult to be 100% sure that you are always getting value for your money.

In some of my recent research, I examined the same (fully anonymised) data set that I had used for a previous study. The data contains a 5% sample of Oyster card records from two 83-day long periods (who touched in/out where & when), along with all the top ups and travel cards that these travellers bought during the sampled time span. This time, I asked: how much money are these travellers ‘wasting’ by buying the incorrect fares?

In this case, “incorrect” simply means that they could have bought a cheaper ticket for the trips that they ended up making. For example, if a person buys a 7-day travel card, but then only travels during two days, it would have been cheaper for that person to use pay as you go. Overall, I found the program I wrote found that this 5% sample of travellers are spending just under GBP 2.5 million more than they should. Two points to note are: (a) using this sample of people to approximate the entire population would mean multiplying these figures by 20 and (b) the datasets each cover an 83-day period. In other words, this translates into an estimate that travellers cumulatively spend approximately GBP 200 million per year more than they need to, by simply buying the incorrect tickets.

To understand why this may be going on, consider this graph below (click for larger).
Graph ImageIt shows the relationship between travel, in terms of trips per day/week/month/year (only in Zones 1 and 2) and what the cheapest fare will be for you to use. For example, you may start on pay as you go; if you travel more than 3 times in one day, a peak travel card is cheaper (and you should already get it by being capped); if you travel more than 3.58 days in a week, a 7-day card is cheaper (and so on). Remember: this is (1) only for Zones 1-2, and (2) there are also relationships between the best price and non-contiguous travel. So, maybe you don’t travel enough in 1 week for a 7-day travel card to be cheaper, but your trips over the whole month means that a 30-day travel card would have been best!

I then tested a variety of machine learning/data mining algorithms that tried to classify users to transport tickets based on their travel patterns. One of the most powerful classifiers that we tested is known as a decision tree: in our experiments, it was able to match travellers to the cheapest fare with over 98% accuracy, which translates into substantial savings for everybody. Wonderful!

I’d like to point out that this research is not bad news for Transport for London. They are not to blame for people’s incorrect purchases (if you really want to, you can point at them for having a set of prices that are possibly too complex to digest). Instead, this is an opportunity. The data that they have should be used to make their services better and to encourage Londoners to use TfL as much as possible. In fact, the same data that I’ve been looking at here to help travellers save money could also be used to help TfL make more money, at the same time. Oh, the future of data…

The paper I have written about this has just been accepted at the 2011 ACM Conference on Knowledge Discovery and Data Mining, and will be shortly is now available on my publications page. If you have any questions, please contact me or tweet @ me. The paper’s full abstract is below.

Abstract. As the public transport infrastructure of large cities expands, transport operators are diversifying the range and prices of tickets that can be purchased for travel. However, selecting the best fare for each individual traveller’s needs is a complex process that is left almost completely unaided. By examining the relation between urban mobility and fare purchasing habits in large datasets from London, England’s public transport network, we estimate that travellers in the city cumulatively spend, per year, up to approximately GBP 200 million more than they need to, as a result of purchasing the incorrect fares. We propose to address these incorrect purchases by leveraging the huge volumes of data that travellers create as they move about the city, by providing, to each of them, personalised ticket recommendations based on their estimated future travel patterns. In this work, we explore the viability of building a fare-recommendation system for public transport networks by (a) formalising the problem as two separate prediction problems and (b) evaluating a number of algorithms that aim to match travellers to the best fare. We find that applying data mining techniques to public transport data has the potential to provide travellers with substantial savings.

After a number of emails in the past few days, I have set up a Google group called “Bike Sharing Research and Practice” for everyone who is interested in shared bicycle systems. If that includes you, please join!

The (pre-group) list was also recently sharing a number of research papers related to bike sharing systems around the world. I’ve been collecting the links to number of them; here they are (some are missing reference details):

Papers:

  • [pdf] J. Froehlich, J. Neumann, N. Oliver. Sensing and Predicting the Pulse of the City through Shared Bicycling. In 21st International Joint Conference on Artificial Intelligence (IJCAI), July 2009.
  • [pdf] J. Froehlich, J. Neumann, N. Oliver. Measuring the Pulse of the City through Shared Bicycle Programs. In International Workshop on Urban, Community and Social Applications of Networked Sensing Systems (UrbanSense), November 2008.
  • [arxiv] A. Kaltenbrunner, R. Meza, J. Grivolla, J. Codina, R. Banchs. Bicycle Cycles and Mobility Patterns: Exploring and Characterizing Data from A Community Bicycle Program.
  • [pdf] A. Kaltenbrunner, R. Meza, J. Grivolla, J. Codina, R. Banchs. Urban Cycles and Mobility Patterns: Exploring and Predicting Trends in a Bicycle-based Public Transport System. In Pervasive and Mobile Computing, Vol. 6 Issue 4, August 2010.
  • R. Nair, E. Miller-Hooks, R. C. Hampshire, A. Busic. Large-Scale Bicycle Sharing Systems: Analysis of Velib. (submitted for publication; there are further papers in review/under submission on Rahul Nair’s website).
  • [pdf] M. Benchimol, P. Benchimol, B. Chappert, A. De la Taille, F. Laroche, F. Meunier, L. Robinet. Balancing the Stations of a Self-Service “Bike Hire” System.
  • [link] “Shared Bicycles in a City: A Signal Processing and Data Analysis Perspective”, Pierre Borgnat, Céline Robardet, Jean-Baptiste Rouquier, Eric Fleury, Patrice Abry, Patrick Flandrin, accepted to Advances in Complex Systems, 12/2010
  • [link] “Spatial analysis of dynamic movements of Vélo’v, Lyon’s shared bicycle program”, P. Borgnat, E. Fleury, C. Robardet, A. Scherrer, European Conference on Complex Systems, ECCS’09, Warwick University (UK), 21-25 September 2009
  • [link] “Studying Lyon’s Vélo’V: A Statistical Cyclic Model”, P. Borgnat, P. Abry, P. Flandrin, J.-B. Rouquier, European Conference on Complex Systems, ECCS’09, Warwick University (UK), 21-25 September 2009
  • [pdf] D. Chemla, F. Meunier, R. W. Calvo. Balancing a Bike-Sharing System with Multiple Vehicles. 2009.
  • [pdf] P. deMaio. Bike-sharing: History, Impacts, Models of Provision, and Future. Journal of Public Transportation 12 41–56.
  • R. C. Hampshire, A. Busic. A Stochastic Model and Optimization for Bike Sharing Programs. (working paper, 2011)
  • P. Jensen, J. Rouquier, N. Ovtracht, C. Robardet. Characterizing the Speed and Paths of Shared Bicycles in Lyon. 2010. Transportation Research Part D: Transport and Environment 15.
  • G. R. Krykewycz, J. Rocks, B. Bonnette, F. Jaskiewicz. Defining a Primary Market and Estimating Demand for Major Bicycle-Sharing Program in Philadelphia, Pennsylvania. Transportation Research Record: Journal of the Transportation Research Board 2143 117–124. 2011.
  • J. Lin, T. Yang. Strategic Design of Public Bicycle Sharing Systems with Service Level Constraints. Transportation Research Part E: Logistics and Transportation Review 47(2) 284 – 294. 2011.
  • obis (project). 2011. Analysis of Existing Bike Sharing Systems in European Cities .
  • T. Raviv, M Tzur, I. A. Forma. Static Repositioning in a Bike-Sharing System: Models and Solution Approaches. 2010.
  • S. Shaheen, S. Guzman, H. Zhang. Bikesharing in Europe, the Americas, and Asia: Past, Present, and Future. Journal Transportation Research Record: Journal of the Transportation Research Board 2143 159–167. 2010.
  • [pdf] I. Wang, S. Jia, C. Chow, T. Mabel. Bicycle-sharing system: Deployment, Utilization and the Value of Re-Distribution.
  • N. Lathia, S. Ahmed, L. Capra. Measuring the Impact of Opening the London Shared Bicycle Scheme to Casual Users. In Transportation Research Part C, accepted December 2011.

Feasibility Studies:

Visualisation:

  • Bike Share visualisations/maps by Oliver O’Brien (site)

TfL Survey After running my last online survey (read about it here) on Londoner’s trip and ticket purchasing habits, I was pointed to the fact that TfL are, themselves, doing a bit of research. They have been passing out short surveys for passengers to fill in on a number of bus routes, asking people to fill in their origins, destinations, reasons for travel.

It seems that they turned to surveys for many of the same reasons that I did: when you collect Oyster card data, you can see where people are travelling from (and, if they take the tube, where they are going to), but never understand why they are doing it. These surveys will help them get a small glimpse into why people were on buses at that time of day, and where they were going.

Thanks for the pointer (and the survey), Tamas.

Follow

Get every new post delivered to your Inbox.