On the Unfortunate Naming of ‘Data Science’

ah, so you’re a data scientist!

At a workshop that I recently attended, I was talking to one of the participants about my day-to-day research. We had a very exciting conversation about how some of the techniques and methods that I use could be applied to his (company’s) domain. But, at one point in the conversation, he said to me “ah, so you’re a data scientist!”

Well, I guess I am. I have tagged myself with that word on LinkedIn, so it must be true. And I have been very excited about the growing recognition of the power of data in everyday life; I have been following the buzz from people discussing what data science may be and what it could mean to them, and participating in panels that talk about this new hype.

However, when I first came across ‘data science,’ I was faced with the great contradiction that the word presents:

all science is data science

I really can’t think of any scientific enquiry that is not based on data (can you?), or any science that is not based around the age-old scientific method of supporting or rejecting hypotheses based on structured observations of the world (i.e., data). From the historical experiments about gravity and the speed of sound, to the state-of-the-art atom smashing machines; from psychologists running mice through a maze to computer scientists building the next generation of networks, it has always been (and always will be) about collecting data that provides empirical insight into the explanations that we seek to use about how stuff work.

Let’s put aside the tangential debate about ‘big data.’ Is ‘data science’ renaming methodologies that have been used throughout the ages? If so, what is so special about it? As I’ve observed the myriad of people jumping onto the ‘data’ bandwagon, it has been clear to me time and again that

having data does not entail that you are doing science

Many of the best examples that come to mind here boil down to people looking at individual instances of data and seeing a causal effect where non need be in place; people projecting their views of the world into data in order to confirm the opinions they already held (where much simpler explanations may have sufficed); people attributing some kind of ‘success’ to the systems they built that use data without an ounce of measurement. In other words, the kind of schoolboy errors that even the (mostly flawed) academic peer-review system is primed to catch. Perhaps this is where we can find the value of ‘data science:’

data science propagates scientific reasoning

Data science seems to be resurfacing age-old methodologies and pushing them out to where they historically haven’t been, including business and government (let’s talk some other time about whether it’s creepy to do this for elections). I’m talking about driving the progress of development and innovation by validating hypotheses with actual measurement, rather than putting a thumb to the air.

Are you doing data science? Well, if you find yourself asking questions like ‘what do I expect to happen in this experiment?’ and ‘how do I measure this? are these metrics capturing what I want to measure?’ then yes. And the kaggle guys are right: there are loads of people out there who are already doing this. They just don’t call themselves data scientists (yet?).

As a side note, some people claim that ‘data engineering’ (that is, all the stuff around tools, algorithms, infrastructure) is an essential part of data science. I both agree and disagree with this… how central is using a microscope to doing lab work? Essential to get things done, but the microscope itself is not the science!

There is one point that has come up in recently among the ‘we-don’t-call-ourselves-data-scientists-but-that-is-what-we-do’ crowd. That is,

data now exists a pre-cursor to science

It’s the natural consequence of all the computerised systems the world has today, both on the web and out there in physical world. So, whereas the formation of research hypotheses was originally a precursor to any collection of data for scientific study, we now have data piling up before any hint of a hypothesis is thrown at it. In many cases, that just makes life easier. Want to study how people move about cities? How people form their social networks? [Insert your own research question here] There are worlds of data out there that scale beyond anything you could collect yourself…

What is the problem then? I have heard a few. The most basic one is the idea of being data ‘rich’ or ‘poor:’ access to data shouldn’t be elitist (or bounded by company walls). This is a problem of balancing mixed interests which goes way beyond what I want to write about in this blog post..

Other scientists claim that asking questions on a pre-existing dataset is about fitting hypotheses to data, rather than the other way around, and that the ‘real’ scientific method is about ad-hoc, controlled data collection (in general, this seems quite silly to me – what do you think?) Others disagree with the representativeness of data or research that correlates data from different streams of daily life (say, tweets and geographic sentiment, check-ins and urban mobility, travel patterns and social deprivation), claiming that they are subject to bias and, by not being direct measurements, are inaccurate. Again, in general, I’m a bit tired of the ‘your data is biased’ criticism. Of course it is: all data is biased. More importantly, has this been taken into account? As for the accuracy issue, this one is still open: what do you think?

Advertisements

One thought on “On the Unfortunate Naming of ‘Data Science’

  1. I think the difference is the way the focus shifted, as you mentioned, now we have some data and the organisation that is collected it is looking to extract something valuable out of it. I agree that this way of thinking can lead to false conclusions as the hypothesis is designed to fit the data, on the other hand there are so many ways to abuse the ad-hoc, controlled data collection way, too.
    However, the positive side of this way of thinking is that the way the data is generated might restrict the set of hypotheses that can be formulated, probably towards asking more practical questions. Partly because the drive behind this quest is to extract value, party because the origin of the data might force scientist to explain/predict how the is generated.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s