ah, so you’re a data scientist!
At a workshop that I recently attended, I was talking to one of the participants about my day-to-day research. We had a very exciting conversation about how some of the techniques and methods that I use could be applied to his (company’s) domain. But, at one point in the conversation, he said to me “ah, so you’re a data scientist!”
Well, I guess I am. I have tagged myself with that word on LinkedIn, so it must be true. And I have been very excited about the growing recognition of the power of data in everyday life; I have been following the buzz from people discussing what data science may be and what it could mean to them, and participating in panels that talk about this new hype.
However, when I first came across ‘data science,’ I was faced with the great contradiction that the word presents:
all science is data science
I really can’t think of any scientific enquiry that is not based on data (can you?), or any science that is not based around the age-old scientific method of supporting or rejecting hypotheses based on structured observations of the world (i.e., data). From the historical experiments about gravity and the speed of sound, to the state-of-the-art atom smashing machines; from psychologists running mice through a maze to computer scientists building the next generation of networks, it has always been (and always will be) about collecting data that provides empirical insight into the explanations that we seek to use about how stuff work.
Let’s put aside the tangential debate about ‘big data.’ Is ‘data science’ renaming methodologies that have been used throughout the ages? If so, what is so special about it? As I’ve observed the myriad of people jumping onto the ‘data’ bandwagon, it has been clear to me time and again that
having data does not entail that you are doing science
Many of the best examples that come to mind here boil down to people looking at individual instances of data and seeing a causal effect where non need be in place; people projecting their views of the world into data in order to confirm the opinions they already held (where much simpler explanations may have sufficed); people attributing some kind of ‘success’ to the systems they built that use data without an ounce of measurement. In other words, the kind of schoolboy errors that even the (mostly flawed) academic peer-review system is primed to catch. Perhaps this is where we can find the value of ‘data science:’
data science propagates scientific reasoning
Data science seems to be resurfacing age-old methodologies and pushing them out to where they historically haven’t been, including business and government (let’s talk some other time about whether it’s creepy to do this for elections). I’m talking about driving the progress of development and innovation by validating hypotheses with actual measurement, rather than putting a thumb to the air.
Are you doing data science? Well, if you find yourself asking questions like ‘what do I expect to happen in this experiment?’ and ‘how do I measure this? are these metrics capturing what I want to measure?’ then yes. And the kaggle guys are right: there are loads of people out there who are already doing this. They just don’t call themselves data scientists (yet?).
As a side note, some people claim that ‘data engineering’ (that is, all the stuff around tools, algorithms, infrastructure) is an essential part of data science. I both agree and disagree with this… how central is using a microscope to doing lab work? Essential to get things done, but the microscope itself is not the science!
There is one point that has come up in recently among the ‘we-don’t-call-ourselves-data-scientists-but-that-is-what-we-do’ crowd. That is,
data now exists a pre-cursor to science
It’s the natural consequence of all the computerised systems the world has today, both on the web and out there in physical world. So, whereas the formation of research hypotheses was originally a precursor to any collection of data for scientific study, we now have data piling up before any hint of a hypothesis is thrown at it. In many cases, that just makes life easier. Want to study how people move about cities? How people form their social networks? [Insert your own research question here] There are worlds of data out there that scale beyond anything you could collect yourself…
What is the problem then? I have heard a few. The most basic one is the idea of being data ‘rich’ or ‘poor:’ access to data shouldn’t be elitist (or bounded by company walls). This is a problem of balancing mixed interests which goes way beyond what I want to write about in this blog post..
Other scientists claim that asking questions on a pre-existing dataset is about fitting hypotheses to data, rather than the other way around, and that the ‘real’ scientific method is about ad-hoc, controlled data collection (in general, this seems quite silly to me – what do you think?) Others disagree with the representativeness of data or research that correlates data from different streams of daily life (say, tweets and geographic sentiment, check-ins and urban mobility, travel patterns and social deprivation), claiming that they are subject to bias and, by not being direct measurements, are inaccurate. Again, in general, I’m a bit tired of the ‘your data is biased’ criticism. Of course it is: all data is biased. More importantly, has this been taken into account? As for the accuracy issue, this one is still open: what do you think?