Here’s the big data fantasy project in a nutshell:
“We’re pulling in every tweet, every post, every outward link, every update…It is a huge challenge – the platforms make it difficult for us, they keep changing how we can get hold of it…The APIs don’t give us enough – especially when things trend –but there are ways around it. We find a way around it…We’ve got big data”.
There is a new kind of social scientist: the big data social scientist. Top of their Desert Island Disc choices might be Queen’s I want it all (you know the one – “I want it all, I want it all, I want it all, and I want it now!”)
There is something in the “I want it all” aspiration that reminds me of when I travel on the train, and see the man at the end of the platform, notebook and SLR camera around his neck, sandwiches and flask of coffee packed carefully in his knapsack. He has in his hand a book with the number of the item of rolling stock currently in service. He has photographed and recorded 50% of them, he knows he has another 50% to go. He also wants it all. So that’s him, the new generation of big data social scientists, and Freddy Mercury: they all want it all.
When I recently started to capture (or harvest, or some say ‘scrape’) tweets about the police and crime commissioner elections, I found myself with a spreadsheet of 100,000 rows and ten fields of metadata – that’s 1m datapoints. For a first timer to this world, I had myself big data. It was exciting. I had it all.
Then I started to learn more about the mechanism I was using to pull in these tweets. Blogs and websites were warning me that using the API of Twitter to do this gives you sometimes as few as only 1% of the actual tweets. The limitations of 1500 an hour mean that you can’t get everything. For people who like to collect tweets about Obama or Occupy, there are times of the day that you could easily just end up with a tiny sample of a huge volume of tweets. But there are solutions, these people tell you – pay us a few hundred dollars and we will get it all for you. Yes, all of the tweets. No restrictions. You can have it all.
Meanwhile, imagine the big data social scientist, Ipod strapped to his arm, out for a 5K run to relieve some pressure, singing along, mulling the proposition over…
“Not a man for compromise and where’s and why’s and living lies So I’m living it all, yes I’m living it all,
And I’m giving it all, and I’m giving it all,
It ain’t much I’m asking, if you want the truth,
Here’s to the future, hear the cry of youth,
I want it all, I want it all, I want it all, and I want it now”,
And you can have it now my friend. Just enter your card details and you can have it all.
Looking back at my spreadsheet of big data, it doesn’t seem as big any more. I’ve just got an unknown sample of tweets. And to add to that I don’t have their Facebook activity, or their LinkedIn or what they wrote on the Guardian article or BBC news site, or their blog post. I really don’t have very much. I really only have a little bit.
The fantasy of having ”it all” seems like a possibility because we have the technology, or at least we have come close to it. The new generation of big data social scientists will tell you it was easier a couple of years ago – the platforms were less protective, whereas now they are becoming risk averse or enlightened to how they can monetize and exploit their big data. But they battle on. “If you don’t know how to hack, code or have the means to pay ”, they will tell you, “then you need to think carefully before getting involved with the world of big data. You might be better suited to just regular ‘data’ ”.
But hang on. Let’s look at that spreadsheet again – the one with the 1m data points. There’s quite a bit in there. We should stop ourselves from judging our data by what we don’t have and instead think what can we learn from what we’ve got. It is a simple point, but the quality of your data depends on the questions you are asking and the claims you want to make. There are as many unanswered questions in this spreadsheet as there are tweets. The key is not to try and answer them all, nor is it to be led completely by the availability of data, but rather we need to be creative with our questions and to exploit what we have.
It’s time to be happy with our lot – time to change the playlist – what’s that tune by Bobby McFerrin?