Over the past few days, a couple of pieces have come out about Big Data, or rather how economists and other social scientists are incorporating the extremely large datasets that are being collected on every one of us at every minute. Justin Wolfers, at the Big Think, says “whatever question you are interested in answering, the data to analyze it exists on someone’s hard drive, somewhere.” Expanding on Wolfers, Brett Keller speculates as to whether economists will “win” the quant race and “become more empirical.” Marc Bellemare thinks (in a piece that’s older, but still relevant) that the social sciences will start to converge in their methods, with more qualitative fields adopting mathematical formalism to take advantage of how much we know about people’s lives. Justin Wolfers and Betsey Stevenson go on in a related piece at Bloomberg about the boon that big data is for economics.
Not withstanding the significant hurdles to storing and using large datasets over time (ask a data librarian today about information that’s on floppy disks or best read by a Windows XP machine. Heck, look at your own files over the past ten years: can you get all the data you want from them? What would it take to get it all in a place and format you could formally analyze it?), I find the focus on data a little short sighted. And don’t get me wrong; I love data.
Wolfers and Stevenson think that the mere existence of data should change our models, that the purpose of theory nowadays should be “to make sense of the vast, sprawling and unstructured terabytes on our hard drives.” We do have the capability to leverage big data to gain a more accurate picture of the world in which we live, but there is also the very real possibility of getting bogged down in minutiae that comes from knowing every decision a person ever makes and extrapolating its effect on the rest of their lives. It’s the butterfly flaps its wings effect, for every bite of cereal you take, for every cross word your mother said to you, for every time you considered buying those purple suede shoes and stopped yourself–or didn’t. I’m being a bit melodramatic, of course, but it’s very easy, as an economist, as a graduate student, as a pre-tenure professor short on time, to let the data drive the questions you ask. It’s also often useful, I’m not saying that finding answerable questions using existing data is universally bad, by any means. But if we have tons of information on minutiae, we’ll probably ask tons of questions on minutiae, which I don’t think brings us any closer to understanding much of anything about human behavior.
On the convergence side, I worry about losing things like the ethnography. It may not be my strong point, but it’s useful, its methods and ouput informed my own work, and if convergence and big data mean anthropologists start relying solely on econometrics and statistics and formal mathematics, we’ll lose a lot of richness in our history and academics. I’m all for interdisciplinary work, for applying an economic lens to all facets of human interaction and decisions, but I don’t think our way of thinking should supplant another field’s. Rather, it should complement it.
Finally, incorporating big data into models that already exist will mediate some problems (unobserved heterogeneity that can now be observed, for example), but not all. Controlling linearly for now observable characteristics in a regression model has plenty of downsides, which I won’t enumerate, but can be found in any basic explanation of econometrics or simple linear regression.
Similarly, our tools for causal identification keep getting knocked down. At one time, regression discontinuity design was hot, and smacked down. Propensity score matching was genius and then, not so much. Instrumental variables still has this rather pesky problem that we can’t actually prove one of its key components. It’s not to say these tools don’t have value. When implemented correctly, they can indeed point us to novel and interesting insights about human behavior. And we certainly should continue to use the tools we have and find better ways to implement them, but the existence of big data shouldn’t mean we throw more data at these same models, which we know to be flawed, and hope that we can figure out the world. If we’re indeed moving towards more empirical economics (which is truthfully the part I practice and am most familiar with), we still need better tools. The models, the theory, the strategies for identification have to keep evolving.
Big data is part of the solution, but it can’t be the only solution.