Big data and what it means for economists

Over the past few days, a couple of pieces have come out about Big Data, or rather how economists and other social scientists are incorporating the extremely large datasets that are being collected on every one of us at every minute. Justin Wolfers, at the Big Think, says “whatever question you are interested in answering, the data to analyze it exists on someone’s hard drive, somewhere.” Expanding on Wolfers, Brett Keller speculates as to whether economists will “win” the quant race and “become more empirical.” Marc Bellemare thinks (in a piece that’s older, but still relevant) that the social sciences will start to converge in their methods, with more qualitative fields adopting mathematical formalism to take advantage of how much we know about people’s lives. Justin Wolfers and Betsey Stevenson go on in a related piece at Bloomberg about the boon that big data is for economics.

Not withstanding the significant hurdles to storing and using large datasets over time (ask a data librarian today about information that’s on floppy disks or best read by a Windows XP machine. Heck, look at your own files over the past ten years: can you get all the data you want from them? What would it take to get it all in a place and format you could formally analyze it?), I find the focus on data a little short sighted. And don’t get me wrong; I love data.

Wolfers and Stevenson think that the mere existence of data should change our models, that the purpose of theory nowadays should be “to make sense of the vast, sprawling and unstructured terabytes on our hard drives.” We do have the capability to leverage big data to gain a more accurate picture of the world in which we live, but there is also the very real possibility of getting bogged down in minutiae that comes from knowing every decision a person ever makes and extrapolating its effect on the rest of their lives. It’s the butterfly flaps its wings effect, for every bite of cereal you take, for every cross word your mother said to you, for every time you considered buying those purple suede shoes and stopped yourself–or didn’t. I’m being a bit melodramatic, of course, but it’s very easy, as an economist, as a graduate student, as a pre-tenure professor short on time, to let the data drive the questions you ask. It’s also often useful, I’m not saying that finding answerable questions using existing data is universally bad, by any means. But if we have tons of information on minutiae, we’ll probably ask tons of questions on minutiae, which I don’t think brings us any closer to understanding much of anything about human behavior.

On the convergence side, I worry about losing things like the ethnography. It may not be my strong point, but it’s useful, its methods and ouput informed my own work, and if convergence and big data mean anthropologists start relying solely on econometrics and statistics and formal mathematics, we’ll lose a lot of richness in our history and academics. I’m all for interdisciplinary work, for applying an economic lens to all facets of human interaction and decisions, but I don’t think our way of thinking should supplant another field’s. Rather, it should complement it.

Finally, incorporating big data into models that already exist will mediate some problems (unobserved heterogeneity that can now be observed, for example), but not all. Controlling linearly for now observable characteristics in a regression model has plenty of downsides, which I won’t enumerate, but can be found in any basic explanation of econometrics or simple linear regression.

Similarly, our tools for causal identification keep getting knocked down. At one time, regression discontinuity design was hot, and smacked down. Propensity score matching was genius and then, not so much. Instrumental variables still has this rather pesky problem that we can’t actually prove one of its key components. It’s not to say these tools don’t have value. When implemented correctly, they can indeed point us to novel and interesting insights about human behavior. And we certainly should continue to use the tools we have and find better ways to implement them, but the existence of big data shouldn’t mean we throw more data at these same models, which we know to be flawed, and hope that we can figure out the world. If we’re indeed moving towards more empirical economics (which is truthfully the part I practice and am most familiar with), we still need better tools. The models, the theory, the strategies for identification have to keep evolving.

Big data is part of the solution, but it can’t be the only solution.

Advertisements

MLK Day and Race

Today is Martin Luther King, Jr. Day, as I’m sure you know. MLK Day was the only federal holiday we got off at Duke, or at least the only one that fell during the semester. It was always marked with a big celebration and my dance group often performed. I always liked that celebration.

But I’ve gotten totally off-topic. An article this week in the NYT highlighted the issue of choosing a race, particularly on census forms, for Latinos in the US. Latinos, who are incredibly diverse in physiognomy and heritage, are, according to the article, choosing to mark ‘other’ instead of one or more of the categories provided.

The issue is of particular importance to economists because in most microeconomic work, we control for race. The implication of this, of course, is that by including someone’s race in a regression, we are separating out some aspect that is predictive of whatever behavior or outcome we’d like measure. And not only are we separating it out, we’re separating it out in a measured, specific way such that we think it applies to all respondents.

For example, we might see a regression that says, all other things equal, the average black person receives one more year of education than a white person. (I saw a statistic like this the other day, saying that black people of similar wealth and socio-economic status get more education than their white peers, I wish I could remember where it came from.) Though the statement is necessarily couched with “on average”, if a number of people are choosing other instead of white or black or some combination of these, we’re not actually seeing the true average. This is called measurement error, and can have pretty significant effects on esimation.

In my own work, for instance, black mothers and white mothers in the Fragile Families Data display different characteristics and decisions regarding investments in children when controlling for whether they’ve received a promise of financial support. But if I were able to capture more of the group that self-identifies their race as other, this effect may be reduced or even disappear.

The question of whether to even ask about race, or ethnicity, is a sticky one. It may give us information that gives different groups more “clout” as the NYT article argues, or it may reinforce stereotypes and feed the flames. Regardless, if research continues the way it currently goes, having a large group of people opt out because they don’t find something that fits them is problematic.

There’s still lots of thinking to be done about it, and perhaps today is a good day to mull it over a bit. I hope you enjoy your MLK Day!

–“The arc of the moral universe is long, but it bends towards justice.” -MLK, Jr.