NextGen: How Small Are Our Big Data: Turning the 2016 Surprise into a 2020 Vision

Xiao-Li Meng

Harvard, Whipple V. N. Jones Professor of Statistics
Founding Editor-in-Chief of Harvard Data Science Review


The phrase “Big Data” has greatly raised expectations of what we can learn about ourselves and the world in which we live or will live. It also appears to have boosted general trust in empirical findings, because it seems to be common sense that the more data, the more reliable are our results. Unfortunately, this commonsense conception can be falsified mathematically even for methods such as the time-honored ordinary least squares regressions (Meng and Xie, 2014, Econometric Reviews 33: 218-250). Furthermore, whereas the size of data is a common indicator of the amount of information, what matters far more is the quality of data. A 5-element Euler-formula like identity reveals that trading quantity for quality in population statistical inference is a mathematically demonstrably doomed game (Meng, 2018, Annals of Applied Statistics, 685-726). Without considering data quality, Big Data can do more harm than good because of the drastically inflated precision assessment, and hence the gross overconfidence, setting us up to be caught by surprise when the reality unfolds, as we all experienced during the 2016 US presidential election. Data from Cooperative Congressional Election Study (CCES, conducted by Stephen Ansolabehere, Douglas River and others, and analyzed by Shiro Kuriwaki), are used to assess the data quality in 2016 US election polls, with the aim to gain a clearer vision for the 2020 election and beyond.


Both articles are available at; the first one is inside Xiao-Li’s CV.