“We have not yet gone about structuring the field as a whole in an understandable and effective way. We have large tasks before us, both in developing initial structure and in using this structure to organize what others have done and to see what still others might do. Those of us who recognize the importance of more effective data analysis MUST feel the urgency of transforming it somewhat more nearly into an organized body of knowledge.”
— Colin Mallows and John Tukey, 1982.
What’s Next In Data Science?
2019 marks the 150th anniversary of the Mendeleev periodic table. This iconic discovery, which is based on the ingenious observation that the properties of the elements are periodic functions of their atomic numbers, always amuses me. I often wonder: can one day we develop such an organized connected framework for Statistics where seemingly unrelated algorithms can peacefully coexist under a single unified umbrella —“Algorithm of Algorithms.” It is the possibility of achieving this dream that drives my research program.
Nonparametric Data Science
How can we develop a consistent and unified framework of data analysis (the foundation of data science) that would reveal the interconnectedness among different branches of statistics? This question is the driving force behind my research program. I have been developing one such candidate theory to lay the groundwork for a progressive unification of fundamental statistical learning tools. Our theory has given birth to a new and exciting discipline for 21st-century statistics, called “Nonparametric Data Science,” which is rapidly gaining ground.
To realize this vision, we focus on one important field of statistics at a time, with a goal to simplify, unify and generalize them using our “Nonparametric Data Science” theory and tools. Under this new framework, a significant number of statistical problems have been tackled to date, including: generalized empirical Bayes (Mukhopadhyay and Fletcher, 2019), large-scale inference (Mukhopadhyay 2016, 2018, 2020), statistical spectral analysis of graphs (Mukhopadhyay, 2020), universal copula modeling (Mukhopadhyay and Parzen, 2020), high-dimensional data modeling (Mukhopadhyay and Wang, 2018), density estimation (Mukhopadhyay, 2017a), dependence modeling (Parzen and Mukhopadhyay, 2013b), non-linear time series modeling (Mukhopadhyay and Parzen, 2017; Mukhopadhyay and Nandi (2017), and nonparametric distributed learning (Bruce et al., 2016). All of these results show how our general theory acts as an organizing principle for varieties of data analysis endeavors, thereby allowing us to connect different sub-fields of statistics using one universal language.
The Age of ‘Unified Algorithms’ Is Here
A coherent way of designing and understanding data analysis is the ultimate goal of the “Theory of Data Science.” But, does it exist at all? The advancements made so far have convinced me that such a theory of ‘United Statistical Algorithms’ is within reach. This whole field is still very nascent and desperately needs new bold ideas.
“Many useful techniques are developed in application areas, and often more than one. For me theoretical analysis that connects them, explains them better, sheds light on their performance is always very gratifying.” — Trevor Hastie (2018).
From a practical standpoint, there is a dire need to put some order into the current inventory of algorithms that are mushrooming at a staggering rate, in order to better understand the statistical core. A theory of ‘United Statistical Algorithms’ provides us a modern language of data analysis that can put together different “mini-algorithms” into a coherent master algorithm for increased simplification of theory, computation, and practice. There is little doubt that this field will keep evolving and expanding rapidly in the coming years due to its pervasive necessity across many disciplines including statistics, data science, and AI.