“We have not yet gone about structuring the field as a whole in an understandable and effective way. We have large tasks before us, both in developing initial structure and in using this structure to organize what others have done and to see what still others might do. Those of us who recognize the importance of more effective data analysis MUST feel the urgency of transforming it somewhat more nearly into an organized body of knowledge.”
— Colin Mallows and John Tukey, 1982.
What’s Next In Data Science?
2019 marks the 150th anniversary of the Mendeleev periodic table. This iconic discovery, which is based on the ingenious observation that the properties of the elements are periodic functions of their atomic numbers, always amuses me. I often wonder: can one day we develop such an organized connected framework for Statistics where a range of diverse algorithms and perspectives can peacefully coexist under a single unified umbrella —“Algorithm of Algorithms.” I’m developing new fundamental principles to achieve this goal.
Nonparametric Data Science
Are there any general principles of designing statistical algorithms? By general principles I mean, a theoretical framework that is [beautiful] logically coherent, [useful] can systematically synthesize a large number of techniques that proved useful in data analysis, and [adaptable] can provide ways to generalize them beyond classical regime. I have been developing one such candidate theory to lay the groundwork for a progressive unification of fundamental statistical learning tools. Our theory has given birth to a new and exciting discipline for 21st-century statistics, called “Nonparametric Data Science,” which is rapidly gaining ground.
To realize this vision, we focus on one important field of statistics at a time, with a goal to simplify, unify and generalize them using our “Nonparametric Data Science” theory and tools. Under this new framework, a significant number of statistical problems have been tackled to date, including: generalized empirical Bayes (Mukhopadhyay and Fletcher, 2019), large-scale inference (Mukhopadhyay 2016, 2018, 2020), statistical spectral analysis of graphs (Mukhopadhyay, 2020), universal copula modeling (Mukhopadhyay and Parzen, 2020), high-dimensional data modeling (Mukhopadhyay and Wang, 2018), density estimation (Mukhopadhyay, 2017a), dependence modeling (Parzen and Mukhopadhyay, 2013b), non-linear time series modeling (Mukhopadhyay and Parzen, 2017; Mukhopadhyay and Nandi (2017), and nonparametric distributed learning (Bruce et al., 2016). All of these results show how our general theory acts as an organizing principle for varieties of data analysis endeavors, thereby allowing us to connect different sub-fields of statistics using one universal language.
The Age of ‘Unified Algorithms’ Is Here
A coherent way of designing and understanding data analysis is the ultimate goal of the “Theory of Data Science.” But, does it exist at all? The advancements made so far have convinced me that such a theory of ‘United Statistical Algorithms’ is within reach. This whole field is still very nascent and desperately needs new bold ideas.
“Many useful techniques are developed in application areas, and often more than one. For me theoretical analysis that connects them, explains them better, sheds light on their performance is always very gratifying.” — Trevor Hastie (2018).
From a practical standpoint, there is a dire need to put some order into the current inventory of algorithms that are mushrooming at a staggering rate, in order to better understand the statistical core. A theory of ‘United Statistical Algorithms’ provides us a modern language of data analysis that can put together different “mini-algorithms” into a coherent master algorithm for increased simplification of theory, computation, and practice. There is little doubt that this field will keep evolving and expanding rapidly in the coming years due to its pervasive necessity across many disciplines including statistics, data science, and AI.