Statistics Educational Challenge in the 21st Century
Education is one of the top priorities of Statistics profession to develop a 21st-century data-capable workforce. There is a growing need to reform the teaching of statistics to address the `Data Science Talent Gap.’ Developing such a comprehensive training curriculum covering the fundamental methods of statistical learning is challenging and requires a radically new approach, as highlighted in my recent pedagogical article on “Statistics Educational Challenge in the 21st Century.”
Key Challenge: Too Many Topics, Too Little Time
The gap between half-century-old statistical curriculum and our contemporary statistical practice continues to widen. How can we reduce this ‘gap’? Interestingly, though, while there is a great deal of debate on the question ‘what should we teach,’ there is a general consensus on what we must avoid. While discussing the challenges and opportunities for statistics education, in the next 25 years, Kettenring et al. (2015) noted that “There is a need to train students to use deep, broad, and creative statistical thinking instead of just training them in algorithms.” A similar sentiment was echoed by the ASA 2014 Curriculum Guidelines for Undergraduate Programs in Statistical Science: “Students need to see that the discipline of statistics is more than a collection of unrelated tools.” We summarize this as the following (which I call `The Exclusion Principle‘):
Maxim 1. We must not design a Data Science training curriculum that looks like a long manual of specialized methods and series of cookbook algorithms. Otherwise, we will be in danger of producing DataRobots instead of Data Scientists.
The first action plan on what should we teach was first proposed by Bill Cleveland (2001) (Also see John Chamber’s (1993) calls for ‘Greater statistics: learning from data’) where he argued that only 20% of the total curriculum should be allotted for teaching theoretical foundation of data science, rest being computing, collaboration, and software tool development. Similar sentiment was echoed by David Donoho in his recent essay on “50 Years of Data Science”, where he proposed a comprehensive curriculum on Data Science (called `Greater Data Science'[GDS]) composed of six categories of activities: Data Exploration and Preparation, Data Representation and Transformation, Computing with Data, Data Modeling, Data Visualization and Presentation, and Science about Data Science. But the question remains:
Open Question: How can we cover each of its 6 branches within a specified allotted time for the training program? How can we resolve this unsettled conundrum?
As Donoho (2016) noted, ‘programs in Data Science cover only a fraction of the GDS territory.’ The reason is very clear. To accommodate additional topics (like data pre-processing techniques, advanced computing, and other interdisciplinary real-data investigations) we run into the problem of ‘Too many topics, too little time.’ In order to make way for more computer science–type materials, something must be sacrificed. But what should it be? This question was recently discussed by an expert panel (made up of a mix of data scientists, statisticians like David Hand, Chris Wiggins, Zoubin Ghahramani etc.) at Royal Statistics Society. At the end, no consensus was reached on what tools and topics to be included (or excluded) from the curriculum. Contrary to this current divergent and extreme approaches, I recommend an alternative philosophy to make both ends meet [`The Inclusion Principle’]:
Maxim 2. Teach methods for simple data in ways that continue to work for complex high dimensional data (similar to the goal of teaching finite-dimensional math in notation that extend to infinite-dimensional Hilbert space).
This will bypass the problem of “depth vs. breadth” and at the same time could provide the students a glimpse to the frontiers of statistical theory and methods. To further accelerate, and enhance students’ learning, I advocate an additional principle
Maxim 3. Prefer the education program that covers the curriculum using a minimum number of fundamental tools, concepts, and notations.
Can we develop a new curriculum based on these three postulates? YES, I believe. The reason for my optimism lies in the power of ‘united statistical learning’ framework, which seeks to provide a ‘concise, comprehensive, and connected’ (I call it the three C’s of teaching) view of the subject, thereby strengthening the statistical core for data science among students and applied researchers. My research-driven pedagogical efforts have had some tangible success in developing such a ‘core curriculum’ keeping maxims 1-3 in mind.