The Origin Story of Data Science

Publié dans Coder stories

14 mai 2019

8min

The Origin Story of Data Science
auteur.e
Pierre Mary

Data scientist @ PrestaShop

“In God we trust; all others bring data”—William Edwards Deming.

Today, data science is defined as a multidisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from data. It emerged thanks to the convergence of a wide range of factors: New ideas among academic statisticians, the spread of computer science across various fields, and a favorable economic context.

As the falling cost of hard drives allowed companies and governments to store more and more data, the need to find new ways to value it arose. This boosted the development of new systems, algorithms, and computing paradigms. Since data science was particularly appropriate for those wanting to learn from big data, and thanks to the emergence of cloud computing, it spread quickly across various fields.

It ought to be noted, though, that while the rising popularity of big data was a factor in the rapid growth of data science, it shouldn’t be inferred that data science only applies to big data.

Along the way to becoming the field that we know now, data science received a lot of criticism from academics and journalists who saw no distinction between it and statistics, especially during the period 2010–2015. The difference may not have been obvious to them without a statistician’s background. Here, we examine the origins of this field to get a better understanding of why it is a distinct academic discipline. And since it’s a story better understood when looked at through the individuals involved in creating it, let’s meet the four people who pushed the boundaries of statistics: John Tukey, John Chambers, Leo Breiman, and Bill Cleveland.

John Tukey: The epicenter of the earthquake

The influence of John Tukey on both the mathematical and statistical worlds is huge. He coined the term “bit” and is directly and indirectly responsible for much of the richness of graphic methods available today with his book Exploratory Data Analysis and paper Mathematics and the Picturing of Data. We also have him to thank for the box plot and he is renowned for his contributions to the FFT algorithm.

Tukey was teaching at Princeton University while developing statistical methods for computers at Bell Labs when he wrote The Future of Data Analysis (1962). In it, he outlined a new science about learning from data, urging academic statisticians to reduce their focus on statistical theory and engage with the entire data-analysis process. Articulating the importance of the distinction between exploratory data analysis and confirmatory data analysis was the first step in establishing the field of data science. Toward the end of the paper, he summarized what he believed to be the necessary attitudes to adopt concerning the future of statistics. It’s worth repeating some of them for today’s data scientists:

  • “We need to face up to more realistic problems. The fact that normal theory, for instance, may offer the only framework in which some problem can be tackled simply or algebraically may be a very good reason for starting with the normal case, but never can be a good reason for stopping there.”
  • “We need to face up to the necessarily approximate nature of useful results in data analysis.”
  • “We need to face up to the need for collecting the results of actual experience with specific data-analytic techniques.”
  • “We need to face up to the need for iterative procedures in data analysis.”
  • “We need to face up to the need for both indication and conclusion in the same analysis.”
  • “We need to give up the vain hope that data analysis can be founded upon a logico-deductive system like Euclidean plane geometry (or some form of the propositional calculus) and to face up to the fact that data analysis is intrinsically an empirical science.”

John Chambers: Statisticians at a crossroads

Like Tukey, John Chambers worked at Bell Labs. He is the creator of the S programming language, which evolved to become R, a language widely used among data scientists. In 1998, he received the world’s most prestigious software prize, the ACM Software System Award with the citation, “For the S system, which has forever altered how people analyze, visualize, and manipulate data.”

Chambers’s influence on the field can be traced back to his paper Greater or lesser statistics: A choice for future research (1993), in which he developed the idea of dividing statistics into two groups:

  • Greater statistics: “Everything related to learning from data, from the first planning or collection to the last presentation or report”
  • Lesser statistics: “The body of specifically statistical methodology that has evolved within the profession—roughly, statistics as defined by texts, journals, and doctoral dissertations”

To add some context, at the time, statisticians were participating marginally in new research areas where their expertise and interest was relevant, such as expert software, scientific visualization, chaos theory, and neural networks. As Chambers wrote, “If statisticians remain aloof, others will act. Statistics will lose.” Guess what happened.

Leo Breiman: A cultural shift

After seven years as an academic renowned for his work on probability theory, the distinguished statistician Leo Breiman became an independent consultant for 13 years before joining UC Berkeley’s Department of Statistics. Returning to the university opened his eyes. With both his academic background and consultant experience he was able to observe that Tukey’s message and Chambers’s warning hadn’t been heeded. Academic statisticians were continuing to focus on theory and weren’t engaging with the entire data-analysis process. Meanwhile, others had acted.

This provided him with the subject matter for his famous paper Statistical Modeling: The Two Cultures (2001). Like Chambers, he divided statistics into two groups: The data modeling culture (Chambers’s lesser statistics) and the algorithmic modeling culture (Chambers’s greater statistics). He took it one step further, stating that 98% of statisticians were from the former, while only 2% were from the latter. At the same time, the algorithmic modeling culture was the norm in many other fields.

According to Breiman, this focus on the data models had led to irrelevant theory and questionable scientific conclusions, while keeping statisticians from using more suitable algotithmic models. He also felt it was preventing statisticians from working on exciting new problems that would be able to drive a new generation toward potential breakthroughs.

In A Conversation with Leo Breiman (2001) he was even more explicit when asked to give advice to students studying statistics:

“I’m torn in a way because what I might even tell them [young students] is, ‘Don’t go into statistics.’ My feeling is, to some extent, that academic statistics may have lost its way[…]

I knew what was going on out in industry and government in terms of uses of statistics, but what was going on in academic research seemed light-years away. It was proceeding as though it were some branch of abstract mathematics. One of our senior faculty members said a while back, ‘We have to keep alive the spirit of Wald. But before the good old days of Wald and the divorce of statistics from data, there were the good old days of Fisher, who believed that statistics existed for the purposes of prediction and explanation and working with data[…]

In the past five or six years, I’ve become close to the people in the machine learning and neural nets areas because they are doing important applied work on big, tough prediction problems. They’re data oriented and what they are doing corresponds exactly to Webster’s definition of statistics, but almost none of them are statisticians by training.

So I think if I were advising a young person today, I would have some reservations about advising him or her to go into statistics, but probably, in the end, I would say, ‘Take statistics, but remember that the great adventure of statistics is in gathering and using data to solve interesting and important real world problems.”

All the elements of data science were now in the air.

Bill Cleveland: Beyond statistics

Bill Cleveland is a computer scientist and Professor of Statistics and Courtesy Professor of Computer Science at Purdue University, Indiana. He is best known for his work on data visualization, particularly nonparametric regression and local regression, which he first described in his paper Robust Locally Weighted Regression and Smoothing Scatterplots (1979), then developed and enriched in Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting (1988). Cleveland also worked in the Statistics Research Department at Bell Labs, becoming Department Head.

In 2001, he published a paper called Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics. In it, he proposed that universities should create new research and teaching departments by enlarging 6 technical areas of work in the field of statistics. He called the altered field “data science.” He suggested allocating the department resources between the 6 areas, as follows:

  • Multidisciplinary Investigations (25%): Data analysis collaborations in a collection of subject matter areas.
  • Models and Methods for Data (20%): Statistical models; methods of model building; methods of estimation and distribution based on probabilistic inference.
  • Computing with Data (15%): Hardware systems; software systems; computational algorithms.
  • Pedagogy (15%): Curriculum planning and approaches to teaching for elementary school, secondary school, college, graduate school, continuing education, and corporate training.
  • Tool Evaluation (5%): Surveys of tools in use in practice, surveys of perceived needs for new tools, and studies of the processes for developing new tools.
  • Theory (20%): Foundations of data science; general approaches to models and methods, to computing with data, to teaching, and to tool evaluation; mathematical investigations of models and methods, of computing with data, of teaching, and of evaluation.

This plan was also intended to be adopted by research labs and corporate research organizations.

Conclusion

The need for data science thus grew out of the intuition that solving future complex problems would require analysis of large, multivariate data sets, not just theory and logic. It arose from a long evolution of statistical practice and what it could have been. More than 50 years after Tukey first put forward his ideas, they have finally become mainstream, though this development wasn’t due to statisticians.

Scientific methodology is inextricably bound with data science, since we can’t rely just on theory. As Tukey said, every time we apply data science to a new problem, we are breaking ground from a data-analysis point of view. Therefore we need to experiment a lot. This is where scientific methodology comes in handy.

Common to the four researchers discussed here is the fact that they were involved with the application of statistics in a different field. The same applies to today’s data scientists. From its very roots, this field is made up of people from a variety of other disciplines. Most of them started out using computers in their work or studies and eventually switched to data science from their original field. Many disciplines’ own versions of data science have long been employed. Just look at the diversity of the terms referring to a predictor: Feature, input variable, independent variable, or from a database perspective, a field.

Every time you ask a data scientist what they did before working in this field, you are likely to get a different answer. In September 2018, the job site Indeed analyzed tens of thousands of data scientists’ résumés in its possession. The results showed that, on average, they have a high education level, with about 20% of them having attained a PhD and 75% a bachelor’s or master’s degree. The diversity of fields of study among them is very visible. Computer science and business/economics were both represented by about 22%, followed by math/stats (15%), natural sciences (10%), and data sciences (9%). The fact there is a large portion of data-science majors could be considered surprising, given how new this field is, but it demonstrates how universities have succeeded in proposing new courses. In contrast, social sciences are underrepresented (2%).

The story doesn’t end here, though. The explosion of data we’re seeing is just the beginning and will bring new challenges. Along with the growth of the Internet of things (IoT), it will broaden the areas to which we will be able to apply data science. In addition, the increasing amount of available training data will lead to more effective models. As a science, this field is relatively new, while the development of machine-learning software is expected to increase, too. Companies such as Google, Facebook, Uber, and many others are already building software-research teams in data science to anticipate an economy where increasing the precision and accuracy of their machine-learning models may be the best way to grow their businesses.

This article is part of Behind the Code, the media for developers, by developers. Discover more articles and videos by visiting Behind the Code!

Want to contribute? Get published!

Follow us on Twitter to stay tuned!

Illustration by Victoria Roussel

Les thématiques abordées