Coauthor and core developer of the machine-learning library scikit-learn, Alexandre Gramfort is also a researcher at Inria in the Parietal team, where he is working on statistical machine learning, signal and image processing, optimization, scientific computing, and software engineering for primary applications in brain-function imaging. For Behind the Code, he discusses how it all started with scikit-learn, how visualization is key in data science, and the importance of providing the right tools for avoiding statistical mistakes.
Reading good code to learn programming
I learnt a bit by looking at the code of other people while I was doing my PhD, but I was mostly programming on my own. I don’t think you can really learn programming from books. If you want to learn how to program, you should read good code or program with people who are good programmers. I guess these days you have websites like GitHub, or at that time I was doing my PhD, SourceForge was more the standard than GitHub, but you could access open-source code that was pretty good. Really, for me, the way to learn was to program with more-experienced people.
How scikit-learn started
It started around the end of 2009/early 2010. The team I belong to at Inria is working on a neural-imaging application, looking at how the brain works with images and signals. And at the time we were starting out with scikit-learn, we felt that there was a bunch of people around us using support-vector machines or these types of machine-learning techniques to look at the brain. And the tools to do this were not that easy to use. We needed tools that people doing cognitive neuroscience or clinical neuroscience could use. There was a need to have a software package that would simplify the access to machine learning for people working in these fields. Now, it also allows us to do things much faster than we could before.
Not just another machine-learning package in Python
Scikit-learn was not the first machine-learning package that you had in Python. Before, you had something called MDP, as well as PyMVPA, and there were other initiatives that had Python machine-learning packages, but scikit-learn was launched with a clear scope and much more rigorous development process. Maybe something that was specific to scikit-learn is that it was not meant to be application specific, it was not meant to be for brain signals, brain images, chemistry, or physics. The input objects in scikit-learn are NumPy arrays, which is the minimum that you need to work with numerical data in Python. There are a few things that are related to text processing or a few functions for images, but it was really meant to be a core algorithm that people could build on for other disciplines. But there was no application-specific code in scikit-learn, and that’s still the case.
A huge push for Python in data science
I think there has been a huge push for Python in data science. I see it at multiple levels. I was teaching at Télécom Paris for 5 years and before I started working on scikit-learn, everything was in MATLAB or C. And when you start doing machine learning and you see all these nice examples online… It’s easy to make examples, it’s easy to make it hands-on, it’s easy to do live demos, and to have students start getting things done quickly. So Python was really pushed in education, but also, I would say, in research. There are a lot of scientists around that use other languages, but when their work moves into machine learning, they switch to scikit-learn and Python, or other libraries these days.
Using Cython to improve performance
We want people to contribute to scikit-learn, which is really a scientific-computing library. The people who contribute to scikit-learn typically have a math background, they’re not CS programmers. So you take the people who know how to do software programming and the people who know how to do the math of optimization and machine learning, and when you look at the crossover, it’s close to zero. The benefit of Cython, for us, was that it looks very much like Python, but the barrier to getting started is very small. If you look at Cython code, it feels much closer to Python than anything else. So it was easy for us to say, “OK, just add a few types to your Python code, here are a few tricks you need to know. You can still read the code and it will be much faster, at least some of it will be.” So we have Cython in a few areas of scikit-learn—typically the trees, random forests, all the coordinate-descent techniques in linear models—and our nearest neighbors are also heavily using Cython.
Meritocracy in open source
So something I really like about open source is the meritocracy, in the sense that the people who get the power, who get the respect of the community, are the people who invest their time and have proven they’ve done valuable work for the community.
Documentation best practices
Scikit-learn is really well documented and also it’s documented by examples. It’s not technical documentation like, “Here is class A and here is class B, and class B inherits from class A,” which would be a technical description. Rather, documentation of scikit-learn is, “Here is data, here is how to use it, here is what you should be looking at,” thereby putting people right in the use case with this tool that we call sphinx-gallery, which is actually a spin-off from scikit-learn source code. So we have this process of people writing Python code as examples and these get automatically added to the documentation. They write a Python script and we have a system that then converts this into web pages and examples where you have the images, the plots… And really, the documentation has been there from day one. It’s not like, “OK, let’s write code for 3 years,” and then “OK, how do we document this?” Code isn’t accepted if there’s no documentation. If there’s no documentation, then just write the documentation and think of where this documentation should be added so people can find it, and then this will get in.
Prove that your contribution is a better way to solve a problem
We have a frequently asked questions section that tries to explain why some contributions can’t be accepted. They need to have been cited and used enough in the academic world, or they need to be an improvement on an algorithm we already have. So imagine you want to solve a support-vector machine issue and you come up with a faster way to solve, meaning it’s not a new machine-learning model, it is just a better way of solving it—this can be accepted, but we will need you to prove via various datasets that it is actually faster. So we ask you to do your homework and make sure that we’re really making a step forward.
Preventing data scientists from making mistakes
You should think of scikit-learn as a library that allows people to do science easily. But one of the risks of making it easy is that you have more people participating and you have more chance of making big mistakes. So the way scikit-learn is designed in terms of API and cross validations is we have built-in mechanisms to prevent people from making statistical mistakes. And these mistakes can be avoided if you also come up with educational tools, that is to say software, algorithms, tricks, functions, or anything that will allow people to quickly see that they’re making a mistake with their data. So when you have issues with the input data, and the model isn’t working, it’s easier to get a diagnosis for why something isn’t working, for why your input data went awry, it will allow people better access to the toolbox of these techniques. And I think a lot of effort and money is being invested these days in having tools to do this, at a conceptual or algorithmic level. But I think we also need tools like scikit-learn so people can apply this type of thing to their real problems.
Visualization before predictive models
Having a good way of looking at your data can be extremely insightful. You’re not gonna solve an online predictive system with visualization, but I would say data science is greatly influenced by good visualization. It’s hard to explain, it’s hard to teach. You need experience for this, but it’s quite important that, rather than diving deep into predictive models, you should first look at your data and visualize what’s going on quickly and efficiently. And then you can do a predictive model. I think there are a lot of things that can be understood with visualization.
The future challenges of scikit-learn
I would say sustainability has been an issue for many years. But a scikit-learn foundation was recently established in France with 7 corporate partners, and this is sort of funding the project through donations, which will allow us to hire 3 engineers that will work full time on scikit-learn. This is something we’ve been working toward for the past year and a half, two years, and it suddenly became true last September. So we now have this core group of experienced, historical core contributors to scikit-learn who can be paid to maintain the project and keep it running. And of course, if you want your library to survive, it needs to be useful. If we were to allow the state of the library to get stale in 2019 and nothing to evolve… Probably some people will still be using logistic regression in 10 years, but if there are packages that do it 10 times faster than we do, people will probably stop using scikit-learn.
This article is part of Behind the Code, the media for developers, by developers. Discover more articles and videos by visiting Behind the Code!
Want to contribute? Get published!
Follow us on Twitter to stay tuned!
Illustration by WTTJ