IPython: A unified environment for interactive data analysis
It has roots in academic scientific computing, but has features that appeal to many data scientists.
As I noted in a recent post on reproducing data projects, notebooks have become popular tools for maintaining, sharing, and replicating long data science workflows. Much of that is due to the popularity of IPython1. In development since 2001, IPython grew out of the scientific computing community and has slowly added features that appeal to data scientists.
Roots in academic scientific computing
As IPython creator Fernando Perez noted in his “historical retrospective”, exploratory analysis in a scientific setting requires a solid interactive environment. After years of development IPython has become a great tool for interacting with data. IPython also addresses other important pain points for scientists – reproducibility and collaboration – issues that are equally important to data scientists working in industry.
IPython is more than just Python
With an interactive widget architecture that’s 100% language-agnostic, these days IPython is used by many other programming language communities2, including Julia, Haskell, F#, Ruby, Go, and Scala. If you’re a data scientist who likes to mix-and-match languages, you can create, maintain, and share multi-language data projects in IPython:
IPython is routinely used with tools for data wrangling, advanced analytics, and large-scale computing
I’ve written about popular tools in the PyData ecosystem including Pandas, PyTables, scikit-learn, and Statsmodels. IPython users can also leverage Fortran libraries and IPython.parallel – a low-latency, parallel computing system that supports the control of a cluster of IPython kernels running on a large cluster. IBM recently used IPython notebooks and IPython.parallel to migrate Watson to a domain-independent codebase: researchers took 8,000 lines of Java, Javascript and HTML5 (2 min/query), and ported it to 200 lines of Python (2 sec/query).
Data visualization options continue to improve
Users have long used Python visualization tools (e.g., matplotlib and MayaVi) within IPython. More recently, I’ve seen impressive IPython notebooks with embedded interactive visualizations that were built with tools like plotly and bokeh.
Sharing and collaboration is easy
IPython notebooks are gaining rapid adoption for instructional purposes, and it’s no surprise that it’s already being used in many Programming and Data Science courses (e.g., CS109 at Harvard, Python for Data Science at UC Berkeley, and Software Carpentry). More recently, entire books have been written with it as well: each chapter of a recent3 Signal Processing textbook was originally an IPython notebook. Sharing is facilitated by simple, built-in tools for exporting IPython notebooks into different formats including slides, HTML, LaTex, and JSON.
Community, ecosystem, and funding
IPython has an active community of developers4 and users, that come from academia, the public sector, and industry. Checkout the many interesting notebooks listed in this community gallery – a recent favorite is this notebook that replicated XKCD sketches!
Development is funded by individual contributors (tax-deductible donations are handled by Numfocus), the Alfred P. Sloan Foundation, the NSF, Microsoft, and the Simons Foundation. In addition, companies (including Enthought, Microsoft, Continuum.io, GraphLab, Dataiku) and institutions (e.g., MIT’s StarCluster) continue to build tools that integrate IPython.
(0) This post is based on an extended conversation with IPython’s creator, Fernando Perez.
(1) Fernando Perez has a nice summary of the evolution of IPython.
(2) IPython for R and Matlab are in the planning stages. IPython language kernels include IJulia, IHaskell, IFSharp, IRuby, IScala.
(3) Another textbook worth citing: Probabilistic Programming & Bayesian Methods for Hackers. O’Reilly’s Mining the Social Web book is accompanied by an extensive collection of IPython notebooks.
(4) IPython has 300 committers.