Machine learning requires a fundamentally different deployment approach
As organizations embrace machine learning, the need for new deployment tools and strategies grows.
The biggest issue facing machine learning (ML) isn’t whether we will discover better algorithms (we probably will), whether we’ll create a general AI (we probably won’t), or whether we’ll be able to deal with a flood of smart fakes (that’s a long-term, escalating battle). The biggest issue is how we’ll put ML systems into production. Getting an experiment to work on a laptop, even an experiment that runs “in the cloud,” is one thing. Putting that experiment into production is another matter. Production has to deal with reality, and reality doesn’t live on our laptops.
Most of our understanding of “production” has come from the web world and learning how to run ecommerce and social media applications at scale. The latest advances in web operations—containerization and container orchestration—make it easier to package applications that can be deployed reliably and maintained consistently. It’s still not easy, but the tools are there. That’s a good start.
ML applications differ from traditional software in two important ways. First, they’re not deterministic. Second, the application’s behavior isn’t determined by the code, but by the data used for training. These two differences are closely related.
A traditional application implements a specification that describes what the program is supposed to do. If someone clicks “buy,” an item is added to a shopping cart. If someone deposits a check, a transaction is generated and accounts are debited and credited. Although specifications can be ambiguous, at a fundamental level, the program’s behavior isn’t. If you buy an item and that item doesn’t end up in your shopping cart, that’s a bug.
Machine learning is fundamentally different because it is never 100% accurate. It doesn’t matter whether it’s identifying faces, decoding handwriting, or understanding speech; errors will always occur, and the only real question is whether the error rate is acceptable. Therefore, the performance of an ML system can’t be evaluated against a strict specification. It’s always evaluated against metrics: low-level metrics, like false negatives or false positives, and business-level metrics, like sales or user retention. Metrics are always application specific; in a podcast interview, Pete Skomoroch discussed the metrics LinkedIn used for increasing retention. Error rates are also application specific: you can tolerate a much higher error rate on a recommendation system than you can on an autonomous vehicle. If you can’t tolerate error—if it’s unacceptable for a customer to deposit money that doesn’t land in their bank account—you shouldn’t be using AI.
The behavior of a traditional application is completely determined by the code. The code certainly includes programming libraries and, if you want to be pedantic, the source code for databases, web servers, and cloud providers. Regardless of where you draw the line, there’s a code base that determines everything the application can do.
In ML systems, the code is less important. The system’s behavior is determined by a model that is trained and tested against a data set collected by developers. The model raises a number of problems. As Ihab Ilyas and Ben Lorica write, “AI starts with ‘good’ data”; but all data isn’t equally good. Results from a recent O’Reilly survey showed that “lack of data or data quality issues” is holding back AI technologies. Training data may be inaccurate, and it frequently reflects biases and prejudices that lead to unfair applications. Even if the data is accurate when the model is created, models go stale over time and need to be retrained. People change their behavior, perhaps in response to the system itself. And the use of training data (and the protection of personal information) is increasingly subject to regulation, such as GDPR. Collecting data, cleaning the data, maintaining the pipelines that collect the data, retraining the model, and deploying the new model are tasks that never go away.
How do we deploy and manage such systems? We need to go back to basics, starting with the most fundamental ideas from software development:
- Version control for everything
- Automate every process that can be automated
- Test everything that can be tested
Software developers have been using version control systems to manage changes to source code for years. That’s an important start—but we have to realize machine learning presents a much larger problem. You can’t just manage the source code. You have to manage the models, the data used to train and test the models, and the metadata that describes the data (its origins, the terms under which it can be used, and so on). That’s beyond the scope of traditional version control systems like git, but we’re starting to see tools like MLflow that are designed to manage the development process, including tasks like versioning training data, and tracking data lineage and provenance.
Currently available tools, such as Docker and Kubernetes, can automate ML deployments. The problem isn’t a lack of tools as much as it is attitude and expectations. An ML application isn’t complete when it works on the developer’s laptop and is pushed to the cloud. That manual process leaves you with hand-crafted, boutique deployment solutions that are different for every application. If every Docker container is a unique work of art, neither containers nor container orchestration will buy you much. Those solutions will fail as soon as the original developer leaves the company or is just plain unavailable. Remember, too, that an ML system isn’t just the app: it must include the pipelines for acquiring data, cleaning data, training models, and testing. Deploying complex software reliably is a discipline that relies on standardization and automation. Machine learning engineering is now a distinct specialty, and ML engineers are developing tools and practices that are better suited for deploying ML systems.
Testing, monitoring, and the extension of monitoring to observability are cornerstone practices for modern operations and site reliability engineering (SRE). They’re equally important for machine learning—but ML changes the game. ML systems need to be monitored against performance metrics, not specifications; we need tools that can detect whether models have become stale and need to be re-trained; those tools might even initiate automatic re-training. Our needs go way beyond frameworks for unit testing and network monitoring; we need an accurate picture of whether the system is producing results that are accurate, fair, and meeting our performance metrics.
In the past decade, we’ve learned a lot about deploying large web applications. We’re now learning how to deploy ML applications. The constant refrain is “Just wait! The tools we need are coming!” We don’t yet have the tools we need to take machine learning from the laptop to production efficiently and correctly, but we know what to build. Existing tools from the DevOps and SRE communities show what we need; they’re proofs of concepts that demonstrate that problems of deployment and maintenance at scale are solvable. — Mike Loukides
Data points: Recent O’Reilly research and analysis
At Radar, our insights come from many sources: our own reading of the industry tea leaves, our many contacts in the industry, our analysis of usage on the O’Reilly online learning platform, and data we assemble on technology trends.
Every month we plan on sharing notable, useful, or just plain weird results we find in the data. Below you’ll find nuggets from our recent research.
In “How companies adopt and apply cloud native infrastructure,” we surveyed tech leaders and practitioners to evaluate the state of cloud native in the enterprise.
- Nearly 50% of respondents cited lack of skills as the top challenge their organizations face in adopting cloud native infrastructure. Given the technology is both new and rapidly evolving, engineers struggle to keep up-to-date on new tools and technologies.
- 40% of respondents use a hybrid cloud architecture. The hybrid approach can accommodate applications where at least some of the data can’t be stored on a public cloud, and can serve as an interim architecture for organizations migrating legacy applications to a cloud native architecture.
- Among respondents whose organizations have adopted cloud native infrastructure, 88% use containers and 69% use orchestration tools like Kubernetes. These signals align with the Next Architecture’s hypothesis that cloud native infrastructure best meets the demands put on an organization’s digital properties.
In “The topics to watch in software architecture,” we evaluated speaker proposals for the O’Reilly Software Architecture Conference. These go-to experts and practitioners work on the front lines of technology, and they understand that business and software architecture need to operate in harmony to support overall organizational success.
- Microservices was the No. 1 term in the proposals. This topic remains a bedrock concept in the software architecture space.
- A big year-over-year jump in serverless, the No. 7 term, up 89 slots, suggests increased interest, exploration, and experimentation around this nascent and evolving topic for software architects.
- Machine learning (No. 20) and AI (No. 45) ranked well individually in the most frequently referenced topics, but if you combine them, they would rise to the No. 7 topic overall. The increase of AI/ML in proposals is likely tied to the need for more skills development in the software architecture space as well as AI/ML’s role in monitoring and reliability.
Finally, in “What’s driving open source software in 2019,” we analyzed proposal data for the O’Reilly Open Source Software Conference (OSCON). Virtually every impactful socio-technical transformation of the last 20 years is encoded in the record of OSCON speaker proposals. This record doesn’t merely reflect the salience of these and other trends: it anticipates this salience, sometimes by several years.
- In the 2019 OSCON proposals, we see cloud native gaining traction for open source developers to help promote resilience, scaling, availability, and improved responsiveness. The shift to a cloud native paradigm brings new challenges, new tools, and new practices for developers to master.
- Results from our ranking of proposal phrases show the centrality of data to the open source community: “data” (the No. 5 term) outpacing “code” (the No. 14 term), the rise in AI/ML topics, and in the nascent cloud native paradigm where monitoring and analytics assume critical importance—highlighting the demand for skills in analytics, data acquisition, etc.
- AI and ML posted big year-over-year jumps in the OSCON proposals, with the focus shifting from exploration to operationalizing the technology—driving the need for AI/ML skills as well as expertise in a constellation of adjacent technologies, such as automation, monitoring, data preparation, and integration.