It's clear that AI can and will have a big influence on how we develop software.
By Mike Loukides and Ben Lorica.[A version of this post appears on the O'Reilly Radar.]
Roughly a year ago, we wrote âWhat machine learning means for software development.â In that article, we talked about Andrej Karpathyâs concept of Software 2.0. Karpathy argues that weâre at the beginning of a profound change in the way software is developed. Up until now, weâve built systems by carefully and painstakingly telling systems exactly what to do, instruction by instruction. The process is slow, tedious, and error-prone; most of us have spent days staring at a program that should work, but doesnât. And most of us have been surprised when some program that has been reliable for some time suddenly screws up at some slightly unexpected input. The last bug is always the one you find next; if someone hasnât already said that, someone should have.
Karpathy suggests something radically different: with machine learning, we can stop thinking of programming as writing a step of instructions in a programming language like C or Java or Python. Instead, we can program by example. We can collect many examples of what we want the program to do and what not to do (examples of correct and incorrect behavior), label them appropriately, and train a model to perform correctly on new inputs. In short, we can use machine learning to automate software development itself.
Itâs time to evaluate what has happened in the year since we wrote that article. Are we seeing the first steps toward the adoption of Software 2.0? Yes, but so far, theyâre only small steps. Most companies donât have the AI expertise to implement Karpathyâs vision. Traditional programming is well understood. Training models isnât well understood yet, at least not within companies that havenât already invested significantly in technology (in general) or AI (in particular). Nor are building data pipelines and deploying ML systems well understood. The companies that are systematizing how they develop ML and AI applications are companies that already have advanced AI practices.
That doesnât mean we arenât seeing tools to automate various aspects of software engineering and data science. Those tools are starting to appear, particularly for building deep learning models. Weâre seeing continued adoption of tools like AWSâ Sagemaker and Googleâs AutoML. AutoML Vision allows you to build models without having to code; weâre also seeing code-free model building from startups like MLJAR and Lobe, and tools focused on computer vision, such as Platform.ai and Matroid. A sign that companies are scaling up their usage of ML and AI is that we are seeing the rise of data platforms aimed at accelerating the development and deployment of ML within companies that are growing teams focused on machine learning and AI. Several leaders in AI have described platforms theyâve built internally (such as Uberâs Michelangelo, Facebookâs FBLearner, Twitterâs Cortex, and Appleâs Overton); these companies are having an influence on other companies that are starting to build their own tools. Companies like Databricks are building Software as a Service (SaaS) or on-premises tools for companies that arenât ready to build their own platform.
Weâve also seen (and featured at OâReillyâs AI Conference) Snorkel, an ML-driven tool for automated data labeling and synthetic data generation. HoloClean, another tool developed by researchers from Stanford, Waterloo, and Wisconsin, undertakes automatic error detection and repair. As Chris Ré said at our conference, weâve made a lot of progress in automating data collection and model generation; but labeling and cleaning data have stubbornly resisted automation. At OâReillyâs AI Conference in Beijing, Tim Kraska of MIT discussed how machine learning models have out-performed standard, well-known algorithms for database optimization, disk storage optimization, basic data structures, and even process scheduling. The hand-crafted algorithms you learned in school may cease to be relevant, because AI can do better. Rather than learning about sorting and indexing, the next generation of programmers may learn how to apply machine learning to these problems.
One of the most suggestive projects weâve seen has been RISE Labâs AutoPandas. Given a set of inputs, and the outputs those inputs should produce, AutoPandas generates a program based on those inputs and outputs. This âprogramming by exampleâ is an exciting step toward Software 2.0.
What are the biggest obstacles to adoption? The same set of problems that AI and ML are facing everywhere else (and that, honestly, every new technology faces): lack of skilled people, trouble finding the right use cases, and the difficulty of finding data. Thatâs one reason Software 2.0 is having the greatest influence on data science: thatâs where the skilled people are. Those are the same people who know how to collect and preprocess data, and who know how to define problems that can realistically be solved by ML systems. With AutoPandas, and automated tools for optimizing database queries, weâre just starting to see AI tools that are aimed at software developers.
Machine learning also comes with certain risks, and many businesses may not be willing to accept those risks. Traditional programming is by no means risk-free, but at least those risks are familiar. Machine learning raises the question of explainability. You may not be able to explain why your software does what it does, and there are many application domains (for example, medicine and law) where explainability is essential. Reliability is also a problem: itâs not possible to build a machine learning system that is 100% accurate. If you train a system to manage inventory, how many of that systemâs decisions will be incorrect? It might make fewer errors than a human, but weâre more comfortable with the kinds of errors humans make. Weâre only starting to understand the security implications of machine learning, and wherever data is involved, privacy questions are almost certain to follow. Understanding and addressing the risks of ML and AI will require cross-functional teams; these teams need to encompass not only people with different kinds of expertise (security, privacy, compliance, ethics, design, and domain expertise), but also people from different social and cultural backgrounds. Risks that one socio-cultural group accepts without thinking twice are often completely unacceptable to those with different backgrounds; think, for example, what the use of face identification means to people in Hong Kong.
These problems, though, are solvable. Model governance, model operations, data provenance, and data lineage are becoming hot topics for people and organizations that are implementing AI solutions. Understanding where your data comes from and how it has been modified, along with understanding how your models are evolving over time, is a critical step in addressing safety. Governance and provenance will become even more important as data use becomes subject to regulation; and weâre starting to see data-driven businesses follow the lead of companies in highly regulated industries, such as banking and health care.
We are at the edge of a revolution in how we build software. How far will that revolution extend? We donât know; itâs hard to imagine AI systems designing good user interfaces for humansâthough once designed, itâs easy to imagine AI building those interfaces. Nor is it easy to imagine AI systems designing good APIs for programmatic access to applications. But itâs clear that AI can and will have a big influence on how we develop software. Perhaps the biggest change wonât be a reduction in the need for programmers, but in freeing programmers to think more about what weâre doing, and why. What are the right problems to solve? How do we create software thatâs useful to everyone? Thatâs ultimately a more important problem than building yet another online shopping app. And if Software 2.0 lets us pay more attention to those questions, it will be a revolution thatâs truly worthwhile.