In The Future of Software, I wrote that to build a platform that enables anyone to create custom software applications, we’re not taking the mainstream approach of just using a LLM (large language model) end-to-end. But LLMs have demonstrated amazing performance across many different tasks, including the generation of code. So why take a different approach? Recall that our mission at Durable is to empower anyone, regardless of their ability to read or write code, to build custom software applications just by interactively describing what they need. The AI that powers such an experience needs to generate software applications that work without users having to verify or correct the code. It also needs to precisely understand what users want by asking the right questions and validating its assumptions. LLMs currently cannot reliably produce error-free outputs. They also cannot reliably and actively clarify intent through Q&A. And we don’t think these shortcomings are temporary.
LLMs are a modern twist on an old concept. They are distant descendants of N-gram-based language models of the 1950s. These N-gram models were trained to predict the next word (or token) of a sentence, given everything before it (in practice, text prediction models only see a limited history of the preceding language). In comparison to these old models, today’s LLMs are gargantuan. But they’re trained for the same objective: to predict the next word (or token) given the preceding text. Nearly all modern LLMs are based on the transformer architecture, first published in the seminal 2017 paper from Google. When trained on massive corpora of internet text with the simple next-token-prediction objective, the success of transformer-based LLMs on a broad array of tasks has been one of the biggest surprises of the past few years. LLM-generated content is now so prevalent on the internet that it’s interfering with the training of new models.
But LLMs also have peculiar shortcomings: they hallucinate facts, concoct events that didn’t happen, unpredictably ignore parts of their inputs, make hidden assumptions, and are sensitive to the structure of the input language. When the output of an LLM is used for entertainment or brainstorming, or if the user can spot errors and use trial and error to correct them, these failure modes are inconsequential. But if the output is a software application, the bar is much higher. It must be functional and demonstrably correct without requiring users to read and correct the code line by line.
Consider how modern compilers take high-level programming languages as input and generate machine code. There was a time when human programmers wrote this machine code directly (or punched it into cards). But modern compilers elevated programming to higher-level languages: no one verifies the machine code generated by compilers anymore. This is possible because high-level programming languages are formal languages. Natural language is not. It is ambiguous, equivocal, and infinitely flexible. For example, there are many ways to interpret and satisfy the request: “build a web app that lets users review and discuss books they’ve read”. To empower anyone to build software applications simply by describing them in natural language, we must first concretize their intent. This means taking high-level and possibly vague natural language inputs and turning them into concrete, feasible requirements interactively with the user in the loop. The AI that enables this must ask questions about missing information or potential conflicts and issues. It must make reasonable assumptions but explicitly communicate them. And finally, it must demonstrate to users that their intent is correctly understood and that they can trust the generated software application.
Many argue that today’s LLMs will exhibit these capabilities and much more, if provided plenty of high quality data and compute for training. But we think this fixation on end-to-end LLMs is misguided, just like how it was once erroneously thought futile to work on anything but n-grams. And in looking beyond LLMs, we’re not alone. We believe that the AI needed to power our product will be custom-built to concretize user intent and generate working software applications. It will be made of an ensemble of components, some of which will use custom LLMs to process language or generate code. Other components will be specialized to ensure that the assembled code is functional and correct when run within an ecosystem of users, APIs, and services. Rather than generating code token-by-token, its output will be the result of planning over a learned world-model, considering many simultaneous possible paths to a solution. Such an ensemble of components will need an explicit representation of what is known and what isn’t, and the ability to formulate good questions and know when and how to ask them.
Consequently, our approach to Durable’s AI is fundamentally different from the status-quo. Durable’s AI is built solely to interactively concretize, and then satisfy a user’s requirements by building working software applications. It does this by planning over a learned world model in the joint space of natural language and code. It generates an application only when it has concretized the user’s intent against an explicit representation of what it knows. Otherwise it asks questions to learn new skills or acquire missing knowledge. It’s built using neuro-symbolic AI: combining the strengths of custom LLMs in dealing with the flexibility and ambiguity of natural language with the reasoning power and explainability of purpose-built symbolic AI.
This is a technically difficult approach. Unlike LLMs, where the primarily bottleneck is in scaling training data and compute, the challenges for neuro-symbolic approaches also include planning engines and complex model architectures. But as the means to deliver our product vision, where custom, flexible, and durable software is democratized and accessible to everyone, we consider it the only path forward.