Truth and Hallucination

Beyond Narrative Continuity

Jan 17, 2024

Code is less forgiving than natural language. The syntax is strict, and only certain operators, names, and keywords are allowed. Durable’s platform builds software for users whom we assume don’t know how to code, so it must generate syntactically and semantically correct code every time. I’ve written before about why we don’t just use a large language model to do this: we need to ask questions when necessary, state assumptions everywhere else, and produce functional and correct code. These are all challenging tasks for today’s LLMs, but one of their characteristics in particular makes generating correct code especially difficult: LLMs can make things up.

Example of GPT-4 making up a function that doesn’t exist. When this hallucination is caught and pointed out, narrative continuity dictates that GPT-4 acknowledge the error and use an alternative, which itself may also be made up. Full source here.

Making things up, or hallucinations (some would argue confabulation is the right term, but we’ll use the more common term here), is especially problematic when the output is code. If the LLM hallucinates the name of a function, attribute, or library, the code won’t run, leaving our users with no recourse to fix it. Incidentally, this is also why most LLM-for-code applications are targeted at developers. Until we solve the hallucination problem, text-to-code will only be accessible to developers who can spot and fix these errors. So why do hallucinations occur? And will future LLMs be rid of them?

Hallucinations are a byproduct of how LLMs are trained: next token prediction. Given a sequence of tokens (representing words, parts of words, or characters) in the training data, the model is trained to estimate a probability for each possible token that could follow and maximize the probability it estimates for the correct token. Whether the token represents a truthful or false narrative is not part of the training objective. The only objective is narrative continuity. As Bottou and Scholkopf described in their wonderfully intuitive paper Borges and AI:

At any instant, our imagined apparatus [the LLM] is about to generate a story constrained by the narrative demands of what is already printed on the tape [the context]. Some words were typed by the user, some result from the past random picks of the language model. Neither truth nor intention matters to the operation of the machine, only narrative necessity. The ability to recognize the demands of a narrative is a flavour of knowledge distinct from the truth. Although the machine must know what makes sense in the world of the developing story, what is true in the world of the story need not be true in our world. Is Juliet a teenage heroine or your cat-loving neighbour? Does Sherlock Holmes live on Baker Street? As new words are printed on the tape, the story takes new turns, borrowing facts from the training data (not always true) and filling the gaps with plausible inventions (not always false). What the language model specialists sometimes call hallucinations are just confabulations.

This explains why LLMs are more correct when prompted to “take a deep breath". In that narrative, where the LLM is playing the part of a thoughtful interlocutor, narrative continuity requires the LLM to be more correct. But objective truth is beholden to no narrative. For example, no matter how many times “the earth is flat” appears on the internet, it doesn’t change the fact that it isn’t. Assuming equally weighted data, an LLM trained on a dataset where “the earth is flat” appears more frequently than “the earth is a sphere” will ascribe a larger probability to the incorrect result when prompted to complete “the earth is ..”. And while there is a narrative in which everything the LLM says is correct and true, prompting-engineering it with (”everything you say from here will be true and correct”) doesn’t appear to work.

Is this a fundamental problem with the auto-regressive LLM training paradigm? Or a problem with the data? Or perhaps its domain (i.e. language)? As of early 2024, we don’t yet know the answer. Since there is evidence that data quality can make a significant difference in LLM performance, perhaps we could heavily curate the training data to include only what is true. But truth is context-dependent. For example, a sentence claiming the earth is flat might be perfectly true in the context of an article on humanity’s ancient beliefs, and not so in the context of a modern scientific article. So while text is a projection of the world, it is a projection through an unreliable narrator (the writer), and through the lens of the context of the narration. To not hallucinate, an LLM would have to disentangle what is true and what isn’t in every context it encounters, all without access to the raw physical evidence used to obtain those truths. In our case, even that’s not enough, because for our AI to generate code that accomplishes what the user wants, it needs to understand the cause and effect of executing code.

Causal inference is necessary when using evidence to decide whether a hypothesis is true or not. For example, although the altitude of a city and its average temperature are correlated, the causal arrow only points in one direction: a city’s altitude explains (causes) some part of its average temperature, and not the other way around. Although it’s possible to ascertain this purely through observational data, it’s difficult, and requires the application of specialized causal inference. And even then, the data is often not textual, but a measurement of some physical quantity. LLMs perform well on text-based causal inference benchmarks, but the mere fact of hallucinations puts into question whether this performance is based on causal inference from evidence, or based on narrative continuity.

Although we use custom-trained LLMs as part of our AI stack, these challenges are why we don’t just use a single LLM to generate code. We’re also not alone in looking beyond LLMs. Yann LeCun, one of the three Turing award-winners behind the most recent wave of deep learning, has taken a vocal stance against auto-regressive LLMs and the prospects of solving hallucinations. LeCun argues that language is inadequate to reach human-level intelligence, and that planning over world models in a fully differentiable system is the answer, but the representation of the world model, how it’s obtained, and the connection with language are all open problems. There is evidence that LLMs build world models that are induced through compression of the training data, and it’s possible to use them as world models for planning, but it’s unclear whether these world models are causal and correct, and whether the LLM can access them generally or only under very specific and sensitive prompting.

So our approach is to use custom-trained LLMs as a language interface to the user, a task where they shine. We then complement them with an AI system designed from the ground up to guide users to describe objectives that the system then satisfies by generating working code. The AI’s goal isn’t auto-regressive narrative continuity. Rather, it uses planning to search for a sequence of code actions that meet the user’s objective, while at the same time using symbolic AI to ensure that it can compile these actions into a functioning program. It performs this search over a learned causal world model, which encodes actions and their effects in the joint space of code and natural language. Critically, this world model is external to the LLMs that power the natural language interface to our users. During each step of the plan, the AI accesses parts of the world model relevant to the current sub-problem, and progressively breaks down the user objectives. This process continues until either a) the AI successfully generates working software, or b) it deems it necessary to ask the user clarifying questions, or c) it reaches an impasse (e.g. due to impossible user objectives) and requires users to modify specific parts of the objectives.

Since our AI doesn’t optimize for narrative continuity, it doesn’t hallucinate. But its function depends on a validated and causal world model, and it only scales if we build this world model without human supervision. Its function also depends on a planning engine that infers when clarifying questions should be posed to users, and how their answers should be used. These are non-trivial capabilities, but we believe this approach is the only way to unlock text-to-product for everyone. I’ll share my thoughts on our approach to them in future articles.

Thanks to Fernando Nobre, Liam McInroy, and Chris Fruci for reading drafts of this article.

Within Reason: The durable.ai Blog

Discussion about this post