The Limits of Trial-and-Error Interfaces to Generative AI

When Correctness Matters

Sep 09, 2024

Generative AI products today are often little more than direct interfaces to the underlying foundation models that power them. For example, ChatGPT’s answers are the direct output the GPTX language model, with some front-end rendering. Similarly, images produced by MidJourney and StableDiffusion are the direct result of text-to-image models that power them. The results these models produce are amazing and universally celebrated, but generative AI interfaces are still in their infancy. More capable interfaces powered by specialized AI systems have the potential to harness the power of generative AI for a much broader set of applications. In this article, I’ll take a look at the predominant interface to generative AI, where it works well, and where better alternatives are needed.

Most of the current generative AI products follow a similar interface playbook: a user prompts an AI to generate some output (for example a textual answer or an image), then iteratively corrects errors in it by adjusting the initial prompt or adding new clarifying inputs. I’ll refer to this as a trial-and-error interface. Such an interface is successful in applications where users can quickly identify errors in the AI’s output, and where users also know how to modify the prompt to steer the AI towards the correct answer. For example, when playing around with the amazing GPT4 Code Interpreter, I asked it to plot some numbers on a line for me:

I could see the error in the output, and knew how to re-prompt ChatGPT to get what I wanted: a Number Line plot where each dot’s position on the x axis corresponded to its value, rather than its position in the list. With that additional prompt, ChatGPT gave me the output I wanted, and I was quickly able to verify it was correct.

But if the user doesn’t know how to adjust the prompt to correct an error, or the error isn’t obvious, then a trial-and-error interface no longer works. Consider this next example, which I created as a way to explore ChatGPT’s world modeling capabilities with respect to temporal constraints. I asked GPT4 to write me a script that each day would output yesterday’s temperature, with the constraint that it only could get today’s temperature via an API. Although this is a slightly artificial example to prove a point, it resembles the kinds of constraints we often come across in user use cases at Durable).

The result looks so good on first glance that I almost missed a subtle bug in the logic of the program. Before emitting the code, ChatGPT correctly broke down its solution into steps: in order to retrieve yesterday’s temperature when all we have is today’s temperature, we can write the temperature to storage then retrieve it the following day. But the last 5 lines of the program implement a subtly different logic: today’s temperature is written to storage, then immediately retrieved from storage and shown to the user. As a result, the user is always presented with today’s temperature. As a developer, I’m able to spot this bug if I read the code carefully line-by-line. But if this script were written for a non-technical user, the bug wouldn’t be spotted until the script was run for at least a day, and then only if the user knew the right answer. For such a user and usecase, the trial-and-error interface doesn’t work.

Before discussing an alternative interface that would solve the problem we just saw, it’s worth focusing first on the issue of correctness. Some of the biggest successes in generative AI are in applications where there isn’t a strictly correct answer. For example, virtual companionship services like character.ai and image generation products like MidJourney produce outputs that aren’t easily classified as correct or incorrect. This has led prominent voices in generative AI to assert that correctness is overrated. But even though correctness is sometimes optional for today’s popular applications of generative AI, that’s not the case for many of its most promising future applications. For example, with AI we could drastically increase the efficiency and reliability of software development, or supercharge our ability to glean insights from complex and high-volume data, or accelerate the path to new scientific discoveries. But these advances depend on AI-generated outputs whose correctness can be verified. Correctness matters here.

Correctness is similarly critical for our product at Durable. Our mission is to give non-technical users the superpower to build general-purpose custom software. Many of our target usecases involve long-running programs with complex logic that perform critical functions. A trial-and-error interface simply doesn’t work in this setting. Instead, we’re building a different kind of interface: one that guides users through a two-step process to establish trust in the correctness of the generated output.

The first step ensures that the AI understands the user’s intent to a sufficient level of detail. Our interface presents users with clarifying questions and guides them to validate assumptions made by the AI. The AI itself is specifically architected to enable this rather than hallucinating answers. But understanding user intent isn’t sufficient. Even if the AI produces the correct output, our non-technical users can’t verify its correctness via the code. The second step of the process is designed to address this issue by elevating battle-hardened verification principles of software development to the level of non-technical users. One such example is unit testing, which verifies software correctness by comparing generated outputs to known correct answers in a controlled setting. With purpose-built AI and a custom interface, high-level unit tests can be built that help non-technical users verify the software we generate.

Within Reason: The durable.ai Blog

Discussion about this post