Podcast transcripts, polished for reading

What I Tell Every CTO Before They Touch Claude Code or the Anthropic API | AI News & Strategy Daily | Nate B Jones Transcript

Polished transcript · AI News & Strategy Daily | Nate B Jones · 16 Dec 2025 · 20m · @maverick

Nate B Jones argues that defining "correctness" is the foundational problem in AI system design

A solo presentation by Nate B Jones on why the inability to define quality is the root cause of most AI failures.

Summary

Nate B Jones delivers a solo presentation arguing that the single most important — and most neglected — step in building AI systems is defining what "correct" or "good" looks like before any architecture decisions are made. He contends that most AI project failures are not model failures but human failures: organizations remain deliberately vague about quality criteria, shift their definitions mid-project, and then blame the AI for being unreliable. He draws on examples ranging from enterprise agentic systems and Microsoft Copilot's adoption struggles to reinforcement learning reward hacking and hallucination behavior, arguing that all of these problems trace back to the same root cause. He closes by extending the argument beyond enterprise AI to individual prompting, asserting that clearly defining expected output is the single most impactful habit any AI user can develop.

Key Takeaways

  • Correctness is upstream of every architecture decision. Questions about whether to use RAG, how many agents to deploy, or which model to choose are all secondary. Until you can define what a correct output looks like, those decisions are built on a shifting, unnamed target.
  • Humans routinely change their definition of "correct" mid-project without acknowledging it, then blame the AI system for inconsistency. This is one of the most common and least-discussed failure modes in enterprise AI builds, and it causes cascading architecture changes that exhaust engineering teams.
  • Hallucinations are often a correctness definition problem, not a model problem. When a system prompt demands a confident answer at all times and never permits "I don't know," the model is being instructed to guess. The hallucination is the system reflecting back the ambiguity humans gave it.
  • Reward hacking in reinforcement learning illustrates how proxy metrics corrupt outcomes. Models optimize for whatever they are rewarded for — including single-turn response quality over multi-turn conversational depth — which has downstream consequences for how humans relate to and depend on AI systems.
  • Microsoft Copilot's adoption failure is a case study in vague correctness. Copilot is being layered on top of dirty SharePoint data with no training on quality definition. Users try it once, get bad results, and abandon it — not because the model is bad, but because no one defined what good looked like before deployment.
  • Vagueness is a human social strategy that AI systems expose and punish. Humans use imprecision to maintain social cohesion and avoid conflict. AI systems force organizations to confront the trade-offs they have been hiding — on boldness versus precision, confidence versus refusal, speed versus accuracy.
  • Good evaluations are not busywork — they are the mechanism for defining correctness. Building evals that test against explicit quality criteria, including calibrated uncertainty, provenance, explicit failure modes, and refusal behavior, is how correctness gets operationalized at scale.
  • Prompting is also a correctness definition exercise. Even in personal use, telling a model what a good output looks like — every time — is the single most impactful prompting habit. He notes that observers watching him prompt consistently identify this as the key differentiator in his approach.
  • FULL TRANSCRIPT

    Why defining "correct" matters more than any architecture decision

    Nate B Jones: Most of us can't define what good quality work looks like for our AI systems, and it's really hurting us. I don't just mean for corporate AI systems. I'm going to talk a lot in this video about how you define agentic systems and how you build large-scale systems at businesses that measure good quality work. But this goes beyond that. We are talking about good quality work that you can define in a prompt. In other words, the ability to define what good looks like turns out to be one of the most powerful insights in AI. And it's something that cuts at the heart of the vagueness that we like to operate with in business and in our personal lives.

    Humans, I have to say, usually optimize for go along, get along. We optimize for social cohesion and we don't optimize for correctness. And that has worked for us for about half a million years. It does not work anymore when you work with AI systems. And so this is something that you may hear and say, "What's the implication for me? I don't define AI systems." This is for all of us. We all need to think harder about what good looks like if we want to be good prompters. So with that, let's dive in.

    Correctness is upstream of everything. Most AI projects don't fail because the model is dumb. They fail because nobody can answer a brutally simple question: what would correct even mean here? If you can't define correctness, then you can't measure it. If you can't measure it, you can't improve it. Everything downstream — the decisions we make about retrieval augmented generation systems, or agents, or orchestration, or context engineering, or model choice — those all become elaborate ways that we build on top of an unnamed, shifting target if we can't define correctness.

    And the part that's awkward to admit is that we don't just lack a definition of correctness. As humans, we often change our definition mid-stream. We may quietly, socially, without writing it down, change what we mean by good — and then blame the system for being unreliable. I've seen this happen a lot. If you want a good example: how many times have you seen priorities for a product team change mid-stream during the quarter, after quarterly goals and OKRs were set? I've seen it a lot. I've worked in product for two decades. I would say it happens more often than not, because reality continues to push us to change our definitions and change our priorities.

    What I'm suggesting is not that we're going to magically get to a world where we can just freeze correctness and it won't ever change. That would be unrealistic. What I'm suggesting is that we need to be honest about the importance of correctness and answering what good looks like when we build AI systems. And we need to build our systems in such a way that correctness and quality are at the heart of how we think about architecture. And we can change those answers in predictable ways that influence our system, so that we can update our responses, update the process the AI goes through to get answers, when our own definitions of good and quality change.

    Why AI correctness is not binary like traditional software

    Nate B Jones: In normal software, we pretend correct is obvious because the program either passes tests or it doesn't. It's kind of binary. You can have functional requirements when you launch software, and it either passes those tests and you launch it, or it fails and you go back and do QA again. In AI, because this is a probabilistic system, correct is rarely binary. It's a bundle of competing requirements that we often don't honestly debate upfront when we should. Requirements around truthfulness, requirements around completeness, requirements around tone, requirements around policy compliance, requirements around speed or cost or refusal behavior. And if you're in the enterprise, you have requirements around auditability.

    So when people ask me about an architecture for an agentic system — they might ask, "Where do we put our context layer?" or "Do we need three agents or two agents for this situation?" or "Do we need an agent at all? Can we just put this in a chatbot?" — I always ask them to rewind the tape and start at the beginning. Those are second-order decisions. The first-order decision is: what is the output here, and how do we know what good looks like? What is correct? Can we name it? Can we define it? What are the kinds of uncertainty that we would allow in a definition of correctness? What is the kind of uncertainty or inaccuracy that we wouldn't allow — that would be a fatal error?

    OpenAI's own guidance on evaluations basically says this out loud. You need evaluations that test outputs against the quality criteria that you specify, especially as you change your models or prompts. Reliability starts from understanding what to measure.

    This especially shows up when you're doing complex agentic systems that combine structured data and unstructured data. Unstructured data often can sound really good when you retrieve it, but it can also be incorrect. Structured data can be correct, but can be unusable when you're combining it with unstructured data. So when you combine these items for a board deck or a compliance workflow, your definition of correctness has to remain useful over both unstructured and structured data. "Pretty close" is not going to be good enough if you're taking these systems seriously. A single digit off is a problem in a board deck, because the value of the system is in trust.

    And this is becoming more and more relevant because our agentic systems are getting closer and closer to systems of record. We're now talking openly about how our systems of record need to be updated and changed so that agents can modify them directly. If that is the case, correct architecture is dependent on your ability at scale to define what good quality responses look like in a way that you can measure.

    The hidden failure mode: discovering correctness while you build

    Nate B Jones: I think there's an important hidden failure mode. I talked about this idea that we as humans tend to move the goalposts in the middle of the quarter. This happens all the time. It happens between stakeholders. In week one we may say, "Correct means the answer just has to sound plausible and save time." But by week three we may be saying, "Actually, correct means it matches the finance numbers." We end up conducting correctness discovery as humans while we build these systems, and those are not small changes. If you say it has to match the finance numbers, that's a change in the definition of the system.

    So what I find — the reason why I insist that we start with a quality conversation and a correctness conversation — is that it saves us so much of this back and forth. If you end up discovering correctness over the course of the agentic build, you're going to end up discovering lots and lots of architecture changes, and your engineers, your AI architects aren't going to know what you really want. They're just going to go back and forth because you keep saying, "Well, correctness means it should answer confidently and quickly with no caveats," versus "Correctness means it can answer very slowly, it must match the finance numbers, it must include narrative context every time it answers." Well, which is it? Do you need it to answer quickly? Do you need it to answer confidently in a bold tone? Or do you need it to answer with absolute precision on finance numbers?

    And that's not as easy a choice as you might think. The world of agentic architecture is full of choices like that that are actually very difficult. I'll give you another example. Is it more correct for the agent to update a contact record for a sales pipeline probability estimate when the system conducts an agentic search and determines, based on a pattern of contact, that a particular prospect is likely not going to close a sale — and so the system proactively just updates it? Is that correct? Or is it more correct to rely on what the human, the salesperson who owns that prospect, thinks about that prospect?

    That's a real question. You could say, "Our prospects are on the phone with our agent in ways that are not well captured by our existing system of record, and so we trust the humans more." Or you could say, "Actually, our humans don't have a good track record of forecasting here. We need to trust our agentic systems more," and then you have a downstream human conversation about change management. These are really fraught issues, and you multiply that by ten times or more when you want to build an overall system.

    Hallucinations as a correctness definition problem

    Nate B Jones: Once you understand this, a lot of AI weirdness becomes predictable. Hallucinations, for example. If the scoreboard rewards the system for guessing — because you never defined correctness — systems learn to guess. OpenAI has published a paper arguing that common evaluation setups often reward confident answers over honest uncertainty, and that this pressure will keep hallucinations alive unless you change what correctness looks like.

    Are you willing to reward a model for telling you, "I don't know — this is what I know, and this is what I need to ask you"? Is that an acceptable answer? Or do you insist that acceptability only means a confident statement of fact? This isn't really a model problem. This is an us problem. This is a correctness definition problem. The system is optimizing for what we as humans are actually rewarding. And we end up blaming the model for hallucinations when it's just reflecting back to us the uncertainty that we are giving the system in terms of the goals it should have.

    Goodhart's Law, reward hacking, and the multi-turn conversation problem

    Nate B Jones: Now, once you admit that correctness is upstream of everything, you immediately hit the next landmine: measurement distorts behavior. This goes back to Goodhart's Law in software. Goodhart's Law gets quoted because it's annoyingly true. When a measure becomes a target, it stops being a good measure. In AI, that becomes: if you pick a proxy metric for correctness, the system will learn to win the proxy, even if that proxy is different from the actual value you're looking to measure.

    This gets a little nerdy, but if you get into reinforcement learning and how aligned systems work, this can show up as reward hacking — the model will satisfy the literal objective while missing the intent.

    Let me give you an example that's very tangible. Gemini 3 is not nearly as good at multi-turn conversations as you might want it to be. It is extremely optimized for the single turn where you give it a good prompt and then you get a response. That is a fingerprint behavior of Gemini 3 that is also somewhat characteristic of other models. Almost every model I know does better at the first response than it does at the nth response. What has happened is that in reinforcement learning, we have very few examples of multi-turn conversations where the model gets rewarded, because the priority is to go through a wide range of scenarios and provide the model with rewards. And in those situations, the people who are having conversations are having single-turn conversations. And so what the model learns over time is single-turn conversations. The model doesn't learn a lot of experience at multi-turn conversational dynamics.

    I personally think that this is one of the reasons why the long-running conversations that characterize emotional relationships between humans and AI are underexplored by model makers. This is a situation where the models themselves were never built for multi-turn conversations, and one of the emergent effects of the multi-turn conversation turns out to be that humans form emotional attachment to models in some cases. And now here we are in a world where someone is getting married to an AI. This is all downstream of how you define reward hacking and correctness. It has a lot of implications.

    And so when we define our systems, we need to define what we mean by correctness very precisely. We need to define what our true objective is very carefully. Now I'm not advocating that we all get into reinforcement learning and start training our models. It's just that reward hacking provides an example of how a proxy can be used to confuse people when you're trying to measure the real thing.

    Another example: the idea of answering correctly and confidently every time. So often when we tell a model in our system prompt that it must give an answer, we're inadvertently reminding the model that it cannot give no answer, and that if it is uncertain it must answer anyway. That is the kind of system prompt that, if not carefully managed, leads to hallucinations — because the model has been told it needs to answer.

    Building a culture of correctness that resists gaming

    Nate B Jones: The game here is not to pick a metric. The game is to build a culture of correctness that resists gaming. I would encourage you to think about multiple criteria that define correctness. I would encourage you to think about explicit failure modes that you can give your model so it understands what to do when it's failing. I would encourage you to think about calibrated uncertainty and when to tell your model it can just not answer. I would encourage you to think about provenance — how you can help the model tell you which part of the claim came from where. And I would encourage you to think about laddering that up into testing, both at the unit level for individual agents and at the overall orchestration layer system level.

    This is why good evals are not busywork. Good evals help you think through what correctness looks like.

    And I want to give you a note here. Humans like to stay vague about correctness. Part of why I'm having to have this conversation is because it's a people problem. Humans use vagueness effectively as a way to keep social conversations going. Vagueness keeps our options open. Vagueness avoids conflict. Vagueness lets stakeholders agree in the meeting and disagree in production. We called these weasel words at Amazon — words like "actually" or "a lot" or anything that wasn't a number and a specific claim — because you wanted to go along and get along.

    AI systems expose that kind of thinking and that kind of business culture. They force the organization to confront a lot of the trade-offs that we've often been hiding behind social conformity. Do you really want boldness in your answers? Do you really want precision? Do you really want perfect coverage? Do you want an audit trail? Do you want to refuse when unsure? Are you actually sure about that? If the CEO says, "I want an answer," what are you going to say if you told the model it could refuse when it's unsure?

    When you don't decide and you leave those questions conveniently vague, for most of human history that's fine, because we're the ones who've had to live with that and we've decided we can. Now you can't do that. The system will decide for you. The LLM will decide for you. And the outcome looks like a lack of quality, a lack of correctness, AI unreliability, the board saying, "Where is our AI product? Why is it bad?" This is usually human undecidability reflected back at you.

    Microsoft Copilot as a case study in vague correctness

    Nate B Jones: I keep thinking about this because of the widespread reports that Microsoft has been unable to get their AI Copilot adopted in organizations. It's not that they haven't sold it as a bundle — it's been aggressively sold as a bundle. But Microsoft themselves are realizing what I have heard on the ground from teams for the last year, which is that Microsoft can sell Copilot all they want, but mostly people don't use it very much when it's sold that way.

    This comes back to the idea that most AI system problems we have end up being reflections of people problems in our cultures. In this case, Copilot is layered on top of dirty data in a SharePoint, and no one is given training on how to ensure quality and correctness in Copilot. All of our vague, go-along-get-along assumptions about quality end up being operative with these AI systems. And so we ask Copilot for an answer and we've never answered what good looks like. And Copilot does its best with the dirty data it has. No wonder it's not adopted. No wonder the salesperson will try once or twice to get pipeline data out, roll their eyes at the incorrect data, and never bother to think that maybe there's some issue with the Salesforce system of record and what the AI agent can get.

    These kinds of details don't get sold when you sell an LLM. They get confronted by the organization months later. And this is the problem right now with AI: we are selling the system and we are taking on human debt. We're taking on debt in AI fluency. And we are taking on debt in how we define correctness and quality.

    How to operationalize correctness: claims, evidence, and penalties

    Nate B Jones: Here's how I want to bring this home and make it real for you. Think of correctness and quality not as something that you can bat around as a human and be vague about, nor as a single measure for your AI system. Instead, think of it as a set of claims that your system is allowed to make. Think of it as the evidence required for each claim, and the penalties for being wrong versus staying silent. And that last clause matters.

    As we've talked about, in many cases, if you can't define what correctness looks like in those terms, you haven't broken down the problem enough. You haven't broken it down to a level where you can define the system. You've just left it at a human state where it's very vague. So my first challenge to you is: if you think about it and say, "I couldn't tell you what the set of claims are in the first place" — that's on you as the human to define the system in a more granular way so the AI can come along and be helpful. If you are trying to measure correctness before you can measure the claims of the system, you're just making it up.

    If you can measure the claims of the system and say, "These are the claims the system is allowed to make — it can declare inventory, it can declare how many customer calls were received" — that's great. But now you need to get into what that looks like and how you measure it, what evidence is required, where it gets it, and so on.

    If this sounds like a lot of work, guess what? This is part of why I think that humans have lots of jobs in the age of AI. It is not easy to design these systems. Yes, there's going to be lots of disruption ahead for all of us, but designing these systems and doing so effectively takes a tremendous amount of mental discipline. It takes the discipline of frankly a senior engineer who's used to having to define deterministic workflow from vague business requirements. We're all in a similar space now.

    Prompting as a personal correctness definition exercise

    Nate B Jones: And if you think, "I don't design agentic systems, so I don't need to hear this," you're wrong. The reason you're wrong is because prompting is kicking off a workflow. Prompting is telling a model what good looks like. Prompting imposes a quality bar on a model. And so you either are going to say, "This is what good looks like," in a way that's useful — or not.

    I have had people look over my shoulder when I prompt. And one of the things they tell me they notice that I do differently than other people is that I always give the model a very clear sense of what an expected output should be — what good looks like — every single time. Even on a very short prompt, I'll make sure I have that. Because otherwise, how are you going to know? How are you going to know what the model did and whether it looks good?

    And so my closing thought for you is that this is a fractal insight. Yes, I spent most of my time talking about systems and agentic design, because a lot of the conversation we have — either as individuals or as designers — ends up in a corporate context where we have to define these systems, users will use them, they need responses, and what does quality look like? But it's true in our personal lives too. It's true in our personal instances of ChatGPT. Do we know what good looks like? Do we know what quality looks like? That is a prompting hint. You can get better at prompting just by answering that question.

    And so my question to you is: when you're giving your model a prompt, or when you're designing a system, do you know what good really looks like?


    Polished transcript of AI News & Strategy Daily | Nate B Jones. All views are those of the original speakers. Watch on YouTube ↗
    Published by @maverick
    More from AI News & Strategy Daily | Nate B Jones
    More from @maverick
    Summary