Nate B Jones explains DSPy, a framework for building self-optimizing AI prompts, for beginners through to enterprise teams
Solo presentation by Nate B Jones of AI News & Strategy Daily on using DSPy to automate prompt optimization.
Summary
Nate B Jones introduces DSPy, a Python library that treats prompts as programmable code rather than static text, allowing AI to optimize its own prompts systematically rather than relying on individual human expertise. He structures the presentation across three levels: a no-code beginner approach using a single ChatGPT prompt, a technical explanation for engineers and builders, and a scaling framework for team leaders managing production pipelines. The core argument is that traditional prompt engineering is brittle, hard to measure, and difficult to scale — and that DSPy solves this by defining input/output pairs, scoring rubrics, and automated optimization loops. Nate provides a concrete, pasteable beginner prompt that replicates DSPy's logic inside ChatGPT without requiring any terminal or Python knowledge.
Key Takeaways
FULL TRANSCRIPT
Introduction: The Problem DSPy Solves
Nate B Jones: One of the most common concerns I get from people is that they do not know how to optimize their prompts. They want to, but they don't feel they have the expertise. I've written a lot about how to develop that expertise, but I also recognize it's not for everyone. This method that I'm about to show you is actually a way to make AI optimize your prompts for you, and it's based on a very famous Python framework that engineers are currently using for production prompting.
If you've ever wondered how people get their prompts to look so polished, well, this is part of how. What I'm going to do is walk through and explain the concepts in this video, and then I'm going to have a whole post that lays out how you actually get started, with specific prompts and examples.
I'm going to divide that post into three parts. Part one is for beginners — people who have never done this. You should be able to apply these lessons as someone who doesn't want to touch Python code, doesn't want to touch the terminal, doesn't want to see code at all, and you should still be able to get benefits. That is not something people have generally offered. People typically say, "If you want to optimize your prompts like this, well, best of luck — off you go and use the terminal." I don't think that's acceptable. Instead, I want to give you a five-minute quick start that lets you take the same principles engineers are using for production code and apply them yourself in the chat, so that you can get some of those benefits too.
But we're not done yet. If you're an engineer or a builder, if you're not scared of the terminal, I want to give you a reasonably technical explanation of how DSPy works, the principles behind it, and then also in the article a get-started handbook so you can get there.
And we're still not done, because in part three I want to talk about how you scale this across teams. It's a different kind of challenge. If you're a solo builder, you don't need that part. But if you're managing a team and you have production prompting pipelines, understanding how the system scales is actually really important, and I want to get into that and cover some of the key principles.
So stay with me. We're going to do a little bit of visuals on this one — I've seen requests for more visuals in these videos, and we're going to get to that here. We'll walk through for beginners, for builders, and for teams, and then there's going to be lots more good stuff in the post for those who want to go further. Let's get to it.
What Is DSPy?
All right, here we are. I love my graphics. Fair credit — this is Gamma helping me organize my thinking. A nice little AI tool, using AI to optimize AI for prompting. And the framework really does scale from beginner to enterprise.
So with that in mind, what are we talking about? What is this framework? This is called DSPy. It's a library for the Python language that enables you to work with large language models by treating prompts as programmable code rather than static text. It's not really a fork — it's a library. The framework enables systematic prompt engineering so that you can actually scale LLM applications in ways that go beyond just writing "use chain of thought" or some other adjective to make things better. It enables you to be structured and systematic with your prompting so you're much less dependent on individual expertise, which has tons of benefits as we'll see.
But don't worry — we're going to start with beginners first.
Part One: The Beginner Approach
The first thing to do, if you're not sure what I'm talking about, is just to get these concepts under your belt. Then on the next slide we're going to have an actual full beginner prompt that you can paste right into ChatGPT.
DSPy essentially provides a bridge. What you're doing is saying: here's where I want to go. Part one, you're defining your task. Part two, you're saying here are some examples of what the finished product looks like. One example of this is: I want you to write a customer service email — here are some good examples of customer service emails. Part three, you want the prompt optimizer, the DSPy library, to automatically refine its prompt structure to optimize toward those outputs.
So basically, you want to say: here's my goal, here's what good looks like, and here is an input for that good output. You'll notice it says input/output pairs — that's definitely key. You're basically telling the DSPy program: here's an input and an output that looks like this, and I'm only going to give you the input next time. It's pattern matching. If A equals B and C equals D, then E equals F is what you want it to be doing. If I give you notes on a customer call and I give you what a good email looks like three or four times, you should be able to take notes on a customer call and produce a good email. That's the core idea.
And yes, you don't have to run DSPy to get that kind of result. I'm going to show you how, if you don't want to touch the terminal.
What DSPy does is it optimizes and iterates. Once it is able to reliably produce a good email, you can actually integrate it into your production pipeline for AI so that you know you have an optimal prompt — and it wasn't just based on best effort. That in turn increases the overall quality of all your prompting, because you're actually allowing AI to optimize for AI. You're allowing AI to bridge the gap between your input and the output you want, and construct the prompt that links them. That's really the key idea I want to get across.
The Beginner Prompt You Can Use Right Now
Let's get into what beginners can learn. This is a real prompt — you can grab this prompt. It's not technically DSPy because it's not the Python programming language, but it is a prompt that works like DSPy and works in an LLM like ChatGPT.
It's very simple. It says: I need to create a self-optimizing prompt system. This is my task — write an email, summarize meeting notes, whatever it is. These are my examples: here are at least three pairs, an input and an output. Input. Output. Input. Output. Make the outputs really good and make the inputs really consistent. If you're going to give it inputs that are all wildly different, you're not helping it. If you're not going to grade your outputs consistently, you're not helping it.
Now, please create a scoring system with specific criteria. You have functionality, format, completeness — you can adjust what those criteria are. This is an example. If you don't value format as much, you can drop it and put something else in. But you want to, as clearly as you can, specify how the system should score success when it is practicing.
You are then going to tell the system — and ChatGPT will just do this in one shot — please write multiple prompts that could handle my task. In this case I say three; you could do more. Please test every single prompt on the examples I gave you and score the results. It's basically going to test each of the three inputs, see how closely it can mimic the output you gave it, and give itself a score based on the rubric you gave it.
Step four: please take the best one and improve it by fixing whatever element scored the lowest from your rubric of functionality, format, or completeness — or whatever you chose. And step five: give me the final improved prompt with a scoring system.
That is all one prompt in ChatGPT. And that is as close as you can get as a beginner to what it's like to work with DSPy. You don't have to do the terminal. You can literally do this anytime. And that is the whole concept that we are working with for more complex production pipelines.
Part Two: For Engineers and Builders
But let's say you are an engineer and you want to understand a little bit more about what is going on here. This is where we get to part two.
For engineers and builders, DSPy turns prompt engineering from an area of personal expertise into an area of programmable discipline. It basically reduces the ambiguity in the space and turns prompting into a more deterministic science, which in turn makes it much easier to provide clarity and control for systems engineering.
You can define LLM behavior with signatures. Signatures are really just inputs and outputs — you're treating prompts like structured code and delivering signatures that enable the Python library to reliably develop a prompt that maps inputs and outputs from what you're giving it.
It is easy to have modular architectures with DSPy because you can swap out different components. For example, you can easily swap out the language model that DSPy is calling upon to build these prompts — it's like one line in DSPy. And that in turn makes it easier to maintain, easier to upgrade, and so on.
You also have the ability to continue to optimize prompts for specific tasks because you can automatically refine as input/output pair systems grow. There's a lot of different elements here, and we're going to get into it more, but I want you to get an idea of what we're doing.
Fundamentally, if you have programmable prompts, if you have a modular architecture, and if you have some kind of automated optimization loop, you are going to be able to actually build precise LLM applications and not depend on the skills of your best prompter.
Why Traditional Prompt Engineering Falls Short
Traditional prompt engineering had defects. I think we all know there's not a systematic way to improve. It's difficult to measure progress objectively. It's really hard to scale. It's brittle. It is often model-specific, or it claims to be model-specific. I saw someone joking that prompt engineering is like throwing darts at a dartboard blindfolded — you're not sure if the darts land or not, but you're making big claims about it.
Traditional prompt engineering does work if you don't have better options, if you have a skilled prompter, and if that skilled prompter is able to evaluate their work honestly. That is sometimes true, and very skilled prompters will sometimes still write prompts that are better than DSPy will write. But DSPy scales consistently in a way no human can. That is why engineers have been preferring it — it is much easier to scale as a software system.
The Core Philosophy of DSPy
So let's get into the core philosophy. If you're treating your prompt as a program, as code — which I've been advocating for a while — you're going to insist on clean inputs and outputs. You're going to insist on modularity throughout the architecture. You're going to insist that you don't treat prompts as strings; prompts should be treated as code instead. And you should enable a metric-driven feedback loop.
When I talked about automatic optimization a couple of slides ago, the way you do that is by defining quantifiable metrics that DSPy can optimize against. So when I gave beginners a measurement system in the ChatGPT prompt just now, that is the beginning of a quantifiable metric. In production pipelines, you go a whole lot further — you dive much deeper into what you define as acceptable. And that helps DSPy write reliable prompts.
The Key Components: Signatures, Modules, Optimizers, Metrics
So what are the key components? I talked about signatures — I want to actually get into what they are so it's not confusing.
Signatures are input/output contracts that specify what your module should do but do not dictate the how. For example, if the context is question and answer, or email draft and feedback to improved email — those are pairs. You're specifying: this is good and this is good. The question is good and the answer is good. The email draft and feedback is good and the improved email is good. But you're not explaining how anything happened in between. You're asking DSPy to essentially write a prompt as an optimization function in between, to bridge that gap, so that in future you can provide email draft and feedback only, it will apply the bridge, and it will get to improved email.
Modules are another key component. These are composable building blocks that combine signatures with specific reasoning strategies like ReAct or Chain of Thought. You can actually chain modules together to create more complicated workflows in DSPy. That's important because you don't always need inference — not all modules require inference or chain of thought. It gives you flexibility. It's like Lego bricks.
Optimizers are automatic prompt optimization algorithms. An example would be BootstrapFewShot, and it improves your modules based on training data and defined metrics without any manual intervention. It just runs continuously.
And last but not least, the metrics piece. You want to have eval functions that can measure accuracy, relevance, format compliance, and custom business metrics, because these help you decide what is good. These guide the optimization process and give you feedback that enables the optimizer to work.
DSPy in Action: The Workflow
So if we look at this in action, what you're doing is you're going to define your task, start with signatures, and then make sure that you have enough examples of input/output pairs that DSPy can learn from. In the ChatGPT light example we did for beginners, we had three. In real production, we're going to have much more — ten, thirty, forty, fifty. And DSPy is going to learn from these examples to generate effective prompts.
You're then going to specify how to measure quality — accuracy percentages, what format looks like — and you're going to do so in a much higher degree of detail than I gave in the beginner's prompt. It's going to be not three different examples of what good looks like, but quantified examples across six, seven, or eight dimensions of quality. Maybe it's a number of tokens, maybe it's a reading level, maybe it's format compliance. There are a lot of ways to do it and it's going to be dependent on the output you're looking for, but you need to define the output as specifically as you can.
Then you're going to choose an optimizer — like BootstrapFewShot for quick results, or MIPRO for complex reasoning tasks. You're going to pick the one that works for you. And then finally, you're going to deploy it and keep an eye on performance, and allow the DSPy module to adapt to new data as you feed it new training examples. It becomes its own self-improving prompt system.
Part Three: Scaling DSPy Across Teams
To scale DSPy across teams is a separate challenge. If you start with personal workflows, you can get significant improvements — you can automate email responses, content generation, data analysis. There's lots of good stuff you can do. Individual engineers are using this already, and teams are starting to as well and doing so successfully.
But it requires sharing optimized modules across teams through centralized registries, so you actually have scalable architectures and you're not all working off different optimizers. It requires quality gates and cost control, so you are determining the acceptable cost you will pay for quality at a given scale across a range of tasks. And it requires infrastructure for governance and for automated model selection.
If you don't do these things, you end up with a complex library of optimizers that individuals are maintaining on a best-effort basis. Costs run out of control and you have great difficulty actually building a consistent pipeline for prompting. As much as individual engineers may want to roll their eyes at this, if you're a team leader, you have to be thinking about this as you start to scale your production pipelines.
Closing Thoughts: Getting Started Is Easier Than You Think
All right, I hope this has been helpful. I want to call out that it's actually not that scary to get started. To get into BootstrapFewShot and start to optimize right away — as long as you have signatures and input/output pairs — it is totally doable, and you can get to applying it to real work quickly. Week three to four is a reasonable timeline, but I know people who've done it much faster. I know people who have gotten into this in just a few days and gotten to actual workflows in the business. It's totally possible.
The key thing is it removes one of the biggest human dependencies in the prompt equation. You now get consistent scaling of prompt engineering expertise by having AI write the prompts, and that's pretty cool.
So there you have it. That's an introduction to DSPy, that's why I'm excited about it, and I hope it gives you a sense of where the state of the art is going as far as using AI to optimize prompts. It's a wild, exciting world. I've written a whole post on how to actually get into it, whether you're a beginner, an engineer, or a team leader managing entire production pipelines for prompt optimization that actually runs.