OpenAI's Codex team explains how non-engineers are now shipping real code using AI agents
Nate B Jones interviews two members of the Codex engineering team at OpenAI about how the tool is changing workflows across technical and non-technical roles.
Summary
Tibo, an engineering lead on Codex, and Ed, a design engineer on the same team, join host Nate B Jones to discuss how Codex is reshaping how people work at OpenAI — not just engineers, but designers, go-to-market staff, and new graduates. The episode covers how non-technical employees are now submitting pull requests, how ambient code review has become one of the most loved internal features, and how Codex bootstrapped its own multi-agent behavior without being explicitly designed to do so. The two guests argue that code generation is largely a solved problem and that the next frontier is deployment, maintenance, and safe agentic action in the real world. They also discuss how hiring, career progression, and team structure are all being reshaped by the speed at which model capabilities are improving.
Key Takeaways
FULL TRANSCRIPT
Introductions and how Tibo and Ed came to OpenAI
Nate B Jones: A couple of days ago I had the privilege of sitting down with two members of Codex's engineering team. I got to talk with Tibo, who's pretty well known as an engineering lead at Codex, and also with Ed, a design engineer. Our focus really isn't the code. So if you're not a developer, this is still going to be super interesting for you. Instead, what we focused on is how does Codex change how OpenAI works? And in particular, when you're talking to someone from a non-technical background like Ed and a technical background like Tibo, how do our workflows shift? How does what we build change when you have Codex as effectively a teammate? What does that look like in practice?
I think we often talk about AI-native organizations, but I wanted to take this chance to sit down with a truly AI-native organization — OpenAI — and actually learn how they use Codex day-to-day and how it's changing everybody's workflows, not just the technical team. So jump in. This is going to be a fun one.
Maybe first I'd love to hear a little bit from you guys about who you are and how you came to OpenAI. I know that everyone has their own story here.
Ed: I'm a designer on Codex. I've been at OpenAI just over a year and on Codex for about six months. Before that I worked on the research team. I've always worked at the intersection of design, design engineering, and research. I worked on robotics before at Google and a few other things before that.
Tibo: I worked at Google as well — I didn't know we shared that piece of history. Always figuring things out. So at Google very briefly, then moved into DeepMind, worked there for many many years, and then decided to make the big jump and come to the US and work for OpenAI. That was about a year and a half ago — pre-reasoning. That was pre—
Ed: Yes.
Tibo: And so in typical OpenAI fashion, joined just before that happened. I was part of the o1 sprint, more like trying to be useful in any way possible. After that I kind of stuck around, built some tooling for research, and became super obsessed late last year around the idea that models are going to continue to improve and their capabilities are going to continue to impress us. Maybe we should think more about the products and the infrastructure around them to really benefit from those models. And then I started prototyping — you were working on similar things.
Ed: Yeah. We were not working together initially and then we joined efforts earlier this year.
Tibo: Yeah. And that was sort of how Codex got started.
Nate B Jones: So you guys were there from the beginning with Codex.
Ed: Yeah. I mean it's had a long history. The name Codex is a throwback to a model which was pre-GPT, I believe. So coding agents have been around, but Codex as it is now—
Tibo: Codex is the product that was released in April this year.
How OpenAI uses Codex day-to-day
Nate B Jones: So one of the things I'll just ask you — what I get asked about Codex, because we get to chat and find out — this is the number one question I get asked: how do engineers at OpenAI use Codex day-to-day?
Tibo: Again, two different patterns. One is everyone just doesn't have a choice on code review — the code is reviewed by Codex no matter whether you want it reviewed or not. It's just been so useful at catching issues. And then there's a lot of casual usage by even non-technical staff. And then at the complete end of the spectrum are really power users of Codex who deploy a lot of compute — a lot more than we saw even a couple of months ago — and this continues to increase and increase, with increasingly complex workflows, some of them multi-agent, running for many many hours. So it's a highly personal thing and still feels very evolving.
Ed: Yeah. So as I say, I'm a designer on the team, so I work very closely with engineers, but I'm very much in the codebase a lot myself. The cool thing about Codex and these recent models over the past few months is they really have been a step change. What you've seen, even since we launched our most recent product suite, is basically everyone at OpenAI using it. There's one engineer I know who uses it for everything — for note-taking — it's basically his primary interface to his computer. As a designer, I'm seeing more and more in our work-in-progress channel on Slack, people posting these demos. I DM'd someone and I was like, "I didn't know you could code," and he's like, "I couldn't till a few months ago."
So you've got design engineers like myself hopping in more, submitting more PRs and getting closer to the details. And then even new people — non-technical people, go-to-market folks — people are really hopping in and it's just this kind of force multiplier.
Nate B Jones: That's exactly where I kind of wanted to chat, because I think for a lot of organizations that remains the dream. But maybe it's something about the command line, the terminal, the scariness that comes with that — for whatever reason people find themselves hard-limiting a lot of these technical tools to engineering teams. Sometimes that is literally at the level of IT policy. I've been in organizations where the IT policy only allows engineers to use tools like this, and if they catch you doing this as a non-technical person, that's a violation of policy. And I think some of these older ways of working and thinking are having to evolve.
Tibo: Yeah. I think the lines are blurring. Ed, I mean, you're sort of like everywhere — ideating about the future but then also very much using Codex every day and feeling, you know, pulling up PRs and little fixes. That's evolved very quickly.
Ed: Totally. Yeah. And to your point, there's kind of one half of it which is how do you bring organizations along — and some of these large organizations might have some more institutional challenges — but once you get access, it's getting as easy as possible. You mentioned it might feel like a bit of a step up for people to get into the terminal. The cool thing with some of our products recently is we've shipped an IDE extension, so we're not just in the terminal. We have a CLI product which we've had for a little bit, but we meet people where they code — they might be in VS Code, they might be in Cursor, these other IDEs. And we also have a web product. So once you connect all the enterprise puzzle pieces, you can just go into a web product, type in a prompt, and create a fix. Say you want to change some UX copy — you're a copywriter, you don't even need to look at the code, you just want to change some strings — you can just do that yourself. So yeah, the number of surfaces that people are working across just makes it easier and easier to get involved.
Codex as ambient intelligence — code review and beyond
Nate B Jones: Yeah, I think that ease-of-access piece you guys have done a nice job solving for over the last few months. The other piece I heard as you were talking is that there's a little bit of a healthy constraint in something like having Codex review every PR — it doesn't matter, it's getting reviewed, you have to engage with it. And I think that's also going to be new for a lot of organizations I talk with.
Tibo: The thing we've been very careful about is also optimizing for signal-to-noise ratio and making sure that the hit rate is very good, so that people don't actually complain and want to turn it off. Overall as an organization we're getting way more value out of it than the occasional misses. And we keep improving the system and the model over time so that it's capable of finding more and more gnarly and subtle issues. People are generally impressed. I hear it all the time — "This thing is superhuman. It's doing reviews I would never have done because I don't have the time to dig four layers deep into the stack." And just having that always on, you don't have to think about it — you have that safety net that's just there.
Ed: Super interesting. As a designer I was thinking through the user experience — oh no, is everyone just going to get loads of emails? It turns out it's one of the most loved features that I think we've shipped. One of the things that changed for me was seeing some of our top contributors across OpenAI — not just our team — commenting in our Slack saying, as you say, "superhuman." I look forward to those notifications now. It just adds so much value.
Tibo: I think there are two things that are emerging. This ambient intelligence — code review is one example of that, where it just happens, you don't have to trigger it, you don't have to think about it, and you just benefit from that intelligence being deployed. Then the other thing is people starting to use it as a little assistant on their computer. It's not really about code — it does sysadmin tasks, pulls context, maybe gets the latest news for you, crafts new designs and new ideas. For that, the current way we're doing it in the CLI and the extension is instant. So the current interfaces are maybe holding things back a little bit.
Ed: It's kind of both ways really. In some ways there's this throwback to the terminal that people are getting nostalgic about. From a design perspective there are two counts — one is this kind of parlor trick, this transitory form factor, and actually there are perhaps some new interaction paradigms that we're pushing towards but aren't there yet. On the other hand, the constraint of the prompt box, the terminal — it's kind of perfect as well for now. It meets you where you are. And it's very cool to see the workflows that people have built around that. You can literally spin up your terminal and write notes, do all of these different things from just such a simple form factor.
The surprising extensibility of a code-focused model
Nate B Jones: Yeah. One of the things that has surprised me — if I go back to that idea of a non-technical use case for Codex — I find that Codex is an extraordinarily logical model. When I'm using it for a non-technical use case, there's a sharpness and a conciseness about how it evaluates a particular set of inputs. I can see where it came from — of course you would get this from a model designed for code — but it turns out there's an extensibility to that emergent property that helps with a lot of other things. I did a business case analysis and it wasn't technical at all — you're analyzing business inputs like revenue, sales figures, etc. But it applies that same rigor, and you get a response that's really coherent, really clear, cogent, easy to read, and as a result very useful. I love the idea that these models end up having extensible properties that perhaps spin off of what they were originally designed for.
Tibo: There's that element of: this model is trained to be precise and correct about things, diligent, double-check — maybe triple-check — its work sometimes, and not do all the math in its head. Maybe write a little Python script to help itself out. I use it for data analysis all the time. And it's not about the code anymore — it's about the result and trusting the steps. As you put it, it's a very cogent, legible explanation. You can see step by step why it's doing things. And then there's a question of: at what point, for these kinds of tasks, do you still need to look at the code? Is the code just a tool that you don't really care about? You're using that as a stepping stone. So then you have a coding agent that's maybe evolving into a more general kind of assistant.
Ed: Yeah. In terms of use — you mentioned design showing up in different places. On the one hand there's fixing the paper cuts, because we're in these tools all day every day, literally eight hours a day. Any small paper cut that you see, you can fix it. Obviously you're submitting a PR, you need to look at the code you're generating, and we go through the review process. But if I'm in a very different mindset — a design, ideating mindset — maybe I can just make my terminal really small, don't worry too much about the code, just have a localhost open, and basically narrow this gap from thought to product. Really just focus on the interactions — you can move things, think about responsiveness — and it becomes more like a canvas. Similar to if you're writing, it's a very different use case but also a very different way of designing.
Blurring lines between design, engineering, and product
Nate B Jones: For so long design has been effectively disintermediated from engineering. Coming from a product perspective, so much of the role traditionally was to translate design into something that has requirements that engineers can build against. There was always this tension between PMs and engineers and designers when I was coming up — everyone has different incentives — and really it's all just a function of disintermediation. If you take away the gap and give everyone access to the code, it's a different world.
Tibo: Often it's like, "Hey Ed, you're just an engineer on the team, right? Writing PRs and fixing things. You don't need to go and talk to anyone. You just do it."
Ed: Yeah. And I think some of these boundaries, as you say, are kind of slightly artificial. They've grown up — first we had the terminal, and then with the Mac we started to think about the GUI, and these new disciplines emerged. They're kind of just converging and diverging over time.
Tibo: You don't like being called an engineer?
Ed: Oh no, I mean — I don't know what I call myself now. I think that's the cool thing.
Nate B Jones: There's an identity crisis where it's just like, what am I?
Tibo: Yeah. I think we're getting into a world where job titles matter less and skill sets matter more. It's exciting to see what happens when people can wear those hats lightly and just focus on what problems they can solve.
Ed: Yeah. It brings a lot of clarity. It's all about the problems — figuring out what problems to solve, figuring out what questions to ask yourself. So much more is possible and it's much cheaper to ideate and build, and then you find yourself thinking, "Wow, I really need to be crisp about what I want to go and do."
Tibo: It's exciting, but it's also nerve-wracking at the same time.
Ed: Good ideas matter more.
Nate B Jones: They do. And correctly aimed ideas matter more, I think.
Tibo: Yeah. The speed and velocity — which direction you're going in, how fast you learn. The most successful teams we see emerge at OpenAI are really small teams that set themselves up to learn and iterate super fast. There's a general sense of "we're building towards this," but changing things is cheaper.
Ed: Yeah. There's the phrase that's gone around in engineering for a long time — code wins. You can write as many PRDs as you like, but until you have the product in your hands. And I think that's the very cool thing I've seen from the product team broadly defined — whether it's designers or engineers — if there's a hackathon, at the end of it you'll have a fully working product. You won't just have a throwaway React demo like you had before, which I think is super exciting. And then the hard decision is what do you build — which direction do you narrow down?
Tibo: Often you come up with a new idea, a feature, an entire product, and I have to do a double take because it's not a static thing. I'm like, this thing is fully functional — it's almost shippable. How did you cook this?
Ed: Yeah. The cool thing about this — and this is another angle that probably hasn't been explored that much — is in the old world you're a designer, or a product manager for that matter, you work in a document, you work in a file, and it's this throwaway piece. Then you throw it to the engineer and the engineer productionizes it. But now some of the demos that Tibo was mentioning — I'll just create a fork of the repo and it's not just a demo. It's a fully functioning thing. Obviously to move fast I've taken some shortcuts and it's going to be a little rough around the edges, but the fidelity that you can get to is amazing.
How Codex got started and the experience of co-evolving with the model
Nate B Jones: There's one of the pieces you just used as a throwaway line that I want to dig into a little bit. You talked about this idea that you come up with an idea and then a team forms around the idea to get that idea to fruition. I feel like that's a little bit the story of Codex, but I think it's also the story of a new way of working. I'd be curious for you guys to share a little bit more about what that feels like on the inside.
Tibo: It feels like we are co-evolving Codex and the way of working. Codex is evolving as fast as we have to adapt to the new possibilities that it creates. It's quite a challenge, but fortunately one thing that's become very clear is that humans are still the very best at adapting quickly to things. And isn't it quite insane to think that earlier this year none of this really existed? Now it's very rare to find someone who still codes without a little agent by their side. Small, nimble teams tend to produce incredible results, and I think that's going to continue to be true.
Ed: No, I'd agree. The really interesting observation I've seen is just how fast people get used to these new step changes. To Tibo's point, I also joined just before our reasoning models. I remember at the time we were talking about shipping this reasoning model — it obviously sits on top of all of this research and had been this huge research project for the company — but again it was this low-key research preview and it was such a step change in so many areas. If you think about where we are now, I just look at how fast the different teams I've worked in over just the past six months have moved. Every few weeks or every model release, we push this frontier even more, and then a week later you'll be in some agent loop and you'll get a bug and you'll be like, "Ugh, this model" — getting frustrated — and then you forget that this is insane. You could apply the same to image generation or video, right? We shipped Sora and it's mind-blowing, and then you see this tiny fragment and you're kind of like, "Oh, you know." But you forget — you zoom out and you just become used to these things so fast.
The cool thing is it's just super empowering for small teams. Even some of the junior engineers that we work with who are only a few years out of university — the breadth of work that they can do and the big swings that they can take has really accelerated even within the past few years.
Tibo: We've got Ahmed on the team. He joined as a new grad, didn't know Rust, learned Rust super quickly. I've never really seen someone pick up a new language as fast as that and get productive. And the way that he manages to accept the technology and the potential and discover the true potential of agents is faster than most people on the team. There's this superpower of how quickly you're willing to try and adopt new ways of working. I've seen veterans — ten years in the industry — kind of stick to their more traditional ways of developing, and it's tough. I'm not sure which one is more effective, but it's pretty clear to me that three to six months from now it's going to be very clear which one is.
Junior engineers, adaptability, and what matters most in the AI age
Nate B Jones: I think honestly it'll be a surprise to a lot of folks to hear that there are junior engineers at OpenAI, simply because there's been a widespread perception over the last twelve months that these tools can dramatically accelerate people who know enough of the business context and have the experience to utilize them — and then juniors coming in, I hear from them directly saying they can't for the life of them get a role anywhere because they don't have experience, and now you also need to have the ability to leverage that experience with AI, and it's even harder. But you guys have juniors and they're apparently doing very well. What's that been like?
Tibo: It's been awesome. It brings such a joy to work as well, and a fresh perspective, and keeps us grounded. I've been delightfully surprised about how well that's been working for us. It's changed my perception of what is important — this adaptability. And a lot of it was also that Ahmed, to take him as an example — he grew up almost with this. It's not quite true yet, but at some point it will be true that life before coding agents, before background ambient intelligence, before having a little assistant in your terminal — that was not a thing. So it's just supernatural to them. Whereas for me and others, sometimes I'm just like, "Oh, I'm going to go back to Vim," and I'm slowing myself down in a way. And then you look at the way they are using AI today and you get inspired. It's been really interesting how they've been able to level up the rest of the team who are on paper more senior.
A combination that I've seen work very well is where we do spend a lot of time on the general architecture of a codebase. The principles of software engineering still remain. Once you have the right scaffolding, you can just go and run and be extremely fast and proficient, because the agent goals just respect the general scaffolding and the boundaries that you've set.
Nate B Jones: Reading between the lines a little bit, it sounds like the quality of character that is most important — that you guys are seeing on the ground as you work with these models and co-evolve teamwork with them — is it around openness to experience and learning new things and the ability to adapt quickly? Whether you're junior, senior, technical, or non-technical, is that what you've got to have in the AI age, or is there something else?
Ed: It's interesting. I interview a lot of designers. These are definitely qualities I look for when hiring. We're going through a step change technology-wise and being open to those new ideas and new tools is definitely helpful. If I think back — I'm a child of the internet, I grew up pre-internet and post-internet — I kind of feel like we're at the same point now for software engineers, creatives, and designers: pre-AI, post-AI. I'm seeing more and more people who are maybe skeptical, or learning about it a little, as Tibo says perhaps set in their ways, dipping their toes in and just seeing the crazy benefits and moving forward from there.
Tibo: Curiosity and willingness to engage is the one most important thing right now. It's clear that we're only at the very beginning of what's going to continue to evolve. Model capabilities are going to continue to increase — we are not seeing a sign of a slowdown. The 5.2 that just came out is a very strong model, but it's also one of many more to come. We have a very clear research roadmap that a lot of the team and the rest of OpenAI is excited about. This will continue to revolutionize how we do software engineering. If you're not willing to accept that, it's going to be tough. People who are curious, focused on solving problems, out there in the world asking how they can help people's lives — they're the ones having a great time right now.
Nate B Jones: That's really been true. The stories I know that are positive, hopeful, and exciting tend to be correlated really closely to people who have a nose for interesting problems and a curiosity to solve them, and who just look at AI as this really cool super-tool they can use to solve those problems. I know someone who started out as a music major in college and is now a technical founder because he felt like being one. And now you can do that. He just went and solved problems for customers. One of the things I find most interesting about the trajectory we're on is that those stories become more and more plausible.
Ed: I think it's maybe one of the underrated parts of it. There are a lot of valid concerns that people have, but the thing people don't focus on as much is that it is an equalizer. When I got into design as a teenager, I was doing a lot of animation, hand-drawing things, making movies with my friends in the garage. You had to build a green screen, you had to buy the camera which was expensive. Now with a $20 subscription you can as a creative make basically anything. You get access to Codex, you get access to all these other things. In many ways it's an equalizer, but it does require leaping in — having that curiosity and really throwing yourself in and learning all about it.
Tibo: We still have complaints that usage limits and rate limits are too low, but when you think about it — $20 a month for a prolific software engineer that can help you get stuff done. It's crazy. And this equalizer thing — there are so many problems that were left unsolved before this and will now get solved. That's what gets me excited.
Choosing what to focus on when everything is possible
Nate B Jones: That brings me to another question. You referenced earlier this idea that it's about choosing what you're going to focus on in this world because the tools are so powerful. And then you just mentioned another big piece I've seen, which is that there's a whole host of problems that for lack of a better term are tier-three and tier-four type problems that are now accessible and legible and solvable because we have a tool that can do them. On the one hand you have more volume you can attack that's perhaps lower urgency, and on the other hand you have a lot more value on picking your overall direction correctly. In practice for you guys, what is that balance like?
Tibo: Two things matter to us. One is general conviction based on information around the idea that our models are going to continue to improve along this set of capabilities — let's build ahead of that so that we continue to scale and bring more benefit to our users. The second part is what are people asking for. I was on Twitter the other day and I started a thread — "Hey, what should we build? What's holding you back? What's not delightful on Codex right now?" — and got somewhere around 250—
Ed: Yeah, I saw that thread. It was a good thread.
Tibo: It was around 600 unique ideas. But Codex helped me sift through it all, bring it back, section it based on my own priorities and my own notes, and then I was able to discuss it with the team. So conviction and feedback are two good ways we go about it.
Ed: Yeah. To mention a few other areas where we've been building — we have the CLI product, the web product, and the IDE extension. We also have some cool integrations. You can add Codex in Slack and you can add Codex in Linear. With a lot of these smaller issues, one really cool trend I've been seeing is that there are a lot of small tickets that at the end of the quarter the team might struggle to get around to — they're always there, coming up at the end of meetings. Now, after you've triaged a bunch of these things, maybe there are a bunch of small ones you could just put into one of these integrations and just say, "Codex, fix this," or you can literally assign it in Linear and other products. We're starting to get to these really end-to-end workflows of tracking a small problem — literally writing it down in some short descriptive way — and then having a PR that you can review and choose to merge or not. Being able to free up a lot of time from focusing on that low-level work just frees up pure resources and capacity to focus on some of those big issues. We've been able to get a lot of that low-level work almost automated so that the team can really focus on the big issues.
Tibo: It also moves bottlenecks around. As we're almost solving code generation and you can implement any feature faster and faster, suddenly you're left with deploying and maintaining the services. Whenever your hardware breaks or networking has an issue or whatever million things can happen — now suddenly you get paged a little bit more. You're building ahead of the automation we're able to deploy, and the intelligence is not yet capable of doing all these things. We're not able to yet have Codex deploy the service and be on call. This is an area where we're currently feeling that load from having almost solved code generation.
Nate B Jones: Yeah, I was going to go there, so I'm glad you did. Because to me it's like you've 100x'd code generation — or whatever multiple you want to use — but now you've just shifted all of that down the pipe.
Ed: Yeah. It opens up some cool interesting interface possibilities. If you think about ChatGPT, you're conversing back and forth with a model, asking for some piece of information, and it presents something back to you. With a coding agent, it's taking some action in the world and coming back — most often in a codebase — and the artifact and result of that is some code that you have to review if you want to do something useful with it. We're in this kind of transition period where the meme is that a lot of software engineering is reviewing agent code. As an interface and as a problem to solve, that's a really interesting one to think through. It's one we're thinking through and one that many people in the industry are thinking through — how do you not shift the burden from writing code to basically reviewing code, and how can you make that as smooth as possible? We're doing some cool stuff in that space with the code review agents and other things, but that's one emerging problem that we'll have to solve soon.
Tibo: One of the things that's special about code generation is you can make it safe. You're going to have all the code generated in a sandbox with no side effects. And because all the context is there and it's textual — you have Git, you have a lot of the automation already existing, a lot of the tooling already existing — it was solved first, I think primarily due to a combination of those reasons.
You can make it safe. A lot of the work that we do is viewed under the lens of safety and alignment. Alignment is not a solved problem, which means whenever you go into the world of deployment and being on call and actually having real-world consequences of an agent taking actions — that is a whole other game. You cannot yet guarantee that the agent will not go and delete your service or snoop at user logs. There's a whole security aspect, and figuring out how to restrict the set of actions through a safe space — or you have to solve the alignment problem, whichever comes first. We're inching towards that and finding more and more creative ways so that our agents can act upon the world safely and you're able to steer and supervise that. I think that's the next frontier of what we're going to unlock in 2026. Code generation: considered mostly solved. Code review: we've been investing a lot in. And then where are the bottlenecks now?
Staying fluent in code as generation becomes automated
Nate B Jones: That's kind of where my head goes as I look at the next year as well. One of the things that people tend to get curious about — and I think will come up more and more as a conversation in 2026 — is how do engineers stay fluent and able to read code structures in ways that are meaningful, in a world where code generation is mostly a solved problem? How do we keep the fingertip skills that are relevant so that you can understand what you're deploying?
Tibo: There's this part we didn't discuss — code understanding and planning. How quickly can you figure out how your system functions today, and then use that knowledge to plan your changes, and then after you have your changes, how do you actually deploy them and have an effect upon the world? But also, I am more productive, you're more productive, everyone on the team is more productive, and keeping up with all of that is a challenge. There are new features minted every day. The world is changing so quickly around you. Just in teams, even small teams, keeping up with it all is a challenge.
Ed: And saying that — I just want to be clear, everyone's going to be very discouraged to hear you say that, because we're all trying to keep up. You want to have fast ways to understand what's going on in the codebase, synthesize things. Is text the right way to do that? Do you want a little report every day? How fast should your agent be in order to help you understand the state of the code?
And to your point about staying on top of programming as well — not delegating everything and still deeply understanding things. I've seen some cool examples of people internally occasionally turning off their internet and basically doing old-school coding — no tab complete, no agent, no Codex next to them. And human curiosity doesn't go away. People still need to learn. The engineers on the team still read engineering books. I still read engineering books. I don't think curiosity is going away and I don't think it'll become this thing that you hand off to and lose all the knowledge yourself. And as you say, models can help you stay up to date as well. If I'm trying to get to know a codebase, I can talk to the model about it — ask it how the backend integrates here, where does this component come from, can you explain the dependencies. The model itself is also an amazing resource.
Emergent model behaviors that surprised the team
Nate B Jones: When was the last time you were really surprised by an emergent property in a model?
Tibo: This morning. I just saw someone build scaffolding around a model to enable it to work on a problem that I thought was out of reach of the current capabilities of models — and solve it successfully. I was really surprised. I thought we would need to train the model specifically to be able to do this. But it turns out it generalized fairly well and worked for almost thirteen hours on this one, just by being more creative on the tools and the setup around it. I hadn't seen this done before. That was really surprising to me.
Ed: Yeah. Most days there's something. There's one that actually stuck out to me which we released — if you go in the web product and you ask the model a question, it can send you some front-end back. It takes a photo of it and sends it back with it. When I first saw that, I thought that was magical. It's using a bunch of tools, but there is something very interesting about thinking about coding agents being able to code, but also being able to see, being able to generate these assets. At a conceptual level I just found that really interesting as a creative — that this model can do so much more than I thought.
Nate B Jones: One of my biggest takeaways as I reflect on 2025 is that as much as I was excited about tool use for models — and I was — I don't think I realized the combinatorial power that gets unlocked when you start to give a model a good set of tools. And there is something there about what a good set of tools is.
Tibo: The approach we've taken with Codex is just give it access to your computer through good old Unix tools — give it a shell — and let's see how far it can get. And then in order to do this safely, have it run in a sandbox. What emerges from there is to us surprising, because we don't necessarily care about how the model is going to be able to achieve its task. We don't necessarily have a specific bias there, other than you should probably use the shell a bunch of times. Other than that, it's a very general tool. That's something we've done consciously because we believe it's one of the more scalable ways of doing things — it scales with model capabilities and it's super general.
Ed: Yeah. And the surprising thing on the creative side as well — you mentioned someone who uses it to write documents and things like that. It turns out you don't need to give it a document writing tool. You can just use regex and through bash commands edit documents, write, do anything. That was perhaps not surprising but pretty amazing capability.
Tibo: The other day I was just playing with something. We had a Codex SDK and I just told Codex about it, and it was able to write code, use the SDK, write a bunch of TypeScript, and then invoke basically invoke itself in order to achieve more. We don't have native multi-agent in Codex, but this is a form of it that just completely emerged because I just read the documentation and thought, "I can probably get this tool to do something for me," and I just wrote that code, invoked it, and it just worked. Codex is very good at figuring out ways to solve its problems.
Nate B Jones: So Codex essentially read the SDK docs, instantiated another Codex instance, and used that as a tool to get a job done?
Tibo: That's right. A bunch of them actually.
Ed: Yeah.
Nate B Jones: Effectively it bootstrapped multi-agent.
Tibo: Yes. Without us thinking about it. And so there's this thing of throwaway code — code as a tool. It's obviously extremely powerful. But maybe there's just this whole category of things where the agent is just writing code as — it's not a piece of code that you should ever review as a human or necessarily care about. It's just a very general tool.
Nate B Jones: It's code as a means, not code as an output.
Tibo: Yes.
Memory management in long-running agentic tasks
Nate B Jones: Riffing off the tool piece — I've also seen models with fewer but more powerful tools that are more general in nature doing better overall. I'm going to the other side now, looking at the memory side of things and how long-running agentic tasks handle memory problems — both stateful memory you have outside the system and also in-context memory management approaches. How do you guys think about that for, say, a twenty-hour task? How does memory work?
Tibo: Memory is still an open research topic. It's clear something will emerge that is better than whatever short-term approaches we're taking right now. As a form of memory, you can have the model write to a file and keep track of a lot of its state through just markdown files, for example.
Another thing we're doing is for very long-running sessions where the model goes beyond its context window — the model is forced to summarize what it's achieved so far and then reboot itself through a process we call compaction. At the end it's just: let's erase all the content of the context window, summarize it, reboot, restart. You can do this many many times, and essentially you're able to have the agent work forever. If the task required it to work forever, it would work forever.
In addition to that, because it has access to grep and is able to search through things, it can also dump additional context that it doesn't really need to always have in its context window just to files — that's a form of memory. With skills as well — you might have skills in a file somewhere, and it's a form of memory that is shared between the user and the agent. You're co-evolving some common knowledge there, which allows you to have an agent that performs hopefully better over time. There is a problem with staleness — it's sort of a poor and hacky version of memory and it does feel like this will get disrupted at some point. But that's mostly how we've seen it tackled, and it's a very simple way of achieving it.
Nate B Jones: That's one of the themes that's emerging as we chat — you are seeing surprisingly simple primitives be surprisingly successful at solving larger generalized problems.
Tibo: Yeah. I think that's something a lot of people in the field have learned a long time ago and it's sort of being internalized. It's not necessarily common general knowledge that keeping things simple with models that are evolving in capabilities month after month is probably the right thing to do, because otherwise you end up with a pile of complexity that you have to continue to adapt to the ever-evolving capabilities. That's why we decided to keep things very simple as well.
Career progression and identity in an AI-native world
Nate B Jones: The other big question I want to get to — this gets back to the whole idea of technical and non-technical folks using Codex. I get asked a lot: how do I think about my career? How do I think about career progression in a world where job titles are increasingly optional hats that you can take off and put on, and it's about the problems you solve? What is the career conversation inside OpenAI, and how does co-evolving with the model shape that?
Ed: Yeah, it's a good question. One emerging trend I'm seeing among designers and to an extent some engineers as well — which I personally think is a positive direction — is there's less of a focus on credentials and going through certain routes to ascend certain peaks of credentials, and more of a focus on what you've done and what you can show. Particularly in the design community, what I'm seeing is a lot of people building really exciting things, putting them out there, and from a career perspective building up profiles through what they've done and what they show. No one cares where they went to school and all of those other things perhaps of the past. So there is definitely a learning-through-doing and proving-through-what-you've-done trend, which I think is exciting. A lot of the creator economy and the rise of podcasts and personal media is similar — anyone can just do things and show their skills through that direction. That's one trend I've seen personally.
Tibo: "You can just do things" is very much a mantra. The second mantra we have is "I am also curious about this." Those are sort of the two things that people exhibit at OpenAI. It's less a challenge on career progression at OpenAI — we look at impact and that's how you progress. It's been more of a challenge on interviewing and finding the right people, because the traits you're looking for and how you can succeed has broadened. Before it was: do you program well? Let's give you a series of really hard programming tasks and pick the best talent there. But now it's not that easy anymore. You can actually be very successful if you're not going to traditionally be a top performer at hard programming tasks. Being able to find the talent and be more creative here has been a challenge for us, and we're evolving our thinking there.
Nate B Jones: Do you guys have a silver bullet for the persistent issue we see with interviewing where people will have ChatGPT up on the side and will just be reading responses back in the interview?
Tibo: We bring people on-site for a lot of the interviews. And it's also just a reality that in the job you're going to use AI all the time. So the interview itself needs to evolve and maybe not limit the tools that people can use.
Ed: Yeah. One way of thinking is: how are you using AI to get around some constraint? And another is thinking of it in an empowering way. You need to hit certain baselines, you need to have certain skills in place, but also — are you open to using tools and how can you understand how those can give you leverage?
Nate B Jones: Yeah. It reminds me a little bit of one of Jeff Bezos's favorite interview questions, where he asks people how you solve a problem that has two obvious solutions — like, I want to have a higher quality car that goes faster. How do you do that? You have to pick which one to optimize for. And the trick is you're supposed to invent your way to both. You're supposed to think outside the box, think around the constraint, push on both. That part of it is just measuring the willingness to break the mental box.
Ed: And is the point there that if you were using AI on the side you wouldn't come up with a creative—
Nate B Jones: I don't know that it's necessarily that. The best way I've seen to push people off script — let's assume a remote interview, because I think on-sites are critical but we'll take that as read — if it's a remote interview, the most effective tool I've seen is to push off the standard behavioral interview script pretty fast and push people into a really honest conversation that demands some higher-level trade-off thinking. They don't really have time to feed it into a model and get a response in real time. And you see pretty quickly what their thinking tool set is in their head, and that gives you a sense of what they're going to bring as a partner with AI when they start to work.
Tibo: We have this problem now — not sure if you're reading the questions off the screen on the right.
Nate B Jones: I actually am not. I have no questions up. I'm staring at myself and you guys.
Tibo: So are we.
Ed: Delightfully simple.
Nate B Jones: Yeah, I think it's more fun that way. We get to take the conversation where we want to go. I did prep with ChatGPT — I absolutely prepped with questions — but then I was like, they're okay, and I'm just going to riff it.
Communicating model improvements to users who see the same UI
Nate B Jones: Another question — I know we only have a few minutes left. One of the things we haven't dug into yet that I hear a lot — and maybe this is half a design question, half an engineering question, so perfect for the two of you together — I hear a lot of talk, especially when I put out videos talking about new models. People will say, "What's the difference? I see the same chatbot. I see the same terminal. I see a different label." How do I know that this is actually better? And we've even had comments from Sam and others that chat is essentially a saturated use case. How do you convey the step change that you get in capabilities over a six-month or year period to someone who is seeing the same UI?
Ed: It depends on the use case. As a designer or design engineer, I work with different models and for some tasks I like one, for others I like another. If I'm in ChatGPT and I'm asking a super code question, I'll just leave it on auto or some low reasoning model, and if I really want it to think, maybe I'll use Pro or something like that. You test it and it depends on the situation. We have a bunch of research evals, so there are certain barometers you can use. We've talked through this interview about the different capabilities that different model steps unlock, and we've consistently seen that models today, at least for coding, are substantively different from where they were when I joined this team.
Maybe one good mental model that people don't think about is that when thinking about these things, they think with a snapshot of where we are currently with how you interact with current models. In five years' time it's going to be very different. If you think about what these new capabilities can unlock, different products will have very different experiences. Is chat always the best interface? You might not be interacting with a model at all but it's still doing work for you in the background, and in that case model quality and the model you're using is very different.
Tibo: To come back to that — code review is again an example of where it just happens in the background. The model improves and you know either it got faster or it's able to spot more things. You just got an upgrade for free, you don't have to think about it, you just benefit from it every day. Codex itself is a different product from chat. Agents still benefit from a lot of improvements — we definitely see that whenever we improve on reliability, frontier intelligence, how long it can go. But then it feels like we will need another product at some point as well, where you're not going to run Codex for three days in your terminal. Maybe people will — but at what point do we have agents that just run forever, and you interact with them? Maybe you text it from time to time, maybe you call it. We yet have to invent the right product around this. The models are not there yet but they will be. And then it's going to be like, "Oh wow" — it'll be obvious. In the meantime it sort of feels incremental at times, but then when you look back six months ago you're like, none of this was possible.
Nate B Jones: Exactly. I did a video this morning about this idea that straight-line extrapolation is surprisingly hard to experience in real time. You're sitting there and as you said, people get used to the products so fast and get disappointed and frustrated by them. I vividly remember how excited I was working with ChatGPT 5 thinking and how it felt like a step change, and then immediately within two days I found a bunch of things I didn't like about it and wanted to fix. That's just how it goes because that's how humans scale our taste. One of the things I've been thinking about is how do we take our human defaults — where we seem to assume a static world — and start to transition to a human default where we assume a dynamic world, where we're living on the slopey part of an S-curve and have to think about rapid gains in capability as a default base case.
Closing — how Nate uses AI models, and the benchmarks of the future
Nate B Jones: For our last couple of minutes, do you have a question for me or the audience? I know you guys are always hungry for voice of customer. Is there something that's been bugging you that you'd want to ask?
Tibo: What are you — I'm always curious how people use coding agents outside of coding. You mentioned you use ChatGPT to get ready for the interview, but how do you use it in your day-to-day?
Nate B Jones: I love that. I am a relentless omnivore when it comes to AI models and I tend to jump around really quickly, but I have settled task groups that once I decide where I want them to live, I tend to put them there. Right now I'm using ChatGPT — it was 5.1, now it's going to be 5.2 — for a lot of the structuring, brainstorming, researching, and thinking that I do as I start to think about pieces that I write and what the stories are and how they come together.
I'll use Codex when I want what I call hard thinking mode. ChatGPT 5.2 Pro or 5.1 Pro — people talk about it and I've tried it out and I like it, but I think it is sometimes overexpressive for what I need. That's why I mentioned conciseness as a value I appreciate in Codex — it comes back and doesn't give me a thousand tokens. It just comes back with a really concise answer. I love the legibility of that and I kind of get addicted to it. When I need something that is a very clear concise analysis — financial analysis, project analysis, M&A analysis, doc analysis, a response to something really complicated that I need to think through and want to draft out — Codex is great for all of that because it just boils really cleanly. You get exactly what it boils down to and it's really dependable that way.
I get a lot of mileage right now out of the document tool creation from Claude. I know they're the other guys, but they are definitely shipping well there. Opus 4.5 really does well on that. When I want a PowerPoint, when I want an Excel, it does well. I've been both amazed and frustrated by using NotebookLM and Gemini to ship PowerPoints, because I don't get editability and I hate that, but I get all of these lovely graphics and that's fantastic. So I think we're in this place where people are tool omnivores because we're desperate to get the best thing in the moment for the particular task. We all tend to have that list of paper cuts. I think Claude Opus 4.5 has good tool use, but the ability to create decorated PowerPoints is not really there. I think you guys have a ways to go in PowerPoint creation right now. But the completeness I'm seeing — I'm doing some early work on ChatGPT 5.2 and it spits out a very complete document, a complete answer that thoroughly answers the question. One of the things I preach a lot is you must get really fingertippy with your models to actually be able to use them to solve problems in a way that's differentiated from just typing something into a chatbot. So it's a long way of saying I have about half a dozen models and I use them all every single day.
Tibo: And then you have this mental model of what is this model good for, what is that model good for.
Nate B Jones: That's right.
Tibo: And there isn't yet this perfect model that answers all your needs.
Nate B Jones: And your needs keep changing.
Tibo: Yeah.
Nate B Jones: Because they keep changing. And that's the thing you've emphasized that I really strongly agree with — if I were to try and boil this into a table, it would look incorrect, because I have an evolving sense-map of where these models are good and not good. It's very fine-grain. I have learned which models read handwritten tally marks well and which ones don't.
Tibo: When you refresh that knowledge — when you allow yourself to say, "Let me try this other one"—
Nate B Jones: All the time. And that's one of the things people lean into with my channel — I am extremely open to new models and new experiences changing my priors, because I think you have to be to be at all helpful to people in evolving this landscape. You have to assume that any given new model release from any major model maker could upend a key part of your workflow because it's just better. You should not sit there and say, "Well, ChatGPT before didn't do this, so I'm not going to pay attention to this tool use capability." You should assume that the model is fully capable of surprising you and test it carefully.
One of the things I've been thinking about is what we mean when we talk about useful work and what useful work looks like. We've talked a ton about Codex, and obviously with Codex useful work is PR reviews, coding, work you can define in terms of the bits and bytes the model outputs. It gets a little more complicated with other knowledge work. I'd be curious how you guys think about that — maybe beyond Codex, maybe looking at GPT 5.2.
Tibo: It's a similarly difficult question even just for coding. There are certain benchmarks like SWE-bench that are super saturated by now — do they really measure how much use you're getting out of the model in day-to-day use? We've talked about how we're going beyond just code generation — helping you understand things, helping you review, deploy, do sysadmin tasks, and more and more things. Helping you build design prototypes. It's all about the economic value that you're able to create. OpenAI really worked hard on GDPVal for 5.2. I think it's interesting to go from really hyper-specific saturated evals to a better understanding of how this is impacting the real world. No eval is perfect, but whenever a model puts a new score on something that measures economic value, it's worth taking a look at it.
Nate B Jones: I appreciate you calling that out because it gets at this idea that people get suspicious of benchmarks when they get reported on and you get close to 100%. It's like, okay, but what's another 2%? Whereas I think there's something around an implicit measure of generalizability that you get when you get to some of these measures of economic impact. GDPVal I think is a good one. Isn't there one around vending machines? That's another one in that vein.
Tibo: Yeah, it's a fun one.
Nate B Jones: Is it Vending Bench or Vendor? Yeah. But with GDPVal it's clearly not saturated yet.
Tibo: And that's usually the cycle you see with evals — an eval gets published, gains traction, gets saturated. It did measure something useful at some point, and then maybe after a couple of months or years it's not really measuring anything very meaningful anymore because every model is performing more or less the same on it. And then you have a new one. GDPVal is measuring something more interesting again, and given it's not saturated, it's always interesting to pay attention to it. Vending Bench is also a fun one.
Nate B Jones: It sort of underlines the story we've been telling through this whole hour together — progress just keeps happening relentlessly with these models and there isn't a wall. Contrary to popular reports, there isn't a wall. We continue to see progress, and it's something that allows us to continue to publish new benchmarks because we keep knocking down the old ones.
Tibo: An interesting thing there is: what are the benchmarks of the future? It would be pointless to have them now because every model would just score zero. Like, being able to be the CEO of a multi-billion dollar company — that would be a useful benchmark. Do we allow models yet to run multi-billion dollar corporations? Not quite. But I'm pretty sure at some point we're going to have these kinds of benchmarks. They seem crazy now, but they're not going to be crazy in a couple of years.
Nate B Jones: That's a really interesting brain teaser — what are the evals of 2026 and 2027 that we're going to turn to as measures of value?
Tibo: Thanks for having us, Nate.
Nate B Jones: Yeah, thank you. That was a good one to end on. This has been lots of fun, guys.
Ed: Thanks so much.