AI Agents12 min read

Codex /goal Just Ate the Agent-Harness Category

OpenAI's Codex CLI shipped a /goal loop that absorbs the harness category overnight. Here's what survives — skills, memory, and tool integrations.

CL

ComputeLeap Team

Share:
Codex /goal absorbs the agent-harness startup category

On the morning of May 1, Sam Altman tweeted that "all of these 'which is better' polls are silly" and that developers should "use codex or claude code, whatever works best for you." The tweet pulled 15,300 likes and 818,000 views in a few hours.

Sam Altman: 'use codex or claude code, whatever works best for you'

That's not magnanimity. That's a posture. Market leaders do not need to defuse comparisons; challengers do. And the same morning Sam published the white-flag tweet, three things shipped that, taken together, explain what he was actually defending against — and what he was quietly winning.

  1. OpenAI's Codex CLI 0.128.0 added /goal — a native autonomous-loop primitive that runs while True: agent.step() inside the CLI itself.
  2. Cursor's SDK launched, exposing "the same runtime, harness, and models that power Cursor" as an embeddable agent substrate.
  3. GitHub trending was, simultaneously, dominated by personal-flavor agent-harness repos — five of the top fifteen, totaling +9,400 stars in 24 hours.

Read together, this is the surface war crystallizing. The harness layer — the while-loop wrapper around the API that a dozen YC-funded agent startups built businesses on — is being absorbed into the platform CLIs. What survives, and where ComputeLeap thinks the next 12 months of independent value lives, is the skills layer, the memory layer, and the tool-integration layer.

This article makes that argument concretely: with the actual repos, the actual tweets, and a reproducible test of Codex /goal against Claude Code on a published task. We close with the one signal that disagrees with the entire dev-channel narrative — and why that disagreement is the trade to watch.

📖 If you've been tracking this all year, see our prior coverage of OpenAI's Codex model-line consolidation and the Karpathy CLAUDE.md template moment — both pieces are direct prequels to today's story.


The /goal Primitive: What It Actually Is

Simon Willison's coverage of Codex CLI 0.128.0 on his Substack cut to the chase in one line:

"Bad day for any startup whose moat was a while-loop around the API."

That single sentence captures the structural shift. For most of 2025, the "agent harness" — the layer between a developer's prompt and the model's tool calls — was a fragmented startup category. AutoGen, LangGraph, BabyAGI clones, custom YC-batch wrappers — dozens of teams shipping variations of the same recipe: take an LLM, give it a tool list, run a loop, check for completion, retry on error.

/goal collapses that recipe into a one-liner inside the official OpenAI CLI:

codex /goal "Refactor the authentication module to use JWT, run the test suite, and open a PR with the changes"

That is the entire harness. No SDK. No agent framework. No pip install langchain-experimental. The loop, the tool registry, the retry logic, the completion check, the error recovery — all ship in the CLI.

All-In Pod's read this week is that Codex's /goal "materially narrowed the Claude-Code lead this quarter." The framing is right. The implication is bigger.

What just got commoditized: the while True: agent.step() loop, the basic tool-use scaffolding, the per-step prompt envelope, the retry-on-tool-error pattern. None of these are products anymore. They are CLI features.


The Cursor SDK Move — Same Week, Same Direction

The same week Codex shipped /goal, Cursor shipped the Cursor SDK:

Cursor SDK launch tweet — agent-native runtime and harness

"Cursor SDK so you can build agents with the same runtime, harness, and models that power Cursor. Run agents from CI/CD pipelines, create automations for end-to-end workflows, or embed agents directly inside your products."

The Cursor announcement pulled 8,600 likes and 2.8 million views. The framing matters: Cursor is not selling a developer product anymore. Cursor is selling runtime substrate.

Read those two announcements next to each other:

  • OpenAI: the harness ships as a CLI primitive. Free for any developer.
  • Cursor: the harness ships as an embeddable runtime. Paid for any company that wants "agents in product" without building infra.

Both vendors are saying the same thing in different price tiers: the harness is no longer where the value lives. The platform owns it. What you build on top of the platform is where you compete.

The acquisition surface for Microsoft, GitHub, and the hyperscalers just widened materially. Every B2B SaaS that wanted "an AI agent in our product" without standing up its own runtime is now either a Cursor SDK customer or, more likely, a future M&A target for whoever gets there with the cleanest wrapper.


What's Actually Surviving: Five of the Top Fifteen on GitHub

Here is the data point that everyone in the dev-channel discourse was looking at sideways yesterday and is now looking at straight on. GitHub's trending board on May 1 had five agent-harness-adjacent repositories in the top fifteen, putting on +9,400 stars in 24 hours collectively:

Repo+stars/24hTotal starsWhat it actually is
mattpocock/skills+3,64952K"Straight from my .claude directory" — personal dotfiles, published as a product
warpdotdev/warp+3,403"Agentic dev environment" rebrand of the terminal
obra/superpowers+1,098175KA bundle of Claude Code skills + memory primitives
Hmbown/DeepSeek-TUI+580Terminal-native coding agent specifically for DeepSeek
jcodePersonal-flavor harness with editor integration

Notice what is not on this list: a generic "agent framework." There is no AutoGen-clone trending today. No new LangGraph competitor. No "build any agent" SDK.

Instead, what's trending is two specific patterns:

  1. Personal-flavor skills bundles (mattpocock/skills, obra/superpowers). The author's own .claude directory, published. The product is the curation, not the runtime.
  2. Model-specific harnesses (Hmbown/DeepSeek-TUI). A harness purpose-built for a single model family — DeepSeek — that doesn't pretend to abstract over OpenAI-shaped APIs.

This is the structural shift. The category is fragmenting, not consolidating — and it's fragmenting along the two axes the platform CLIs can't commoditize: personal taste, and model specificity.

For builders: if your roadmap is "we are an agent framework," the next 12 months are unkind. If your roadmap is "we are the skills marketplace for [domain]" or "we are the harness specifically tuned for [model family]," the next 12 months are wide open.


The Anthropic Side: Reliability Without the Cycle

While OpenAI was eating the harness category and Cursor was selling the substrate, Anthropic was shipping bug fixes. Anthony Cherny posted that the team had landed "50+ stability fixes in the last four Claude Code releases."

Anthony Cherny: 50+ stability fixes in 4 Claude Code releases

That is real, valuable work. It does not trend.

This is the asymmetric reliability tax: Anthropic earns power-user respect for ships that never break, but loses dev-channel attention to OpenAI shipping pets in Codex and Sam tweeting about distribution. From our today's convergence read on the rise of AI agents:

"Anthropic ships 50+ stability fixes per release and gets credit only from existing power users. OpenAI ships pets in Codex and gets the cycle. Boring infrastructure work doesn't trend."

The tell on Sam's "use whatever works best for you" tweet is exactly here. Hours after that post, Sam shipped ChatGPT-account login into openclaw — turning a competing surface into an OpenAI billing relationship.

Sam Altman announcing ChatGPT-account login inside openclaw

That is the actual play: not "use whatever you want," but "use whatever you want, but pay us when you do."


The Reproducible Test: /goal vs. Claude Code on the Same Task

We ran the same task on both runtimes. The task: "Add Vitest, write a passing test for the formatPrice function in src/utils/format.ts, and open a PR."

# Codex CLI (0.128.0)
codex /goal "Add Vitest to this Astro project, write a passing test for formatPrice, open a PR"

# Claude Code
claude "Add Vitest to this Astro project, write a passing test for formatPrice, open a PR. Use the dev-loop skill."

Codex /goal:

  • Wrote a vitest.config.ts, installed Vitest, wrote a test, ran it, fixed one tsconfig error, ran it again, opened a PR. 6 minutes 12 seconds. 1 retry.

Claude Code:

  • Wrote vitest.config.ts, installed Vitest, wrote a test, ran it, identified an Astro/Vite incompatibility we didn't expect, wrote a workaround config, re-ran, opened a PR with a comment block explaining the workaround. 8 minutes 41 seconds. 0 retries.

Both produced a passing PR. Codex was faster. Claude Code was more thorough — its PR description called out the Astro/Vite gotcha so a human reader knew where the rough edge was. Either is shippable. Neither is dominant.

That's the actual story. /goal is good enough at the harness layer. You no longer pay for the loop. You pay for everything around it.


The Polymarket Disagreement — and Why It's the Trade to Watch

Here is the signal that doesn't match the dev-channel narrative. Polymarket's Best AI Model end-of-May market prices:

  • Anthropic: 80%
  • Google: 16%
  • OpenAI: 3% (down 15.8% on the week)

While dev Twitter is reading "Sam's white flag + Codex /goal + Cursor SDK" as a coordinated OpenAI surface war, the most liquid prediction market on AI is pricing OpenAI's chance of holding the consensus-best model crown at three percent.

Both can be true. Distribution and quality are not the same dimension. OpenAI is winning the surface where AI lives in the workflow. Anthropic is, for now, still winning the question of which model produces the best output. The market has not yet repriced on the harness story.

The trigger to watch: if Codex /goal adoption converts dev-channel attention into Polymarket movement within seven days, the thesis flips. If it doesn't, the divergence between narrative and pricing is itself the asymmetry — and it gets bigger every week the dev cycle keeps celebrating distribution moves while the prediction market keeps pricing in quality.


What To Build (And Not Build) In May 2026

Concretely, here is what the convergence is telling builders right now.

What just got commoditized

  • The autonomous loop (while True: agent.step())
  • Generic tool-use scaffolding
  • Per-step prompt enveloping
  • "An agent framework" as a standalone product

What's wide open

  • Skills marketplaces. mattpocock/skills proved the appetite. There is no "Hugging Face for skill bundles" yet. Expect one within 90 days.
  • Memory primitives. Claude's Project Memory and OpenAI's session memory are both deliberately shallow. The wrapper layer that gives an agent durable, queryable memory across projects is unbuilt.
  • Tool-integration depth. /goal ships with the OpenAI tool surface. Most enterprise workflows live in Notion, Linear, Zendesk, Segment, Datadog. The team that owns the deep integration with one of those is not competing with /goal — they are riding on top of it.
  • Model-specific harnesses. Hmbown/DeepSeek-TUI validates this. The Anthropic-tokenizer non-English tax (Anthropic charges 3.24× more than OpenAI on Hindi input, 2.86× on Arabic, 1.71× on Chinese) is going to keep cracking open regional model-specific tooling for India/SEA/MENA workloads.
  • Curation and taste. Pocock's repo is not a product. It is a curated opinion. The personal-flavor skills bundle is real, and the people who built audiences on what good code looks like have an unfair distribution advantage in the next category.

What is closing

  • Generic "build any agent" SDKs
  • The while-loop wrapper as a billable line item
  • Harness-only differentiation without an integrations or memory layer

TL;DR

OpenAI's /goal and Cursor's SDK both shipped this week, and both are saying the same thing: the harness is now substrate. Free in OpenAI's case, paid in Cursor's case, but no longer where independent value lives.

GitHub's trending board confirms it. The five top harness-adjacent repos this morning are not generic frameworks. They are personal-flavor skills bundles and model-specific clients. The category is fragmenting along the two dimensions the platforms can't commoditize.

The Anthropic side is shipping reliability and losing the cycle. The Polymarket side is pricing Anthropic at 80% on quality and OpenAI at 3%. That divergence is the trade.

If you are still building a while-loop wrapper, this is your week to pivot. If you are building skills, memory, or deep integrations, this is your week to ship.

📖 We're tracking this convergence daily. For the underlying signals see our convergence reports on the rise of AI agents and the Karpathy CLAUDE.md template viral moment that began this story arc three weeks ago.


Have a skills bundle, memory layer, or deep-integration tool you're building? We cover this category daily. Subscribe to ComputeLeap for cross-platform convergence reports on what's actually shipping.

CL

About ComputeLeap Team

The ComputeLeap editorial team covers AI tools, agents, and products — helping readers discover and use artificial intelligence to work smarter.

💬 Join the Discussion

Have thoughts on this article? Discuss it on your favorite platform:

The ComputeLeap Weekly

Get a weekly digest of the best AI infra writing — Claude Code, agent frameworks, deployment patterns. No fluff.

Weekly. Unsubscribe anytime.