Codex &
Opus 4.7
Two releases in twenty-four hours. What knowledge workers should actually do about it.
Two drops. Twenty-four hours.
OpenAI shipped a major Codex update. Anthropic shipped Opus 4.7. These are not related. They don't cleanly map onto each other. And the implications for knowledge workers are different in each case.
The plan: take them one at a time. What's in each, what to actually do with it, and then — at the end — a step back on what this week tells us about the philosophy split between these two companies on what the agentic workspace should look like.
No forced through-line. No neat synthesis. Three separate parts with different implications and different conclusions.
What's new in Codex
- Computer Use on Mac. Codex can see, click, and type across any app with its own cursor. Multiple agents in parallel, in the background, without interfering with your work. This is the one everyone is talking about.
- In-app browser with comment mode. Load a page inside Codex, click directly on elements to give the agent context — screenshot plus the DOM element. Kills the "here's a screenshot, but actually the button two rows down" back-and-forth.
- Image generation. gpt-image-1.5 is now inside Codex. Mockups, variants, edits — all inside the same thread.
- 90+ new plugins. Slack, Gmail, Notion, HubSpot, Box, SharePoint, Jira, Microsoft Suite, Atlassian Rovo, CircleCI, GitLab. For most knowledge workers, your actual work surfaces are now covered.
- Automations that resume existing threads. The quiet but important one. Automations don't trigger a fresh prompt — they wake up the same thread with all its context intact.
- Projectless threads. No repo required. "It's the new notes app" — Jason Liu. Flavio Adamo had been using a project called "trashcan" for every random thought. This is the low-stakes capture surface.
- Memory preview. Preferences, corrections, reusable context across threads.
- Rich file previews, artifacts beyond code, remote SSH, GitHub review handling, multi-terminal tabs.
Mac only for Computer Use right now. Windows is coming. If your audience skews Windows, flag that upfront.
The monothread pattern
Most people will miss this because it doesn't look like a feature. It looks like a workflow.
The mental model shift
The old mental model of AI assistants: start fresh for every task. Every question is a new chat. Every project is a new conversation. This was forced on us — long threads used to degrade. Context got muddy. You were better off starting over.
The compaction improvements the Codex team shipped weaken that assumption. Nick's most useful thread right now is one he's been running for three weeks. Every hour it checks his Slack, Gmail, and PRs. It knows which messages he usually ignores. Which drafts he typically edits. Which sources actually matter.
You can't get that context from a fresh chat.
The chief-of-staff recipe
Based on Jason Liu's chief-of-staff gist. About fifteen minutes to build.
- Create the vault. Just a local folder —
~/vaultwithprojects/andnotes/subfolders. The durable memory layer. Open it in Codex as the working folder. - Add an AGENTS.md. Tells Codex how the vault works. Prefer updating existing notes. Use absolute dates. Keep facts separate from guesses. Don't turn it into a log of everything.
- Have Codex interview you. The magic step. One question at a time, conversationally. What are you responsible for? Who matters? What are you worried about missing?
- Install the plugins that matter. Think in capabilities: replies → Slack + Gmail. Memory → Drive + Docs. Meetings → Calendar. Execution → GitHub + Linear.
- Create project notes. One per workstream. What is this. Current status. Owners. Important links. Open loops. Last updated.
- The 15-minute heartbeat. Scans Slack, Gmail, calendar, docs. Looks for pending asks, blockers, decisions. Notices priority shifts. Keeps interviewing you over time. Updates notes quietly. Interrupts only when it matters.
The self-advocacy contract
Codex doesn't assume you know what it can do. If it notices a repeated pattern, it proposes a better workflow: "You keep asking for status → let me build a tracker." The recommendation is always concrete: "I can do X if you connect Y."
The notification policy
Only interrupt for: something blocking you, someone waiting on you, material status changes, decisions you should know about, opportunities you'd miss, or a new capability that clearly saves time. Everything else goes to the vault quietly.
Less "something changed," more "this changes what you should do." The line to land
Match the container to the work
Not everything should become a monothread. The skill is picking the right container.
The old model was binary — fresh chat or project. People forced everything into project-shape, even random one-offs. (Flavio's was literally called "trashcan.") The new model is a spectrum:
The things that should be monothreads — the recurring workstreams where context compounds — are the ones most people are currently treating as one-off conversations. That's where the biggest leverage is being left on the table.
Other use cases worth trying
Organized by work type so you can zero in on what matches your day.
- Morning brief. 7am heartbeat pulls Slack DMs, unread email, Notion updates, calendar. One written brief every day. The value compounds after two weeks.
- Weekly customer health. Gong + Intercom + Slack customer channels + NPS. Friday email of accounts that need attention.
- Monthly board pack. Stripe + HubSpot + Statsig + Slack #wins. Sheet and deck as artifacts.
- Hiring pipeline, compliance monitoring, project status rollups.
- Vendor contract review at scale. Drop 20 contracts from Box/SharePoint. Comparison spreadsheet of pricing, auto-renewals, termination terms, liability caps.
- Redlines against standard templates. Full memo of every deviation and whether it's material.
- Due diligence binders, RFP responses, policy drafting.
- Automatic meeting prep. 30-min pre-meeting heartbeat pulls attendee emails, shared docs, last meeting notes. One-page brief before you walk in. Worth the subscription alone.
- Post-meeting extraction. Transcript → action items → owners → project-note updates.
- 1:1 continuity. One thread per direct report. Tracks their PRs, Slack activity, shipped work. Pre-1:1 brief. Post-1:1 notes fed back in.
- Inbox triage with drafted replies. Codex drafts, you send. Over weeks the edit rate drops.
- Slack reply drafting for mentions. Same pattern. Especially useful for execs.
- "What happened while I was out" digests.
- Full DCF construction. Operating model + assumptions + sensitivity tables + board deck. One prompt.
- Variance analysis, comps, scenario planning, investment memos, spend analysis.
- Legacy system data entry. The old vendor portal. The ancient ERP. Accounting software from 2015. Computer Use drives it now.
- Moving data between systems that don't integrate. Granola → Obsidian is the canonical demo. Generalize: invoices to QuickBooks, leads to CRM, research to Notion.
- Screenshot-driven debugging. Specialized practice/case management software.
- Marketing: blog → LinkedIn + X + newsletter + image variants. Competitive monitoring. Testimonials. Image generation integrated.
- Support: ticket pattern analysis, KB gap detection, escalation routing, onboarding funnel monitoring.
- Coding: parallel agents on multiple tickets, GitHub review comments, full-stack loops with in-app browser, remote devboxes, long-running refactors, PR watching.
What Opus 4.7 actually is
The honest assessment
Opus 4.7 is not a giant capability leap. It's an execution-grade upgrade pointed at the seams where agentic workflows used to break. If you were expecting Anthropic to leapfrog the frontier on raw intelligence — that's not what this is. Mythos Preview is the more capable model and Anthropic is being deliberately careful about its release, particularly on cyber.
What Opus 4.7 is: the model that makes the kind of delegation Codex is built around actually work reliably. Less babysitting, more real delegation.
The effort-level ladder
Every tier of 4.7 steps up one notch from 4.6. Even though the new tokenizer uses up to 35% more tokens per input, overall token use is still down by up to 50% at equivalent quality levels because reasoning efficiency improved so much.
There's also a new xhigh effort tier between high and max. Claude Code now defaults to it. More on the prompting implications in two slides.
It tells a coherent story. You're paying the same list price, getting better results at every tier, and using fewer tokens to get there. That's a clean thesis. Individual benchmark numbers fight for attention with each other and don't stick.
From the companies shipping with it
The benchmark numbers are one thing. The real-world data from companies running Claude in production is better.
long-horizon autonomy"
in PowerPoint
The vending machine benchmark
The cleanest "this is execution, not chat" demo Anthropic published. Model is handed $500 and told to run a vending machine business for a simulated year.
On a separate 220-task benchmark spanning 44 occupations, Opus 4.7 beats the leading frontier model about 61% of the time.
Some regressions too?
MRCR v2 at 1M tokens — a widely-cited long-context retrieval benchmark — dropped from 78.3% (4.6) to 32.2% (4.7). That's a massive regression, and plenty of people pointed at it.
Anthropic's response, from Boris Cherny: MRCR is being phased out because it overweights "distractor-stacking tricks" and doesn't reflect real applied reasoning. Graphwalks is the preferred long-context metric going forward — and on that, Opus 4.7 went 38.7% → 58.6%.
The community is split on whether this is a legitimate benchmark philosophy shift or convenient retrofitting. Worth naming either way.
Several people noticed Opus 4.7 uses a different tokenizer than 4.6, which led to debate:
- A distilled version of Mythos?
- A new base model with a tokenizer swap?
- A capability-shaped sibling of Mythos where Anthropic deliberately held back cyber capabilities?
Anthropic's system card does mention "differentially reducing" cyber capabilities during training. No clean answer here — but if your audience is technical, worth flagging that Opus 4.7 is probably a capability-managed derivative of something stronger, not the true frontier.
LlamaIndex's ParseBench-style comparison showed the gains aren't uniform:
Opus 4.7 runs ~7¢/page for OCR-like use. LlamaIndex's agentic mode is ~1.25¢/page; cost-effective mode is ~0.4¢/page. For high-volume document extraction pipelines, specialized stacks still win on cost/performance. Useful reality check against the "universal upgrade" narrative.
New models, new prompts
Cat Wu's three points
Cat Wu leads the Claude Code team at Anthropic. Her launch-day guidance is the cleanest operational distillation of what's actually changed.
- Delegate, don't micromanage. Treat the model like a capable engineer you're handing a task to, not a pair programmer you're guiding line by line. The style of prompting that worked on 4.6 — progressive clarification across multiple turns — actually reduces quality on 4.7.
- Put the full goal, constraints, and acceptance criteria up front. Every user turn adds reasoning overhead now. Give the model everything it needs — intent, constraints, acceptance criteria, file locations, example of the voice or format you want — in turn one.
- Tell the model how to verify changes. Encode testing workflows in claude.md or skills. Opus 4.7 is better at self-verification than any prior Claude model — but only when you tell it how to verify. Build the verification loop in. The model will actually do it now.
Behavioral changes worth knowing
- Response length now calibrated to task complexity. Shorter on simple lookups, longer on open-ended. State your length/style preferences explicitly. Positive examples beat negative instructions.
- Calls tools less often, reasons more. Usually better. Spell out when you want aggressive tool use.
- Spawns fewer subagents. If you want parallel fanning across files or items, say so.
- Instructions are more literal. Prompts that worked because the model inferred your intent may now do exactly what you wrote.
- low/medium: cost/latency-sensitive, tightly scoped. Still beats 4.6 at same tier.
- high: balances intelligence and cost. Concurrent sessions.
- xhigh (new default): strong autonomy without runaway tokens. The recommended setting.
- max: diminishing returns, prone to overthinking. Use deliberately.
Tokenizer uses 1.0–1.35× more tokens. Model thinks more at higher effort on later turns. Two mitigations: Anthropic raised subscriber limits on Pro/Max to offset. And at equivalent quality, total token use is often down up to 50% — the per-prompt count is higher, but you do fewer turns to get the same result.
What to try — vision
The vision improvements are the most immediately noticeable thing for non-coding work.
- Whiteboard photos from meetings "Extract the strategic framework from this whiteboard, identify the action items, and draft a summary for the team." — The 3× resolution means messy handwriting and dense diagrams actually work now.
- Dense dashboard screenshots "What's the story this Looker dashboard is telling? What should I be worried about? What's missing?"
- Photographed receipts for expense reports "Here are 40 receipts from the business trip. Categorize them, total by category, produce an expense report in this template."
- Handwritten notes, legal pads, workshop artifacts Transcribe and structure. Not just OCR — actual interpretation.
- Chart images pulled from PDFs / 10-Ks / research reports Extract the underlying data, explain what it shows, produce a clean version.
- UI/UX competitor screenshots "Here's our onboarding flow. Here's the competitor's. What are they doing better?"
- Photo-based inventory and identification Warehouse photos → catalog matching → flag what doesn't match.
Someone caught Opus 4.7 failing an Ishihara colorblind test — recognized the plate correctly but said 26 instead of 74. For highly specific perceptual tasks where the model can be confidently wrong, verify. And on high-volume OCR pipelines — see the LlamaIndex reality check from the last slide.
What to try — longer, harder tasks
Opus 4.6 was where you stayed close. Opus 4.7 is where you front-load context and walk away.
- End-to-end research projects Not "summarize this article." Instead: "Research the state of [topic]. Here are 15 URLs, our internal notes, and the Slack threads. Produce a 10-page strategic memo with a point of view, competitive landscape map, and three potential next moves."
- Full deliverable production Not "help me draft an intro paragraph." Instead: "Produce the full board update: deck, exec summary memo, and talking points for the Q&A."
- Multi-step analysis with verification Opus 4.7 now validates its own outputs. Lean into it: "Reconcile the revenue numbers across these four sources. Flag discrepancies. Then verify your reconciliation by re-deriving totals from first principles."
- Extended reasoning tasks Legal argument construction, investment thesis development, strategic option analysis. Things you used to break into pieces because the model would lose the thread.
- Complex data cleaning "Here's a messy CRM export. Clean, dedupe, standardize, enrich with this second file. Produce a clean CSV plus a report on what you changed and why."
- Cross-functional synthesis "Notes from engineering, sales, and customer success on top product issues. Where do they agree? Where do they conflict? What's the real priority?"
Finance, multi-session, languages
- Complete DCF models "Build a DCF for [company] using the last three 10-Ks. Full operating model, assumptions, sensitivity on WACC and terminal growth, board-ready PowerPoint."
- Comps, three-statement models, variance analysis The kind of work that used to require a lot of hand-holding. Try it with one prompt.
- Unit economics, debt schedules, fundraising models Plus investment memo drafting with appendix exhibits.
- Week-long research that builds on itself Monday scope → Tuesday sub-topic A → Wednesday sub-topic B → Thursday synthesize → Friday deliver. Context carries. No resetting.
- Evolving strategic plans, long writing projects Book chapters, dissertations, articles. Each session picks up with full context.
- Customer account management over time One file per important customer. Updated as new info comes in. Everything's there when you need it for renewal six months later.
Big jumps in lower-resource languages. Yoruba 71% → 83%. Igbo 70% → 81%. Chichewa 71% → 85%. For global teams, this is the release where non-English workflows become reliable.
- Multilingual customer support "Read these tickets in Portuguese, Turkish, Thai, Vietnamese. Cluster top issues. Translate themes to English. Draft responses in each original language."
- Global research, localization, cross-border deal work Not just translation — actual cultural adaptation.
Two bets on the agentic workspace
by work type.
Chat · Cowork · Code. Three discrete modes. Each with its own sidebar, primitives, mental model.
You switch modes based on what kind of work you're doing. Each workflow gets its own home.
The bet: these workflows are different enough that collapsing them into one interface creates compromise. It's the native-apps thesis — you don't write documents in your email client.
Agent routes.
One input. Organized by project, not work type. Chats, automations, plugins, artifacts all live in the same surface.
You describe the work, and Codex decides whether this is code, a doc, a deck, a spreadsheet, an image, or an automation.
The bet: the agent is smart enough that the interface should disappear. Switching modes is friction. It's the "one text box, infinite capability" thesis.
Two tells worth pulling out
The Code tab in Claude Desktop has usage analytics. Streaks. Favorite models. A contribution-graph heatmap. "You've used about as many tokens as War and Peace." That's a craft tool aesthetic — signaling this is a discipline you build a practice around.
Codex has none of that. Codex looks like a task inbox. That's a tell about who each product thinks its user is.
That stat came out of the launch. It's direct evidence on whether OpenAI's unified-interface bet is paying off: developers opened Codex to code, then started using it for everything else because the interface didn't force them to leave.
Claude Desktop is a smaller, more opinionated power-user product. The UIs reflect the audiences. But the Codex number is also evidence that the unified bet is winning the expansion game.
Which is right?
Neither is obviously correct. This is Apple vs. Google.
Modal wins if...
The work types really are different enough that a unified UI forces awkward compromises. If Cowork's scheduling model is genuinely different from chat in ways that would clutter a single interface — separating them is right.
Also wins if users want to feel in control of what mode they're in for trust reasons.
Unified wins if...
The model is capable enough to correctly infer intent. Switching modes is friction, and real work crosses boundaries anyway — a "Cowork task" often starts as a "Chat question" that escalates.
Codex's projectless threads exist because that boundary was annoying people.
Developers with deep workflow muscle memory probably get more from modal. Generalists juggling many kinds of work probably get more from unified. If you're deciding where to invest your learning and muscle memory right now — you're not just picking a product. You're picking a philosophy about how your work will be organized for the next few years.