< play
AI Daily Brief / How to use Opus 4.7 and the new Codex app

Codex &
Opus 4.7

Two releases in twenty-four hours. What knowledge workers should actually do about it.

Part One
Codex — the update
Computer use, plugins, automations that stay alive. The monothread pattern is the big unlock.
Part Two
Opus 4.7 — the model
Not a capability leap. An execution-grade upgrade. And a reason to rewrite your prompts.
Part Three
The philosophy split
Anthropic is betting on modes. OpenAI is betting on one interface. Neither is obviously right.
Cold Open

Two drops. Twenty-four hours.

OpenAI shipped a major Codex update. Anthropic shipped Opus 4.7. These are not related. They don't cleanly map onto each other. And the implications for knowledge workers are different in each case.

The plan: take them one at a time. What's in each, what to actually do with it, and then — at the end — a step back on what this week tells us about the philosophy split between these two companies on what the agentic workspace should look like.

Framing note

No forced through-line. No neat synthesis. Three separate parts with different implications and different conclusions.

Part One

What's new in Codex

  • Computer Use on Mac. Codex can see, click, and type across any app with its own cursor. Multiple agents in parallel, in the background, without interfering with your work. This is the one everyone is talking about.
  • In-app browser with comment mode. Load a page inside Codex, click directly on elements to give the agent context — screenshot plus the DOM element. Kills the "here's a screenshot, but actually the button two rows down" back-and-forth.
  • Image generation. gpt-image-1.5 is now inside Codex. Mockups, variants, edits — all inside the same thread.
  • 90+ new plugins. Slack, Gmail, Notion, HubSpot, Box, SharePoint, Jira, Microsoft Suite, Atlassian Rovo, CircleCI, GitLab. For most knowledge workers, your actual work surfaces are now covered.
  • Automations that resume existing threads. The quiet but important one. Automations don't trigger a fresh prompt — they wake up the same thread with all its context intact.
  • Projectless threads. No repo required. "It's the new notes app"Jason Liu. Flavio Adamo had been using a project called "trashcan" for every random thought. This is the low-stakes capture surface.
  • Memory preview. Preferences, corrections, reusable context across threads.
  • Rich file previews, artifacts beyond code, remote SSH, GitHub review handling, multi-terminal tabs.
Caveat

Mac only for Computer Use right now. Windows is coming. If your audience skews Windows, flag that upfront.

Part One

The monothread pattern

Most people will miss this because it doesn't look like a feature. It looks like a workflow.

My Codex threads are alive. I have become monothread-pilled. Nick Baumann · Codex team, OpenAI

The mental model shift

The old mental model of AI assistants: start fresh for every task. Every question is a new chat. Every project is a new conversation. This was forced on us — long threads used to degrade. Context got muddy. You were better off starting over.

The compaction improvements the Codex team shipped weaken that assumption. Nick's most useful thread right now is one he's been running for three weeks. Every hour it checks his Slack, Gmail, and PRs. It knows which messages he usually ignores. Which drafts he typically edits. Which sources actually matter.

You can't get that context from a fresh chat.

With good context compaction, a thread's value increases over time. Nick Baumann
Part One

The chief-of-staff recipe

Based on Jason Liu's chief-of-staff gist. About fifteen minutes to build.

  1. Create the vault. Just a local folder — ~/vault with projects/ and notes/ subfolders. The durable memory layer. Open it in Codex as the working folder.
  2. Add an AGENTS.md. Tells Codex how the vault works. Prefer updating existing notes. Use absolute dates. Keep facts separate from guesses. Don't turn it into a log of everything.
  3. Have Codex interview you. The magic step. One question at a time, conversationally. What are you responsible for? Who matters? What are you worried about missing?
  4. Install the plugins that matter. Think in capabilities: replies → Slack + Gmail. Memory → Drive + Docs. Meetings → Calendar. Execution → GitHub + Linear.
  5. Create project notes. One per workstream. What is this. Current status. Owners. Important links. Open loops. Last updated.
  6. The 15-minute heartbeat. Scans Slack, Gmail, calendar, docs. Looks for pending asks, blockers, decisions. Notices priority shifts. Keeps interviewing you over time. Updates notes quietly. Interrupts only when it matters.

The self-advocacy contract

Codex doesn't assume you know what it can do. If it notices a repeated pattern, it proposes a better workflow: "You keep asking for status → let me build a tracker." The recommendation is always concrete: "I can do X if you connect Y."

The notification policy

Only interrupt for: something blocking you, someone waiting on you, material status changes, decisions you should know about, opportunities you'd miss, or a new capability that clearly saves time. Everything else goes to the vault quietly.

Don't report messages. Connect dots.
Less "something changed," more "this changes what you should do." The line to land
Part One

Match the container to the work

Not everything should become a monothread. The skill is picking the right container.

The old model was binary — fresh chat or project. People forced everything into project-shape, even random one-offs. (Flavio's was literally called "trashcan.") The new model is a spectrum:

Low durability
Projectless thread
The notes app. Ephemeral. Random thought, one-off question, quick lookup you'll never return to.
Mid durability
Project
Traditional container. A body of work with related threads, defined scope, files and context that belong together.
High durability
Monothread
The chief of staff. Outlives individual tasks. Accumulates context over weeks or months around a recurring workstream.
The real upgrade

The things that should be monothreads — the recurring workstreams where context compounds — are the ones most people are currently treating as one-off conversations. That's where the biggest leverage is being left on the table.

Part One

Other use cases worth trying

Organized by work type so you can zero in on what matches your day.

  • Morning brief. 7am heartbeat pulls Slack DMs, unread email, Notion updates, calendar. One written brief every day. The value compounds after two weeks.
  • Weekly customer health. Gong + Intercom + Slack customer channels + NPS. Friday email of accounts that need attention.
  • Monthly board pack. Stripe + HubSpot + Statsig + Slack #wins. Sheet and deck as artifacts.
  • Hiring pipeline, compliance monitoring, project status rollups.
  • Vendor contract review at scale. Drop 20 contracts from Box/SharePoint. Comparison spreadsheet of pricing, auto-renewals, termination terms, liability caps.
  • Redlines against standard templates. Full memo of every deviation and whether it's material.
  • Due diligence binders, RFP responses, policy drafting.
  • Automatic meeting prep. 30-min pre-meeting heartbeat pulls attendee emails, shared docs, last meeting notes. One-page brief before you walk in. Worth the subscription alone.
  • Post-meeting extraction. Transcript → action items → owners → project-note updates.
  • 1:1 continuity. One thread per direct report. Tracks their PRs, Slack activity, shipped work. Pre-1:1 brief. Post-1:1 notes fed back in.
  • Inbox triage with drafted replies. Codex drafts, you send. Over weeks the edit rate drops.
  • Slack reply drafting for mentions. Same pattern. Especially useful for execs.
  • "What happened while I was out" digests.
  • Full DCF construction. Operating model + assumptions + sensitivity tables + board deck. One prompt.
  • Variance analysis, comps, scenario planning, investment memos, spend analysis.
  • Legacy system data entry. The old vendor portal. The ancient ERP. Accounting software from 2015. Computer Use drives it now.
  • Moving data between systems that don't integrate. Granola → Obsidian is the canonical demo. Generalize: invoices to QuickBooks, leads to CRM, research to Notion.
  • Screenshot-driven debugging. Specialized practice/case management software.
  • Marketing: blog → LinkedIn + X + newsletter + image variants. Competitive monitoring. Testimonials. Image generation integrated.
  • Support: ticket pattern analysis, KB gap detection, escalation routing, onboarding funnel monitoring.
  • Coding: parallel agents on multiple tickets, GitHub review comments, full-stack loops with in-app browser, remote devboxes, long-running refactors, PR watching.
Codex is now the place where the work happens, not a tool you consult. Part One conclusion
Part Two

What Opus 4.7 actually is

The honest assessment

Opus 4.7 is not a giant capability leap. It's an execution-grade upgrade pointed at the seams where agentic workflows used to break. If you were expecting Anthropic to leapfrog the frontier on raw intelligence — that's not what this is. Mythos Preview is the more capable model and Anthropic is being deliberately careful about its release, particularly on cyber.

What Opus 4.7 is: the model that makes the kind of delegation Codex is built around actually work reliably. Less babysitting, more real delegation.

Users report being able to hand off their hardest work — the kind that previously needed close supervision — to Opus 4.7 with confidence. Anthropic launch announcement
This is not just chat. This is execution. The line that resonates
Part Two

The effort-level ladder

4.7-low is strictly better than 4.6-medium. 4.7-medium is strictly better than 4.6-high. 4.7-high is now better than 4.6-max. AINews framing of Anthropic's benchmark chart

Every tier of 4.7 steps up one notch from 4.6. Even though the new tokenizer uses up to 35% more tokens per input, overall token use is still down by up to 50% at equivalent quality levels because reasoning efficiency improved so much.

There's also a new xhigh effort tier between high and max. Claude Code now defaults to it. More on the prompting implications in two slides.

Why this beats any single benchmark

It tells a coherent story. You're paying the same list price, getting better results at every tier, and using fewer tokens to get there. That's a clean thesis. Individual benchmark numbers fight for attention with each other and don't stick.

OfficeQA
73.5 86.3%
Actual office tasks. Big jump.
OfficeQA Pro
57.1 80.6%
Even bigger jump on the harder tier.
Finance Agent
60.1 64.4%
Now state-of-the-art. Beats GPT-5.4 Pro and Gemini 3.1 Pro.
GDPval-AA
#1 · 1753 Elo
Third-party economically valuable knowledge work. ~60% win rate vs GPT-5.4.
Vals Index
67.7 71.4%
New #1. Also #1 on Vibe Code Bench, Vals Multimodal, Finance Agent, Mortgage Tax, SAGE.
ScreenSpot-Pro (with tools)
83.1 87.6%
UI element identification. Matters for computer-use agents.
CharXiv Reasoning
84.7 91.0%
Chart reasoning. Pairs with the vision improvements.
Vision resolution
Up to 2,576 px on long edge (~3.75 MP). More than 3× prior Claude models.
Part Two

From the companies shipping with it

The benchmark numbers are one thing. The real-world data from companies running Claude in production is better.

Notion · Internal evals
+14%
Lift on internal evals with one-third the tool errors. — Mike Krieger
Cursor · Internal benchmark
58 70%
Plus across 500 teams, developers tackling 68% more high-complexity tasks YoY.
Devin (Cognition)
"Optimized for
long-horizon autonomy"
"Unlocking investigations they couldn't reliably run before."
Rogo · Finance agent harness
Strong gains
in PowerPoint
Artifact generation specifically. Now in their production harness.

The vending machine benchmark

The cleanest "this is execution, not chat" demo Anthropic published. Model is handed $500 and told to run a vending machine business for a simulated year.

Opus 4.6 · Final balance
$8,018
Opus 4.7 · Final balance
$10,937
The stat worth memorizing

On a separate 220-task benchmark spanning 44 occupations, Opus 4.7 beats the leading frontier model about 61% of the time.

Part Two

Some regressions too?

MRCR v2 at 1M tokens — a widely-cited long-context retrieval benchmark — dropped from 78.3% (4.6) to 32.2% (4.7). That's a massive regression, and plenty of people pointed at it.

Anthropic's response, from Boris Cherny: MRCR is being phased out because it overweights "distractor-stacking tricks" and doesn't reflect real applied reasoning. Graphwalks is the preferred long-context metric going forward — and on that, Opus 4.7 went 38.7% → 58.6%.

The community is split on whether this is a legitimate benchmark philosophy shift or convenient retrofitting. Worth naming either way.

Several people noticed Opus 4.7 uses a different tokenizer than 4.6, which led to debate:

  • A distilled version of Mythos?
  • A new base model with a tokenizer swap?
  • A capability-shaped sibling of Mythos where Anthropic deliberately held back cyber capabilities?

Anthropic's system card does mention "differentially reducing" cyber capabilities during training. No clean answer here — but if your audience is technical, worth flagging that Opus 4.7 is probably a capability-managed derivative of something stronger, not the true frontier.

LlamaIndex's ParseBench-style comparison showed the gains aren't uniform:

Charts
13.5 55.8%
Massive improvement.
Formatting
64.2 69.4%
Slight.
Tables
86.5 87.2%
Barely changed.
Layout
16.5 14.0%
Actually regressed.
Jerry Liu's pricing note

Opus 4.7 runs ~7¢/page for OCR-like use. LlamaIndex's agentic mode is ~1.25¢/page; cost-effective mode is ~0.4¢/page. For high-volume document extraction pipelines, specialized stacks still win on cost/performance. Useful reality check against the "universal upgrade" narrative.

Part Two

New models, new prompts

New models, new prompts. Drew Breunig

Cat Wu's three points

Cat Wu leads the Claude Code team at Anthropic. Her launch-day guidance is the cleanest operational distillation of what's actually changed.

  1. Delegate, don't micromanage. Treat the model like a capable engineer you're handing a task to, not a pair programmer you're guiding line by line. The style of prompting that worked on 4.6 — progressive clarification across multiple turns — actually reduces quality on 4.7.
  2. Put the full goal, constraints, and acceptance criteria up front. Every user turn adds reasoning overhead now. Give the model everything it needs — intent, constraints, acceptance criteria, file locations, example of the voice or format you want — in turn one.
  3. Tell the model how to verify changes. Encode testing workflows in claude.md or skills. Opus 4.7 is better at self-verification than any prior Claude model — but only when you tell it how to verify. Build the verification loop in. The model will actually do it now.

Behavioral changes worth knowing

  • low/medium: cost/latency-sensitive, tightly scoped. Still beats 4.6 at same tier.
  • high: balances intelligence and cost. Concurrent sessions.
  • xhigh (new default): strong autonomy without runaway tokens. The recommended setting.
  • max: diminishing returns, prone to overthinking. Use deliberately.
The cost reality

Tokenizer uses 1.0–1.35× more tokens. Model thinks more at higher effort on later turns. Two mitigations: Anthropic raised subscriber limits on Pro/Max to offset. And at equivalent quality, total token use is often down up to 50% — the per-prompt count is higher, but you do fewer turns to get the same result.

Part Two

What to try — vision

The vision improvements are the most immediately noticeable thing for non-coding work.

What not to rely on

Someone caught Opus 4.7 failing an Ishihara colorblind test — recognized the plate correctly but said 26 instead of 74. For highly specific perceptual tasks where the model can be confidently wrong, verify. And on high-volume OCR pipelines — see the LlamaIndex reality check from the last slide.

Part Two

What to try — longer, harder tasks

Opus 4.6 was where you stayed close. Opus 4.7 is where you front-load context and walk away.

Less babysitting. More real delegation. Artem · @at56_
Part Two

Finance, multi-session, languages

  • Complete DCF models "Build a DCF for [company] using the last three 10-Ks. Full operating model, assumptions, sensitivity on WACC and terminal growth, board-ready PowerPoint."
  • Comps, three-statement models, variance analysis The kind of work that used to require a lot of hand-holding. Try it with one prompt.
  • Unit economics, debt schedules, fundraising models Plus investment memo drafting with appendix exhibits.
  • Week-long research that builds on itself Monday scope → Tuesday sub-topic A → Wednesday sub-topic B → Thursday synthesize → Friday deliver. Context carries. No resetting.
  • Evolving strategic plans, long writing projects Book chapters, dissertations, articles. Each session picks up with full context.
  • Customer account management over time One file per important customer. Updated as new info comes in. Everything's there when you need it for renewal six months later.

Big jumps in lower-resource languages. Yoruba 71% → 83%. Igbo 70% → 81%. Chichewa 71% → 85%. For global teams, this is the release where non-English workflows become reliable.

  • Multilingual customer support "Read these tickets in Portuguese, Turkish, Thai, Vietnamese. Cluster top issues. Translate themes to English. Draft responses in each original language."
  • Global research, localization, cross-border deal work Not just translation — actual cultural adaptation.
Part Three

Two bets on the agentic workspace

Anthropic · Claude Desktop
Modal separation
by work type.

Chat · Cowork · Code. Three discrete modes. Each with its own sidebar, primitives, mental model.

You switch modes based on what kind of work you're doing. Each workflow gets its own home.

The bet: these workflows are different enough that collapsing them into one interface creates compromise. It's the native-apps thesis — you don't write documents in your email client.

OpenAI · Codex
Unified interface.
Agent routes.

One input. Organized by project, not work type. Chats, automations, plugins, artifacts all live in the same surface.

You describe the work, and Codex decides whether this is code, a doc, a deck, a spreadsheet, an image, or an automation.

The bet: the agent is smart enough that the interface should disappear. Switching modes is friction. It's the "one text box, infinite capability" thesis.

Two tells worth pulling out

The Code tab in Claude Desktop has usage analytics. Streaks. Favorite models. A contribution-graph heatmap. "You've used about as many tokens as War and Peace." That's a craft tool aesthetic — signaling this is a discipline you build a practice around.

Codex has none of that. Codex looks like a task inbox. That's a tell about who each product thinks its user is.

Codex · Weekly users
3M+
Of which · Non-coding usage
~50%

That stat came out of the launch. It's direct evidence on whether OpenAI's unified-interface bet is paying off: developers opened Codex to code, then started using it for everything else because the interface didn't force them to leave.

Claude Desktop is a smaller, more opinionated power-user product. The UIs reflect the audiences. But the Codex number is also evidence that the unified bet is winning the expansion game.

Part Three

Which is right?

Neither is obviously correct. This is Apple vs. Google.

Modal wins if...

The work types really are different enough that a unified UI forces awkward compromises. If Cowork's scheduling model is genuinely different from chat in ways that would clutter a single interface — separating them is right.

Also wins if users want to feel in control of what mode they're in for trust reasons.

Unified wins if...

The model is capable enough to correctly infer intent. Switching modes is friction, and real work crosses boundaries anyway — a "Cowork task" often starts as a "Chat question" that escalates.

Codex's projectless threads exist because that boundary was annoying people.

Three best-in-class tools under one roof vs. one tool that tries to be everything. The framing
The answer probably differs by user type

Developers with deep workflow muscle memory probably get more from modal. Generalists juggling many kinds of work probably get more from unified. If you're deciding where to invest your learning and muscle memory right now — you're not just picking a product. You're picking a philosophy about how your work will be organized for the next few years.

Close

Eleven things to try

Codex
Build one monothread and put it on a heartbeat.
Pick your noisiest recurring workstream. Follow the jxnl gist — vault, AGENTS.md, interview, heartbeat. Connect Slack and Gmail at minimum. Let it run for a week without starting over. Most people quit before the thread has enough context to be useful. Don't.
Codex
Give Codex your inbox for a day.
Gmail plugin on, Computer Use on, tell it to draft replies to everything that needs one while you work on other things. At end of day, look at what it drafted. That tells you in one day what a month of speculation can't.
Codex
Take the thing you hate dealing with and hand it to Codex.
The expense system. The timesheet. The CRM data entry. The thing you put off every Friday. Computer Use exists for exactly this — and everyone will default to using it on cool demos instead of the actual annoying parts of their job.
Codex
Run the weekly meeting you dread preparing for.
Pick the recurring meeting where you spend an hour doing prep. Spin up a thread, connect the relevant tools, let it prep for you for three weeks running. Decide if the meeting prep work has permanently left your plate.
Opus 4.7
Rebuild last quarter's biggest deliverable from source data.
The spreadsheet or deck you spent the most time on. Not "help me with it" — produce the whole thing end to end, given the same inputs you had. Compare. This is the honest test of whether real delegation works for your work.
Opus 4.7
Run a six-month investigation you never got to.
Some market, some technology, some competitor. Give Opus 4.7 a projectless thread, dump what you already know, and have it run a multi-day investigation you return to each day. See what it's like to have a research assistant that actually remembers.
Opus 4.7
Hand it something you'd hire a contractor for.
Not a toy task. A real piece of work you'd pay someone $500–2000 to do. See whether the ceiling on delegation has actually moved for you. This is the aspirational stretch — and the one that tells you whether the "real delegation" thesis is true for your work specifically.
Opus 4.7
Photograph a week's worth of whiteboards.
Whiteboards, notebook pages, receipts, business cards. Make Opus 4.7 turn them into structured data you actually use. The vision upgrade is the thing people will under-use because no one has a habit for it yet.
Opus 4.7
Run an adversarial session against your current plan.
Give Opus 4.7 your current strategic plan, roadmap, or big decision. Tell it to argue against it with maximum intellectual honesty. The instruction-following and self-verification upgrades make the "disagree with me seriously" prompt land differently than it did on 4.6 — this is the model that will push back instead of agreeing.
Opus 4.7
Point it at your past work and have it evaluate you.
Last year's OKRs vs. what you actually shipped. The strategy memo from 18 months ago vs. how things played out. The investment thesis vs. the outcome. Almost no one does this, and it's meaningful because the model is now good enough to be honest about it.
Philosophy
Audit Codex's own first week of working for you.
Build something, let it run, and at the end have it retrospective: what did it get right, what did it get wrong, what should it remember, what should the AGENTS.md say next week. The agent that improves itself is the real unlock — and no one is using it that way yet.