Friday, April 17, 2026 · AI Daily Brief Companion

Codex &
Opus 4.7

Two releases in twenty-four hours. What knowledge workers should actually do about it.

Part One

Codex — the update

Computer use, plugins, automations that stay alive. The monothread pattern is the big unlock.

Part Two

Opus 4.7 — the model

Not a capability leap. An execution-grade upgrade. And a reason to rewrite your prompts.

Part Three

The philosophy split

Anthropic is betting on modes. OpenAI is betting on one interface. Neither is obviously right.

Cold Open

Two drops. Twenty-four hours.

OpenAI shipped a major Codex update. Anthropic shipped Opus 4.7. These are not related. They don't cleanly map onto each other. And the implications for knowledge workers are different in each case.

The plan: take them one at a time. What's in each, what to actually do with it, and then — at the end — a step back on what this week tells us about the philosophy split between these two companies on what the agentic workspace should look like.

Framing note

No forced through-line. No neat synthesis. Three separate parts with different implications and different conclusions.

Part One

What's new in Codex

Capability adds

Computer Use on Mac. Codex can see, click, and type across any app with its own cursor. Multiple agents in parallel, in the background, without interfering with your work. This is the one everyone is talking about.
In-app browser with comment mode. Load a page inside Codex, click directly on elements to give the agent context — screenshot plus the DOM element. Kills the "here's a screenshot, but actually the button two rows down" back-and-forth.
Image generation. gpt-image-1.5 is now inside Codex. Mockups, variants, edits — all inside the same thread.
90+ new plugins. Slack, Gmail, Notion, HubSpot, Box, SharePoint, Jira, Microsoft Suite, Atlassian Rovo, CircleCI, GitLab. For most knowledge workers, your actual work surfaces are now covered.

Infrastructure for long-running work

Automations that resume existing threads. The quiet but important one. Automations don't trigger a fresh prompt — they wake up the same thread with all its context intact.
Projectless threads. No repo required. "It's the new notes app" — Jason Liu. Flavio Adamo had been using a project called "trashcan" for every random thought. This is the low-stakes capture surface.
Memory preview. Preferences, corrections, reusable context across threads.
Rich file previews, artifacts beyond code, remote SSH, GitHub review handling, multi-terminal tabs.

Caveat

Mac only for Computer Use right now. Windows is coming. If your audience skews Windows, flag that upfront.

Part One

The monothread pattern

Most people will miss this because it doesn't look like a feature. It looks like a workflow.

My Codex threads are alive. I have become monothread-pilled. Nick Baumann · Codex team, OpenAI

The mental model shift

The old mental model of AI assistants: start fresh for every task. Every question is a new chat. Every project is a new conversation. This was forced on us — long threads used to degrade. Context got muddy. You were better off starting over.

The compaction improvements the Codex team shipped weaken that assumption. Nick's most useful thread right now is one he's been running for three weeks. Every hour it checks his Slack, Gmail, and PRs. It knows which messages he usually ignores. Which drafts he typically edits. Which sources actually matter.

You can't get that context from a fresh chat.

With good context compaction, a thread's value increases over time. Nick Baumann

Part One

The chief-of-staff recipe

Based on Jason Liu's chief-of-staff gist. About fifteen minutes to build.

Create the vault. Just a local folder — ~/vault with projects/ and notes/ subfolders. The durable memory layer. Open it in Codex as the working folder.
Add an AGENTS.md. Tells Codex how the vault works. Prefer updating existing notes. Use absolute dates. Keep facts separate from guesses. Don't turn it into a log of everything.
Have Codex interview you. The magic step. One question at a time, conversationally. What are you responsible for? Who matters? What are you worried about missing?
Install the plugins that matter. Think in capabilities: replies → Slack + Gmail. Memory → Drive + Docs. Meetings → Calendar. Execution → GitHub + Linear.
Create project notes. One per workstream. What is this. Current status. Owners. Important links. Open loops. Last updated.
The 15-minute heartbeat. Scans Slack, Gmail, calendar, docs. Looks for pending asks, blockers, decisions. Notices priority shifts. Keeps interviewing you over time. Updates notes quietly. Interrupts only when it matters.

The self-advocacy contract

Codex doesn't assume you know what it can do. If it notices a repeated pattern, it proposes a better workflow: "You keep asking for status → let me build a tracker." The recommendation is always concrete: "I can do X if you connect Y."

The notification policy

Only interrupt for: something blocking you, someone waiting on you, material status changes, decisions you should know about, opportunities you'd miss, or a new capability that clearly saves time. Everything else goes to the vault quietly.

Don't report messages. Connect dots.
Less "something changed," more "this changes what you should do." The line to land

Part One

Match the container to the work

Not everything should become a monothread. The skill is picking the right container.

The old model was binary — fresh chat or project. People forced everything into project-shape, even random one-offs. (Flavio's was literally called "trashcan.") The new model is a spectrum:

Low durability

Projectless thread

The notes app. Ephemeral. Random thought, one-off question, quick lookup you'll never return to.

Mid durability

Project

Traditional container. A body of work with related threads, defined scope, files and context that belong together.

High durability

Monothread

The chief of staff. Outlives individual tasks. Accumulates context over weeks or months around a recurring workstream.

The real upgrade

The things that should be monothreads — the recurring workstreams where context compounds — are the ones most people are currently treating as one-off conversations. That's where the biggest leverage is being left on the table.

Part One

Other use cases worth trying

Organized by work type so you can zero in on what matches your day.

Recurring reporting & monitoring

Morning brief. 7am heartbeat pulls Slack DMs, unread email, Notion updates, calendar. One written brief every day. The value compounds after two weeks.
Weekly customer health. Gong + Intercom + Slack customer channels + NPS. Friday email of accounts that need attention.
Monthly board pack. Stripe + HubSpot + Statsig + Slack #wins. Sheet and deck as artifacts.
Hiring pipeline, compliance monitoring, project status rollups.

Document & contract workflows

Vendor contract review at scale. Drop 20 contracts from Box/SharePoint. Comparison spreadsheet of pricing, auto-renewals, termination terms, liability caps.
Redlines against standard templates. Full memo of every deviation and whether it's material.
Due diligence binders, RFP responses, policy drafting.

Meetings & 1:1s

Automatic meeting prep. 30-min pre-meeting heartbeat pulls attendee emails, shared docs, last meeting notes. One-page brief before you walk in. Worth the subscription alone.
Post-meeting extraction. Transcript → action items → owners → project-note updates.
1:1 continuity. One thread per direct report. Tracks their PRs, Slack activity, shipped work. Pre-1:1 brief. Post-1:1 notes fed back in.

Email & messaging

Inbox triage with drafted replies. Codex drafts, you send. Over weeks the edit rate drops.
Slack reply drafting for mentions. Same pattern. Especially useful for execs.
"What happened while I was out" digests.

Financial modeling

Full DCF construction. Operating model + assumptions + sensitivity tables + board deck. One prompt.
Variance analysis, comps, scenario planning, investment memos, spend analysis.

Cross-app automation (the Computer Use unlock)

Legacy system data entry. The old vendor portal. The ancient ERP. Accounting software from 2015. Computer Use drives it now.
Moving data between systems that don't integrate. Granola → Obsidian is the canonical demo. Generalize: invoices to QuickBooks, leads to CRM, research to Notion.
Screenshot-driven debugging. Specialized practice/case management software.

Marketing · Support · Coding (abbreviated)

Marketing: blog → LinkedIn + X + newsletter + image variants. Competitive monitoring. Testimonials. Image generation integrated.
Support: ticket pattern analysis, KB gap detection, escalation routing, onboarding funnel monitoring.
Coding: parallel agents on multiple tickets, GitHub review comments, full-stack loops with in-app browser, remote devboxes, long-running refactors, PR watching.

Codex is now the place where the work happens, not a tool you consult. Part One conclusion

Part Two

What Opus 4.7 actually is

The honest assessment

Opus 4.7 is not a giant capability leap. It's an execution-grade upgrade pointed at the seams where agentic workflows used to break. If you were expecting Anthropic to leapfrog the frontier on raw intelligence — that's not what this is. Mythos Preview is the more capable model and Anthropic is being deliberately careful about its release, particularly on cyber.

What Opus 4.7 is: the model that makes the kind of delegation Codex is built around actually work reliably. Less babysitting, more real delegation.

Users report being able to hand off their hardest work — the kind that previously needed close supervision — to Opus 4.7 with confidence. Anthropic launch announcement

This is not just chat. This is execution. The line that resonates

Part Two

The effort-level ladder

4.7-low is strictly better than 4.6-medium. 4.7-medium is strictly better than 4.6-high. 4.7-high is now better than 4.6-max. AINews framing of Anthropic's benchmark chart

Every tier of 4.7 steps up one notch from 4.6. Even though the new tokenizer uses up to 35% more tokens per input, overall token use is still down by up to 50% at equivalent quality levels because reasoning efficiency improved so much.

There's also a new xhigh effort tier between high and max. Claude Code now defaults to it. More on the prompting implications in two slides.

Why this beats any single benchmark

It tells a coherent story. You're paying the same list price, getting better results at every tier, and using fewer tokens to get there. That's a clean thesis. Individual benchmark numbers fight for attention with each other and don't stick.

The benchmarks that matter for knowledge workers

OfficeQA

73.5 → 86.3%

Actual office tasks. Big jump.

OfficeQA Pro

57.1 → 80.6%

Even bigger jump on the harder tier.

Finance Agent

60.1 → 64.4%

Now state-of-the-art. Beats GPT-5.4 Pro and Gemini 3.1 Pro.

GDPval-AA

#1 · 1753 Elo

Third-party economically valuable knowledge work. ~60% win rate vs GPT-5.4.

Vals Index

67.7 → 71.4%

New #1. Also #1 on Vibe Code Bench, Vals Multimodal, Finance Agent, Mortgage Tax, SAGE.

ScreenSpot-Pro (with tools)

83.1 → 87.6%

UI element identification. Matters for computer-use agents.

CharXiv Reasoning

84.7 → 91.0%

Chart reasoning. Pairs with the vision improvements.

Vision resolution

3×

Up to 2,576 px on long edge (~3.75 MP). More than 3× prior Claude models.

Part Two

From the companies shipping with it

The benchmark numbers are one thing. The real-world data from companies running Claude in production is better.

Notion · Internal evals

+14%

Lift on internal evals with one-third the tool errors. — Mike Krieger

Cursor · Internal benchmark

58 → 70%

Plus across 500 teams, developers tackling 68% more high-complexity tasks YoY.

Devin (Cognition)

"Optimized for
long-horizon autonomy"

"Unlocking investigations they couldn't reliably run before."

Rogo · Finance agent harness

Strong gains
in PowerPoint

Artifact generation specifically. Now in their production harness.

The vending machine benchmark

The cleanest "this is execution, not chat" demo Anthropic published. Model is handed $500 and told to run a vending machine business for a simulated year.

Opus 4.6 · Final balance

$8,018

Opus 4.7 · Final balance

$10,937

The stat worth memorizing

On a separate 220-task benchmark spanning 44 occupations, Opus 4.7 beats the leading frontier model about 61% of the time.

Part Two

Some regressions too?

The MRCR controversy

MRCR v2 at 1M tokens — a widely-cited long-context retrieval benchmark — dropped from 78.3% (4.6) to 32.2% (4.7). That's a massive regression, and plenty of people pointed at it.

Anthropic's response, from Boris Cherny: MRCR is being phased out because it overweights "distractor-stacking tricks" and doesn't reflect real applied reasoning. Graphwalks is the preferred long-context metric going forward — and on that, Opus 4.7 went 38.7% → 58.6%.

The community is split on whether this is a legitimate benchmark philosophy shift or convenient retrofitting. Worth naming either way.

The base model question

Several people noticed Opus 4.7 uses a different tokenizer than 4.6, which led to debate:

A distilled version of Mythos?
A new base model with a tokenizer swap?
A capability-shaped sibling of Mythos where Anthropic deliberately held back cyber capabilities?

Anthropic's system card does mention "differentially reducing" cyber capabilities during training. No clean answer here — but if your audience is technical, worth flagging that Opus 4.7 is probably a capability-managed derivative of something stronger, not the true frontier.

Mixed document parsing results

LlamaIndex's ParseBench-style comparison showed the gains aren't uniform:

Charts

13.5 ↑ 55.8%

Massive improvement.

Formatting

64.2 → 69.4%

Slight.

Tables

86.5 → 87.2%

Barely changed.

Layout

16.5 ↓ 14.0%

Actually regressed.

Jerry Liu's pricing note

Opus 4.7 runs ~7¢/page for OCR-like use. LlamaIndex's agentic mode is ~1.25¢/page; cost-effective mode is ~0.4¢/page. For high-volume document extraction pipelines, specialized stacks still win on cost/performance. Useful reality check against the "universal upgrade" narrative.

Part Two

New models, new prompts

New models, new prompts. Drew Breunig

Cat Wu's three points

Cat Wu leads the Claude Code team at Anthropic. Her launch-day guidance is the cleanest operational distillation of what's actually changed.

Delegate, don't micromanage. Treat the model like a capable engineer you're handing a task to, not a pair programmer you're guiding line by line. The style of prompting that worked on 4.6 — progressive clarification across multiple turns — actually reduces quality on 4.7.
Put the full goal, constraints, and acceptance criteria up front. Every user turn adds reasoning overhead now. Give the model everything it needs — intent, constraints, acceptance criteria, file locations, example of the voice or format you want — in turn one.
Tell the model how to verify changes. Encode testing workflows in claude.md or skills. Opus 4.7 is better at self-verification than any prior Claude model — but only when you tell it how to verify. Build the verification loop in. The model will actually do it now.

Behavioral changes worth knowing

Response length now calibrated to task complexity. Shorter on simple lookups, longer on open-ended. State your length/style preferences explicitly. Positive examples beat negative instructions.
Calls tools less often, reasons more. Usually better. Spell out when you want aggressive tool use.
Spawns fewer subagents. If you want parallel fanning across files or items, say so.
Instructions are more literal. Prompts that worked because the model inferred your intent may now do exactly what you wrote.

Effort levels (Claude Code)

low/medium: cost/latency-sensitive, tightly scoped. Still beats 4.6 at same tier.
high: balances intelligence and cost. Concurrent sessions.
xhigh (new default): strong autonomy without runaway tokens. The recommended setting.
max: diminishing returns, prone to overthinking. Use deliberately.

The cost reality

Tokenizer uses 1.0–1.35× more tokens. Model thinks more at higher effort on later turns. Two mitigations: Anthropic raised subscriber limits on Pro/Max to offset. And at equivalent quality, total token use is often down up to 50% — the per-prompt count is higher, but you do fewer turns to get the same result.

Part Two

What to try — vision

The vision improvements are the most immediately noticeable thing for non-coding work.

Whiteboard photos from meetings "Extract the strategic framework from this whiteboard, identify the action items, and draft a summary for the team." — The 3× resolution means messy handwriting and dense diagrams actually work now.
Dense dashboard screenshots "What's the story this Looker dashboard is telling? What should I be worried about? What's missing?"
Photographed receipts for expense reports "Here are 40 receipts from the business trip. Categorize them, total by category, produce an expense report in this template."
Handwritten notes, legal pads, workshop artifacts Transcribe and structure. Not just OCR — actual interpretation.
Chart images pulled from PDFs / 10-Ks / research reports Extract the underlying data, explain what it shows, produce a clean version.
UI/UX competitor screenshots "Here's our onboarding flow. Here's the competitor's. What are they doing better?"
Photo-based inventory and identification Warehouse photos → catalog matching → flag what doesn't match.

What not to rely on

Someone caught Opus 4.7 failing an Ishihara colorblind test — recognized the plate correctly but said 26 instead of 74. For highly specific perceptual tasks where the model can be confidently wrong, verify. And on high-volume OCR pipelines — see the LlamaIndex reality check from the last slide.

Part Two

What to try — longer, harder tasks

Opus 4.6 was where you stayed close. Opus 4.7 is where you front-load context and walk away.

Less babysitting. More real delegation. Artem · @at56_

End-to-end research projects Not "summarize this article." Instead: "Research the state of [topic]. Here are 15 URLs, our internal notes, and the Slack threads. Produce a 10-page strategic memo with a point of view, competitive landscape map, and three potential next moves."
Full deliverable production Not "help me draft an intro paragraph." Instead: "Produce the full board update: deck, exec summary memo, and talking points for the Q&A."
Multi-step analysis with verification Opus 4.7 now validates its own outputs. Lean into it: "Reconcile the revenue numbers across these four sources. Flag discrepancies. Then verify your reconciliation by re-deriving totals from first principles."
Extended reasoning tasks Legal argument construction, investment thesis development, strategic option analysis. Things you used to break into pieces because the model would lose the thread.
Complex data cleaning "Here's a messy CRM export. Clean, dedupe, standardize, enrich with this second file. Produce a clean CSV plus a report on what you changed and why."
Cross-functional synthesis "Notes from engineering, sales, and customer success on top product issues. Where do they agree? Where do they conflict? What's the real priority?"

Part Two

Finance, multi-session, languages

Financial modeling — state-of-the-art now

Complete DCF models "Build a DCF for [company] using the last three 10-Ks. Full operating model, assumptions, sensitivity on WACC and terminal growth, board-ready PowerPoint."
Comps, three-statement models, variance analysis The kind of work that used to require a lot of hand-holding. Try it with one prompt.
Unit economics, debt schedules, fundraising models Plus investment memo drafting with appendix exhibits.

Multi-day, multi-session work

Week-long research that builds on itself Monday scope → Tuesday sub-topic A → Wednesday sub-topic B → Thursday synthesize → Friday deliver. Context carries. No resetting.
Evolving strategic plans, long writing projects Book chapters, dissertations, articles. Each session picks up with full context.
Customer account management over time One file per important customer. Updated as new info comes in. Everything's there when you need it for renewal six months later.

Non-English languages — the sleeper improvement

Big jumps in lower-resource languages. Yoruba 71% → 83%. Igbo 70% → 81%. Chichewa 71% → 85%. For global teams, this is the release where non-English workflows become reliable.

Multilingual customer support "Read these tickets in Portuguese, Turkish, Thai, Vietnamese. Cluster top issues. Translate themes to English. Draft responses in each original language."
Global research, localization, cross-border deal work Not just translation — actual cultural adaptation.

Part Three

Two bets on the agentic workspace

Anthropic · Claude Desktop

Modal separation
by work type.

Chat · Cowork · Code. Three discrete modes. Each with its own sidebar, primitives, mental model.

You switch modes based on what kind of work you're doing. Each workflow gets its own home.

The bet: these workflows are different enough that collapsing them into one interface creates compromise. It's the native-apps thesis — you don't write documents in your email client.

OpenAI · Codex

Unified interface.
Agent routes.

One input. Organized by project, not work type. Chats, automations, plugins, artifacts all live in the same surface.

You describe the work, and Codex decides whether this is code, a doc, a deck, a spreadsheet, an image, or an automation.

The bet: the agent is smart enough that the interface should disappear. Switching modes is friction. It's the "one text box, infinite capability" thesis.

Two tells worth pulling out

Tell #1 · Aesthetic

The Code tab in Claude Desktop has usage analytics. Streaks. Favorite models. A contribution-graph heatmap. "You've used about as many tokens as War and Peace." That's a craft tool aesthetic — signaling this is a discipline you build a practice around.

Codex has none of that. Codex looks like a task inbox. That's a tell about who each product thinks its user is.

Tell #2 · The data

Codex · Weekly users

3M+

Of which · Non-coding usage

~50%

That stat came out of the launch. It's direct evidence on whether OpenAI's unified-interface bet is paying off: developers opened Codex to code, then started using it for everything else because the interface didn't force them to leave.

Claude Desktop is a smaller, more opinionated power-user product. The UIs reflect the audiences. But the Codex number is also evidence that the unified bet is winning the expansion game.

Part Three

Which is right?

Neither is obviously correct. This is Apple vs. Google.

Modal wins if...

The work types really are different enough that a unified UI forces awkward compromises. If Cowork's scheduling model is genuinely different from chat in ways that would clutter a single interface — separating them is right.

Also wins if users want to feel in control of what mode they're in for trust reasons.

Unified wins if...

The model is capable enough to correctly infer intent. Switching modes is friction, and real work crosses boundaries anyway — a "Cowork task" often starts as a "Chat question" that escalates.

Codex's projectless threads exist because that boundary was annoying people.

Three best-in-class tools under one roof vs. one tool that tries to be everything. The framing

The answer probably differs by user type

Developers with deep workflow muscle memory probably get more from modal. Generalists juggling many kinds of work probably get more from unified. If you're deciding where to invest your learning and muscle memory right now — you're not just picking a product. You're picking a philosophy about how your work will be organized for the next few years.

Close

Eleven things to try

Codex

Build one monothread and put it on a heartbeat.

Pick your noisiest recurring workstream. Follow the jxnl gist — vault, AGENTS.md, interview, heartbeat. Connect Slack and Gmail at minimum. Let it run for a week without starting over. Most people quit before the thread has enough context to be useful. Don't.

Codex

Give Codex your inbox for a day.

Gmail plugin on, Computer Use on, tell it to draft replies to everything that needs one while you work on other things. At end of day, look at what it drafted. That tells you in one day what a month of speculation can't.

Codex

Take the thing you hate dealing with and hand it to Codex.

The expense system. The timesheet. The CRM data entry. The thing you put off every Friday. Computer Use exists for exactly this — and everyone will default to using it on cool demos instead of the actual annoying parts of their job.

Codex

Run the weekly meeting you dread preparing for.

Pick the recurring meeting where you spend an hour doing prep. Spin up a thread, connect the relevant tools, let it prep for you for three weeks running. Decide if the meeting prep work has permanently left your plate.

Opus 4.7

Rebuild last quarter's biggest deliverable from source data.

The spreadsheet or deck you spent the most time on. Not "help me with it" — produce the whole thing end to end, given the same inputs you had. Compare. This is the honest test of whether real delegation works for your work.

Opus 4.7

Run a six-month investigation you never got to.

Some market, some technology, some competitor. Give Opus 4.7 a projectless thread, dump what you already know, and have it run a multi-day investigation you return to each day. See what it's like to have a research assistant that actually remembers.

Opus 4.7

Hand it something you'd hire a contractor for.

Not a toy task. A real piece of work you'd pay someone $500–2000 to do. See whether the ceiling on delegation has actually moved for you. This is the aspirational stretch — and the one that tells you whether the "real delegation" thesis is true for your work specifically.

Opus 4.7

Photograph a week's worth of whiteboards.

Whiteboards, notebook pages, receipts, business cards. Make Opus 4.7 turn them into structured data you actually use. The vision upgrade is the thing people will under-use because no one has a habit for it yet.

Opus 4.7

Run an adversarial session against your current plan.

Give Opus 4.7 your current strategic plan, roadmap, or big decision. Tell it to argue against it with maximum intellectual honesty. The instruction-following and self-verification upgrades make the "disagree with me seriously" prompt land differently than it did on 4.6 — this is the model that will push back instead of agreeing.

Opus 4.7

Point it at your past work and have it evaluate you.

Last year's OKRs vs. what you actually shipped. The strategy memo from 18 months ago vs. how things played out. The investment thesis vs. the outcome. Almost no one does this, and it's meaningful because the model is now good enough to be honest about it.

Philosophy

Audit Codex's own first week of working for you.

Build something, let it run, and at the end have it retrospective: what did it get right, what did it get wrong, what should it remember, what should the AGENTS.md say next week. The agent that improves itself is the real unlock — and no one is using it that way yet.

Codex & Opus 4.7

Two drops. Twenty-four hours.

What's new in Codex

The monothread pattern

The mental model shift

The chief-of-staff recipe

The self-advocacy contract

The notification policy

Match the container to the work

Other use cases worth trying

What Opus 4.7 actually is

The honest assessment

The effort-level ladder

From the companies shipping with it

The vending machine benchmark

Some regressions too?

New models, new prompts

Cat Wu's three points

Behavioral changes worth knowing

What to try — vision

What to try — longer, harder tasks

Finance, multi-session, languages

Two bets on the agentic workspace

Two tells worth pulling out

Which is right?

Modal wins if...

Unified wins if...

Eleven things to try

Codex &
Opus 4.7