Subagents in Codex: Planning vs Execution

May 6, 2026

Most people use agents like a better chat box. They open Codex, paste a task, wait, and hope the model keeps the whole project in its head.

That works for small things. It breaks as soon as the work starts looking like real work.

The problem is not that the model is dumb. The problem is that one agent doing planning, architecture, implementation, review, testing, docs, repo hygiene, and GTM is bad org design. You would not hire one person and ask them to be CEO, CTO, engineer, reviewer, QA, growth operator, and trainer at the same time. But that is how most AI workflows are structured.

So I stopped treating Codex like one worker.

I set it up like a small company.

The actual setup

I checked my whole ~/.codex directory before writing this. It is not just a prompt file anymore. It is basically the operating system around how I work.

At the top level I have config, global instructions, personal agents, skills, automations, sessions, logs, browser config, generated images, caches, archived sessions, worktrees, and memory. The current folder has 5,183 files and 2,282 directories. The biggest thing by far is session history, around 2GB, which makes sense because every serious Codex run leaves a trail.

The custom subagents live here:

~/.codex/agents
+-- ceo-orchestrator.toml
+-- cmo-gtm.toml
+-- code-builder.toml
+-- code-reviewer.toml
+-- cto-architect.toml
+-- docs-researcher.toml
+-- gym-coach.toml
+-- qa-tester.toml
+-- workflow-operator.toml

That list is the point. I did not make nine agents because I wanted more names in a folder. I made nine because different kinds of work need different permissions, different failure modes, and different expectations.

Planning agents should not edit files. Review agents should not be tempted to fix the thing they are reviewing. Execution agents should not wander around the codebase inventing scope. Workflow agents should have power, but only when the task actually needs that power.

That is the split.

My parent Codex session is powerful on purpose

The parent session is allowed to do real work.

model = "gpt-5.5"
model_reasoning_effort = "medium"
approval_policy = "never"
sandbox_mode = "danger-full-access"

[agents]
max_threads = 8
max_depth = 1

This is aggressive. I know.

But the power is not the dangerous part. Unscoped power is the dangerous part.

My main Codex session can inspect repos, run builds, edit files, and operate across my machine because that is the actual job I need it to do. I do not want an assistant that stops every five seconds asking whether it can read a config file. I want it to execute.

The safety comes from how the work is routed. The parent session keeps the full context and makes the final decision. The subagents are specialized and constrained. They get a narrow task, they return a result, and the parent integrates it.

max_depth = 1 matters here. I do not want agents spawning agents spawning agents. That sounds powerful until you have a tree of workers making assumptions nobody owns. One layer is enough. The parent stays accountable.

The 10x part is real

This setup is not just cleaner. It made me ship faster than I ever have.

I do not mean a small speed boost. I mean the whole shape of output changed. In How I Actually Vibecode, I wrote that Ryva, egeuysal.com, brain, and ibx were all generated by AI and then reviewed by me. That is already a different mode of building. The code is real. The products work. The review still stays human.

Then the loop got sharper. In Six Hours to Never Lose Context Again, I shipped ibx, www, and brain into a production system in six hours. That post has the simplest version of the operating model: I capture intent, ibx structures it, agents act, www publishes it, brain feeds context back in. Nothing gets lost and nothing starts from zero.

The best number in that post is the outreach loop. Finding ICP leads and preparing context used to take about thirty minutes per run. After brain, it took four. That is 7.5x before even counting the extra leverage from subagents. Once the work is split across research, planning, execution, review, and QA, the bottleneck moves again. It stops being me manually carrying every task between tools.

The diary is the evidence that this was not a one-off. On March 21, I shipped brain as Ryva’s knowledge base, built an Apple Shortcut for article capture, ran brain through Ryva for priorities, and still published distribution posts. On March 24, I published a blog post, pushed LinkedIn, sent eight specific ICP replies, and set up assisted outreach workflows. On April 5, I published a blog on egeuysal.com, a Ryva blog, and a CyberMinds case study while continuing the second-run GTM loop. On May 3, even with meetings and school pressure, the system still helped push a Slack integration and keep project context moving.

That is what 10x means to me.

Not typing ten times faster. Not prompting harder. Not pretending every output is perfect.

It means the work stops coming back to me in one giant pile. The docs researcher can verify an API while the builder owns a patch. The QA agent can run checks while I read the diff. The reviewer can attack the change after the builder is done. The GTM agent can keep Ryva context grounded instead of making up generic startup advice. The parent session can make the final call instead of being the only brain in the room.

Before this, I was still fast, but speed depended on me holding everything. After this, the system holds more of it.

That is the founder unlock. Not AI as a writer. Not AI as autocomplete. AI as a tiny operating team that lets one person move like four without pretending there are no tradeoffs.

Planning is read-only

The biggest rule in my setup is simple: planning does not mutate.

My planning roles are read-only. The ceo_orchestrator is for ambiguous goals and routing. The cto_architect is for technical shape, data flow, APIs, auth boundaries, and implementation strategy.

The CEO agent looks like this:

name = "ceo_orchestrator"
description = "Read-only strategy and routing agent that decomposes ambiguous goals and recommends which specialized agents to spawn."
model = "gpt-5.5"
model_reasoning_effort = "medium"
sandbox_mode = "read-only"

The CTO agent is also read-only, but it thinks harder:

name = "cto_architect"
description = "Read-only technical architect for mapping systems, APIs, data flow, security risks, and implementation approach."
model = "gpt-5.5"
model_reasoning_effort = "high"
sandbox_mode = "read-only"

That distinction matters.

The CEO agent answers: what are we actually trying to do, what workstreams exist, what should be parallelized, and which agents should touch it.

The CTO agent answers: what are the real code paths, where does data move, what breaks if this changes, where are the auth boundaries, and what is the smallest implementation that does not create future pain.

Both are forbidden from editing files. That is intentional.

Planning agents are most useful when they cannot hide behind implementation. They have to think clearly. They have to return concrete guidance. They have to say where the risk is instead of papering over it with a patch.

Execution is scoped

The code_builder is the opposite. It exists to write code.

name = "code_builder"
description = "Implementation-focused engineering agent for scoped production code changes."
model = "gpt-5.5"
model_reasoning_effort = "medium"
sandbox_mode = "workspace-write"

But even the builder is not just go build the feature.

The instruction I care about most is this:

Own only the files and responsibility explicitly assigned by the parent agent.
You are not alone in the codebase: do not revert or overwrite changes made by others.
Make the smallest maintainable change that satisfies the request.

That is what makes subagents useful instead of chaotic.

If I assign the builder a file, it owns that file. If I assign it a behavior, it fixes that behavior. It does not get to refactor half the repo because it saw something ugly. It does not get to undo another agent’s change because its local context is behind. It does not get to clean up code that is unrelated to the task.

This is the same rule I would use with a human engineer. Scope is not bureaucracy. Scope is how you keep velocity from becoming damage.

Review is separate from building

The reviewer is read-only and high reasoning.

name = "code_reviewer"
description = "Read-only reviewer focused on correctness, security, behavior regressions, maintainability, and missing tests."
model = "gpt-5.5"
model_reasoning_effort = "high"
sandbox_mode = "read-only"

That is non-negotiable for me.

If the same agent writes and reviews the code, the review is weaker. It already believes the patch makes sense because it just made it. It knows what it meant to do, which is exactly the wrong perspective. Review should come from outside the implementation frame.

The reviewer is told to lead with concrete findings ordered by severity. Correctness bugs, security risks, data loss, authorization mistakes, regressions, race conditions, missing tests. Not style takes. Not this could be cleaner unless the cleanliness hides a real maintenance risk.

That is the kind of review I want from AI. I do not need another formatter. I need something that asks whether this can leak private data, whether this route trusts client input, and what happens when this promise fails halfway through.

QA is allowed to make temporary mess

Testing is its own role because verification is not the same as review.

The qa_tester can run checks, create temporary artifacts, inspect logs, and reproduce behavior. It has workspace-write because tests and builds often need cache files, screenshots, generated output, or local artifacts.

name = "qa_tester"
description = "Quality agent for running tests, reproductions, and checks; may create temporary artifacts but should not edit source unless asked."
model = "gpt-5.5"
model_reasoning_effort = "medium"
sandbox_mode = "workspace-write"

That last part is the important part: it may create temporary artifacts, but it should not edit source unless explicitly assigned.

The QA agent is not there to be clever. It is there to verify reality. Run the command. Open the browser. Capture the screenshot. Reproduce the bug. Tell me exactly what passed, what failed, and what assumption is still untested.

In AI workflows, verification is where people lie to themselves the most. A green-looking UI is not proof that the backend write happened. A successful patch is not proof that auth still works. A passing unit test is not proof that the production path is covered.

I want a separate agent whose whole job is evidence.

Workflow gets the most power and the narrowest trust

The most dangerous agent in the setup is workflow_operator.

name = "workflow_operator"
description = "Trusted local workflow operator for repo hygiene, scripts, tooling, and operational automation."
model = "gpt-5.5"
model_reasoning_effort = "medium"
sandbox_mode = "danger-full-access"

This one can touch more of the machine. That is useful for repo hygiene, scripts, tooling setup, automations, and local maintenance.

It is also exactly why the instructions are stricter:

Never run destructive git or filesystem commands such as git reset --hard,
git checkout --, rm -rf, mass deletion, or credential changes unless the user
explicitly requested that exact operation.

Do not expose, copy, or persist secrets.

I do not treat full access like a badge. I treat it like a loaded operational surface.

The workflow agent is useful when the task is local and mechanical: inspect repo state, prepare scripts, wire up tooling, check logs, clean workflow noise. It is not the default. It is not the agent I send into product decisions. Power should match the task.

Docs and GTM are their own lanes

Two agents exist because not all work is code.

docs_researcher is for API uncertainty. It is read-only and uses a smaller model because the job is focused: verify current docs, local package versions, source behavior, and primary references.

name = "docs_researcher"
description = "Read-only documentation and API researcher that verifies current framework, library, and platform behavior."
model = "gpt-5.4-mini"
sandbox_mode = "read-only"

This prevents a common failure mode: the main agent guessing how a library works from memory. That is bad engineering. If the thing might have changed recently, go check.

cmo_gtm is for Ryva outreach and growth work. It is also read-only by default because outbound work can create real external consequences.

name = "cmo_gtm"
description = "GTM and outreach operator for Ryva-related execution, follow-ups, messaging, and growth workflows."
model = "gpt-5.5"
sandbox_mode = "read-only"

The GTM agent is forced into my actual Ryva workflow. Latest context first. Second-run loops. Delta-based messaging. Qualify or drop threads. No fabricating metrics, replies, or prior runs.

That matters because GTM agents are very good at sounding plausible. Plausible is not useful. I would rather get fewer qualified leads than ten padded weak ones.

The routing rule lives in AGENTS.md

The agents existing is not enough. Codex needs to know when to use them.

So the routing preference lives in ~/.codex/AGENTS.md:

Subagent routing preference:

- For code review, spawn `code_reviewer` automatically and keep it read-only.
- For architecture or implementation planning, spawn `cto_architect` or
  `ceo_orchestrator` when the task is broad or ambiguous.
- For scoped implementation, spawn `code_builder` only after the target files
  or ownership are clear; use `qa_tester` for independent verification.
- For docs/API uncertainty, spawn `docs_researcher`.
- For repo hygiene, scripts, tooling, or local workflow operations, spawn
  `workflow_operator` only with narrow scope and avoid destructive commands.
- For outreach, Ryva, or GTM work, use the `ryva-execution` skill and spawn
  `cmo_gtm` when delegation is useful.

That is not magic auto-management. The parent agent still has to decide. But it turns the default from one model does everything into route the work to the right role when it actually helps.

The last sentence in that file is probably the most important one:

Do not spawn agents just to appear thorough; use them when their result can
run in parallel or materially improves correctness, security, speed, or
verification.

That is the whole philosophy.

Subagents are not decoration. They are a way to reduce context collision.

How a real task moves through the system

For a broad feature, the flow looks like this:

parent Codex session
  -> ceo_orchestrator if the goal is still fuzzy
  -> cto_architect if the implementation shape is risky
  -> code_builder once ownership is clear
  -> qa_tester for independent verification
  -> code_reviewer before merge if the change is meaningful

Not every task needs every step.

If I ask for a one-line copy change, just edit the file. If I ask for a production auth change, I want architecture first, implementation second, QA third, review fourth. The workflow expands with risk.

That is where most people get agents wrong. They either over-delegate everything and create noise, or they under-delegate everything and leave speed on the table.

The right question is: what part of this task benefits from an independent mind?

Planning benefits from independence when the scope is unclear. Review benefits from independence because the reviewer should not be attached to the patch. QA benefits from independence because verification should not depend on the implementer’s confidence. Docs research benefits from independence because the main thread should not stall on API archaeology.

Execution only benefits from delegation when the target is clear.

That is the rule.

Why this works better than one giant prompt

You can put all of this into one giant system prompt. I basically used to do that.

But giant prompts blur responsibilities. The model has to remember that it is a planner, builder, reviewer, tester, security engineer, and operator all at once. It can do that for a while, but the roles compete. The builder wants to move. The reviewer wants to slow down. The workflow operator wants to run commands. The architect wants to map dependencies. The GTM agent wants to ship messaging. Those are different modes.

Separate agents make the mode explicit.

The other benefit is permissions.

Read-only is underrated. A read-only architect can think without accidentally changing files. A read-only reviewer can criticize without quietly fixing the evidence. A read-only GTM agent can draft without sending. That constraint makes the output cleaner.

Execution agents get write access because they need it. But they get it with scope.

This is not about making AI safer in an abstract way. It is about making the system harder to misuse when I am moving fast.

The lesson

The future of using AI to build is not write better prompts.

It is org design.

Who plans. Who executes. Who reviews. Who verifies. Who is allowed to touch files. Who is allowed to touch the whole machine. Who has to stay read-only. Who gets spawned only when the task is worth it.

That is the difference between an agent setup that feels cool and an agent setup that actually compounds.

I do not have employees. But I do have a CEO router, CTO architect, builder, reviewer, QA tester, workflow operator, GTM operator, docs researcher, and gym coach sitting inside Codex. Most of them cannot edit anything. A few can. One can do real local operations. The parent session owns the final call.

That is the structure that works for me right now, and I think this is where most solo builders are going next whether they notice it or not.

The people who win with AI will not be the people with the longest prompt. They will be the people who can design work. They will know what should be planned, what should be executed, what should be checked, what should be ignored, and what should never get write access in the first place.

That sounds boring. It is not. It is the difference between playing with agents and building an actual machine around yourself.

Small team. Clear roles. Sharp permissions. Evidence before trust. That is the whole game.