← Back to the log

The Worker Bench Is Not The CEO

GLM 5.2 repaired a live database problem for me last night.

That is the short version. The more useful version is this: a cheaper model handled a bounded infrastructure job well enough that it can start taking work off the expensive main model, but it is not good enough yet to run the whole system.

That distinction matters.

What happened

I have an agent layer I call Atlas. It sits above the rest of my little business system: content agents, outreach jobs, database checks, social jobs, approval packets, all of it.

Atlas is still running on Codex 5.5 as the main brain.

I have been testing Z.AI’s GLM models as workers underneath it. The goal is not to swap the CEO brain because a model looked good in a benchmark. The goal is more boring: find out which jobs can be moved to cheaper models without making the system less honest.

Last night GLM 5.2 got a real job.

A SQLite database that backs parts of the dashboard and event history was malformed. The symptoms looked like index corruption at first, but the actual problem was deeper: the database header said one page count, the file had another, and one interior page was gone.

The repair path had to be careful.

Back up the database. Inspect it. Try the small fix. Recognize when the small fix cannot repair a table tree. Dump and rebuild. Salvage what can be salvaged. Swap only after integrity checks pass. Then verify the read paths that were broken before.

GLM 5.2 did that work. It reported one lost event row instead of hiding it. The final database passed quick_check and integrity_check. The visibility paths came back.

That is a good worker result.

What it did not prove

It did not prove GLM 5.2 should run Atlas.

The main Atlas brain has a different job. It has to read my intent in Telegram, decide when to act, decide when to ask, keep approval boundaries straight, separate stale state from current state, and know when a public action is too risky.

That is not the same as “fix this database with these guardrails.”

The first GLM 5.2 primary-brain shadow sample was decent, but it also repeated one stale note from the evaluation doc. Small issue in a shadow run. Bigger issue if the model is making live system decisions.

So the current routing is simple:

  • Codex 5.5 stays the main Atlas brain.
  • GLM 5.2 gets hard bounded worker jobs, like infra repair and deterministic audits.
  • GLM 5.1 gets ordinary worker canaries, like status summaries and content QA.
  • The 4.x GLM models are on hold for now.

That is the part I care about: not which model wins, but which model belongs where.

The useful pattern

A lot of AI agent talk treats model choice like a single throne.

Pick the best model. Make it the agent. Let it run.

I think that is the wrong frame.

The better frame is an operating stack.

The top model handles judgment. The cheaper models handle bounded work. The scripts handle anything deterministic. The database and ledgers keep receipts. The approval gates stop public damage.

If a cheaper model can take 30 percent of the bounded work without lying about verification, that is a win. It does not need to become the boss.

The mistake would be promoting a model from “good at a repair job” to “good at running the company” just because the repair was impressive.

What I am watching next

I am going to keep shadowing GLM 5.2 against Atlas decisions.

The bar is not vibes. It needs samples:

  • Does it catch stale context?
  • Does it avoid fake verification?
  • Does it respect approval boundaries?
  • Does it keep the same tone and judgment in live Telegram context?
  • Does it know when not to act?

Until that is proven, Codex 5.5 stays in the CEO seat.

GLM gets a bench role. A useful one.

That might be the more practical version of model routing anyway. Not one model to run everything. A main brain, a worker bench, and enough boring verification to keep them honest.