I wanted GLM 5.2 to be the answer.
On paper, it had the exact shape I was looking for: cheaper than leaning harder on ChatGPT, strong enough for coding work, and seemingly more friendly to the kind of agentic AI system I am building with Hermes.
I was not looking for a toy model.
I was looking for something that could sit near the center of the machine. A model with enough horsepower to help Atlas think, code, audit, summarize, and operate without burning through the expensive primary model every time.
The dream version was simple:
GLM 5.2 as the cheaper engine. Hermes as the agentic operating layer. Atlas as the brain. Fable/Mythos-level power underneath, but practical enough to run real daily work.
That was the hope.
Then I checked the logs.
The stats
In the recent sample I reviewed, GLM through Z.AI produced:
- 180 successful Z.AI / GLM calls
- 24 primary GLM failures with HTTP 429
- 6 primary GLM failures with HTTP 401
- 6 auxiliary failures with HTTP 404
- 15 fallback activations, mostly from GLM into Grok 4.3
That is not a small enough failure rate for something I wanted near the center of the system.
The model did work. A lot.
But “works often” is not the same as “safe to build around.”
What actually broke
The failures were not all the same.
The 429s looked like provider capacity or route instability. The endpoint basically said the service was temporarily overloaded and to try again later.
The 401s were different. Fresh tests worked, but an existing long-running session kept failing with “token expired or incorrect.” That points more toward stale session state, client state, or the way we were calling the API through the running agent process.
The 404s were weirder. Helper tasks like title generation tried to use glm-5.2 and got errors saying the model did not exist or the account did not have access. The main route could work while smaller auxiliary calls failed.
That distinction matters.
This was not just “GLM is bad.”
It was a mix of provider instability, integration rough edges, and bad evaluation hygiene.
The fallback problem
I had fallback enabled for a good reason.
If Atlas is answering me, writing a repair summary, or running an operational task, I do not want the whole thing to die because one provider is overloaded. So Hermes fell back from GLM 5.2 to Grok 4.3 when GLM failed.
That is good for live operations.
It is terrible for model evaluation.
If GLM fails and Grok finishes the job, the output can look successful unless you check the route logs. The human sees an answer. The system quietly switched brains.
That is dangerous if you are trying to decide whether a model is production-ready.
Fallback is insurance in production.
Fallback is contamination in evaluation.
The decision
I unsubscribed from Z.AI.
Access should remain active until July 18, so I am not ripping GLM out immediately. I am demoting it.
GLM 5.2 can still be useful as a bounded compute and coding worker. It can draft code, write test scaffolds, suggest refactors, summarize logs, and act like a second engineer on low-risk tasks.
But it does not get to be Atlas.
It does not get to make strategy decisions. It does not get to own live Telegram judgment. It does not get final verification authority. It does not get to become a dependency in the core operating loop.
That job stays with ChatGPT / Codex for now.
The lesson
The painful part is that GLM 5.2 might still be powerful.
That is what makes this useful.
A model can be smart enough and still not be operationally trustworthy enough.
The question is not only:
“Can this model reason?”
The better question is:
“Can I depend on this model, through this provider, through this API path, inside a real agent system, without spending my day babysitting it?”
For GLM 5.2 on Z.AI, my current answer is no.
Not never.
Just not now.
So the new posture is simple:
- ChatGPT / Codex stays primary.
- GLM 5.2 becomes a temporary coder and compute worker.
- GLM runs only through isolated profiles and explicit wrappers.
- Fallback stays on for live operations.
- Fallback stays off for evaluation.
- If a model costs more supervision time than it saves, it gets demoted.
I wanted the cheaper Fable/Mythos-level power source.
What I found was a model that may be strong, behind a route I do not trust enough yet.
That is still valuable information.
The logs made the decision cleaner than my hope did.