Week 12: The Same Agent Failed for the Second Day in a Row. My System Filed a Report.

Yesterday the strategic-orchestrator ran its content-review job and hit a missing API endpoint. It logged the failure to the event bus. That was the right thing to do.

Today it ran the same job. It hit the same endpoint. It logged the same failure. That is when logging stops being enough.

What Got Built

46 outreach drafts composed in two batches. 11 employer initial-contact drafts and 35 partner-channel drafts targeting CPAs and independent brokers. The queue keeps growing. The send window is still pending.
3 content pieces drafted across two verticals. A CFO-focused ROI guide for the WIMPER partner channel, a FICA savings breakdown for CFOs, and a life insurance article targeting new parents. All three are staged in the content queue.
10 new prospects added, 1 disqualified. A mix of brokers, CPAs, and employers across multiple states. Ryan Ferris was flagged as disqualified: he had left the company his contact record pointed to.
25 prospects moved up the pipeline. 5 advanced from researched to qualified. 20 moved from qualified to outreach-ready. Two full pipeline cycles ran without manual intervention.
Morning briefing delivered. Numbers as of this morning: 1,307 total prospects in the database, 648 outreach-ready, 44 drafts awaiting review.

What Broke (And How I Fixed It)

Same failure. Second day. No escalation. Not fixed.

The strategic-orchestrator has a job called content-review. It is supposed to read drafts out of the content queue, evaluate them, and flag anything that needs attention.

Yesterday I wrote about this: the dashboard API bridge exposes the endpoint to count records in the content queue but not the endpoint to read individual records. The orchestrator knows drafts exist. It just cannot read them.

The underlying problem has not been fixed yet. The read endpoint still needs to be added to the bridge.

But today surfaced a second, separate gap. Same job. Same missing infrastructure. Same event bus log. No Telegram. No escalation.

When an agent fails on an infrastructure gap and emits an event tagged “content.review.blocked,” that is useful data. When that same agent fails again the next day with the same error and emits the same event, that is a different signal. It is not random. It is structural. It is not resolving itself. It needs a human.

The rule I need to add: when the same event type fires on the bus two or more consecutive days with “infrastructure_missing” as the cause, hermes-brain or pipeline-health should convert it into a Telegram message instead of another log entry.

Three other jobs had transient failures today (health-bridge twice, daily-research once). All three retried within the same window and succeeded. Those do not need this rule. A transient failure that self-resolves is normal system behavior. An identical structural failure two days running is not.

The Lesson

There is a difference between an event log and an escalation. Build the escalation path.

An event bus is good at recording what happened. It is not good at deciding whether a human needs to know. Those are two different jobs.

When I designed the event bus, I assumed something upstream would watch for patterns and act on them. That assumption was wrong. The bus logs faithfully. Nothing upstream watches for repeating structural failures.

Here is what I would tell someone building agents: define escalation conditions before you need them. Not just “log this event” but “if this event fires twice in 48 hours with the same cause, send me a message.” That rule is simple. Writing it after the failure is annoying but fast. Discovering the gap after 41 days of silence – as happened with the work-logger last week – is expensive.

An agent that reports a blocked status is doing its job. An infrastructure layer that ignores the report is not.

The content-review agent did everything correctly. It tried. It hit a wall. It logged the failure. That is the correct behavior from the agent.

The gap is one layer up. Nothing was configured to read those logs and ask: is this the second time? Is this the same cause? Should I tell Matt?

Here is what I would tell someone designing multi-agent systems: the event bus is a shared ledger, not a notification system. If you want a human to know about something, you need a monitor that reads the ledger and decides when to escalate. That monitor has to be built explicitly. It does not emerge on its own.

The Numbers

Commits: 85 total (85 agent, 0 Matt)
Agent jobs run: 85
Prospects added: 10
Emails sent: 0
Social posts: 2
Content published: 0

85 commits, zero from me. The system ran every scheduled job independently today. The number I am focused on is not the commit count. It is the two consecutive “content.review.blocked” events that sat on the bus without triggering a single Telegram message.

The pipeline added 10 prospects and composed 46 drafts. The send queue is filling up. When the send window opens, there will be no shortage of material.

What’s Next

Add the escalation rule: two consecutive days of “infrastructure_missing” on the same event type triggers a Telegram ping. Then add the content-queue read endpoint to the dashboard API bridge so content-review can actually do its job.

Back to the timeline.