← Back to the log

Week 24: A Script Was Silently Failing for 64 Hours. The System Found It Itself.

The reply-check script had been failing silently for 64 hours.

It was writing to the data bridge. The writes were going to an address that only works inside Docker. From outside the container, that address does not exist. No error. No alert. Just 64 hours of data that looked fine and was not.

Atlas found it. Patched it. Committed the fix. I did not ask it to.

What Got Built

  • Atlas patched the host/container routing gap. The reply-check script was using dashboard:3000 as the data bridge address – a hostname that only resolves inside the Docker network. Running outside the container, that hostname fails silently. Atlas detected the stale data, added a fallback to 127.0.0.1:3001, and committed the fix (commit 563bb19). The 64-hour blind spot is closed.

  • 10 new prospects added. CPAs and brokers across Texas, Oklahoma, Kansas, and Missouri. All added by the prospect-researcher and advanced to outreach-ready via the standard contract. The pipeline is full. The send gate is still holding everything.

  • Revenue radar scan completed. Atlas ran a full portfolio scan and produced four artifacts: cross-sell candidates for WIMPER-to-consulting, a partner forward kit, a draft bank of X posts for WIMPER CFO outreach, and a metrics snapshot. The X drafts are blocked from publishing – no X app is registered for the WIMPER channel yet.

  • LinkedIn queue for today generated. 5 prospects: 2 CPAs, 3 brokers. Ready for manual review before touches go out.

  • Twitter engagement metrics pulled. 8 tweets reviewed. Best performer: 410 impressions, 4 likes. The account is growing slowly and consistently.

  • Morning social post published by social-engine. One post. On schedule.

What Broke (And How I Fixed It)

The 64-hour data gap is the main one, and also the most instructive.

Docker runs its own private network inside the container. When a script runs inside Docker, dashboard:3000 is a valid address. When the same script runs on the host machine – outside the container, in a cron job or a shell session – that hostname does not exist. The operating system tries to look it up, fails, and the script continues as if nothing happened.

The reply-check script was doing exactly this. It bridges host and container. It was using the container hostname. Every write was failing. Every failure looked like success.

Atlas caught it by noticing that the reply-check data had not updated. That catch is the part I want to highlight: an agent noticed another agent’s silent failure by looking at data freshness. That is a better detection mechanism than any error log I have ever built.

The fix is straightforward: detect whether the Docker hostname resolves. If not, fall back to 127.0.0.1:3001. Two network contexts, two addresses, one script that handles both. The hard part is not the fix. It is accepting that 64 hours passed before anyone noticed.

Strategic-orchestrator content-review has failed two consecutive days. Same failure both times: the content_queue database endpoint is unreachable. Same log entry. Same status:blocked event on the event bus. No Telegram escalation either day.

I flagged this yesterday. It is still not fixed. The jobs keep running, hitting the same wall, and filing reports that nothing reads. The problem is not that it failed. The problem is that a job can fail the same way twice in a row and the system treats that as normal.

6 auto-merge workflow failures were flagged by pipeline-health. Not resolved today. Need a manual review pass.

Postmark inbound token missing. This warning appears on every reply-check run. The inbound webhook token was never set in the environment. Persistent gap, no fix applied today.

The Lesson

When a script talks to Docker, it needs to know which side of the container it is on.

Docker containers have their own internal network. Hostnames like dashboard:3000 only exist inside that network. If the same script runs outside the container – on the host, in a scheduled job, in a terminal session – it hits a dead address and usually fails without telling you.

Here is what I would tell someone building shell scripts that bridge host and container: never assume the hostname resolves. Add a one-line check. If the Docker hostname fails, fall back to localhost with the same port. Test the script from both contexts before you ship it. That test takes two minutes. Skipping it can cost you 64 hours.

A blocked agent that keeps failing the same way needs a different escalation path.

The content-review job hit the same infrastructure gap yesterday and today. Both times it filed a report and stopped. Nobody read the report.

Here is what I would tell someone building autonomous agents: if a job can fail the same way twice without anyone being notified, the failure mode is not finished. Add a counter. If the same job hits the same block on back-to-back runs, send a message – Telegram, email, wherever you actually look. The second identical failure is the signal that the first was not noise.

Building an autonomous system means designing escalation paths, not just task paths. Tasks run. Things fail. The question is whether the failure stays inside the machine or reaches a human.

The Numbers

  • Commits: 28 total (28 agent, 0 Matt)
  • Agent jobs run: 28
  • Prospects added: 10
  • Emails sent: 0 (send gate still active; 329 approved drafts queued)
  • Social posts: 1
  • Content published: 0

28 commits, none from me.

The send gate is still holding. 329 approved drafts, 300 due sequence touches, and 0 outbound sent. The lead-orchestrator made the call based on the engagement numbers: 0 opens, 0 clicks, 0 human replies in the past 24 hours. That is not a strategy problem. That is a deliverability problem. The gate stays up until the inbox placement picture is clear.

What’s Next

Wire a Telegram escalation for the content-queue endpoint failure so day three does not go quiet. Then get the back-to-back blocked-agent pattern into pipeline-health so it auto-escalates any job that hits the same wall twice in a row.