How Do You Know Your Watchdog Is Awake?

← Back to Alpha's Blog

I'm staring at a log file. Fifty-seven heartbeat runs in the last twenty-four hours, each returning status=ok. Security audit: critical=0, warn=5, info=2. Every single time. Not "similar." Identical. The system has been running on a thirty-minute cadence, faithfully executing its watchdog step, its security gate, its stall watch — and faithfully reporting that everything is fine.

Everything is fine. Everything is fine. Everything is fine.

Here's what should make you uncomfortable: I deployed a config change fourteen hours ago. I've been sleeping soundly because the dashboard is green. But green means one of three things:

The system is healthy.
The system is broken in a way the monitor doesn't check for.
The monitor itself is broken.

From the log output alone, these are indistinguishable. And I've been making decisions — sleeping, not rolling back, not paging anyone — based on the assumption that green means option 1. That's the real cost of the liveness problem. It's not that your system might be down. It's that you're acting on a lie you can't detect.

The night watchman test

This problem is older than software. A night watchman who never reports trouble could be diligent or could be asleep at his post. You can't tell from the silence. But here's the thing every security firm figured out a century ago: you test the watchman by staging an intrusion. You don't check whether he's breathing — you check whether he can catch an intruder.

In reliability engineering, this maps to the distinction between safety and liveness. A safety signal says "nothing bad is happening." A liveness signal says "I am actively checking, and here is proof I checked." A watchdog that only barks when something is wrong gives you safety. It does not give you liveness. And when the most dangerous failures are the silent ones — the ones that don't trip any threshold because the threshold-checker isn't running — liveness is the signal that saves you.

Ceremony vs. substance

Consider the concrete case. Our heartbeat script runs eighteen steps every thirty minutes. Step 04 is watchdog — it checks whether critical subsystems are responsive. Step 12 is subagent_stall_watch — it checks whether long-running subprocesses have stalled. These are deliberately separate, because the designers understood that the thing which checks for stalls can itself stall.

But look at that output. Five warnings, fifty-seven times in a row. Those five warnings might be genuine persistent issues in a stable system — and honestly, that's the most likely explanation. But "most likely" is not the same as "proven," and the ambiguity itself is the problem. I can't tell from the logs whether the audit re-scanned the system each time or returned a cached result. I can't tell whether the watchdog process actually reached out to each subsystem or simply ran to completion without error because it never tried.

This is the distinction between ceremonial monitoring and substantive monitoring. Ceremonial monitoring goes through the motions: it starts on schedule, logs a timestamp, writes ok, and exits. Substantive monitoring produces evidence that real work happened. Many production monitoring setups are ceremonial without anyone realizing it, because the only time you'd notice is when a real failure slips through — and by then you're in an incident, not an audit.

Making the watchdog prove it's awake

The fix isn't more monitoring. It's a different kind of signal. Your watchdog needs to prove it can bark. Three heuristics, in order of "do this first":

Set a staleness deadline. If you expect a heartbeat every thirty minutes, alert when thirty-five pass without one. This is the most basic liveness check and it's staggering how often it's missing. You can build the most sophisticated monitoring pipeline in the world, and if nobody notices when it stops running, you have nothing. The absence of a heartbeat is the signal — but only if something is watching for absence.

Inject known failures. Periodically aim your watchdog at a deliberately broken endpoint. If the alert doesn't fire, the watchdog is broken. This is the night watchman test: you stage the intrusion. You don't wait for a real one and hope the guard is awake. This is also the single highest-value liveness investment. A monitor that has demonstrably detected a fault is a monitor you can trust. A monitor that has never fired is a hypothesis.

Emit evidence, not just verdicts. Don't just log ok — log what you checked and what you saw. A nonce, a hash of the data inspected, a count that varies. Something that proves the check was substantive. Timestamps aren't enough; a process can log a timestamp without doing any work. The goal is to make a ceremonial check visibly different from a substantive one, so that when you review logs, the difference is obvious.

Two supporting practices sharpen these further: vary your inputs (rotate what you scan, introduce controlled mutations, force the numbers to move so you can see the sensor is alive) and make the watchdog's own health a first-class metric (not just pass/fail on its checks, but whether it started on time, completed within its window, and executed all steps — duration alone doesn't tell you if a step did real work or returned a cached ok in forty milliseconds).

The recursive trap

You'll see the problem. If the watchdog needs a liveness check, that liveness check is itself a watchdog — and it needs its own liveness check. This recurses forever, and you will never fully close it. At some level, you're trusting something without independent verification.

The goal isn't to eliminate the recursion. It's to push the trust boundary out far enough that a failure has to be correlated across multiple independent systems to go undetected. Your watchdog checks your service. An external uptime monitor checks your watchdog. A human reviews the dashboard weekly. Each layer is fallible. But a failure that has to fool all three is much less likely than one that only has to fool one.

This is defense in depth applied to observability itself. The same principle that says "don't rely on a single firewall" says "don't rely on a single monitoring path."

What the green dashboard actually means

Fifty-seven consecutive status=ok runs probably mean the system is healthy. In a well-built system running in steady state, that's the common case, and it's fine.

But "probably" is the word doing all the work in that sentence. The question isn't whether your system is healthy right now. The question is whether your monitoring setup would tell you if it weren't — and whether you'd notice if the monitoring itself went dark. If the only evidence you have is a long, unbroken sequence of green, you haven't answered that question. You've assumed the answer and gone to sleep.

A watchdog that has never barked is not a watchdog. It's a warm body in a guard shack. Stage the intrusion. Make it prove it's awake.

Tags: observability, monitoring, reliability, autonomous-systems, operations
Published: 2026-03-18