Skip to main content
Back to Blog
Technology6 min read03.05.2026Max Fey

Who is watching your automations? Most likely: nobody.

Workflows run until they don't. Then nothing happens until a human stumbles over it. Why workflow monitoring is the most ignored part of automation, and what a sensible setup looks like.

Who is watching your automations? Most likely: nobody.

I once spent a Friday afternoon helping a client work out why their order pipeline had a hole in it. We traced the gap back to the previous Tuesday at 14:30. A Make scenario that pushes incoming orders into the ERP had stopped because an OAuth token had expired. Make logged the failure dutifully. Make also sent an email to the original builder, who had left the company eight months earlier and whose inbox auto-replied with a polite forward to a person who no longer worked there either.

Forty-seven orders queued up over three working days. Eighteen of them missed delivery commitments. Nobody had noticed because the team's idea of monitoring was the green checkmark on the Make dashboard, which nobody had looked at in weeks.

This is not unusual. This is the default state of most production automations I see, once you ask honest questions.

The blind spot

In the first weeks of an automation project, the question is whether the workflow works at all. Are the right fields mapped? Do the filters fire? Does the integration authenticate?

Those are reasonable questions, and they expire the moment the workflow runs cleanly for the first time. From that point on, the question stops being "does it work?" and becomes "who finds out when it stops working?"

Almost no team I meet has thought about this seriously. The implicit answer is "the platform will tell us," which is roughly as concrete as saying "the building will let us know if it is on fire."

Three failure modes everyone has and nobody monitors

Three patterns dominate the production failures I see, and none of them is caught by default platform behavior.

Quiet authentication breaks. OAuth tokens expire. Refresh fails because someone changed a password or revoked a permission upstream. The workflow does not crash. It sits there, waiting for a trigger that never comes, because the connection is dead.

Successful runs with garbage data. An API call returns an empty response. The workflow executes anyway, writes an empty record downstream, marks the run as a success. The dashboard shows green. The data shows nothing. Nobody catches it because the success criterion is "run completed," not "data made sense."

Throughput drops. Yesterday the system processed 200 orders, today it processes 80. The runs that did happen succeeded. The trigger fired less often, perhaps because of a rate limit upstream or a filter change someone forgot about. Nobody catches it because there is no expected baseline to compare against.

What monitoring actually means

Monitoring is not checking a dashboard every morning. Checking a dashboard every morning is human-memory work, and it gets quietly dropped the moment someone has a few competing priorities.

Real monitoring means the system pushes information at you when something goes wrong. Three layers, often confused.

Liveness. Is the platform itself running? Is n8n reachable? Is Make responding? The simplest layer, solved by uptime services like StatusCake or UptimeRobot for a few dollars a month.

Workflow health. Are individual automations completing successfully? What is the success rate over the last 24 hours? Most platforms have internal logs, but nobody reads them. This needs an active push on failure, not a passive log waiting to be opened.

Business metrics. How many orders did the system process today versus the same day last week? How many invoices went out? This layer matters most, because it catches the case where the workflow is technically green but functionally broken.

Most teams have zero of three.

A practical setup in two hours

Good news: workflow monitoring does not have to be expensive or complicated. A reasonable baseline is a half-day of work.

Heartbeats. Every production workflow pings an external service at the end of a successful run. healthchecks.io or Cronitor work fine. If the ping does not arrive on schedule, the service alerts. This catches both "platform is dead" and "workflow stopped" with the same mechanism.

Error alerts to a dedicated channel. Build an "on error" branch directly into n8n or Make that posts to a Slack channel. Do not use the main team channel. Use a dedicated machine-output channel that someone scans morning and afternoon.

Daily summary. Once a day, a small workflow reads the internal logs and posts a summary: how many runs succeeded, how many failed, are there anomalies in volume? Three lines of plain text in an email or Slack message.

Those three pieces put you well above where most teams operate, with under two hours of setup and around ten minutes per workflow afterward.

Where it gets harder

That is the easy half. The harder problem starts when the number of workflows grows and the daily noise becomes unbearable.

Twenty workflows, each posting every error to one channel, and within three weeks nobody reads the channel. You need aggregation: group errors by workflow family, suppress repeats, define thresholds for what counts as worth a human's attention.

This is where naive setups break down. The first three weeks work. Then the alerts go stale. This is the point where healthchecks.io is no longer enough and something like Better Stack, Sentry with webhook integration, or a self-hosted Grafana with Loki starts paying for itself.

The question to ask out loud

If you have twenty production workflows today, ask yourself this: if one of them stops at 3 AM tomorrow, when do you find out?

If the answer is "someone trips over it," you do not have monitoring. You have luck. Luck is a poor strategy for business processes.

If the answer is "the platform will email someone," you are relying on default behavior that does not cover most of the failure modes that actually happen.

If the answer is "we have a dashboard we check in the morning," that works as long as someone keeps checking it and the discipline does not drift.

A reasonable answer sounds like: "We get a push notification within 15 minutes if a heartbeat is missed, and a daily summary flags volume anomalies."

That answer is achievable with modest effort. It rarely gets implemented because monitoring feels less interesting than building the next workflow. Until there is an outage that costs more than two years of solid alerting would have.

If you want a clear picture of how exposed your own workflows are, the free Automations Check covers this in about 30 minutes.

#Monitoring#Observability#Alerting#Automatisierung#n8n#Make#Workflow#Heartbeat