Self-hosting n8n in production: what the docs leave out
Self-hosting promises sovereignty and predictable costs. The truth shows up in year two, in backups that never restore, updates that block, and monitoring you forgot to build. What we learned running n8n in production for two years.
The Stuttgart call
A client in Stuttgart brought us in last month. Sixty production workflows on a self-hosted n8n instance, around 150 executions per hour, three internal teams using the platform. The question on the table: why has the instance been unreachable for seven minutes every Thursday at 10 PM for the past two weeks.
It took us three hours to find the cause. It was not an n8n bug. It was the backup job on the underlying server, putting enough load on the Postgres instance that the worker containers lost their health checks. The behavior is documented somewhere, in a forum thread from 2024. Nobody reads that before going to production.
Stories like this are not outliers. They show up in nearly every self-hosting setup that was not built with an operations mindset from day one. Self-hosted n8n promises three things: data sovereignty, predictable costs, full control. In the first weeks it delivers on all three. In year two, the question reframes itself.
We have been running n8n in self-hosting for two years, for our own operations and for a meaningful portion of our clients. In that time we have learned what the official documentation softens and what the marketing pages leave out completely. This article distills what we tell clients before they commit to self-hosting.
Why teams choose to self-host
We hear three reasons most often, and all three are legitimate.
The first is data sovereignty. Teams in healthcare, finance, or public administration often have explicit mandates that certain data cannot leave their own infrastructure boundary. A SaaS platform whose data flows through US-based data centers does not qualify, even when it offers European-region hosting.
The second is cost. n8n Cloud charges per workflow execution with tiered plans. Above roughly 10,000 executions per month, the cloud bill grows noticeably faster than a self-hosted server, which gets set up once and then runs at near-zero marginal cost. On paper.
The third is extensibility. Teams that want to build custom nodes, talk to internal systems without public APIs, or modify the default behavior get much further with self-hosting than with the cloud. We use this regularly for clients with specialized requirements.
So much for the theory. Now what happens in practice.
Reality one: backups are not a button
The official documentation covers backups in one section. The gist: run pg_dump, write the output to S3, done. Technically correct, operationally naive.
In practice, four questions emerge that the docs do not answer.
When do you back up? If the instance runs 24/7, there is no natural maintenance window. Backups during active executions cause either inconsistent snapshots or a noticeable performance dip. We use pg_basebackup during off-hours plus continuous WAL archiving. That enables point-in-time recovery but costs storage and forces a thoughtful retention policy.
Where do backups go? S3 is the obvious answer, but teams that chose self-hosting because of US cloud concerns cannot suddenly land their backups at AWS. We use Hetzner Object Storage, Wasabi, or an internal MinIO cluster depending on the client. Each has its own cost, latency, and quirks during restore.
How do you test restores? A backup never restored is not a backup. It is a hope. We restore once per quarter per client into an isolated environment and verify that workflows run afterward. This is not trivial because credentials are encrypted, and a restore with the wrong encryption key succeeds formally but fails functionally.
What do you back up besides the database? Workflow definitions, obviously. But also credentials whose encryption key lives separately. Custom nodes mounted as files. Webhook endpoints with their generated URLs. A large instance has four or five backup sources that must be restored in the correct order.
We have seen clients run for half a year before someone attempted a restore. The restore failed. Nobody knew why. Workflows came back online a week later, restored from an old export file.
Reality two: updates are not npm install
n8n releases new versions every one to two weeks. Bug fixes, new nodes, occasional breaking changes in the node API. The temptation to auto-update is real. We never do.
Three things have made updates hard for us, repeatedly.
Database migrations. Some releases ship schema changes. On small instances, migrations finish in seconds. On instances with millions of execution records, an update can block for ten minutes or more. Teams that do not prune execution history sit through this every time.
Breaking changes in custom nodes. Custom nodes written against a specific n8n API risk breaking with every major update. It is not frequent, but when it happens, diagnosis takes days because workflows often keep running and silently return wrong values.
Node deprecations. n8n regularly retires deprecated nodes or shifts their signatures. Production workflows then fail with error messages that do not point to the actual cause. We watched one workflow produce silent errors for ten days after an update because a node output had switched from string to number.
Our practice: update a staging instance first, run a smoke-test set of workflows, read the release notes carefully, check custom nodes for compatibility, and only then pull the update into production. Each release costs about two hours of effort. The cost is worth it.
Reality three: scaling means queue mode
By default, n8n runs as a single container that handles web UI, workers, and scheduling. This works up to a limit somewhere between 10 and 50 concurrent executions, depending on workflow complexity and host resources.
Beyond that, you switch to queue mode. One container for web and webhooks, one or more worker containers for execution, a Redis instance as the job queue. The architecture is standard, but the implementation has a few traps.
First trap: webhook responses. In single mode, n8n responds immediately to an incoming webhook. In queue mode, the webhook is received, a job is queued in Redis, and the worker executes it before the response is sent. This works for most cases but can cause timeouts when workers are fully busy or when the workflow runs long.
Second trap: sticky sessions. The editor UI needs consistency across requests for certain actions. Teams that put a load balancer in front must configure sticky sessions or watch the editor display inconsistent states.
Third trap: worker scaling. More workers mean more throughput but also more database connections. We watched an instance run into the connection pool limit because each worker opened its own pool, and Postgres simply stopped accepting new connections at some point. The workflows then failed with confusing error messages.
Fourth trap: Redis persistence. By default Redis stores queue data in memory. If Redis crashes, every unprocessed job is gone. Configuring AOF persistence trades performance for safety. Using managed Redis trades cost and adds another external dependency.
We use queue mode for clients with more than 1,000 executions per hour. Below that, single mode is the cheaper operational tradeoff, and the performance gain is marginal.
Reality four: monitoring is not an add-on
n8n has no built-in platform-level alerting. There is a workflow overview and execution logs, nothing else. Teams that want to know whether the platform itself is alive, whether a workflow has been silent for three days, or whether the database connection is starting to thrash, need their own monitoring stack.
We set up four layers per production instance.
First layer: container liveness. We check every two minutes that the web server and all workers respond. Tools: Prometheus with Blackbox Exporter, Uptime Kuma, or Healthchecks.io for an external heartbeat.
Second layer: database health. Postgres connections, query latency, replication lag, free disk. Standard Postgres exporter plus Grafana dashboards.
Third layer: workflow success rates. n8n writes every execution to the database with status. A simple query returns error rates per workflow. We alert when a workflow produces more failures than successes in an hour.
Fourth layer: external heartbeats for critical workflows. Workflows that should run once a day send a ping to an external monitor when they finish. If the ping is missing, an alarm fires. This catches the case where n8n is technically running but the workflow did not start.
Setting up these four layers cleanly takes about two days. Skipping them is flying blind. We had a client whose daily data exports failed silently for two weeks. The CFO noticed when monthly close did not have the numbers. Cause: an expired API token. A simple heartbeat alarm would have shown it immediately.
Reality five: security needs architecture, not a checklist
Self-hosting means you own the platform's security. Not just configuration, but network, OS, TLS certificates, access control, audit logs.
Three areas where we regularly find gaps during audits.
Webhook endpoints. Every n8n webhook is publicly reachable by default. Without configured authentication, anyone on the internet can trigger it. We have seen instances whose webhooks appeared regularly in bot scan logs. Some of these webhook triggers send emails, create CRM records, post to Slack. An attacker only needs the URL.
Credentials encryption. n8n encrypts credentials with an encryption key that lives in a container environment variable by default. Teams that misconfigure container permissions or push images to shared registries risk leaking the key in container layers or logs. We separate the encryption key into an external secret manager and mount it fresh per container.
Database access. Anyone with network access to the Postgres database can read workflows and export credentials in their encrypted form. With the encryption key in hand, decryption is trivial. We restrict the database to internal networking with no public interface.
These three points are standard in any web security checklist. In self-hosting they reappear because the installation often happened via quickstart and hardening was deferred. Deferred hardening rarely happens.
The cost calculation the marketing pages do not show
On the marketing page, self-hosting looks nearly free. A VPS for 30 euros a month, a Postgres instance, done. The real cost picture is different.
We model client costs roughly like this: infrastructure is about a quarter of total cost. Backup infrastructure and object storage are another quarter. Monitoring stack is another quarter. Operations effort, meaning updates, restores, incident response, is the final quarter.
A small setup at around 5,000 executions per month, set up once and rarely touched, costs realistically between 200 and 300 euros per month. Maybe 60 euros of that is infrastructure. The rest is proportional operational effort that exists either way, whether it gets paid for or absorbed silently.
A large setup at 50,000 executions per month, with queue mode, backups, and proper monitoring, lands between 800 and 1,500 euros per month. Of that, 200 to 400 euros is pure infrastructure. The rest is operations.
For comparison, n8n Cloud's Pro plan charges around 50 euros for 10,000 executions, and Business is around 600 euros for 100,000 executions. For many clients, the cloud bill ends up no more expensive than honestly priced self-hosting once operations are factored in.
Self-hosting wins financially when three conditions hold: you already run the supporting stack for other workloads, you have execution volume above 50,000 per month, or sovereignty is non-negotiable.
When we recommend self-hosting
For clients under 5,000 workflow executions per month without hard sovereignty requirements, we recommend n8n Cloud. The bill is lower, the weekend stays free.
For clients with hard regulatory constraints in healthcare, finance, or public administration, we recommend self-hosting. Price is not the question. Control is.
For clients with in-house IT teams and existing operational maturity, where monitoring stacks, backup systems, and patch management already exist, self-hosting is a good lever to reduce platform costs while building technical depth.
For clients without IT operations experience, we almost always advise against. Self-hosting is not a platform. It is a stack, and a stack needs someone to maintain it. Teams without that capacity pay more later, either in consulting or in outages.
Three questions before going self-hosted
Before we install anything with a client, we settle three questions.
First: who is on the hook when the instance goes down at 2 AM? If the answer is "we will look in the morning", that is a deliberate SLA decision. If the answer is "we do not know", self-hosting is the wrong path.
Second: which workflows genuinely need this platform's uptime? Self-hosting is justified when the cost of a one-hour outage is real. When workflows can wait an hour without consequence, cloud is usually the cheaper choice.
Third: who tests the restores? A backup without a successful test is superstition. Without quarterly restores, there is no disaster recovery plan. There is a hope.
When all three answers are clear, we move to architecture. When they are not, we move back to advisory.
What we think after two years
Self-hosting n8n is not hard. It is persistently demanding. The installation is a matter of hours. The operation is a matter of years, with continuous maintenance, regular updates, occasional incidents, and a monitoring stack that grows alongside the platform.
We keep running it because we have the experience, infrastructure, and routine. We recommend it only when the client brings the same maturity or is ready to build it.
The honest view of self-hosting is not an ideological battle between open source and vendor lock-in. It is a question of cost, accountability, and operational readiness. Teams that internalize that can decide well. Teams that pick self-hosting because the cloud bill stings should run the numbers twice.
If you are uncertain whether self-hosting fits your stack, or whether your current self-hosting setup is operationally viable, our free Automations Check is a good starting point. We review your setup against these five realities and tell you honestly whether you are at the operational point where self-hosting actually pays off.