TL;DR: A reliable n8n setup has three layers. Per-node settings (Retry On Fail and the On Error selector) handle transient failures in place. One global error workflow, triggered by the Error Trigger node, catches everything that still breaks and sends you a useful alert. The Stop And Error node lets you fail on purpose when your data is wrong. Build all three. If your only safety net is staring at the executions list, you don't have error handling, you have hope.
The mistake almost everyone makes first
Most n8n workflows ship with zero error handling, and the author finds out a payment sync has been silently dead for three days when a customer complains. The default behavior of a node failure is to stop the execution and mark it failed. Nobody is watching the executions list at 3am. That is the entire problem to solve.
The fix is not one setting. Transient errors (a rate-limited API, a momentary timeout) want a retry. Permanent errors (a 401, a malformed payload) want an alert, not five pointless retries. And errors you can predict (an empty array where you expected data) want to fail loudly and on purpose. A production-grade strategy uses a different tool for each, layered from the node outward.
Layer 1: per-node Retry On Fail for transient errors
Open any node, go to the Settings tab, and turn on Retry On Fail. This is the cheapest reliability win in n8n and the one people skip. Two fields appear:
-
Max Tries - how many times n8n re-runs the node before giving up. The field caps at 5.
-
Wait Between Tries (ms) - the delay before each retry, capped at 5000 ms.
Use it on anything that talks to a flaky network: HTTP Request nodes, third-party API integrations, database calls. A common-sense baseline is 3 tries with a 2000-3000 ms wait. That alone absorbs the overwhelming majority of "it failed once and worked on rerun" incidents.
What Retry On Fail is not good for: authentication failures, validation errors, or 4xx responses. Retrying a 401 five times just delays the alert and burns API quota. For those, you want the failure to surface immediately.
There is one quirk worth knowing before you trust it. If you enable Retry On Fail and set the On Error option (below) to one of the Continue choices, the Max Tries and Wait Between Tries values are ignored - the node continues on the first failure instead of retrying (n8n issue #10763). Pick one behavior per node: retry, or continue. Don't expect both from the same node.
Layer 2: the On Error selector
Right below Retry On Fail in the Settings tab is On Error, the dropdown that decides what happens when a node fails for good. Three options:
-
Stop Workflow (the default) - the execution halts and is marked failed. This is what triggers your error workflow, so it is the right default for most nodes.
-
Continue - the workflow moves on as if nothing happened, passing the input through. Use this rarely; a silent skip is how data goes missing.
-
Continue (using error output) - the node grows a second output (a red error branch). Successful items leave the normal output, failed items leave the error output. This is the good one.
"Continue (using error output)" is the pattern you reach for when partial failure is acceptable and you want to handle the bad items separately. Process a batch of 200 leads, send the 195 that succeed onward, and route the 5 that errored into a Set node that logs them to a sheet. The workflow finishes, you keep the good data, and the failures are captured instead of crashing the whole run.
One reported gotcha: in some node versions and community nodes, error output has been observed leaking into the success branch (n8n issue #11202). If you depend on this branch for correctness, add an explicit check downstream rather than assuming items on the success output actually succeeded. Our standing advice with n8n: never assume, verify the shape of the data. The Code node is the cleanest place to do that.
Layer 3: the Error Trigger and a single global error workflow
This is the piece that turns a pile of workflows into something you can run a business on. Build one workflow whose only trigger is the Error Trigger node, then assign it as the error workflow for everything else.
The Error Trigger fires whenever a workflow it is attached to fails (a node hits Stop Workflow and there is no error output to catch it). It receives a structured payload describing the failure. The shape, when the execution was saved to the database, looks roughly like this:
{
"execution": {
"id": "231",
"url": "https://your-instance.app.n8n.cloud/workflow/abc/executions/231",
"retryOf": null,
"error": {
"message": "Request failed with status code 401",
"stack": "..."
},
"lastNodeExecuted": "HTTP Request",
"mode": "trigger"
},
"workflow": {
"id": "abc",
"name": "Stripe to CRM sync"
}
}
The fields that matter for an alert are workflow.name (which automation broke), execution.lastNodeExecuted (where it broke), execution.error.message (why), and execution.url (a direct link to the failed run for debugging). Note that execution.id and execution.url only exist if the execution was saved - and execution.url is absent when the error happens in the trigger node of the main workflow itself. If your alert template assumes those fields always exist, it will throw inside your error workflow, which is a uniquely annoying way to lose an alert.
Assigning the error workflow
You do not connect workflows together for this. Open the workflow you want to protect, go to the three-dot menu, open Settings, and set the Error workflow field to your error-handling workflow. Do this for every important workflow. There is no global default in the UI, so a workflow with the field left blank fails silently. That blank field is the single most common reason teams think their alerting works when it doesn't.
A small but real time-saver: set the Error workflow field as part of your build checklist, not as an afterthought. A workflow without an assigned error workflow should be treated as unfinished.
Wiring the alert: Slack or email
Inside the error workflow, the Error Trigger feeds whatever notification node you prefer. Keep the message dense and link straight to the failed execution so you can debug in one click. A Slack message body using expressions:
Workflow failed: {{ $json.workflow.name }}
Node: {{ $json.execution.lastNodeExecuted }}
Error: {{ $json.execution.error.message }}
Run: {{ $json.execution.url }}
For email, the same fields map to a Send Email or Gmail node. Slack is the better default for failures you need to react to within minutes; email is fine for lower-urgency batch jobs where a daily glance is enough.
If you want to avoid alert fatigue, add an IF node after the Error Trigger that filters on execution.error.message. Route known-transient errors to a quiet log and only ping the channel for the ones that need a human. Don't over-engineer this on day one - a single channel that fires on every failure beats a clever filter you never finish building.
One caution: if your alert depends on an external service (Slack's API, your SMTP host), that service can be down at the exact moment your workflow fails. For anything truly critical, send to two independent channels - for example Slack plus email - so a single outage doesn't swallow the notification. No silent fallbacks; you want both to fire.
Failing on purpose: the Stop And Error node
The Stop And Error node throws an error deliberately, which is exactly what you want when a downstream step received data it should never have received. It stops the execution and, because it is a real failure, it triggers your error workflow and alert just like any other node failure.
It offers two Error Type options:
-
Error Message - throw a plain string you write, like
No invoice line items found for order. -
Error Object - throw a JSON object, useful when your error workflow parses structured fields to decide routing or severity.
The classic use: right after a node that fetches records, add an IF node checking whether the result is empty, and on the empty branch drop a Stop And Error with a clear message. Instead of the workflow quietly continuing with zero items and "succeeding" while doing nothing, it fails with a message that tells you precisely what was missing. This is the n8n equivalent of an assertion. Use it anywhere a downstream success would be a lie.
What a production-grade strategy actually looks like
Putting the layers together, here is the stance worth defaulting to:
Enable Retry On Fail (3 tries, ~2500 ms) on every node that hits the network. Leave On Error on Stop Workflow for nodes where any failure means the whole run is invalid; switch to Continue (using error output) only where partial success is genuinely acceptable, and always route that error branch somewhere. Build exactly one error workflow with an Error Trigger and a Slack alert, and assign it in Settings on every workflow you care about. Drop Stop And Error nodes at the points where bad or empty data should never pass.
That combination means transient blips self-heal, predictable bad data fails loudly, and anything you didn't anticipate still reaches you with a name, a node, a message, and a link. The goal is not zero failures - it is zero silent failures.
It is worth separating error handling from the noisier connectivity problems that masquerade as workflow errors. If your executions are dying with websocket or proxy symptoms rather than genuine node failures, that is an infrastructure issue, not an error-handling one - see our guide on the n8n connection lost error before you start adding retries to paper over it.
FAQ
What is an n8n error workflow?
It is an ordinary workflow whose trigger is the Error Trigger node. You assign it to other workflows via Settings -> Error workflow. When an assigned workflow fails, n8n runs the error workflow and passes it a payload describing the failure, which you typically forward to Slack or email.
How do I set up automatic retries in n8n?
Open the node, go to the Settings tab, enable Retry On Fail, then set Max Tries (up to 5) and Wait Between Tries in milliseconds (up to 5000). Apply it to network-facing nodes like HTTP Request, and avoid it for authentication or validation errors where retrying is pointless.
What is the difference between Continue and Continue (using error output)?
Continue silently passes the input through and moves on, which can hide missing data. Continue (using error output) adds a separate error branch so failed items are routed away from successful ones, letting you log or handle them explicitly while the rest of the batch proceeds.
Why isn't my error workflow firing?
Almost always because the failing workflow has no Error workflow assigned in its Settings, since there is no global default. Other causes: the failure is caught by a node's error output (so it never counts as a workflow failure), or the error happens in the trigger node before execution data exists.
Does Retry On Fail work together with the On Error setting?
Not reliably. If you enable Retry On Fail and also set On Error to a Continue option, the retry settings are ignored and the node continues on the first failure. Choose one behavior per node: retry, or continue.
When should I use the Stop And Error node?
When you want to fail on purpose because data is invalid or missing - for example, after detecting an empty result set. It throws either an Error Message string or an Error Object, halts the run, and triggers your error workflow, so a bad state surfaces as a real failure instead of a fake success.
If you want a second set of eyes on your error strategy - or you'd rather have your critical workflows built reliable from the start - our team works on exactly this kind of n8n reliability work.