When the Workflow Learns to Fail — Error Handling and Observability in n8n

Article 7 · Series: Getting Started with n8n

The classifier from Article 6 categorizes tickets with a local language model, routes critical cases to an alert, and runs without cloud dependency. What it lacks is operational robustness. What happens when the model server doesn’t answer, when a run aborts mid-processing, or when a new workflow version replaces an old one while executions are in flight. This article makes the workflow production-ready: a global error workflow, an observability stack of Prometheus, Loki and Grafana, and finally a failure case that is especially treacherous because nobody notices it.

The code for this article is on Codeberg, tag v0.7: codeberg.org/rotecodefraktion/n8n-einstieg.

Three layers n8n already brings

Error handling in n8n is not a single feature but assembles from several layers. Three of them ship out of the box:

Node retry. Every node has a Retry On Fail under Settings, with Max Tries and Wait Between Tries. It acts within an execution with a fixed delay, before the workflow even counts as failed. Useful for HTTP and chat model nodes running against transient network errors.
Error workflow. A dedicated workflow that starts automatically when another one fails. This is the layer this article rests on: central classification, logging and alerting in one place, for any number of production workflows.
REST-driven re-run. POST /executions/{id}/retry restarts a failed execution, the same API the UI uses behind its retry button.

A fourth, production layer is deliberately absent here: a persistent queue with workers and automatic backoff. That comes with queue mode and Redis, and gets its own article (Article 11). A Postgres dead-letter table with a schedule trigger and home-grown backoff would be buildable here, but it duplicates functionality queue mode brings along. Article 7 stays with the three layers n8n offers without extra infrastructure.

The global error workflow

A workflow whose first node is an Error Trigger can be set as the error workflow of any other workflow, in that workflow’s settings. It then fires automatically on a failure. One error workflow can serve many workflows at once.

The v0.7-error-handler has three nodes in a row: Error Trigger, a code node for classification, a code node for the structured log. On a workflow failure the Error Trigger receives an object with execution and workflow. On a trigger activation error there is no execution, but a trigger.error instead. The classification node handles both cases and derives a severity from the error name:

// Mode: Run Once for Each Item
const input = $input.item.json;
const isActivationError = !input.execution;
const errorObj = isActivationError
  ? (input.trigger?.error ?? {})
  : (input.execution?.error ?? {});
const errorName = errorObj.name ?? 'UnknownError';

let severity = 'info';
if (errorName === 'NodeApiError') severity = 'critical';
else if (errorName === 'NodeOperationError') severity = 'warning';
else if (errorName === 'WorkflowOperationError') severity = 'critical';
else if (isActivationError) severity = 'critical';

return { json: {
  marker: 'n8n-error-workflow',
  severity, errorName,
  errorMessage: errorObj.message ?? '',
  workflowName: input.workflow?.name,
  executionId: input.execution?.id ?? null,
  timestamp: new Date().toISOString(),
} };

The mode Run Once for Each Item is mandatory here. $input.item is only defined in this mode; in Run Once for All Items the code runs against a non-existent item.

The second code node writes the result as a single JSON line to standard output:

const payload = $input.item.json;
console.log(JSON.stringify(payload));
return { json: payload };

Here lurks the first trap. Since n8n 1.15.x, console.log from a code node no longer reaches the container’s standard output by default. The environment variable CODE_ENABLE_STDOUT=true re-enables it, but only for production executions. Manual runs from the UI go exclusively to the browser console. The marker we are about to search for in Loki therefore needs a real production trigger, not a step-execute in the editor.

Severity routing and a Telegram alert

On critical errors we want more than a log entry, we want an alert. Between classification and log sits a Switch node that routes on severity: a rule {{ $json.severity }} is equal to critical leads into the alert branch, a fallback output catches info and warning. Both branches reconverge at the log node.

In the critical branch sits a Telegram node. Telegram is pragmatic as an alert channel: free, no contract, quick to set up in a self-hosting environment. The bot token comes from @BotFather, the chat ID from a one-time getUpdates call. Important: the token belongs in an n8n credential, never in the node or the exported workflow. The message text is built from the payload as an expression, without a parse mode. Markdown or HTML would provoke a 400 Bad Request on special characters in an error message; plain text is more robust.

Three pitfalls came up along the way, all three worth keeping in mind for any n8n 2.0 setup:

The error workflow must be published. In another workflow’s “Error Workflow” dropdown, a merely saved draft appears greyed out. n8n 2.0 separates Save from Publish, and production executions always run against the published version. Without a publish, the error workflow is not selectable.
Write node references exactly. A code node accessing another node via $('Classification') needs the name character-for-character. If it’s wrong, n8n reports Referenced node doesn't exist, and treacherously so: the error aborts the handler after the Telegram message is already out. The alert arrives, the log is missing, and the workflow counts as failed.
Don’t type a leading = yourself. n8n marks a field as an expression internally with a =. Type it yourself and it ends up as a visible character in the message.

Save and publish

The save-vs-publish model from n8n 2.0 is more than a detail. Save stores a draft version, Publish promotes a version to production. Production executions always run against the most recently published version. This addresses a real risk: editing live on an active workflow risks inconsistent runs.

Two consequences follow for operations, easily overlooked. First, a workflow must be published to run in production at all, and to be selectable as an error workflow. Second, and this cost us a debugging detour: a change to a node’s code is not re-published automatically. Pure settings changes are, a code change needs an explicit publish (Shift + P). On an error workflow this surfaces especially late, because it only runs on failure. A saved but unpublished fix means production keeps running the old, broken version.

The execution log and its limits

n8n stores every run under Executions in Postgres. For simple debugging that suffices: status, duration, the failed node, its input data. Failed executions are only persisted if Save Failed Executions is enabled, otherwise execution.id and execution.url are absent in the error workflow.

What the built-in log does not provide: long-term trend analysis, correlation with other system events, alerting on anomalies. That is exactly where the external stack begins.

Metrics with Prometheus and Grafana

The observability stack comes as a separate compose override (docker-compose.observability.yml) that extends the n8n service with N8N_METRICS=true plus label toggles and adds four services: Prometheus, Loki, Grafana and a log collector.

Setting up the observability stack with Docker Compose

The stack extends the setup from Article 2 rather than replacing it: a second compose file plus configurations under docker/, all in the repo at tag v0.7:

docker/docker-compose.observability.yml   # n8n + N8N_METRICS, four new services
docker/prometheus/prometheus.yml          # scrape config for /metrics
docker/loki/loki-config.yml               # Loki, 7-day retention
docker/alloy/config.alloy                 # Alloy: container stdout → Loki
docker/grafana/provisioning/             # datasources + dashboard preconfigured
docker/grafana/dashboards/n8n.json        # the dashboard

The override adds N8N_METRICS=true (plus CODE_ENABLE_STDOUT=true) to the n8n service and brings up Prometheus, Loki, Grafana and Grafana Alloy as the log collector. A block for grafana.localhost goes into the Caddyfile. Both compose files are brought up together:

cd docker
docker compose -f docker-compose.yml -f docker-compose.observability.yml up -d

n8n and Postgres are recreated because the override changes their environment. The named volumes survive that, no data loss. Only Grafana is exposed externally through Caddy; Prometheus, Loki and Alloy stay internal.

Verify everything is running:

curl -sk https://localhost/metrics | head -5            # n8n metrics
curl -sk https://grafana.localhost/api/health           # {"database":"ok",...}
docker exec docker-loki-1 wget -qO- \
  'http://localhost:3100/loki/api/v1/label/service/values'   # {"data":["n8n","postgres"]}

Grafana opens at https://grafana.localhost, first login admin/admin with a password change. The dashboard sits in the n8n folder. Tear down with docker compose -f docker-compose.yml -f docker-compose.observability.yml down; down -v also removes the metric and dashboard volumes.

Once up, n8n exposes GET /metrics in Prometheus format:

curl -sk https://localhost/metrics | grep '^n8n_workflow_'

The workflow counters that matter for operations are there out of the box:

Metric	Meaning
`n8n_workflow_started_total`	runs started, per `workflow_id`
`n8n_workflow_success_total`	successful runs
`n8n_workflow_failed_total`	failed runs
`n8n_workflow_execution_duration_seconds`	run duration histogram

The workflow_id label comes from N8N_METRICS_INCLUDE_WORKFLOW_ID_LABEL=true. The provisioned Grafana dashboard “n8n Self-Hosted Overview” shows version, active workflows, execution rate, latency percentiles and duration metrics. Failed and success sit there deliberately in two separate panels with their own scale. A combined panel marginalizes the smaller series: with many successful and few failed runs, the failure line vanishes at the zero axis, exactly the signal you want to see.

Logs with Loki and Alloy

n8n has no native Loki output. The built-in log streaming knows only webhook and sentry as destinations. The obvious solution, a Docker log-driver plugin, can block the Docker daemon when Loki is down, even the Grafana maintainer advises against it. The classic route, Promtail, is end-of-life since March 2, 2026. Its successor is Grafana Alloy with loki.source.docker, which tails the container’s standard output over the Docker socket and ships it to Loki.

This makes the JSON line from the error workflow searchable:

{service="n8n"} |= "n8n-error-workflow"

Here CODE_ENABLE_STDOUT=true pays off: without that variable the code node writes nothing to standard output, and Alloy would have nothing to read.

The silent model failure

Now to the most treacherous failure case, and it does not belong in the global error workflow. The AI classifier from Article 6 has its Basic LLM Chain’s On Error set to Continue (using error output). When the model fails, say because the gateway doesn’t answer, the workflow doesn’t run into the error but into a fallback branch that files the ticket as sonstiges and passes it on. This is intentional: the ticket flow doesn’t break just because a model is briefly gone.

The price of this graceful degradation is a blind spot. The workflow doesn’t fail, so the global error workflow doesn’t fire, so no alert goes out. On a gateway outage, all tickets simply land in sonstiges, and nobody finds out. We reproduced this in testing: gateway stopped, ticket sent, the webhook answers HTTP 200 with category: sonstiges. Cleanly processed, factually wrong.

The fix is to make the degradation observable without breaking the flow. Into the fallback branch goes a code node that writes its own marker to standard output:

// Mode: Run Once for Each Item
console.log(JSON.stringify({
  marker: 'n8n-ai-fallback',
  backend: 'mlx',
  ticketId: $('Normalize Input').item.json.id || null,
  workflowName: $workflow.name,
  reason: 'model unreachable or schema-invalid',
  timestamp: new Date().toISOString(),
}));
return $input.item;

We pull the ticket ID from Normalize Input on purpose, not from the current item: the fallback set node discards the unset fields, the ID would be gone at this point. And not from the webhook node, because that one doesn’t run in the chat-trigger path, where the reference would break.

Via Alloy this marker also lands in Loki and gets two dashboard panels of its own: a fallback rate as bars (sum(count_over_time({service="n8n"} |= "n8n-ai-fallback" [5m]))) and an event log with the ticket ID. The silent failure is now visible, the ticket flow stays untouched. Real failover, where a second backend takes over, is the next step and a topic in its own right (Article 9).

Healthcheck

For external monitoring systems, GET /healthz is always on and returns {"status":"ok"}. An uptime check shows at a glance whether the instance is reachable. GET /healthz/readiness is only relevant in queue mode.

Verification: the failure smoke test

The error workflow only fires on real production failures, a manual run in the editor does not trigger it. To test, use a small workflow with a webhook trigger that fails on purpose, with v0.7-error-handler as its error workflow. Two variants show both severities:

# info: code node with throw new Error('smoke test') → UnknownError → severity info
curl -sk https://localhost/webhook/error-smoketest

# critical: HTTP request to a dead port → NodeApiError → severity critical → Telegram
curl -sk https://localhost/webhook/critical-smoketest

# the marker on stdout and the incremented counter
docker logs --since 1m docker-n8n-1 | grep n8n-error-workflow
curl -sk https://localhost/metrics | grep 'n8n_workflow_failed_total'

A bare throw new Error() yields UnknownError and thus severity: info. The critical path with the Telegram alert needs a NodeApiError or NodeOperationError, for instance an HTTP request to an unreachable address. The error workflow itself reports success in the process, because it handled the error successfully. What gets counted is the triggering workflow.

What comes next

The workflow now fails in a controlled way, alerts on critical errors, and is observable through metrics and logs, including the silent fallback. What are currently two separate paths, the rule-based entrance and the AI classification, the next article merges into one pipeline (Article 8). The fourth retry layer, a persistent queue with automatic backoff, follows with queue mode in Article 11.