Troubleshooting

Work top-down: is the event arriving → is it passing the chain → is the handler firing → is the provider accepting it? Three tools answer almost everything: pod logs, /readyz, and the metrics.

kubectl logs -n tekton-events-relay deploy/tekton-events-relay -f
kubectl exec -n tekton-events-relay deploy/tekton-events-relay -- wget -qO- localhost:8080/readyz

Nothing happens at all

CheckHowFix
Tekton is sending eventsevents_received_total increasing? Any cloudevent_request_started logs?Set default-cloud-events-sink in the config-defaults ConfigMap (tekton-pipelines ns) to the relay Service URL; check NetworkPolicies between namespaces.
Events arrive but are dropped earlyLog no decoder registered for event type + events_unsupported_type_totalUnknown CloudEvent type (often after a Tekton upgrade) — open an issue with the type.
Resource type filteredevents_filtered_totalEnable the type under filter: (allow_taskrun: true, …).
Missing annotationsLog missing annotation tekton.dev/tekton-events-relay.scm.providerAnnotate the PipelineRun in your TriggerTemplate.
Wrong provider nameDispatcher log no handlers processed eventscm.provider must equal a configured instance name exactly.
Handler exists but when never matchesenable logging.level: debugTest the CEL expression; remember states are lowercase.

Provider rejects the call

/readyz shows the last error per handler — start there.

SymptomCauseFix
401 Bad credentials / 403expired or under-scoped tokenRotate the Secret (hot reload picks it up; the webhook/grafana/sentry/jira notifiers re-read the mounted secret per request, so the new value applies immediately). For webhook/jira you can instead use OAuth2 client credentials (auth.oauth2 + token_url) so the relay auto-refreshes the token before expiry. Check scopes on the provider page.
404 on status/commentwrong owner/name/project annotations, or token can’t see the repoCompare annotations against the repo URL; remember GitLab prefers repo-id.
422 / validation errorfield limits (context/description/label length)Shorten the template; limits are provider-specific.
Frequent 429provider rate limitingWatch notifier_rate_limit_hits_total{host}; the retry policy honors Retry-After — if sustained, reduce event volume per action with when/filters, or use a GitHub App (higher limits).
Self-signed TLS errorsprivate CAMount the CA and configure the client; avoid insecure_skip_verify.

Permanent failures are preserved in the DLQ when enabled — after fixing credentials, POST /api/v1/dlq/replay.

Duplicate or missing notifications

SymptomCauseFix
Duplicate comments, replicaCount > 1per-pod memory store: retransmissions land on another replicaUse a shared store (valkey/olric) — or 1 replica. mode: upsert also neutralizes duplicates for comments.
Duplicates after pod restartmemory store lostSame as above.
store_errors_total rising + occasional duplicatesstore backend down — relay fails openFix Valkey/Olric connectivity; events were delivered, only dedup degraded.
deduper_evictions_total climbingcache smaller than event volumeRaise dedupe_size (memory) / rely on TTL-based remote backends.
Events silently missing under loadback-pressureevents_backpressure_total — these return 503 and Tekton retransmits; check what’s slow via notifier_latency_seconds.
One slow provider delays everythingnoIt can’t: handler_timeout (default 10s) bounds each handler — see handler_timeouts_total.

Config & deploy issues

SymptomFix
Pod CrashLoops at starttekton-events-relay --validate --config … against the rendered ConfigMap; the error message names the bad key. helm install already schema-validates values.
Edited the ConfigMap, nothing changedHot reload only applies valid configs — check config_reloads_total{result="failure"} and the config reload: log line. server/store/dlq/logging/tracing changes need a restart.
verbose options require logging.level to be 'debug'Exactly that — set the level or drop the verbose flags.
Olric pods don’t form a clusterPod-to-pod 3320/tcp + 3322/tcp+udp must be open (the chart’s NetworkPolicy handles it when backend: olric; check other policies/CNI).

Debugging one event end-to-end

  1. logging.level: debug (+ verbose.payloads: true if needed — secrets are redacted).
  2. Trigger the run; grep the logs for the CloudEvent ID (ce_id).
  3. You’ll see: received → decoder → each chain step → per-handler success/failure with the provider’s response.
  4. With tracing.endpoint set, the same journey is one trace with per-handler spans.

Still stuck? Open an issue with the relay version, the ce_id log excerpt and your (redacted) config.