Troubleshooting

Work top-down: is the event arriving → is it passing the chain → is the handler firing → is the provider accepting it? Three tools answer almost everything: pod logs, /readyz, and the metrics.

kubectl logs -n tekton-events-relay deploy/tekton-events-relay -f
kubectl exec -n tekton-events-relay deploy/tekton-events-relay -- wget -qO- localhost:8080/readyz

Nothing happens at all

Check	How	Fix
Tekton is sending events	`events_received_total` increasing? Any `cloudevent_request_started` logs?	Set `default-cloud-events-sink` in the `config-defaults` ConfigMap (`tekton-pipelines` ns) to the relay Service URL; check NetworkPolicies between namespaces.
Events arrive but are dropped early	Log `no decoder registered for event type` + `events_unsupported_type_total`	Unknown CloudEvent type (often after a Tekton upgrade) — open an issue with the type.
Resource type filtered	`events_filtered_total`	Enable the type under `filter:` (`allow_taskrun: true`, …).
Missing annotations	Log `missing annotation tekton.dev/tekton-events-relay.scm.provider`	Annotate the PipelineRun in your TriggerTemplate.
Wrong provider name	Dispatcher log `no handlers processed event`	`scm.provider` must equal a configured instance `name` exactly.
Handler exists but `when` never matches	enable `logging.level: debug`	Test the CEL expression; remember states are lowercase.

Provider rejects the call

/readyz shows the last error per handler — start there.

Symptom	Cause	Fix
`401 Bad credentials` / `403`	expired or under-scoped token	Rotate the Secret (hot reload picks it up; the webhook/grafana/sentry/jira notifiers re-read the mounted secret per request, so the new value applies immediately). For webhook/jira you can instead use OAuth2 client credentials (`auth.oauth2` + `token_url`) so the relay auto-refreshes the token before expiry. Check scopes on the provider page.
`404` on status/comment	wrong owner/name/project annotations, or token can’t see the repo	Compare annotations against the repo URL; remember GitLab prefers `repo-id`.
`422` / validation error	field limits (context/description/label length)	Shorten the template; limits are provider-specific.
Frequent `429`	provider rate limiting	Watch `notifier_rate_limit_hits_total{host}`; the retry policy honors `Retry-After` — if sustained, reduce event volume per action with `when`/filters, or use a GitHub App (higher limits).
Self-signed TLS errors	private CA	Mount the CA and configure the client; avoid `insecure_skip_verify`.

Permanent failures are preserved in the DLQ when enabled — after fixing credentials, POST /api/v1/dlq/replay.

Duplicate or missing notifications

Symptom	Cause	Fix
Duplicate comments, `replicaCount > 1`	per-pod `memory` store: retransmissions land on another replica	Use a shared store (valkey/olric) — or 1 replica. `mode: upsert` also neutralizes duplicates for comments.
Duplicates after pod restart	memory store lost	Same as above.
`store_errors_total` rising + occasional duplicates	store backend down — relay fails open	Fix Valkey/Olric connectivity; events were delivered, only dedup degraded.
`deduper_evictions_total` climbing	cache smaller than event volume	Raise `dedupe_size` (memory) / rely on TTL-based remote backends.
Events silently missing under load	back-pressure	`events_backpressure_total` — these return 503 and Tekton retransmits; check what’s slow via `notifier_latency_seconds`.
One slow provider delays everything	no	It can’t: `handler_timeout` (default 10s) bounds each handler — see `handler_timeouts_total`.

Config & deploy issues

Symptom	Fix
Pod CrashLoops at start	`tekton-events-relay --validate --config …` against the rendered ConfigMap; the error message names the bad key. `helm install` already schema-validates values.
Edited the ConfigMap, nothing changed	Hot reload only applies valid configs — check `config_reloads_total{result="failure"}` and the `config reload:` log line. `server`/`store`/`dlq`/`logging`/`tracing` changes need a restart.
`verbose options require logging.level to be 'debug'`	Exactly that — set the level or drop the verbose flags.
Olric pods don’t form a cluster	Pod-to-pod 3320/tcp + 3322/tcp+udp must be open (the chart’s NetworkPolicy handles it when `backend: olric`; check other policies/CNI).

Debugging one event end-to-end

logging.level: debug (+ verbose.payloads: true if needed — secrets are redacted).
Trigger the run; grep the logs for the CloudEvent ID (ce_id).
You’ll see: received → decoder → each chain step → per-handler success/failure with the provider’s response.
With tracing.endpoint set, the same journey is one trace with per-handler spans.

Still stuck? Open an issue with the relay version, the ce_id log excerpt and your (redacted) config.