Troubleshooting
Work top-down: is the event arriving → is it passing the chain → is the handler firing → is the provider accepting it? Three tools answer almost everything: pod logs, /readyz, and the metrics.
kubectl logs -n tekton-events-relay deploy/tekton-events-relay -f
kubectl exec -n tekton-events-relay deploy/tekton-events-relay -- wget -qO- localhost:8080/readyz
Nothing happens at all
| Check | How | Fix |
|---|---|---|
| Tekton is sending events | events_received_total increasing? Any cloudevent_request_started logs? | Set default-cloud-events-sink in the config-defaults ConfigMap (tekton-pipelines ns) to the relay Service URL; check NetworkPolicies between namespaces. |
| Events arrive but are dropped early | Log no decoder registered for event type + events_unsupported_type_total | Unknown CloudEvent type (often after a Tekton upgrade) — open an issue with the type. |
| Resource type filtered | events_filtered_total | Enable the type under filter: (allow_taskrun: true, …). |
| Missing annotations | Log missing annotation tekton.dev/tekton-events-relay.scm.provider | Annotate the PipelineRun in your TriggerTemplate. |
| Wrong provider name | Dispatcher log no handlers processed event | scm.provider must equal a configured instance name exactly. |
Handler exists but when never matches | enable logging.level: debug | Test the CEL expression; remember states are lowercase. |
Provider rejects the call
/readyz shows the last error per handler — start there.
| Symptom | Cause | Fix |
|---|---|---|
401 Bad credentials / 403 | expired or under-scoped token | Rotate the Secret (hot reload picks it up; the webhook/grafana/sentry/jira notifiers re-read the mounted secret per request, so the new value applies immediately). For webhook/jira you can instead use OAuth2 client credentials (auth.oauth2 + token_url) so the relay auto-refreshes the token before expiry. Check scopes on the provider page. |
404 on status/comment | wrong owner/name/project annotations, or token can’t see the repo | Compare annotations against the repo URL; remember GitLab prefers repo-id. |
422 / validation error | field limits (context/description/label length) | Shorten the template; limits are provider-specific. |
Frequent 429 | provider rate limiting | Watch notifier_rate_limit_hits_total{host}; the retry policy honors Retry-After — if sustained, reduce event volume per action with when/filters, or use a GitHub App (higher limits). |
| Self-signed TLS errors | private CA | Mount the CA and configure the client; avoid insecure_skip_verify. |
Permanent failures are preserved in the DLQ when enabled — after fixing credentials, POST /api/v1/dlq/replay.
Duplicate or missing notifications
| Symptom | Cause | Fix |
|---|---|---|
Duplicate comments, replicaCount > 1 | per-pod memory store: retransmissions land on another replica | Use a shared store (valkey/olric) — or 1 replica. mode: upsert also neutralizes duplicates for comments. |
| Duplicates after pod restart | memory store lost | Same as above. |
store_errors_total rising + occasional duplicates | store backend down — relay fails open | Fix Valkey/Olric connectivity; events were delivered, only dedup degraded. |
deduper_evictions_total climbing | cache smaller than event volume | Raise dedupe_size (memory) / rely on TTL-based remote backends. |
| Events silently missing under load | back-pressure | events_backpressure_total — these return 503 and Tekton retransmits; check what’s slow via notifier_latency_seconds. |
| One slow provider delays everything | no | It can’t: handler_timeout (default 10s) bounds each handler — see handler_timeouts_total. |
Config & deploy issues
| Symptom | Fix |
|---|---|
| Pod CrashLoops at start | tekton-events-relay --validate --config … against the rendered ConfigMap; the error message names the bad key. helm install already schema-validates values. |
| Edited the ConfigMap, nothing changed | Hot reload only applies valid configs — check config_reloads_total{result="failure"} and the config reload: log line. server/store/dlq/logging/tracing changes need a restart. |
verbose options require logging.level to be 'debug' | Exactly that — set the level or drop the verbose flags. |
| Olric pods don’t form a cluster | Pod-to-pod 3320/tcp + 3322/tcp+udp must be open (the chart’s NetworkPolicy handles it when backend: olric; check other policies/CNI). |
Debugging one event end-to-end
logging.level: debug(+verbose.payloads: trueif needed — secrets are redacted).- Trigger the run; grep the logs for the CloudEvent ID (
ce_id). - You’ll see: received → decoder → each chain step → per-handler success/failure with the provider’s response.
- With
tracing.endpointset, the same journey is one trace with per-handler spans.
Still stuck? Open an issue with the relay version, the ce_id log excerpt and your (redacted) config.