Observability
Endpoints
| Path | Purpose |
|---|---|
/ | CloudEvents receiver |
/metrics | Prometheus metrics (also on server.metrics_addr if set; the chart ships a ServiceMonitor) |
/healthz | Liveness — always 200 while the process serves |
/readyz | Readiness + per-handler diagnostics (JSON) |
/api/v1/dlq · /api/v1/dlq/replay | Dead letter queue |
/readyz
Returns 503 until handlers and decoders are registered, then 200 with live per-handler state — your first stop when “notifications stopped”:
{
"status": "ok",
"handlers": {
"github": {
"last_event_at": "2026-06-11T12:01:33Z",
"succeeded": 1041,
"failed": 3,
"last_error": "github API returned 401: Bad credentials",
"last_error_at": "2026-06-11T11:58:02Z"
},
"slack": { "last_event_at": "2026-06-11T12:01:33Z", "succeeded": 87, "failed": 0 }
}
}
Prometheus metrics
All relay metrics carry the tekton_events_relay_ prefix.
Event flow
| Metric | Labels | Meaning |
|---|---|---|
events_received_total | type, source | CloudEvents accepted by the receiver |
events_processed_total | handler, status | Chain-step / handler outcomes (success/error) |
events_filtered_total | reason | Dropped by the resource-type filter |
events_unsupported_type_total | type | CloudEvent types with no decoder (watch after Tekton upgrades) |
events_backpressure_total | — | Events answered with 503 (Tekton will retransmit) |
errors_permanent_total | reason | Non-retryable chain failures (these go to the DLQ when enabled) |
pipeline_errors_total | stage | Internal chain errors per stage |
Latency
| Metric | Labels |
|---|---|
chain_duration_seconds | result |
handler_duration_seconds | handler |
notifier_latency_seconds | handler, action |
handler_timeouts_total | handler |
Deduplication & state
| Metric | Labels | Meaning |
|---|---|---|
deduper_hits_total | — | Duplicates dropped |
dedupe_cache_size | — | Current entries (memory backend) |
deduper_evictions_total | — | LRU evictions — sustained growth means dedupe_size is too small |
store_errors_total | backend, op | State-backend failures (relay failed open) |
Outbound reliability
| Metric | Labels | Meaning |
|---|---|---|
notifier_retries_total | host, reason | Retries (rate_limit, server_error, timeout, network_error) |
notifier_rate_limit_hits_total | host | HTTP 429 received per destination |
dlq_size / dlq_enqueued_total | — | Dead letter queue depth / inflow |
Operations
| Metric | Labels | Meaning |
|---|---|---|
config_reloads_total | result | Hot reload attempts (success/failure) |
handlers_registered | — | Handlers built from config |
Standard HTTP server metrics (http_request_duration_seconds, http_requests_total, http_requests_in_flight) and Go runtime/process collectors are also exported.
Alerting starting points
# permanent failures appearing
increase(tekton_events_relay_errors_permanent_total[10m]) > 0
# DLQ filling up — broken credential or config
tekton_events_relay_dlq_size > 0
# being rate-limited by a provider
increase(tekton_events_relay_notifier_rate_limit_hits_total[5m]) > 10
# dedupe degraded (store down, failing open)
increase(tekton_events_relay_store_errors_total[5m]) > 0
# config rollout broke the config
increase(tekton_events_relay_config_reloads_total{result="failure"}[15m]) > 0
Logging
Structured JSON via zap. logging.level: debug unlocks the verbose switches (caller, http_calls, payloads — payloads are redacted of known secret keys). Every request gets an X-Request-ID and trace/span IDs for correlation.
Log lines worth alerting/searching on: permanent error in pipeline chain, event preserved in DLQ for replay, dedupe store unavailable, processing event without deduplication, config reload: …, no decoder registered for event type.
Tracing
Set tracing.endpoint to an OTLP HTTP collector (e.g. otel-collector:4318) and each event produces a trace: receiver span → chain → one handler.execute span per handler (with handler.name/handler.type attributes and recorded errors). Use it to answer “which provider made this event slow”.
Exporting events elsewhere (DevLake, data lakes…)
For engineering-metrics platforms, don’t scrape — relay the events themselves with the generic webhook notifier and its gojq transform, shaping the payload to whatever schema the destination expects (e.g. Apache DevLake’s deployments webhook). See Examples.