Observability

Endpoints

PathPurpose
/CloudEvents receiver
/metricsPrometheus metrics (also on server.metrics_addr if set; the chart ships a ServiceMonitor)
/healthzLiveness — always 200 while the process serves
/readyzReadiness + per-handler diagnostics (JSON)
/api/v1/dlq · /api/v1/dlq/replayDead letter queue

/readyz

Returns 503 until handlers and decoders are registered, then 200 with live per-handler state — your first stop when “notifications stopped”:

{
  "status": "ok",
  "handlers": {
    "github": {
      "last_event_at": "2026-06-11T12:01:33Z",
      "succeeded": 1041,
      "failed": 3,
      "last_error": "github API returned 401: Bad credentials",
      "last_error_at": "2026-06-11T11:58:02Z"
    },
    "slack": { "last_event_at": "2026-06-11T12:01:33Z", "succeeded": 87, "failed": 0 }
  }
}

Prometheus metrics

All relay metrics carry the tekton_events_relay_ prefix.

Event flow

MetricLabelsMeaning
events_received_totaltype, sourceCloudEvents accepted by the receiver
events_processed_totalhandler, statusChain-step / handler outcomes (success/error)
events_filtered_totalreasonDropped by the resource-type filter
events_unsupported_type_totaltypeCloudEvent types with no decoder (watch after Tekton upgrades)
events_backpressure_totalEvents answered with 503 (Tekton will retransmit)
errors_permanent_totalreasonNon-retryable chain failures (these go to the DLQ when enabled)
pipeline_errors_totalstageInternal chain errors per stage

Latency

MetricLabels
chain_duration_secondsresult
handler_duration_secondshandler
notifier_latency_secondshandler, action
handler_timeouts_totalhandler

Deduplication & state

MetricLabelsMeaning
deduper_hits_totalDuplicates dropped
dedupe_cache_sizeCurrent entries (memory backend)
deduper_evictions_totalLRU evictions — sustained growth means dedupe_size is too small
store_errors_totalbackend, opState-backend failures (relay failed open)

Outbound reliability

MetricLabelsMeaning
notifier_retries_totalhost, reasonRetries (rate_limit, server_error, timeout, network_error)
notifier_rate_limit_hits_totalhostHTTP 429 received per destination
dlq_size / dlq_enqueued_totalDead letter queue depth / inflow

Operations

MetricLabelsMeaning
config_reloads_totalresultHot reload attempts (success/failure)
handlers_registeredHandlers built from config

Standard HTTP server metrics (http_request_duration_seconds, http_requests_total, http_requests_in_flight) and Go runtime/process collectors are also exported.

Alerting starting points

# permanent failures appearing
increase(tekton_events_relay_errors_permanent_total[10m]) > 0

# DLQ filling up — broken credential or config
tekton_events_relay_dlq_size > 0

# being rate-limited by a provider
increase(tekton_events_relay_notifier_rate_limit_hits_total[5m]) > 10

# dedupe degraded (store down, failing open)
increase(tekton_events_relay_store_errors_total[5m]) > 0

# config rollout broke the config
increase(tekton_events_relay_config_reloads_total{result="failure"}[15m]) > 0

Logging

Structured JSON via zap. logging.level: debug unlocks the verbose switches (caller, http_calls, payloads — payloads are redacted of known secret keys). Every request gets an X-Request-ID and trace/span IDs for correlation.

Log lines worth alerting/searching on: permanent error in pipeline chain, event preserved in DLQ for replay, dedupe store unavailable, processing event without deduplication, config reload: …, no decoder registered for event type.

Tracing

Set tracing.endpoint to an OTLP HTTP collector (e.g. otel-collector:4318) and each event produces a trace: receiver span → chain → one handler.execute span per handler (with handler.name/handler.type attributes and recorded errors). Use it to answer “which provider made this event slow”.

Exporting events elsewhere (DevLake, data lakes…)

For engineering-metrics platforms, don’t scrape — relay the events themselves with the generic webhook notifier and its gojq transform, shaping the payload to whatever schema the destination expects (e.g. Apache DevLake’s deployments webhook). See Examples.