Observability

Endpoints

Path	Purpose
`/`	CloudEvents receiver
`/metrics`	Prometheus metrics (also on `server.metrics_addr` if set; the chart ships a ServiceMonitor)
`/healthz`	Liveness — always 200 while the process serves
`/readyz`	Readiness + per-handler diagnostics (JSON)
`/api/v1/dlq` · `/api/v1/dlq/replay`	Dead letter queue

`/readyz`

Returns 503 until handlers and decoders are registered, then 200 with live per-handler state — your first stop when “notifications stopped”:

{
  "status": "ok",
  "handlers": {
    "github": {
      "last_event_at": "2026-06-11T12:01:33Z",
      "succeeded": 1041,
      "failed": 3,
      "last_error": "github API returned 401: Bad credentials",
      "last_error_at": "2026-06-11T11:58:02Z"
    },
    "slack": { "last_event_at": "2026-06-11T12:01:33Z", "succeeded": 87, "failed": 0 }
  }
}

Prometheus metrics

All relay metrics carry the tekton_events_relay_ prefix.

Event flow

Metric	Labels	Meaning
`events_received_total`	`type`, `source`	CloudEvents accepted by the receiver
`events_processed_total`	`handler`, `status`	Chain-step / handler outcomes (`success`/`error`)
`events_filtered_total`	`reason`	Dropped by the resource-type filter
`events_unsupported_type_total`	`type`	CloudEvent types with no decoder (watch after Tekton upgrades)
`events_backpressure_total`	—	Events answered with 503 (Tekton will retransmit)
`errors_permanent_total`	`reason`	Non-retryable chain failures (these go to the DLQ when enabled)
`pipeline_errors_total`	`stage`	Internal chain errors per stage

Latency

Metric	Labels
`chain_duration_seconds`	`result`
`handler_duration_seconds`	`handler`
`notifier_latency_seconds`	`handler`, `action`
`handler_timeouts_total`	`handler`

Deduplication & state

Metric	Labels	Meaning
`deduper_hits_total`	—	Duplicates dropped
`dedupe_cache_size`	—	Current entries (memory backend)
`deduper_evictions_total`	—	LRU evictions — sustained growth means `dedupe_size` is too small
`store_errors_total`	`backend`, `op`	State-backend failures (relay failed open)

Outbound reliability

Metric	Labels	Meaning
`notifier_retries_total`	`host`, `reason`	Retries (`rate_limit`, `server_error`, `timeout`, `network_error`)
`notifier_rate_limit_hits_total`	`host`	HTTP 429 received per destination
`dlq_size` / `dlq_enqueued_total`	—	Dead letter queue depth / inflow

Operations

Metric	Labels	Meaning
`config_reloads_total`	`result`	Hot reload attempts (`success`/`failure`)
`handlers_registered`	—	Handlers built from config

Standard HTTP server metrics (http_request_duration_seconds, http_requests_total, http_requests_in_flight) and Go runtime/process collectors are also exported.

Alerting starting points

# permanent failures appearing
increase(tekton_events_relay_errors_permanent_total[10m]) > 0

# DLQ filling up — broken credential or config
tekton_events_relay_dlq_size > 0

# being rate-limited by a provider
increase(tekton_events_relay_notifier_rate_limit_hits_total[5m]) > 10

# dedupe degraded (store down, failing open)
increase(tekton_events_relay_store_errors_total[5m]) > 0

# config rollout broke the config
increase(tekton_events_relay_config_reloads_total{result="failure"}[15m]) > 0

Logging

Structured JSON via zap. logging.level: debug unlocks the verbose switches (caller, http_calls, payloads — payloads are redacted of known secret keys). Every request gets an X-Request-ID and trace/span IDs for correlation.

Log lines worth alerting/searching on: permanent error in pipeline chain, event preserved in DLQ for replay, dedupe store unavailable, processing event without deduplication, config reload: …, no decoder registered for event type.

Tracing

Set tracing.endpoint to an OTLP HTTP collector (e.g. otel-collector:4318) and each event produces a trace: receiver span → chain → one handler.execute span per handler (with handler.name/handler.type attributes and recorded errors). Use it to answer “which provider made this event slow”.

Exporting events elsewhere (DevLake, data lakes…)

For engineering-metrics platforms, don’t scrape — relay the events themselves with the generic webhook notifier and its gojq transform, shaping the payload to whatever schema the destination expects (e.g. Apache DevLake’s deployments webhook). See Examples.