Operations

Everything you need to run the relay reliably in production: scaling, shared state, the DLQ, hot reload, security hardening and shutdown behavior.

State backends

Two components hold state: the deduper (drops Tekton retransmissions by CloudEvent ID) and the accumulator (buffers TaskRuns per PipelineRun for summary comments). The store config selects where that state lives:

	`memory` (default)	`valkey`	`olric`
Extra infrastructure	none	1 small Valkey/RESP server	none (embedded in relay pods)
Correct with N replicas	❌ per-pod	✅	✅
Survives pod restart	❌	✅	partial (rolling updates yes, full restart no)
Network requirements	none	egress to Valkey	pod-to-pod ports 3320/tcp + 3322/tcp+udp (chart wires NetworkPolicy + headless Service)

Why this matters: with memory, each replica has its own dedupe cache. Tekton retransmits events (the relay even asks it to, via 503 back-pressure) — and the retransmission can land on another replica, slipping past dedup and duplicating comments; the accumulator likewise fragments summaries across pods. Run one replica with memory, or pick a shared backend before scaling.

store:
  backend: valkey
  ttl: 1h
  valkey:
    address: valkey.tekton-events-relay.svc:6379

A 64Mi Valkey without persistence is enough — losing the cache only risks a rare duplicate notification. Any RESP-compatible server works (Valkey, KeyDB, …).

olric trades the external server for an embedded gossip cluster between the relay pods themselves (heavier dependency, state lost if all pods restart simultaneously).

All backends fail open: if the store is unreachable, events are processed without deduplication instead of being dropped, and tekton_events_relay_store_errors_total{backend,op} counts the failures.

Comments survive even that, if you use mode: upsert — idempotency then lives in the PR itself.

Dead letter queue

Permanent failures (expired token, deleted repo, 4xx from the provider) are not retried by Tekton — the relay acks them with 200. Without the DLQ they’d be lost; with it they’re preserved for inspection and replay:

dlq:
  enabled: true

API (behind server.auth when enabled):

# inspect (oldest first; ?limit=N)
curl -s http://relay:8080/api/v1/dlq | jq

# replay everything: re-runs the chain; successes are removed,
# still-failing events stay with retry_count bumped
curl -s -X POST http://relay:8080/api/v1/dlq/replay

Entries carry the full envelope, failure cause, timestamp and replay count. Storage is a size-bounded JSONL file (dlq.max_size_bytes, oldest dropped first) on an emptyDir the chart mounts. Watch tekton_events_relay_dlq_size — a growing DLQ means a broken credential or config.

Typical flow: token expires at 02:00 → statuses fail permanently → events accumulate in the DLQ → you rotate the secret at 09:00 → POST /api/v1/dlq/replay → all morning’s statuses are delivered.

Configuration hot reload

The relay reloads its config without restart when the file changes (Kubernetes ConfigMap updates are detected, including the atomic symlink swap) or on SIGHUP:

kubectl exec deploy/tekton-events-relay -- kill -HUP 1

The new config is validated first; an invalid config is rejected and the current one stays active (tekton_events_relay_config_reloads_total{result="failure"}). Handlers and the chain are rebuilt and swapped atomically; in-flight events finish on the old set; the dedupe store is preserved across the swap. Secrets files are re-read too, so mounted Secret rotation propagates without restart.

The webhook, grafana, sentry and jira notifiers go further: with a static credential they re-read the mounted secret file on every request, so a rotated Secret takes effect immediately — no config change or SIGHUP required. The webhook and Jira notifiers can also use OAuth2 client credentials (auth.oauth2 + token_url, grant_type client_credentials or refresh_token), where the access token is fetched and auto-refreshed before expiry, so a long-running pod never serves a stale token. (Grafana/Sentry have no OAuth2 client-credentials path — their APIs use a service-account / auth token — so they rely on the secret re-read.)

Sections that still require a restart (a warning is logged if they change): server, store, dlq, logging, tracing.

Secret rotation without downtime

The relay is designed so that credential rotation never forces a pod restart. Three mechanisms work together:

1. FileTokenSource re-reads on every request

All token-based notifiers (webhook, grafana, sentry, jira) and the shared SCM BaseClient store a secrets.FileTokenSource rather than a resolved string. Every call to Token(ctx) re-reads the mounted file from disk:

// internal/secrets/file_token.go
func (f *FileTokenSource) Token(_ context.Context) (string, error) {
    return ResolveWithReader(f.path, f.reader, nil)
}

The read is cheap — Kubernetes mounts secrets on an in-memory tmpfs, not a real disk. This means a rotated Secret takes effect on the next request after the kubelet updates the mounted file, with no config reload or SIGHUP required.

2. OAuth2 auto-refresh

Providers configured with auth.oauth2 (GitLab, Gitea, Bitbucket, Jira, generic webhook) use an x/oauth2 TokenSource that caches the access token and re-fetches it from the token endpoint before expiry:

// internal/notifier/scm/oauth2/client.go — Token() delegates to the
// x/oauth2 TokenSource, which refreshes automatically.
func (c *Client) Token(_ context.Context) (string, error) {
    tok, err := c.ts.Token()
    // ...
    return tok.AccessToken, nil
}

A long-running pod never serves a stale OAuth2 access token. If you rotate the client_secret, the next refresh cycle picks it up (subject to the kubelet propagation delay below).

3. Kubernetes volume propagation

The relay’s Helm chart mounts secrets as projected volumes (the default secretRef mechanism). Kubernetes watches the Secret object and updates the mounted file in place:

Mount type	Propagates rotation?	Delay
Projected / volume (default)	✅	kubelet sync period, typically ≤60 s
`subPath` mount	❌	Never — the file is a one-time copy; the pod must restart

Rule of thumb: never use subPath for secrets the relay needs to read at runtime. The chart’s secretRef pattern avoids this automatically.

Step-by-step: rotate a secret

This procedure works for any credential the relay uses (GitHub PAT, Slack webhook URL, Grafana API token, etc.). Replace <provider> and <instance> with your config names.

1. Update the Kubernetes Secret:

# Option A: imperative update
kubectl create secret generic <provider>-<instance> \
  --namespace tekton-events-relay \
  --from-literal=token="ghp_NEW_TOKEN_HERE" \
  --dry-run=client -o yaml | kubectl apply -f -

# Option B: edit directly
kubectl edit secret <provider>-<instance> -n tekton-events-relay

2. Wait for kubelet propagation (≤60 s):

No action needed. The kubelet detects the Secret change and updates the mounted file. You can verify:

# Check the mounted file timestamp inside the pod
kubectl exec deploy/tekton-events-relay -n tekton-events-relay -- \
  stat -c '%Y %y' /etc/secrets/<provider>/<instance>/token

3. Verify the relay is using the new credential:

# Check for auth errors in the last minute
kubectl logs deploy/tekton-events-relay -n tekton-events-relay --since=1m | \
  grep -i 'unauthorized\|401\|403\|token'

# Or trigger a test event and watch the handler succeed
curl -s http://relay:8080/readyz | jq '.handlers'

4. (Optional) Replay events that failed during rotation:

If there’s a brief window where the old token expired but the new one hasn’t propagated:

# Inspect the DLQ
curl -s http://relay:8080/api/v1/dlq | jq '.[] | select(.cause | test("auth|token|401"))'

# Replay all failed events
curl -s -X POST http://relay:8080/api/v1/dlq/replay

What about OAuth2 client_secret rotation?

The same procedure applies. Update the Secret that holds client_secret, wait ≤60 s for propagation. The next time the x/oauth2 TokenSource refreshes the access token, it reads the new client_secret from the mounted file (via FileTokenSource). No restart, no config reload.

What about config reload (`SIGHUP`)?

You don’t need SIGHUP for secret rotation. The hot reload path is for changes to the config YAML itself (adding/removing providers, changing CEL expressions, etc.). Secret values are resolved per-request, not at config load time.

Security hardening

Inbound auth — server.auth with hmac-sha256 (validates X-Hub-Signature-256 over the body) or bearer. HMAC requires validate_timestamp: true for replay protection (X-Webhook-Timestamp within timestamp_tolerance, default 5m); Bearer auth is unchanged.
Native TLS — server.tls.cert_file/key_file to serve HTTPS directly instead of relying on ingress termination.
Outbound TLS — prefer mounting a custom CA over insecure_skip_verify for self-hosted SCMs.
NetworkPolicy — enabled by default in the chart; it opens Valkey egress / Olric pod-to-pod ports only when the matching backend is selected.
Supply chain — images and charts are Cosign-signed (keyless); verify with the commands in the repository README.

Scaling & resources

The relay is I/O-bound on provider APIs; CPU/memory needs are small (the chart’s defaults fit most clusters; GOMEMLIMIT is derived from the memory limit).
max_concurrency bounds parallel handler executions per event; handler_timeout (default 10s) keeps one slow provider from stalling dispatch (tekton_events_relay_handler_timeouts_total).
Outbound retry policy honors provider rate limits (429 + Retry-After); watch tekton_events_relay_notifier_rate_limit_hits_total per host.
Before replicaCount > 1 or HPA: configure a shared state backend.

Lifecycle & shutdown

Back-pressure protocol: retryable trouble → HTTP 503 → Tekton retransmits later. Permanent failure → 200 + DLQ. So brief outages self-heal: events queued by Tekton are simply re-delivered.
Graceful shutdown: SIGTERM → preStop sleep → server drain within shutdown_timeout_sec → store closed. Rolling updates don’t lose events: anything in flight is either completed or re-sent by Tekton to the next pod.
Probes: /healthz (liveness), /readyz (readiness, with per-handler diagnostics — see Observability).