Observability, Logging, Monitoring, and Operations
1. Logging Model
Structured logging fields include:
timestamp,level,logger,messagerequest_idfromRequestContextFilterregion_idfromRequestContextFilterorder_idfrom service-level MDC contexttrace_idfrom tracing context
X-Request-Id is accepted/generated and returned to clients, enabling incident correlation.
2. Metrics Catalog (Current Implementation)
HTTP and API metrics
http.server.request.counthttp.server.request.latency(p95, p99)http.server.request.errorshttp.server.requests.by.regionapi.errors.unexpected— increments on each unhandled exception mapped to HTTP 500 byGlobalExceptionHandler(paired with structuredERRORlog and safe client message)
Business and service metrics
orders.service.request.countorders.created.countorders.idempotency.hit.countorders.operation.failure.countorders.operation.durationorders.query.request.countorders.query.error.countorders.query.duration
Cache metrics
cache.hit.countcache.miss.countcache.error.countcache.degraded.mode.countredis.connection.failures(tagged by component)redis.command.latency(tagged by command/component)
Rate limiting metrics
rate_limit.allowed.countrate_limit.blocked.countrate_limit.dynamic.adjustmentsrate_limit.rejections.by.policy
Adaptive retry and backpressure metrics
retry.delay.msretry.classification.count(TRANSIENT,SEMI_TRANSIENT,PERMANENT)backpressure.levelbackpressure.outbox.backlogbackpressure.kafka.lag.msbackpressure.db.saturation.percentdb.query.duration(histogram + exemplars when trace context exists)
Outbox metrics
outbox.pending.countoutbox.failure.countoutbox.publish.latencyoutbox.batch.sizeoutbox.publish.rateoutbox.lagoutbox.retry.count
Kafka and schema metrics
kafka.consumer.errorskafka.consumer.lag.mskafka.consumer.processed.countkafka.consumer.retry.countkafka.consumer.dlq.countkafka.schema.validation.errorskafka.event.version.distribution
Regional resilience metrics
failover.events.countregion.health.unhealthy.countregion.health.dependency.failure.countregion.conflict.rejected.count
3. Tracing
Configured through Micrometer bridge and OTLP endpoint:
management.tracing.sampling.probabilitymanagement.otlp.tracing.endpoint
Use traces + request_id for end-to-end debugging.
4. Reliability and Retry Behavior
Outbox pipeline (OutboxPublisher + OutboxFetcher + OutboxProcessor + OutboxRetryHandler)
- Scheduler dispatches owned partitions with in-flight semaphore backpressure.
- Fetcher claims rows in transaction, records
outbox.batch.size, and preserves deterministic aggregate ordering. - Processor performs async publish, tracks publish latency/rate/lag, and marks
SENTon completion. - Retry handler updates retry count, classifies failure type, computes adaptive delay (retry count + system pressure + jitter + cap), and parks terminal/permanent failures.
- Kafka publisher is async and single-attempt; retry/backoff policy is intentionally owned by outbox to keep retry control deterministic.
- Cleanup archives/deletes old
SENTrows on retention schedule.
Kafka consumer
- Manual ack mode (
manual_immediate) with ack after transactional processing success. - Dedupe check + domain update + processed marker insertion executed in transaction template boundary.
- Retry topics with exponential backoff and max attempts; DLT handler logs payload context and headers.
- Versioned schema parsing with fallback compatibility logic.
read_committedconsumption prevents seeing aborted transactional producer records.- Kafka lag measurements feed
BackpressureManagerfor global admission/throttling decisions.
5.1 Feedback Loops and Signal Propagation
flowchart LR
KAFKA_LAG[kafka.consumer.lag.ms] --> BPM[BackpressureManager]
OUTBOX[outbox.pending.count + outbox.failure.count] --> BPM
DBSAT[DB pool saturation] --> BPM
BPM --> WRITES[OrderService write admission]
BPM --> RLP[RateLimitPolicyProvider adaptive policy]
BPM --> RETRY[AdaptiveRetryPolicyStrategy]
Interpretation details:
- Lag and backlog signals are normalized into a single pressure level.
- Pressure level drives three downstream controls: write admission, rate-limit policy, and retry delay.
- Controls are intentionally asymmetric:
- rate limits tighten first (protects availability),
- retry delays increase second (reduces downstream thrash),
- write rejection activates last (protects correctness under critical saturation).
- This staged behavior prevents abrupt mode flapping and keeps eventual drain possible.
5.2 Exemplar-driven Debugging (Prometheus -> Trace)
The service now emits exemplar-friendly histograms for:
http.server.requestsdb.query.duration
With tracing enabled, Micrometer attaches the active trace_id as exemplar metadata to latency buckets.
Operational workflow:
- SRE spots spike in Prometheus latency histogram.
- SRE opens bucket exemplar and extracts trace id.
- Trace id is queried in Jaeger/Tempo.
- Full distributed path is inspected (controller -> query service -> repository/cache fallback).
This removes guesswork when separating saturation from path-specific regressions.
5.3 HLC Conflict Telemetry Notes
Regional conflict evaluation now uses Hybrid Logical Clock ordering:
- physical clock in milliseconds (
lastUpdatedTimestamp/ incomingoccurredAt) - logical counter (
version)
When HLC comparison rejects an update, region.conflict.rejected.count increments.
This metric should be interpreted alongside clock-skew and replication-delay dashboards.
6. Operational Alerts and Interpretation
High outbox backlog
- Signals: rising
outbox.pending.countandoutbox.failure.count - Likely causes: broker issues, schema failures, retry pressure
- Actions: validate broker health, inspect retry logs, verify schema errors
Elevated cache degradation
- Signals: rising
cache.error.count/cache.degraded.mode.count - Likely causes: Redis connectivity/latency issues
- Actions: check Redis health, verify fallback DB latency capacity
Retry storm / DLT growth
- Signals: rising
kafka.consumer.retry.count,kafka.consumer.dlq.count - Likely causes: malformed payloads, persistent downstream dependency issues
- Actions: inspect DLT payload contexts, classify and replay only safe records
7. DLQ Operations Guidance
When kafka.consumer.dlq.count rises:
- Inspect DLQ logs with
eventId,orderId, topic/partition/offset. - Classify failures:
- payload/schema issues
- missing order records
- persistent downstream failures
- Decide replay/ignore policy.
- Apply fix and replay if safe.
8. Real-world Monitoring Scenarios
Scenario: Kafka outage
Symptoms:
- rising
outbox.pending.count - rising
outbox.failure.count - stable API latency (writes still commit)
Action:
- restore Kafka
- confirm outbox drain by falling pending/failure gauges
Scenario: Redis instability
Symptoms:
- rising
cache.error.count - rising
cache.degraded.mode.count - possibly rising DB read load
Action:
- restore Redis
- verify cache hits recover
Scenario: abusive traffic or bot spikes
Symptoms:
- rising
rate_limit.blocked.count - rising
rate_limit.dynamic.adjustmentsduring pressure events
Action:
- tune token bucket limits/window
- add edge/WAF controls if needed
Scenario: consumer retry storm
Symptoms:
- rising
kafka.consumer.retry.count - lag rising
- potential DLQ increase
Action:
- inspect root cause class of retries
- verify retry bounds and delay settings
- scale consumers / fix downstream dependency
Scenario: regional failover event
Symptoms:
failover.events.countincrements- increased 503 on write endpoints in passive node
- request logs include impacted
region_id
Action:
- Validate root cause (DB/Redis/Kafka health in region).
- Confirm global traffic router moved writes to healthy region.
- Track recovery and verify node returns active state.
Scenario: active-active conflict suppression
Symptoms:
- rising
region.conflict.rejected.count - elevated concurrent writes from multiple regions
Action:
- verify conflict policy mode (
last-write-winsvsversion-based) - inspect conflicting event/order timestamps and versions
- ensure upstream routing consistency for hot aggregates
9. Runbook Checkpoints
- Verify Kafka connectivity and topic health.
- Verify Redis availability and latency.
- Verify Prometheus scraping
/actuator/prometheus. - Verify trace export endpoint and sampling settings.
- Verify outbox backlog and DLQ trends on deployments.
- Verify regional health and failover mode in
/actuator/health(multiRegion).
10. P2 Chaos and Load Validation Matrix
| Scenario | Injected Fault | Expected Signal | Pass Criteria |
|---|---|---|---|
| Slow broker acks | broker latency and partial throttling | outbox.publish.latency p95 rises |
in-flight publishes remain capped (app.outbox.publisher.max-in-flight) |
| Lease reclaim race | worker delayed beyond lease expiry | fenced update rejection logs and counters | no stale IN_FLIGHT -> SENT overwrite |
| Redis degradation | Redis unavailable during rate-limit checks | redis.connection.failures increases |
API remains available (fail-open) |
| Consumer retry pressure | transient DB failures | kafka.consumer.retry.count increases |
no listener thread sleep stalls |