Skip to main content

Observability Roadmap

This document outlines the evolution of our observability strategy across V1, V2, and V3. Each version adds capabilities based on actual needs rather than theoretical completeness.

Philosophy

Start simple, add complexity only when needed.

Observability exists to help developers debug issues and understand system behavior. The best observability strategy is one that developers actually use. Therefore:

  • ✅ Prefer automatic instrumentation over manual code
  • ✅ Keep developer experience simple (1-2 lines of code)
  • ✅ Add capabilities only when pain points emerge
  • ✅ Measure the cost/benefit of each observability layer

V1: Simplicity First (Current)

Goal: Provide excellent observability with minimal developer effort.

What We Have

Automatic (Zero Code Required)

  • HTTP Request Tracing - Every request automatically traced with timing, status, route
  • Database Query Tracing - All TypeORM operations automatically traced with query type, table, duration
  • Log Correlation - All logs automatically include requestId and traceId for flow reconstruction
  • Error Tracking - HTTP layer captures all unhandled exceptions with full context
  • External HTTP Calls - Instrumented HTTP clients trace outbound requests
  • RabbitMQ Messages - Instrumented message handlers trace message processing

Manual (Simple 1-Line Patterns)

  • Structured Logging - logger.logWithContext(userLogs.readOneStarting, context)
  • Template Variables - userLogs.userNotFound.replace({ id: userId })
  • Request Context - ...this.contextService.getLoggingContext() adds requestId/traceId automatically

What We Deliberately Don't Have

  • Service-Level Tracing - No traceAsync wrappers around service methods
  • Manual Span Creation - No span.setAttributes() or span.addEvent() in business logic
  • Complex Instrumentation - No custom metric collectors or profilers

Why This Works for V1

95% of debugging needs are satisfied by:

  1. Flow Reconstruction - Filter logs by requestId to see entire request flow
  2. HTTP Timing - See which requests are slow (automatic HTTP trace)
  3. Database Timing - See which queries are slow (automatic DB trace)
  4. Error Context - See full context when errors occur (logs + trace)
  5. Console Readability - Logs are immediately actionable with inline variables

Example debugging workflow:

1. User reports slow page load
2. Find requestId from frontend logs
3. Filter backend logs by requestId
4. See all operations in order with timestamps
5. Identify slow database query from DB trace
6. Optimize query

No service-level tracing needed for this workflow.

When V1 Isn't Enough

You'll know it's time for V2 when you encounter:

  • Complex operations where you can't tell which step is slow from logs alone
  • Distributed operations where one request triggers multiple background jobs
  • Performance bottlenecks where HTTP trace shows slowness but you need granular timing within a service method

V2: Targeted Service-Level Tracing

Goal: Add service-level tracing only for complex operations that need it.

What Changes

New Capability: Opt-In Service Tracing

Services can wrap complex operations in traceAsync only when needed:

// Most methods stay simple (no tracing):
async readOne(values: UserReadOneInput): Promise<User> {
const context = { module: 'UserService', method: 'readOne', ...this.contextService.getLoggingContext() };
this.logger.logWithContext(userLogs.readOneStarting, context);

const user = await this.repository.findOne({ where: { id: values.id } });

this.logger.logWithContext(userLogs.readOneComplete, context);
return user;
}

// Only complex operations get tracing:
async processOrder(values: OrderInput): Promise<Order> {
return await this.tracingService.traceAsync('OrderService.processOrder', async (span) => {
span.setAttributes({ 'order.id': values.id });

// Step 1: Validate inventory (traced as child span)
await this.tracingService.traceAsync('validateInventory', async (inventorySpan) => {
// Complex validation logic with multiple database calls
});

// Step 2: Calculate pricing (traced as child span)
await this.tracingService.traceAsync('calculatePricing', async (pricingSpan) => {
// Complex pricing logic with external API calls
});

// Step 3: Create order
return await this.repository.save(order);
});
}

Key principle: Only use traceAsync when you specifically need timing breakdown of complex operations.

When to Use Service-Level Tracing in V2

Add traceAsync only for operations that meet these criteria:

  1. Complex Multi-Step Operations

    • 5+ sequential steps within one service method
    • Need to identify which step is the bottleneck
    • Example: Order processing with validation, pricing, inventory, payment
  2. Performance Bottlenecks

    • HTTP trace shows endpoint is slow (e.g., 2 seconds)
    • Logs show it completed all steps
    • Need granular timing to find which internal step is slow
    • Example: Report generation with multiple aggregations
  3. Distributed Operations

    • One request triggers multiple background jobs or events
    • Need to trace the entire workflow across async boundaries
    • Example: User signup triggers email, analytics, and CRM sync
  4. External Integrations

    • Service makes multiple external API calls
    • Need to attribute latency to specific external services
    • Example: Payment processing with multiple provider fallbacks

What Still Doesn't Need Tracing in V2

  • ✅ Simple CRUD operations - Logs + HTTP trace are sufficient
  • ✅ Single database calls - Automatic DB tracing handles this
  • ✅ Straightforward business logic - Logs show the flow
  • ✅ Controller methods - HTTP tracing handles request/response

V2 Implementation Checklist

When ready to implement V2:

  • Review CustomTracingService implementation
  • Create documentation for when to use traceAsync
  • Add examples of traced vs. untraced operations
  • Update code review guidelines to prevent over-tracing
  • Add metrics to measure tracing overhead
  • Create training materials for team

V3: Distributed Systems & Advanced Observability

Goal: Support microservices architecture and advanced performance analysis.

What Changes

New Capabilities

  1. Cross-Service Tracing

    • Trace propagation across multiple backend services
    • Service mesh integration (Istio, Linkerd)
    • Distributed trace visualization showing service dependencies
  2. Advanced Metrics

    • Request rate, error rate, duration (RED metrics) per service
    • Percentile latencies (p50, p95, p99)
    • Histograms for distribution analysis
    • Custom business metrics (orders/sec, revenue/min)
  3. Performance Profiling

    • CPU flame graphs for hot path identification
    • Memory profiling for leak detection
    • Continuous profiling in production
  4. Real-Time Alerting

    • Trace-based alerts (e.g., "payment flow exceeded 5s")
    • Anomaly detection on trace patterns
    • Automatic incident creation from trace data
  5. Event-Driven Tracing

    • Trace across event bus (RabbitMQ, Kafka, SQS)
    • Async job tracing with parent-child relationships
    • Scheduled task tracing

When V3 Is Needed

You'll know it's time for V3 when:

  • Microservices Architecture - Multiple services need coordinated observability
  • Scale Challenges - Performance issues that require percentile analysis
  • Distributed Debugging - Issues span multiple services and require unified view
  • Compliance/SLA - Need to prove performance characteristics for contracts

V3 Implementation Considerations

Architecture Changes

  • Service Mesh: Consider Istio or Linkerd for automatic trace propagation
  • Centralized Tracing: Ensure Grafana Tempo can handle distributed traces
  • Metrics Backend: Add Prometheus for metrics collection
  • Alert Manager: Set up Alertmanager for trace-based alerts

Team Changes

  • SRE Role: May need dedicated SRE for observability infrastructure
  • On-Call Runbooks: Create runbooks using trace data for common issues
  • Performance Budget: Define performance SLAs with trace-based monitoring

Cost Considerations

  • Trace Volume: Distributed tracing generates significantly more data
  • Sampling Strategy: Implement intelligent sampling (trace all errors, sample successes)
  • Storage Costs: Tempo storage costs scale with trace retention
  • Network Overhead: Trace propagation adds latency

Decision Matrix: Which Version Do I Need?

ScenarioV1V2V3
Single monolithic backend--
Simple CRUD operations--
Need to debug slow requests--
Complex multi-step operations--
Performance bottleneck analysis--
Background job orchestration--
Microservices architecture--
Cross-service debugging--
SLA/compliance requirements--
Need percentile metrics--

Migration Path

V1 → V2

Trigger: Team reports difficulty debugging complex operations with logs alone

Steps:

  1. Identify 2-3 most complex operations that need tracing
  2. Add traceAsync to those specific methods
  3. Measure improvement in debugging time
  4. Document the pattern for other developers
  5. Gradually add tracing to other complex operations as needed

Effort: Low - Infrastructure already exists, just need to use CustomTracingService

V2 → V3

Trigger: Moving to microservices or scaling challenges emerge

Steps:

  1. Evaluate service mesh options (Istio vs. Linkerd)
  2. Set up Prometheus for metrics collection
  3. Configure Tempo for distributed traces
  4. Implement sampling strategy
  5. Create distributed trace dashboards
  6. Train team on distributed debugging

Effort: High - Significant infrastructure changes required


Current Status

We are on V1 and will remain on V1 until we encounter specific pain points that V1 cannot solve.

Indicators we're ready for V2:

  • Developers frequently say "I can see it's slow but can't tell which step"
  • Complex operations (5+ steps) are difficult to debug
  • Background job orchestration needs better visibility

Indicators we're ready for V3:

  • Planning microservices migration
  • Need cross-service request tracing
  • Compliance requires proving SLA adherence

Current focus: Make V1 excellent before considering V2. Ensure all services follow logging standards and use template variables consistently.