Observability Roadmap

This document outlines the evolution of our observability strategy across V1, V2, and V3. Each version adds capabilities based on actual needs rather than theoretical completeness.

Philosophy

Start simple, add complexity only when needed.

Observability exists to help developers debug issues and understand system behavior. The best observability strategy is one that developers actually use. Therefore:

✅ Prefer automatic instrumentation over manual code
✅ Keep developer experience simple (1-2 lines of code)
✅ Add capabilities only when pain points emerge
✅ Measure the cost/benefit of each observability layer

V1: Simplicity First (Current)

Goal: Provide excellent observability with minimal developer effort.

What We Have

Automatic (Zero Code Required)

HTTP Request Tracing - Every request automatically traced with timing, status, route
Database Query Tracing - All TypeORM operations automatically traced with query type, table, duration
Log Correlation - All logs automatically include requestId and traceId for flow reconstruction
Error Tracking - HTTP layer captures all unhandled exceptions with full context
External HTTP Calls - Instrumented HTTP clients trace outbound requests
RabbitMQ Messages - Instrumented message handlers trace message processing

Manual (Simple 1-Line Patterns)

Structured Logging - logger.logWithContext(userLogs.readOneStarting, context)
Template Variables - userLogs.userNotFound.replace({ id: userId })
Request Context - ...this.contextService.getLoggingContext() adds requestId/traceId automatically

What We Deliberately Don't Have

Service-Level Tracing - No traceAsync wrappers around service methods
Manual Span Creation - No span.setAttributes() or span.addEvent() in business logic
Complex Instrumentation - No custom metric collectors or profilers

Why This Works for V1

95% of debugging needs are satisfied by:

Flow Reconstruction - Filter logs by requestId to see entire request flow
HTTP Timing - See which requests are slow (automatic HTTP trace)
Database Timing - See which queries are slow (automatic DB trace)
Error Context - See full context when errors occur (logs + trace)
Console Readability - Logs are immediately actionable with inline variables

Example debugging workflow:

User reports slow page load
Find requestId from frontend logs
Filter backend logs by requestId
See all operations in order with timestamps
Identify slow database query from DB trace
Optimize query

No service-level tracing needed for this workflow.

When V1 Isn't Enough

You'll know it's time for V2 when you encounter:

Complex operations where you can't tell which step is slow from logs alone
Distributed operations where one request triggers multiple background jobs
Performance bottlenecks where HTTP trace shows slowness but you need granular timing within a service method

V2: Targeted Service-Level Tracing

Goal: Add service-level tracing only for complex operations that need it.

What Changes

New Capability: Opt-In Service Tracing

Services can wrap complex operations in traceAsync only when needed:

// Most methods stay simple (no tracing):
async readOne(values: UserReadOneInput): Promise<User> {
  const context = { module: 'UserService', method: 'readOne', ...this.contextService.getLoggingContext() };
  this.logger.logWithContext(userLogs.readOneStarting, context);

  const user = await this.repository.findOne({ where: { id: values.id } });

  this.logger.logWithContext(userLogs.readOneComplete, context);
  return user;
}

// Only complex operations get tracing:
async processOrder(values: OrderInput): Promise<Order> {
  return await this.tracingService.traceAsync('OrderService.processOrder', async (span) => {
    span.setAttributes({ 'order.id': values.id });

    // Step 1: Validate inventory (traced as child span)
    await this.tracingService.traceAsync('validateInventory', async (inventorySpan) => {
      // Complex validation logic with multiple database calls
    });

    // Step 2: Calculate pricing (traced as child span)
    await this.tracingService.traceAsync('calculatePricing', async (pricingSpan) => {
      // Complex pricing logic with external API calls
    });

    // Step 3: Create order
    return await this.repository.save(order);
  });
}

Key principle: Only use traceAsync when you specifically need timing breakdown of complex operations.

When to Use Service-Level Tracing in V2

Add traceAsync only for operations that meet these criteria:

Complex Multi-Step Operations
- 5+ sequential steps within one service method
- Need to identify which step is the bottleneck
- Example: Order processing with validation, pricing, inventory, payment
Performance Bottlenecks
- HTTP trace shows endpoint is slow (e.g., 2 seconds)
- Logs show it completed all steps
- Need granular timing to find which internal step is slow
- Example: Report generation with multiple aggregations
Distributed Operations
- One request triggers multiple background jobs or events
- Need to trace the entire workflow across async boundaries
- Example: User signup triggers email, analytics, and CRM sync
External Integrations
- Service makes multiple external API calls
- Need to attribute latency to specific external services
- Example: Payment processing with multiple provider fallbacks

What Still Doesn't Need Tracing in V2

✅ Simple CRUD operations - Logs + HTTP trace are sufficient
✅ Single database calls - Automatic DB tracing handles this
✅ Straightforward business logic - Logs show the flow
✅ Controller methods - HTTP tracing handles request/response

V2 Implementation Checklist

When ready to implement V2:

Review CustomTracingService implementation
Create documentation for when to use traceAsync
Add examples of traced vs. untraced operations
Update code review guidelines to prevent over-tracing
Add metrics to measure tracing overhead
Create training materials for team

V3: Distributed Systems & Advanced Observability

Goal: Support microservices architecture and advanced performance analysis.

What Changes

New Capabilities

Cross-Service Tracing
- Trace propagation across multiple backend services
- Service mesh integration (Istio, Linkerd)
- Distributed trace visualization showing service dependencies
Advanced Metrics
- Request rate, error rate, duration (RED metrics) per service
- Percentile latencies (p50, p95, p99)
- Histograms for distribution analysis
- Custom business metrics (orders/sec, revenue/min)
Performance Profiling
- CPU flame graphs for hot path identification
- Memory profiling for leak detection
- Continuous profiling in production
Real-Time Alerting
- Trace-based alerts (e.g., "payment flow exceeded 5s")
- Anomaly detection on trace patterns
- Automatic incident creation from trace data
Event-Driven Tracing
- Trace across event bus (RabbitMQ, Kafka, SQS)
- Async job tracing with parent-child relationships
- Scheduled task tracing

When V3 Is Needed

You'll know it's time for V3 when:

Microservices Architecture - Multiple services need coordinated observability
Scale Challenges - Performance issues that require percentile analysis
Distributed Debugging - Issues span multiple services and require unified view
Compliance/SLA - Need to prove performance characteristics for contracts

V3 Implementation Considerations

Architecture Changes

Service Mesh: Consider Istio or Linkerd for automatic trace propagation
Centralized Tracing: Ensure Grafana Tempo can handle distributed traces
Metrics Backend: Add Prometheus for metrics collection
Alert Manager: Set up Alertmanager for trace-based alerts

Team Changes

SRE Role: May need dedicated SRE for observability infrastructure
On-Call Runbooks: Create runbooks using trace data for common issues
Performance Budget: Define performance SLAs with trace-based monitoring

Cost Considerations

Trace Volume: Distributed tracing generates significantly more data
Sampling Strategy: Implement intelligent sampling (trace all errors, sample successes)
Storage Costs: Tempo storage costs scale with trace retention
Network Overhead: Trace propagation adds latency

Decision Matrix: Which Version Do I Need?

Scenario	V1	V2	V3
Single monolithic backend	✅	-	-
Simple CRUD operations	✅	-	-
Need to debug slow requests	✅	-	-
Complex multi-step operations	-	✅	-
Performance bottleneck analysis	-	✅	-
Background job orchestration	-	✅	-
Microservices architecture	-	-	✅
Cross-service debugging	-	-	✅
SLA/compliance requirements	-	-	✅
Need percentile metrics	-	-	✅

Migration Path

V1 → V2

Trigger: Team reports difficulty debugging complex operations with logs alone

Steps:

Identify 2-3 most complex operations that need tracing
Add traceAsync to those specific methods
Measure improvement in debugging time
Document the pattern for other developers
Gradually add tracing to other complex operations as needed

Effort: Low - Infrastructure already exists, just need to use CustomTracingService

V2 → V3

Trigger: Moving to microservices or scaling challenges emerge

Steps:

Evaluate service mesh options (Istio vs. Linkerd)
Set up Prometheus for metrics collection
Configure Tempo for distributed traces
Implement sampling strategy
Create distributed trace dashboards
Train team on distributed debugging

Effort: High - Significant infrastructure changes required

Current Status

We are on V1 and will remain on V1 until we encounter specific pain points that V1 cannot solve.

Indicators we're ready for V2:

Developers frequently say "I can see it's slow but can't tell which step"
Complex operations (5+ steps) are difficult to debug
Background job orchestration needs better visibility

Indicators we're ready for V3:

Planning microservices migration
Need cross-service request tracing
Compliance requires proving SLA adherence

Current focus: Make V1 excellent before considering V2. Ensure all services follow logging standards and use template variables consistently.

Philosophy​

V1: Simplicity First (Current)​

What We Have​

Automatic (Zero Code Required)​

Manual (Simple 1-Line Patterns)​

What We Deliberately Don't Have​

Why This Works for V1​

When V1 Isn't Enough​

V2: Targeted Service-Level Tracing​

What Changes​

New Capability: Opt-In Service Tracing​

When to Use Service-Level Tracing in V2​

What Still Doesn't Need Tracing in V2​

V2 Implementation Checklist​

V3: Distributed Systems & Advanced Observability​

What Changes​

New Capabilities​

When V3 Is Needed​

V3 Implementation Considerations​

Architecture Changes​

Team Changes​

Cost Considerations​

Decision Matrix: Which Version Do I Need?​

Migration Path​

V1 → V2​

V2 → V3​

Current Status​

Philosophy

V1: Simplicity First (Current)

What We Have

Automatic (Zero Code Required)

Manual (Simple 1-Line Patterns)

What We Deliberately Don't Have

Why This Works for V1

When V1 Isn't Enough

V2: Targeted Service-Level Tracing

What Changes

New Capability: Opt-In Service Tracing

When to Use Service-Level Tracing in V2

What Still Doesn't Need Tracing in V2

V2 Implementation Checklist

V3: Distributed Systems & Advanced Observability

What Changes

New Capabilities

When V3 Is Needed

V3 Implementation Considerations

Architecture Changes

Team Changes

Cost Considerations

Decision Matrix: Which Version Do I Need?

Migration Path

V1 → V2

V2 → V3

Current Status