Observability Roadmap
This document outlines the evolution of our observability strategy across V1, V2, and V3. Each version adds capabilities based on actual needs rather than theoretical completeness.
Philosophy
Start simple, add complexity only when needed.
Observability exists to help developers debug issues and understand system behavior. The best observability strategy is one that developers actually use. Therefore:
- ✅ Prefer automatic instrumentation over manual code
- ✅ Keep developer experience simple (1-2 lines of code)
- ✅ Add capabilities only when pain points emerge
- ✅ Measure the cost/benefit of each observability layer
V1: Simplicity First (Current)
Goal: Provide excellent observability with minimal developer effort.
What We Have
Automatic (Zero Code Required)
- HTTP Request Tracing - Every request automatically traced with timing, status, route
- Database Query Tracing - All TypeORM operations automatically traced with query type, table, duration
- Log Correlation - All logs automatically include
requestIdandtraceIdfor flow reconstruction - Error Tracking - HTTP layer captures all unhandled exceptions with full context
- External HTTP Calls - Instrumented HTTP clients trace outbound requests
- RabbitMQ Messages - Instrumented message handlers trace message processing
Manual (Simple 1-Line Patterns)
- Structured Logging -
logger.logWithContext(userLogs.readOneStarting, context) - Template Variables -
userLogs.userNotFound.replace({ id: userId }) - Request Context -
...this.contextService.getLoggingContext()adds requestId/traceId automatically
What We Deliberately Don't Have
- ❌ Service-Level Tracing - No
traceAsyncwrappers around service methods - ❌ Manual Span Creation - No span.setAttributes() or span.addEvent() in business logic
- ❌ Complex Instrumentation - No custom metric collectors or profilers
Why This Works for V1
95% of debugging needs are satisfied by:
- Flow Reconstruction - Filter logs by
requestIdto see entire request flow - HTTP Timing - See which requests are slow (automatic HTTP trace)
- Database Timing - See which queries are slow (automatic DB trace)
- Error Context - See full context when errors occur (logs + trace)
- Console Readability - Logs are immediately actionable with inline variables
Example debugging workflow:
1. User reports slow page load
2. Find requestId from frontend logs
3. Filter backend logs by requestId
4. See all operations in order with timestamps
5. Identify slow database query from DB trace
6. Optimize query
No service-level tracing needed for this workflow.
When V1 Isn't Enough
You'll know it's time for V2 when you encounter:
- Complex operations where you can't tell which step is slow from logs alone
- Distributed operations where one request triggers multiple background jobs
- Performance bottlenecks where HTTP trace shows slowness but you need granular timing within a service method
V2: Targeted Service-Level Tracing
Goal: Add service-level tracing only for complex operations that need it.
What Changes
New Capability: Opt-In Service Tracing
Services can wrap complex operations in traceAsync only when needed:
// Most methods stay simple (no tracing):
async readOne(values: UserReadOneInput): Promise<User> {
const context = { module: 'UserService', method: 'readOne', ...this.contextService.getLoggingContext() };
this.logger.logWithContext(userLogs.readOneStarting, context);
const user = await this.repository.findOne({ where: { id: values.id } });
this.logger.logWithContext(userLogs.readOneComplete, context);
return user;
}
// Only complex operations get tracing:
async processOrder(values: OrderInput): Promise<Order> {
return await this.tracingService.traceAsync('OrderService.processOrder', async (span) => {
span.setAttributes({ 'order.id': values.id });
// Step 1: Validate inventory (traced as child span)
await this.tracingService.traceAsync('validateInventory', async (inventorySpan) => {
// Complex validation logic with multiple database calls
});
// Step 2: Calculate pricing (traced as child span)
await this.tracingService.traceAsync('calculatePricing', async (pricingSpan) => {
// Complex pricing logic with external API calls
});
// Step 3: Create order
return await this.repository.save(order);
});
}
Key principle: Only use traceAsync when you specifically need timing breakdown of complex operations.
When to Use Service-Level Tracing in V2
Add traceAsync only for operations that meet these criteria:
-
Complex Multi-Step Operations
- 5+ sequential steps within one service method
- Need to identify which step is the bottleneck
- Example: Order processing with validation, pricing, inventory, payment
-
Performance Bottlenecks
- HTTP trace shows endpoint is slow (e.g., 2 seconds)
- Logs show it completed all steps
- Need granular timing to find which internal step is slow
- Example: Report generation with multiple aggregations
-
Distributed Operations
- One request triggers multiple background jobs or events
- Need to trace the entire workflow across async boundaries
- Example: User signup triggers email, analytics, and CRM sync
-
External Integrations
- Service makes multiple external API calls
- Need to attribute latency to specific external services
- Example: Payment processing with multiple provider fallbacks
What Still Doesn't Need Tracing in V2
- ✅ Simple CRUD operations - Logs + HTTP trace are sufficient
- ✅ Single database calls - Automatic DB tracing handles this
- ✅ Straightforward business logic - Logs show the flow
- ✅ Controller methods - HTTP tracing handles request/response
V2 Implementation Checklist
When ready to implement V2:
- Review
CustomTracingServiceimplementation - Create documentation for when to use
traceAsync - Add examples of traced vs. untraced operations
- Update code review guidelines to prevent over-tracing
- Add metrics to measure tracing overhead
- Create training materials for team
V3: Distributed Systems & Advanced Observability
Goal: Support microservices architecture and advanced performance analysis.
What Changes
New Capabilities
-
Cross-Service Tracing
- Trace propagation across multiple backend services
- Service mesh integration (Istio, Linkerd)
- Distributed trace visualization showing service dependencies
-
Advanced Metrics
- Request rate, error rate, duration (RED metrics) per service
- Percentile latencies (p50, p95, p99)
- Histograms for distribution analysis
- Custom business metrics (orders/sec, revenue/min)
-
Performance Profiling
- CPU flame graphs for hot path identification
- Memory profiling for leak detection
- Continuous profiling in production
-
Real-Time Alerting
- Trace-based alerts (e.g., "payment flow exceeded 5s")
- Anomaly detection on trace patterns
- Automatic incident creation from trace data
-
Event-Driven Tracing
- Trace across event bus (RabbitMQ, Kafka, SQS)
- Async job tracing with parent-child relationships
- Scheduled task tracing
When V3 Is Needed
You'll know it's time for V3 when:
- Microservices Architecture - Multiple services need coordinated observability
- Scale Challenges - Performance issues that require percentile analysis
- Distributed Debugging - Issues span multiple services and require unified view
- Compliance/SLA - Need to prove performance characteristics for contracts
V3 Implementation Considerations
Architecture Changes
- Service Mesh: Consider Istio or Linkerd for automatic trace propagation
- Centralized Tracing: Ensure Grafana Tempo can handle distributed traces
- Metrics Backend: Add Prometheus for metrics collection
- Alert Manager: Set up Alertmanager for trace-based alerts
Team Changes
- SRE Role: May need dedicated SRE for observability infrastructure
- On-Call Runbooks: Create runbooks using trace data for common issues
- Performance Budget: Define performance SLAs with trace-based monitoring
Cost Considerations
- Trace Volume: Distributed tracing generates significantly more data
- Sampling Strategy: Implement intelligent sampling (trace all errors, sample successes)
- Storage Costs: Tempo storage costs scale with trace retention
- Network Overhead: Trace propagation adds latency
Decision Matrix: Which Version Do I Need?
| Scenario | V1 | V2 | V3 |
|---|---|---|---|
| Single monolithic backend | ✅ | - | - |
| Simple CRUD operations | ✅ | - | - |
| Need to debug slow requests | ✅ | - | - |
| Complex multi-step operations | - | ✅ | - |
| Performance bottleneck analysis | - | ✅ | - |
| Background job orchestration | - | ✅ | - |
| Microservices architecture | - | - | ✅ |
| Cross-service debugging | - | - | ✅ |
| SLA/compliance requirements | - | - | ✅ |
| Need percentile metrics | - | - | ✅ |
Migration Path
V1 → V2
Trigger: Team reports difficulty debugging complex operations with logs alone
Steps:
- Identify 2-3 most complex operations that need tracing
- Add
traceAsyncto those specific methods - Measure improvement in debugging time
- Document the pattern for other developers
- Gradually add tracing to other complex operations as needed
Effort: Low - Infrastructure already exists, just need to use CustomTracingService
V2 → V3
Trigger: Moving to microservices or scaling challenges emerge
Steps:
- Evaluate service mesh options (Istio vs. Linkerd)
- Set up Prometheus for metrics collection
- Configure Tempo for distributed traces
- Implement sampling strategy
- Create distributed trace dashboards
- Train team on distributed debugging
Effort: High - Significant infrastructure changes required
Current Status
We are on V1 and will remain on V1 until we encounter specific pain points that V1 cannot solve.
Indicators we're ready for V2:
- Developers frequently say "I can see it's slow but can't tell which step"
- Complex operations (5+ steps) are difficult to debug
- Background job orchestration needs better visibility
Indicators we're ready for V3:
- Planning microservices migration
- Need cross-service request tracing
- Compliance requires proving SLA adherence
Current focus: Make V1 excellent before considering V2. Ensure all services follow logging standards and use template variables consistently.