Architecture
Welcome to Yew Search! This document provides a high-level overview of how the system works. If you're new to the project, start here before diving into specific component documentation.
There is a lot in this document, so feel free to skip around.
What is Yew Search?
Yew Search is a privacy-focused personal search engine that aggregates data from your external services (Gmail, Slack, Google Drive, etc.) and makes it all searchable in one place. Think of it as "Google search for your personal data" - but you own the data and control who can access it.
Key Features:
- Search across all your connected services simultaneously
- Privacy-first: self-hosted, you control your data
- Plugin architecture: easily add new data sources
- Real-time synchronization with external services
- Session-based authentication for immediate security control
Core Architectural Principles
Yew Search is built on several key principles that guide all technical decisions:
1. Privacy and Security First
Users trust us with their most sensitive data. Every architectural decision prioritizes security:
- Cookie-based sessions (not JWT) for immediate logout capability
- OAuth tokens encrypted at rest
- No third-party services - everything is self-hosted
- Remote logout: instantly invalidate sessions when needed
2. Plugin Architecture
Integrations (Gmail, Slack, etc.) are self-contained plugins:
- No integration knows about the core application
- Each integration is independent and can be developed separately
- Adding a new integration doesn't require changing core code
- Integrations communicate through well-defined interfaces
3. Monolithic Architecture
We're a monolith, not microservices:
- Single NestJS backend application
- Single PostgreSQL database
- Simpler to develop, deploy, and reason about
- Performance overhead is negligible for our use case
- Session-based auth works perfectly for monoliths
4. Consistency Through Constraints
Small teams need systems that enforce order automatically:
- PostgreSQL enforces data consistency (foreign keys, constraints, schemas)
- TypeScript provides compile-time type safety
- Strict coding standards reduce cognitive load
- Everything has one clear way to do it
5. Self-Hosted First
Must run on minimal hardware (Raspberry Pi 4+):
- No mandatory cloud dependencies
- Docker containers for easy deployment
- Resource-efficient design
- Works offline (except for syncing new data)
System Architecture
High-Level Components
┌─────────────────────────────────────────────────────────────────┐
│ User's Browser │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Next.js Frontend (Port 3001) │ │
│ │ • Search Interface • OAuth Authorization UI │ │
│ │ • Login/Register • Settings & Profile │ │
│ └───────────────────────────────────────────────────────────┘ │
└────────────────────┬──────────────────────────────────────────┬─┘
│ │
│ HTTP/REST API │ Cookie
│ │ (yew_session)
│ │
┌────────────────────▼──────────────────────────────────────────▼─┐
│ NestJS Backend (Port 3000) │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌────────────────────────┐ │
│ │ Auth │ │ User API │ │ Search API │ │
│ │ • Login │ │ • Profile │ │ • Query Parser │ │
│ │ • Logout │ │ • Sessions │ │ • Full-Text Search │ │
│ │ • Register │ └──────────────┘ │ • Result Ranking │ │
│ └─────────────┘ └────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Integration System │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │ Gmail │ │ Slack │ │ FTP │ ... │ │
│ │ │ Integration│ │ Integration│ │ Integration│ │ │
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ Polling System (Background) │ │ │
│ │ │ • Task Queue (Bull/RabbitMQ) │ │ │
│ │ │ • Priority-based scheduling │ │ │
│ │ │ • Idempotency checks │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Observability │ │
│ │ • Logging (structured JSON) │ │
│ │ • Tracing (OpenTelemetry) │ │
│ │ • Metrics (request duration, errors, etc.) │ │
│ └──────────────────────────────────────────────────────────┘ │
└───────────────┬───────────────────────────────────────┬─────────┘
│ │
│ │
▼ ▼
┌───────────────────────────┐ ┌────────────────────────────┐
│ PostgreSQL Database │ │ Redis (Session Store) │
│ │ │ │
│ • user │ │ • Session IDs → User IDs │
│ • user_session │ │ • TTL-based expiration │
│ • user_integration │ │ • Fast lookups │
│ • user_integration_ │ └────────────────────────────┘
│ content │
│ • (OAuth encrypted) │
└───────────────────────────┘
Data Flow
1. User Authentication Flow
User → Frontend → Backend → Redis
↓
PostgreSQL (user table)
↓
Set Cookie → Frontend → User
- User submits credentials
- Backend validates against PostgreSQL
- Backend creates session in Redis
- Backend sets secure cookie
- All subsequent requests include cookie
- Backend validates session on each request
2. Integration Authorization Flow (OAuth)
User → "Connect Gmail" → Backend
↓
Generate OAuth URL
↓
Redirect to Google
↓
User grants access
↓
Google redirects back
↓
Exchange code for tokens
↓
Encrypt & store tokens
↓
Store in user_integration table
3. Data Synchronization Flow
Cron (every 1 min) → Create "start" task
↓
Polling System (Bull/RabbitMQ)
↓
Load Integration Plugin
↓
Run Task (e.g., fetch emails)
↓
┌───────────────┴────────────────┐
▼ ▼
New Tasks (pagination) Output (email data)
│ │
└────→ Queue Save to PostgreSQL
(user_integration_content)
Tasks spawn more tasks recursively until all data is synced.
4. Search Flow
User → Search Query → Backend
↓
Parse & sanitize query
↓
PostgreSQL full-text search
(on user_integration_content.content)
↓
Rank & filter results
↓
Return to Frontend → User
Major Components
Backend (NestJS)
Location: backend/
The backend is a NestJS monolith that handles all API requests, authentication, integration management, and background data synchronization.
Key Modules:
Session Module (src/v1/user-session/)
- Session management (create, validate, delete)
- Cookie-based sessions
User Module (src/v1/user/)
- User registration
- Password hashing with Argon2
- User profile management
- User settings
- Active session management
Integration Module (src/integration/)
- Dynamic integration loading
- OAuth endpoint (unified for all integrations)
- Integration instance management
- See: Writing Integrations
Polling Module (src/integration/polling/)
- Background task queue (Bull/RabbitMQ)
- Priority-based task scheduling
- Integration task execution
- Idempotency checking
- See: Integration Idempotency
Search Module (src/search/)
- Query parsing and validation
- Full-text search using PostgreSQL
- Result ranking and filtering
- (Future: Elasticsearch integration)
Frontend (Next.js)
Location: frontend/
Server-side rendered React application for the user interface.
Key Pages:
/login- Authentication/register- New user signup/search- Main search interface/integrations- Manage connected services (OAuth)/settings- User preferences and active sessions
Authentication:
- Cookie-based (automatically sent with every request)
- Redirects to
/loginon 401 responses - No client-side token management needed
Database (PostgreSQL)
Core Tables:
user
- User accounts
- Email, name, hashed password
- Created/updated timestamps
user_session
- This is used to lookup sessions for users, locations, etc.
- Think of this more as logging of sessions
- Session ID, user ID
- Expiration, last accessed timestamp
- IP address, user agent
user_integration
- User's connected services (Gmail, Slack, etc.)
- OAuth credentials (encrypted)
- Integration domain, status
- Sync state (last history ID, etc.)
user_integration_content
- Actual synced data (emails, messages, files)
- JSONB content field (integration-specific structure)
- External ID (idempotency key)
- External timestamp (when created in source system)
See: Entity Standards
Session Store (Redis)
Both Redis and Postgres will be used for sessions store, but Redis will be used for auth.
Redis is fast for looking up sessions by session ID, but might not be the best for filtering sessions by user Ids.
For this we use Postgres. This allows us to show all sessions a user has so they can manage their sessions from one location.
See: Authorization
Task Queue (Bull/RabbitMQ)
Current: Bull (Redis-backed) Future: RabbitMQ (when scaling is needed)
Background task processing for integration data synchronization:
- Priority queue (new data fetched before backfill)
- Retry logic for transient failures
- Dead-letter queue for permanent failures
Integrations (Plugins)
Location: backend/src/integration/*/
Each integration is a self-contained plugin:
Structure:
backend/src/integration/gmail/
├── src/
│ └── main.ts # Integration implementation
├── manifest.json # Metadata (domain, name, version)
└── package.json # Dependencies
Responsibilities:
- Authenticate with external service (OAuth or credentials)
- Define task types (start, list, download, etc.)
- Execute tasks and return output + new tasks
- Handle pagination, error recovery, rate limiting
Key Concept: Integrations know nothing about the core application. They receive credentials and return data. The core application handles storage, queueing, and orchestration.
See: Writing Integrations
Key Design Decisions
Why PostgreSQL Over MongoDB?
Decision: Use PostgreSQL for all data storage.
Rationale:
- Enforces consistency (foreign keys, constraints, schemas)
- Prevents "bad data practices" common in NoSQL
- Small teams need databases that enforce order automatically
- TypeORM provides excellent PostgreSQL support
See: Why Postgres
Why Cookie Sessions Over JWT?
Decision: Use cookie-based sessions, never JWT.
Rationale:
- Remote logout is critical - Can terminate sessions instantly
- JWT cannot be revoked before expiration (24-hour vulnerability window)
- Privacy focus requires immediate access control
- Central session store provides audit trail
See: Authorization
Why Monolith Over Microservices?
Decision: Single NestJS application, not microservices.
Rationale:
- Simpler to develop and deploy
- Small team doesn't need microservice complexity
- Session-based auth works perfectly for monoliths
- Monoliths are easy to deploy on local systems
- Can always split later if needed (YAGNI)
Why Task-Based Integration Architecture?
Decision: Integrations use recursive task spawning, not batch jobs.
Rationale:
- Handles pagination naturally (each page spawns next page task)
- Priority queue ensures new data fetched quickly during backfill
- Granular error handling (one failed email doesn't block others)
- Idempotency prevents duplicate data
See: Writing Integrations
Why Self-Hosted?
Decision: Must run on user's own hardware (Raspberry Pi 4+).
Rationale:
- Privacy: user controls their data
- No SaaS vendor lock-in
- No recurring cloud costs
- Compliance with data sovereignty requirements
Future Plans:
Yew Search will run in 3 different ways, similar to Mattermost, N8N, or GitLab.
- Local homelab / Raspberry Pi deployment (for individuals)
- We light homelabs and we want to get this community to like us
- Homelabs give users a way try the software before buying with minimal work from us
- Not all features will be available to the self hosted option and not all can be
- Business (for companies)
- Most companies don't want to self host and this makes sense
- Self hosting for a company is usually more expensive than buying a subscription
- This also allows more advanced features like Soc2 compliance and other important features companies will use
- This is managed in our servers and all companies will be in a pool
- This is similar to how most hosted solutions work
- Enterprise (for large companies)
- When a company outgrows the Business plan the will move to the Enterprise plan
- This comes with dedicated hardware and more support from our team
Data Privacy & Security
Authentication Security
- Argon2 password hashing (not bcrypt)
- HttpOnly cookies (JavaScript cannot access)
- Secure flag (HTTPS only in production)
- SameSite strict (CSRF protection)
- Rate limiting on login endpoint
Integration Security
- OAuth tokens encrypted at rest in database
- Tokens only decrypted when making API calls
- Integration plugins cannot access other users' data
- Each user's integrations isolated by user ID
Search Security
- Search queries scoped to authenticated user
- No cross-user data leakage
- All queries logged for audit trail
- Content never exposed in URLs (POST, not GET)
Development Workflow
Local Development Setup
TBD
Environment Variables:
DATABASE_URL- PostgreSQL connection stringREDIS_URL- Redis connection stringSESSION_SECRET- Random secret for session encryptionGMAIL_OAUTH_CLIENT_ID- OAuth credentials for GmailGMAIL_OAUTH_CLIENT_SECRET- (See
.env.examplefor full list)
Raspberry Pi Deployment
Yew Search is designed to run on Raspberry Pi 4 (4GB+ RAM):
- Flash Raspberry Pi OS (64-bit)
- Install Docker & Docker Compose
- Clone repository
- Run
docker-compose up -d - Access via
http://raspberrypi.local:3001
Performance Notes:
- PostgreSQL uses minimal indexes for low-memory usage
- Background tasks throttled to avoid CPU spikes
- Redis configured for low memory footprint
Observability
Logging
All logs are structured JSON with consistent fields:
timestamp- ISO 8601level- debug, info, warn, errormessage- Human-readable descriptionmodule- Which module emitted the logmethod- Which method emitted the logrequestId- Request correlation IDuserId- User ID (if authenticated)traceId- Distributed tracing IDspanId- Current span ID
See: Logging Standards
Tracing
OpenTelemetry distributed tracing for:
- HTTP requests (automatic)
- Database queries (automatic)
- Business operations (manual instrumentation)
Traces can be exported to Jaeger, Zipkin, or other OTLP-compatible backends.
See: Tracing Standards
Metrics
Automatic metrics collection:
- HTTP request rate, duration, errors
- Active sessions count
- Integration sync success/failure rate
- Search query performance
See: Metrics Standards
Testing Strategy
Unit Tests
- Test individual functions and classes
- Mock external dependencies
- Fast execution (< 1 second per test)
Integration Tests
- Test module interactions
- Use test database (separate from dev)
- Test API endpoints end-to-end
E2E Tests
- Test full user workflows
- Browser automation (Playwright)
- Test OAuth flows, search, etc.
Run tests:
# Unit tests
npm test
# E2E tests
npm run test:e2e
# Coverage
npm run test:cov
Performance Considerations
Database Performance
- Indexes: Minimal indexes for low-memory environments
- JSONB: Content stored as JSONB for flexibility
- Full-text search: PostgreSQL GIN indexes on content fields
- Connection pooling: TypeORM manages connection pool
Search Performance
V1 (PostgreSQL):
- Full-text search on JSONB content field
- Good for < 100k documents
- No additional infrastructure needed
V2 (Elasticsearch):
- Dedicated search engine
- Good for > 100k documents
- Requires additional container
Background Tasks
- Priority queue: New data fetched before backfill
- Throttling: Rate limiting to avoid API quota exhaustion
- Batching: Group operations where possible
- Idempotency: Skip already-downloaded content
Scalability
Current Scale (V1)
Designed for:
- Single user or small family (< 10 users)
- Raspberry Pi 4 hardware (4GB+ RAM)
- < 100k documents per user
- < 10 concurrent searches
Future Scale (V2+)
Can scale to:
- Small teams (< 100 users)
- Cloud VPS or bare metal server
- Millions of documents
- Horizontal scaling with Redis + multiple backend instances
Scaling Strategy:
- Add Elasticsearch for better search performance
- Add RabbitMQ for distributed task queue
- Add multiple backend instances behind load balancer
- Keep PostgreSQL as single source of truth
Common Patterns
NestJS Patterns
Controllers: Thin orchestrators
- Validate input (DTOs)
- Call service methods
- Handle errors (convert to HTTP exceptions)
- Return DTOs
Services: Business logic
- No HTTP knowledge (throw domain errors, not HTTP exceptions)
- Return domain objects (not entities)
- Validate business rules
Entities: Database schema
- Map to PostgreSQL tables
- Use TypeORM decorators
- Include soft delete support
See: Controller Standards, Service Standards, Entity Standards
Integration Patterns
Task Types:
start- Entry point (cron triggers this)list- Fetch IDs/metadata (e.g., list emails)download- Fetch full content (e.g., download email)
Task Spawning:
- List tasks spawn download tasks (one per item)
- Pagination handled by spawning new list tasks
- Priority degrades with each generation
Idempotency:
- Check if content exists before downloading
- Use external ID as deduplication key
- Skip already-downloaded items
See: Writing Integrations
Where to Go Next
Now that you understand the high-level architecture, dive deeper into specific areas:
Backend Development:
Integration Development:
Authentication & Security:
Observability:
Coding Standards:
Project Planning:
Getting Help
Documentation: You're reading it! Check the sidebar for specific topics.
Code Examples: Look at existing modules (e.g., src/integration/gmail/) for working examples.
Questions: Ask the team in Slack/Discord/your communication channel.
Welcome to Yew Search! 🎉