Skip to main content

Architecture

Welcome to Yew Search! This document provides a high-level overview of how the system works. If you're new to the project, start here before diving into specific component documentation.

There is a lot in this document, so feel free to skip around.

Yew Search is a privacy-focused personal search engine that aggregates data from your external services (Gmail, Slack, Google Drive, etc.) and makes it all searchable in one place. Think of it as "Google search for your personal data" - but you own the data and control who can access it.

Key Features:

  • Search across all your connected services simultaneously
  • Privacy-first: self-hosted, you control your data
  • Plugin architecture: easily add new data sources
  • Real-time synchronization with external services
  • Session-based authentication for immediate security control

Core Architectural Principles

Yew Search is built on several key principles that guide all technical decisions:

1. Privacy and Security First

Users trust us with their most sensitive data. Every architectural decision prioritizes security:

  • Cookie-based sessions (not JWT) for immediate logout capability
  • OAuth tokens encrypted at rest
  • No third-party services - everything is self-hosted
  • Remote logout: instantly invalidate sessions when needed

2. Plugin Architecture

Integrations (Gmail, Slack, etc.) are self-contained plugins:

  • No integration knows about the core application
  • Each integration is independent and can be developed separately
  • Adding a new integration doesn't require changing core code
  • Integrations communicate through well-defined interfaces

3. Monolithic Architecture

We're a monolith, not microservices:

  • Single NestJS backend application
  • Single PostgreSQL database
  • Simpler to develop, deploy, and reason about
  • Performance overhead is negligible for our use case
  • Session-based auth works perfectly for monoliths

4. Consistency Through Constraints

Small teams need systems that enforce order automatically:

  • PostgreSQL enforces data consistency (foreign keys, constraints, schemas)
  • TypeScript provides compile-time type safety
  • Strict coding standards reduce cognitive load
  • Everything has one clear way to do it

5. Self-Hosted First

Must run on minimal hardware (Raspberry Pi 4+):

  • No mandatory cloud dependencies
  • Docker containers for easy deployment
  • Resource-efficient design
  • Works offline (except for syncing new data)

System Architecture

High-Level Components

┌─────────────────────────────────────────────────────────────────┐
│ User's Browser │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Next.js Frontend (Port 3001) │ │
│ │ • Search Interface • OAuth Authorization UI │ │
│ │ • Login/Register • Settings & Profile │ │
│ └───────────────────────────────────────────────────────────┘ │
└────────────────────┬──────────────────────────────────────────┬─┘
│ │
│ HTTP/REST API │ Cookie
│ │ (yew_session)
│ │
┌────────────────────▼──────────────────────────────────────────▼─┐
│ NestJS Backend (Port 3000) │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌────────────────────────┐ │
│ │ Auth │ │ User API │ │ Search API │ │
│ │ • Login │ │ • Profile │ │ • Query Parser │ │
│ │ • Logout │ │ • Sessions │ │ • Full-Text Search │ │
│ │ • Register │ └──────────────┘ │ • Result Ranking │ │
│ └─────────────┘ └────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Integration System │ │
│ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │
│ │ │ Gmail │ │ Slack │ │ FTP │ ... │ │
│ │ │ Integration│ │ Integration│ │ Integration│ │ │
│ │ └────────────┘ └────────────┘ └────────────┘ │ │
│ │ │ │
│ │ ┌──────────────────────────────────────────────┐ │ │
│ │ │ Polling System (Background) │ │ │
│ │ │ • Task Queue (Bull/RabbitMQ) │ │ │
│ │ │ • Priority-based scheduling │ │ │
│ │ │ • Idempotency checks │ │ │
│ │ └──────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ Observability │ │
│ │ • Logging (structured JSON) │ │
│ │ • Tracing (OpenTelemetry) │ │
│ │ • Metrics (request duration, errors, etc.) │ │
│ └──────────────────────────────────────────────────────────┘ │
└───────────────┬───────────────────────────────────────┬─────────┘
│ │
│ │
▼ ▼
┌───────────────────────────┐ ┌────────────────────────────┐
│ PostgreSQL Database │ │ Redis (Session Store) │
│ │ │ │
│ • user │ │ • Session IDs → User IDs │
│ • user_session │ │ • TTL-based expiration │
│ • user_integration │ │ • Fast lookups │
│ • user_integration_ │ └────────────────────────────┘
│ content │
│ • (OAuth encrypted) │
└───────────────────────────┘

Data Flow

1. User Authentication Flow

User → Frontend → Backend → Redis

PostgreSQL (user table)

Set Cookie → Frontend → User
  1. User submits credentials
  2. Backend validates against PostgreSQL
  3. Backend creates session in Redis
  4. Backend sets secure cookie
  5. All subsequent requests include cookie
  6. Backend validates session on each request

2. Integration Authorization Flow (OAuth)

User → "Connect Gmail" → Backend

Generate OAuth URL

Redirect to Google

User grants access

Google redirects back

Exchange code for tokens

Encrypt & store tokens

Store in user_integration table

3. Data Synchronization Flow

Cron (every 1 min) → Create "start" task

Polling System (Bull/RabbitMQ)

Load Integration Plugin

Run Task (e.g., fetch emails)

┌───────────────┴────────────────┐
▼ ▼
New Tasks (pagination) Output (email data)
│ │
└────→ Queue Save to PostgreSQL
(user_integration_content)

Tasks spawn more tasks recursively until all data is synced.

4. Search Flow

User → Search Query → Backend

Parse & sanitize query

PostgreSQL full-text search
(on user_integration_content.content)

Rank & filter results

Return to Frontend → User

Major Components

Backend (NestJS)

Location: backend/

The backend is a NestJS monolith that handles all API requests, authentication, integration management, and background data synchronization.

Key Modules:

Session Module (src/v1/user-session/)

  • Session management (create, validate, delete)
  • Cookie-based sessions

User Module (src/v1/user/)

  • User registration
  • Password hashing with Argon2
  • User profile management
  • User settings
  • Active session management

Integration Module (src/integration/)

  • Dynamic integration loading
  • OAuth endpoint (unified for all integrations)
  • Integration instance management
  • See: Writing Integrations

Polling Module (src/integration/polling/)

  • Background task queue (Bull/RabbitMQ)
  • Priority-based task scheduling
  • Integration task execution
  • Idempotency checking
  • See: Integration Idempotency

Search Module (src/search/)

  • Query parsing and validation
  • Full-text search using PostgreSQL
  • Result ranking and filtering
  • (Future: Elasticsearch integration)

Frontend (Next.js)

Location: frontend/

Server-side rendered React application for the user interface.

Key Pages:

  • /login - Authentication
  • /register - New user signup
  • /search - Main search interface
  • /integrations - Manage connected services (OAuth)
  • /settings - User preferences and active sessions

Authentication:

  • Cookie-based (automatically sent with every request)
  • Redirects to /login on 401 responses
  • No client-side token management needed

Database (PostgreSQL)

Core Tables:

user

  • User accounts
  • Email, name, hashed password
  • Created/updated timestamps

user_session

  • This is used to lookup sessions for users, locations, etc.
  • Think of this more as logging of sessions
  • Session ID, user ID
  • Expiration, last accessed timestamp
  • IP address, user agent

user_integration

  • User's connected services (Gmail, Slack, etc.)
  • OAuth credentials (encrypted)
  • Integration domain, status
  • Sync state (last history ID, etc.)

user_integration_content

  • Actual synced data (emails, messages, files)
  • JSONB content field (integration-specific structure)
  • External ID (idempotency key)
  • External timestamp (when created in source system)

See: Entity Standards

Session Store (Redis)

Both Redis and Postgres will be used for sessions store, but Redis will be used for auth.

Redis is fast for looking up sessions by session ID, but might not be the best for filtering sessions by user Ids.

For this we use Postgres. This allows us to show all sessions a user has so they can manage their sessions from one location.

See: Authorization

Task Queue (Bull/RabbitMQ)

Current: Bull (Redis-backed) Future: RabbitMQ (when scaling is needed)

Background task processing for integration data synchronization:

  • Priority queue (new data fetched before backfill)
  • Retry logic for transient failures
  • Dead-letter queue for permanent failures

Integrations (Plugins)

Location: backend/src/integration/*/

Each integration is a self-contained plugin:

Structure:

backend/src/integration/gmail/
├── src/
│ └── main.ts # Integration implementation
├── manifest.json # Metadata (domain, name, version)
└── package.json # Dependencies

Responsibilities:

  • Authenticate with external service (OAuth or credentials)
  • Define task types (start, list, download, etc.)
  • Execute tasks and return output + new tasks
  • Handle pagination, error recovery, rate limiting

Key Concept: Integrations know nothing about the core application. They receive credentials and return data. The core application handles storage, queueing, and orchestration.

See: Writing Integrations

Key Design Decisions

Why PostgreSQL Over MongoDB?

Decision: Use PostgreSQL for all data storage.

Rationale:

  • Enforces consistency (foreign keys, constraints, schemas)
  • Prevents "bad data practices" common in NoSQL
  • Small teams need databases that enforce order automatically
  • TypeORM provides excellent PostgreSQL support

See: Why Postgres

Decision: Use cookie-based sessions, never JWT.

Rationale:

  • Remote logout is critical - Can terminate sessions instantly
  • JWT cannot be revoked before expiration (24-hour vulnerability window)
  • Privacy focus requires immediate access control
  • Central session store provides audit trail

See: Authorization

Why Monolith Over Microservices?

Decision: Single NestJS application, not microservices.

Rationale:

  • Simpler to develop and deploy
  • Small team doesn't need microservice complexity
  • Session-based auth works perfectly for monoliths
  • Monoliths are easy to deploy on local systems
  • Can always split later if needed (YAGNI)

Why Task-Based Integration Architecture?

Decision: Integrations use recursive task spawning, not batch jobs.

Rationale:

  • Handles pagination naturally (each page spawns next page task)
  • Priority queue ensures new data fetched quickly during backfill
  • Granular error handling (one failed email doesn't block others)
  • Idempotency prevents duplicate data

See: Writing Integrations

Why Self-Hosted?

Decision: Must run on user's own hardware (Raspberry Pi 4+).

Rationale:

  • Privacy: user controls their data
  • No SaaS vendor lock-in
  • No recurring cloud costs
  • Compliance with data sovereignty requirements

Future Plans:

Yew Search will run in 3 different ways, similar to Mattermost, N8N, or GitLab.

  1. Local homelab / Raspberry Pi deployment (for individuals)
    • We light homelabs and we want to get this community to like us
    • Homelabs give users a way try the software before buying with minimal work from us
    • Not all features will be available to the self hosted option and not all can be
  2. Business (for companies)
    • Most companies don't want to self host and this makes sense
    • Self hosting for a company is usually more expensive than buying a subscription
    • This also allows more advanced features like Soc2 compliance and other important features companies will use
    • This is managed in our servers and all companies will be in a pool
    • This is similar to how most hosted solutions work
  3. Enterprise (for large companies)
    • When a company outgrows the Business plan the will move to the Enterprise plan
    • This comes with dedicated hardware and more support from our team

Data Privacy & Security

Authentication Security

  • Argon2 password hashing (not bcrypt)
  • HttpOnly cookies (JavaScript cannot access)
  • Secure flag (HTTPS only in production)
  • SameSite strict (CSRF protection)
  • Rate limiting on login endpoint

Integration Security

  • OAuth tokens encrypted at rest in database
  • Tokens only decrypted when making API calls
  • Integration plugins cannot access other users' data
  • Each user's integrations isolated by user ID

Search Security

  • Search queries scoped to authenticated user
  • No cross-user data leakage
  • All queries logged for audit trail
  • Content never exposed in URLs (POST, not GET)

Development Workflow

Local Development Setup

TBD

Environment Variables:

  • DATABASE_URL - PostgreSQL connection string
  • REDIS_URL - Redis connection string
  • SESSION_SECRET - Random secret for session encryption
  • GMAIL_OAUTH_CLIENT_ID - OAuth credentials for Gmail
  • GMAIL_OAUTH_CLIENT_SECRET
  • (See .env.example for full list)

Raspberry Pi Deployment

Yew Search is designed to run on Raspberry Pi 4 (4GB+ RAM):

  1. Flash Raspberry Pi OS (64-bit)
  2. Install Docker & Docker Compose
  3. Clone repository
  4. Run docker-compose up -d
  5. Access via http://raspberrypi.local:3001

Performance Notes:

  • PostgreSQL uses minimal indexes for low-memory usage
  • Background tasks throttled to avoid CPU spikes
  • Redis configured for low memory footprint

Observability

Logging

All logs are structured JSON with consistent fields:

  • timestamp - ISO 8601
  • level - debug, info, warn, error
  • message - Human-readable description
  • module - Which module emitted the log
  • method - Which method emitted the log
  • requestId - Request correlation ID
  • userId - User ID (if authenticated)
  • traceId - Distributed tracing ID
  • spanId - Current span ID

See: Logging Standards

Tracing

OpenTelemetry distributed tracing for:

  • HTTP requests (automatic)
  • Database queries (automatic)
  • Business operations (manual instrumentation)

Traces can be exported to Jaeger, Zipkin, or other OTLP-compatible backends.

See: Tracing Standards

Metrics

Automatic metrics collection:

  • HTTP request rate, duration, errors
  • Active sessions count
  • Integration sync success/failure rate
  • Search query performance

See: Metrics Standards

Testing Strategy

Unit Tests

  • Test individual functions and classes
  • Mock external dependencies
  • Fast execution (< 1 second per test)

Integration Tests

  • Test module interactions
  • Use test database (separate from dev)
  • Test API endpoints end-to-end

E2E Tests

  • Test full user workflows
  • Browser automation (Playwright)
  • Test OAuth flows, search, etc.

Run tests:

# Unit tests
npm test

# E2E tests
npm run test:e2e

# Coverage
npm run test:cov

Performance Considerations

Database Performance

  • Indexes: Minimal indexes for low-memory environments
  • JSONB: Content stored as JSONB for flexibility
  • Full-text search: PostgreSQL GIN indexes on content fields
  • Connection pooling: TypeORM manages connection pool

Search Performance

V1 (PostgreSQL):

  • Full-text search on JSONB content field
  • Good for < 100k documents
  • No additional infrastructure needed

V2 (Elasticsearch):

  • Dedicated search engine
  • Good for > 100k documents
  • Requires additional container

Background Tasks

  • Priority queue: New data fetched before backfill
  • Throttling: Rate limiting to avoid API quota exhaustion
  • Batching: Group operations where possible
  • Idempotency: Skip already-downloaded content

Scalability

Current Scale (V1)

Designed for:

  • Single user or small family (< 10 users)
  • Raspberry Pi 4 hardware (4GB+ RAM)
  • < 100k documents per user
  • < 10 concurrent searches

Future Scale (V2+)

Can scale to:

  • Small teams (< 100 users)
  • Cloud VPS or bare metal server
  • Millions of documents
  • Horizontal scaling with Redis + multiple backend instances

Scaling Strategy:

  • Add Elasticsearch for better search performance
  • Add RabbitMQ for distributed task queue
  • Add multiple backend instances behind load balancer
  • Keep PostgreSQL as single source of truth

Common Patterns

NestJS Patterns

Controllers: Thin orchestrators

  • Validate input (DTOs)
  • Call service methods
  • Handle errors (convert to HTTP exceptions)
  • Return DTOs

Services: Business logic

  • No HTTP knowledge (throw domain errors, not HTTP exceptions)
  • Return domain objects (not entities)
  • Validate business rules

Entities: Database schema

  • Map to PostgreSQL tables
  • Use TypeORM decorators
  • Include soft delete support

See: Controller Standards, Service Standards, Entity Standards

Integration Patterns

Task Types:

  • start - Entry point (cron triggers this)
  • list - Fetch IDs/metadata (e.g., list emails)
  • download - Fetch full content (e.g., download email)

Task Spawning:

  • List tasks spawn download tasks (one per item)
  • Pagination handled by spawning new list tasks
  • Priority degrades with each generation

Idempotency:

  • Check if content exists before downloading
  • Use external ID as deduplication key
  • Skip already-downloaded items

See: Writing Integrations

Where to Go Next

Now that you understand the high-level architecture, dive deeper into specific areas:

Backend Development:

Integration Development:

Authentication & Security:

Observability:

Coding Standards:

Project Planning:

Getting Help

Documentation: You're reading it! Check the sidebar for specific topics.

Code Examples: Look at existing modules (e.g., src/integration/gmail/) for working examples.

Questions: Ask the team in Slack/Discord/your communication channel.

Welcome to Yew Search! 🎉