Architecture

Welcome to Yew Search! This document provides a high-level overview of how the system works. If you're new to the project, start here before diving into specific component documentation.

There is a lot in this document, so feel free to skip around.

What is Yew Search?

Yew Search is a privacy-focused personal search engine that aggregates data from your external services (Gmail, Slack, Google Drive, etc.) and makes it all searchable in one place. Think of it as "Google search for your personal data" - but you own the data and control who can access it.

Key Features:

Search across all your connected services simultaneously
Privacy-first: self-hosted, you control your data
Plugin architecture: easily add new data sources
Real-time synchronization with external services
Session-based authentication for immediate security control

Core Architectural Principles

Yew Search is built on several key principles that guide all technical decisions:

1. Privacy and Security First

Users trust us with their most sensitive data. Every architectural decision prioritizes security:

Cookie-based sessions (not JWT) for immediate logout capability
OAuth tokens encrypted at rest
No third-party services - everything is self-hosted
Remote logout: instantly invalidate sessions when needed

2. Plugin Architecture

Integrations (Gmail, Slack, etc.) are self-contained plugins:

No integration knows about the core application
Each integration is independent and can be developed separately
Adding a new integration doesn't require changing core code
Integrations communicate through well-defined interfaces

3. Monolithic Architecture

We're a monolith, not microservices:

Single NestJS backend application
Single PostgreSQL database
Simpler to develop, deploy, and reason about
Performance overhead is negligible for our use case
Session-based auth works perfectly for monoliths

4. Consistency Through Constraints

Small teams need systems that enforce order automatically:

PostgreSQL enforces data consistency (foreign keys, constraints, schemas)
TypeScript provides compile-time type safety
Strict coding standards reduce cognitive load
Everything has one clear way to do it

5. Self-Hosted First

Must run on minimal hardware (Raspberry Pi 4+):

No mandatory cloud dependencies
Docker containers for easy deployment
Resource-efficient design
Works offline (except for syncing new data)

System Architecture

High-Level Components

┌─────────────────────────────────────────────────────────────────┐
│                         User's Browser                          │
│                                                                 │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │              Next.js Frontend (Port 3001)                 │  │
│  │  • Search Interface    • OAuth Authorization UI           │  │
│  │  • Login/Register      • Settings & Profile               │  │
│  └───────────────────────────────────────────────────────────┘  │
└────────────────────┬──────────────────────────────────────────┬─┘
                     │                                          │
                     │ HTTP/REST API                            │ Cookie
                     │                                          │ (yew_session)
                     │                                          │
┌────────────────────▼──────────────────────────────────────────▼─┐
│              NestJS Backend (Port 3000)                         │
│                                                                 │
│  ┌─────────────┐  ┌──────────────┐  ┌────────────────────────┐  │
│  │   Auth      │  │  User API    │  │  Search API            │  │
│  │  • Login    │  │  • Profile   │  │  • Query Parser        │  │
│  │  • Logout   │  │  • Sessions  │  │  • Full-Text Search    │  │
│  │  • Register │  └──────────────┘  │  • Result Ranking      │  │
│  └─────────────┘                    └────────────────────────┘  │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │              Integration System                          │   │
│  │  ┌────────────┐  ┌────────────┐  ┌────────────┐          │   │
│  │  │   Gmail    │  │   Slack    │  │    FTP     │  ...     │   │
│  │  │ Integration│  │ Integration│  │ Integration│          │   │
│  │  └────────────┘  └────────────┘  └────────────┘          │   │
│  │                                                          │   │
│  │  ┌──────────────────────────────────────────────┐        │   │
│  │  │         Polling System (Background)          │        │   │
│  │  │  • Task Queue (Bull/RabbitMQ)                │        │   │
│  │  │  • Priority-based scheduling                 │        │   │
│  │  │  • Idempotency checks                        │        │   │
│  │  └──────────────────────────────────────────────┘        │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │              Observability                               │   │
│  │  • Logging (structured JSON)                             │   │
│  │  • Tracing (OpenTelemetry)                               │   │
│  │  • Metrics (request duration, errors, etc.)              │   │
│  └──────────────────────────────────────────────────────────┘   │
└───────────────┬───────────────────────────────────────┬─────────┘
                │                                       │
                │                                       │
                ▼                                       ▼
┌───────────────────────────┐          ┌────────────────────────────┐
│   PostgreSQL Database     │          │  Redis (Session Store)     │
│                           │          │                            │
│  • user                   │          │  • Session IDs → User IDs  │
│  • user_session           │          │  • TTL-based expiration    │
│  • user_integration       │          │  • Fast lookups            │
│  • user_integration_      │          └────────────────────────────┘
│    content                │
│  • (OAuth encrypted)      │
└───────────────────────────┘

Data Flow

1. User Authentication Flow

User → Frontend → Backend → Redis
                    ↓
                PostgreSQL (user table)
                    ↓
                Set Cookie → Frontend → User

User submits credentials
Backend validates against PostgreSQL
Backend creates session in Redis
Backend sets secure cookie
All subsequent requests include cookie
Backend validates session on each request

2. Integration Authorization Flow (OAuth)

User → "Connect Gmail" → Backend
                            ↓
                    Generate OAuth URL
                            ↓
                    Redirect to Google
                            ↓
                    User grants access
                            ↓
                    Google redirects back
                            ↓
                    Exchange code for tokens
                            ↓
                    Encrypt & store tokens
                            ↓
                    Store in user_integration table

3. Data Synchronization Flow

Cron (every 1 min) → Create "start" task
                            ↓
                    Polling System (Bull/RabbitMQ)
                            ↓
                    Load Integration Plugin
                            ↓
                    Run Task (e.g., fetch emails)
                            ↓
            ┌───────────────┴────────────────┐
            ▼                                ▼
    New Tasks (pagination)          Output (email data)
            │                                │
            └────→ Queue              Save to PostgreSQL
                                    (user_integration_content)

Tasks spawn more tasks recursively until all data is synced.

4. Search Flow

User → Search Query → Backend
                        ↓
                Parse & sanitize query
                        ↓
        PostgreSQL full-text search
        (on user_integration_content.content)
                        ↓
            Rank & filter results
                        ↓
            Return to Frontend → User

Major Components

Backend (NestJS)

Location: backend/

The backend is a NestJS monolith that handles all API requests, authentication, integration management, and background data synchronization.

Key Modules:

Session Module (src/v1/user-session/)

Session management (create, validate, delete)
Cookie-based sessions

User Module (src/v1/user/)

User registration
Password hashing with Argon2
User profile management
User settings
Active session management

Integration Module (src/integration/)

Dynamic integration loading
OAuth endpoint (unified for all integrations)
Integration instance management
See: Writing Integrations

Polling Module (src/integration/polling/)

Background task queue (Bull/RabbitMQ)
Priority-based task scheduling
Integration task execution
Idempotency checking
See: Integration Idempotency

Search Module (src/search/)

Query parsing and validation
Full-text search using PostgreSQL
Result ranking and filtering
(Future: Elasticsearch integration)

Frontend (Next.js)

Location: frontend/

Server-side rendered React application for the user interface.

Key Pages:

/login - Authentication
/register - New user signup
/search - Main search interface
/integrations - Manage connected services (OAuth)
/settings - User preferences and active sessions

Authentication:

Cookie-based (automatically sent with every request)
Redirects to /login on 401 responses
No client-side token management needed

Database (PostgreSQL)

Core Tables:

user

User accounts
Email, name, hashed password
Created/updated timestamps

user_session

This is used to lookup sessions for users, locations, etc.
Think of this more as logging of sessions
Session ID, user ID
Expiration, last accessed timestamp
IP address, user agent

user_integration

User's connected services (Gmail, Slack, etc.)
OAuth credentials (encrypted)
Integration domain, status
Sync state (last history ID, etc.)

user_integration_content

Actual synced data (emails, messages, files)
JSONB content field (integration-specific structure)
External ID (idempotency key)
External timestamp (when created in source system)

See: Entity Standards

Session Store (Redis)

Both Redis and Postgres will be used for sessions store, but Redis will be used for auth.

Redis is fast for looking up sessions by session ID, but might not be the best for filtering sessions by user Ids.

For this we use Postgres. This allows us to show all sessions a user has so they can manage their sessions from one location.

See: Authorization

Task Queue (Bull/RabbitMQ)

Current: Bull (Redis-backed) Future: RabbitMQ (when scaling is needed)

Background task processing for integration data synchronization:

Priority queue (new data fetched before backfill)
Retry logic for transient failures
Dead-letter queue for permanent failures

Integrations (Plugins)

Location: backend/src/integration/*/

Each integration is a self-contained plugin:

Structure:

backend/src/integration/gmail/
├── src/
│   └── main.ts           # Integration implementation
├── manifest.json         # Metadata (domain, name, version)
└── package.json          # Dependencies

Responsibilities:

Authenticate with external service (OAuth or credentials)
Define task types (start, list, download, etc.)
Execute tasks and return output + new tasks
Handle pagination, error recovery, rate limiting

Key Concept: Integrations know nothing about the core application. They receive credentials and return data. The core application handles storage, queueing, and orchestration.

See: Writing Integrations

Key Design Decisions

Why PostgreSQL Over MongoDB?

Decision: Use PostgreSQL for all data storage.

Rationale:

Enforces consistency (foreign keys, constraints, schemas)
Prevents "bad data practices" common in NoSQL
Small teams need databases that enforce order automatically
TypeORM provides excellent PostgreSQL support

See: Why Postgres

Decision: Use cookie-based sessions, never JWT.

Rationale:

Remote logout is critical - Can terminate sessions instantly
JWT cannot be revoked before expiration (24-hour vulnerability window)
Privacy focus requires immediate access control
Central session store provides audit trail

See: Authorization

Why Monolith Over Microservices?

Decision: Single NestJS application, not microservices.

Rationale:

Simpler to develop and deploy
Small team doesn't need microservice complexity
Session-based auth works perfectly for monoliths
Monoliths are easy to deploy on local systems
Can always split later if needed (YAGNI)

Why Task-Based Integration Architecture?

Decision: Integrations use recursive task spawning, not batch jobs.

Rationale:

Handles pagination naturally (each page spawns next page task)
Priority queue ensures new data fetched quickly during backfill
Granular error handling (one failed email doesn't block others)
Idempotency prevents duplicate data

See: Writing Integrations

Why Self-Hosted?

Decision: Must run on user's own hardware (Raspberry Pi 4+).

Rationale:

Privacy: user controls their data
No SaaS vendor lock-in
No recurring cloud costs
Compliance with data sovereignty requirements

Future Plans:

Yew Search will run in 3 different ways, similar to Mattermost, N8N, or GitLab.

Local homelab / Raspberry Pi deployment (for individuals)
- We light homelabs and we want to get this community to like us
- Homelabs give users a way try the software before buying with minimal work from us
- Not all features will be available to the self hosted option and not all can be
Business (for companies)
- Most companies don't want to self host and this makes sense
- Self hosting for a company is usually more expensive than buying a subscription
- This also allows more advanced features like Soc2 compliance and other important features companies will use
- This is managed in our servers and all companies will be in a pool
- This is similar to how most hosted solutions work
Enterprise (for large companies)
- When a company outgrows the Business plan the will move to the Enterprise plan
- This comes with dedicated hardware and more support from our team

Data Privacy & Security

Authentication Security

Argon2 password hashing (not bcrypt)
HttpOnly cookies (JavaScript cannot access)
Secure flag (HTTPS only in production)
SameSite strict (CSRF protection)
Rate limiting on login endpoint

Integration Security

OAuth tokens encrypted at rest in database
Tokens only decrypted when making API calls
Integration plugins cannot access other users' data
Each user's integrations isolated by user ID

Search Security

Search queries scoped to authenticated user
No cross-user data leakage
All queries logged for audit trail
Content never exposed in URLs (POST, not GET)

Development Workflow

Local Development Setup

TBD

Environment Variables:

DATABASE_URL - PostgreSQL connection string
REDIS_URL - Redis connection string
SESSION_SECRET - Random secret for session encryption
GMAIL_OAUTH_CLIENT_ID - OAuth credentials for Gmail
GMAIL_OAUTH_CLIENT_SECRET
(See .env.example for full list)

Raspberry Pi Deployment

Yew Search is designed to run on Raspberry Pi 4 (4GB+ RAM):

Flash Raspberry Pi OS (64-bit)
Install Docker & Docker Compose
Clone repository
Run docker-compose up -d
Access via http://raspberrypi.local:3001

Performance Notes:

PostgreSQL uses minimal indexes for low-memory usage
Background tasks throttled to avoid CPU spikes
Redis configured for low memory footprint

Observability

Logging

All logs are structured JSON with consistent fields:

timestamp - ISO 8601
level - debug, info, warn, error
message - Human-readable description
module - Which module emitted the log
method - Which method emitted the log
requestId - Request correlation ID
userId - User ID (if authenticated)
traceId - Distributed tracing ID
spanId - Current span ID

See: Logging Standards

Tracing

OpenTelemetry distributed tracing for:

HTTP requests (automatic)
Database queries (automatic)
Business operations (manual instrumentation)

Traces can be exported to Jaeger, Zipkin, or other OTLP-compatible backends.

See: Tracing Standards

Metrics

Automatic metrics collection:

HTTP request rate, duration, errors
Active sessions count
Integration sync success/failure rate
Search query performance

See: Metrics Standards

Testing Strategy

Unit Tests

Test individual functions and classes
Mock external dependencies
Fast execution (< 1 second per test)

Integration Tests

Test module interactions
Use test database (separate from dev)
Test API endpoints end-to-end

E2E Tests

Test full user workflows
Browser automation (Playwright)
Test OAuth flows, search, etc.

Run tests:

# Unit tests
npm test

# E2E tests
npm run test:e2e

# Coverage
npm run test:cov

Performance Considerations

Database Performance

Indexes: Minimal indexes for low-memory environments
JSONB: Content stored as JSONB for flexibility
Full-text search: PostgreSQL GIN indexes on content fields
Connection pooling: TypeORM manages connection pool

Search Performance

V1 (PostgreSQL):

Full-text search on JSONB content field
Good for < 100k documents
No additional infrastructure needed

V2 (Elasticsearch):

Dedicated search engine
Good for > 100k documents
Requires additional container

Background Tasks

Priority queue: New data fetched before backfill
Throttling: Rate limiting to avoid API quota exhaustion
Batching: Group operations where possible
Idempotency: Skip already-downloaded content

Scalability

Current Scale (V1)

Designed for:

Single user or small family (< 10 users)
Raspberry Pi 4 hardware (4GB+ RAM)
< 100k documents per user
< 10 concurrent searches

Future Scale (V2+)

Can scale to:

Small teams (< 100 users)
Cloud VPS or bare metal server
Millions of documents
Horizontal scaling with Redis + multiple backend instances

Scaling Strategy:

Add Elasticsearch for better search performance
Add RabbitMQ for distributed task queue
Add multiple backend instances behind load balancer
Keep PostgreSQL as single source of truth

Common Patterns

NestJS Patterns

Controllers: Thin orchestrators

Validate input (DTOs)
Call service methods
Handle errors (convert to HTTP exceptions)
Return DTOs

Services: Business logic

No HTTP knowledge (throw domain errors, not HTTP exceptions)
Return domain objects (not entities)
Validate business rules

Entities: Database schema

Map to PostgreSQL tables
Use TypeORM decorators
Include soft delete support

See: Controller Standards, Service Standards, Entity Standards

Integration Patterns

Task Types:

start - Entry point (cron triggers this)
list - Fetch IDs/metadata (e.g., list emails)
download - Fetch full content (e.g., download email)

Task Spawning:

List tasks spawn download tasks (one per item)
Pagination handled by spawning new list tasks
Priority degrades with each generation

Idempotency:

Check if content exists before downloading
Use external ID as deduplication key
Skip already-downloaded items

See: Writing Integrations

Where to Go Next

Now that you understand the high-level architecture, dive deeper into specific areas:

Backend Development:

Integration Development:

Authentication & Security:

Authorization

Observability:

Coding Standards:

Project Planning:

Roadmap

Getting Help

Documentation: You're reading it! Check the sidebar for specific topics.

Code Examples: Look at existing modules (e.g., src/integration/gmail/) for working examples.

Questions: Ask the team in Slack/Discord/your communication channel.

Welcome to Yew Search! 🎉

What is Yew Search?​

Core Architectural Principles​

1. Privacy and Security First​

2. Plugin Architecture​

3. Monolithic Architecture​

4. Consistency Through Constraints​

5. Self-Hosted First​

System Architecture​

High-Level Components​

Data Flow​

1. User Authentication Flow​

2. Integration Authorization Flow (OAuth)​

3. Data Synchronization Flow​

4. Search Flow​

Major Components​

Backend (NestJS)​

Frontend (Next.js)​

Database (PostgreSQL)​

Session Store (Redis)​

Task Queue (Bull/RabbitMQ)​

Integrations (Plugins)​

Key Design Decisions​

Why PostgreSQL Over MongoDB?​

Why Cookie Sessions Over JWT?​

Why Monolith Over Microservices?​

Why Task-Based Integration Architecture?​

Why Self-Hosted?​

Data Privacy & Security​

Authentication Security​

Integration Security​

Search Security​

Development Workflow​

Local Development Setup​

Raspberry Pi Deployment​

Observability​

Logging​

Tracing​

Metrics​

Testing Strategy​

Unit Tests​

Integration Tests​

E2E Tests​

Performance Considerations​

Database Performance​

Search Performance​

Background Tasks​

Scalability​

Current Scale (V1)​

Future Scale (V2+)​

Common Patterns​

NestJS Patterns​

Integration Patterns​

Where to Go Next​

Getting Help​

What is Yew Search?

Core Architectural Principles

1. Privacy and Security First

2. Plugin Architecture

3. Monolithic Architecture

4. Consistency Through Constraints

5. Self-Hosted First

System Architecture

High-Level Components

Data Flow

1. User Authentication Flow

2. Integration Authorization Flow (OAuth)

3. Data Synchronization Flow

4. Search Flow

Major Components

Backend (NestJS)

Frontend (Next.js)

Database (PostgreSQL)

Session Store (Redis)

Task Queue (Bull/RabbitMQ)

Integrations (Plugins)

Key Design Decisions

Why PostgreSQL Over MongoDB?

Why Cookie Sessions Over JWT?

Why Monolith Over Microservices?

Why Task-Based Integration Architecture?

Why Self-Hosted?

Data Privacy & Security

Authentication Security

Integration Security

Search Security

Development Workflow

Local Development Setup

Raspberry Pi Deployment

Observability

Logging

Tracing

Metrics

Testing Strategy

Unit Tests

Integration Tests

E2E Tests

Performance Considerations

Database Performance

Search Performance

Background Tasks

Scalability

Current Scale (V1)

Future Scale (V2+)

Common Patterns

NestJS Patterns

Integration Patterns

Where to Go Next

Getting Help