Concept Brief · Living Database Engine May 2026 · Confidential
ALIVE
Idea Summary · v1.0

The Living
Database

A domain-agnostic engine that treats any dataset as a living organism — continuously breathing, growing, evolving, and pruning itself from the open internet.

Domain Types Supported
0
Paid APIs Required
24/7
Self-Evolving Engine
Equivalent Exists Today
01

Every structured dataset — investors, companies, drugs, space missions, geopolitical entities — begins decaying the moment it is created. The world moves. The data doesn't.

Existing solutions (Crunchbase, PitchBook, Apollo) are domain-locked, expensive, and closed. They work for one slice of the world and charge accordingly. No open, self-hostable, domain-agnostic equivalent exists.

"Crunchbase only keeps investor data. What if we go for space tech advancements around the world? No single source owns that data. It lives scattered across arXiv, NASA releases, patent filings, university labs, and forums. Nobody is maintaining a living, structured database of that."
📉
Static Datasets Decay
People change roles. Funds close. Missions launch. Discoveries happen. Any snapshot is obsolete within weeks.
🔒
Existing Solutions Are Locked
Crunchbase, PitchBook, and Apollo are proprietary, expensive, and cover only their chosen domain. You cannot self-host, extend, or redirect them.
🌐
The Knowledge Lives Everywhere
For most domains, the freshest data is scattered across the open internet — unstructured, unindexed, and unowned.
02

The Living Database treats any dataset not as a static table, but as a living organism with four biological behaviours running continuously:

🫁
Breathes
Continuously pulls fresh signals from the internet — search results, crawled pages, RSS feeds, public filings.
🌱
Grows
Discovers and adds new records that belong in the dataset — entities the original snapshot missed entirely.
🔄
Evolves
Updates existing records as the world changes — role shifts, new activity, revised facts, updated status.
✂️
Prunes
Marks or removes records that no longer exist — closed funds, defunct companies, discontinued projects.
The key insight: the dataset defines its own shape. You give the engine any CSV or database table and it figures out what each record is, what's worth enriching, where to look, and how to write back — without being pre-programmed for a specific domain.
03

The engine runs as a scheduled pipeline. Every cycle, it processes a batch of records through five stages — all locally, with a self-hosted LLM and free/open internet sources.

01
Schema Analyst Agent
Reads sample rows. Infers entity type (person, company, place, mission, drug…). Identifies which fields are enrichable vs stable. Generates a search strategy per field.
02
Search & Fetch Agent
Builds search queries per record. Hits Serper/Exa/DuckDuckGo. Crawls top results via Crawl4AI (fully local, handles JS-rendered pages). Pulls RSS and public filing sources.
03
Extraction Agent (Gemma 4 Local)
Local LLM reads raw crawled content. Extracts only fields relevant to this schema. Confidence-scores each extracted value. Flags ambiguous or conflicting data.
04
Merge Agent
Compares new vs existing values. Applies update rules: overwrite / append / flag for review. Logs full provenance — what changed, from where, when.
05
Growth Agent
While fetching data for existing records, identifies new entities that belong in the dataset. Proposes and stages new records for addition. The dataset grows beyond its original seed.
04
Layer Tool Cost Role
LLM Gemma 4 (local) Free Schema inference, extraction, merge decisions
Crawling Crawl4AI Free / OSS JS-rendered page scraping, returns clean markdown
Search Serper.dev / Exa / DDG Free tier Live web queries per record, per field
Public Data SEC EDGAR, OpenCorporates, arXiv, RSS Free Structured sources for specific domain types
Orchestration LangGraph OSS Agent pipeline, state management, retry logic
Storage Supabase Free tier Dataset storage, change logs, provenance tracking
Scheduler Cron / Supabase Edge Free Trigger daily refresh cycles per record tier
05

The value isn't in re-doing investor data. It's in every domain where no Crunchbase exists — vast, valuable, structurally scattered knowledge that the world hasn't indexed.

🚀
Space Tech
arXiv · NASA · SpaceNews
💊
Drug Discovery
PubMed · ClinicalTrials · FDA
🌍
Climate Tech
IRENA · Gov reports · Patents
⚗️
Semiconductor R&D
IEEE · Patent filings · Fab news
🏛️
Policy & Regulation
Gov gazettes · EUR-Lex · Congress
🧬
Biotech Research
PubMed · WHO · Lab pages
⚔️
Geopolitics
Reuters · UN · Think tanks
🏺
Archaeology
Journals · Museum databases
Each of these domains has the same problem: massive, real, valuable knowledge scattered across the open internet with no single structured, living source. That's the gap.
06
This Doesn't Exist Yet.
Crunchbase is a billion-dollar company built on keeping one domain alive with humans and scrapers. You're building the open, AI-native version of that infrastructure — for any domain, self-hosted, at near-zero marginal cost.

The longer the engine runs on a domain, the richer and more accurate that dataset becomes. That compounding knowledge is the moat. Nobody can replicate years of continuous enrichment overnight.
🏗️
Infrastructure Play
Sell the engine as self-hosted infrastructure. Any team with a dataset and a domain can run their own living database.
📦
Data Product Play
Pick one high-value unstructured domain. Run the engine for 6 months. Sell access to the living dataset itself.
🔌
API Layer Play
Expose living datasets as queryable APIs. Developers and analysts pay per query or per seat to access always-fresh structured knowledge.