The Living Database — Concept Summary

01

The Problem

Knowledge Decays. Nobody Fixes It.

Every structured dataset — investors, companies, drugs, space missions, geopolitical entities — begins decaying the moment it is created. The world moves. The data doesn't.

Existing solutions (Crunchbase, PitchBook, Apollo) are domain-locked, expensive, and closed. They work for one slice of the world and charge accordingly. No open, self-hostable, domain-agnostic equivalent exists.

"Crunchbase only keeps investor data. What if we go for space tech advancements around the world? No single source owns that data. It lives scattered across arXiv, NASA releases, patent filings, university labs, and forums. Nobody is maintaining a living, structured database of that."

📉

Static Datasets Decay

People change roles. Funds close. Missions launch. Discoveries happen. Any snapshot is obsolete within weeks.

🔒

Existing Solutions Are Locked

Crunchbase, PitchBook, and Apollo are proprietary, expensive, and cover only their chosen domain. You cannot self-host, extend, or redirect them.

🌐

The Knowledge Lives Everywhere

For most domains, the freshest data is scattered across the open internet — unstructured, unindexed, and unowned.

02

The Idea

A Dataset That Behaves Like an Organism

The Living Database treats any dataset not as a static table, but as a living organism with four biological behaviours running continuously:

🫁

Breathes

Continuously pulls fresh signals from the internet — search results, crawled pages, RSS feeds, public filings.

🌱

Grows

Discovers and adds new records that belong in the dataset — entities the original snapshot missed entirely.

🔄

Evolves

Updates existing records as the world changes — role shifts, new activity, revised facts, updated status.

✂️

Prunes

Marks or removes records that no longer exist — closed funds, defunct companies, discontinued projects.

The key insight: the dataset defines its own shape. You give the engine any CSV or database table and it figures out what each record is, what's worth enriching, where to look, and how to write back — without being pre-programmed for a specific domain.

03

How It Works

The Engine Architecture

The engine runs as a scheduled pipeline. Every cycle, it processes a batch of records through five stages — all locally, with a self-hosted LLM and free/open internet sources.

01

Schema Analyst Agent

Reads sample rows. Infers entity type (person, company, place, mission, drug…). Identifies which fields are enrichable vs stable. Generates a search strategy per field.

02

Search & Fetch Agent

Builds search queries per record. Hits Serper/Exa/DuckDuckGo. Crawls top results via Crawl4AI (fully local, handles JS-rendered pages). Pulls RSS and public filing sources.

03

Extraction Agent (Gemma 4 Local)

Local LLM reads raw crawled content. Extracts only fields relevant to this schema. Confidence-scores each extracted value. Flags ambiguous or conflicting data.

04

Merge Agent

Compares new vs existing values. Applies update rules: overwrite / append / flag for review. Logs full provenance — what changed, from where, when.

05

Growth Agent

While fetching data for existing records, identifies new entities that belong in the dataset. Proposes and stages new records for addition. The dataset grows beyond its original seed.

04

Tech Stack

What You Need to Build This

Layer	Tool	Cost	Role
LLM	Gemma 4 (local)	Free	Schema inference, extraction, merge decisions
Crawling	Crawl4AI	Free / OSS	JS-rendered page scraping, returns clean markdown
Search	Serper.dev / Exa / DDG	Free tier	Live web queries per record, per field
Public Data	SEC EDGAR, OpenCorporates, arXiv, RSS	Free	Structured sources for specific domain types
Orchestration	LangGraph	OSS	Agent pipeline, state management, retry logic
Storage	Supabase	Free tier	Dataset storage, change logs, provenance tracking
Scheduler	Cron / Supabase Edge	Free	Trigger daily refresh cycles per record tier

05

The Opportunity

Domains Nobody Has Structured Yet

The value isn't in re-doing investor data. It's in every domain where no Crunchbase exists — vast, valuable, structurally scattered knowledge that the world hasn't indexed.

🚀

Space Tech

arXiv · NASA · SpaceNews

💊

Drug Discovery

PubMed · ClinicalTrials · FDA

🌍

Climate Tech

IRENA · Gov reports · Patents

⚗️

Semiconductor R&D

IEEE · Patent filings · Fab news

🏛️

Policy & Regulation

Gov gazettes · EUR-Lex · Congress

🧬

Biotech Research

PubMed · WHO · Lab pages

⚔️

Geopolitics

Reuters · UN · Think tanks

🏺

Archaeology

Journals · Museum databases

Each of these domains has the same problem: massive, real, valuable knowledge scattered across the open internet with no single structured, living source. That's the gap.

06

Why This Matters

The Dataset Is the Product. The Engine Is the Moat.

This Doesn't Exist Yet.

Crunchbase is a billion-dollar company built on keeping one domain alive with humans and scrapers. You're building the open, AI-native version of that infrastructure — for any domain, self-hosted, at near-zero marginal cost.

The longer the engine runs on a domain, the richer and more accurate that dataset becomes. That compounding knowledge is the moat. Nobody can replicate years of continuous enrichment overnight.

🏗️

Infrastructure Play

Sell the engine as self-hosted infrastructure. Any team with a dataset and a domain can run their own living database.

📦

Data Product Play

Pick one high-value unstructured domain. Run the engine for 6 months. Sell access to the living dataset itself.

🔌

API Layer Play

Expose living datasets as queryable APIs. Developers and analysts pay per query or per seat to access always-fresh structured knowledge.

The LivingDatabase

The Living
Database