About TB Research
tbresearch.orgis the open, machine-readable archive of the world's tuberculosis knowledge. We pull every published TB paper from the open biomedical literature, normalize it to markdown, embed it, and serve hybrid keyword + semantic search with cited RAG answers. The full vision is in docs/02_prd.md.
Current corpus
- 154661 documents · 638246 embedded chunks · publication years 1944–2026
- Sources: PubMed (E-utilities), Europe PMC, ClinicalTrials.gov, OpenAlex citation graph
- Embeddings:
nomic-embed-text(768d) via self-hosted Ollama - Answer model: Claude Sonnet 4.6 with strict per-claim citation requirements (PRD §10.3)
What you can do today
- Search the corpus with hybrid BM25 + cosine similarity, fused via reciprocal rank fusion. Filter by document type, TB subtype, year, country, or language.
- Ask TB Research a research question and get a synthesized answer with inline
[tbrsch_id]citations to specific documents. The model is instructed to say “I don't know” rather than fabricate, and to refuse clinical decision support for specific patients. - Browse a document at
/doc/[tbrsch_id]with full metadata, related documents, and citation graph. - Subscribe to a saved search via RSS for new-paper alerts.
- Use the public API at
/v1/*for programmatic access: search, full document records, citation graph, and authenticated bulk export.
What we don't do yet
- Grey literature, NTP guidelines, conference abstracts, survivor narratives
- Translation pipeline for non-English documents
- Patient narrative consent & right-to-withdraw workflow
- Fine-tuned TB Research SLM (Phase 2)
- Clinician / patient / advocate UI surfaces (Phase 2 per PRD §11.2)
Architecture
[ PubMed · Europe PMC · ClinicalTrials.gov · OpenAlex ]
↓
[ ingest + normalize ]
↓
[ Postgres + pgvector + tsvector ]
↓
[ hybridSearch ]
↓
┌───────────┴───────────┐
↓ ↓
Search results Ask TB Research (RAG)
↓
Claude Sonnet 4.6
+ strict citationsStack
- Next.js 16 on Vercel
- Supabase Postgres + pgvector + tsvector (IVFFlat index)
- Self-hosted Ollama on Fly.io for embeddings
- Anthropic Claude (Sonnet 4.6 answer, Haiku 4.5 rewrite/rerank)
Licensing
Code: Apache 2.0. Corpus metadata: CC0. Document content preserves its source license. Bulk exports surface license information via the X-TBRsch-License header so downstream consumers can honor it.