tinyweb/.gitignore at 6119ed3aeff9bdacdd8adbf7171ce3a2ffde2e0d - lichenblankie/tinyweb - Forgejo: Beyond coding. We Forge.

lichenblankie/tinyweb

Derick Phan 395fc17092

Add hybrid semantic search with optional cross-encoder reranking

Implements a three-stage search pipeline:
1. BM25 keyword search via FTS5 with column weights
2. Semantic search via Snowflake arctic-embed-s bi-encoder + HNSW index
3. Optional cross-encoder reranking (on by default, toggleable in settings)

Top 20 results are reranked for precision, next 10 appended from RRF
for coverage, giving 30 total results across 3 pages.

- New embeddings.py with ONNX Runtime inference, text chunking, HNSW
  index management, RRF fusion, and cross-encoder reranking
- Meta description extraction for authentic page snippets with centroid
  extractive fallback
- Stopword filtering in FTS5 queries to avoid overly strict matching
- /reindex page for batch embedding of existing pages
- Semantic embedding of remote pages during subscription sync
- ~125MB dependency footprint (onnxruntime, tokenizers, hnswlib, numpy)
- Models: 34MB bi-encoder + 22MB cross-encoder (downloaded on first use)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-27 03:24:41 -07:00

7 lines

84 B

Text

Raw Blame History

 __pycache__/
 tinyweb_identity
 index.db
 index.db-shm
 index.db-wal
 models/
 index.hnsw