Commit graph

15 commits

Author SHA1 Message Date
lichenblankie
b112ee3660 added reticulum hash option to add page 2026-06-05 05:29:36 +00:00
lichenblankie
a1358c1f3d added manual URL entry 2026-06-05 05:29:35 +00:00
lichenblankie
9bc5abd32f made semantic search optional, use meta snippets
- Add semantic_search setting to toggle AI-powered search on/off
- Skip embedding generation, hybrid search, and model preloading when disabled
- Use site owner's meta description as snippet instead of heuristic extraction
- Remove _generate_summary() and snippet() - no more generated snippets
- Show reranker/reindex controls grayed out when semantic search is off
- AI dependencies (onnxruntime, hnswlib, etc.) are now fully optional
2026-06-05 05:29:35 +00:00
lichenblankie
e72afbb22e improved snippet extraction (heuristic)
- Case-insensitive meta description extraction (fixes sites like Lemmy
  with capitalized "Description" meta name)
- Strip aside and noscript tags for cleaner body text
- Extract paragraph text separately for better sentence quality
- Prefer sentences mentioning the site name, then first quality
  paragraph, then title as fallback
- Skip meta descriptions under 20 chars (e.g. just "Lemmy")
- Remove embedding/centroid dependency from summary generation
2026-06-05 05:29:35 +00:00
lichenblankie
e8915fa381 stripped noscript tags from pages
Lemmy and other JS-heavy sites include noscript fallback text like
"Javascript is disabled" that pollutes the stored body text and
generated snippets/summaries.
2026-06-05 05:29:35 +00:00
lichenblankie
5ded9f1339 added hybrid semantic search with reranking
Implements a three-stage search pipeline:
1. BM25 keyword search via FTS5 with column weights
2. Semantic search via Snowflake arctic-embed-s bi-encoder + HNSW index
3. Optional cross-encoder reranking (on by default, toggleable in settings)

Top 20 results are reranked for precision, next 10 appended from RRF
for coverage, giving 30 total results across 3 pages.

- New embeddings.py with ONNX Runtime inference, text chunking, HNSW
  index management, RRF fusion, and cross-encoder reranking
- Meta description extraction for authentic page snippets with centroid
  extractive fallback
- Stopword filtering in FTS5 queries to avoid overly strict matching
- /reindex page for batch embedding of existing pages
- Semantic embedding of remote pages during subscription sync
- ~125MB dependency footprint (onnxruntime, tokenizers, hnswlib, numpy)
- Models: 34MB bi-encoder + 22MB cross-encoder (downloaded on first use)
2026-06-05 05:29:35 +00:00
lichenblankie
67084bbaed enabled WAL mode, pooling, pagination
WAL + pooling:
- Enable WAL journal mode for concurrent read/write support
- Add connection pool (size 4) with return_db() to reuse connections
  instead of opening/closing on every request

Pagination:
- Search results, /pages, and /tags/<name> now paginate at 50 per page
- Prev/next navigation links appear when results exceed one page

Delta sync:
- Pages table gains last_modified timestamp, set on insert/update
- /api/sites accepts ?since= param to return only changed pages
- Subscription sync uses last_sync timestamp for incremental fetches
- Remote pages upserted instead of delete-all/re-insert
- Full sync includes all_urls list for detecting remote deletions
2026-06-05 05:29:35 +00:00
lichenblankie
b574c4b7f5 normalized URLs to prevent dupes
clean_url() now canonicalizes: http→https, strips www., removes
trailing slashes, drops default ports, and sorts query params.
Prevents the same page from being indexed multiple times under
different URL variations.
2026-06-05 05:29:35 +00:00
lichenblankie
6d649616ca fixed index_url page_id mismatch
lastrowid returns 0 when ON CONFLICT DO UPDATE fires on an existing
row, causing links to not be cleaned up or associated correctly on
re-index. Now fetches the actual row ID with a SELECT after upsert.
Also adds try/finally for connection safety.
2026-06-05 05:29:35 +00:00
lichenblankie
449174b0ca fixed SSRF bypass, tightened error handling
- SSRF: disable automatic redirects, manually follow up to 5 hops with
  IP re-validation at each step to prevent redirect-to-localhost bypass
- Identity file: enforce 0600 permissions on tinyweb_identity at load
  and creation to prevent other users from reading the private key
- Error messages: replace raw exception strings with generic messages
  to avoid leaking internal paths/hostnames to the UI
- DB connections: wrap all get_db() usage in try/finally to guarantee
  close() even when handlers throw mid-operation
2026-06-05 05:29:35 +00:00
lichenblankie
4899819597 added bookmark auth, CSP, per-session CSRF
- Bookmark endpoint now requires a secret token (stored in settings)
- Style reset moved from GET to POST with CSRF protection
- Open redirect prevention in _redirect() helper
- Import capped at 100 URLs to prevent abuse
- page_tags cleaned up on delete + PRAGMA foreign_keys enabled
- CSP, X-Frame-Options, X-Content-Type-Options on all responses
- CSRF tokens now per-session via double-submit cookie pattern
- Tag names URL-decoded for special characters
- Gateway forwards cookies in request data
2026-06-05 05:29:35 +00:00
lichenblankie
0981c2e0a9 hardened CSRF, SSRF, FTS5
- CSRF: Generate random token at startup, include as hidden field in
  all 11 POST forms, validate at top of POST dispatch (returns 403)
- SSRF: Block private/internal IP ranges (127/8, 10/8, 172.16/12,
  192.168/16, 169.254/16, ::1, fc00::/7) by resolving hostname before
  fetch. Remove verify=False from requests.get().
- DELETE: Change /delete/<id> from GET (instant delete) to GET
  (confirmation page) + POST (actual delete) to prevent accidental
  deletion from prefetchers/crawlers.
- FTS5: Wrap search input in double quotes to neutralize FTS5
  operators (AND, OR, NOT, *, column:). Add try/except fallback.
2026-06-05 05:29:35 +00:00
lichenblankie
acfa9f6d4f stripped tracking params, added tags
URLs are cleaned of tracking parameters (utm_*, fbclid, gclid, etc.)
before indexing. Tags can be added when saving or editing pages,
browsed at /tags, and are included in search results. Tags are shared
via /api/sites and preserved when syncing/importing from subscriptions.
2026-06-05 05:29:35 +00:00
lichenblankie
7ccaf93404 wired up mesh subscriptions + search
- Subscriptions now use Reticulum destination hashes instead of HTTP URLs
- All subscription syncing happens over encrypted RNS links (rns_client.py)
- Add remote_pages table for synced content from subscriptions
- Search results now include pages from synced subscriptions, grouped by source
- Remove HTTP dependency from subscription handlers
2026-06-05 05:29:35 +00:00
lichenblankie
4b4e7e8081 ported everything to Reticulum mesh
Replace HTTP server with Reticulum-native architecture. The server
now speaks only Reticulum, with a client-side gateway providing
browser access by translating HTTP to/from RNS requests.

- Extract db layer (db.py), templates (templates.py), handlers (handlers.py)
- app.py is now the RNS server with persistent identity and destination
- gateway.py bridges HTTP on localhost:8080 to RNS link requests
- Add rns dependency, add .gitignore
2026-06-05 05:29:35 +00:00