Strip noscript tags when parsing pages to remove JS-disabled messages

Lemmy and other JS-heavy sites include noscript fallback text like
"Javascript is disabled" that pollutes the stored body text and
generated snippets/summaries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Derick Phan 2026-03-27 14:18:54 -07:00
parent fd20454fa4
commit 570d876b8e
No known key found for this signature in database

2
db.py
View file

@ -328,7 +328,7 @@ def fetch_page(url):
if og_tag and og_tag.get("content"):
meta_desc = og_tag["content"].strip()
for tag in soup(["script", "style", "nav", "footer", "header"]):
for tag in soup(["script", "style", "nav", "footer", "header", "noscript"]):
tag.decompose()
title = soup.title.string.strip() if soup.title and soup.title.string else url
body = soup.get_text(separator=" ", strip=True)