Content Extraction

The extension uses content scripts to extract page content before sending it to the LoomBrain API.

Extraction process

Type detection — URL patterns determine the content type (tweet, video, repo, article)
DOM extraction — A content script runs on the page to extract relevant HTML
Sanitization — HTML is sanitized in an offscreen document to remove scripts and tracking
Size check — If sanitized content exceeds 5MB, falls back to URL-only
API submission — Content is posted to the captures API with metadata

Extraction by content type

Tweets

Targets the [data-testid="tweet"] element on Twitter/X. If the element isn’t found, falls back to URL-only capture.

Videos (YouTube)

No HTML extraction. The URL is sent to the server, which fetches the transcript and video metadata directly from YouTube.

GitHub Repositories

Extracts the README element ([data-testid="readme"] or #readme article) along with repository description and topics metadata. Falls back to URL-only if README isn’t found.

Articles (default)

Captures the full document.documentElement.outerHTML. This is the fallback for any URL that doesn’t match tweet, video, or repo patterns.

Sanitization

All extracted HTML passes through sanitization in a secure offscreen document context. This removes:

Script tags and event handlers
Common tracking and analytics elements
Other potentially dangerous content

The sanitized HTML is what gets sent to the API. The original page is never modified.

Size limits

Limit	Value
Max content size	5 MB
Extraction timeout	10 seconds

If either limit is exceeded, the extension sends only the URL. The server then fetches and processes the content independently.

Injection detection

The API scans capture fields (title, why, selected_text, URL) for prompt injection attempts. If detected, the capture is flagged for review but still processed.