Content Extraction
The extension uses content scripts to extract page content before sending it to the LoomBrain API.
Extraction process
Section titled “Extraction process”- Type detection — URL patterns determine the content type (tweet, video, repo, article)
- DOM extraction — A content script runs on the page to extract relevant HTML
- Sanitization — HTML is sanitized in an offscreen document to remove scripts and tracking
- Size check — If sanitized content exceeds 5MB, falls back to URL-only
- API submission — Content is posted to the captures API with metadata
Extraction by content type
Section titled “Extraction by content type”Tweets
Section titled “Tweets”Targets the [data-testid="tweet"] element on Twitter/X. If the element isn’t found, falls back to URL-only capture.
Videos (YouTube)
Section titled “Videos (YouTube)”No HTML extraction. The URL is sent to the server, which fetches the transcript and video metadata directly from YouTube.
GitHub Repositories
Section titled “GitHub Repositories”Extracts the README element ([data-testid="readme"] or #readme article) along with repository description and topics metadata. Falls back to URL-only if README isn’t found.
Articles (default)
Section titled “Articles (default)”Captures the full document.documentElement.outerHTML. This is the fallback for any URL that doesn’t match tweet, video, or repo patterns.
Sanitization
Section titled “Sanitization”All extracted HTML passes through sanitization in a secure offscreen document context. This removes:
- Script tags and event handlers
- Common tracking and analytics elements
- Other potentially dangerous content
The sanitized HTML is what gets sent to the API. The original page is never modified.
Size limits
Section titled “Size limits”| Limit | Value |
|---|---|
| Max content size | 5 MB |
| Extraction timeout | 10 seconds |
If either limit is exceeded, the extension sends only the URL. The server then fetches and processes the content independently.
Injection detection
Section titled “Injection detection”The API scans capture fields (title, why, selected_text, URL) for prompt injection attempts. If detected, the capture is flagged for review but still processed.