Markdown beats raw HTML
For training data you want the content, not the chrome. SupaCrawlX returns clean Markdown by default — headings, lists, and code blocks preserved, navigation and ads stripped. That alone removes most of the cleaning work people dread.
Step 1: Map the domain
Start by discovering every URL worth scraping:
from supacrawlx import Client
client = Client(api_key="your_api_key")
site = client.web.map(url="https://docs.example.com")
urls = site["urls"] # up to 5,000 discovered linksStep 2: Scrape each page to Markdown
def fetch(url):
page = client.web.scrape(url=url)
return {
"url": url,
"title": page.get("title"),
"text": page["content"], # clean Markdown
}
docs = [fetch(u) for u in urls]Step 3: Clean and deduplicate
Two cheap passes dramatically improve dataset quality:
- Drop pages under ~200 characters — they're usually redirects or stubs.
- Hash the normalized text and drop exact duplicates (boilerplate pages repeat constantly).
import hashlib
seen, clean = set(), []
for d in docs:
text = d["text"].strip()
if len(text) < 200:
continue
h = hashlib.sha256(text.encode()).hexdigest()
if h in seen:
continue
seen.add(h)
clean.append(d)Step 4: Export to JSONL
import json
with open("dataset.jsonl", "w") as f:
for d in clean:
f.write(json.dumps({"text": d["text"]}) + "\n")When to crawl instead
For large sites, skip the manual loop and use the asynchronous crawl endpoint. It traverses the site for you, returns a jobId, and can deliver results to a webhook when finished:
job = client.web.crawl(url="https://docs.example.com", limit=500)
# poll job["jobId"] or receive a webhook on completion
result = client.jobs.get(job["jobId"])Respect robots.txt and licensing. A clean dataset you're allowed to use beats a huge one you aren't.