Building AI Training Datasets from Web Content with the Scraping API

A fine-tune is only as good as its corpus. The web is the richest corpus there is — if you can turn messy HTML into clean, structured text. Here is a repeatable workflow that does exactly that.

Markdown beats raw HTML

For training data you want the content, not the chrome. SupaCrawlX returns clean Markdown by default — headings, lists, and code blocks preserved, navigation and ads stripped. That alone removes most of the cleaning work people dread.

Step 1: Map the domain

Start by discovering every URL worth scraping:

from supacrawlx import Client

client = Client(api_key="your_api_key")

site = client.web.map(url="https://docs.example.com")
urls = site["urls"]  # up to 5,000 discovered links

Step 2: Scrape each page to Markdown

def fetch(url):
    page = client.web.scrape(url=url)
    return {
        "url": url,
        "title": page.get("title"),
        "text": page["content"],  # clean Markdown
    }

docs = [fetch(u) for u in urls]

Step 3: Clean and deduplicate

Two cheap passes dramatically improve dataset quality:

Drop pages under ~200 characters — they're usually redirects or stubs.
Hash the normalized text and drop exact duplicates (boilerplate pages repeat constantly).

import hashlib

seen, clean = set(), []
for d in docs:
    text = d["text"].strip()
    if len(text) < 200:
        continue
    h = hashlib.sha256(text.encode()).hexdigest()
    if h in seen:
        continue
    seen.add(h)
    clean.append(d)

Step 4: Export to JSONL

import json

with open("dataset.jsonl", "w") as f:
    for d in clean:
        f.write(json.dumps({"text": d["text"]}) + "\n")

When to crawl instead

For large sites, skip the manual loop and use the asynchronous crawl endpoint. It traverses the site for you, returns a jobId, and can deliver results to a webhook when finished:

job = client.web.crawl(url="https://docs.example.com", limit=500)
# poll job["jobId"] or receive a webhook on completion
result = client.jobs.get(job["jobId"])

Respect robots.txt and licensing. A clean dataset you're allowed to use beats a huge one you aren't.