DocsIntegrationsPlaygroundPricing
Get API Key
All Articles
GuideApr 28, 2026·10 min read

Building AI Training Datasets from Web Content with the Scraping API

A fine-tune is only as good as its corpus. The web is the richest corpus there is — if you can turn messy HTML into clean, structured text. Here is a repeatable workflow that does exactly that.

JR

James Rivera

Data Engineer · Published Apr 28, 2026

TL;DR

Use SupaCrawlX map to enumerate a domain, scrape each page to Markdown, dedupe and clean, then export to JSONL. Crawl handles the whole site asynchronously when you need scale.

Markdown beats raw HTML

For training data you want the content, not the chrome. SupaCrawlX returns clean Markdown by default — headings, lists, and code blocks preserved, navigation and ads stripped. That alone removes most of the cleaning work people dread.

Step 1: Map the domain

Start by discovering every URL worth scraping:

from supacrawlx import Client

client = Client(api_key="your_api_key")

site = client.web.map(url="https://docs.example.com")
urls = site["urls"]  # up to 5,000 discovered links

Step 2: Scrape each page to Markdown

def fetch(url):
    page = client.web.scrape(url=url)
    return {
        "url": url,
        "title": page.get("title"),
        "text": page["content"],  # clean Markdown
    }

docs = [fetch(u) for u in urls]

Step 3: Clean and deduplicate

Two cheap passes dramatically improve dataset quality:

  • Drop pages under ~200 characters — they're usually redirects or stubs.
  • Hash the normalized text and drop exact duplicates (boilerplate pages repeat constantly).
import hashlib

seen, clean = set(), []
for d in docs:
    text = d["text"].strip()
    if len(text) < 200:
        continue
    h = hashlib.sha256(text.encode()).hexdigest()
    if h in seen:
        continue
    seen.add(h)
    clean.append(d)

Step 4: Export to JSONL

import json

with open("dataset.jsonl", "w") as f:
    for d in clean:
        f.write(json.dumps({"text": d["text"]}) + "\n")

When to crawl instead

For large sites, skip the manual loop and use the asynchronous crawl endpoint. It traverses the site for you, returns a jobId, and can deliver results to a webhook when finished:

job = client.web.crawl(url="https://docs.example.com", limit=500)
# poll job["jobId"] or receive a webhook on completion
result = client.jobs.get(job["jobId"])
Respect robots.txt and licensing. A clean dataset you're allowed to use beats a huge one you aren't.

Ready to Build Something Extraordinary?

Start with 100 free requests. No credit card. No setup fee. Ship your first AI-powered feature today.

Start Building Free View Pricing