Web

Crawl Entire Websites
Into Clean
Structured Data

Child Links Only Up to 100 Pages Default Async Job-Based Paginated Results

# pip install supacrawlx
from supacrawlx import Client
import time

client = Client("YOUR_API_KEY")

# Start crawl job
job = client.web.crawl(
    url="https://docs.example.com",
    limit=100   # optional: max pages to crawl (default 100)
)
print(f"Started crawl job: {job.job_id}")

# Poll for results
while True:
    result = client.web.get_crawl_results(job_id=job.job_id)
    if result.status == "completed":
        print(f"Total pages crawled: {len(result.pages)}")
        for page in result.pages:
            print(page.url, page.name)
            print(page.content[:100])
        break
    elif result.status == "failed":
        print("Crawl failed")
        break
    else:
        print(f"Status: {result.status}")  # scraping, cancelled
        time.sleep(3)

# For large crawls, handle pagination via result.next
if result.next:
    next_page = client.web.get_crawl_results(job_id=job.job_id, cursor=result.next)

Features

What Makes This API Special

Child Links Only

The crawler follows only child links from the seed URL. Crawling https://example.com/blog will follow /blog/post-1 but not /about. To crawl an entire site, provide the root domain.

Async Job with jobId

Submit the crawl and receive a jobId instantly. Call GET /v1/web/crawl/{jobId} to check status. Status values: scraping, completed, failed, or cancelled.

Paginated Results

Large crawls return results in pages. When more results exist, the response includes a next field — pass it to the same endpoint to fetch the next batch.

Clean Markdown per Page

Every crawled page is returned as clean markdown content with nav, ads, and boilerplate stripped. Each page includes url, name, description, and content.

Limit Parameter

Control how many pages to crawl with the limit parameter. Defaults to 100. Set it lower to cap credit usage or scope the crawl to the most important pages.

Job Status Tracking

Poll GET /v1/web/crawl/{jobId} at any interval to get live status. The scraping status means the job is actively running; completed means all pages and results are ready.

Use Cases

Platform-Specific Workflows

AI Training Data Collection

Crawl documentation sites, blogs, and knowledge bases to build high-quality text corpora for LLM fine-tuning or RAG pipelines.

LLMFine-TuningRAG

Documentation Indexing

Index your entire docs site into a vector database for semantic search or a chatbot that can answer questions about your product.

SearchChatbotDocs

Competitive Site Audit

Crawl a competitor's website to analyze their content strategy, extract pricing information, or map their feature set.

ResearchAnalyticsStrategy

FAQs

Web Crawling API Questions

Which links does the crawler follow?

Only child links from the seed URL. If you crawl https://example.com/blog, the crawler will follow https://example.com/blog/post-1 but not https://example.com/about. To crawl an entire site, pass the root URL as the starting point.

What are the possible job status values?

scraping (job is running), completed (all pages fetched, results ready), failed (crawl encountered an error), cancelled (job was stopped manually).

How does pagination work in crawl results?

When a crawl has many pages, the status response includes a next field. Call GET /v1/web/crawl/{jobId} again with that cursor value to retrieve the next page of results. Keep going until next is absent.

What data does each crawled page include?

Each page object includes: url (page URL), content (clean markdown), name (page title), and description (meta description).

What does the limit parameter do?

It caps the number of pages the crawler will visit. The default is 100. Setting limit to 10 will stop after 10 pages regardless of how many links exist.

What does each crawl cost?

1 crawl request = 1 credit. Each page crawled = 1 additional credit. A 100-page crawl costs 101 credits total.

Ready to Build Something Extraordinary?

Start with 100 free requests. No credit card. No setup fee. Ship your first AI-powered feature today.

Start Building Free View Pricing

Crawl Entire WebsitesInto CleanStructured Data