Crawl Entire Websites
Into Clean
Structured Data
What Makes This API Special
Child Links Only
The crawler follows only child links from the seed URL. Crawling https://example.com/blog will follow /blog/post-1 but not /about. To crawl an entire site, provide the root domain.
Async Job with jobId
Submit the crawl and receive a jobId instantly. Call GET /v1/web/crawl/{jobId} to check status. Status values: scraping, completed, failed, or cancelled.
Paginated Results
Large crawls return results in pages. When more results exist, the response includes a next field — pass it to the same endpoint to fetch the next batch.
Clean Markdown per Page
Every crawled page is returned as clean markdown content with nav, ads, and boilerplate stripped. Each page includes url, name, description, and content.
Limit Parameter
Control how many pages to crawl with the limit parameter. Defaults to 100. Set it lower to cap credit usage or scope the crawl to the most important pages.
Job Status Tracking
Poll GET /v1/web/crawl/{jobId} at any interval to get live status. The scraping status means the job is actively running; completed means all pages and results are ready.
Platform-Specific Workflows
AI Training Data Collection
Crawl documentation sites, blogs, and knowledge bases to build high-quality text corpora for LLM fine-tuning or RAG pipelines.
Documentation Indexing
Index your entire docs site into a vector database for semantic search or a chatbot that can answer questions about your product.
Competitive Site Audit
Crawl a competitor's website to analyze their content strategy, extract pricing information, or map their feature set.
Web Crawling API Questions
Which links does the crawler follow?
Only child links from the seed URL. If you crawl https://example.com/blog, the crawler will follow https://example.com/blog/post-1 but not https://example.com/about. To crawl an entire site, pass the root URL as the starting point.
What are the possible job status values?
scraping (job is running), completed (all pages fetched, results ready), failed (crawl encountered an error), cancelled (job was stopped manually).
How does pagination work in crawl results?
When a crawl has many pages, the status response includes a next field. Call GET /v1/web/crawl/{jobId} again with that cursor value to retrieve the next page of results. Keep going until next is absent.
What data does each crawled page include?
Each page object includes: url (page URL), content (clean markdown), name (page title), and description (meta description).
What does the limit parameter do?
It caps the number of pages the crawler will visit. The default is 100. Setting limit to 10 will stop after 10 pages regardless of how many links exist.
What does each crawl cost?
1 crawl request = 1 credit. Each page crawled = 1 additional credit. A 100-page crawl costs 101 credits total.