Skip to content
On this page

Single-page mode

Not every URL is a list. An article, a company profile, a one-off dashboard page — these are one record, not a repeating item. datahelm:scrap:generate normally requires a repeating pattern to detect (it errors with "Could not detect a repeating item list" on a page that has none); --single-page skips that requirement entirely:

bash
php artisan datahelm:scrap:generate "https://example.com/about-us" \
  --single-page --robot-name=AboutUs

This treats the whole page as a single item: field detectors (title, price, image, description, …) run directly against <body> instead of a detected list-item sample, and the resulting blueprint is exactly what you'd expect:

json
{
  "item_selector": "body",
  "pagination": { "strategy": "none", "css": "" }
}

No engine changes are involved: CrawlEngine already treats an item_selector matching exactly one node as a one-item crawl, so --single-page is purely a generation-time shortcut. --search-filters still works alongside it — each filter URL becomes its own single-page item, useful for scraping the same kind of one-off page (e.g. a profile) across several known URLs.

--main-content — skip the site chrome

By default, single-page detection sees the whole <body> — including the nav bar, footer and sidebars, whose links and text can leak into the detected fields. --main-content (the equivalent of Firecrawl's onlyMainContent) scopes detection to the page's primary content region:

bash
php artisan datahelm:scrap:generate "https://en.wikipedia.org/wiki/Web_scraping" \
  --single-page --main-content --robot-name=WikiArticle

The detector looks for <main> / [role=main] / #content-style containers and bakes the region in as the item selector — on Wikipedia this produces item_selector: "main#content". Two safety rules:

  • When no region is confidently found, it falls back to <body> and says so in the generation notes.
  • When the region's CSS selector isn't unique on the page (it would split one page into several items), it also falls back to <body>.

Pairs well with Markdown output

Single-page mode is the natural companion of the markdown field type and output format: point --single-page --main-content at an article, set a markdown field for the content region, and export an LLM-ready document from one URL — the package's equivalent of Firecrawl's /scrape endpoint.

How other tools handle this

Scrapy has no dedicated concept either — you write a parse() that reads fields off response directly and yields one item, instead of looping over a selector list (the same idea as item_selector: "body"). Firecrawl draws the line explicitly with two endpoints: /scrape (one URL → one result) versus /crawl (many results) — --single-page is this package's /scrape.


Next: Markdown / LLM output →

Released under the MIT License.