Scraper Templates
complete
J
John
For some sites, we would like to be able to scrape many pages that all have the same layout/etc and extract information from all of those pages using the same selectors/config/etc.
Currently, we'd have to either replicate the Scraper for each unique URL, update the URL, and run the Scrapers or set the scraper URL, run the scraper, wait, update the URL, repeat. This becomes particularly problematic if we want to regularly scrape these pages and the site later changes or we want to extract additional information from the pages as we'd have to either update every Scraper or re-duplicate from a new "base" for every URL that we want to scrape.
It would be helpful if we could create the scraper once and then provide the specific page URL when we want to run the Scraper (via the UI and/or API).
Alternatively, if we could define the selectors/config/etc in a "parent Scraper" and then define the URL in the "child Scraper" that would also work.
Cahyo
complete
Cahyo
Please check the latest changelog post for information related to this request. Thanks!
Cahyo
planned
Cahyo
I've been wanting to tackle this one. Currently, it's not flexible enough.
Maybe the URL should be inside the Scraper, or it should behave as a "default URL" in case you do not provide a new one when scraping through the UI or API.
Sharing extractors also sounds good.
Will think about this tomorrow. Thanks John!
J
John
Another (perhaps simpler) way to accomplish this would be to allow Extractors to be shared across Scrapers. However, in that scenario we would ideally like the ability to create a new scraper via API using an existing Extractor.
Cahyo
John: I have been doing some tests regarding this feature and this is what I think we could do:
- The Scraper's URL field stays the same, but it turns to a "default URL".
- We add a new action (button) to Scrapers: advanced/targeted scraping, by providing a single or array of URLs we can override the default and scrape one or more defined URLs.
- The api endpoint /scrapers/{scraper}/run, now has an optional parameter: urls (string or array). Which does the same as the previous action we described.
Is this approach going to solve the issues you are currently facing?
Thanks!