Testing web scraper avatar
Testing web scraper
Under maintenance

Pricing

$10.00 / 1,000 results

Go to Store
Testing web scraper

Testing web scraper

Under maintenance
rezaczu/testing-web-scraper

Developed by

Zuzana Řezáčová

Maintained by Community

The scraper of Web

0.0 (0)

Pricing

$10.00 / 1,000 results

0

Monthly users

2

Runs succeeded

>99%

Last modified

10 days ago

Start URLs

startUrlsarrayRequired

A static list of URLs to scrape. To be able to add new URLs on the fly, enable the Use request queue option.

For details, see Start URLs in README.

Use request queue

useRequestQueuebooleanOptional

If enabled, the scraper will support adding new URLs to scrape on the fly, either using the Link selector and Pseudo-URLs options or by calling context.enqueueRequest() inside Page function. Use of the request queue has some overheads, so only enable this option if you need to add URLs dynamically.

Default value of this property is true

URL #fragments identify unique pages

keepUrlFragmentsbooleanOptional

Indicates that URL fragments (e.g. http://example.com#fragment) should be included when checking whether a URL has already been visited or not. Typically, URL fragments are used for page navigation only and therefore they should be ignored, as they don't identify separate pages. However, some single-page websites use URL fragments to display different pages; in such a case, this option should be enabled.

Default value of this property is false

Link selector

linkSelectorstringOptional

A CSS selector saying which links on the page (<a> elements with href attribute) shall be followed and added to the request queue. This setting only applies if Use request queue is enabled. To filter the links added to the queue, use the Pseudo-URLs setting.

If Link selector is empty, the page links are ignored.

For details, see Link selector in README.

Pseudo-URLs

pseudoUrlsarrayOptional

Specifies what kind of URLs found by Link selector should be added to the request queue. A pseudo-URL is a URL with regular expressions enclosed in [] brackets, e.g. http://www.example.com/[.*]. This setting only applies if the Use request queue option is enabled.

If Pseudo-URLs are omitted, the actor enqueues all links matched by the Link selector.

For details, see Pseudo-URLs in README.

Default value of this property is []

Page function

pageFunctionstringRequired

JavaScript (ES6) function that is executed in the context of every page loaded in the Chrome browser. Use it to scrape data from the page, perform actions or add new URLs to the request queue.

For details, see Page function in README.

Inject jQuery

injectJQuerybooleanOptional

If enabled, the scraper will inject the jQuery library into every web page loaded, before Page function is invoked. Note that the jQuery object ($</code) will not be registered into global namespace in order to avoid conflicts with libraries used by the web page. It can only be accessed through context.jQuery in Page function.

Default value of this property is true

Inject Underscore.js

injectUnderscorebooleanOptional

If enabled, the scraper will inject the Underscore.js library into every web page loaded, before Page function is invoked. Note that the Underscore.js object (_</code) will not be registered into global namespace in order to avoid conflicts with libraries used by the web page. It can only be accessed through context.underscoreJs in Page function.

Default value of this property is false

Proxy configuration

proxyConfigurationobjectOptional

Specifies proxy servers that will be used by the scraper in order to hide its origin.

For details, see Proxy configuration in README.

Default value of this property is {}

Initial cookies

initialCookiesarrayOptional

A JSON array with cookies that will be set to every Chrome browser tab opened before loading the page, in the format accepted by Puppeteer's Page.setCookie() function. This option is useful for transferring a logged-in session from an external web browser. For details how to do this, read this help article.

Default value of this property is []

Use Chrome

useChromebooleanOptional

If enabled, the scraper will use a real Chrome browser instead of Chromium bundled with Puppeteer. This option may help bypass certain anti-scraping protections, but might make the scraper unstable. Use at your own risk 🙂

Default value of this property is false

Use stealth mode

useStealthbooleanOptional

If enabled, the scraper will apply various browser emulation techniques to match a real user's browser as closely as possible, in order to bypass around certain anti-scraping protections. This feature works best in conjunction with the Use Chrome option, but it also carries a risk of making the scraper unstable.

Default value of this property is false

Ignore SSL errors

ignoreSslErrorsbooleanOptional

If enabled, the scraper will ignore SSL/TLS certificate errors. Use at your own risk.

Default value of this property is false

Ignore CORS and CSP

ignoreCorsAndCspbooleanOptional

If enabled, the scraper will ignore Content Security Policy (CSP) and Cross-Origin Resource Sharing (CORS) settings of visited pages and requested domains. This enables you to freely use XHR/Fetch to make HTTP requests from Page function.

Default value of this property is false

Download media files

downloadMediabooleanOptional

If enabled, the scraper will download media such as images, fonts, videos and sound files, as usual. Disabling this option might speed up the scrape, but certain websites could stop working correctly.

Default value of this property is true

Download CSS files

downloadCssbooleanOptional

If enabled, the scraper will download CSS files with stylesheets, as usual. Disabling this option may speed up the scrape, but certain websites could stop working correctly, and the live view will not look as cool.

Default value of this property is true

Max page retries

maxRequestRetriesintegerOptional

The maximum number of times the scraper will retry to load each web page on error, in case of a page load error or an exception thrown by Page function.

If set to 0, the page will be considered failed right after the first error.

Default value of this property is 3

Max pages per run

maxPagesPerCrawlintegerOptional

The maximum number of pages that the scraper will load. The scraper will stop when this limit is reached. It's always a good idea to set this limit in order to prevent excess platform usage for misconfigured scrapers. Note that the actual number of pages loaded might be slightly higher than this value.

If set to 0, there is no limit.

Default value of this property is 0

Max result records

maxResultsPerCrawlintegerOptional

The maximum number of records that will be saved to the resulting dataset. The scraper will stop when this limit is reached.

If set to 0, there is no limit.

Default value of this property is 0

Max crawling depth

maxCrawlingDepthintegerOptional

Specifies how many links away from Start URLs the scraper will descend. This value is a safeguard against infinite crawling depths for misconfigured scrapers. Note that pages added using context.enqueuePage() in Page function are not subject to the maximum depth constraint.

If set to 0, there is no limit.

Default value of this property is 0

Max concurrency

maxConcurrencyintegerOptional

Specified the maximum number of pages that can be processed by the scraper in parallel. The scraper automatically increases and decreases concurrency based on available system resources. This option enables you to set an upper limit, for example to reduce the load on a target website.

Default value of this property is 50

Page load timeout

pageLoadTimeoutSecsintegerOptional

The maximum amount of time the scraper will wait for a web page to load, in seconds. If the web page does not load in this timeframe, it is considered to have failed and will be retried (subject to Max page retries), similarly as with other page load errors.

Default value of this property is 60

Page function timeout

pageFunctionTimeoutSecsintegerOptional

The maximum amount of time the scraper will wait for Page function to execute, in seconds. It's a good idea to set this limit, to ensure that unexpected behavior in page function will not get the scraper stuck.

Default value of this property is 60

Navigation waits until

waitUntilarrayOptional

Contains a JSON array with names of page events to wait, before considering a web page fully loaded. The scraper will wait until all of the events are triggered in the web page before executing Page function. Available events are domcontentloaded, load, networkidle2 and networkidle0.

For details, see waitUntil option in Puppeteer's Page.goto() function documentation.

Default value of this property is ["networkidle2"]

Enable debug log

debugLogbooleanOptional

If enabled, the actor log will include debug messages. Beware that this can be quite verbose. Use context.log.debug('message') to log your own debug messages from Page function.

Default value of this property is false

Enable browser log

browserLogbooleanOptional

If enabled, the actor log will include console messages produced by JavaScript executed by the web pages (e.g. using console.log()). Beware that this may result in the log being flooded by error messages, warnings and other messages of little value, especially with high concurrency.

Default value of this property is false

Custom data

customDataobjectOptional

A custom JSON object that is passed to Page function as context.customData. This setting is useful when invoking the scraper via API, in order to pass some arbitrary parameters to your code.

Default value of this property is {}

Pricing

Pricing model

Pay per result 

This Actor is paid per result. You are not charged for the Apify platform usage, but only a fixed price for each dataset of 1,000 items in the Actor outputs.

Price per 1,000 items

$10.00