chore: update @crawlee to v4 and its usage & set proxy before discoverValidSitemaps by nikitachapovskii-dev · Pull Request #216 · apify/actor-scraper

nikitachapovskii-dev · 2026-02-09T13:46:37Z

Switch sitemap-scraper to crawlee@4.0.0-beta.25 to use the new sitemap httpClient (GotScrapingHttpClient) for discovery/parsing.
Updated imports to crawlee v4, moved proxy init earlier, adjusted response handling for v4 Response.

Partially fixes #214 (1. and 2.)

…rValidSitemaps

nikitachapovskii-dev · 2026-02-09T13:50:53Z

Hi, @barjin I've bumped this actor to crawlee@4.0.0-beta.25 to use the new sitemap httpClient.
@apify/scraper-tools typings are still v3, so implements CrawlerSetupOptions

actor-scraper/packages/actor-scraper/sitemap-scraper/src/internals/crawler_setup.ts

Line 55 in 27259c3

export class CrawlerSetup implements CrawlerSetupOptions {

breaks (RequestQueueV2/KeyValueStore types differ).
For now I've dropped the implements and added a few as any casts

Is there a plan to update @apify/scraper-tools to v4 typings?
If not, would you prefer we add a local v4 interface (or another approach), smth like

type CrawlerSetupOptionsV4 = Omit<CrawlerSetupOptions, 'requestQueue' | 'keyValueStore'> & {
  requestQueue: RequestQueueV2;
  keyValueStore: KeyValueStore;
};

to avoid these casts?

B4nan · 2026-02-09T13:53:35Z

packages/actor-scraper/sitemap-scraper/package.json

        "@apify/scraper-tools": "^1.1.4",
-        "@crawlee/http": "^3.14.1",
-        "@crawlee/utils": "^3.15.4-beta.44",
        "apify": "^3.2.6",


afaik you also need v4 beta for the SDK, at least that was my feeling when i was testing it few months ago (and that's why i even bothered with SDK v4 back then).

B4nan · 2026-02-09T14:03:15Z

packages/actor-scraper/sitemap-scraper/package.json

+        "crawlee": "^4.0.0-beta.25",
        "impit": "^0.7.5"
    },
-    "overrides": {


imo you should use overrides to force crawlee v4 inside scraper tools (plus use SDK v4 as suggested above), otherwise you end up with two copies of crawlee. hopefully things will be compatible, if not, we'd have to update scraper tools first.

also dont import the full crawlee, keep using the @crawlee/ packages that you actually need. and bump impit since you are at it.

B4nan

You still have crawlee v3 installed for some reason. use npm why to see why.

actor-scraper/package-lock.json

Line 457 in e3f84e7

"node_modules/@crawlee/basic": {

(I will leave the final review to @barjin, just wanted to share my 2 cents based on what I saw)

B4nan · 2026-02-09T15:52:00Z

packages/actor-scraper/sitemap-scraper/src/internals/crawler_setup.ts

    dataset!: Dataset;
    pagesOutputted!: number;
    proxyConfiguration?: ProxyConfiguration;
+    private sitemapHttpClient = new GotScrapingHttpClient();


I thought we wanted to use impit for the sitemap scraper?

B4nan · 2026-02-09T15:52:32Z

packages/actor-scraper/sitemap-scraper/package.json

+        "@crawlee/got-scraping-client": "^4.0.0-beta.25",
+        "@crawlee/http": "^4.0.0-beta.25",
+        "@crawlee/utils": "^4.0.0-beta.25",


lets pin those versions, since every new beta can contain some breaking change

nikitachapovskii-dev · 2026-02-09T16:39:27Z

Note: bumping Crawlee in the root means all actors will need code updates (e.g. v4 CrawlingContext, Response, enqueueLinks options/types changed etc.), otherwise builds will break.

You still have crawlee v3 installed for some reason. use npm why to see why.

actor-scraper/package-lock.json

Line 457 in e3f84e7

"node_modules/@crawlee/basic": {

Shall I proceed with a full v4 migration? If yes, should that be a separate PR?

B4nan · 2026-02-09T18:12:12Z

Oh, I see, then it's all fine. We don't want to switch those just yet.

nikitachapovskii-dev · 2026-02-10T09:47:46Z

In Apify Console it fails with:
@apify/scraper-tools requires apify ^3.1.8, while we have apify 4.0.0‑beta.12.
Locally it passes (because node_modules/lockfile already exist?), but in the server build it breaks with ERESOLVE.

https://console.apify.com/admin/users/ZscMwFR5H7eCtWtyh/actors/rGeTNESChDZ65EbYh/source

Is there way to force install (legacy-peer-deps) on Console or could we downgrade apify back to ^3.2.6 for now?

B4nan · 2026-02-10T09:51:49Z

It should install the same as you have in the lockfile, the problem is likely because you do not have a lockfile in the scraper, only the project root here.

You are in control of the dockerfile, try using npm i --force first, that should get you around the error, but could potentially fail at runtime if there are some incompatibilities.

barjin

Thank you @nikitachapovskii-dev !

The code is fine by me - as long as it actually fixes the problems mentioned in the issue. Here are some low-priority ideas, mostly regarding types:

barjin · 2026-02-12T11:36:36Z

packages/actor-scraper/sitemap-scraper/src/internals/crawler_setup.ts

+        const status =
+            (response as any)?.status ?? (response as any)?.statusCode;


The new CrawlingContext.response is now compatible with the Response interface, the any here is imo unnecessary

barjin · 2026-02-12T11:36:48Z

packages/actor-scraper/sitemap-scraper/src/internals/crawler_setup.ts

    private async _handleResult(
        request: Request,
-        response?: IncomingMessage,
+        response?: any,


Suggested change

response?: any,

response?: Response,

see above

barjin · 2026-02-12T11:43:51Z

packages/actor-scraper/sitemap-scraper/src/internals/crawler_setup.ts

-import type { IncomingMessage } from 'node:http';
 import { URL } from 'node:url';

+import type { CrawlingContext } from '@crawlee/core';


@crawlee/core is now an undeclared dependency (it's not in package.json's .dependencies). This type might exist in @crawlee/types, too

(also, note to self - the fact that this is required in failedRequestHandler might be actually a bug, I'll investigate)

apify/crawlee#3424

…onflict

nikitachapovskii-dev · 2026-02-20T08:40:54Z

When running with proxyConfiguration enabled, runtime fails in HttpCrawler with:

Expected argument to be of type string but received type Object
(from ProxyConfiguration.newProxyInfo).

It looks like Crawlee v4 calls proxyConfiguration.newProxyInfo({ request }), while the current Apify beta still validates the first argument as sessionId: string.

I could also add a small compatibility wrapper in sitemap-scraper (around Actor.createProxyConfiguration) that maps object-first calls to the legacy (sessionId, options) signature?

If not preferred, what alternative would you suggest for this branch?
cc @barjin

nikitachapovskii-dev · 2026-02-20T09:47:08Z

The current version successfully runs in apify console. The exact fixes applied:

ProxyConfiguration incompatibility
Crawlee v4 calls proxyConfiguration.newProxyInfo({ request }), while this SDK beta still validates the first arg as a string session id, causing:
Expected argument to be of type string but received type Object.
Fix:
Added a small compatibility wrapper around Actor.createProxyConfiguration that adapts object-first calls to the legacy (sessionId, options) signature for newProxyInfo/newUrl.

‼️
Crash when using skipNavigation on sitemap requests
With skipNavigation: true, failed-request/reclaim path in Crawlee v4 crashes with:
The request.loadedUrl property is not available - skipNavigation was used.
Fix:
For now removed skipNavigation for sitemap requests, so crawler navigation is performed normally and loadedUrl is available. Please suggest if there is another solution available.

Sitemap parsing path adjusted for stability in v4
parseSitemap with type: 'url' can trigger an extra fetch path and led to unstable behavior in this setup.
Fix:
Use crawler-fetched body and parse via type: 'raw', while keeping Impit HTTP client wired consistently.

Result:
No intentional product-level behavior change; these are compatibility fixes required to make the v4 migration run without runtime crashes.

cc @barjin

barjin · 2026-02-20T10:07:20Z

Regarding the ProxyConfiguration.newProxyInfo type, I'm fine with the patch here for now (so this PR is not blocked), but we should definitely align SDK v4 and Crawlee v4 types. I'll make an issue for this.

barjin

Thank you @nikitachapovskii-dev !

If this works as expected and helps with the issues from #214 , it's a go from me 👍 I created the issues in the dependency repos, so we can clean up some of the patches here once we fix them there.

Cheers!

chore: update @crawlee to v4 and its usage & set proxy before discove…

6a23192

…rValidSitemaps

B4nan reviewed Feb 9, 2026

View reviewed changes

bump sdk & impit, add overrides

e3f84e7

nikitachapovskii-dev requested a review from B4nan February 9, 2026 15:09

B4nan reviewed Feb 9, 2026

View reviewed changes

nikitachapovskii-dev requested a review from barjin February 10, 2026 08:03

avoid gotscraping

69b128c

barjin approved these changes Feb 12, 2026

View reviewed changes

nicklamonov assigned nikitachapovskii-dev Feb 19, 2026

barjin mentioned this pull request Feb 19, 2026

fix(sitemap-scraper): enforce stateless no-cookie #236

Open

nikitachapovskii-dev added 4 commits February 19, 2026 16:58

Merge branch 'master' into chore/update-crawlee-to-v4

93b4366

fix(sitemap-scraper): restore sitemap discovery timeout after merge c…

1473999

…onflict

fix: Docker TypeScript build failure

815ac67

lint fix

28391f8

nikitachapovskii-dev requested a review from barjin February 20, 2026 08:06

nikitachapovskii-dev added 2 commits February 20, 2026 10:26

fix: wrapper around Actor.createProxyConfiguration

4bc51d9

fix(sitemap-scraper): remove skipNavigation and parse sitemap from body

793f8c3

barjin mentioned this pull request Feb 20, 2026

v4 - incompatible ErrorHandler context type apify/crawlee#3424

Open

barjin mentioned this pull request Feb 20, 2026

Update ProxyConfiguration interface in v4 apify/apify-sdk-js#567

Open

barjin approved these changes Feb 20, 2026

View reviewed changes

nikitachapovskii-dev merged commit 5c98d8d into master Feb 20, 2026
4 checks passed

		const status =
		(response as any)?.status ?? (response as any)?.statusCode;

Comments

Conversation

nikitachapovskii-dev commented Feb 9, 2026

Uh oh!

nikitachapovskii-dev commented Feb 9, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

B4nan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikitachapovskii-dev commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

B4nan commented Feb 9, 2026

Uh oh!

nikitachapovskii-dev commented Feb 10, 2026

Uh oh!

B4nan commented Feb 10, 2026

Uh oh!

barjin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikitachapovskii-dev commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nikitachapovskii-dev commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

barjin commented Feb 20, 2026

Uh oh!

barjin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nikitachapovskii-dev commented Feb 9, 2026 •

edited

Loading

nikitachapovskii-dev commented Feb 20, 2026 •

edited

Loading

nikitachapovskii-dev commented Feb 20, 2026 •

edited

Loading