Skip to content

Comments

chore: update @crawlee to v4 and its usage & set proxy before discoverValidSitemaps#216

Merged
nikitachapovskii-dev merged 9 commits intomasterfrom
chore/update-crawlee-to-v4
Feb 20, 2026
Merged

chore: update @crawlee to v4 and its usage & set proxy before discoverValidSitemaps#216
nikitachapovskii-dev merged 9 commits intomasterfrom
chore/update-crawlee-to-v4

Conversation

@nikitachapovskii-dev
Copy link
Contributor

Switch sitemap-scraper to crawlee@4.0.0-beta.25 to use the new sitemap httpClient (GotScrapingHttpClient) for discovery/parsing.
Updated imports to crawlee v4, moved proxy init earlier, adjusted response handling for v4 Response.

Partially fixes #214 (1. and 2.)

@nikitachapovskii-dev
Copy link
Contributor Author

Hi, @barjin I've bumped this actor to crawlee@4.0.0-beta.25 to use the new sitemap httpClient.
@apify/scraper-tools typings are still v3, so implements CrawlerSetupOptions

export class CrawlerSetup implements CrawlerSetupOptions {

breaks (RequestQueueV2/KeyValueStore types differ).
For now I've dropped the implements and added a few as any casts

Is there a plan to update @apify/scraper-tools to v4 typings?
If not, would you prefer we add a local v4 interface (or another approach), smth like

type CrawlerSetupOptionsV4 = Omit<CrawlerSetupOptions, 'requestQueue' | 'keyValueStore'> & {
  requestQueue: RequestQueueV2;
  keyValueStore: KeyValueStore;
};

to avoid these casts?

"@apify/scraper-tools": "^1.1.4",
"@crawlee/http": "^3.14.1",
"@crawlee/utils": "^3.15.4-beta.44",
"apify": "^3.2.6",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik you also need v4 beta for the SDK, at least that was my feeling when i was testing it few months ago (and that's why i even bothered with SDK v4 back then).

"crawlee": "^4.0.0-beta.25",
"impit": "^0.7.5"
},
"overrides": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo you should use overrides to force crawlee v4 inside scraper tools (plus use SDK v4 as suggested above), otherwise you end up with two copies of crawlee. hopefully things will be compatible, if not, we'd have to update scraper tools first.

also dont import the full crawlee, keep using the @crawlee/ packages that you actually need. and bump impit since you are at it.

Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You still have crawlee v3 installed for some reason. use npm why to see why.

"node_modules/@crawlee/basic": {

(I will leave the final review to @barjin, just wanted to share my 2 cents based on what I saw)

dataset!: Dataset;
pagesOutputted!: number;
proxyConfiguration?: ProxyConfiguration;
private sitemapHttpClient = new GotScrapingHttpClient();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we wanted to use impit for the sitemap scraper?

Comment on lines 9 to 11
"@crawlee/got-scraping-client": "^4.0.0-beta.25",
"@crawlee/http": "^4.0.0-beta.25",
"@crawlee/utils": "^4.0.0-beta.25",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets pin those versions, since every new beta can contain some breaking change

@nikitachapovskii-dev
Copy link
Contributor Author

nikitachapovskii-dev commented Feb 9, 2026

Note: bumping Crawlee in the root means all actors will need code updates (e.g. v4 CrawlingContext, Response, enqueueLinks options/types changed etc.), otherwise builds will break.

You still have crawlee v3 installed for some reason. use npm why to see why.

"node_modules/@crawlee/basic": {

Shall I proceed with a full v4 migration? If yes, should that be a separate PR?

@B4nan
Copy link
Member

B4nan commented Feb 9, 2026

Oh, I see, then it's all fine. We don't want to switch those just yet.

@nikitachapovskii-dev
Copy link
Contributor Author

In Apify Console it fails with:
@apify/scraper-tools requires apify ^3.1.8, while we have apify 4.0.0‑beta.12.
Locally it passes (because node_modules/lockfile already exist?), but in the server build it breaks with ERESOLVE.

https://console.apify.com/admin/users/ZscMwFR5H7eCtWtyh/actors/rGeTNESChDZ65EbYh/source

Is there way to force install (legacy-peer-deps) on Console or could we downgrade apify back to ^3.2.6 for now?

@B4nan
Copy link
Member

B4nan commented Feb 10, 2026

It should install the same as you have in the lockfile, the problem is likely because you do not have a lockfile in the scraper, only the project root here.

You are in control of the dockerfile, try using npm i --force first, that should get you around the error, but could potentially fail at runtime if there are some incompatibilities.

Copy link
Member

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @nikitachapovskii-dev !

The code is fine by me - as long as it actually fixes the problems mentioned in the issue. Here are some low-priority ideas, mostly regarding types:

Comment on lines +353 to +354
const status =
(response as any)?.status ?? (response as any)?.statusCode;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new CrawlingContext.response is now compatible with the Response interface, the any here is imo unnecessary

private async _handleResult(
request: Request,
response?: IncomingMessage,
response?: any,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
response?: any,
response?: Response,

see above

import type { IncomingMessage } from 'node:http';
import { URL } from 'node:url';

import type { CrawlingContext } from '@crawlee/core';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crawlee/core is now an undeclared dependency (it's not in package.json's .dependencies). This type might exist in @crawlee/types, too

(also, note to self - the fact that this is required in failedRequestHandler might be actually a bug, I'll investigate)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikitachapovskii-dev
Copy link
Contributor Author

nikitachapovskii-dev commented Feb 20, 2026

When running with proxyConfiguration enabled, runtime fails in HttpCrawler with:

Expected argument to be of type string but received type Object
(from ProxyConfiguration.newProxyInfo).

It looks like Crawlee v4 calls proxyConfiguration.newProxyInfo({ request }), while the current Apify beta still validates the first argument as sessionId: string.

I could also add a small compatibility wrapper in sitemap-scraper (around Actor.createProxyConfiguration) that maps object-first calls to the legacy (sessionId, options) signature?

If not preferred, what alternative would you suggest for this branch?
cc @barjin

@nikitachapovskii-dev
Copy link
Contributor Author

nikitachapovskii-dev commented Feb 20, 2026

The current version successfully runs in apify console. The exact fixes applied:

ProxyConfiguration incompatibility
Crawlee v4 calls proxyConfiguration.newProxyInfo({ request }), while this SDK beta still validates the first arg as a string session id, causing:
Expected argument to be of type string but received type Object.
Fix:
Added a small compatibility wrapper around Actor.createProxyConfiguration that adapts object-first calls to the legacy (sessionId, options) signature for newProxyInfo/newUrl.

‼️
Crash when using skipNavigation on sitemap requests
With skipNavigation: true, failed-request/reclaim path in Crawlee v4 crashes with:
The request.loadedUrl property is not available - skipNavigation was used.
Fix:
For now removed skipNavigation for sitemap requests, so crawler navigation is performed normally and loadedUrl is available. Please suggest if there is another solution available.

Sitemap parsing path adjusted for stability in v4
parseSitemap with type: 'url' can trigger an extra fetch path and led to unstable behavior in this setup.
Fix:
Use crawler-fetched body and parse via type: 'raw', while keeping Impit HTTP client wired consistently.

Result:
No intentional product-level behavior change; these are compatibility fixes required to make the v4 migration run without runtime crashes.

cc @barjin

@barjin
Copy link
Member

barjin commented Feb 20, 2026

Regarding the ProxyConfiguration.newProxyInfo type, I'm fine with the patch here for now (so this PR is not blocked), but we should definitely align SDK v4 and Crawlee v4 types. I'll make an issue for this.

Copy link
Member

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @nikitachapovskii-dev !

If this works as expected and helps with the issues from #214 , it's a go from me 👍 I created the issues in the dependency repos, so we can clean up some of the patches here once we fix them there.

Cheers!

@nikitachapovskii-dev nikitachapovskii-dev merged commit 5c98d8d into master Feb 20, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sitemap Scraper issues

3 participants