chore: update @crawlee to v4 and its usage & set proxy before discoverValidSitemaps#216
Conversation
…rValidSitemaps
|
Hi, @barjin I've bumped this actor to crawlee@4.0.0-beta.25 to use the new sitemap httpClient. breaks (RequestQueueV2/KeyValueStore types differ). For now I've dropped the implements and added a few as any casts
Is there a plan to update @apify/scraper-tools to v4 typings? to avoid these casts? |
| "@apify/scraper-tools": "^1.1.4", | ||
| "@crawlee/http": "^3.14.1", | ||
| "@crawlee/utils": "^3.15.4-beta.44", | ||
| "apify": "^3.2.6", |
There was a problem hiding this comment.
afaik you also need v4 beta for the SDK, at least that was my feeling when i was testing it few months ago (and that's why i even bothered with SDK v4 back then).
| "crawlee": "^4.0.0-beta.25", | ||
| "impit": "^0.7.5" | ||
| }, | ||
| "overrides": { |
There was a problem hiding this comment.
imo you should use overrides to force crawlee v4 inside scraper tools (plus use SDK v4 as suggested above), otherwise you end up with two copies of crawlee. hopefully things will be compatible, if not, we'd have to update scraper tools first.
also dont import the full crawlee, keep using the @crawlee/ packages that you actually need. and bump impit since you are at it.
B4nan
left a comment
There was a problem hiding this comment.
You still have crawlee v3 installed for some reason. use npm why to see why.
actor-scraper/package-lock.json
Line 457 in e3f84e7
(I will leave the final review to @barjin, just wanted to share my 2 cents based on what I saw)
| dataset!: Dataset; | ||
| pagesOutputted!: number; | ||
| proxyConfiguration?: ProxyConfiguration; | ||
| private sitemapHttpClient = new GotScrapingHttpClient(); |
There was a problem hiding this comment.
I thought we wanted to use impit for the sitemap scraper?
| "@crawlee/got-scraping-client": "^4.0.0-beta.25", | ||
| "@crawlee/http": "^4.0.0-beta.25", | ||
| "@crawlee/utils": "^4.0.0-beta.25", |
There was a problem hiding this comment.
lets pin those versions, since every new beta can contain some breaking change
|
Note: bumping Crawlee in the root means all actors will need code updates (e.g. v4 CrawlingContext, Response, enqueueLinks options/types changed etc.), otherwise builds will break.
Shall I proceed with a full v4 migration? If yes, should that be a separate PR? |
|
Oh, I see, then it's all fine. We don't want to switch those just yet. |
|
In Apify Console it fails with: https://console.apify.com/admin/users/ZscMwFR5H7eCtWtyh/actors/rGeTNESChDZ65EbYh/source Is there way to force install (legacy-peer-deps) on Console or could we downgrade apify back to ^3.2.6 for now? |
|
It should install the same as you have in the lockfile, the problem is likely because you do not have a lockfile in the scraper, only the project root here. You are in control of the dockerfile, try using |
barjin
left a comment
There was a problem hiding this comment.
Thank you @nikitachapovskii-dev !
The code is fine by me - as long as it actually fixes the problems mentioned in the issue. Here are some low-priority ideas, mostly regarding types:
| const status = | ||
| (response as any)?.status ?? (response as any)?.statusCode; |
There was a problem hiding this comment.
The new CrawlingContext.response is now compatible with the Response interface, the any here is imo unnecessary
| private async _handleResult( | ||
| request: Request, | ||
| response?: IncomingMessage, | ||
| response?: any, |
There was a problem hiding this comment.
| response?: any, | |
| response?: Response, |
see above
| import type { IncomingMessage } from 'node:http'; | ||
| import { URL } from 'node:url'; | ||
|
|
||
| import type { CrawlingContext } from '@crawlee/core'; |
There was a problem hiding this comment.
@crawlee/core is now an undeclared dependency (it's not in package.json's .dependencies). This type might exist in @crawlee/types, too
(also, note to self - the fact that this is required in failedRequestHandler might be actually a bug, I'll investigate)
|
When running with proxyConfiguration enabled, runtime fails in HttpCrawler with: Expected argument to be of type string but received type Object It looks like Crawlee v4 calls proxyConfiguration.newProxyInfo({ request }), while the current Apify beta still validates the first argument as sessionId: string. I could also add a small compatibility wrapper in sitemap-scraper (around Actor.createProxyConfiguration) that maps object-first calls to the legacy (sessionId, options) signature? If not preferred, what alternative would you suggest for this branch? |
|
The current version successfully runs in apify console. The exact fixes applied: ProxyConfiguration incompatibility
Sitemap parsing path adjusted for stability in v4 Result: cc @barjin |
|
Regarding the |
barjin
left a comment
There was a problem hiding this comment.
Thank you @nikitachapovskii-dev !
If this works as expected and helps with the issues from #214 , it's a go from me 👍 I created the issues in the dependency repos, so we can clean up some of the patches here once we fix them there.
Cheers!
Switch sitemap-scraper to crawlee@4.0.0-beta.25 to use the new sitemap httpClient (GotScrapingHttpClient) for discovery/parsing.
Updated imports to crawlee v4, moved proxy init earlier, adjusted response handling for v4 Response.
Partially fixes #214 (1. and 2.)