feat: add support parse_url#3563
Conversation
|
Thanks @rafafrdz! Kicking off CI. |
|
This looks good so far @rafafrdz. CI shows all three On the compatibility side, I noticed that the PR doesn't distinguish between ANSI and legacy mode for invalid URL handling. DataFusion's |
|
Thanks for the suggestion @andygrove What I did:
I also registered |
7c8a8d9 to
9d7720f
Compare
| -- under the License. | ||
|
|
||
| -- ConfigMatrix: parquet.enable.dictionary=false,true | ||
| -- MinSparkVersion: 3.5 |
There was a problem hiding this comment.
Does Spark 3.4 not support parse_url?
| SELECT parse_url(url, 'AUTHORITY') FROM test_parse_url | ||
|
|
||
| query | ||
| SELECT parse_url(url, 'USERINFO') FROM test_parse_url |
There was a problem hiding this comment.
Could you also add a literal arguments query to the SQL file, since literal and column inputs can use different code paths? Something like:
SELECT parse_url('http://spark.apache.org/path?q=1', 'HOST')
| } | ||
|
|
||
| test("parse_url with invalid URL in legacy mode") { | ||
| assume(isSpark40Plus) |
There was a problem hiding this comment.
Could you add a comment explaining why this test is specific to Spark 4?
There was a problem hiding this comment.
my mistake, drop assumed part
| } | ||
|
|
||
| test("parse_url in ANSI mode (Spark 3.5)") { | ||
| assume(!isSpark40Plus) |
There was a problem hiding this comment.
Why does this test exclude Spark 4?
|
Moving this to draft until feedback is addressed |
- Remove try_parse_url SQL queries: try_parse_url is a Comet-internal DataFusion function name used when serializing parse_url with failOnError=false. It is not a registered Spark SQL function and calling it directly via SQL raises UNRESOLVED_ROUTINE on any Spark version. The NULL-on-invalid-URL behaviour is covered by the 'parse_url with invalid URL in legacy mode' Scala test. - Replace ftp port 21 with 2121 in test URLs: the Rust url crate omits well-known default ports (ftp=21) when serialising authority(), while Java URI.getRawAuthority() preserves them verbatim. Using a non-default port avoids this pre-existing semantic gap and keeps the AUTHORITY test case meaningful on both Spark 3.5 and 4.0.
Use Expression directly instead of a generic type parameter T <: Expression. This eliminates the asInstanceOf cast in convertFromInvokeUnchecked and the unchecked method itself, since QueryPlanSerde can now call convertFromInvoke directly without existential type issues.
3b07a93 to
54fc48a
Compare
Remove unused import that causes scalafix lint failure in CI.
Summary
parse_urlby mappingParseUrlto the native scalar functionparse_urlin expression serde.parse_urlas supported.Why
This closes one of the missing DataFusion 50 migration functions from issue #2443
Part of #2443