added support for huggingface dataset repositories#1007
added support for huggingface dataset repositories#1007arnab2001 wants to merge 8 commits intokitops-ml:mainfrom
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR extends the HuggingFace import functionality to support both models and datasets by adding URL parsing logic and routing to appropriate API endpoints.
- Adds dataset support alongside existing model support through HuggingFace API
- Introduces URL path parsing to differentiate between models and datasets
- Updates API URL constants and function signatures to handle both resource types
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/lib/hf/list.go | Splits tree URL format into model and dataset variants; adds kind parameter to route to correct API endpoint |
| pkg/lib/hf/download.go | Splits resolve URL format into model and dataset variants; adds kind parameter for downloads |
| pkg/cmd/kitimport/util_test.go | Extends test cases to verify dataset URL parsing and returns expected kind alongside repository |
| pkg/cmd/kitimport/util.go | Updates extractRepoFromURL to parse dataset URLs and return both normalized path and kind |
| pkg/cmd/kitimport/hfimport.go | Threads kind parameter through ListFiles and DownloadFiles calls |
| pkg/cmd/kitimport/cmd.go | Updates call to extractRepoFromURL to handle new return signature with kind |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
pkg/cmd/kitimport/util.go
Outdated
| if segments[0] == "datasets" { | ||
| return "", "", fmt.Errorf("invalid dataset URL format: expected datasets/org/repo, got '%s'", path) | ||
| } |
There was a problem hiding this comment.
I don't know if this check is appropriate for all uses of extractRepoFromURL -- it makes sense for huggingface URLs but kit import supports any git repository as well; what if I try to import from my-gitlab.org/datasets/my-dataset?
This is mixing huggingface-specific logic into this function without making it clear; this function should handle both git repositories (arbitrary names) and huggingface ones.
We should either handle huggingface specifically or rename the function to reflect its actual purpose and update all usages to be appropriate.
There was a problem hiding this comment.
extractRepoFromURL() is generic now and works for any git repository (GitLab, GitHub, etc.). my-gitlab.org/datasets/my-dataset would be handled correctly by extractRepoFromURL() as a generic org/repo pair. HuggingFace-specific parsing (including "datasets" prefix handling) is isolated in parseHuggingFaceRepo(). Usage in cmd.go (lines 134-142) tries HuggingFace parsing first, then falls back to generic parsing.
pkg/cmd/kitimport/util.go
Outdated
| if len(segments) != 2 { | ||
| return "", fmt.Errorf("could not extract organization and repository from '%s'", path) | ||
|
|
||
| var kind string |
There was a problem hiding this comment.
We should probably use a typed enum for kind to make things more explicit
type hfRepoType int
const (
hfModelRepoType hfRepoType = iota
hfDatasetRepoType
)There was a problem hiding this comment.
Implemented. Added hfRepoType enum in hfimport.go (lines 178-183) and RepositoryType in pkg/lib/hf/repo.go. All comparisons use typed constants (RepoTypeDataset, RepoTypeModel) instead of string literals.
pkg/cmd/kitimport/util.go
Outdated
| // extractRepoFromURL attempts to normalize a string or URL into a repository name as is used on GitHub and Huggingface. | ||
| // Returns an error we cannot automatically handle the input URL/string. | ||
| // - https://example.com/segment1/segment2 --> segment1/segment2 | ||
| // - 'organization/repository' --> 'organization/repository' | ||
| func extractRepoFromURL(rawUrl string) (string, error) { | ||
| // Returns the normalized repository path (org/repo), the kind ("model" or "dataset"), and an error if parsing fails. | ||
| // - https://huggingface.co/org/repo --> (org/repo, "model", nil) | ||
| // - https://huggingface.co/datasets/org/repo --> (org/repo, "dataset", nil) | ||
| // - 'organization/repository' --> (organization/repository, "model", nil) | ||
| // - 'datasets/organization/repository' --> (organization/repository, "dataset", nil) | ||
| func extractRepoFromURL(rawUrl string) (string, string, error) { |
There was a problem hiding this comment.
These changes change the purpose of this function, making it specific for handling huggingface repositories.
// extractRepoFromURL attempts to normalize a string or URL into a repository name as is used on GitHub and Huggingface.vs
Returns the normalized repository path (org/repo), the kind ("model" or "dataset"),
-- the 'kind' here only makes sense in a huggingface context, not GitHub (or any Git repository, for that matter).
There was a problem hiding this comment.
extractRepoFromURL() remains generic and only returns the normalized repo path (org/repo) , no kind/type. HuggingFace-specific logic (including dataset detection) is handled by parseHuggingFaceRepo() in hfimport.go.
|
Thanks for the review @amisevsk, Its my short sight that i missed these things. I am working on resolving these comments |
@amisevsk please check and let me know if theres any other changes are required, thank you for your time. |
|
PR checks are failing |
|
@gorkem working on it |
Signed-off-by: Arnab Chatterjee <arnabchat2001@gmail.com>
…tion, enums use int with iota Signed-off-by: Arnab Chatterjee <arnabchat2001@gmail.com>
Signed-off-by: Arnab Chatterjee <arnabchat2001@gmail.com>
Added the standard Apache License 2.0 header to pkg/lib/hf/repo.go for compliance and clarity regarding licensing. Signed-off-by: Arnab Chatterjee <arnabchat2001@gmail.com>
Signed-off-by: Arnab Chatterjee <arnabchat2001@gmail.com>
5fcf2b2 to
eb35533
Compare
|
@gorkem , CI checks should pass now |
Signed-off-by: Arnab Chatterjee <arnabchat2001@gmail.com>
amisevsk
left a comment
There was a problem hiding this comment.
Thanks for the fixes @arnab2001. Looking pretty good, just a few comments before we can merge.
pkg/cmd/kitimport/hfimport.go
Outdated
| func parseHuggingFaceRepo(rawURL string) (string, hfRepoType, error) { | ||
| u, err := url.Parse(rawURL) | ||
| if err != nil { | ||
| return "", 0, fmt.Errorf("failed to parse url: %w", err) | ||
| } | ||
|
|
||
| if u.Host != "" && !strings.Contains(u.Host, "huggingface.co") { | ||
| return "", 0, fmt.Errorf("not a Hugging Face repository") | ||
| } | ||
|
|
||
| path := strings.Trim(u.Path, "/") | ||
| segments := strings.FieldsFunc(path, func(r rune) bool { return r == '/' }) | ||
|
|
||
| if len(segments) >= 3 && segments[0] == "datasets" { | ||
| return strings.Join(segments[1:3], "/"), hfRepoTypeDataset, nil | ||
| } | ||
| if len(segments) == 2 && segments[0] == "datasets" { | ||
| return "", 0, fmt.Errorf("could not extract repository from path '%s'", path) | ||
| } | ||
|
|
||
| if len(segments) >= 2 { | ||
| return strings.Join(segments[len(segments)-2:], "/"), hfRepoTypeModel, nil | ||
| } | ||
|
|
||
| return "", 0, fmt.Errorf("could not extract repository from path '%s'", path) | ||
| } |
There was a problem hiding this comment.
We should move this function into lib/hf and have it return the lib/hf-defined repo types (e.g. it could go into repo.go. This would allow us to avoid the duplicate hfRepoType enum and the mapRepoType function.
pkg/cmd/kitimport/hfimport.go
Outdated
| func parseHuggingFaceRepo(rawURL string) (string, hfRepoType, error) { | ||
| u, err := url.Parse(rawURL) | ||
| if err != nil { | ||
| return "", 0, fmt.Errorf("failed to parse url: %w", err) |
There was a problem hiding this comment.
Consider defining an explicit repoTypeUnknown in the enum (placing it first in the list) and handling that in the lib/hf functions to improve zero-values for the error case.
pkg/cmd/kitimport/hfimport.go
Outdated
| path := strings.Trim(u.Path, "/") | ||
| segments := strings.FieldsFunc(path, func(r rune) bool { return r == '/' }) |
There was a problem hiding this comment.
Since you're trimming / from the path, FieldsFunc will be equivalent to strings.Split(path, "/") except harder to read.
Signed-off-by: Arnab Chatterjee <arnabchat2001@gmail.com>
|
@amisevsk Done. Moved ParseHuggingFaceRepo to pkg/lib/hf/repo.go, added RepoTypeUnknown as the zero value, and simplified FieldsFunc to strings.Split. Removed the duplicate hfRepoType enum and mapRepoType function from hfimport.go. |
pkg/lib/hf/repo.go
Outdated
| return "", RepoTypeUnknown, fmt.Errorf("failed to parse url: %w", err) | ||
| } | ||
|
|
||
| if u.Host != "" && !strings.Contains(u.Host, "huggingface.co") { |
There was a problem hiding this comment.
In this check, inputs like https://huggingface.co.evil.com/org/repo pass the “is HF” check and proceed as if they were Hugging Face. Can you tighten the parsing and validation of the host?
|
Working on it |
…P URLs, improve host validation, and handle trailing slashes. Signed-off-by: Arnab Chatterjee <arnabchat2001@gmail.com>
|
Fixed two critical issues in ParseHuggingFaceRepo(): hardened host validation to reject malicious URLs like https://huggingface.co.evil.com using strict exact-match checks, and added scheme-less URL handling so huggingface.co/datasets/org/repo correctly parses as a dataset instead of failing. Added 3 new test cases covering security edge cases, trailing slashes, and HTTP schemes - all 12 tests passing. |
Description
add Hugging Face dataset support to kit import by detecting dataset URLs and hitting the correct api/datasets/resolve endpoints, so both models and datasets can be packaged
Testing
go test ./...KITOPS_HOME=$(mktemp -d) ./kit import https://huggingface.co/datasets/nvidia/ProfBench --ref main --tag test/profbench:latest --tool hf/kit unpack test/profbench:latest --dir "$(mktemp -d)"Linked issues
closes #1004
AI-Assisted Code