fix: validate GitHub URL and check repo reachability in /scan-url#10
Conversation
|
@lakshay122007 Let the checks complete for now while i review it. |
|
@lakshay122007 Look into the CI failures |
|
@lakshay122007 The PR looks good to me but a few things to fix before we can merge: Must fix
Should fix
Nice to have follow-ups
Kindly address these changes! |
|
ohh, the code need to be formatted using ruff - thats what creating the issue. lemme fix it. |
|
@lakshay122007 Yup. Also look into the other changes ive asked you to make. |
|
@lakshay122007 Kindly look into the other changes ive asked you to make. |
The httpx was already in the requirements.txt - so that does not need adding i guess - or you were trying to say something else? for rest i have fixed going to push but for follow ups - 5th one would be nice if follow up issue if raised. |
That LGTM as well. You may raise followups and ill assign. |
Linked issue
Closes #5
What this PR does
The
/scan-urlendpoint was silently accepting invalid or private GitHub URLs, causing the server to hang or return a generic 500 with no useful message. This PR adds URL format validation, a reachability check before cloning, and a 30 second timeout on the download step. The frontend now also parses the error response correctly so users see a readable message instead of raw JSON.Type of change
ML tier (if applicable)
Changes
Backend
github_zip_url— returns 422 immediately if the format is invalidcheck_repo_reachable()— sends a HEAD request tohttps://github.com/{owner}/{repo}with a 5 second timeout before attempting to download, returns 422 if repo is not found or privatedownload_to_pathinasyncio.wait_forwith a 30 second timeout — returns 504 "Repository clone timed out" if exceeded.Frontend
scanRepoUrlinapi.tsnow parses error responses as JSON and surfaces thedetailfield so the actual error message is shown to the user instead of a raw JSON string.Testing
How did you test this?
not-a-url) → got 422 immediately with clear messagehttps://github.com/someone/this-does-not-exist) → got 422 "Repository not found or is private" within 5 secondsChecklist
console.erroror unhandled Python exceptions introducedrequirements.txt/package.jsonupdated if new dependencies added.pkl,.pt, etc.) are gitignored, not committedAnything reviewers should focus on
The
check_repo_reachablefunction uses a HEAD request which works for public repos. Private repos return 404 from GitHub's public API which is the correct behaviour — we surface that as "Repository not found or is private" to avoid leaking information.