Skip to content

fix: validate GitHub URL and check repo reachability in /scan-url#10

Merged
ionfwsrijan merged 3 commits into
ionfwsrijan:mainfrom
lakshay122007:fix/scan-url-validation
Jun 1, 2026
Merged

fix: validate GitHub URL and check repo reachability in /scan-url#10
ionfwsrijan merged 3 commits into
ionfwsrijan:mainfrom
lakshay122007:fix/scan-url-validation

Conversation

@lakshay122007
Copy link
Copy Markdown
Contributor

Before opening: make sure there is an issue tracking this work, and link it below. PRs without a linked issue may be closed without review.

Linked issue

Closes #5

What this PR does

The /scan-url endpoint was silently accepting invalid or private GitHub URLs, causing the server to hang or return a generic 500 with no useful message. This PR adds URL format validation, a reachability check before cloning, and a 30 second timeout on the download step. The frontend now also parses the error response correctly so users see a readable message instead of raw JSON.

Type of change

  • Bug fix
  • New feature
  • ML model / training pipeline
  • Refactor (no behaviour change)
  • Documentation
  • Tests only

ML tier (if applicable)

  • Tier 1 — Triage
  • Tier 2 — Predictive
  • Tier 3 — Autonomous
  • Not ML-related

Changes

Backend

  • Added regex validation on the URL format in github_zip_url — returns 422 immediately if the format is invalid
  • Added check_repo_reachable() — sends a HEAD request to https://github.com/{owner}/{repo} with a 5 second timeout before attempting to download, returns 422 if repo is not found or private
  • Wrapped download_to_path in asyncio.wait_for with a 30 second timeout — returns 504 "Repository clone timed out" if exceeded.

Frontend

  • scanRepoUrl in api.ts now parses error responses as JSON and surfaces the detail field so the actual error message is shown to the user instead of a raw JSON string.

Testing

How did you test this?

  • Submitted a completely invalid URL (e.g. not-a-url) → got 422 immediately with clear message
  • Submitted a nonexistent repo (e.g. https://github.com/someone/this-does-not-exist) → got 422 "Repository not found or is private" within 5 seconds
  • Submitted a valid public repo → scan proceeded normally.

Checklist

  • Tested locally end-to-end (upload ZIP or GitHub URL → scan → findings returned correctly)
  • New ML model falls back gracefully when model file is absent
  • No new console.error or unhandled Python exceptions introduced
  • Added or updated tests where applicable
  • requirements.txt / package.json updated if new dependencies added
  • New model files (.pkl, .pt, etc.) are gitignored, not committed

Anything reviewers should focus on

The check_repo_reachable function uses a HEAD request which works for public repos. Private repos return 404 from GitHub's public API which is the correct behaviour — we surface that as "Repository not found or is private" to avoid leaking information.

@ionfwsrijan
Copy link
Copy Markdown
Owner

@lakshay122007 Let the checks complete for now while i review it.

@ionfwsrijan
Copy link
Copy Markdown
Owner

@lakshay122007 Look into the CI failures

@ionfwsrijan
Copy link
Copy Markdown
Owner

@lakshay122007 The PR looks good to me but a few things to fix before we can merge:

Must fix

  1. httpx missing from requirements.txt — it's imported in main.py but the diff shows no change to requirements.txt. Fresh installs will crash with an ImportError on startup.

  2. Job directory not cleaned up on reachability failurejob_dir is created before check_repo_reachable() is called, but if it raises a 422, the directory is never deleted. The timeout path cleans up correctly but this path doesn't. Same orphaned-directory bug we have in issue Uploading a non-ZIP file to /scan causes an unhandled 500 instead of a validation error #6.

Should fix

  1. Wrong error message on HEAD request timeout — when httpx.TimeoutException is caught, it returns "Repository not found or is private" which is misleading. A timeout is a network issue, not a missing repo. Should say something like "Could not reach GitHub — check your network connection."

  2. HTTPException raised inside a utility functiongithub_zip_url is a plain helper but now raises FastAPI-specific exceptions directly. Better to raise a ValueError there and convert it to HTTPException in the route handler. Keeps things testable and separated.

Nice to have follow-ups

  1. No tests added — at minimum, mock httpx for the 404, timeout, and 200 cases. Happy to track this as a follow-up issue if you'd prefer not to add them now.

  2. Branch assumption is still fragile — the ZIP URL uses refs/heads/{ref} so repos with a default branch other than main (e.g. master, trunk) will 404 after passing the reachability check. Pre-existing issue, but worth a follow-up.

Kindly address these changes!

@lakshay122007
Copy link
Copy Markdown
Contributor Author

lakshay122007 commented May 31, 2026

ohh, the code need to be formatted using ruff - thats what creating the issue. lemme fix it.

@ionfwsrijan
Copy link
Copy Markdown
Owner

@lakshay122007 Yup. Also look into the other changes ive asked you to make.

@ionfwsrijan
Copy link
Copy Markdown
Owner

@lakshay122007 Kindly look into the other changes ive asked you to make.

@lakshay122007
Copy link
Copy Markdown
Contributor Author

lakshay122007 commented May 31, 2026

@lakshay122007 Kindly look into the other changes ive asked you to make.

The httpx was already in the requirements.txt - so that does not need adding i guess - or you were trying to say something else?

for rest i have fixed going to push but for follow ups - 5th one would be nice if follow up issue if raised.
For 6th, since ref is already a form parameter, users can pass master or trunk explicitly.

@ionfwsrijan
Copy link
Copy Markdown
Owner

@lakshay122007 Kindly look into the other changes ive asked you to make.

The httpx was already in the requirements.txt - so that does not need adding i guess - or you were trying to say something else?

for rest i have fixed going to push but for follow ups - 5th one would be nice if follow up issue if raised. For 6th, since ref is already a form parameter, users can pass master or trunk explicitly.

That LGTM as well. You may raise followups and ill assign.

@ionfwsrijan ionfwsrijan merged commit 26b8523 into ionfwsrijan:main Jun 1, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

/scan-url silently accepts invalid or private GitHub URLs with no user-facing error

2 participants