This document outlines a refined API structure for the CrossRoad analysis pipeline, prioritizing performance for large tabular data using dedicated endpoints and Apache Arrow, while incorporating job queuing and status reporting.
- Improve data transfer speed for plotting tables compared to zipping text files.
- Provide specific endpoints for frontend components to fetch only necessary plot data.
- Use an efficient binary format (Apache Arrow) for tabular data transfer.
- Implement job queuing to limit concurrent analyses.
- Allow clients to query job status (queued, running, completed, failed).
- Retain an option to download the complete analysis results as a zip file.
- Backend Framework: FastAPI
- Data Serialization (Tabular): Apache Arrow (IPC Stream Format) via
pyarrow - Data Serialization (Control/Status): JSON
- Asynchronous Tasks/Queuing: FastAPI
BackgroundTasks(simple) or a dedicated library like Celery/RQ (more robust). - Concurrency/Queue Management: Requires a mechanism (e.g., global counter/lock, Redis, task queue features) to track active jobs and manage the queue.
- Endpoint:
POST /analyze_ssr/ - Request:
multipart/form-datacontaining input files (FASTA, optional Category TSV, optional Gene BED) and parameters (reference_id, perf_params, flanks). - Processing:
- Generate unique
job_id. - Save input files to
jobOut/{job_id}/input/. - Check current number of active jobs against
MAX_CONCURRENT_JOBSlimit (e.g., 2). - If below limit:
- Mark job status as
running. - Start the analysis pipeline (M2, GC2, etc.) asynchronously.
- Return
202 Acceptedresponse with statusrunning.
- Mark job status as
- If at limit:
- Add job details (input paths, params,
job_id) to a waiting queue. - Mark job status as
queued. - Return
202 Acceptedresponse with statusqueued.
- Add job details (input paths, params,
- Generate unique
- Response (
202 Accepted):application/json{ "job_id": "job_abc789", "status": "queued", // or "running" "status_url": "/api/job/job_abc789/status", "results_base_url": "/api/job/job_abc789/plot_data/", // Base URL for plot data endpoints "download_all_url": "/api/job/job_abc789/download_zip" // URL for full download } - Response (Error):
400 Bad Request(invalid params),422 Unprocessable Entity(validation error),500 Internal Server Error.
- Endpoint:
GET /api/job/{job_id}/status - Path Parameter:
job_id(string). - Processing: Retrieve the current status and any associated messages/progress for the specified
job_idfrom the job tracking mechanism. - Response (
200 OK):application/json{ "job_id": "job_abc789", "status": "running", // "queued", "completed", "failed" "message": "Processing GC2 module...", // Optional: Current step/message "progress": 0.65, // Optional: Overall progress (0.0-1.0) "error_details": null // Populated if status is "failed" } - Response (Error):
404 Not Found(invalidjob_id).
- Endpoint:
GET /api/job/{job_id}/plot_data/{plot_key} - Path Parameters:
job_id(string).plot_key(string): Identifier for the specific plot dataset needed (e.g.,hotspot,ssr_gc,category_sankey,hssr_data,ssr_gene_intersect,plot_source). Theplot_sourcekey should handle checking formergedOut.tsvfirst, thenreformatted.tsv.
- Processing:
- Verify
job_idexists. - Check job status. If not
completed, return409 Conflict(or404if preferred). - Determine the required source file path(s) based on
plot_key(e.g.,plot_key='hotspot'->jobOut/{job_id}/output/main/mutational_hotspot.csv). - If file(s) exist, read into a Pandas DataFrame. Handle potential errors (file not found, read errors).
- Convert the DataFrame to Apache Arrow IPC Stream format using
pyarrow.
- Verify
- Response (Success -
200 OK): Raw binary Arrow data.Content-Type: application/vnd.apache.arrow.stream(Preferred) orapplication/octet-stream.
- Response (Error):
404 Not Found(job or file missing),409 Conflict(job not completed),500 Internal Server Error(processing error).
- Endpoint:
GET /api/job/{job_id}/download_zip - Path Parameter:
job_id(string). - Processing:
- Verify
job_idexists. - Check job status. If not
completed, return409 Conflict(or404). - Locate the
jobOut/{job_id}/output/directory. - Create a zip archive of this directory (if it doesn't already exist).
- Verify
- Response (Success -
200 OK): The zip file.Content-Type: application/zipContent-Disposition: attachment; filename="ssr_analysis_{job_id}_full.zip"
- Response (Error):
404 Not Found,409 Conflict,500 Internal Server Error.
- Picks up a job (
job_id, inputs, params) when started or when dequeued. - Updates job status (
running,failed,completed) in the shared tracking mechanism. - Optionally updates progress/messages during execution.
- Runs the core analysis steps (M2, GC2, SSR processing).
- Saves output files to the correct
jobOut/{job_id}/output/main/orintrim/directories. - Upon completion (success or failure), triggers the queue manager to check if a waiting job can be started.
- User submits analysis via UI ->
POST /analyze_ssr/. - Frontend receives
job_id,status, and URLs. Storesjob_id. - If status is
queuedorrunning, frontend periodically pollsGET /api/job/{job_id}/status. - UI updates based on status (e.g., "Queued", "Running X%", "Failed: Error message").
- When status becomes
completed:- Frontend enables UI elements for viewing plots/results.
- When a specific plot is needed, frontend calls
GET /api/job/{job_id}/plot_data/{plot_key}. - Receives binary Arrow data, parses using Arrow JS, renders plot/table.
- Frontend provides a button linked to
GET /api/job/{job_id}/download_zipfor full download.
This plan provides a robust and performant architecture addressing your requirements.