-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Feature Request
I would like to be able to seamlessly run jobs on a k8s cluster from my local machine using Kanon.
To achieve this, we need to avoid creating images every time to reflect changes in first-party tasks.
This would require many changes to the Kanon library, so I would like to discuss it.
Motivation
Machine learning jobs require a lot of data and computation, which makes it impractical to run them on a local machine.
However, it is easy to run them on a k8s cluster because of their flexible resources.
As you may know, running jobs on a k8s cluster requires a Docker image.
The current implementation of Kanon, as far as I know, requires an image containing all the tasks to run.
This is inconvenient because it requires building the image repeatedly when ML engineers try out their logics.
Proposal
To achieve this, there are several issues that need to be addressed.
Send first-party packages to the job container when apply the job.
Add CLI options to specify a task as the entry point, like Gokart.
Allow for manual rerunning of a task even if it has already succeeded.
1. Send first-party packages to the job container when apply the job
As previously mentioned, we need to avoid building images every time a change is made to the code.
Since Python is an interpreted language, we don't need to compile code when building a Docker image.
This means that the image only requires the Python runtime and third-party dependencies.
Therefore, we can send the first-party packages to the job container when apply the job.
2. Allow for manual rerunning of a task even if it has already succeeded
Currently, Kanon requires specifying a root task in Python code to resolve the order of tasks to run.
While this is sufficient for production or integration environments, it's not enough for local development.
When ML engineers are testing their logic, they may want to run a single task.
In this case, we need to specify a task as the entry point.
3. Manually rerun a task even if it succeeded
When ML engineers are testing their logic, they may want to rerun a task even if it has already succeeded, in order to check the results of their modifications.
As you may know, gokart caches the output of a task as a pickle file, and won't rerun a task if the pickle file exists and no parameters have changed.
Therefore, we need an option that allows for forcing the rerunning of a task, such as the --rerun option in gokart.
Draft Design
Here is a draft of the design I came up with. I'm not sure if it is the best design, but I'd like to discuss it.
Send first-party packages to the job container by tarball (for Proposal 1.)
We can send first-party packages to the job container by tarball.
I think the implementation of skaffold to run kaniko is a good reference.
Skaffold is a tool to create a pipeline to deploy applications to a k8s cluster.
It has the option to use kaniko to build a docker image on a k8s cluster instead of locally.
Since the kaniko image does not contain dependencies (build context) to build applications, Skaffold needs to send them to a kaniko pod.
The following code is used to send the build context to a kaniko pod by tarball.
https://github.com/GoogleContainerTools/skaffold/blob/908c36a893faa3729d121e273855a3749f2335b5/pkg/skaffold/build/cluster/kaniko.go#L146-L179
We can use the same way to send first-party packages to the job container by tarball.
Of course, we must add some options to Kanon to specify the location of the first-party folders, and extract them into PYTHONPATH when the job container starts.
Expose CLI options of gokart (for Proposal 2., 3.)
I'm not sure if it can be implemented, so this is an idea rather than a design.
Since gokart has a lot of useful options for development purposes like --rerun and --modification_time_check, it's good to depend on it.
Please consider introducing CLI like the following:
kanon run \
# here are options for kanon
--namespace default --task TaskA
# separator of kanon and gokart options
--
# after `--`, here are options for gokart
--rerun --modification_time_check
# --tasks-a-param is a parameter of TaskA
--task-a-param 1Thanks for reading this long proposal and great library! 🙏
I'm looking forward to your feedback.