[Feature Proposal] Proposal for Running Jobs on Kubernetes Cluster from Local Machine with Kanon

# Feature Request

I would like to be able to seamlessly run jobs on a k8s cluster from my local machine using Kanon. 
To achieve this, we need to avoid creating images every time to reflect changes in first-party tasks.

This would require many changes to the Kanon library, so I would like to discuss it.

## Motivation

Machine learning jobs require a lot of data and computation, which makes it impractical to run them on a local machine. 
However, it is easy to run them on a k8s cluster because of their flexible resources.

As you may know, running jobs on a k8s cluster requires a Docker image. 
The current implementation of Kanon, as far as I know, requires an image containing all the tasks to run. 
This is inconvenient because it requires building the image repeatedly when ML engineers try out their logics.

## Proposal

To achieve this, there are several issues that need to be addressed.

Send first-party packages to the job container when apply the job.
Add CLI options to specify a task as the entry point, like Gokart.
Allow for manual rerunning of a task even if it has already succeeded.

### 1. Send first-party packages to the job container when apply the job

As previously mentioned, we need to avoid building images every time a change is made to the code.
Since Python is an interpreted language, we don't need to compile code when building a Docker image.
This means that the image only requires the Python runtime and third-party dependencies.
Therefore, we can send the first-party packages to the job container when apply the job.

### 2. Allow for manual rerunning of a task even if it has already succeeded

Currently, Kanon requires specifying a root task in Python code to resolve the order of tasks to run.
While this is sufficient for production or integration environments, it's not enough for local development.
When ML engineers are testing their logic, they may want to run a single task.
In this case, we need to specify a task as the entry point.

### 3. Manually rerun a task even if it succeeded

When ML engineers are testing their logic, they may want to rerun a task even if it has already succeeded, in order to check the results of their modifications.
As you may know, gokart caches the output of a task as a pickle file, and won't rerun a task if the pickle file exists and no parameters have changed.
Therefore, we need an option that allows for forcing the rerunning of a task, such as the `--rerun` option in [gokart](https://github.com/m3dev/gokart/blob/fd304e52f39eb1451973dcc71ce65f0a4d4902dd/gokart/task.py#L39).

## Draft Design

Here is a draft of the design I came up with. I'm not sure if it is the best design, but I'd like to discuss it.

### Send first-party packages to the job container by tarball (for Proposal 1.)

We can send first-party packages to the job container by tarball.
I think the implementation of [skaffold](https://github.com/GoogleContainerTools/skaffold) to run [kaniko](https://github.com/GoogleContainerTools/kaniko) is a good reference.
Skaffold is a tool to create a pipeline to deploy applications to a k8s cluster.
It has the option to use kaniko to build a docker image on a k8s cluster instead of locally.

Since the kaniko image does not contain dependencies (build context) to build applications, Skaffold needs to send them to a kaniko pod. 

The following code is used to send the build context to a kaniko pod by tarball.
https://github.com/GoogleContainerTools/skaffold/blob/908c36a893faa3729d121e273855a3749f2335b5/pkg/skaffold/build/cluster/kaniko.go#L146-L179

We can use the same way to send first-party packages to the job container by tarball.
Of course, we must add some options to Kanon to specify the location of the first-party folders, and extract them into PYTHONPATH when the job container starts.

### Expose CLI options of gokart (for Proposal 2., 3.)

I'm not sure if it can be implemented, so this is an idea rather than a design.
Since gokart has a lot of useful options for development purposes like `--rerun` and `--modification_time_check`, it's good to depend on it.

Please consider introducing CLI like the following:

```console
kanon run \
    # here are options for kanon
    --namespace default --task TaskA
    # separator of kanon and gokart options
    --
    # after `--`, here are options for gokart
    --rerun --modification_time_check
    # --tasks-a-param is a parameter of TaskA
    --task-a-param 1
```

Thanks for reading this long proposal and great library! :pray:
I'm looking forward to your feedback.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Proposal] Proposal for Running Jobs on Kubernetes Cluster from Local Machine with Kanon #4

Feature Request

Motivation

Proposal

1. Send first-party packages to the job container when apply the job

2. Allow for manual rerunning of a task even if it has already succeeded

3. Manually rerun a task even if it succeeded

Draft Design

Send first-party packages to the job container by tarball (for Proposal 1.)

Expose CLI options of gokart (for Proposal 2., 3.)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Proposal] Proposal for Running Jobs on Kubernetes Cluster from Local Machine with Kanon #4

Description

Feature Request

Motivation

Proposal

1. Send first-party packages to the job container when apply the job

2. Allow for manual rerunning of a task even if it has already succeeded

3. Manually rerun a task even if it succeeded

Draft Design

Send first-party packages to the job container by tarball (for Proposal 1.)

Expose CLI options of gokart (for Proposal 2., 3.)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions