Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-672: Implement the DependsOn API #740

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

andreyvelich
Copy link
Contributor

@andreyvelich andreyvelich commented Dec 27, 2024

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it:

I added implementation for the DependsOn API. The majority of validations are implemented using CEL.

TODO list:

  • Add unit tests for the depends_on.go
  • Add controller integration tests to verify CEL validation.
  • Add controller integration tests to verify Job creation.

/cc @ahg-g @kannon92 @danielvegamyhre @tenzen-y @vsoch

Which issue(s) this PR fixes:

Fixes: #672

Does this PR introduce a user-facing change?

Add the DependsOn API to support serial execution of ReplicatedJobs.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 27, 2024
@k8s-ci-robot k8s-ci-robot requested a review from ahg-g December 27, 2024 13:17
@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API labels Dec 27, 2024
@k8s-ci-robot
Copy link
Contributor

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: vsoch.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it:

I added implementation for the DependsOn API. The majority of validations are implemented using CEL.

TODO list:

  • Add unit tests for the depends_on.go
  • Add controller integration tests to verify CEL validation.
  • Add controller integration tests to verify Job creation.

/cc @ahg-g @kannon92 @danielvegamyhre @tenzen-y @vsoch

Which issue(s) this PR fixes:

Fixes: #672

Does this PR introduce a user-facing change?

Add the DependsOn API to support serial execution of ReplicatedJobs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Dec 27, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @andreyvelich. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Dec 27, 2024
Copy link

netlify bot commented Dec 27, 2024

Deploy Preview for kubernetes-sigs-jobset ready!

Name Link
🔨 Latest commit 0df9651
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-jobset/deploys/67800cefca730a0008b978dc
😎 Deploy Preview https://deploy-preview-740--kubernetes-sigs-jobset.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.


// Complete status means the Succeeded counter equals the number of child Jobs.
// .spec.replicatedJobs["name==<JOB_NAME>"].replicas == .status.replicatedJobsStatus.name["name==<JOB_NAME>"].succeeded
CompleteStatus DependsOnStatus = "Complete"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we think about Complete status here, given the discussion in: #723 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to match the Kubernetes Job and use Complete, so keep as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I think that is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for consistency w/ Job API

@kannon92
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 27, 2024
@andreyvelich
Copy link
Contributor Author

I've added integration and unit tests. Let me know if that looks good to you!
/assign @ahg-g @kannon92 @danielvegamyhre @tenzen-y @vsoch @shravan-achar @akshaychitneni

@k8s-ci-robot
Copy link
Contributor

@andreyvelich: GitHub didn't allow me to assign the following users: vsoch, shravan-achar, akshaychitneni.

Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

I've added integration and unit tests. Let me know if that looks good to you!
/assign @ahg-g @kannon92 @danielvegamyhre @tenzen-y @vsoch @shravan-achar @akshaychitneni

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@andreyvelich andreyvelich changed the title [WIP] KEP-672: Implement the DependsOn API KEP-672: Implement the DependsOn API Jan 3, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 3, 2025

// Complete status means the Succeeded counter equals the number of child Jobs.
// .spec.replicatedJobs["name==<JOB_NAME>"].replicas == .status.replicatedJobsStatus.name["name==<JOB_NAME>"].succeeded
CompleteStatus DependsOnStatus = "Complete"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to match the Kubernetes Job and use Complete, so keep as is.

@@ -417,5 +417,126 @@ var _ = ginkgo.Describe("jobset webhook defaulting", func() {
},
updateShouldFail: true,
}),
ginkgo.Entry("DependsOn and StartupPolicy can't be set together", &testCase{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a test case for StartupPolicy set to AnyOrder and allowing DependsOn

// only after the referenced ReplicatedJobs reach their desired state.
// The Order of ReplicatedJobs is defined by their enumeration in the slice.
// Note, that the first ReplicatedJob in the slice cannot use the DependsOn API.
// TODO (andreyvelich): Currently, only a single item is supported in the DependsOn list.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should have TODO in user facing APIs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will remove it.

rJobsReplicas[replicatedJob.Name] = replicatedJob.Replicas

// For depends on, the Job is created only after the previous replicatedJob reached the status.
if replicatedJob.DependsOn != nil && !isDependsOnJobReachedStatus(replicatedJob.DependsOn[0], rJobsReplicas[replicatedJob.DependsOn[0].Name], replicatedJobStatuses) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a bug. You probably want the idx - 1 rather than 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich there are some comments in this PR confused about the hardcoded index 0, it surprised me as well at first until I remembered the design discussion about single vs multi-dependency and future proofing.

I think hard coding the index of a specific dependency makes the code a bit surprising/confusing as compared to simply looping through the slice checking all dependencies, and using the CEL validation you've added to enforce the single dependency only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, @danielvegamyhre do you suggest that in this PR we implement the dependencyReachedStatus function in a way that it can accept array and verify that all dependent ReplicatedJobs reach status ?

Copy link
Contributor

@kannon92 kannon92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how important this is for kubeflow, I am wondering if a e2e test would be useful.

@@ -1487,10 +1487,152 @@ func TestValidateCreate(t *testing.T) {
},
}

dependsOnTests := []validationTestCase{
{
name: "dependsOn is valid",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a case where job-2 waits for job-1 and job-3 waits for job-2?

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: andreyvelich, vsoch
Once this PR has been reviewed and has the lgtm label, please ask for approval from ahg-g. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andreyvelich andreyvelich force-pushed the issue-672-implement-depends-on branch from 76d53e5 to bf743bb Compare January 6, 2025 15:27
@andreyvelich
Copy link
Contributor Author

Given how important this is for kubeflow, I am wondering if a e2e test would be useful.

Yeah, maybe we could add one e2e test as part of JobSet: https://github.com/kubernetes-sigs/jobset/blob/main/test/e2e/e2e_test.go.
Will add it soon.

@andreyvelich andreyvelich force-pushed the issue-672-implement-depends-on branch from c0b8760 to d76b286 Compare January 6, 2025 22:39
@andreyvelich
Copy link
Contributor Author

@kannon92 @tenzen-y I've added the simple E2E test for DependsOn API.

@kannon92
Copy link
Contributor

kannon92 commented Jan 8, 2025

Can we make sure to add some examples and maybe call this out on our site?

@andreyvelich
Copy link
Contributor Author

Can we make sure to add some examples and maybe call this out on our site?

Sure, I can add simple example with sleep.
Actually, once we have implementation of MPI Runtime in Kubeflow Training V2, I am planning to add example of MPI job that uses DependsOn API. We can link those examples in the JobSet docs as well.
kubeflow/training-operator#2217

@kannon92
Copy link
Contributor

kannon92 commented Jan 8, 2025

Sure, I can add simple example with sleep.
Actually, once we have implementation of MPI Runtime in Kubeflow Training V2, I am planning to add example of MPI job that uses DependsOn API. We can link those examples in the JobSet docs as well.

We can have sleep for now. I'd prefer to have standalone examples without requiring Kubeflow in our user facing docs.


// Complete status means the Succeeded counter equals the number of child Jobs.
// .spec.replicatedJobs["name==<JOB_NAME>"].replicas == .status.replicatedJobsStatus.name["name==<JOB_NAME>"].succeeded
CompleteStatus DependsOnStatus = "Complete"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for consistency w/ Job API

rJobsReplicas[replicatedJob.Name] = replicatedJob.Replicas

// For depends on, the Job is created only after the previous replicatedJob reached the status.
if replicatedJob.DependsOn != nil && !isDependsOnJobReachedStatus(replicatedJob.DependsOn[0], rJobsReplicas[replicatedJob.DependsOn[0].Name], replicatedJobStatuses) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich there are some comments in this PR confused about the hardcoded index 0, it surprised me as well at first until I remembered the design discussion about single vs multi-dependency and future proofing.

I think hard coding the index of a specific dependency makes the code a bit surprising/confusing as compared to simply looping through the slice checking all dependencies, and using the CEL validation you've added to enforce the single dependency only.


const (
// Ready status means the Ready + Succeeded + Failed counter equals the number of child Jobs.
ReadyStatus DependsOnStatus = "Ready"
Copy link
Contributor

@danielvegamyhre danielvegamyhre Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should rename these status variables something like DependencyReady and DependencyComplete (or ReplicatedJobReady and ReplicatedJobComplete) since they are specific to a replicatedJob dependency and not the status of the JobSet itself.

With the current naming, in the other places we're importing the api as jobset and using these variables, they sound like they are referring to the status of the JobSet itself (e.g. jobset.ReadyStatus, jobset.CompletedStatus, etc), which is confusing/misleading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is good point!

)

// isDependsOnJobReachedStatus checks if depends on ReplicatedJob reaches Ready or Complete status.
func isDependsOnJobReachedStatus(dependsOnJob jobset.DependsOn, dependsOnJobReplicas int32, rJobsStatuses []jobset.ReplicatedJobStatus) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function name is a bit confusing to me, maybe we can name it something like dependencyReachedStatus or replicatedJobReachedStatus?

@andreyvelich
Copy link
Contributor Author

@danielvegamyhre @kannon92 What do you think about example that I added ?

@andreyvelich andreyvelich force-pushed the issue-672-implement-depends-on branch from 386fad0 to 13e133c Compare January 9, 2025 17:12
- /bin/sh
- -c
- "echo 'initializer runs for 10 seconds' && sleep 10"
- name: launcher
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the launcher maybe we should have some kind of readiness probe that can explain how this would be used for ready?

Right now it would go to active/running very quick due to no readiness probe.

Copy link
Contributor Author

@andreyvelich andreyvelich Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I changed it to startupProbe for launcher.

Right now it would go to active/running very quick due to no readiness probe.

Actually, the Job will be Ready only after all InitContainers are complete: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#:~:text=Ready%3A%20the%20Pod%20is%20able%20to%20serve%20requests%20and%20should%20be%20added%20to%20the%20load%20balancing%20pools%20of%20all%20matching%20Services.

@andreyvelich andreyvelich force-pushed the issue-672-implement-depends-on branch from ffd4715 to 0df9651 Compare January 9, 2025 17:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for the execution policy API in JobSet
7 participants