-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-672: Implement the DependsOn API #740
base: main
Are you sure you want to change the base?
KEP-672: Implement the DependsOn API #740
Conversation
@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: vsoch. Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Hi @andreyvelich. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
✅ Deploy Preview for kubernetes-sigs-jobset ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
api/jobset/v1alpha2/jobset_types.go
Outdated
|
||
// Complete status means the Succeeded counter equals the number of child Jobs. | ||
// .spec.replicatedJobs["name==<JOB_NAME>"].replicas == .status.replicatedJobsStatus.name["name==<JOB_NAME>"].succeeded | ||
CompleteStatus DependsOnStatus = "Complete" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we think about Complete
status here, given the discussion in: #723 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to match the Kubernetes Job and use Complete, so keep as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I think that is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for consistency w/ Job API
/ok-to-test |
I've added integration and unit tests. Let me know if that looks good to you! |
@andreyvelich: GitHub didn't allow me to assign the following users: vsoch, shravan-achar, akshaychitneni. Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
api/jobset/v1alpha2/jobset_types.go
Outdated
|
||
// Complete status means the Succeeded counter equals the number of child Jobs. | ||
// .spec.replicatedJobs["name==<JOB_NAME>"].replicas == .status.replicatedJobsStatus.name["name==<JOB_NAME>"].succeeded | ||
CompleteStatus DependsOnStatus = "Complete" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to match the Kubernetes Job and use Complete, so keep as is.
@@ -417,5 +417,126 @@ var _ = ginkgo.Describe("jobset webhook defaulting", func() { | |||
}, | |||
updateShouldFail: true, | |||
}), | |||
ginkgo.Entry("DependsOn and StartupPolicy can't be set together", &testCase{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a test case for StartupPolicy set to AnyOrder and allowing DependsOn
api/jobset/v1alpha2/jobset_types.go
Outdated
// only after the referenced ReplicatedJobs reach their desired state. | ||
// The Order of ReplicatedJobs is defined by their enumeration in the slice. | ||
// Note, that the first ReplicatedJob in the slice cannot use the DependsOn API. | ||
// TODO (andreyvelich): Currently, only a single item is supported in the DependsOn list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should have TODO in user facing APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will remove it.
pkg/controllers/jobset_controller.go
Outdated
rJobsReplicas[replicatedJob.Name] = replicatedJob.Replicas | ||
|
||
// For depends on, the Job is created only after the previous replicatedJob reached the status. | ||
if replicatedJob.DependsOn != nil && !isDependsOnJobReachedStatus(replicatedJob.DependsOn[0], rJobsReplicas[replicatedJob.DependsOn[0].Name], replicatedJobStatuses) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bug. You probably want the idx - 1 rather than 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich there are some comments in this PR confused about the hardcoded index 0, it surprised me as well at first until I remembered the design discussion about single vs multi-dependency and future proofing.
I think hard coding the index of a specific dependency makes the code a bit surprising/confusing as compared to simply looping through the slice checking all dependencies, and using the CEL validation you've added to enforce the single dependency only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, @danielvegamyhre do you suggest that in this PR we implement the dependencyReachedStatus
function in a way that it can accept array and verify that all dependent ReplicatedJobs reach status ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given how important this is for kubeflow, I am wondering if a e2e test would be useful.
pkg/webhooks/jobset_webhook_test.go
Outdated
@@ -1487,10 +1487,152 @@ func TestValidateCreate(t *testing.T) { | |||
}, | |||
} | |||
|
|||
dependsOnTests := []validationTestCase{ | |||
{ | |||
name: "dependsOn is valid", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a case where job-2 waits for job-1 and job-3 waits for job-2?
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: andreyvelich, vsoch The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
76d53e5
to
bf743bb
Compare
Yeah, maybe we could add one e2e test as part of JobSet: https://github.com/kubernetes-sigs/jobset/blob/main/test/e2e/e2e_test.go. |
c0b8760
to
d76b286
Compare
Can we make sure to add some examples and maybe call this out on our site? |
Sure, I can add simple example with sleep. |
We can have sleep for now. I'd prefer to have standalone examples without requiring Kubeflow in our user facing docs. |
api/jobset/v1alpha2/jobset_types.go
Outdated
|
||
// Complete status means the Succeeded counter equals the number of child Jobs. | ||
// .spec.replicatedJobs["name==<JOB_NAME>"].replicas == .status.replicatedJobsStatus.name["name==<JOB_NAME>"].succeeded | ||
CompleteStatus DependsOnStatus = "Complete" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for consistency w/ Job API
pkg/controllers/jobset_controller.go
Outdated
rJobsReplicas[replicatedJob.Name] = replicatedJob.Replicas | ||
|
||
// For depends on, the Job is created only after the previous replicatedJob reached the status. | ||
if replicatedJob.DependsOn != nil && !isDependsOnJobReachedStatus(replicatedJob.DependsOn[0], rJobsReplicas[replicatedJob.DependsOn[0].Name], replicatedJobStatuses) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich there are some comments in this PR confused about the hardcoded index 0, it surprised me as well at first until I remembered the design discussion about single vs multi-dependency and future proofing.
I think hard coding the index of a specific dependency makes the code a bit surprising/confusing as compared to simply looping through the slice checking all dependencies, and using the CEL validation you've added to enforce the single dependency only.
api/jobset/v1alpha2/jobset_types.go
Outdated
|
||
const ( | ||
// Ready status means the Ready + Succeeded + Failed counter equals the number of child Jobs. | ||
ReadyStatus DependsOnStatus = "Ready" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should rename these status variables something like DependencyReady
and DependencyComplete
(or ReplicatedJobReady
and ReplicatedJobComplete
) since they are specific to a replicatedJob dependency and not the status of the JobSet itself.
With the current naming, in the other places we're importing the api as jobset
and using these variables, they sound like they are referring to the status of the JobSet itself (e.g. jobset.ReadyStatus
, jobset.CompletedStatus
, etc), which is confusing/misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that is good point!
pkg/controllers/depends_on.go
Outdated
) | ||
|
||
// isDependsOnJobReachedStatus checks if depends on ReplicatedJob reaches Ready or Complete status. | ||
func isDependsOnJobReachedStatus(dependsOnJob jobset.DependsOn, dependsOnJobReplicas int32, rJobsStatuses []jobset.ReplicatedJobStatus) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function name is a bit confusing to me, maybe we can name it something like dependencyReachedStatus
or replicatedJobReachedStatus
?
@danielvegamyhre @kannon92 What do you think about example that I added ? |
Improve API docs
Rename func to dependencyReachedStatus
Signed-off-by: Andrey Velichkevich <[email protected]>
386fad0
to
13e133c
Compare
- /bin/sh | ||
- -c | ||
- "echo 'initializer runs for 10 seconds' && sleep 10" | ||
- name: launcher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the launcher maybe we should have some kind of readiness probe that can explain how this would be used for ready?
Right now it would go to active/running very quick due to no readiness probe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I changed it to startupProbe for launcher.
Right now it would go to active/running very quick due to no readiness probe.
Actually, the Job will be Ready only after all InitContainers are complete: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#:~:text=Ready%3A%20the%20Pod%20is%20able%20to%20serve%20requests%20and%20should%20be%20added%20to%20the%20load%20balancing%20pools%20of%20all%20matching%20Services.
ffd4715
to
0df9651
Compare
What type of PR is this?
/kind feature
/kind api-change
What this PR does / why we need it:
I added implementation for the DependsOn API. The majority of validations are implemented using CEL.
TODO list:
depends_on.go
/cc @ahg-g @kannon92 @danielvegamyhre @tenzen-y @vsoch
Which issue(s) this PR fixes:
Fixes: #672
Does this PR introduce a user-facing change?