-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RSDK-9591 - Kill all lingering module process before exiting #4657
base: main
Are you sure you want to change the base?
Conversation
@@ -280,6 +283,9 @@ func (s *robotServer) serveWeb(ctx context.Context, cfg *config.Config) (err err | |||
case <-doneServing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we're here, I'd recommend removing this whole case <-doneServing
stuff (and incidentally the select statement) and move straight to the killing/logging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is that? I thought the justification above for the code as it is makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ug -- it does. But it's a self-inflicted mess. I'll make a change after this goes in.
web/server/entrypoint.go
Outdated
@@ -280,6 +283,9 @@ func (s *robotServer) serveWeb(ctx context.Context, cfg *config.Config) (err err | |||
case <-doneServing: | |||
return true | |||
default: | |||
if myRobot != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If myRobot
can be nil here -- this is a data race.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would that be an issue? I can see that myRobot could have started some processes but not yet returned, but I don't know if we can protect against that completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same remark as the other cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added some locking around this
robot/impl/local_robot.go
Outdated
// Kill will attempt to kill any processes on the system started by the robot as quickly as possible. | ||
// This operation is not clean and will not wait for completion. | ||
func (r *localRobot) Kill() { | ||
if r.manager != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we justify this isn't a data race?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r.manager could be nil if startup fails/hangs, but yes, it could also be a data race.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked offline -- I agree that the a mutex doesn't fix the "logical" race where we may observe the manager is nil a moment before it gets assigned.
But TSAN/Go's data race detection will notice this. Strictly speaking if one has two threads reading and writing a variable/memory address at the same time:
X initialized to 0
| Writer | Reader |
|--------------+--------|
| Write(X = 1) | Read X |
While our program may be OK with the reader seeing 0 or the reader seeing 1, it's not necessarily the case that for all architectures the Reader can only see 0 or 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking at this again, this isn't a data race - there's no chance for Kill() to be called before r.manager is assigned (Kill() can only be called if robot exists, and robot only exists if r.manager is assigned)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great -- much prefer to be able to assume members are non-nil.
@@ -280,6 +283,9 @@ func (s *robotServer) serveWeb(ctx context.Context, cfg *config.Config) (err err | |||
case <-doneServing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ug -- it does. But it's a self-inflicted mess. I'll make a change after this goes in.
robot/impl/local_robot.go
Outdated
// Kill will attempt to kill any processes on the system started by the robot as quickly as possible. | ||
// This operation is not clean and will not wait for completion. | ||
func (r *localRobot) Kill() { | ||
if r.manager != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked offline -- I agree that the a mutex doesn't fix the "logical" race where we may observe the manager is nil a moment before it gets assigned.
But TSAN/Go's data race detection will notice this. Strictly speaking if one has two threads reading and writing a variable/memory address at the same time:
X initialized to 0
| Writer | Reader |
|--------------+--------|
| Write(X = 1) | Read X |
While our program may be OK with the reader seeing 0 or the reader seeing 1, it's not necessarily the case that for all architectures the Reader can only see 0 or 1.
robot/impl/resource_manager.go
Outdated
// TODO: Kill processes in processManager as well. | ||
|
||
// moduleManager may be nil in tests | ||
if manager.moduleManager != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this can be a data race as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep! added some locking around this
web/server/entrypoint.go
Outdated
@@ -280,6 +283,9 @@ func (s *robotServer) serveWeb(ctx context.Context, cfg *config.Config) (err err | |||
case <-doneServing: | |||
return true | |||
default: | |||
if myRobot != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same remark as the other cases
@@ -261,6 +262,8 @@ func (s *robotServer) serveWeb(ctx context.Context, cfg *config.Config) (err err | |||
forceShutdown := make(chan struct{}) | |||
defer func() { <-forceShutdown }() | |||
|
|||
var myRobot robot.LocalRobot | |||
|
|||
utils.PanicCapturingGo(func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be awesome if this force shutdown goroutine was its own method/function as we've been doing for our various unnamed async lambdas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but would like to defer that work to https://viam.atlassian.net/browse/RSDK-9708. There's a bit of refactoring that has to be done (it'd look pretty ugly unless we add some vars to the server object), and I'd rather do it separately
// Kill will attempt to kill any processes on the system started by the robot as quickly as possible. | ||
// This operation is not clean and will not wait for completion. | ||
func (r *localRobot) Kill() { | ||
r.manager.Kill() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels a little awkward that localRobot.Kill only calls kill on the resource manager.
And the resource manager only calls kill on the mod manager.
And localRobot already has a handle on the modmanager. So why doesn't it call kill directly? Or just have the part that's about to do the log.Fatal/os.Exit call kill on the modmanager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keeping the abstractions I think is better in the long run - the resource manager will eventually call kill on the process manager.
the part that's about to do log.Fatal doesn't have a handle to the modmanager, since it only has access to the LocalRobot interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keeping the abstractions I think is better in the long run
If I'm adding some plumbing from local robot -> modmanager, when should I use localRobot.modmanager
directly and when should I add additional plumbing through localRobot.manager
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I'm fine with leaving this as-is. The above is me just coming to the realization that there's some abstraction that I didn't know existed.
This is part two of two PRs that will hopefully help with shutting down all module processes before viam-server exits. Part one is here
This is still a draft as I'm looking for thoughts and ideas around making this better.
Before doing this, I looked into assigning module processes to the same process group as the viam server and just kill the process group. However, we already have each module and process assign to unique process groups, and we use that property to kill each modules and processes separately if necessary. Changing that behavior would be risky, so did not pursue that path further.
We could kill each process in mod manager directly using the exposed unixpid, but figured we could just do it within each managed process, that way we get support in windows as well. It does mean I added Kill() in a few interfaces, but it will hopefully be extensible in case anything else may need killing.
The idea behind this is for a Kill() call to propagate from the viam-server at the end of 90s, and we should not block on anything if possible. The Kill() does not care about the resource graph, only that we kill processes/module processes spawned by the server. I did not do the killing in parallel, since the calls will not block. I can see things racing with Close(), but I think the mitigation would be to make sure that kill/close is idempotent and will not panic if overlapping. This Kill() call does happen in the same goroutine that eventually calls log.Fatal, is that good enough for now or should we create a different goroutine so that we can guarantee that the viam-server exits by the 90s mark?
Ideas for testing? I've tested on a python module and observed that the module process does get killed, and would be good to test on setups where this is happening.