Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate when nodes die but not enough for DR #265

Open
DomAyre opened this issue Dec 13, 2024 · 1 comment
Open

Investigate when nodes die but not enough for DR #265

DomAyre opened this issue Dec 13, 2024 · 1 comment

Comments

@DomAyre
Copy link
Collaborator

DomAyre commented Dec 13, 2024

I did some investigation where I manually restart a CCF node container:

  • The node attempts to join as if it never existed
  • The CCF network denies it because it believes it already has a node with the same URL

So this doesn't just work, I spoke to Gaurav from the az cleanroom team, and he said the extension is capable of detecting dead nodes but doesn't automatically re-provision them. Orchestration is deferred to a higher level process.

I see a few possible solutions to this:

  1. We build a simple "orchestrator" which could just be a cron job type process which inspects the network health, and calls the scaling function if any nodes die such that the desired number is always maintained
  2. We rework the az-cleanroom containers such that they do some work to re-identify as the dead node
  3. We rework the az-cleanroom container such they join as a new node
@DomAyre
Copy link
Collaborator Author

DomAyre commented Jan 6, 2025

Amaury confirms that we will want a separate process which monitors network health and re-provisions nodes as required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant