Investigate when nodes die but not enough for DR #265

DomAyre · 2024-12-13T09:17:05Z

I did some investigation where I manually restart a CCF node container:

The node attempts to join as if it never existed
The CCF network denies it because it believes it already has a node with the same URL

So this doesn't just work, I spoke to Gaurav from the az cleanroom team, and he said the extension is capable of detecting dead nodes but doesn't automatically re-provision them. Orchestration is deferred to a higher level process.

I see a few possible solutions to this:

We build a simple "orchestrator" which could just be a cron job type process which inspects the network health, and calls the scaling function if any nodes die such that the desired number is always maintained
We rework the az-cleanroom containers such that they do some work to re-identify as the dead node
We rework the az-cleanroom container such they join as a new node

DomAyre · 2025-01-06T14:30:41Z

Amaury confirms that we will want a separate process which monitors network health and re-provisions nodes as required.

DomAyre mentioned this issue Jan 9, 2025

Add an orchestrator for az cleanroom CCF networks #286

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate when nodes die but not enough for DR #265

Investigate when nodes die but not enough for DR #265

DomAyre commented Dec 13, 2024 •

edited

Loading

DomAyre commented Jan 6, 2025

Investigate when nodes die but not enough for DR #265

Investigate when nodes die but not enough for DR #265

Comments

DomAyre commented Dec 13, 2024 • edited Loading

DomAyre commented Jan 6, 2025

DomAyre commented Dec 13, 2024 •

edited

Loading