Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional error handling to CosmosHealthCheck #4781

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

mikaelweave
Copy link
Contributor

@mikaelweave mikaelweave commented Jan 15, 2025

Description

  • Adds Cosmos 503 exception to retriable exceptions in CosmosHealthCheck.
  • Adds additional diagnostic logging when Cosmos 503 exception is encounters.
  • Also catches Cosmos timeout exceptions and returns a degraded health status

Related issues

AB#137391

Testing

Unit testing.

FHIR Team Checklist

  • Update the title of the PR to be succinct and less than 65 characters
  • Add a milestone to the PR for the sprint that it is merged (i.e. add S47)
  • Tag the PR with the type of update: Bug, Build, Dependencies, Enhancement, New-Feature or Documentation
  • Tag the PR with Open source, Azure API for FHIR (CosmosDB or common code) or Azure Healthcare APIs (SQL or common code) to specify where this change is intended to be released.
  • Tag the PR with Schema Version backward compatible or Schema Version backward incompatible or Schema Version unchanged if this adds or updates Sql script which is/is not backward compatible with the code.
  • CI is green before merge Build Status
  • Review squash-merge requirements

Semver Change (docs)

Patch|Skip|Feature|Breaking (reason)

@mikaelweave mikaelweave requested a review from a team as a code owner January 15, 2025 20:08
@mikaelweave mikaelweave added this to the 2Wk07 milestone Jan 15, 2025
@mikaelweave mikaelweave added Bug Bug bug bug. Azure API for FHIR Label denotes that the issue or PR is relevant to the Azure API for FHIR labels Jan 15, 2025
@mikaelweave mikaelweave changed the title Personal/mikaelw/cosmos healthcheck additional checks Add additional error handling to CosmosHealthCheck Jan 15, 2025
@mikaelweave mikaelweave added Bug Bug bug bug. and removed Bug Bug bug bug. labels Jan 15, 2025
ex,
"Failed to connect to the data store. Request has timed out.");

return HealthCheckResult.Degraded(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should timeout requests to CosmosDB result in Degraded or ServiceUnavailable? 408 status code can mean the database is overloaded from client requests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @fhibf - degraded is the proper behavior here.

VerifyErrorInResult(result.Data, "Error", FhirHealthErrorCode.Error408.ToString());
}

private void VerifyErrorInResult(IReadOnlyDictionary<string, object> dictionary, string key, string expectedMessage)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this helper due to code scanning errors.

LTA-Thinking
LTA-Thinking previously approved these changes Jan 17, 2025
// Reference: https://learn.microsoft.com/azure/cosmos-db/nosql/conceptual-resilient-sdk-applications#should-my-application-retry-on-errors
static bool IsRetryableException(Exception ex) =>
ex is CosmosOperationCanceledException ||
(ex is CosmosException cex && (cex.StatusCode == HttpStatusCode.ServiceUnavailable || cex.StatusCode == (HttpStatusCode)449));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include HttpStatusCode 408 as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the code now I see that HTTP408 is treated differently. That's fine.


void LogAdditionalRetryableExceptionDetails(Exception exception)
{
if (exception is CosmosException cosmosException && cosmosException.StatusCode == HttpStatusCode.ServiceUnavailable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add the logginc for HTTP449 and HTTP408?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fhibf
fhibf previously approved these changes Jan 17, 2025
@mikaelweave mikaelweave dismissed stale reviews from fhibf and LTA-Thinking via 7653e90 January 17, 2025 22:53
@mikaelweave
Copy link
Contributor Author

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Azure API for FHIR Label denotes that the issue or PR is relevant to the Azure API for FHIR Bug Bug bug bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants