Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-4026: [c++] Add new methods to CustomAttributes to allow non-string custom attributes in schemas #3266

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

jhump
Copy link

@jhump jhump commented Dec 16, 2024

This adds new methods to CustomAttributes to allow setting non-string values. These other methods work with JSON-encoded strings.

Unlike #3064 and #3069, this change attempts to be backwards compatible. However, from reading more comments in pull requests, it looks like the "fix" I added (to escape the keys and values in custom attributes when printing to JSON) may actually be a compatibility issue since it seems that users were expected to have to escape string values if they contained any characters that would be escaped in JSON (including quotes). That seems like a really terrible API, and it also meant that the values would not round-trip correctly: reading a data file would not create custom attributes with these strings properly escaped, so later writing out data with the same schema would generate an invalid schema JSON document.

In any event, this uses strings as the values even though it would be ideal if we could pass some sort of structured data as the value type. The ideal types (json::Entity and its accompanying json::Object and json::Array types) are defined in json/JsonDom.hh. But that header files is not part of the Avro include files distribution, which means we cannot #include it from CustomAttributes.hh, so it's a no-go. From a little history spelunking, I see that they indeed used to use a structured form which was simplified to strings in #1821, purely because these JSON header files aren't available to users in the Avro distribution.

Alternatives that I considered for using JSON-encoded strings:

  • Include the JSON types in the Avro distribution. I think this would be as simple as moving the header files out of lang/c++/impl and into lang/c++/include/avro. But then we expand the public API of too much. This approach was already tried and rejected in AVRO-3601: C++ API header contains breaking include #1820.
  • Use a wrapper like std::any as the value type. This can be #included in CustomAttributes.hh but eliminates type safety in the signature. The only concrete accepted value would likely be json::Entity -- though we could make it more sophisticated and also allow the various value types sans wrapper: std::string, bool, int64_t, double, json::Array (aka std::vector<json::Entity>), and json::Object (aka std::map<std::string, json::Entity>). But this isn't really usable by external/user code, at least not for any composite values, since they aren't able to include the JSON headers and then produce valid values of json::Entity.
  • Convert from json::Entity to some other structured type that is defined in CustomAttributes.hh. This could possibly be a concrete std::variant that allows the various options (and could use std::monostate to represent JSON null values). This introduces non-trivial conversion code. From a performance perspective, it could likely be better than converting to/from strings, but it's a non-trivial amount of new-new code to maintain, which didn't feel right.

What is the purpose of the change

Fixes exception from C++ library when compiling schemas with non-string custom attributes.

Verifying this change

This change added tests and can be verified as follows:

  • Updates existing unit test cases to use the new method and use non-string custom attributes.

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? not documented

@github-actions github-actions bot added the C++ Pull Requests for C++ binding label Dec 16, 2024
…deprecated versions (except one case, testing the deprecated ones); add test cases specifically for deprecated methods
@martin-g
Copy link
Member

@wgtmac Do you have an opinion about these changes ?
There are two more PR about the same (#3064 and #3069). We try to find the best one of them.

@wgtmac
Copy link
Member

wgtmac commented Dec 19, 2024

I'm not fully familiar with the context (and the code) yet. I agree that keeping backward compatibility is the bottom line. However, from the JIRA issue it seems that non-string attributes were supported once but broken after fixing other issues. Shouldn't we just fix it to revert the unexpected behavior? Except the alternatives list above, is it possible to add an option to CustomAttributes to control its behavior of dealing with non-string values?

@jhump
Copy link
Author

jhump commented Dec 19, 2024

Except the alternatives list above, is it possible to add an option to CustomAttributes to control its behavior of dealing with non-string values?

Most certainly. I hadn't considered it, but it would be easy to change to that. Would like you me to apply that suggestion in this PR? For back-compat, the "default" mode would have to be the awkward (arguably broken) string-only mode, that just wraps the input in double-quotes (with no other escaping). But a new constructor could be added to enable a mode where the API expects well-formed JSON-encoded data (so string values must be quoted and escaped).

@jhump
Copy link
Author

jhump commented Jan 6, 2025

@wgtmac, I've updated this branch so that instead of deprecating the old methods and adding new ones, there's now a constructor flag to indicate whether values are strings or arbitrary JSON values (in which case string values must be quoted).

In updating tests, I realized that my code wasn't correctly validating that string values were correct -- if it had a botched/unescaped quote, it wouldn't be flagged as an invalid value. So I added some stuff in the json folder so that loadEntity makes sure that it exhausts all input and fails if there appears to be a subsequent value in the stream. This also caught an existing issue in a schema string literal in SchemaTests.cc.

@jhump
Copy link
Author

jhump commented Jan 6, 2025

@wgtmac, if you don't like the look of this API -- using a separate constructor with a bool flag -- let me know. I'm happy to roll back that commit or continue to iterate on this.

@wgtmac
Copy link
Member

wgtmac commented Jan 7, 2025

@jhump I'm not a maintainer so my opinion might not be thorough.

@martin-g @thiru-mg WDYT?

@martin-g
Copy link
Member

martin-g commented Jan 7, 2025

I am not C++ dev/user, so my opinion is even more irrelevant :-)
My 2c:

  • the PR becomes bigger and since there are not many C++ users reviewing/commenting on it the chances to be merged become smaller :-/
  • in Rust and Java I prefer to use a simple enum instead of boolean parameter. 1) it is more clear to read xyz(Enum.Variant1) instead of xyz(false); 2) it is easy to add more logic by adding more variants.

@jhump
Copy link
Author

jhump commented Jan 7, 2025

I am not C++ dev/user, so my opinion is even more irrelevant :-)

So who is the right person that we should tag, to review and approve this?

I prefer to use a simple enum instead of boolean parameter.

Sure, I can do that.

@martin-g
Copy link
Member

martin-g commented Jan 7, 2025

So who is the right person that we should tag, to review and approve this?

No one! The Avro team is notified anyway.

If no one from the Avro team merges the PR for some reasonable time then I could help with the merge only after at least two approvals from the Avro community.

@wgtmac
Copy link
Member

wgtmac commented Jan 8, 2025

I can help review it :)

@jhump
Copy link
Author

jhump commented Jan 8, 2025

@wgtmac, what do you think of the current approach? If I just replace the bool constructor parameter with an enum (as mentioned above), would that be pretty close to an acceptable patch? Any other concerns or feedback?

@wgtmac
Copy link
Member

wgtmac commented Jan 9, 2025

@jhump Yes, I think so.

@jhump
Copy link
Author

jhump commented Jan 9, 2025

@wgtmac, I've updated the PR with that suggestion. Please take a look. Thanks in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C++ Pull Requests for C++ binding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants