Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE-3186] schema registry json schema: array compatibility checks #20137

Conversation

andijcr
Copy link
Contributor

@andijcr andijcr commented Jun 25, 2024

Implements is_array_superset, the function that performs compatibility checks for "type": "array" json schemas.

Note1: "type": "array" can express proper arrays or tuples. Tuples have a validation logic similar to "type": "object" schemas, so the compatibility check, too, has a similar logic.

Note 2: for tuples, the syntax changed between Draft 4 and later drafts. To prevent erroneous validation, "prefixItems" keyword is added to the list of not implemented features.

Last commit fixed an edge case for "min__" properties. these have a default value of 0, so these two schemas
{"type": "string"} and {"type": "string", "minLength": 0 } should be considered compatible (equivalent, in fact)

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.1.x
  • v23.3.x
  • v23.2.x

Release Notes

  • none

"type": "array" can model tuples, and in doing so in Draft4 it uses
"additionalItems" the same way as "additionalProperties".

After Draft4, "items" (in tuple mode) works in the same way as
"additionalProperties".
this commit adds an enum that it's used to parametrize the field name
@andijcr andijcr force-pushed the feat/core-3186/schema_registry_array branch from ff21fa8 to e455b96 Compare June 27, 2024 16:26
the "uniqueItems" keyword makes most sense for array type, but it's
legal also for tuples, so it's checked in the common section of
is_array_superset
find if a schema is modelling an array by checking if "items" is a
schema or an array of schemas.

implement superset check if both are array schemas
tuple schema check is more involved:

each element in newer["items"] must be compatible with the element at
the same index of older["items"].

additionally, if newer["items"] has more elements than older["items"],
the excess elements must be compatible with older["additionalItems"]

it's a similar process as "type": "object"

adds "prefixItems" to the list of uninplemented keywords: this is from
drafts after Draft4, and tuple schema is implemented differntly so this
checks protects against that
the check would not work for numeric properties with a default value, in
the edge case where on one side the property is explicitly set with the
default value, and on the other side the property is left out.

for example
{"type": "string", "minLength": 0} and {"type": "string"} should be
considered equivalent

adds a parameter to set the default value if the property is not set
@andijcr andijcr force-pushed the feat/core-3186/schema_registry_array branch from e455b96 to 3c9596f Compare June 27, 2024 20:46
@andijcr andijcr changed the title [CORE-3186] Feat/core 3186/schema registry array [CORE-3186] schema registry json schema: array compatibility checks Jun 27, 2024
@andijcr andijcr marked this pull request as ready for review June 27, 2024 20:55
@andijcr andijcr requested review from BenPope and a team June 27, 2024 20:58
@andijcr
Copy link
Contributor Author

andijcr commented Jun 28, 2024

Failure is #20315

Copy link
Member

@BenPope BenPope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

// get value or default_value
auto get_value = [&](json::Value const& v) -> std::optional<double> {
auto it = v.FindMember(
json::Value{prop_name.data(), rapidjson::SizeType(prop_name.size())});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth creating a helper from std::string_view to StringRef to highlight the lifetime and hide the cast.

.reader_schema = R"({"type": "array", "minItems": 2, "maxItems": 10})",
.writer_schema = R"({"type": "array", "maxItems": 11})",
.reader_is_compatible_with_writer = false,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is it worth splitting this into 2 tests, one that changes minItems and another that changes maxItems to show that neither change is allowed (independently of one another)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, i'll follow up

VPred&& value_predicate,
std::optional<double> default_value = std::nullopt) {
// get value or default_value
auto get_value = [&](json::Value const& v) -> std::optional<double> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this only fixes the bug for the numeric properties but the same bug exists for other calls to extract_property_and_gate_check. If I understand it correctly, the bug is that extract_property_and_gate_check doesn't handle the default value of the field.
For example, this test case would fail - I believe incorrectly:

{
    .reader_schema
    = R"({"type": "number", "minimum": 11, "exclusiveMinimum": false})",
    .writer_schema = R"({"type": "number", "minimum": 11})",
    .reader_is_compatible_with_writer = true,
  },

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, but extract_property_and_gate_check cannot handle this case cleanly.
"exclusiveMinimum" (or _maximum) is kind of tricky because it's a boolean with a default value in draft4 but a number without a default value from draft6.
Currently, is_numeric_superset doesn't know the schema dialect, so it cannot assign a meaningful default to the property.
We have json schema normalization on the roadmap that should fix this problem at the root, by explicitly setting all the implicit default values based on the schema dialect in use.
So if you are ok with this, i'd move this fix to a follow-up or to the json schema normalization PR

.reader_schema = R"({"type": "array", "uniqueItems": true})",
.writer_schema = R"({"type": "array"})",
.reader_is_compatible_with_writer = false,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is it worth adding a test here for the default value handling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i'll followup

@BenPope BenPope added the area/schema-registry Schema Registry service within Redpanda label Jun 28, 2024
@michael-redpanda michael-redpanda merged commit 14c8aa4 into redpanda-data:dev Jun 28, 2024
17 of 21 checks passed
@andijcr andijcr deleted the feat/core-3186/schema_registry_array branch June 28, 2024 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda area/schema-registry Schema Registry service within Redpanda
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants