Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Inconsistent data types from extract_properties #234

Open
baitsguy opened this issue Feb 7, 2024 · 1 comment
Open

[Bug] Inconsistent data types from extract_properties #234

baitsguy opened this issue Feb 7, 2024 · 1 comment

Comments

@baitsguy
Copy link
Contributor

baitsguy commented Feb 7, 2024

Describe the bug
When using gpt-3.5 in the extract_properties transform, the resulting 'entity' object doesn't always have the same schema. In some cases the types of the fields within 'entity' are different across documents. This causes an error in indexing into OpenSearch as it expects each field to have a consistent datatype. The inconsistency happens in extract_schema as well, but that doesn't cause issues since it's a single task whose results are applied to all documents.

To Reproduce
Steps to reproduce the behavior:

  1. Execute the metadata-extraction notebook here https://github.com/aryn-ai/sycamore/blob/a39eced00c884e0f50f33eefa7b009b5f9923249/notebooks/metadata-extraction.ipynb a few times
  2. You will notice transient failures

Expected behavior
Extract properties should result in an 'entity' object in a record, with each entity object having the same set of fields with same types.

Screenshots
n/a

Desktop (please complete the following information):

  • Unrelated

Smartphone (please complete the following information):

  • Unrelated

Additional context
Stack trace attached
datatype_error.txt

@HenryL27
Copy link
Collaborator

HenryL27 commented Jun 7, 2024

@baitsguy have we fixed this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants