feat: update evaluation flow sample for abstractive summarization with g-eval method to enable GPT-4-Turbo #3317
+639
−143
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR updates a evaluation flow example that was introduced by #2037. This example only supported GPT-4 previously as GPT-4-Turbo was showing poor performance with previous approach. With this update, GPT-4-Turbo is introduced and meta-evaluated along with the implementation update from sampling based approach to weighted average over probability approach. New implementation outperformed previous evaluation performance according to meta-evaluation result. Besides, this new approach reduces estimated cost of evaluation from $6.19 to $1.32 per 100 documents.
Previous approach is still kept under
sampling_based
directory to provide backward compatibility with GPT-4 evaluator and reference for meta-evaluationAll Promptflow Contribution checklist:
General Guidelines and Best Practices
Testing Guidelines