Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using different models in evaluating mode-graded eval and in generating the completion #1393

Open
LoryPack opened this issue Nov 3, 2023 · 5 comments

Comments

@LoryPack
Copy link
Contributor

LoryPack commented Nov 3, 2023

Describe the feature or improvement you're requesting

build_eval.md says:

In general, the evaluation model and the model being evaluated don't have to be the same, though we will assume that they are here for ease of explanation.

However, I can't find anywhere how to do this. Is this currently implemented?

Additional context

No response

@LRudL
Copy link

LRudL commented Nov 27, 2023

I recently struggled to get this to work too so I can share what I found.

This is currently implemented in the GitHub version of this repo (but not the one on PyPI that you get by installing it the library through any package manager, as these versions are many months out of date and have a version where gpt-3.5-turbo is hard-coded as the grader).

Lines 29-32 in evals/elsuite/modelgraded/classify.py show you how this feature is implemented: the last completion_fn given is treated as the evaluation function.

Completion functions in turn can be specified in a comma-separated string. The logic for this is at evals/cli/oaieval.py lines 142-145.

Concretely, a string like "gpt-4,gpt-3.5-turbo" seems to work for me to get gpt-4 to be the completer and gpt-3.5-turbo the one grading the responses.

However, be warned that there seems to be a slight bug where modelgraded eval execution can hang for a long time in a way that other evals don't (and seems unrelated to rate limits).

@LoryPack
Copy link
Contributor Author

I had opened a PR last week (#1418) where I address this issue but forgot to mention it here.

@LRudL
Copy link

LRudL commented Nov 27, 2023

Regarding #1418: A new PR is not necessary for setting the evaluating model (though the feature really should be documented), since the full relevant lines are:

        # treat last completion_fn as eval_completion_fn
        self.eval_completion_fn = self.completion_fns[-1]
        if len(self.completion_fns) > 1:
            self.completion_fns = self.completion_fns[:-1]

If you pass in many (in a comma-separated list) into completion_fns, then the last one will be treated as the evaluating model.

@LoryPack
Copy link
Contributor Author

LoryPack commented Nov 27, 2023 via email

@LRudL
Copy link

LRudL commented Nov 27, 2023

If you want to run the eval with modelA, and run the grading with modelB, then you can pass in the string "modelA,modelB" as the name of the completer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants