Eval-running often hangs on last sample #1384

sjadler2004 · 2023-10-17T22:38:29Z

Describe the bug

Relatively often, my eval-run will be at say samples 199/200 but then will hang for a very long period of time on the last one. It isn't clear to me why this occurs, but sometimes it'll persist as long as an hour or more, at which point I generally terminate the command from my CLI and try again

To Reproduce

I'm not sure how to make this happen every time unfortunately. It does seem more likely to happen on bigger sampling runs than small ones though.

Code snippets

No response

OS

macOS

Python version

Python v3.11

Library version

latest

sjadler2004 · 2023-10-17T22:39:08Z

Strangely, even after KeyboardInterrupt, it often takes a while for my Terminal to regain the ability to run normal commands after this error occurs - not sure if that helps to pin down the problem

LRudL · 2023-11-28T15:49:33Z

I also have this issue. It is not about rate limits, because it happens despite running datasets that are definitely below the tokens per minute and requests per minute rate limits. However, it does only seem to show up for large datasets.

An example of the error trace when I ctrl+C twice to exit after it gets stuck for a long time:

Traceback (most recent call last):
  File "/home/lrudl/miniconda3/envs/evalg2/lib/python3.10/multiprocessing/pool.py", line 856, in next
    item = self._items.popleft()
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
[...]
  File "/home/lrudl/[...]/evals/evals/cli/oaieval.py", line 223, in run
    result = eval.run(recorder)
  File "/home/lrudl/[...]/evals/evals/elsuite/modelgraded/classify.py", line 107, in run
    self.eval_all_samples(recorder, samples)
  File "/home/lrudl/[...]/evals/evals/eval.py", line 146, in eval_all_samples
    idx_and_result = list(tqdm(iter, total=len(work_items), disable=not show_progress))
  File "/home/lrudl/miniconda3/envs/evalg2/lib/python3.10/site-packages/tqdm/std.py", line 1182, in __iter__
    for obj in iterable:
  File "/home/lrudl/miniconda3/envs/evalg2/lib/python3.10/multiprocessing/pool.py", line 861, in next
    self._cond.wait(timeout)
  File "/home/lrudl/miniconda3/envs/evalg2/lib/python3.10/threading.py", line 320, in wait
    waiter.acquire()
KeyboardInterrupt


^CException ignored in: <module 'threading' from '/home/lrudl/miniconda3/envs/evalg2/lib/python3.10/threading.py'>
Traceback (most recent call last):
  File "/home/lrudl/miniconda3/envs/evalg2/lib/python3.10/threading.py", line 1537, in _shutdown
    atexit_call()
  File "/home/lrudl/miniconda3/envs/evalg2/lib/python3.10/concurrent/futures/thread.py", line 31, in _python_exit
    t.join()
  File "/home/lrudl/miniconda3/envs/evalg2/lib/python3.10/threading.py", line 1096, in join
    self._wait_for_tstate_lock()
  File "/home/lrudl/miniconda3/envs/evalg2/lib/python3.10/threading.py", line 1116, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
KeyboardInterrupt:

Often all I need to do is try again a few times for it to eventually run all the way to completion, but:
(1) This massively increases the token cost.
(2) This makes it difficult to efficiently run many evals in sequence with a script, because you need to manually supervise it and get it unstuck many times. This is a major time cost for big eval projects.

katsuya · 2023-11-29T04:29:53Z

It seems that this issue is influenced by a bug in tqdm, as discussed at tqdm/tqdm#627. Applying the following patch significantly improved the situation.

diff -urN a/.venv/lib/python3.11/site-packages/evals/eval.py b/.venv/lib/python3.11/site-packages/evals/eval.py
--- a/.venv/lib/python3.11/site-packages/evals/eval.py  2023-11-29 12:55:58.214648049 +0900
+++ b/.venv/lib/python3.11/site-packages/evals/eval.py  2023-11-29 12:56:05.630671841 +0900
@@ -143,7 +143,8 @@
             else:
                 logger.info(f"Running in threaded mode with {threads} threads!")
                 iter = pool.imap_unordered(eval_sample, work_items)
-            idx_and_result = list(tqdm(iter, total=len(work_items), disable=not show_progress))
+            # idx_and_result = list(tqdm(iter, total=len(work_items), disable=not show_progress))
+            idx_and_result = list(iter)
         return [r for _, r in sorted(idx_and_result)]

     def get_samples(self):

isc-Shiva-Gupta · 2024-02-06T04:49:00Z

I also had this issue. A workaround I found is to use the EVALS_THREADS_TIMEOUT flag while running the command. It specifies the time allowed for every input to the model to run. It can be used as follows:

EVALS_THREADS_TIMEOUT=20 oaieval completion_fn eval_name

As has been brought up before (#1384, #1292, #270), evals suffer from a hanging issue, where an evaluation run will hang for a very long time (if not indefinitely) at the end of a run (say, on the 99th sample of out 100). This PR addresses this issue, by replacing a seemingly redundant single-threaded thread creation that was happening when making requests, nested inside the already multi-threaded eval loop. My impression is that this nested multithreading was causing overhead that resulted in the hanging experienced. I had also noticed this hanging issue in `EVALS_SEQUENTIAL=1` mode (where it no longer occurs at the end, but instead randomly in the middle of the run). I was able to identify the source of this issue though debugging print statements that ultimately pointed to the `request_with_timeout` function as the culprit. We have tested the new `request_with_timeout` code on a fork where we have run multiple new and pre-existing evals, including with 3rd party solvers, and found no change in behaviour or errors, and a clear improvement on the hanging issue.

sjadler2004 added the bug Something isn't working label Oct 17, 2023

thesofakillers mentioned this issue Mar 14, 2024

Address sporadic hanging of evals on certain samples #1482

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval-running often hangs on last sample #1384

Eval-running often hangs on last sample #1384

sjadler2004 commented Oct 17, 2023

sjadler2004 commented Oct 17, 2023

LRudL commented Nov 28, 2023

katsuya commented Nov 29, 2023

isc-Shiva-Gupta commented Feb 6, 2024

Eval-running often hangs on last sample #1384

Eval-running often hangs on last sample #1384

Comments

sjadler2004 commented Oct 17, 2023

Describe the bug

To Reproduce

Code snippets

OS

Python version

Library version

sjadler2004 commented Oct 17, 2023

LRudL commented Nov 28, 2023

katsuya commented Nov 29, 2023

isc-Shiva-Gupta commented Feb 6, 2024