Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

While storing vectors into pgvector: "struct.error: 'h' format requires -32768 <= number <= 32767" #988

Closed
2 tasks
ArneJanning opened this issue Feb 10, 2024 · 2 comments
Assignees

Comments

@ArneJanning
Copy link

ArneJanning commented Feb 10, 2024

Describe the bug
I'm using pgvector as data store, because Chroma won't work either (same problem as in #986 ), so I'm ingesting PDFs via memgpt load directory..., chunking them, getting embeddings from Azure OpenAI and storing them in pgvector.

It works fine with small PDFs, but bigger PDFs fail every time with this error message where the code breaks down in pg8000:

 File "/home/arne/src/MemGPT/memgpt/cli/cli_load.py", line 222, in load_directory
    store_docs(str(name), docs, user_id)
  File "/home/arne/src/MemGPT/memgpt/cli/cli_load.py", line 139, in store_docs
    insert_passages_into_source(passages, name, user_id, config)
  File "/home/arne/src/MemGPT/memgpt/cli/cli_load.py", line 58, in insert_passages_into_source
    storage.insert_many(passages)
  File "/home/arne/src/MemGPT/memgpt/agent_store/db.py", line 478, in insert_many
    conn.execute(upsert_stmt)
  File "/home/arne/.cache/pypoetry/virtualenvs/pymemgpt-K6xMi7ln-py3.11/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1416, in execute
    return meth(
           ^^^^^
  File "/home/arne/.cache/pypoetry/virtualenvs/pymemgpt-K6xMi7ln-py3.11/lib/python3.11/site-packages/sqlalchemy/sql/elements.py", line 517, in _execute_on_connection
    return connection._execute_clauseelement(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/arne/.cache/pypoetry/virtualenvs/pymemgpt-K6xMi7ln-py3.11/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1639, in _execute_clauseelement
    ret = self._execute_context(
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/arne/.cache/pypoetry/virtualenvs/pymemgpt-K6xMi7ln-py3.11/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1848, in _execute_context
    return self._exec_single_context(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/arne/.cache/pypoetry/virtualenvs/pymemgpt-K6xMi7ln-py3.11/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1988, in _exec_single_context
    self._handle_dbapi_exception(
  File "/home/arne/.cache/pypoetry/virtualenvs/pymemgpt-K6xMi7ln-py3.11/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 2347, in _handle_dbapi_exception
    raise exc_info[1].with_traceback(exc_info[2])
  File "/home/arne/.cache/pypoetry/virtualenvs/pymemgpt-K6xMi7ln-py3.11/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1969, in _exec_single_context
    self.dialect.do_execute(
  File "/home/arne/.cache/pypoetry/virtualenvs/pymemgpt-K6xMi7ln-py3.11/lib/python3.11/site-packages/sqlalchemy/engine/default.py", line 922, in do_execute
    cursor.execute(statement, parameters)
  File "/home/arne/.cache/pypoetry/virtualenvs/pymemgpt-K6xMi7ln-py3.11/lib/python3.11/site-packages/pg8000/legacy.py", line 254, in execute
    self._context = self._c.execute_unnamed(
                    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/arne/.cache/pypoetry/virtualenvs/pymemgpt-K6xMi7ln-py3.11/lib/python3.11/site-packages/pg8000/core.py", line 701, in execute_unnamed
    self.send_BIND(NULL_BYTE, params)
  File "/home/arne/.cache/pypoetry/virtualenvs/pymemgpt-K6xMi7ln-py3.11/lib/python3.11/site-packages/pg8000/core.py", line 765, in send_BIND
    NULL_BYTE + statement_name_bin + h_pack(0) + h_pack(len(params))
                                                 ^^^^^^^^^^^^^^^^^^^
struct.error: 'h' format requires -32768 <= number <= 32767

Digging deeper the problem seems to be that in ./memgpt/agent_store/db.py the insert_many-method generates SQL-code like this:

INSERT INTO <table> (<columns>....) VALUES (%s %s %s ...)

But there are limits on the number of %s. So when executed into pg8000 with 9 columns (id, user_id, text, doc_id, agent_id, data_source, embedding, embedding_dim, embedding_model, metadata_) with a chunk size of maybe 20,000 it becomes 180,000 which is more than 32,767 and pg8000 will raise the error.

So when generating the statements, we need to bring down the number of %s.

Please describe your setup

  • How did you install memgpt?
    • git clone, poetry install
  • Describe your setup
    • What's your OS? Linux/WSL2
    • How are you running memgpt? Terminal/ZSH
@sarahwooders
Copy link
Collaborator

sarahwooders commented Feb 12, 2024

@ArneJanning thanks for reporting this - could you please try the fix in #994 to see if it resolves your issue? You can also wait for the nightly package tomorrow which should include it.

If you get a chance, could you also please let me know how large the PDF file was, and if it was a folder of files or a single file? Then I can try to reproduce the error as well.

@ArneJanning
Copy link
Author

ArneJanning commented Feb 14, 2024

@sarahwooders Thank you very much for your quick fix! I made my own little fix and put it into #1004 which calculates and uses the optimal chunk size for pg8000 instead of hard-coding 1,000 which gives us more performance.

I was loading scientific PDFs with about 1,000 pages each in a folder of files, works without problem now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

2 participants