Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve LLM model tool annotation #659

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

dghirardo
Copy link
Contributor

@dghirardo dghirardo commented Jun 5, 2024

Issue #637

This pull request implements the tool_tailor gem (https://github.com/kieranklaassen/tool_tailor) to automate the generation of tool annotations.

Changes:

  • All previously used JSON files for tool annotations have been removed.
  • Tool-specific directories have been eliminated, resulting in a cleaner and more straightforward directory structure.
  • The :ANNOTATIONS_PATH constant has been replaced with :FUNCTIONS, an array of symbols corresponding to the tool's functions.
  • Methods for validating constants (:NAME, :FUNCTIONS) have been implemented to ensure consistency and correctness.

@dghirardo dghirardo changed the title Feature/auto tool annotations Improve LLM model tool annotation Jun 5, 2024
self.class.const_get(:ANNOTATIONS_PATH)
)
)
functions.map do |function|
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about just using all of the public_instance_methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that could be a good idea, I had already considered it. However, I noticed the execute_search method in google_search.rb is not private, even though it's not meant to be used directly (it wasn't present in the old JSON annotations file).

This error could easily be repeated, so I thought explicitly defining the methods would make it difficult to introduce errors and provide a clear overview of available tools. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dghirardo In the case of google_search.rb specifically, I'm not even convinced that we need the def self.execute_search method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, after looking at the code, it seems we don't need it. So, should we use public_instance_methods or keep the FUNCTIONS array?

@andreibondarev
Copy link
Collaborator

andreibondarev commented Jun 9, 2024

Before:

irb(main):003> Langchain::Tool::NewsRetriever.new(api_key: ENV["NEWS_API_KEY"]).to_openai_tools
=>
[{"type"=>"function",
  "function"=>
   {"name"=>"news_retriever__get_everything",
    "description"=>"News Retriever: Search through millions of articles from over 150,000 large and small news sources and blogs.",
    "parameters"=>
     {"type"=>"object",
      "properties"=>
       {"q"=>
         {"type"=>"string",
          "description"=>
           "Keywords or phrases to search for in the article title and body. Surround phrases with quotes (\") for exact match. Alternatively you can use the AND / OR / NOT keywords, and optionally group these with parenthesis. Must be URL-encoded."},
        "search_in"=>{"type"=>"string", "description"=>"The fields to restrict your q search to.", "enum"=>["title", "description", "content"]},
        "sources"=>
         {"type"=>"string",
          "description"=>"A comma-seperated string of identifiers (maximum 20) for the news sources or blogs you want headlines from. Use the /sources endpoint to locate these programmatically or look at the sources index."},
        "domains"=>{"type"=>"string", "description"=>"A comma-seperated string of domains (eg bbc.co.uk, techcrunch.com, engadget.com) to restrict the search to."},
        "exclude_domains"=>{"type"=>"string", "description"=>"A comma-seperated string of domains (eg bbc.co.uk, techcrunch.com, engadget.com) to remove from the results."},
        "from"=>{"type"=>"string", "description"=>"A date and optional time for the oldest article allowed. This should be in ISO 8601 format."},
        "to"=>{"type"=>"string", "description"=>"A date and optional time for the newest article allowed. This should be in ISO 8601 format."},
        "language"=>{"type"=>"string", "description"=>"The 2-letter ISO-639-1 code of the language you want to get headlines for.", "enum"=>["ar", "de", "en", "es", "fr", "he", "it", "nl", "no", "pt", "ru", "sv", "ud", "zh"]},
        "sort_by"=>{"type"=>"string", "description"=>"The order to sort the articles in.", "enum"=>["relevancy", "popularity", "publishedAt"]},
        "page_size"=>{"type"=>"integer", "description"=>"The number of results to return per page (request). 5 is the default, 100 is the maximum."},
        "page"=>{"type"=>"integer", "description"=>"Use this to page through the results if the total results found is greater than the page size."}}}}},

After:

irb(main):013> Langchain::Tool::NewsRetriever.new(api_key: ENV["NEWS_API_KEY"]).to_openai_tools
=>
[{"type"=>"function",
  "function"=>
   {"name"=>"news_retriever__get_everything",
    "description"=>"Retrieve all news",
    "parameters"=>
     {"type"=>"object",
      "properties"=>
       {"q"=>{"type"=>"string", "description"=>"Keywords or phrases to search for in the article title and body."},
        "search_in"=>{"type"=>"string", "description"=>"The fields to restrict your q search to. The possible options are: title, description, content."},
        "sources"=>
         {"type"=>"string",
          "description"=>"A comma-seperated string of identifiers (maximum 20) for the news sources or blogs you want headlines from. Use the /sources endpoint to locate these programmatically or look at the sources index."},
        "domains"=>{"type"=>"string", "description"=>"A comma-seperated string of domains (eg bbc.co.uk, techcrunch.com, engadget.com) to restrict the search to."},
        "exclude_domains"=>{"type"=>"string", "description"=>"A comma-seperated string of domains (eg bbc.co.uk, techcrunch.com, engadget.com) to remove from the results."},
        "from"=>{"type"=>"string", "description"=>"A date and optional time for the oldest article allowed. This should be in ISO 8601 format."},
        "to"=>{"type"=>"string", "description"=>"A date and optional time for the newest article allowed. This should be in ISO 8601 format."},
        "language"=>{"type"=>"string", "description"=>"The 2-letter ISO-639-1 code of the language you want to get headlines for. Possible options: ar, de, en, es, fr, he, it, nl, no, pt, ru, se, ud, zh."},
        "sort_by"=>{"type"=>"string", "description"=>"The order to sort the articles in. Possible options: relevancy, popularity, publishedAt."},
        "page_size"=>{"type"=>"integer", "description"=>"The number of results to return per page. 20 is the API's default, 100 is the maximum. Our default is 5."},
        "page"=>{"type"=>"integer", "description"=>"Use this to page through the results."}},
      "required"=>[]}}},

Upon a quick glance -- the enum: param is missing. I wonder if we can annotate enum options in YARD docs 🤔

@andreibondarev
Copy link
Collaborator

@dghirardo Btw -- @palladius told me that you guys met at the RubyDay! 🫶

@dghirardo
Copy link
Contributor Author

@andreibondarev I am currently working on a pull request for the tool_tailor gem to add support for the enum parameter. What do you think about the proposed solution?

@dghirardo
Copy link
Contributor Author

Hi @andreibondarev, tool_tailor has merged my pull request, so now the enum property is supported, updating the gem. However, I noticed another issue. When you have a parameter that is a collection (like an array or a hash), you can't describe its internal structure.

Example (Tavily tool):

# @param exclude_domains [Array<String>] A list of domains to specifically exclude from the search results. Default is None, which doesn't exclude any domains.

While [Array] is supported, [Array<String>] is not.
So, it’s not possible to:

  • Define that the array items must be Strings.
  • Describe extra properties for the String values, such as enum.

You can only use the description field.

Even if we find a solution, I think nested parameters with multiple levels are not easy to describe with YARD comments.

@andreibondarev
Copy link
Collaborator

andreibondarev commented Jun 28, 2024

Hi @andreibondarev, tool_tailor has merged my pull request, so now the enum property is supported, updating the gem. However, I noticed another issue. When you have a parameter that is a collection (like an array or a hash), you can't describe its internal structure.

Example (Tavily tool):

# @param exclude_domains [Array<String>] A list of domains to specifically exclude from the search results. Default is None, which doesn't exclude any domains.

While [Array] is supported, [Array<String>] is not. So, it’s not possible to:

  • Define that the array items must be Strings.
  • Describe extra properties for the String values, such as enum.

You can only use the description field.

Even if we find a solution, I think nested parameters with multiple levels are not easy to describe with YARD comments.

Hmm... I think we may want to consider some sort of a decorator pattern, a la:

llm_callable :get_everything,
  description: "News Retriever: Search through millions of articles from over 150,000 large and small news sources and blogs",
  properties: {
    q: {
      type: :string,
      description: "Keywords or phrases to search for in the article title and body. Surround phrases with quotes (\") for exact match. Alternatively you can use the AND / OR / NOT keywords, and optionally group these with parenthesis. Must be URL-encoded."
  }

def get_everything(q:, ...)
  ...

The really tricky thing here is figuring out a developer-friendly interface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants