Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] [馃拵 $400 bounty] Integrate SearchApi as a WebSearchEngine and as a Tool #1132

Open
mark-searchapi opened this issue May 20, 2024 · 21 comments 路 May be fixed by #1215 or #1216
Open

[FEATURE] [馃拵 $400 bounty] Integrate SearchApi as a WebSearchEngine and as a Tool #1132

mark-searchapi opened this issue May 20, 2024 · 21 comments 路 May be fixed by #1215 or #1216
Labels
enhancement New feature or request P3 Medium priority

Comments

@mark-searchapi
Copy link

Feature Overview

Integrate SearchApi as a WebSearchEngine and as a Tool for function calling.

Requirements

Adhere to langchain4j contribution guidelines

Related work

Related issues:

Existing SearchApi integrations

Design considerations

SearchApi support not only Google Search, but 30+ other APIs such as Youtube Search, Transcripts, Bing Search (similar JSON response keys). In a statically typed language, it is probably better to have separate tools for separate APIs, but implementation could be flexible enough to extend SearchApi and easily adapt to other engines. All engines use the same HTTP GET request. The only difference is the parameters they accept. And how you want the response to be parsed.

Other notes

  • Ping us if you need more search credits when implementing.
  • Someone from the SearchApi Team will help review the functionality.

Bounty

There is a $400 bounty on this awarded by SearchApi to the community.

In the event of multiple attempts, the one that is going to be merged by the maintainers of the repository will pick the bounty.

@mark-searchapi mark-searchapi added the enhancement New feature or request label May 20, 2024
@algora-pbc
Copy link

馃拵 SearchApi is offering a $400 bounty for this issue
馃憠 Got a pull request resolving this? Claim the bounty by adding @algora-pbc /claim #1132 in the PR body and joining algora.io

@abhishek818
Copy link

@mark-searchapi can i get this assigned?

@mark-searchapi
Copy link
Author

Given the speed at which we're moving, we don't assign issues or "give" issues to anyone.

@abhishek818 In the event of multiple attempts, the one that is going to be merged by the maintainers of the repository will pick the bounty.

@mark-searchapi mark-searchapi changed the title [FEATURE] Integrate SearchApi as a WebSearchEngine and as a Tool [FEATURE] [馃拵 $400 bounty] Integrate SearchApi as a WebSearchEngine and as a Tool May 23, 2024
@jemiluv8
Copy link

Hello @mark-searchapi, I just started work on the SearchApi integration as a search engine and tool.
If anyone else ( and @abhishek818 ) has already started work on this please let me know as soon as you can.

@langchain4j
Copy link
Owner

Please adhere to these two implementations:
https://github.com/langchain4j/langchain4j/tree/main/web-search-engines

@jemiluv8
Copy link

Yes of course.

@ayewo
Copy link

ayewo commented May 24, 2024

If anyone else ( and ...) has already started work on this please let me know as soon as you can.

@jemiluv8
I've but I've not publicly added an "attempt", yet. Since you've also started with the issue, what difference does it make if you know or not?

@jemiluv8
Copy link

@ayewo. Just wanted to know so I may halt my attempt since I just started.
I'm aborting my attempt now so as to not duplicate your efforts.
Happy hunting ...

@Ahmadkhan02
Copy link

Hi, was setting this up. Just confused as to how I am supposed to set up env like OPENAI_API_KEY.

@Ahmadkhan02 Ahmadkhan02 mentioned this issue May 26, 2024
10 tasks
@Ahmadkhan02
Copy link

Hi, was setting this up. Just confused as to how I am supposed to set up env like OPENAI_API_KEY.

I have raised a draft PR with my current work. Need help with this to continue. Also pls assign this to me if satisfactory

@ayewo
Copy link

ayewo commented May 26, 2024

@Ahmadkhan02

The comment by @mark-searchapi above clearly says:

Given the speed at which we're moving, we don't assign issues or "give" issues to anyone.

@langchain4j
Copy link
Owner

@abhishek818 why do you need open ai key? In any way, you can register with openai and get it there: https://github.com/langchain4j/langchain4j?tab=readme-ov-file#how-to-get-an-api-key

@ayewo
Copy link

ayewo commented May 31, 2024

Hi @mark-searchapi
Looking at the docs for the Google Search API, it seems the API will return JSON to match the type of search query. In other words, the endpoint returns differently shaped JSON depending on whether the search is for a well-known fact, a local business, a product search etc.

The docs itself lists about 34 API examples which I pulled out into the list of JSON elements below:

  • "knowledge_graph"#"knowledge_graph_type": "Kp3 verticals"
  • "knowledge_graph"#"knowledge_graph_type": "Local nav"
  • "knowledge_graph"#"knowledge_graph_type": "Shopping entity"
  • "knowledge_graph"#"knowledge_graph_type": "Company"
  • "answer_box"#"type": "organic_result"
  • "sports_results"#"type": "team_sport"
  • "answer_box"#"type": "open_hours"
  • "answer_box"#"type": "population_graph"
  • "answer_box"#"type": "finance_results"
  • "answer_box"#"type": "currency_converter"
  • "answer_box"#"type": "local_time"
  • "weather_result"
  • "local_map"
  • "job_results"
  • "salary_estimates"
  • "shopping_ads"
  • "ads"
  • "local_ads"
  • "inline_shopping"
  • "from_sources_across_the_web"
  • "discussions_and_forums"
  • "related_questions"
  • "questions_and_answers"
  • "explore_brands"
  • "about_this_result"
  • "things_to_do"
  • "events"
  • "scholarly_articles"
  • "inline_videos"
  • "inline_tweets"
  • "inline_images"
  • "inline_recipes"
  • "top_stories"
  • "courses"

Since the task here is to "Integrate SearchApi as a WebSearchEngine and as a Tool for function calling", the common format between the two classes is the WebSearchResults object.

The WebSearchResults object in turn is hydrated from 3 things from a search:

"list of organic search results, information about the search, and pagination information"

Can you clarify which of the JSON elements are to scope and which ones are to be skipped for this issue?

@zambrinf
Copy link
Contributor

zambrinf commented Jun 1, 2024

Hey @mark-searchapi, looks like the num parameter for pagination is not working properly, the endpoint always returns 4 results if the parameter 5 is present, could you take a look? This happens when using the query "LangChain4j", which is being tested in WebSearchEngineIT

@Ahmadkhan02
Copy link

how many tests should one write for this?
and also is validation of parameters based on selected 'engine' required?

@ayewo ayewo linked a pull request Jun 2, 2024 that will close this issue
10 tasks
zambrinf added a commit to zambrinf/langchain4j that referenced this issue Jun 2, 2024
@zambrinf zambrinf linked a pull request Jun 2, 2024 that will close this issue
10 tasks
@mark-searchapi
Copy link
Author

@ayewo I have reviewed the Langchain4j codebase and can provide the following insights:

WebSearchEngine as ContentRetriever Integration (for RAG use case)

After #642 PR, the core implementation seems to provide:

  • Swappable providers (Google Search, SearchApi, Bing and others)
  • Automatic integration with ContentRetriever
  • Best for RAG use cases

Cons:

  • Very basic integration as a tool (bare minimum MVP)

I think the most important part of WebSearchResults is WebSearchOrganicResult. To construct it, we iterate through the organic_results and take title, snippet and link. In this particular use case, none of the other fields seem to matter.

SearchApiWebSearchEngine should allow configuring the engine to easily swap between Google, Bing, Google News, Baidu (we will release this week) and later other search engines. The JSON structure won't change here and will always stay the same on SearchApi side as:

{
  "organic_results": [
    {
      "title": "String",
      "link": "String",
      "snippet": "String",
      ...
    },
    ...
  ]  
}

If you need error handling, rely on the error JSON key presence in the response.

After implementing SearchApi as a core WebSearchEngine, the basic tool should work out of the box:

 googleSearchApi = SearchApiWebSearchEngine.builder()
                .apiKey(System.getenv("SEARCHAPI_API_KEY"))
                .engine("google")
  WebSearchTool webSearchTool = WebSearchTool.from(googleSearchEngine);
...

Integration as a Tool (Function calling)

WebSearchTool is built into the langchain4j core and as mentioned should work outside the box.

However, it does not seem to be very useful when used as a function since it constructs the String for LLMs only based on organic_results as seen here.

I think we should create independent tool definitions and start with a couple of engines for this issue:

  • GoogleSearchTool
  • GoogleNewsTool
  • YoutubeTranscriptsTool
  • Your selected engine that might be useful with LLM

This way we can control the final string that is being built. For instance, in youtube_transcripts engine, we are interested only in transcript text. You can also check sample Google Search string construction and Google News example in our recent other integration. You can also add extra fields to the tools that you think might be important for LLMs.

Documentation and Examples

Throwing some ideas:

  • RAG use case with Search: How to use SearchApiWebSearchEngine + ContentRetriever + LLM.
  • Summarization: SearchApi YoutubeTranscripts tool + TokenSplitter + LLM. Great for YouTube podcasts.
  • Blog post writing: SearchApi Google Search Tool + Content retriever + LLM.

Most of the stuff (apart custom tools) is already implemented in langchain4j Google Custom Search directory. Tests contains hints on how to implement examples for documentation (how to use web search engine, content retriever, tool, etc..).

Hope the above information helps!

@zambrinf
Copy link
Contributor

zambrinf commented Jun 2, 2024

@mark-searchapi I opened a PR with the idea of adding new search api engines in the future using an interface that handles the requests and responses

@mark-searchapi
Copy link
Author

Hey @mark-searchapi, looks like the num parameter for pagination is not working properly, the endpoint always returns 4 results if the parameter 5 is present, could you take a look? This happens when using the query "LangChain4j", which is being tested in WebSearchEngineIT

@zambrinf, you can try using the default num parameter and verify that the total amount of organic results is greater than 0. This issue happens with the "LangChain4j" query because Google counts different elements toward the num parameter. WebSearchEngine results only consider organic results, so elements like inline videos are not included as organic results.

@ayewo
Copy link

ayewo commented Jun 3, 2024

Hey @mark-searchapi, looks like the num parameter for pagination is not working properly, the endpoint always returns 4 results if the parameter 5 is present, could you take a look? This happens when using the query "LangChain4j", which is being tested in WebSearchEngineIT

@zambrinf, you can try using the default num parameter and verify that the total amount of organic results is greater than 0. This issue happens with the "LangChain4j" query because Google counts different elements toward the num parameter. WebSearchEngine results only consider organic results, so elements like inline videos are not included as organic results.

@mark-searchapi I鈥檝e already implemented it as you described in my PR. Please have a look at #1215.

@langchain4j langchain4j added the P3 Medium priority label Jun 3, 2024
@czelabueno
Copy link
Contributor

Hi @mark-searchapi and everybody,
It looks as a interesting challenge and I think that SearchAPI would be helpful for a lot of users.
I was who define the v1 of WebSearchEgine, I would like to clarify some premises to take into account in your PRs and provide additional info to what is already explained here

WebSearchEngine v1 was thought to retrieve organic web content where we can include (searchAPI and imagesAPI) directly from the web search engine (google, bing, yahoo) or using a wrapper solution that searches for the same thing. It does not include news, shopping, videos, maps, books, etc.) since they are another type of object. However it could be relevant to some users and could be included in the v2 that dynamically includes an object structure of other content categories.

Speaking only of the organic contents. If the integration supports URL scrapping and is capable of retrieving the complete content of the website in text, it should populate it in 'WebSearchOrganicResult.content'. If integration can only extract one snippet into its corresponding WebSearchOrganicResult.snippet field.

For now, the Tool and the WebSearchContentRetriever do not include the metadata field of either the WebSearchContentRetriever.metadata or the WebSearchOrganicResult.metadata (we did not see it necessary until now). However, it is planned in v2 to include it and each implementation can set which metadata is fetched.

You can check in this post the flow included in the v1 https://x.com/c_zela/status/1785522559791808650

I hope it gave a little more context and I welcome your ideas to include in v2

I'll put some comments on your PRs :)

Thank you!

@langchain4j
Copy link
Owner

@czelabueno I will then assign both PRs to you? 馃檹

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P3 Medium priority
Projects
None yet
9 participants