Skip to content

Question Generator

Jael Gu edited this page Jul 13, 2023 · 3 revisions

Why generate questions?

In a knowledge-based question answering retrieval system,If you directly search the vector database through the user's question, you are searching for the answer through the question. This needs to be guaranteed that the embedding space of the question and the answer is consistent. In this case, you must use a limited or specially trained embedding model on question-answer pairs.

If we add a step in the middle, the situation can be improved:
Use the input doc chunk to generate relevant question-answer pairs, then save the embeddings of these questions and the corresponding answers in the vector database. When retrieving, we use the user's questions to search the questions in the vector database and get the corresponding answers. In this way, we can search for questions with questions, which is significantly better than previous approaches.

Why not generate questions?

But this method also needs to consume a lot of resources to generate question-answer pairs of documents, such as a large number of OpenAI API request tokens. So you have to decide whether to use the step of generating questions according to your actual resources and situation.

How it works?

There are many tricks when generating QA pairs.

If you directly ask LLM to generate some document-related questions, then you are likely to receive a lot of invalid questions, or questions that cannot be answered in the doc chunk. So it is very important to ensure that all valid and high-quality questions are generated, and the answers can be found in the doc.

Prompt design

We can let LLM generate questions in two steps. The first step is to generate questions, and the second step is to find the corresponding answers in the doc. If no answer can be found according to the doc information, then a fixed answer {no_answer_str} will be returned.

'''The first step is to generate some meaningful questions according to the following doc chunk.
In the second step, according to the content of the doc chunk, answer the answer to each question in the first step.
Note if the corresponding answer cannot be found in the doc chunk, the answer is a str: "{no_answer_str}".
'''

Fix output format

In the return of LLM, sometimes the format of the generated questions is very arbitrary, such as adding labels such as "1, 2, 3", but sometimes they are not added. And it is possible that LLM will return other information, such as invalid information such as "the generated problem is as follows". Therefore, we can use Lang Chain's Structured Output Parser to fix the returned format.

from langchain.output_parsers import StructuredOutputParser, ResponseSchema

response_schemas = [
                ResponseSchema(name=question_list_str,
                               description='List[str] of questions generated in the first step.'),
                ResponseSchema(name=answer_list_str,
                               description=f'''List[str] of answers for the second step, corresponding to the questions generated in the first step.
If the corresponding answer cannot be found in the doc chunk, the answer is a str: "{no_answer_str}".''')
            ]

With a fixed format schema, we can fix the return of LLM as some json, which may look like this:

{
  "question_list": [
    "What is React?",
    "What are components in React?",
    "What is the Video.js function?",
    "What is the My video description?",
    "How does React allow for seamless combination of components?"
  ],
  "answer_list": [
    "The library for web and native user interfaces",
    "Individual pieces that make up user interfaces in React.",
    "A function for rendering a video component in React.",
    "NO ANSWER",
    "React is designed to let you seamlessly combine components written by independent people, teams, and organizations."
  ]
}

The above question_list and answer_list are corresponding, and note that "NO ANSWER" is the label "{no_answer_str}" that we set before and could not find the answer. We filter out problems such as json parsing failures and problems that cannot find an answer, and the rest are high-quality problems.