Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What's the best practise to fetch summary from a long HTML string or file? #880

Open
xjtroddy opened this issue May 24, 2024 · 0 comments
Open
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@xjtroddy
Copy link

I am currently working with LlamaIndex TS to summarize large HTML files or strings. Below is the code I am using:

const serviceContext = serviceContextFromDefaults({
  nodeParser: new SimpleNodeParser({
    chunkSize: 800,
    chunkOverlap: 400,
    splitLongSentences: true,
  }),
  embedModel,
  llm: llmModel,
});

const index = await VectorStoreIndex.fromDocuments([document], {
  serviceContext,
});

const retriever = index.asRetriever();

const queryEngine = index.asQueryEngine({
  retriever,
});

const queryResponse = await queryEngine.query({
  query: embeddingPrompt,
});

Issues Encountered
1, Incomplete Summarization: The summarization output seems to be incomplete. At times, I don't get any content at all.
2. Handling Large HTML Files: My HTML files are quite large, exceeding the maximum token limit of 8192. Despite using SimpleNodeParser with chunking, it appears to only process the first 800 tokens of the document.
Questions

  1. What are the best practices for summarizing large HTML files or strings using LlamaIndex TS?
  2. How can I ensure that the summarization captures the entire content of the document, given the token limitations?
  3. Are there any specific settings or configurations in SimpleNodeParser or other components that I should adjust to improve the summarization results?
    Thank you for your assistance!
@himself65 himself65 added good first issue Good for newcomers help wanted Extra attention is needed labels Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants