What's the best practise to fetch summary from a long HTML string or file? #880

xjtroddy · 2024-05-24T08:30:15Z

I am currently working with LlamaIndex TS to summarize large HTML files or strings. Below is the code I am using:

const serviceContext = serviceContextFromDefaults({
  nodeParser: new SimpleNodeParser({
    chunkSize: 800,
    chunkOverlap: 400,
    splitLongSentences: true,
  }),
  embedModel,
  llm: llmModel,
});

const index = await VectorStoreIndex.fromDocuments([document], {
  serviceContext,
});

const retriever = index.asRetriever();

const queryEngine = index.asQueryEngine({
  retriever,
});

const queryResponse = await queryEngine.query({
  query: embeddingPrompt,
});

Issues Encountered
1， Incomplete Summarization: The summarization output seems to be incomplete. At times, I don't get any content at all.
2. Handling Large HTML Files: My HTML files are quite large, exceeding the maximum token limit of 8192. Despite using SimpleNodeParser with chunking, it appears to only process the first 800 tokens of the document.
Questions

What are the best practices for summarizing large HTML files or strings using LlamaIndex TS?
How can I ensure that the summarization captures the entire content of the document, given the token limitations?
Are there any specific settings or configurations in SimpleNodeParser or other components that I should adjust to improve the summarization results?
Thank you for your assistance!

The text was updated successfully, but these errors were encountered:

himself65 added good first issue Good for newcomers help wanted Extra attention is needed labels Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's the best practise to fetch summary from a long HTML string or file? #880

What's the best practise to fetch summary from a long HTML string or file? #880

xjtroddy commented May 24, 2024

What's the best practise to fetch summary from a long HTML string or file? #880

What's the best practise to fetch summary from a long HTML string or file? #880

Comments

xjtroddy commented May 24, 2024