How to use DocumentByParagraphSplitter? #1194

WuJingLearn · 2024-05-29T07:48:00Z

WuJingLearn
May 29, 2024

DocumentByParagraphSplitter splitter = new DocumentByParagraphSplitter(1024, 0);

Use DocumentByParagraphSplitter for text segmentation, with no more than 1024 tokens per paragraph in the document Finally, it merged multiple paragraphs together

@dosu · 2024-05-29T07:52:05Z

dosubot[bot]
bot May 29, 2024

To use DocumentByParagraphSplitter for text segmentation, ensuring no more than 1024 tokens per paragraph, and then merge multiple paragraphs together, follow these steps:

Create an instance of Tokenizer to handle token-based segmentation.
Instantiate a DocumentByParagraphSplitter with the desired maximum segment size in tokens (1024 tokens in this case).
Use the split method to segment the text into paragraphs.
Use the joinDelimiter method to merge the paragraphs back together.

Here is a sample code snippet demonstrating this:

import dev.langchain4j.data.document.splitter.DocumentByParagraphSplitter;
import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.document.Metadata;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.Tokenizer;
import dev.langchain4j.model.openai.OpenAiTokenizer;
import java.util.List;

public class Main {
    public static void main(String[] args) {
        // Create a tokenizer instance
        Tokenizer tokenizer = new OpenAiTokenizer();

        // Create a DocumentSplitter with a max segment size of 1024 tokens
        DocumentByParagraphSplitter splitter = new DocumentByParagraphSplitter(1024, 0, tokenizer);

        // Example text to be split
        String text = "Your text here...";

        // Create a Document instance
        Document document = Document.from(text, Metadata.from("document", "0"));

        // Split the text into paragraphs
        List<TextSegment> segments = splitter.split(document);

        // Merge the paragraphs back together
        String mergedText = String.join(splitter.joinDelimiter(), segments.stream().map(TextSegment::text).toArray(String[]::new));

        // Output the merged text
        System.out.println(mergedText);
    }
}

Replace Your text here... with the actual text you want to process. This code will ensure that the text is segmented into paragraphs with no more than 1024 tokens each and then merged back together using double newline characters as delimiters ^[1]^[2]^[3].

Details

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

0 replies

langchain4j · 2024-05-29T08:05:30Z

langchain4j
May 29, 2024
Maintainer

@WuJingLearn please clarify your question

0 replies

WuJingLearn · 2024-05-29T08:29:01Z

WuJingLearn
May 29, 2024
Author

Now I just want to divide according to paragraphs, requiring different paragraphs to be divided into different TextSegments

1 reply

langchain4j May 29, 2024
Maintainer

Ah, I see. In this case you can either do that manually, or using DocumentByParagraphSplitter.split(String) method

WuJingLearn · 2024-05-29T08:52:43Z

WuJingLearn
May 29, 2024
Author

got it,thanks

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use DocumentByParagraphSplitter? #1194

{{title}}

Replies: 4 comments 1 reply

{{title}}

Details

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to use DocumentByParagraphSplitter? #1194

WuJingLearn May 29, 2024

Replies: 4 comments · 1 reply

dosubot[bot] bot May 29, 2024

Details

langchain4j May 29, 2024 Maintainer

WuJingLearn May 29, 2024 Author

langchain4j May 29, 2024 Maintainer

WuJingLearn May 29, 2024 Author

WuJingLearn
May 29, 2024

Replies: 4 comments 1 reply

dosubot[bot]
bot May 29, 2024

langchain4j
May 29, 2024
Maintainer

WuJingLearn
May 29, 2024
Author

langchain4j May 29, 2024
Maintainer

WuJingLearn
May 29, 2024
Author