How to use DocumentByParagraphSplitter? #1194
Replies: 4 comments 1 reply
-
To use
Here is a sample code snippet demonstrating this: import dev.langchain4j.data.document.splitter.DocumentByParagraphSplitter;
import dev.langchain4j.data.document.Document;
import dev.langchain4j.data.document.Metadata;
import dev.langchain4j.data.segment.TextSegment;
import dev.langchain4j.model.Tokenizer;
import dev.langchain4j.model.openai.OpenAiTokenizer;
import java.util.List;
public class Main {
public static void main(String[] args) {
// Create a tokenizer instance
Tokenizer tokenizer = new OpenAiTokenizer();
// Create a DocumentSplitter with a max segment size of 1024 tokens
DocumentByParagraphSplitter splitter = new DocumentByParagraphSplitter(1024, 0, tokenizer);
// Example text to be split
String text = "Your text here...";
// Create a Document instance
Document document = Document.from(text, Metadata.from("document", "0"));
// Split the text into paragraphs
List<TextSegment> segments = splitter.split(document);
// Merge the paragraphs back together
String mergedText = String.join(splitter.joinDelimiter(), segments.stream().map(TextSegment::text).toArray(String[]::new));
// Output the merged text
System.out.println(mergedText);
}
} Replace
|
Beta Was this translation helpful? Give feedback.
-
@WuJingLearn please clarify your question |
Beta Was this translation helpful? Give feedback.
-
Now I just want to divide according to paragraphs, requiring different paragraphs to be divided into different TextSegments |
Beta Was this translation helpful? Give feedback.
-
got it,thanks |
Beta Was this translation helpful? Give feedback.
-
DocumentByParagraphSplitter splitter = new DocumentByParagraphSplitter(1024, 0);
Use DocumentByParagraphSplitter for text segmentation, with no more than 1024 tokens per paragraph in the document Finally, it merged multiple paragraphs together
Beta Was this translation helpful? Give feedback.
All reactions