Text Splitter

A text splitter is a tool used for breaking down a document or text into smaller chunks or segments. Its primary function is to divide lengthy texts into more manageable units, making them easier to analyze or process.

CharacterTextSplitter

The CharacterTextSplitter is used to split a long text into smaller chunks based on a specified character. It splits the text by trying to keep paragraphs, sentences, and words together as long as possible in order to create semantically meaningful chunks.

Parameters:

  • Documents: Input documents to split.

  • chunk_size: Specifies the size or length of each chunk. The default is set to 1000.

  • chunk_overlap: Determines the number of characters that overlap between consecutive chunks when splitting text. It specifies how much of the previous chunk should be included in the next chunk.

    For example, if the chunk_overlap is set to 20 and the chunk_size is set to 100, the splitter will create chunks of 100 characters each, but the last 20 characters of each chunk will overlap with the first 20 characters of the next chunk. This allows for a smoother transition between chunks and ensures that no information is lost – defaults to 200.

  • separator: Specifies the character that will be used to split the text into chunks – defaults to “."

RecursiveCharacterTextSplitter

The RecursiveCharacterTextSplitter splits the text similarly to the CharacterTextSplitter. However, it also recursively splits the text into smaller chunks if the chunk size exceeds a specified threshold.

Parameters:

  • Documents: Input documents to split.

  • chunk_size: Specifies the size or length of each chunk. The default is set to 1000.

  • chunk_overlap: Determines the number of characters that overlap between consecutive chunks when splitting text. It specifies how much of the previous chunk should be included in the next chunk.

  • separators: The separators in RecursiveCharacterTextSplitter are the characters used to split the text into chunks. The text splitter tries to create chunks based on splitting on the first character in the list of separators. If any chunks are too large, it moves on to the next character in the list and continues splitting. Defaults to ["\n\n," "\n,"" ", ""].

Last updated