Data Refinement Tuning

This page allows you to configure text processing parameters that determine how documents are processed and stored for optimal retrieval and knowledge base creation.

To configure a knowledge base, complete the following steps in sequence:

  • Select a RAG definition – Choose between a vector store or knowledge graph as the underlying retrieval method.

  • Configure chunk settings – Define how source content is segmented for efficient indexing and retrieval.

  • Set retrieval parameters – Specify query behavior such as search type, top-k results, and score thresholds.

  • Choose an embedding model – Select the model to be used to convert text into vector representations for semantic search and to capture contextual relationships of the input data.

Select the RAG definition

In this step, configure the data retrieval model that determines how your knowledge base will store, index, and retrieve content to generate accurate, context-aware responses to user queries.

ZBrain supports two RAG storage models:

  • Vector store – default, chunk-and-embedding index for semantic similarity search

  • Knowledge graph – entity–relationship graph stored in ZBrain’s graph database

Select the one that best matches your data and query needs.

Option
How it works
When to choose it

Vector store

Splits each document into chunks, converts the chunks into high-dimensional embeddings, and saves them in a vector database. At query time, ZBrain performs a semantic-similarity search and supplies the matched chunks to the LLM.

  • Unstructured text

  • Rapid prototyping

Knowledge graph

Extracts entities and relationships from every chunk, stores them as nodes and edges, and also embeds the chunk text so you can fall back on vector similarity. The query engine can traverse the graph, run vector search, or perform both operations.

  • Information with critical inter-entity relationships, such as product-component hierarchies, chronological timelines, or complex organizational structures.

Chunk settings

ZBrain provides two distinct chunking approaches for both options:

  • Automatic: This option is recommended for users unfamiliar with the process. ZBrain will automatically set chunk and preprocessing rules based on best practices.

  • Custom:

This section allows advanced users to fine-tune how their data is broken down (chunked) before the AI uses it. It includes options for segmenting text, setting chunk lengths, and preprocessing.

Segment identifier

  • What it is: A character or sequence that defines where a new chunk starts. Instead of breaking text at a fixed length, you can break it at logical points like paragraphs or tabs.

  • Examples:

    • \n = newline character (used to separate paragraphs or lines)

    • \t = tab character (used for indentation or bullet points)

  • Why it matters: Using segment identifiers helps keep chunks semantically meaningful (e.g., breaking at the end of a sentence or section).

  • Tip: Use \n to chunk based on paragraphs or lines. Use \t if your data is tabular or structured with tabs.

Maximum chunk length

  • What it is: The maximum number of characters (including spaces) in a single chunk.

  • The default is shown: 500 characters.

  • Why it matters: LLMs (Large Language Models) can only handle a limited context window.

    • Higher chunk length: Better context and accuracy, but slower processing and higher memory use.

    • Lower chunk length: Faster and more efficient, but may reduce accuracy due to limited context.

Chunk overlap

  • What it is: The number of overlapping characters between two consecutive chunks.

  • Why it is useful: Prevents loss of context at chunk boundaries. For example, if important content is at the end of one chunk, a small overlap ensures it appears at the start of the next chunk.

  • Default shown: 0 (no overlap).

Text preprocessing rules

  • Replace consecutive spaces, newlines, and tabs to clean up formatting inconsistencies.

  • Delete all URLs and email addresses to remove sensitive or irrelevant contact information.

  • Once you have made your changes, click the 'Confirm & Preview' button to review the results.

To complete the configuration of your knowledge base, refer to the appropriate documentation based on your selected RAG definition. Follow the step-by-step guide provided to ensure accurate and efficient setup.

Last updated