Data refinement tuning

Quick ‘How-to’ video with steps to tune data refinement settings

This page allows you to configure text processing parameters that determine how documents are processed and stored for optimal retrieval and knowledge base creation.

To configure a knowledge base, complete the following steps in sequence:

  • Select a RAG definition – Choose between a vector store or a knowledge graph as the underlying retrieval method.

  • Configure chunk settings – Define how source content is segmented for efficient indexing and retrieval.

  • Select vector/graph store - Select the vector store/ graph store from the available options as per your RAG selection.

  • Use the default file store – ZBrain’s built-in file storage securely manages source documents or structured data used for retrieval, without any customized configuration.

  • Set retrieval parametersSpecify query behavior such as search type, top-k results, and score thresholds.

  • Choose an embedding model – Select the model to be used to convert text into vector representations for semantic search and to capture contextual relationships of the input data.

  • Choose LLM for knowledge graph - If you have selected the knowledge graph option, select the appropriate LLM from the available options.

  • Agentic retrieval (optional step) - Enable this option to use agentic, RAG-based retrieval logic, leveraging LLMs for intelligent sub-query planning and delivering more accurate, context-rich responses.

Select the RAG definition

In this step, configure the data retrieval model that determines how your knowledge base will store, index, and retrieve content to generate accurate, context-aware responses to user queries.

ZBrain supports two RAG storage models:

  • Vector store – default, chunk-and-embedding index for semantic similarity search

  • Knowledge graph – entity–relationship graph stored in ZBrain’s graph database

Select the one that best matches your data and query needs.

Option
When to choose it

Vector store

  • Unstructured text

  • Rapid prototyping

Knowledge graph

  • Information with critical inter-entity relationships, such as product-component hierarchies, chronological timelines, or complex organizational structures.

Chunk settings

ZBrain provides three distinct chunking approaches for both options:

  • Automatic: This option is recommended for users unfamiliar with the process. ZBrain will automatically set chunk and preprocessing rules based on best practices.

  • DocType Chunking

This option splits documents into meaningful sections based on their type and structure, instead of using arbitrary text length. For example, policies are divided by sections, contracts by clauses, and tables by rows or cells. OCR is also supported, so scanned and image-based documents can be processed and made searchable.

To configure DocType chunking:

  1. From the Chunking Method drop-down, select the option that aligns with your document type. Each method supports specific file formats and follows a distinct processing logic.

  2. Review the supported file formats for the selected type before uploading files.

  3. If you select ‘General’, you can optionally enable ‘Apply Vision Model (GPT-4o) for general chunking method’ for better OCR or image-text comprehension. To switch to a different LLM, click ‘Change.’

  4. Click ‘Update’ to apply your changes.

The available chunking methods and their corresponding supported file types are listed below:

Chunking method

Supported file formats

Description

General

PDF, DOCX, XLSX, TXT, JSON

General-purpose chunking suitable for most unstructured documents. Optionally enable ‘Apply Vision model (GPT-4o) for general chunking method’ for enhanced visual-text understanding.

Q & A (Question & Answer)

XLSX, CSV/TXT

XLSX/XLS (97–2003): Two columns without headers — the first for questions, the second for answers, with the question column preceding the answer column. Multiple sheets are supported if columns are correctly structured. CSV/TXT: Must be UTF-8 encoded, using TAB as the delimiter separating questions and answers.

Manual

PDF, DOCX

Ideal for structured instructional or procedural manuals.

Book

PDF, DOCX, TXT

Designed for large text files such as books. Automatically detects chapters and sections for logical segmentation.

Paper

PDF

Tailored for academic, research or related papers.

Table

XLSX, CSV/TXT

Optimized for tabular data. The first row must contain column headers. For CSV/TXT files, TAB must be used as the column delimiter.

Laws

PDF, DOCX, TXT

Segments legal documents for precise legal information retrieval.

Presentation

PDF, PPTX

Processes slides, preserving layout and text hierarchy.

One (Single Chunk)

DOCX, XLSX, PDF, TXT

Treats the entire document as a single chunk. Useful when maintaining document context as a whole is critical.

  • Custom:

This section allows advanced users to fine-tune how their data is broken down (chunked) before the AI uses it. It includes options for segmenting text, setting chunk lengths, and preprocessing.

Segment identifier

  • What it is: A character or sequence that defines where a new chunk starts. Instead of breaking text at a fixed length, you can break it at logical points like paragraphs or tabs.

  • Examples:

    • \n = newline character (used to separate paragraphs or lines)

    • \t = tab character (used for indentation or bullet points)

You will get a tooltip on this box, which includes clear explanations of characters like n (newline) and t (tab) to understand how content will be segmented.

  • Why it matters: Using segment identifiers helps keep chunks semantically meaningful (e.g., breaking at the end of a sentence or section).

  • Tip: Use \n to chunk based on paragraphs or lines. Use \t if your data is tabular or structured with tabs.

Maximum chunk length

  • What it is: The maximum number of characters (including spaces) in a single chunk.

  • The default is shown: 500 characters.

  • Why it matters: LLMs (Large Language Models) can only handle a limited context window.

    • Higher chunk length: Better context and accuracy, but slower processing and higher memory use.

    • Lower chunk length: Faster and more efficient, but may reduce accuracy due to limited context.

  • The tooltip message clearly provides contextual explanations of how high or low values impact processing.

Chunk overlap

  • What it is: The number of overlapping characters between two consecutive chunks.

  • Why it is useful: Prevents loss of context at chunk boundaries. For example, if important content is at the end of one chunk, a small overlap ensures it appears at the start of the next chunk.

  • Default shown: 0 (no overlap).

Text preprocessing rules

  • Replace consecutive spaces, newlines, and tabs to clean up formatting inconsistencies.

  • Delete all URLs and email addresses to remove sensitive or irrelevant contact information.

  • Once you have made your changes, click the 'Confirm & Preview' button to review the results.

Agentic retrieval

  • Whether you select a vector store or a knowledge graph as the RAG definition, the Agentic Retrieval toggle is set to disable by default. For a more context-rich result, enable the Agentic Retrieval toggle.

  • Once agentic retrieval is enabled, select the LLM to be used for sub-query planning from the list of LLMs available in the Agentic Retrieval Model drop-down.

To complete the configuration of your knowledge base, refer to the appropriate documentation based on your selected RAG definition. Follow the step-by-step guide provided to ensure accurate and efficient setup.

Last updated