How to create a knowledge base using knowledge graph?

Knowledge graph selection

Depending on your requirements, ZBrain lets you create a knowledge graph (KG) as an alternative to a traditional vector store. If your use case involves uncovering relationships between concepts, like how policies, products, or people are connected, a Knowledge Graph can provide deeper insights and more structured answers.

  • To build a Knowledge Graph, you need to select RAG definition as Knowledge Graph in Data Refinement Tuning.

Chunk settings

ZBrain provides two distinct chunking approaches for both options:

  • Automatic: This option is recommended for users unfamiliar with the process. ZBrain will automatically set chunk and preprocessing rules based on best practices.

  • Custom:

    • This section allows advanced users to fine-tune how their data is broken down (chunked) before the AI uses it. It includes options for segmenting text, setting chunk lengths, and preprocessing.

Segment identifier

  • What it is: A character or sequence that defines where a new chunk starts. Instead of breaking text blindly at a fixed length, you can break it at logical points like paragraphs or tabs.

  • Examples:

    • \n = newline character (used to separate paragraphs or lines)

    • \t = tab character (used for indentation or bullet points)

  • Why it matters: Using segment identifiers helps keep chunks semantically meaningful (e.g., breaking at the end of a sentence or section).

  • Tip: Use \n to chunk based on paragraphs or lines. Use \t if your data is tabular or structured with tabs.

📏 Maximum chunk length

  • What it is: The maximum number of characters (including spaces) in a single chunk.

  • The default is shown: 500 characters.

  • Why it matters: LLMs (Large Language Models) can only handle a limited context window. If a chunk is too large, it may be cut off; too small, and it may lose coherence or context.

  • Best Practice: Start with 500–800 characters. Adjust according to your document style (e.g., short FAQ vs. long policy documents).

Chunk overlap

  • What it is: The Number of overlapping characters between two consecutive chunks.

  • Why it's useful: Prevents loss of context at chunk boundaries. For example, if important content is at the end of one chunk, a small overlap ensures it appears at the start of the next chunk.

  • Default shown: 0 (no overlap). Try 50–100 if the content is narrative or flowing.

Text Preprocessing Rules (Optional)

  • Replace consecutive spaces, newlines, and tabs to clean up formatting inconsistencies.

  • Delete all URLs and email addresses to remove sensitive or irrelevant contact information.

  • Once you have made your changes, click the 'Confirm & Preview' button to review the results.

Graph store

Below is the available graph store option for the knowledge graph:

  • Economical: This option utilizes ZBrain's built-in vector store with cost-effective vector engines and keyword indexes for efficient data handling.

File store selection

  • ZBrain S3 storage: This option utilizes ZBrain's secure and scalable S3 storage for data management. It offers enhanced data management features and precise retrieval results without incurring additional token costs.

Knowledge Graph LLM ( for knowledge graph selection)

  • Choose the LLM that will perform reasoning over the knowledge graph (default: gpt-4o). The chosen model powers query rewriting, path finding and answer synthesis.

Adding instructions for knowledge graph generation

  • You can enter custom instructions to define exactly how a Knowledge Graph should be built in the Knowledge Graph Instructions box.

  • Enter custom instructions: ZBrain allows advanced users to edit the instructions sent to the LLM during knowledge graph creation. Click ‘Edit' to customize or modify the default prompt and type your instructions, or click 'Generate to let ZBrain draft a prompt template, so the system knows exactly how to extract entities and relationships.

Note: Adding custom instructions is only applicable for advanced users.

Customize the prompt for the knowledge graph

If the prompt instructions are not given properly, or the output of the prompt is not in the expected format, or there is a possibility that the knowledge graph creation will fail, you can customize the prompt. This step is optional and intended for users who have a detailed understanding of prompt formatting.

You can choose to:

  • Use the default prompt (recommended for most users)

  • Replace only the placeholder values

  • Remove all default instructions and write your own

Default prompt

In this step, a structured prompt is sent to the model by default. It contains placeholders that the system automatically fills in with preset values. These include:

Placeholder
Populated with
Default values

{language}

The selected language for output

English

{entity_types}

Entity types chosen by the user

"organization", "person", "geo", "event", "category"

{tuple_delimiter}

Symbol for separating elements within a tuple

<|>

{record_delimiter}

Symbol for separating entries in the output list

##

{completion_delimiter}

Final output marker

<|COMPLETE|>

{examples}

Examples to guide the LLM

Predefined list of example outputs

{input_text}

The input document content

Automatically populated at runtime with data from the knowledge source. Should not be removed or modified.

Do not delete these placeholders unless replacing them intentionally.

Output format

If you create a custom prompt, it must return output in a specific format. This includes:

  • Entities: ("entity"{tuple_delimiter}entity_name{tuple_delimiter}entity_type{tuple_delimiter}entity_description)

  • Relationships: ("relationship"{tuple_delimiter}source_entity{tuple_delimiter}target_entity{tuple_delimiter}relationship_description{tuple_delimiter}relationship_keywords{tuple_delimiter}relationship_strength)

  • Content keywords: ("content_keywords"{tuple_delimiter}high_level_keywords)

  • Output must be a flat list, separated by {record_delimiter}, and end with {completion_delimiter}.

The final prompt

This is the prompt that goes to the LLM after placeholders are replaced in the backend.

Watch the tooltip: If your instructions are incomplete or ambiguous, an inline warning appears to flag the risk of skewed results.

Once the prompt reflects your requirements, proceed to generate the Knowledge Graph; the platform will apply your refined instructions to the uploaded content.

Retrieval settings

For knowledge graph selection

  • Retrieval type: You can choose between five search types:

  • Naive Mode: Falls back to basic vector similarity on text chunks (no KG traversal).

    • Best suited for: Quick POCs; content without rich relationships.

  • Local Mode: This search looks up context-dependent facts about a single entity using low-level keywords.

    • Best suited for: Ideal for Q&A about a particular policy, product feature, or isolated technical detail.

  • Global Mode: Emphasizes relationship-based knowledge, traversing edges to reveal broader connections between concepts.

    • Best suited for: For holistic questions that require networked insights, e.g., “How do X, Y, and Z relate?”

  • Hybrid Mode: Combines both local and global retrieval, then merges the results.

    • Best suited for: Complex business questions that need both entity facts and contextual relationships.

  • Mix Mode: Executes both vector (semantic) and graph retrieval in parallel, drawing from unstructured and structured data, including time metadata.

    • Best suited for: Multi-layered queries that span different data types or dimensions, such as timelines, comparisons, or multifaceted evaluations.

Top K: This setting determines the number of most relevant results returned for a user's search query. You can specify the desired number of results (default is 50).

Score threshold: This setting defines the minimum score a result needs to achieve to be included in the search results. You can specify a score between 0.01 and 1 (default is 0.2).

Embedding model

  • Choose the embedding type that best suits your use case to optimize text representation and improve performance.

The following embedding models are available when a knowledge graph is chosen in the RAG definition:

  • It will then display the proposed document and the estimated number of chunks for your review.

  • Once you have confirmed your selections, click the ‘Next’ button.

Execute and finish

On this screen, review all the details of the knowledge base you have provided earlier. If everything appears accurate, click the ‘Manage Knowledge Base’ button to complete the creation process. You can monitor the embedding progress of the knowledge base in real-time using the slider, whether it has been created or is currently in progress.

Your newly created knowledge base is now accessible for use within your ZBrain solutions. You can create additional knowledge bases by clicking on the ‘Add’ button or delete existing ones using the ‘Delete’ button.

Note: If a knowledge base is initially created using a knowledge graph, the vector store option is hidden for all subsequent document uploads under that knowledge base and vice versa.

Last updated