How to create a knowledge base?

This guide details the steps involved in creating a knowledge base within your ZBrain account.

Getting started

To begin, log into your ZBrain account.
Once you have successfully logged in, navigate to the dashboard.
Click the 'Create Knowledge Base' button to initiate the process of setting up a new knowledge base.

Uploading document

You will be directed to a screen where you can either upload a document from your device or import it from a data source.

Data Source Configuration

This page allows users to configure the foundational elements of a new knowledge base.

After uploading or importing the data, you will be prompted to provide a name and description for the chosen file/data.
To upload additional documents to the knowledge base, click the ‘Add More’ button located below the uploaded documents. Select the documents from your device to add them.

Additional features

Document summarization

Enable document summarization by toggling the dedicated switch.
Select an appropriate large language model to perform the summarization process.
This feature creates concise overviews of lengthy documents for easier comprehension.

Automated reasoning policy

Create an automated reasoning policy by activating the feature toggle.
An automated reasoning policy consists of predefined rules, conditions, and variables that guide the system's reasoning process when responding to queries.
It extracts structured data from the knowledge base, applies logical reasoning, and ensures responses are accurate and consistent.
This policy governs how the system interprets information, processes queries, and delivers answers based on established knowledge and logic.

Improve efficiency using Flow

Enable the ‘Improve Efficiency Using Flow’ option to streamline and enhance the process of transforming documents into refined knowledge bases.
This feature leverages predefined or custom flows from the Flow Library to automate data extraction and analysis, converting raw documents into structured, actionable insights.
It is essential for users seeking to create efficient knowledge bases by applying standardized data processing techniques such as text extraction, image analysis, and language model-based summarization.
By incorporating flows, you optimize the data refinement process, transforming input data into a well-organized, accessible knowledge base. This enhances the solution’s ability to handle various data formats, including text, images, and structured content, ultimately improving operational efficiency and decision-making.
When you activate the toggle to enable this feature, a button labeled ‘Add a Flow from the Flow Library’ will appear. Clicking this button will open the ‘Add a Flow’ panel.

Types of flows

There are two types of flows available:

ZBrain Flows

ZBrain Flows offer predefined automation solutions for common data processing tasks. Users can choose from the following options:

OCR (Optical Character Recognition)
- Purpose: Recognizes and extracts text content from images or documents. This is particularly useful for digitizing physical documents or documents containing non-editable text (e.g., scanned PDFs).
- Functionality:
  - Extracts text from images or scanned documents.
  - Enables further processing or analysis, such as searching, summarization, or automated reasoning.
Analyze each page as an image using an LLM
- Purpose: Treats each document page as an image and processes both the visual and textual content for detailed analysis.
- Functionality:
  - Converts document pages into a digital format using OCR.
  - Extracted text is analyzed using a Large Language Model (LLM) to:
    Derive insights.
    Generate summaries.
    Classify content based on predefined criteria.
Extract images from the document and evaluate them using an LLM
- Purpose: Designed for documents containing images that need to be analyzed.
- Functionality:
  - Extracts images from the document.
  - Applies an LLM to analyze the images for:
    Content recognition.
    Pattern or object identification.
    Useful for image-heavy documents requiring deeper content understanding.

Custom flows

The custom flows option allows users to create a flow specifically for data extraction, enabling tailored and advanced automation based on unique workflows and processing needs. Users can click on this option to add a custom flow.

Complete all the required fields and click the ‘Next’ button to proceed to the text data refinement page.

Text data refinement

This page allows configuration of text processing parameters that determine how documents are processed and stored for optimal retrieval and knowledge base creation.

RAG definition

In this step, configure the data retrieval model that determines how your knowledge base will store, index, and retrieve content to generate accurate, context-aware responses to user queries.

ZBrain supports two RAG storage models:

Vector store – default, chunk-and-embedding index for semantic similarity search
Knowledge graph – entity–relationship graph stored in ZBrain’s graph database

Select the one that best matches your data and query needs.

Option

How it works

When to choose it

Vector store

Splits each document into chunks, converts the chunks into high-dimensional embeddings, and saves them in a vector database. At query time, ZBrain performs a semantic-similarity search and supplies the matched chunks to the LLM.

Unstructured text
Rapid prototyping
Relationship reasoning is not critical

Knowledge graph

Extracts entities and relationships from every chunk, stores them as nodes and edges, and also embeds the chunk text so you can fall back on vector similarity. The query engine can traverse the graph, run vector search, or perform both operations.

Information with critical inter-entity relationships, such as product-component hierarchies, chronological timelines, or complex organizational structures.

Option 1: Vector store selection (Default option)

If you have selected the vector store option in your RAG definition, you will be able to choose from four available vector store types listed below:

Pinecone: This option leverages the scalability of Pinecone, a third-party vector indexing service, directly within ZBrain.
Economical: This option utilizes ZBrain's built-in vector store with cost-effective vector engines and keyword indexes for efficient data handling.
Chroma: This option utilizes ChromaDB, a high-performance open-source vector database optimized for applications leveraging large language models. It offers robust support for embeddings, vector search, document storage, full-text search, metadata filtering, and multi-modal capabilities.

Add new connection: To use your vector store, provide the necessary API key and credentials. You can choose from Pinecone Hosted or Qdrant for vector storage. Input the connection name and enter the API key to establish the connection. To get an API key from Pinecone Hosted, follow these steps:
- Open the Pinecone console and log in to your account.
- Select your project from the list of projects.
- Navigate to the API Keys tab.
- Click ‘Create API Key.’
- Enter a name for your API key.
- Choose the permissions you want to assign to the API key.
- Click ‘Create Key.’
- Copy and securely store the generated API key, as you cannot view it again once you close the dialog. To get an API key from the Qdrant vector database, follow these steps:
- Log in to the Qdrant Cloud dashboard.
- Go to the cluster detail page.
- Navigate to the API keys section.
- Click ‘Create’ to generate a new API key.
- Configure the permissions for the key if granular access control is enabled.
- Click ‘OK’ and copy your API key.
- Once you have the API key, enter the environment and index name.

After filling in all the required details, click ‘Add’ to complete the process.

Option 2: Knowledge graph selection

Depending on your requirements, ZBrain lets you create a knowledge graph (KG) as an alternative to a traditional vector store. Below is the available graph store option for the knowledge graph:

Economical: This option utilizes ZBrain's built-in vector store with cost-effective vector engines and keyword indexes for efficient data handling.

Customize the prompt for the knowledge graph

ZBrain allows advanced users to edit the instructions sent to the LLM during knowledge graph creation. This step is optional and intended for users who have a detailed understanding of prompt formatting.

You can choose to:

Use the default prompt (recommended for most users)
Replace only the placeholder values
Remove all default instructions and write your own

Default prompt

---Goal---
Given a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.
Use {language} as output language.


---Steps---
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, use same language as input text. If English, capitalized the name.
- entity_type: One of the following types: [{entity_types}]
- entity_description: Comprehensive description of the entity's attributes and activities
Format each entity as ("entity"{tuple_delimiter}<entity_name>{tuple_delimiter}<entity_type>{tuple_delimiter}<entity_description>)


2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other
- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity
- relationship_keywords: one or more high-level key words that summarize the overarching nature of the relationship, focusing on concepts or themes rather than specific details
Format each relationship as ("relationship"{tuple_delimiter}<source_entity>{tuple_delimiter}<target_entity>{tuple_delimiter}<relationship_description>{tuple_delimiter}<relationship_keywords>{tuple_delimiter}<relationship_strength>)


3. Identify high-level key words that summarize the main concepts, themes, or topics of the entire text. These should capture the overarching ideas present in the document.
Format the content-level key words as ("content_keywords"{tuple_delimiter}<high_level_keywords>)


4. Return output in {language} as a single list of all the entities and relationships identified in steps 1 and 2. Use **{record_delimiter}** as the list delimiter.


5. When finished, output {completion_delimiter}


######################
---Examples---
######################
{examples}


#############################
---Real Data---
######################
Entity_types: [{entity_types}]
Text:
{input_text}
######################
Output:

In this step, a structured prompt is sent to the model by default. It contains placeholders that the system automatically fills in with preset values. These include:

Placeholder

Populated with...

Default values

{language}

The selected language for output

English

{entity_types}

Entity types chosen by the user

"organization", "person", "geo", "event", "category"

{tuple_delimiter}

Symbol for separating elements within a tuple

<|>

{record_delimiter}

Symbol for separating entries in the output list

##

{completion_delimiter}

Final output marker

<|COMPLETE|>

{examples}

Examples to guide the LLM

Predefined list of example outputs

{input_text}

The input document content

Automatically populated at runtime with data from the knowledge source. Should not be removed or modified.

Do not delete these placeholders unless replacing them intentionally.

Output format

If you create a custom prompt, it must return output in a specific format. This includes:

Entities: ("entity"{tuple_delimiter}entity_name{tuple_delimiter}entity_type{tuple_delimiter}entity_description)
Relationships: ("relationship"{tuple_delimiter}source_entity{tuple_delimiter}target_entity{tuple_delimiter}relationship_description{tuple_delimiter}relationship_keywords{tuple_delimiter}relationship_strength)
Content keywords: ("content_keywords"{tuple_delimiter}high_level_keywords)
Output must be a flat list, separated by {record_delimiter}, and end with {completion_delimiter}.

The final prompt

---Goal---
Given a text document that is potentially relevant to this activity and a list of entity types, identify all entities of those types from the text and all relationships among the identified entities.
Use English as output language.


---Steps---
1. Identify all entities. For each identified entity, extract the following information:
- entity_name: Name of the entity, use same language as input text. If English, capitalized the name.
- entity_type: One of the following types: [organization, person, geo, event, category]
- entity_description: Comprehensive description of the entity's attributes and activities
Format each entity as ("entity"<|>entity_name<|>entity_type<|>entity_description)


2. From the entities identified in step 1, identify all pairs of (source_entity, target_entity) that are *clearly related* to each other.
For each pair of related entities, extract the following information:
- source_entity: name of the source entity, as identified in step 1
- target_entity: name of the target entity, as identified in step 1
- relationship_description: explanation as to why you think the source entity and the target entity are related to each other
- relationship_strength: a numeric score indicating strength of the relationship between the source entity and target entity
- relationship_keywords: one or more high-level key words that summarize the overarching nature of the relationship, focusing on concepts or themes rather than specific details
Format each relationship as ("relationship"<|>source_entity<|>target_entity<|>relationship_description<|>relationship_keywords<|>relationship_strength)


3. Identify high-level key words that summarize the main concepts, themes, or topics of the entire text. These should capture the overarching ideas present in the document.
Format the content-level key words as ("content_keywords"<|>high_level_keywords)


4. Return output in English as a single list of all the entities and relationships identified in steps 1 and 2. Use **##** as the list delimiter.


5. When finished, output <|COMPLETE|>


######################
---Examples---
######################
Example 1:


Entity_types: [person, technology, mission, organization, location]
Text:
```
while Alex clenched his jaw, the buzz of frustration dull against the backdrop of Taylor's authoritarian certainty. It was this competitive undercurrent that kept him alert, the sense that his and Jordan's shared commitment to discovery was an unspoken rebellion against Cruz's narrowing vision of control and order.


Then Taylor did something unexpected. They paused beside Jordan and, for a moment, observed the device with something akin to reverence. "If this tech can be understood..." Taylor said, their voice quieter, "It could change the game for us. For all of us."


The underlying dismissal earlier seemed to falter, replaced by a glimpse of reluctant respect for the gravity of what lay in their hands. Jordan looked up, and for a fleeting heartbeat, their eyes locked with Taylor's, a wordless clash of wills softening into an uneasy truce.


It was a small transformation, barely perceptible, but one that Alex noted with an inward nod. They had all been brought here by different paths
```


Output:
("entity"<|>"Alex"<|>"person"<|>"Alex is a character who experiences frustration and is observant of the dynamics among other characters.")##("entity"<|>"Taylor"<|>"person"<|>"Taylor is portrayed with authoritarian certainty and shows a moment of reverence towards a device, indicating a change in perspective.")##("entity"<|>"Jordan"<|>"person"<|>"Jordan shares a commitment to discovery and has a significant interaction with Taylor regarding a device.")##("entity"<|>"Cruz"<|>"person"<|>"Cruz is associated with a vision of control and order, influencing the dynamics among other characters.")##("entity"<|>"The Device"<|>"technology"<|>"The Device is central to the story, with potential game-changing implications, and is revered by Taylor.")##("relationship"<|>"Alex"<|>"Taylor"<|>"Alex is affected by Taylor's authoritarian certainty and observes changes in Taylor's attitude towards the device."<|>"power dynamics, perspective shift"<|>7)##("relationship"<|>"Alex"<|>"Jordan"<|>"Alex and Jordan share a commitment to discovery, which contrasts with Cruz's vision."<|>"shared goals, rebellion"<|>6)##("relationship"<|>"Taylor"<|>"Jordan"<|>"Taylor and Jordan interact directly regarding the device, leading to a moment of mutual respect and an uneasy truce."<|>"conflict resolution, mutual respect"<|>8)##("relationship"<|>"Jordan"<|>"Cruz"<|>"Jordan's commitment to discovery is in rebellion against Cruz's vision of control and order."<|>"ideological conflict, rebellion"<|>5)##("relationship"<|>"Taylor"<|>"The Device"<|>"Taylor shows reverence towards the device, indicating its importance and potential impact."<|>"reverence, technological significance"<|>9)##("content_keywords"<|>"power dynamics, ideological conflict, discovery, rebellion")<|COMPLETE|>
#############################


######################
Example 2:


Entity_types: [company, index, commodity, market_trend, economic_policy, biological]
Text:
```
Stock markets faced a sharp downturn today as tech giants saw significant declines, with the Global Tech Index dropping by 3.4% in midday trading. Analysts attribute the selloff to investor concerns over rising interest rates and regulatory uncertainty.


Among the hardest hit, Nexon Technologies saw its stock plummet by 7.8% after reporting lower-than-expected quarterly earnings. In contrast, Omega Energy posted a modest 2.1% gain, driven by rising oil prices.


Meanwhile, commodity markets reflected a mixed sentiment. Gold futures rose by 1.5%, reaching $2,080 per ounce, as investors sought safe-haven assets. Crude oil prices continued their rally, climbing to $87.60 per barrel, supported by supply constraints and strong demand.


Financial experts are closely watching the Federal Reserve's next move, as speculation grows over potential rate hikes. The upcoming policy announcement is expected to influence investor confidence and overall market stability.
```


Output:
("entity"<|>"Global Tech Index"<|>"index"<|>"The Global Tech Index tracks the performance of major technology stocks and experienced a 3.4% decline today.")##("entity"<|>"Nexon Technologies"<|>"company"<|>"Nexon Technologies is a tech company that saw its stock decline by 7.8% after disappointing earnings.")##("entity"<|>"Omega Energy"<|>"company"<|>"Omega Energy is an energy company that gained 2.1% in stock value due to rising oil prices.")##("entity"<|>"Gold Futures"<|>"commodity"<|>"Gold futures rose by 1.5%, indicating increased investor interest in safe-haven assets.")##("entity"<|>"Crude Oil"<|>"commodity"<|>"Crude oil prices rose to $87.60 per barrel due to supply constraints and strong demand.")##("entity"<|>"Market Selloff"<|>"market_trend"<|>"Market selloff refers to the significant decline in stock values due to investor concerns over interest rates and regulations.")##("entity"<|>"Federal Reserve Policy Announcement"<|>"economic_policy"<|>"The Federal Reserve's upcoming policy announcement is expected to impact investor confidence and market stability.")##("relationship"<|>"Global Tech Index"<|>"Market Selloff"<|>"The decline in the Global Tech Index is part of the broader market selloff driven by investor concerns."<|>"market performance, investor sentiment"<|>9)##("relationship"<|>"Nexon Technologies"<|>"Global Tech Index"<|>"Nexon Technologies' stock decline contributed to the overall drop in the Global Tech Index."<|>"company impact, index movement"<|>8)##("relationship"<|>"Gold Futures"<|>"Market Selloff"<|>"Gold prices rose as investors sought safe-haven assets during the market selloff."<|>"market reaction, safe-haven investment"<|>10)##("relationship"<|>"Federal Reserve Policy Announcement"<|>"Market Selloff"<|>"Speculation over Federal Reserve policy changes contributed to market volatility and investor selloff."<|>"interest rate impact, financial regulation"<|>7)##("content_keywords"<|>"market downturn, investor sentiment, commodities, Federal Reserve, stock performance")<|COMPLETE|>
#############################


######################
Example 3:


Entity_types: [economic_policy, athlete, event, location, record, organization, equipment]
Text:
```
At the World Athletics Championship in Tokyo, Noah Carter broke the 100m sprint record using cutting-edge carbon-fiber spikes.
```


Output:
("entity"<|>"World Athletics Championship"<|>"event"<|>"The World Athletics Championship is a global sports competition featuring top athletes in track and field.")##("entity"<|>"Tokyo"<|>"location"<|>"Tokyo is the host city of the World Athletics Championship.")##("entity"<|>"Noah Carter"<|>"athlete"<|>"Noah Carter is a sprinter who set a new record in the 100m sprint at the World Athletics Championship.")##("entity"<|>"100m Sprint Record"<|>"record"<|>"The 100m sprint record is a benchmark in athletics, recently broken by Noah Carter.")##("entity"<|>"Carbon-Fiber Spikes"<|>"equipment"<|>"Carbon-fiber spikes are advanced sprinting shoes that provide enhanced speed and traction.")##("entity"<|>"World Athletics Federation"<|>"organization"<|>"The World Athletics Federation is the governing body overseeing the World Athletics Championship and record validations.")##("relationship"<|>"World Athletics Championship"<|>"Tokyo"<|>"The World Athletics Championship is being hosted in Tokyo."<|>"event location, international competition"<|>8)##("relationship"<|>"Noah Carter"<|>"100m Sprint Record"<|>"Noah Carter set a new 100m sprint record at the championship."<|>"athlete achievement, record-breaking"<|>10)##("relationship"<|>"Noah Carter"<|>"Carbon-Fiber Spikes"<|>"Noah Carter used carbon-fiber spikes to enhance performance during the race."<|>"athletic equipment, performance boost"<|>7)##("relationship"<|>"World Athletics Federation"<|>"100m Sprint Record"<|>"The World Athletics Federation is responsible for validating and recognizing new sprint records."<|>"sports regulation, record certification"<|>9)##("content_keywords"<|>"athletics, sprinting, record-breaking, sports technology, competition")<|COMPLETE|>
#############################


#############################
---Real Data---
######################
Entity_types: [organization, person, geo, event, category]
Text:
{input_text}
######################
Output:

This is the prompt that goes to the LLM after placeholders are replaced in the backend.

File store selection

ZBrain S3 storage: This option utilizes ZBrain's secure and scalable S3 storage for data management. It offers enhanced data management features and precise retrieval results without incurring additional token costs.

Chunk settings

ZBrain provides two distinct chunking approaches for both options:

Automatic: This option is recommended for users unfamiliar with the process. ZBrain will automatically set chunk and preprocessing rules based on best practices.

Custom:
- This option enables experienced users to customize configurations, including defining end-of-segment characters, setting chunking rules and lengths, and applying text preprocessing rules.
- These rules include replacing consecutive spaces, newlines, and tabs, and removing all URLs and email addresses.
- Once you have made your changes, click the 'Confirm & Preview' button to review the results.

Retrieval settings

ZBrain offers various retrieval settings to define how users can search and retrieve information from a knowledge base. Here's an overview of the available settings:

For vector store selection

Search type: You can choose between three search types:
- Vector search: This method uses vector representations of text data for efficient retrieval. ZBrain utilizes an inverted index structure to map terms to relevant text chunks.
- Full-text search: This method indexes all terms within your documents, allowing users to search and retrieve documents based on keywords.
- Hybrid search: This option combines vector search and full-text search. ZBrain performs both searches simultaneously and then reranks the results to prioritize the most relevant documents for the user's query. To utilize hybrid search, you will need to configure a Rerank model API.
Top K: This setting determines the number of most relevant results returned for a user's search query. You can specify the desired number of results (default is 50).
Score threshold: This setting defines the minimum score a result needs to achieve to be included in the search results. You can specify a score between 0.01 and 1 (the default is 0.2).

For knowledge graph selection

Search type: You can choose between five search types:

Naive Mode: Falls back to basic vector similarity on text chunks (no KG traversal).
- Best suited for: Quick POCs; content without rich relationships.
Local Mode: This search looks up context-dependent facts about a single entity using low-level keywords.
- Best suited for: Ideal for Q&A about a particular policy, product feature, or isolated technical detail.
Global Mode: Emphasizes relationship-based knowledge, traversing edges to reveal broader connections between concepts.
- Best suited for: For holistic questions that require networked insights, e.g., “How do X, Y, and Z relate?”
Hybrid Mode: Combines both local and global retrieval, then merges the results.
- Best suited for: Complex business questions that need both entity facts and contextual relationships.
Mix Mode: Executes both vector (semantic) and graph retrieval in parallel, drawing from unstructured and structured data, including time metadata.
- Best suited for: Multi-layered queries that span different data types or dimensions, such as timelines, comparisons, or multifaceted evaluations.

Top K: This setting determines the number of most relevant results returned for a user's search query. You can specify the desired number of results (default is 50).

Score threshold: This setting defines the minimum score a result needs to achieve to be included in the search results. You can specify a score between 0.01 and 1 (default is 0.2).

Embedding model

Choose the embedding type that best suits your use case to optimize text representation and improve performance.

Upon selecting a vector store in the RAG definition, the following embedding models are available for use:

The following embedding models are available when a knowledge graph is chosen in the RAG definition:

Knowledge Graph LLM ( for knowledge graph selection)

Choose the LLM that will perform reasoning over the knowledge graph (default: gpt-4o). The chosen model powers query rewriting, path finding and answer synthesis.

It will then display the proposed document and the estimated number of chunks for your review.
Once you have confirmed your selections, click the ‘Next’ button.

Execute and finish

On this screen, review all the details of the knowledge base you have provided earlier. If everything appears accurate, click the ‘Manage Knowledge Base’ button to complete the creation process. You can monitor the embedding progress of the knowledge base in real-time using the slider, whether it has been created or is currently in progress.

Your newly created knowledge base is now accessible for use within your ZBrain solutions. You can create additional knowledge bases by clicking on the ‘Add’ button or delete existing ones using the ‘Delete’ button.

Note: If a KB is initially created using a knowledge graph, the vector store option is hidden for all subsequent document uploads under that KB and vice versa.

PreviousKnowledge base NextKnowledge source

Last updated 16 hours ago