Documentation
Guide
Data sources

Data Sources

Data sources are where you provide the system access to your data. When you load any text information into the system, it reads the text, splits it into chunks, and uploads in a special storage (currently, we user a vector database). When a user query comes, we looks in the datastore to find the most fitting chunks, and then pass them to the ChatGPT engine which looks for an answer and build a human-like response.

Add a data source

As soon as a project is created you may add the data source. A data source is a piece of information to be indexed and searched by our engine. Read more on the data source documentation page (opens in a new tab).

Click the "Add a new source" button and then select the corresponding type of data sources:

image

💡

At the moment, you can add data sources of 3 types: PDF file, plain text, or a web site.

Let's review all these datasource types.

Plain text

After you create a new text data source, you will be redirected to its page:

Plain text data source

To save the data source under a specific title, enter it and click the Save changes button to save.

The plain text data source can contain multiple pages of text. To add a new page of text, click the Add text content button. In the modal window, enter the title and the content of a data source:

image

Then click the Save & Index button to save the changes and immediately index the content in our database to by used later by the chatbot.

PDF files

When you select to create a file-based data source, just click the Select or drag file(s), select your file, then click the Upload button.

image

The process of uploading followed by indexing will start. You will something like that while it's in progress:

image

When it's done you can click on a page to review and edit the text content:

image

💡

If you accidentally select a non-PDF file, it will be ignored.

💡

Currently, only one file is allowed to be uploaded per data source.

Web site

You may enter the full URL, or a subfolder, or even specify pages manually. Another option is to use the sitemap file. Let's review all these options:

image

💡

Note, that you can enter a URL without a protocol "https://". But if you enter a bare URL, we will add an "https" protocol by default.

💡

If you know your website has many pages whereas the crawler finds only one or two, we probably hit the problem that currently, we don't solve which is the content on your website is formed dynamically (that is, by some JavaScript code).

💡

We can read your website even if it's protected by Cloudflare.

Let's review all the options for finding web pages.

All pages within a domain (automatically)

When this option is selected, the crawler will look for any links, put them into a special list and then load the pages to find new links. The crawler is optimized in a way to avoid multiple hitting the same pages. As soon as a page is loaded, the crawler saves its content in a special in-memory database. This data will stored there for 7 days. So, if you'd like to just re-index the content of your website for some reason (without re-loading the pages), it will take much less time.

Only within a specific path (like inside of a folder), automatically

You enter a path you want all the pages under would be indexed:

image

Read pages from the site map

You can enter the exact URL of your website's sitemap file, or even just your website's URL and our system will find a sitemap on its own:

image

💡

Note, we can find the sitemap files automatically only if their name is common and one of those:

  • sitemap.xml,
  • sitemap.xml.gz,
  • sitemap_index.xml,
  • sitemap-index.xml,
  • sitemap_index.xml.gz,
  • sitemap-index.xml.gz,
  • .sitemap.xml,
  • sitemap,
  • admin/config/search/xmlsitemap,
  • sitemap/sitemap-index.xml.

If your sitemap name is unique, please enter it manually.

Manually entered pages

Also, you can enter URLs manually and then click the plus button to add them to the list:

image

You can do it any moment but before you start indexing.

Indexing webpages

As soon as you find or add the pages to be indexed, select them and then click the Start indexing button. The stats of indexing pages will be updated automatically. All indexed pages will be unchecked.

💡

If you select many pages (like thousands) the process could take some significant time. You may close the window and open it later to check the progress. Refresh the page to see the changes.

You can add as many pages as you want within your limits. After the indexing is done you will see the updated numbers and icons:

image

Indexing parameters

image

image

💡

Change the settings before indexing. If you already index your data, please re-index to see if this parameter affects the chatbot's quality.

Density of data

You can specify if you want data to be stored densely, or sparced. Sometimes, when parsing a website, our system leaves a lot of empty space between text data (that looks like empty lines). Usually, it doesn't affect to the chatbot's quality but sometimes, removing these gaps could improve it significantly.

Include links

This option allows you to include links in your extracted text alongside the other text. For example, if your data source has the following HTML code ...in our <a href="https://mywebsite.com/documentation.html">documentation</a> the full link will be included as "...in our https://mywebsite.com/documentation.html (opens in a new tab) documentation" if the option is selected. Even though it may provide more valuable information, in some cases it may worsen the results, so always test this option with indexing just a couple of pages with the following testing the chatbot on the Interaction page to make sure the chatbot quality is appropriate.

Re-indexing website

You may completely re-index your website, make crawling from scratch, and then select and re-index pages, or you can change and re-index an individual page as it's done with any other type of data source.

To re-index the whole website, just select the pages to be indexed and click the Start indexing button. To re-index the individual page, find it and click the button to edit (not necessarily) and index:

image

Ignored pages

You may want to mark some pages as ignored. It can be done at any moment. The ignored pages will not be indexed or crawled. To mark a page as ignored, just click on the button in the corresponding column:

image

Deleting data

You can completely delete the data obtained from some pages, from the index database. To do so, just select pages, then click the Remove page data button. You will have to confirm your intent to delete this data.

image

💡

If you accidentally deleted page data or changed your mind, you always can re-index them again. To avoid indexing these pages in the future, mark them as ingored.

To remove this flag, click on the same button again.

💡

After you mark or unmark a page, start crawling or indexing to save the changes.