Data Sources
Data sources are where you provide the system access to your data. When you load any text information into the system, it reads the text, split into chunks, and upload in a special storage (currently, we user a vector database). When a user query comes, we looks in the datastore to find the most fitting chunks, and then pass them to the ChatGPT engine which looks for an answer and build a human-like response.
Add a data source
As soon as a project is created you may add the data source. A data source is a piece of information to be indexed and searched by our engine. Read more on the data source documentation page (opens in a new tab).
Click the "Add a new source" button and then select the corresponding type of data sources:
At the moment, you can add data sources of 3 types: PDF file, plain text, or a web site.
Let's review all these datasource types.
Plain text
After you create a new text data source, you will be redirected to its page:
To save the data source under a specific title, enter it and click the Save changes button to save.
The plain text data source can contain multiple pages of text. To add a new page of text, click the Add text content button. In the modal window, enter the title and the content of a data source:
Then click the Save & Index button to save the changes and immediately index the content in our database to by used later by the chatbot.
PDF files
When you select to create a file-based data source, just click the Select or drag file(s), select your file, then click the Upload button.
The process of uploading followed by indexing will start. You will something like that while it's in progress:
When it's done you can click on a page to review and edit the text content:
If you accidentally select a non-PDF file, it will be ignored.
Currently, only one file is allowed to be uploaded per data source.
Web site
You may enter the full URL, or a subfolder, or even specify pages manually. Another option is to use the sitemap file. Let's review all these options:
Note, that you can enter a URL without a protocol "https://". But if you enter a bare URL, we will add an "https" protocol by default.
If you know your website has many pages whereas the crawler finds only one or two, we probably hit the problem that currently, we don't solve which is the content on your website is formed dynamically (that is, by some JavaScript code).
We can read your website even if it's protected by Cloudflare.
Let's review all the options for finding web pages.
All pages within a domain (automatically)
When this option is selected, the crawler will look for any links, put them into a special list and then load the pages to find new links. The crawler is optimized in a way to avoid multiple hitting the same pages. As soon as a page is loaded, the crawler saves its content in a special in-memory database. This data will stored there for 7 days. So, if you'd like to just re-index the content of your website for some reason (without re-loading the pages), it will take much less time.
Only within a specific path (like inside of a folder), automatically
You enter a path you want all the pages under would be indexed:
Read pages from the site map
You can enter the exact URL of your website's sitemap file, or even just your website's URL and our system will find a sitemap on its own:
Note, we can find the sitemap files automatically only if their name is common and one of those:
- sitemap.xml,
- sitemap.xml.gz,
- sitemap_index.xml,
- sitemap-index.xml,
- sitemap_index.xml.gz,
- sitemap-index.xml.gz,
- .sitemap.xml,
- sitemap,
- admin/config/search/xmlsitemap,
- sitemap/sitemap-index.xml.
If your sitemap name is unique, please enter it manually.
Manually entered pages
Also, you can enter URLs manually and then click the plus button to add them to the list:
You can do it any moment but before you start indexing.
Indexing webpages
As soon as you find or add the pages to be indexed, select them and then click the Start indexing button. The stats of indexing pages will be updated automatically. All indexed pages will be unchecked.
If you select many pages (like thousands) the process could take some significant time. You may close the window and open it later to check the progress. Refresh the page to see the changes.
You can add as many pages as you want within your limits. After the indexing is done you will see the updated numbers and icons:
Re-indexing website
You may completely re-index your website, make crawling from scratch, and then select and re-index pages, or you can change and re-index an individual page as it's done with any other type of data source.
To re-index the whole website, just select the pages to be indexed and click the Start indexing button. To re-index the individual page, find it and click the button to edit (not necessarily) and index:
Ignored pages
You may want to mark some pages as ignored. It can be done at any moment. The ignored pages will not be indexed or crawled. To mark a page as ignored, just click on the button in the corresponding column:
To remove this flag, click on the same button again.
After you mark or unmark a page, start crawling or indexing to save the changes.