Vectara-ingest: Data Ingestion made easy
A collection of crawlers for the Vectara community, making crawling and indexing documents quick and easy
≈ 12 minutes readIntroduction
Vectara provides an easy to use, managed platform for building LLM-powered conversational search applications with your data.
Developing such an application typically includes the following steps:
- Retrieving the data or content from its source (a website, Notion, Jira, etc)
- Using Vectara’s indexing API to ingest that content into a Vectara corpus
- Building a search user interface that calls Vectara’s Search API with a user query and displays the results to the user.
Indexing documents is very straightforward, and thanks to our Instant Index capability indexed data is available to query within seconds.
But how does one extract all this content from those sources in the first place?
For a website – you need to understand the latest in web crawling and HTML extraction; for sources with an API like Jira, Notion or Discourse, you need to learn and understand the detailed nuances of using all these APIs. This can quickly become overwhelming.
This is why I’m excited to announce the release of vectara-ingest, an open source project that includes a set of reusable code for crawling data sources and indexing the extracted content into Vectara corpora.
Getting Started with vectara-ingest
So how do you index a data source with vectara-ingest?
Let’s start with an example: we will crawl and index the content of sf.gov – the website for the city and county of San Francisco.
First, we set up our environment (instructions shown for mac; for other environments see here):
- Install Docker if it’s not yet installed
- Install Python 3.8 or above if it’s not yet installed
- Install the yq command, if it’s not yet installed (
brew install yq
) - Clone the repo:
git clone
https://github.com/vectara/vectara-ingest cd vectara-ingest
Next let’s open the Vectara console and set up a corpus for this crawl job.
Figure 1:
create new corpus
We call our corpus “sf” and provide a simple description. We can then click the “Create” button and the corpus is ready.
In the new corpus view, we generate an API key for indexing and searching:
Figure 2:
create an API key for the corpus
Now that the corpus is ready, we can create a new YAML configuration file config/sf.yaml
for our crawl job:
Let’s review the contents of this configuration file:
- The
vectara
section provides the information about our Vectara account and corpus – in this case the corpus_id (found in the top-right of this specific corpus page in the console) and customer_id - The
crawling
section has a single parametercrawler_type
. In our case we select the “website” crawler type (see here for the list of all available crawlers). - The
website_crawler
section provides specific parameters for this crawl job:- We specify the target website URL with the
website_homepage
parameter - In the
pages_source
parameter we choose thesitemap
crawling technique - We choose the PDF method for rendering website content by choosing
PDF
in theextraction
paramaeter. - We specify a 1 second delay between URL extractions to make sure we don’t overload the sf.gov website.
- We specify the target website URL with the
Vectara-ingest uses a secrets.toml
file to hold secrets that are not part of the code-base such as API keys. In this case we add a specific profile called “sf”, and store under this profile the vectara auth_url
and the api-key
we created earlier.
To run the crawl job we use the run.sh
script:
This creates the Docker image, and a Docker container (called vingest
), and then runs that Docker container with the sf.yaml
file we provided and kicks off the crawl job. If you want to track progress, you can look at the log messages from the running docker container by:
docker logs -f vingest
Once the job is finished, we can use the Vectara console to explore the results by trying out some search queries:
Figure 3:
Searching sf.gov from the Vectara console
How does a vectara-ingest crawler work?
Now that we’ve seen how to run a crawl job using vectara-ingest
, let’s look at an example crawler (the RSS crawler) to better understand how it works internally.
The RSS crawler retrieves a list of URLs from an RSS feed and ingests the documents pointed to by these URLs into a Vectara corpus.
This crawler has the following parameters
source
: the name of the RSS source.rss_pages
: a list of one or more RSS feed URLs.days_past
: specifies a filtering condition; URLs from the RSS feed will be included only if they have been published in the last N days.delay
: number of seconds to wait between indexing operations (to avoid overloading servers).extraction
: determines the way we want to extract content from the URL (valid values arepdf
orhtml
).
Every crawler in vectara-ingest
is a subclass of the Crawler
base class, and has to implement the crawl()
method, and RSSCrawler
is no exception:
Take a look at the implementation of this method and notice there are two major steps:
In the first step, we collect a list of all URLs from the RSS feeds that are within the time period specified by days_past
:
feed = feedparser.parse(rss_page) for entry in feed.entries: if "published_parsed" not in entry: urls.append([entry.link, entry.title, None]) continue entry_date = datetime.fromtimestamp(mktime(entry.published_parsed)) if entry_date >= days_ago and entry_date <= today: urls.append([entry.link, entry.title, entry_date])
In the second step, for each URL we call the url_to_file()
helper method to render the content into a PDF file, and the index_file()
method (part of the Indexer
object) to index that content into the Vectara corpus:
Make your own crawler!
We saw how to use vectara-ingest
to run a website crawl job, and then looked at the code of the RSSCrawler
for a detailed example of how the internals of a crawler work.
The vectara-ingest
project has many other crawlers implemented that might come in handy:
- Mediawiki: crawl a website powered by MediaWiki such as wikipedia
- Notion: crawl content from your company’s Notion instance
- Jira: crawl your company’s Jira instance indexing the issues and comments
- Docusaurus: crawl a documentation site powered by Docusaurus
- Discourse: crawl a public forum powered by Discourse
- S3: crawl files on an S3 bucket
- Folder: crawl all files in a certain local folder
- PMC: crawl scientific papers from pubmed central
- GitHub: crawl a GitHub repository, indexing all issues and comments
- Hacker News: crawl top stories from Hacker News
- Edgar: crawl 10-K annual reports from the SEC Edgar website
I invite you to contribute to this project – whether it’s to improve an existing crawler implementation, contribute a new crawler type, or even add a small improvement to the project documentation – every contribution is appreciated.
Please see the contribution guidelines for additional information, and submit a PR.
Summary
Vectara-ingest
provides code samples that make data source crawling and indexing easier, and a framework to easily run “crawl” jobs to ingest data into Vectara.
Vectara community members are using this codebase to build their LLM-powered applications, making the whole process easier and simpler.
I am excited to see this project continue to evolve, support additional types of data ingestion flows, and power new and innovative LLM-powered applications.