Train on a website

A website source is the most common way to train a chatbot. Hilal Chatbot crawls the URL you provide, extracts text, and indexes it. The bot then cites that content when answering.

In this guide:

  • Add a website source
  • Set crawl depth and limits
  • What gets indexed (and what doesn’t)
  • Re-crawl when content changes
  • Troubleshoot common crawl issues

Step 1: Add the source

On the chatbot detail page, go to Knowledge base → Add knowledge source → Website.

Paste the URL you want indexed. Public, HTML pages work best — sites that require login, render entirely from JavaScript without server-side HTML, or block bots in robots.txt may extract poorly. For those, use Documents or Snippets instead.

Website source dropdown Screenshot: A website source crawling, with progress percentage.

Step 2: Configure crawl

OptionDefaultWhen to change
Max depth3Raise to 5–6 for deeply nested help centers; lower to 1 to index a single page.
Max pages500Raise on Enterprise plans for large sites.
Include subdomainsoffTurn on if your help center lives on help.example.com and you want to also index docs.example.com.
Excluded paths(none)Glob patterns of pages to skip. Good for excluding /blog/, /legal/, /login/.

Click Add.

Step 3: Watch the crawl

Status sequence:

  1. pending — queued for a crawl worker.
  2. crawling — fetching pages. Progress is approximate; real total only known after discovery.
  3. processing — extracting and chunking content.
  4. training — building the vector index.
  5. trained — live and answering.

A 50-page site finishes in 1–3 minutes. A 5,000-page site can take 30+ minutes.

What gets indexed

The crawler extracts the main content of each page — paragraphs, lists, tables, headings. It skips:

  • <nav>, <header>, <footer>, <aside> elements.
  • Cookie banners and modals (best-effort).
  • Pure-image content (no OCR for crawled pages).
  • Pages that return non-200 status.
  • Pages excluded by robots.txt (we respect Disallow for our user-agent).

If your content lives in JavaScript-rendered components only, the crawler may get an empty page. Either provide a server-side rendered version, or upload a documents export.

Re-crawl when content changes

Two ways:

  1. Manual — open the source row, click Retrain. The crawler runs again and replaces the old index.
  2. Automatic — turn on Auto-retraining (New) and pick daily, weekly, or monthly. The system re-crawls on schedule.

Troubleshooting

  • “0 pages indexed.” Either robots.txt blocks our user-agent, the site requires JS to render, or the URL returned non-200. Verify by visiting the URL in a private browser tab.
  • Wrong content extracted (cookies banner, navigation menu, etc.). Use Excluded paths to skip the offending sections, or move to a Document source.
  • Crawl never finishes. A site with too many pages or pagination loops can hit limits. Raise Max pages or use a sitemap to give us a curated list.

What’s next