Websites

Train on a website

A website source is the most common way to train a chatbot. Hilal Chatbot crawls the URL you provide, extracts text, and indexes it. The bot then cites that content when answering.

In this guide:

Add a website source
Set crawl depth and limits
What gets indexed (and what doesn’t)
Re-crawl when content changes
Troubleshoot common crawl issues

Step 1: Add the source

On the chatbot detail page, go to Knowledge base → Add knowledge source → Website.

Paste the URL you want indexed. Public, HTML pages work best — sites that require login, render entirely from JavaScript without server-side HTML, or block bots in robots.txt may extract poorly. For those, use Documents or Snippets instead.

Website source dropdown Screenshot: A website source crawling, with progress percentage.

Step 2: Configure crawl

Option	Default	When to change
Max depth	3	Raise to 5–6 for deeply nested help centers; lower to 1 to index a single page.
Max pages	500	Raise on Enterprise plans for large sites.
Include subdomains	off	Turn on if your help center lives on `help.example.com` and you want to also index `docs.example.com`.
Excluded paths	(none)	Glob patterns of pages to skip. Good for excluding `/blog/`, `/legal/`, `/login/`.

Click Add.

Step 3: Watch the crawl

Status sequence:

pending — queued for a crawl worker.
crawling — fetching pages. Progress is approximate; real total only known after discovery.
processing — extracting and chunking content.
training — building the vector index.
trained — live and answering.

A 50-page site finishes in 1–3 minutes. A 5,000-page site can take 30+ minutes.

What gets indexed

The crawler extracts the main content of each page — paragraphs, lists, tables, headings. It skips:

<nav>, <header>, <footer>, <aside> elements.
Cookie banners and modals (best-effort).
Pure-image content (no OCR for crawled pages).
Pages that return non-200 status.
Pages excluded by robots.txt (we respect Disallow for our user-agent).

If your content lives in JavaScript-rendered components only, the crawler may get an empty page. Either provide a server-side rendered version, or upload a documents export.

Re-crawl when content changes

Two ways:

Manual — open the source row, click Retrain. The crawler runs again and replaces the old index.
Automatic — turn on Auto-retraining (New) and pick daily, weekly, or monthly. The system re-crawls on schedule.

Troubleshooting

“0 pages indexed.” Either robots.txt blocks our user-agent, the site requires JS to render, or the URL returned non-200. Verify by visiting the URL in a private browser tab.
Wrong content extracted (cookies banner, navigation menu, etc.). Use Excluded paths to skip the offending sections, or move to a Document source.
Crawl never finishes. A site with too many pages or pagination loops can hit limits. Raise Max pages or use a sitemap to give us a curated list.

What’s next

Next → Upload documents Auto-retraining

Permissions Documents