C# library for scraping text content from websites.
$ dotnet add package ChatAIze.RabbitHoleRabbit Hole is a small, deterministic web text scraper for .NET. It discovers links within a root URL and extracts readable text from HTML pages. The output is a Markdown-like string suited for indexing, summarization, or offline processing.
dotnet add package ChatAIze.RabbitHole
using ChatAIze.RabbitHole;
var scraper = new WebsiteScraper();
await foreach (var link in scraper.ScrapeLinksAsync("https://example.com", depth: 2))
{
Console.WriteLine(link);
}
var page = await scraper.ScrapeContentAsync("https://example.com");
Console.WriteLine(page.Title);
Console.WriteLine(page.Content);
using ChatAIze.RabbitHole;
var scraper = new WebsiteScraper();
await foreach (var link in scraper.ScrapeLinksAsync("https://example.com", depth: 3))
{
var page = await scraper.ScrapeContentAsync(link);
Console.WriteLine($"{page.Url} -> {page.Title}");
}
using ChatAIze.RabbitHole;
var scraper = new WebsiteScraper();
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(30));
await foreach (var link in scraper.ScrapeLinksAsync("https://example.com", depth: 3, cts.Token))
{
Console.WriteLine(link);
}
using ChatAIze.RabbitHole;
var scraper = new WebsiteScraper();
await foreach (var link in scraper.ScrapeLinksAsync("https://example.com", depth: 3))
{
if (!link.Contains("/docs/"))
{
continue;
}
var page = await scraper.ScrapeContentAsync(link);
Console.WriteLine(page.Content);
}
depth parameter.
depth: 2 fetches the root page and yields its links, but does not fetch those links.depth: 3 fetches the root page and each linked page once, but does not go deeper./) are resolved against the root host.mailto:, tel:, and anchor-only (#...) links.text/html.WebsiteScraper for the list).PageDetails instance with null metadata and content.<title><meta name="description"><meta name="keywords">article, main, or div.content, falling back to the entire document.h1-h6 map to #-style headings- or numbered list itemsThe output is Markdown-like and optimized for readability, not strict Markdown compliance.
# Welcome
This is a [link](https://example.com/about).
- First item
- Second itemScrapeLinksAsync performs best-effort crawling and skips pages that fail to load or parse.ScrapeContentAsync throws HttpRequestException for non-success status codes./docs and /docs-old are both treated as in-scope.<a href=...>) are used for link discovery.WebsiteScraperpublic async IAsyncEnumerable<string> ScrapeLinksAsync(
string url,
int depth = 2,
CancellationToken cancellationToken = default)
public async ValueTask<PageDetails> ScrapeContentAsync(
string url,
CancellationToken cancellationToken = default)PageDetailspublic sealed record PageDetails(
string Url,
string? Title,
string? Description,
string? Keywords,
string? Content);Build the library:
dotnet buildRun the preview app:
dotnet run --project ChatAIze.RabbitHole.PreviewGPL-3.0-or-later. See LICENSE.txt.