Web crawling extension for WebSpark.HttpClientUtility. Includes SiteCrawler and SimpleSiteCrawler with robots.txt compliance, HTML link extraction (HtmlAgilityPack), sitemap generation (Markdig), CSV export (CsvHelper), and real-time SignalR progress updates. Perfect for web scraping, SEO audits, and site analysis. Supports .NET 8 LTS, .NET 9, and .NET 10 (Preview). Requires WebSpark.HttpClientUtility base package [2.2.0]. Install both packages and call AddHttpClientUtility() + AddHttpClientCrawler() in your DI registration.
$ dotnet add package WebSpark.HttpClientUtility.CrawlerWeb crawling extension for WebSpark.HttpClientUtility.
This package provides enterprise-grade web crawling capabilities with robots.txt compliance, HTML parsing, sitemap generation, and real-time progress tracking via SignalR.
Important: This is an extension package. You must also install the base package WebSpark.HttpClientUtility version 2.0.0.
Install both packages:
dotnet add package WebSpark.HttpClientUtility
dotnet add package WebSpark.HttpClientUtility.Crawler
using WebSpark.HttpClientUtility;
using WebSpark.HttpClientUtility.Crawler;
var builder = WebApplication.CreateBuilder(args);
// Register base package (required)
builder.Services.AddHttpClientUtility();
// Register crawler package
builder.Services.AddHttpClientCrawler();
var app = builder.Build();
// Optional: Register SignalR hub for progress updates
app.MapHub<CrawlHub>("/crawlHub");
app.Run();
public class CrawlerService
{
private readonly ISiteCrawler _crawler;
public CrawlerService(ISiteCrawler crawler)
{
_crawler = crawler;
}
public async Task<CrawlResult> CrawlWebsiteAsync(string url)
{
var options = new CrawlerOptions
{
StartUrl = url,
MaxDepth = 3,
MaxPages = 100,
RespectRobotsTxt = true
};
var result = await _crawler.CrawlAsync(options);
Console.WriteLine($"Crawled {result.TotalPages} pages in {result.Duration}");
return result;
}
}
For lightweight crawling without full recursion:
public class SimpleCrawlerService
{
private readonly SimpleSiteCrawler _simpleCrawler;
public SimpleCrawlerService(SimpleSiteCrawler simpleCrawler)
{
_simpleCrawler = simpleCrawler;
}
public async Task<List<string>> GetAllLinksAsync(string url)
{
var result = await _simpleCrawler.CrawlAsync(url);
return result.DiscoveredUrls;
}
}
builder.Services.AddHttpClientCrawler(options =>
{
options.DefaultMaxDepth = 5;
options.DefaultMaxPages = 500;
options.DefaultTimeout = TimeSpan.FromSeconds(30);
options.UserAgent = "MyBot/1.0";
});
var result = await _crawler.CrawlAsync(options);
await result.ExportToCsvAsync("crawl-results.csv");
// Client-side JavaScript
const connection = new signalR.HubConnectionBuilder()
.withUrl("/crawlHub")
.build();
connection.on("CrawlProgress", (progress) => {
console.log(`Progress: ${progress.pagesProcessed}/${progress.totalPages}`);
});
await connection.start();
| Property | Type | Default | Description |
|---|---|---|---|
| StartUrl | string | required | The URL to start crawling from |
| MaxDepth | int | 3 | Maximum depth to crawl (0 = no limit) |
| MaxPages | int | 100 | Maximum number of pages to crawl |
| RespectRobotsTxt | bool | true | Honor robots.txt directives |
| Timeout | TimeSpan | 30s | Request timeout per page |
| UserAgent | string | "WebSpark.Crawler" | User agent string |
| AllowedDomains | List | null | Restrict crawling to specific domains |
| ExcludedPaths | List | null | Paths to exclude from crawling |
The crawler automatically:
Disallow directivesCrawl-delay settingsDisable with:
options.RespectRobotsTxt = false; // Not recommended
MaxDepth and MaxPages to avoid overwhelming serversIf you're upgrading from WebSpark.HttpClientUtility v1.x:
dotnet add package WebSpark.HttpClientUtility.Crawlerusing WebSpark.HttpClientUtility.Crawler;services.AddHttpClientCrawler();All crawler APIs remain unchanged - only the registration is different.
MIT License - see LICENSE