ReadableWeb extracts readable article content, images and metadata from HTML. Small, modular .NET 10 library for article extraction and readability parsing.
License
—
Deps
10
Install Size
—
Vulns
✓ 0
Published
Dec 21, 2025
$ dotnet add package ReadableWebSmall, modular .NET 10 library and tools to extract article content, images and metadata from web pages. The solution contains extractors, abstractions, HTML parsing implementations, tests and benchmarks.
ReadableWeb.Abstractions � public interfaces for extractors and processorsReadableWeb � composition and higher-level servicesReadableWeb.HtmlAgilityPack � HTML Agility Pack based extractor implementationReadableWeb.AngleSharp � AngleSharp based extractor implementationReadableWeb.Tests � unit testsReadableWeb.Benchmarks � benchmark projectsReadableWeb.TestConsole � sample/test console appdotnet-ef or other tooling only if needed for local tasksQuick extraction via the default HTTP helper:
using ReadableWeb.Extraction;
var extractor = HttpArticleExtractor.CreateDefault();
var article = await extractor.ExtractFromUrlAsync(
"https://www.example.com/news/story");
Console.WriteLine(article.Title);
Console.WriteLine(article.Excerpt);
foreach (var image in article.Images)
{
Console.WriteLine($"Image: {image.Url}");
}
Register the library in an ASP.NET Core or worker service using the provided DI extension:
using ReadableWeb;
using ReadableWeb.Configuration;
builder.Services.AddArticleExtraction(builder.Configuration, "ArticleExtraction", options =>
{
options.EnableImageFileCache = true;
options.ImageFileCachePath = Path.Combine(builder.Environment.ContentRootPath, "wwwroot/images");
});
ReadableWeb supports two complementary caching mechanisms to improve performance and reduce bandwidth when extracting articles with images:
EnableImageFileCache = true in the configuration or by setting the option in the DI extension. When enabled the extractor will download image assets referenced by the extracted article and persist them to the specified ImageFileCachePath on disk.ImageFileCacheBaseUrl should be set to the public base URL segment where the cached images will be served from (for example /article-images). The extractor will rewrite image URLs in the returned ArticleContent to point at the base URL plus the cached filename.ImageFileCachePath is a filesystem path relative to your application root (or an absolute path). Make sure the directory is writable by the process and is served by your web server (for example, place it under wwwroot in ASP.NET Core or configure static file serving for that path).IgnoreImageDownloadErrors = true prevents extraction from failing when individual image downloads fail. When false, image download failures may surface as errors.Example configuration for local image caching:
{
"ArticleExtraction": {
"Parser": "HtmlAgilityPack",
"UseRedis": false,
"EnableImageFileCache": true,
"ImageFileCachePath": "wwwroot/article-images",
"ImageFileCacheBaseUrl": "/article-images",
"IgnoreImageDownloadErrors": true
}
}
UseRedis = true the library will use the configured distributed cache (typically backed by Redis) to store extract results and/or intermediate data depending on your configuration. This reduces repeated extraction work for the same URLs across multiple instances.IDistributedCache (for example Microsoft.Extensions.Caching.StackExchangeRedis) in your application and make sure the ArticleExtraction configuration section enables UseRedis.Example configuration snippet enabling Redis + image cache:
{
"ArticleExtraction": {
"Parser": "HtmlAgilityPack",
"UseRedis": true,
"EnableImageFileCache": true,
"ImageFileCachePath": "wwwroot/article-images",
"ImageFileCacheBaseUrl": "/article-images",
"IgnoreImageDownloadErrors": true
}
}
ImageFileCachePath at the ImageFileCacheBaseUrl you configured.IDistributedCache implementation for storage semantics.IgnoreImageDownloadErrors is enabled the extraction will still return article text and metadata even if some images fail to download. Disable this setting during debugging to surface download issues.Expose the extractor through a minimal API endpoint:
using ReadableWeb.Extraction;
var app = builder.Build();
app.MapGet("/api/article", async (
[FromServices] IHttpArticleExtractor extractor,
[FromQuery] string url,
CancellationToken token) =>
{
if (string.IsNullOrWhiteSpace(url))
{
return Results.BadRequest("Url query parameter is required.");
}
var article = await extractor.ExtractFromUrlAsync(url, cancellationToken: token);
return Results.Ok(article);
})
.Produces<ArticleContent>()
.WithName("GetArticle");
app.Run();
Or wire it up in a conventional API controller:
using ReadableWeb.Extraction;
using Microsoft.AspNetCore.Mvc;
[ApiController]
[Route("api/[controller]")]
public class ArticleController : ControllerBase
{
private readonly IHttpArticleExtractor _extractor;
public ArticleController(IHttpArticleExtractor extractor)
{
_extractor = extractor;
}
[HttpGet]
public async Task<IActionResult> Get([FromQuery] string url, CancellationToken token)
{
if (string.IsNullOrWhiteSpace(url))
{
return BadRequest("Url query parameter is required.");
}
var article = await _extractor.ExtractFromUrlAsync(url, cancellationToken: token);
return Ok(article);
}
}
Example configuration section consumed by AddArticleExtraction:
{
"ArticleExtraction": {
"Parser": "HtmlAgilityPack",
"UseRedis": false,
"EnableImageFileCache": true,
"ImageFileCachePath": "wwwroot/article-images",
"ImageFileCacheBaseUrl": "/article-images",
"IgnoreImageDownloadErrors": true
}
}
Restore and build all projects:
dotnet restore
dotnet build --configuration Release
Run unit tests from solution root:
dotnet test
Benchmarks use BenchmarkDotNet. Run from the benchmark project directory:
dotnet run -c Release -p ReadableWeb.Benchmarks
This repository includes a GitHub Actions workflow to pack and publish NuGet packages: .github/workflows/publish-nuget.yml.
The workflow builds and packs with a version based on commit count and pushes packages to NuGet when the NUGET_API_KEY secret is provided.
Dependabot configuration is provided in .github/dependabot.yml to open weekly PRs for NuGet package updates.
No license file included in the repository. Add a LICENSE file if you intend to open source this code.
For local development questions, run the sample console app ReadableWeb.TestConsole or inspect tests in ReadableWeb.Tests.