HttpClient static page loader for the ScrapeAAS library.
$ dotnet add package ScrapeAAS.HttpClientScrapeAAS integrates existing packages and ASP.NET features into a toolstack enabling you, the developer, to design your scraping service using a fammilar environment.
Add ASP.NET Hosting, ScrapeAAS, a validator of your choice (here Dawn.Guard RIP), and a object mapper of your choice (here AutoMapper), and the database/messagequeue you feel most comftable with (here EFcore with SQLite).
dotnet add package Microsoft.Extensions.Hosting
dotnet add package ScrapeAAS
dotnet add package Dawn.Guard
dotnet add package AutoMapper.Extensions.Microsoft.DependencyInjection
Full example of scraping the r/dotnet subreddit.
Create a crawler, a that service periodically triggers scraping
var builder = Host.CreateApplicationBuilder(args);
builder.Services
.AddAutoMapper()
.AddScrapeAAS()
.AddHostedService<RedditSubredditCrawler>()
.AddDataflow<RedditPostSpider>()
.AddDataflow<RedditSqliteSink>()
sealed class RedditSubredditCrawler : BackgroundService {
private readonly IAngleSharpBrowserPageLoader _browserPageLoader;
private readonly IDataflowPublisher<RedditPost> _publisher;
...
protected override async Task ExecuteAsync(CancellationToken stoppingToken) {
... execute service scope periotically
}
private async Task CrawlAsync(IDataflowPublisher<RedditSubreddit> publisher, CancellationToken stoppingToken)
{
_logger.LogInformation("Crawling /r/dotnet");
await publisher.PublishAsync(new("dotnet", new("https://old.reddit.com/r/dotnet")), stoppingToken);
_logger.LogInformation("Crawling complete");
}
}
Implement your spiders, services that collect, and normalize data.
sealed class RedditPostSpider : IDataflowHandler<RedditSubreddit> {
private readonly IAngleSharpBrowserPageLoader _browserPageLoader;
private readonly IDataflowPublisher<RedditComment> _publisher;
...
private async Task ParseRedditTopLevelPosts(RedditSubreddit subreddit, CancellationToken stoppingToken)
{
Url root = new("https://old.reddit.com/");
_logger.LogInformation("Parsing top level posts from {RedditSubreddit}", subreddit);
var document = await _browserPageLoader.LoadAsync(subreddit.Url, stoppingToken);
_logger.LogInformation("Request complete");
var queriedContent = document
.QuerySelectorAll("div.thing")
.AsParallel()
.Select(div => new
{
PostUrl = div.QuerySelector("a.title")?.GetAttribute("href"),
Title = div.QuerySelector("a.title")?.TextContent,
Upvotes = div.QuerySelector("div.score.unvoted")?.GetAttribute("title"),
Comments = div.QuerySelector("a.comments")?.TextContent,
CommentsUrl = div.QuerySelector("a.comments")?.GetAttribute("href"),
PostedAt = div.QuerySelector("time")?.GetAttribute("datetime"),
PostedBy = div.QuerySelector("a.author")?.TextContent,
})
.Select(queried => new RedditPost(
new(root, Guard.Argument(queried.PostUrl).NotEmpty()),
Guard.Argument(queried.Title).NotEmpty(),
long.Parse(queried.Upvotes.AsSpan()),
Regex.Match(queried.Comments ?? "", "^\\d+") is { Success: true } commentCount ? long.Parse(commentCount.Value) : 0,
new(queried.CommentsUrl),
DateTimeOffset.Parse(queried.PostedAt.AsSpan()),
new(Guard.Argument(queried.PostedBy).NotEmpty())
), IExceptionHandler.Handle((ex, item) => _logger.LogInformation(ex, "Failed to parse {RedditTopLevelPostBrief}", item)));
foreach (var item in queriedContent)
{
await _publisher.PublishAsync(item, stoppingToken);
}
_logger.LogInformation("Parsing complete");
}
}
Add a sink, a service that commits the scraped data disk/network.
sealed class RedditSqliteSink : IAsyncDisposable, IDataflowHandler<RedditSubreddit>, IDataflowHandler<RedditPost>
{
private readonly RedditPostSqliteContext _context;
private readonly IMapper _mapper;
...
public async ValueTask DisposeAsync()
{
await _context.Database.EnsureCreatedAsync();
await _context.SaveChangesAsync();
}
public async ValueTask HandleAsync(RedditSubreddit message, CancellationToken cancellationToken = default)
{
var messageDto = _mapper.Map<RedditSubredditDto>(message);
await _context.Database.EnsureCreatedAsync(cancellationToken);
await _context.Subreddits.AddAsync(messageDto, cancellationToken);
}
public async ValueTask HandleAsync(RedditPost message, CancellationToken cancellationToken = default)
{
var messageDto = _mapper.Map<RedditPostDto>(message);
if (await _context.Users.FindAsync(new object[] { message.PostedBy.Id }, cancellationToken) is { } existingUser)
{
messageDto.PostedById = existingUser.Id;
messageDto.PostedBy = existingUser;
}
await _context.Database.EnsureCreatedAsync(cancellationToken);
await _context.Posts.AddAsync(messageDto, cancellationToken);
}
}I have tried both toolstacks, and found them wanting. So I tried to make it better by delegating as much work as reasonable to existing projects.
In addition to my own goals; from evaluating both libraries I wish to keep all thier pros, and discard all their cons. The verbocity of this library sits comtably between WebReaper and DotnetSpider, but more towards the DotnetSpider end of things.
The overall data flow in ScrapeAAS is adopted from DotnetSpider: Crawler --> Spider --> Sink .
dynamic riddeled design when storing to a database.The Puppeteer browser handling is a mixture of the lifetime tracking http handler and the WebReaper Puppeteer integration.
Redis, MySql, RabbitMq, are always included in the package.