Scrape as a service

ScrapeAAS integrates existing packages and ASP.NET features into a toolstack enabling you, the developer, to design your scraping service using a fammilar environment.

Quickstart

Add ASP.NET Hosting, ScrapeAAS, a validator of your choice (here Dawn.Guard RIP), and a object mapper of your choice (here AutoMapper), and the database/messagequeue you feel most comftable with (here EFcore with SQLite).

dotnet add package Microsoft.Extensions.Hosting
dotnet add package ScrapeAAS
dotnet add package Dawn.Guard
dotnet add package AutoMapper.Extensions.Microsoft.DependencyInjection

Full example of scraping the r/dotnet subreddit.

Create a crawler, a that service periodically triggers scraping

var builder = Host.CreateApplicationBuilder(args);
builder.Services
  .AddAutoMapper()
  .AddScrapeAAS()
  .AddHostedService<RedditSubredditCrawler>()
  .AddDataflow<RedditPostSpider>()
  .AddDataflow<RedditSqliteSink>()

sealed class RedditSubredditCrawler : BackgroundService {
  private readonly IAngleSharpBrowserPageLoader _browserPageLoader;
  private readonly IDataflowPublisher<RedditPost> _publisher;
  ...
  protected override async Task ExecuteAsync(CancellationToken stoppingToken) {
    ... execute service scope periotically
  }

  private async Task CrawlAsync(IDataflowPublisher<RedditSubreddit> publisher, CancellationToken stoppingToken)
  {
    _logger.LogInformation("Crawling /r/dotnet");
    await publisher.PublishAsync(new("dotnet", new("https://old.reddit.com/r/dotnet")), stoppingToken);
    _logger.LogInformation("Crawling complete");
  }
}

Implement your spiders, services that collect, and normalize data.


sealed class RedditPostSpider : IDataflowHandler<RedditSubreddit> {
  private readonly IAngleSharpBrowserPageLoader _browserPageLoader;
  private readonly IDataflowPublisher<RedditComment> _publisher;
  ...

  private async Task ParseRedditTopLevelPosts(RedditSubreddit subreddit, CancellationToken stoppingToken)
  {
    Url root = new("https://old.reddit.com/");
    _logger.LogInformation("Parsing top level posts from {RedditSubreddit}", subreddit);
    var document = await _browserPageLoader.LoadAsync(subreddit.Url, stoppingToken);
    _logger.LogInformation("Request complete");
    var queriedContent = document
      .QuerySelectorAll("div.thing")
      .AsParallel()
      .Select(div => new
      {
        PostUrl = div.QuerySelector("a.title")?.GetAttribute("href"),
        Title = div.QuerySelector("a.title")?.TextContent,
        Upvotes = div.QuerySelector("div.score.unvoted")?.GetAttribute("title"),
        Comments = div.QuerySelector("a.comments")?.TextContent,
        CommentsUrl = div.QuerySelector("a.comments")?.GetAttribute("href"),
        PostedAt = div.QuerySelector("time")?.GetAttribute("datetime"),
        PostedBy = div.QuerySelector("a.author")?.TextContent,
      })
      .Select(queried => new RedditPost(
        new(root, Guard.Argument(queried.PostUrl).NotEmpty()),
        Guard.Argument(queried.Title).NotEmpty(),
        long.Parse(queried.Upvotes.AsSpan()),
        Regex.Match(queried.Comments ?? "", "^\\d+") is { Success: true } commentCount ? long.Parse(commentCount.Value) : 0,
        new(queried.CommentsUrl),
        DateTimeOffset.Parse(queried.PostedAt.AsSpan()),
        new(Guard.Argument(queried.PostedBy).NotEmpty())
      ), IExceptionHandler.Handle((ex, item) => _logger.LogInformation(ex, "Failed to parse {RedditTopLevelPostBrief}", item)));
    foreach (var item in queriedContent)
    {
      await _publisher.PublishAsync(item, stoppingToken);
    }
    _logger.LogInformation("Parsing complete");
  }
}

ProphetLamb/ScrapeAAS.HttpClientv1.1.4

Get Started

Readme

Scrape as a service

Quickstart

Why not WebReaper or DotnetSpider?

Evaluation of DotnetSpider

Evaluation of WebReaper