James Beswick wrote this piece about how to use serverless to scale an old concept for the modern era. It describes a client project which involved the need to crawl a large media site to generate a list of URLs and site assets.
For Node users, there’s a package that does this elegantly called Website Scraper, and it has plenty of configuration options to handle those questions plus a lot of other features. The package is mostly configuration-driven.
The article is split into:
- Let’s crawl before we run
- Serverless Crawler — Version 1.0
- Serverless Web Crawler 2.0 — now slower!
- Show me the code!
- Serverless Web Crawler 3.0
There are many reasons to crawl a website – and crawling is different to scraping. The article describes evolution of lambda aws based solution with detailed experience and learnings. Great!
[Read More]