This blog post will cover Python web scraping packages in terms of their speed, ease of use, and personal investigations. This blog post won’t cover what webscraping is and how parsers work. By Dmitriy Zub.
The article recommendations:
If you need to scrape data from a dynamic page that doesn’t require clicking, scrolling and similar things but still requires rendering JavaScript, try
requests-html
. It uses pure XPath as lxml and should be faster than the other two browser automations.If you need to do complex page manipulation on the dynamic page, try to use
playwright
orselenium
.If you scraping non-dynamic pages (rendered via JavaScript), try
selectolax
overbs4
,lxml
orparsel
. It’s a lot faster, uses less memory, and has almost identical syntax to parsel or bs4. A hidden gem I would say.If you need to use XPath in your parser, try to use either
lxml
orparsel
. parsel is built on top of lxml and translates every CSS query to XPath and can combine (chain) CSS and XPath queries. However, lxml is faster.
Excellent read with charts and code to complement the comparison of each package!
[Read More]