Mark Litwintschik put together this article to help server admins with separating bot and human-generated traffic in web server logs, which can be challenging. In this blog he’ll walk you through the steps he went through to build an IPv4 ownership and browser string-based bot detection script.
His solution includes implementing free IPv4 database of country and city registration tohether with Python-based library for fast lookups. He theh grepped logs and found IPs where “robots.txt” was being requested. From that list he spot-checked some of the more frequently-appearing IP addresses and found a number of hosting and cloud providers being listed as the owners of these IPs. In in addition to Google, six firms came up a lot: Amazon, Baidu, Digital Ocean, Hetzner, Linode and New Dream Network.
Final code for monitoring script in Python is provided together with instructions how to programatically filter bots from both Apache and Nginx logs. Good read and very useful code.
[Read More]