Identify crawlers in real time – PHP – Keep track of accessed files in memory
Legitimate spiders are great. It’s part of the web and I’m glad they visit my site.
Spiders crawling my site without authorization are bad and I want to get rid of them.
I have a PHP app that monitors my website access files. Each time a user with a suspicious UserAgent visits the site, the system checks the access logs for entries from the same IP address and makes a judgment about their behavior. If it’s not human, and I don’t authorize it, then it will be recorded and I may (or may not) take steps like blocking.
The way it works is that the process of checking for access files happens every time a page is loaded. I only check suspicious UserAgents, so the number of checks is kept to a minimum.
What I want to do is check every visit to the site (i.e. check the last 50 lines of the access file to see if it is related to that access IP). But this means that every child process processed by my web server wants to open an access log file. This sounds like a resource and I/O blocking nightmare.
Is there a way to “smear” an access.log file into some kind of central memory where all web processes can access it simultaneously (or at least very quickly). Maybe load it into Memcache or similar. But how do I do this in real time? So the last 500 lines of the access.log file are loaded into memory consecutively (but only 500 rows are deleted instead of increasing numbers).
So to put it simply: is there a PHP or Linux or “other” way to buffer ever-increasing files (i.e. nginx log files) into memory so that other processes can access the information at the same time (or at least read files on the hard drive faster than all the hard drives).
It is important to know that a well-written service will always be able to mimic the behavior of the browser, unless you do something very strange that will affect the user experience of legitimate visitors.
However, even complex crawlers have some measures to cope with:
… Referrer URL and UA string. These are easy to fake and some legitimate users don’t have generic ones. You get a lot of false positive/negative results, but not much.
Web servers like Apache or nginx have core or additional functionality to limit the request rate for certain requests. For example, you can allow a *.html page to be downloaded every two seconds, but not limit resources such as JS/CSS. (Keep in mind that you should also notify legitimate bots of delays through robots.txt).
Fail2ban does something similar to what you want to do: it scans log files for malicious requests and blocks them. It’s very effective against malware bots and should be able to be configured to handle crawlers (at least not too clever ones).
These are specifically to answer your questions, but there are others you can consider:
3。 Modify the content
This is actually a very interesting thing: sometimes, we make small (automatic) modifications to HTML pages and JSON feeds, which forces crawlers to adjust their parsers. Interestingly, we see stale data on their website for a few days until they catch it. Then let’s modify it again.
4。 Restrictions: Captcha and sign-in
In addition to web server-level throttling, we count the number of requests per hour per IP address. If there is more than a certain number (which should be enough for a legitimate user), each request to the API needs to resolve the CAPTCHA issue.
Other APIs require authentication, so they don’t even enter these areas.
5. Abuse Notices
If you regularly access an IP address or subnet, you can make WHOIS queries against the web service where they run the bot. Often, they have abusive contacts, and often those contacts are very eager to hear about policy violations. Because the last thing they want is to get blacklisted (and if they don’t cooperate, we’ll submit to them).
Also, if you see ads on the crawler’s website, you should inform the ad network of the fact that they were being used to steal Material.
6。 IP bans
Obviously, you can block individual IP addresses. What we do is even block entire data centers, such as AWS, Azure, etc. A list of IP ranges for all of these services is available on the network.
Of course, if there are partner services that legally access your website from a data center, you have to whitelist them.
By the way, we do not do this in the web server, but at the firewall level (IPtables).
7. Legal Measures
After all, fighting a crawler is a “battle with a windmill” and can require a lot of effort. You won’t be able to stop all of them, but you should focus on the things that hurt you, for example. Waste your resources or earn money that is supposed to be yours.