个人感觉有几个点:
1. iptables可以屏蔽这种用户。
iptables可以监测一个IP地址在同一时间段内的请求,过多请求可以可以直接屏蔽。不过iptables操作起来较为复杂并且危险性高,非牛勿动。
2. 前端代理 - varnish 直接屏蔽掉非正规请求,具体的VCL参考下面的代码:
if ( req.http.user-agent ~ "^$"
|| req.http.user-agent ~ "^Java"
|| req.http.user-agent ~ "^Jakarta"
|| req.http.user-agent ~ "^Ruby"
|| req.http.user-agent ~ "IDBot"
|| req.http.user-agent ~ "Wget"
|| req.http.user-agent ~ "id-search"
|| req.http.user-agent ~ "User-Agent"
|| req.http.user-agent ~ "ConveraCrawler"
|| req.http.user-agent ~ "^Mozilla$"
|| req.http.user-agent ~ "libwww"
|| req.http.user-agent ~ "lwp-trivial"
|| req.http.user-agent ~ "curl"
|| req.http.user-agent ~ "PHP/"
|| req.http.user-agent ~ "urllib"
|| req.http.user-agent ~ "GT:WWW"
|| req.http.user-agent ~ "Snoopy"
|| req.http.user-agent ~ "MFC_Tear_Sample"
|| req.http.user-agent ~ "HTTP::Lite"
|| req.http.user-agent ~ "PHPCrawl"
|| req.http.user-agent ~ "URI::Fetch"
|| req.http.user-agent ~ "Zend_Http_Client"
|| req.http.user-agent ~ "http client"
|| req.http.user-agent ~ "PECL::HTTP"
|| req.http.user-agent ~ "Fetch API Request"
|| req.http.user-agent ~ "PleaseCrawl"
|| req.http.user-agent ~ "TurnitinBot"
|| req.http.user-agent ~ "python-requests"
|| req.http.user-agent ~ "Python-urllib"
|| req.http.user-agent ~ "CorporateNewsSearchEngine"
|| req.http.user-agent ~ "libwww-perl"
|| req.http.user-agent ~ "rogerbot"
|| req.http.user-agent ~ "Microsoft URL Control"
|| req.http.user-agent == "-"
|| req.http.user-agent == "MSIE 6.0"
|| req.http.user-agent == "Mozilla/4.0 (compatible; Advanced Email Extractor v2.xx)"
|| req.http.user-agent == "Mozilla/4.0 (compatible; Iplexx Spider/1.0 http://www.iplexx.at)"
|| req.http.user-agent == "Mozilla/5.0 (Version: xxxx Type:xx)"
|| req.http.user-agent == "MVAClient"
|| req.http.user-agent == "MJ12bot"
|| req.http.user-agent == "spider-ads"
|| req.http.user-agent == "bakey"
|| req.http.user-agent == "NameOfAgent (CMS Spider)"
|| req.http.user-agent == "PBrowse 1.4b"
|| req.http.user-agent == "Poirot"
|| req.http.user-agent == "searchbot admin@google.com"
|| req.http.user-agent == "sogou develop spider"
|| req.http.user-agent == "WEP Search 00"
) {
error 403 "You are banned from this site. Please contact via a
different client configuration if you believe that this is a mistake.";
}
3. 如果没有前端代理,可以使用apache的防屏蔽技术,具体的apache配置文件如下:
a). 首先声明一个环境变量(在最外面层即可):stayout
SetEnvIfNoCase user-agent "^(Java|Jakarta|Ruby|Wget|PHP|Python|Zend_Http_Client|curl|libwww|User-Agent|URI::Fetch|HTTP::Lite|urllib|http client|PleaseCrawl|TurnitinBot).*" stayout=1
SetEnvIfNoCase user-agent "^(-|MSIE 6.0|Mozilla/4.0 \(compatible; Advanced Email Extractor v2.xx\)|MVAClient|MJ12bot|bakey|Poirot|sogou develop spider)" stayout=1
b). 再在某个具体虚机下面添加 deny
<Directory /home/jenkins/www/web/yoursite>
Options FollowSymLinks
AllowOverride All
Order allow,deny
Allow from all
Deny from env=stayout
</Directory>
4. 牵扯到蜜罐,比较出名的有http:BL,有很多开源包,基于wordpress、基于drupal的,基于apache的。
http://www.projecthoneypot.org/httpbl_implementations.php
综合来说,Iptables应该最强,最智能,前端代理配置比较简单。