cnblogs/dcrenl/IIS设置文件 Robots.txt 禁止爬虫.html
2024-09-24 12:43:01 +08:00

90 lines
3.5 KiB
HTML
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<h1><span style="font-size: 14px;">robots.txt用于禁止网络爬虫访问网站指定目录。robots.txt的格式采用面向行的语法空行、注释行以#打头、规则行。规则行的格式为Field: value。常见的规则行User-Agent、Disallow、Allow行。</span></h1>
<p>User-Agent行</p>
<pre>User-Agent: robot-name
User-Agent: *
</pre>
<p>Disallow和Allow行</p>
<pre>Disallow: /path
Disallow: # 空字符串,起通配符效果,全禁止
Allow: /path
Allow: # 空字符串,起通配符效果,全允许
</pre>
<h2>搜索引擎的User-Agent对应名称</h2>
<table>
<tbody>
<tr><th>搜索引擎</th><th>User-Agent值</th></tr>
<tr>
<td>Google</td>
<td>googlebot</td>
</tr>
<tr>
<td>百度</td>
<td>baiduspider</td>
</tr>
<tr>
<td>雅虎</td>
<td>slurp</td>
</tr>
<tr>
<td>MSN</td>
<td>msnbot</td>
</tr>
<tr>
<td>Alexa</td>
<td>is_archiver</td>
</tr>
</tbody>
</table>
<p>我在Linux上抓包观察到的一些搜索引擎访问记录</p>
<pre># tcpdump -n -nn -A -l -s1024 'tcp port 80'|grep User-Agent
User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
User-Agent: Googlebot-Image/1.0
User-Agent: Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 5 subscribers; feed-id=4619555564728728616)
User-Agent: Mozilla/5.0(compatible; Sosospider/2.0; +http://help.soso.com/webspider.htm)
User-Agent: Mozilla/5.0 (compatible; YoudaoBot/1.0; http://www.youdao.com/help/webmaster/spider/; )
User-Agent: Mozilla/5.0 (compatible; JikeSpider; +http://shoulu.jike.com/spider.html)
</pre>
<p>JikeSpider是即刻搜索人民搜索</p>
<h2>robots.txt的补充</h2>
<p>如果你没有对网站根目录的写入权限无法建立robots.txt文件或你想要某个指定的网页不被搜索引擎收录可以使用<a href="http://www.berlinix.com/html/header.php#meta">元标签</a>阻止爬虫访问:</p>
<pre> name="robots" content="noindex"&gt;
name="googlerobot" content="noindex"&gt;
</pre>
<p>robots元标记的默认值为"index,follow",它的取值可以是(来自<a href="http://support.google.com/webmasters/bin/answer.py?hl=zh-Hans&amp;answer=79812" target="_blank">Google站长帮助</a></p>
<dl><dt>noindex</dt><dd>防止网页被编入索引。</dd><dt>nofollow</dt><dd>防止googlebot从此页面中跟踪链接。</dd><dt>noarchive</dt><dd>防止Google显示网页的快照链接。</dd><dt>noimageindex</dt><dd>不被Google图片搜索索引。</dd></dl>
<h1>现实中的robots.txt</h1>
<h2>淘宝屏蔽百度</h2>
<p>淘宝屏蔽了百度抓取2008年9月http://www.taobao.com/robots.txt的内容</p>
<pre>User-agent: Baiduspider
Disallow: /
User-agent: baiduspider
Disallow: /
</pre>
<h2>百度与360的搜索引擎之争</h2>
<p>2012年8月360推出搜索引擎并与百度发生正面冲突。百度工程师跑出来说360违反robots协议偷窃百度内容。以百度知道为例http://zhidao.baidu.com/robots.txt的内容大致是这样</p>
<pre>User-agent: Baiduspider
Disallow: /w?
Allow: /
User-agent: Googlebot
User-agent: MSNBot
User-agent: Baiduspider-image
User-agent: YoudaoBot
User-agent: Sogou web spider
User-agent: Sogou inst spider
User-agent: Sogou spider2
User-agent: Sogou blog
User-agent: Sogou News Spider
User-agent: Sogou Orion spider
User-agent: JikeSpider
User-agent: Sosospider
Allow: /
User-agent: *
Disallow: /
</pre>
<p>也就是说对360爬虫而言应该走最后一条规则也就是禁止抓取百度知道所有内容。但从360搜索看有百度知道的内容。</p>