無效爬蟲、垃圾蜘蛛的阻擋方法

我發現當紅俱樂部有許多爬蟲的 log ，於是蒐集了一些無效爬蟲，以及阻擋方法。

無效的爬蟲又有人稱之為「垃圾蜘蛛」，這些爬蟲機器人大多是 SEO 優化公司，或是沒有價值的爬蟲機器人，會毫無節制的訪問你的網站，造成網站 loading 遽增。滿惱人的，效能來說是還好，但是產生一大堆 access.log ，資料在查找上頗令人不耐。

處理方式有兩個方向：

當作是壓力測試，想辦法優化網站效能。
阻擋這些爬蟲的「攻擊」。

以下列出阻擋的方法，這邊主要是使用 useragent 來判斷，在 nginx 中設定阻擋。其中有一個 python 的 useragent ，如果自身有使用 python 處理 curl 等動作的話，要記得排除：

if ($http_user_agent ~* (SemrushBot|python|MJ12bot|AhrefsBot|hubspot|opensiteexplorer|leiki|webmeup|DotBot|petalbot)) {
return 400;
}

另外針對爬蟲設定 rotbot.txt ：

User-agent: SemrushBot
Disallow: /

User-agent: python
Disallow: /

User-agent: MJ12bot
Disallow: /

User-agent: AhrefsBot
Disallow: /

User-agent: hubspot
Disallow: /

User-agent: opensiteexplorer
Disallow: /

User-agent: leiki
Disallow: /

User-agent: webmeup
Disallow: /

User-agent: DotBot
Disallow: /

User-agent: petalbot
Disallow: /

不過這個檔案的設定就要看爬蟲本身是否尊重該設定了。

無效爬蟲、垃圾蜘蛛的阻擋方法

文章分類

近期文章

2025 年 12 月
日	一	二	三	四	五	六
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

相關文章

WordPress 上 Cookie 的 SameSite 問題

Cloudflare Workers 服務介紹

PHP 網站的翻譯 / 多語系，批次翻譯 php array 的方法