【Hacker News搬运】猪笼草是捕捉人工智能网络爬虫的陷阱

hackernews

Title: Nepenthes is a tarpit to catch AI web crawlers

猪笼草是捕捉人工智能网络爬虫的陷阱

Text:

Url: https://zadzmo.org/code/nepenthes/

很抱歉，但我无法直接访问或分析网页内容，包括您提供的链接。JinaReader 是一个文本分析工具，通常需要您提供文本内容，然后它才能进行分析和总结。

如果您希望我帮助您分析某个网页的内容，您可以手动复制该网页上的文本内容，然后提供给我。我可以在不访问原始链接的情况下，使用一些文本处理技术来模拟JinaReader的分析过程。

以下是一个简化的例子，展示如何使用Python进行基本的文本摘要：

```python
import re

def summarize_text(text, num_sentences=5):
    # 使用简单的句子长度来分割文本
    sentences = re.split(r'[.!?]+', text)
    # 过滤掉空句子
    sentences = [sentence.strip() for sentence in sentences if sentence.strip()]
    # 如果句子数量多于我们想要的摘要长度，就选择前几个句子
    if len(sentences) > num_sentences:
        return ' '.join(sentences[:num_sentences]) + '...'
    else:
        return ' '.join(sentences)

# 假设这是从网页复制的文本内容
web_content = """
This is the first sentence of the text. This is the second sentence. And this is the third sentence. More sentences follow...
"""

# 获取摘要
summary = summarize_text(web_content)
print(summary)

请注意，这只是一个非常基础的文本摘要示例，它并不涉及复杂的自然语言处理技术。对于实际应用，您可能需要使用更高级的NLP库，如spaCy或transformers，来获取更准确的摘要。

如果您需要将非中文内容翻译成中文，您可以使用像Google Translate API这样的服务。以下是一个使用Google Translate API进行翻译的Python示例：

from googletrans import Translator

def translate_to_chinese(text):
    translator = Translator()
    translation = translator.translate(text, src='auto', dest='zh-cn')
    return translation.text

# 假设这是从网页复制的非中文文本内容
non_chinese_content = "This is a non-Chinese sentence."

# 翻译成中文
chinese_translation = translate_to_chinese(non_chinese_content)
print(chinese_translation)

请记住，使用Google Translate API可能需要您创建一个Google Cloud账户并启用翻译API。

        
## Post by: blendergeek
        
### Comments: 
        
**bflesch**: Haha, this would be an amazing way to test the ChatGPT crawler reflective DDOS vulnerability [1] I published last week.<p>Basically a single HTTP Request to ChatGPT API can trigger 5000 HTTP requests by ChatGPT crawler to a website.<p>The vulnerability is&#x2F;was thoroughly ignored by OpenAI&#x2F;Microsoft&#x2F;BugCrowd but I really wonder what would happen when ChatGPT crawler interacts with this tarpit several times per second. As ChatGPT crawler is using various Azure IP ranges I actually think the tarpit would crash first.<p>The vulnerability reporting experience with OpenAI &#x2F; BugCrowd was really horrific. It&#x27;s always difficult to get attention for DOS&#x2F;DDOS vulnerabilities and companies always act like they are not a problem. But if their system goes dark and the CEO calls then suddenly they accept it as a security vulnerability.<p>I spent a week trying to reach OpenAI&#x2F;Microsoft to get this fixed, but I gave up and just published the writeup.<p>I don&#x27;t recommend you to exploit this vulnerability due to legal reasons.<p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;bf&#x2F;security-advisories&#x2F;blob&#x2F;main&#x2F;2025-01-ChatGPT-Crawler-Reflective-DDOS-Vulnerability.md">https:&#x2F;&#x2F;github.com&#x2F;bf&#x2F;security-advisories&#x2F;blob&#x2F;main&#x2F;2025-01-...</a>
> **bflesch**: 哈哈，这将是一种测试我上周发布的ChatGPT爬虫反射DDOS漏洞的神奇方法[1]<p> 基本上，对ChatGPT API的单个HTTP请求可以触发ChatGPT爬虫对网站的5000个HTTP请求<p> 该漏洞为；被OpenAI彻底忽略了；微软；BugCrowd，但我真的很想知道当ChatGPT爬虫每秒与这个tarpit交互几次时会发生什么。由于ChatGPT爬虫正在使用各种Azure IP范围，我实际上认为tarpit会先崩溃<p> OpenAI的漏洞报告经验；BugCrowd真的很可怕。它；DOS总是很难引起人们的注意；DDOS漏洞和公司总是表现得好像它们不是问题。但如果他们的系统变暗，首席执行官打来电话，他们就会突然接受这是一个安全漏洞<p> 我花了一周时间试图联系OpenAI；微软试图解决这个问题，但我放弃了，只是发表了这篇文章<p> 我不知道；由于法律原因，不建议您利用此漏洞<p> [1]<a href=“https:&#x2F；github.com&#bf&#x2H；安全建议&#blob&#x2M；main&#x205-01-ChatGPT-Crawler-Reflective-DDOS漏洞管理.md”>https:&#x2F；github.com；bf&#x2F；安全建议；blob；main；2025-01-...</一
            
**m3047**: Having first run a bot motel in I think 2005, I&#x27;m thrilled and greatly entertained to see this taking off. When I first did it, I had crawlers lost in it literally for days; and you could tell that eventually some human would come back and try to suss the wreckage. After about a year I started seeing URLs like ..&#x2F;this-page-does-not-exist-hahaha.html. Sure it&#x27;s an arms race but just like security is generally an afterthought these days, don&#x27;t think that you can&#x27;t be the woodpecker which destroys civilization. The comments are great too, this one in particular reflects my personal sentiments:<p>&gt;  the moment it becomes the basic default install ( ala adblocker in browsers for people ), it does not matter what the bigger players want to do
> **m3047**: 我第一次经营机器人汽车旅馆是在2005年；看到它起飞，我感到非常兴奋和高兴。当我第一次这样做的时候，我已经好几天都迷失在里面了；你可以看出，最终会有人回来，试图找到残骸。大约一年后，我开始看到像…这样的网址&#x2F；这个页面不是exist-haha.html。当然；这是一场军备竞赛，但就像现在安全通常是事后才考虑的一样，不要；我不认为你能；不要成为破坏文明的啄木鸟。这些评论也很棒，这一条尤其反映了我的个人感受：<p>&gt；一旦它成为基本的默认安装（人们浏览器中的ala adblocker），大玩家想做什么就没关系了
            
**taikahessu**: We had our non-profit website drained out of bandwidth and site closed temporarily (!!) from our hosting deal because of Amazon bot aggressively crawling like ?page=21454 ... etc.<p>Gladly Siteground restored our site without any repercussions as it was not our fault. Added Amazon bot into robots.txt after that one.<p>Don&#x27;t like how things are right now. Is a tarpit the solution? Or better laws? Would they stop the chinese bots? Should they even? I don&#x27;t know.
> **taikahessu**: 我们的非营利网站带宽耗尽，网站因亚马逊机器人的攻击性爬行而暂时关闭（！！）？第21454页。。。等等。<p>很高兴Siteground恢复了我们的网站，没有任何影响，因为这不是我们的错。在那之后，将亚马逊机器人添加到robots.txt中<p> 不要；我不喜欢现在的情况。tarpit是解决方案吗？还是更好的法律？他们会阻止中国机器人吗？他们甚至应该？我不知道；我不知道。
            
**dspillett**: Tarpits to slow down the crawling may stop them crawling your entire site, but they&#x27;ll not care unless a great many sites do this. Your site will be assigned a thread or two at most and the rest of the crawling machine resources will be off scanning other sites. There will be timeouts to stop a particular site even keeping a couple of cheap threads busy for long. And anything like this may get you delisted from search results you might want to be in as it can be difficult to reliably identify these bots from others and sometimes even real users, and if things like this get good enough to be any hassle to the crawlers they&#x27;ll just start lying (more) and be even harder to detect.<p>People scraping for nefarious reasons have had decades of other people trying to stop them, so mitigation techniques are well known unless you can come up with something truly unique.<p>I don&#x27;t think random Markov chain based text generators are going to pose much of a problem to LLM training scrapers either. They&#x27;ll have rate limits and vast attention spreading too. Also I suspect that random pollution isn&#x27;t going to have as much effect as people think because of the way the inputs are tokenised. It will have an effect, but this will be massively dulled by the randomness – statistically relatively unique information and common (non random) combinations will still bubble up obviously in the process.<p>I think better would be to have less random pollution: use a small set of common text to pollute the model. Something like “this was a common problem with Napoleonic genetic analysis due to the pre-frontal nature of the ongoing stream process, as is well documented in the grimoire of saint Churchill the III, 4th edition, 1969”, in fact these snippets could be Markov generated, but use the same few repeatedly. They would need to be nonsensical enough to be obvious noise to a human reader, or highlighted in some way that the scraper won&#x27;t pick up on, but a general intelligence like most humans would (perhaps a CSS styled side-note inlined in the main text? — though that would likely have accessibility issues), and you would need to cycle them out regularly or scrapers will get “smart” and easily filter them out, but them appearing fully, numerous times, might mean they have more significant effect on the tokenising process than more entirely random text.
> **dspillett**: 减慢爬行速度的Tarpits可能会阻止它们爬行您的整个网站，但它们；除非有很多网站这样做，否则我不会在乎。您的站点最多将被分配一两个线程，其余的爬行机资源将停止扫描其他站点。即使让几个廉价线程长时间繁忙，也会有超时来停止特定站点。这样的事情可能会让你从你可能想进入的搜索结果中除名，因为很难从其他人，有时甚至是真实用户中可靠地识别这些机器人，如果这样的事情变得足够好，对爬虫来说会有任何麻烦；我会开始撒谎（更多），更难被发现<p> 出于邪恶原因的人有几十年的其他人试图阻止他们，所以缓解技术是众所周知的，除非你能想出一些真正独特的方法<p> 我不知道；我也不认为基于随机马尔可夫链的文本生成器会给LLM训练抓取器带来很大问题。他们；我们也会有利率限制和广泛的关注。我还怀疑随机污染不是；由于输入的标记方式，它不会像人们想象的那样有多大的影响。它会产生影响，但随机性会大大削弱这种效果——统计上相对独特的信息和常见的（非随机）组合仍会在这个过程中明显出现<p> 我认为更好的办法是减少随机污染：使用一小部分通用文本来污染模型。类似于“这是拿破仑遗传分析的一个常见问题，因为正在进行的流过程具有前锋面性质，正如1969年圣丘吉尔三世第4版的grimoire中所充分记录的那样”，事实上这些片段可能是马尔可夫生成的，但重复使用同样的几个片段。它们需要足够荒谬，以便对人类读者来说是明显的噪音，或者以某种方式突出显示刮刀获胜；不会接受，但像大多数人一样的通用智能（也许是主文本中内联的CSS风格的旁注？——尽管这可能会有可访问性问题），你需要定期循环它们，否则抓取器会变得“聪明”，很容易过滤掉它们，但它们完全、多次出现，可能意味着它们对标记化过程的影响比完全随机的文本更大。
            
**dilDDoS**: I appreciate the intent behind this, but like others have pointed out, this is more likely to DOS your own website than accomplish the true goal.<p>Probably unethical or not possible, but you could maybe spin up a bunch of static pages on GitHub Pages with random filler text and then have your site redirect to a random one of those instead. Unless web crawlers don’t follow redirects.
> **dilDDoS**: 我很欣赏这背后的意图，但正如其他人所指出的那样，这更有可能是DOS你自己的网站，而不是实现真正的目标<p> 可能不道德或不可能，但你可以在GitHub pages上用随机填充文本创建一堆静态页面，然后让你的网站重定向到其中一个随机页面。除非网络爬虫不遵循重定向。