【Hacker News搬运】为什么wordfreq不会更新

hackernews

Title: Why wordfreq will not be updated

为什么wordfreq不会更新

Text:

Url: https://github.com/rspeer/wordfreq/blob/master/SUNSET.md

由于我是一个AI，我无法直接访问互联网来抓取或分析外部链接的内容。不过，我可以根据提供的链接内容给出一些指导和建议。

如果你想要使用JinaReader来抓取分析内容，并且需要将非中文内容翻译成中文，以下是一个可能的步骤：

1. **抓取内容**：
   - 使用JinaReader或其他合适的工具来抓取指定链接（https://github.com/rspeer/wordfreq/blob/master/SUNSET.md）的内容。
   - 确保抓取工具能够解析Markdown或其他格式的内容。

2. **内容分析**：
   - 一旦内容被成功抓取，你可以使用JinaReader的内置功能来分析文本。这可能包括关键词提取、情感分析、主题建模等。

3. **翻译非中文内容**：
   - 如果抓取到的内容包含非中文部分，你需要使用翻译API（如Google Translate API、百度翻译API等）来将这些内容翻译成中文。
   - 在JinaReader中，你可能需要编写一些自定义代码来实现这一功能，或者使用支持翻译功能的库。

以下是一个简化的代码示例，演示了如何使用Python和Google Translate API来翻译文本：

```python
from google.cloud import translate_v2 as translate

def translate_text(text, target='zh-CN'):
    client = translate.Client()
    result = client.translate(text, target_language=target)
    return result['translatedText']

# 假设你已经抓取了以下非中文内容
non_chinese_text = "This is an example of non-Chinese text."

# 翻译文本
chinese_text = translate_text(non_chinese_text)
print(chinese_text)

总结内容：
- 在翻译完所有非中文内容后，你可以使用JinaReader或其他文本分析工具来生成内容的摘要。

请注意，为了运行上述代码，你需要安装google-cloud-translate库，并且需要设置Google Cloud Platform项目并启用翻译API。由于你要求不使用pip安装任何包，你需要确保这些库已经安装在你的环境中。

如果你需要更具体的帮助，比如如何设置JinaReader或如何处理特定的翻译问题，请提供更多信息或具体的代码段。

        
## Post by: tomthe
        
### Comments: 
        
**voytec**: I agree in general but the web was already polluted by Google&#x27;s unwritten SEO rules. Single-sentence paragraphs, multiple keyword repetitions and focus on &quot;indexability&quot; instead of readability, made the web a less than ideal source for such analysis long before LLMs.<p>It also made the web a less than ideal source for training. And yet LLMs were still fed articles written for Googlebot, not humans. ML&#x2F;LLM is the second iteration of writing pollution. The first was humans writing for corporate bots, not other humans.
> **voytec**: 我大体上同意，但网络已经被谷歌污染了；不成文的SEO规则。单句段落，多个关键词重复，注重&quot；可指数化&quot；早在LLM出现之前，网络就不是进行此类分析的理想来源，而不是可读性<p> 这也使网络成为不太理想的培训来源。然而，LLM仍然收到为Googlebot而非人类撰写的文章。ML；LLM是写作污染的第二次迭代。第一种是人类为企业机器人写作，而不是其他人类。
            
**jgrahamc**: I created <a href="https:&#x2F;&#x2F;lowbackgroundsteel.ai&#x2F;" rel="nofollow">https:&#x2F;&#x2F;lowbackgroundsteel.ai&#x2F;</a> in 2023 as a place to gather references to unpolluted datasets. I&#x27;ll add wordfreq. Please submit stuff to the Tumblr.
> **jgrahamc**: 我创建了<a href=“https:&#x2F；&quot；low-backgroundsteel.ai&#x2O；”rel=“nofollow”>https:&quot&#x2F；低背景钢</a> 2023年，作为收集未受污染数据集参考的地方。我；我将添加单词频率。请将内容提交至Tumblr。
            
**jll29**: I regret the situation led to the OP feel discourage about the NLP community, wo which I belong, and I just want to say &quot;we&#x27;re not all like that&quot;, even though it is a trend and we&#x27;re close to peak hype (slightly past even?).<p>The complaint about pollution of the Web with artificial content is timely, and it&#x27;s not even the first time due to spam farms intended to game PageRank, among other nonsense. This may just mean there is new value in hand-curated lists of high-quality Web sites (some people use the term &quot;small Web&quot;).<p>Each generation of the Web needs techniques to overcome its particular generation of adversarial mechanisms, and the current Web stage is no exception.<p>When Eric Arthur Blair wrote 1984 (under his pen name &quot;George Orwell&quot;), he anticipated people consuming auto-generated content to keep the masses from away from critical thinking. This is now happening (he even anticipated auto-generated porn in the novel), but the technologies criticized can also be used for good, and that is what I try to do in my NLP research team. Good <i>will</i> prevail in the end.
> **jll29**: 我很遗憾这种情况导致OP对我所属的NLP社区感到沮丧，我只想说：&quot；我们；并非都是这样的&quot；，尽管这是一种趋势；你接近炒作的顶峰了吗（稍微超过偶数？）<p> 关于人工内容污染网络的投诉是及时的；这甚至不是第一次，因为垃圾邮件农场意图玩弄PageRank等无稽之谈。这可能只是意味着精心策划的高质量网站列表有了新的价值（有些人使用&quot；小型网站&quot）<p> 每一代网络都需要技术来克服其特定的对抗机制，当前的网络阶段也不例外<p> 当埃里克·阿瑟·布莱尔（Eric Arthur Blair）写《1984》时（笔名为&quot；乔治·奥威尔&quot。），他预计人们消费自动生成的内容是为了让大众远离批判性思维。现在这种情况正在发生（他甚至在小说中预见到了自动生成的色情内容），但被批评的技术也可以被用来做好事，这就是我在NLP研究团队中试图做的事情。好的<i>最终会占上风。
            
**0xbadcafebee**: I&#x27;m going to call it: The Web is dead. Thanks to &quot;AI&quot; I spend more time now digging through searches trying to find something useful than I did back in 2005. And the sites you do find are largely garbage.<p>As a random example: just trying to find a particular popular set of wireless earbuds takes me at least 10 minutes, when I already know the company, the company&#x27;s website, other vendors that sell the company&#x27;s goods, etc. It&#x27;s just buried under tons of dreck. And my laptop is &quot;old&quot; (an 8-core i7 processor with 16GB of RAM) so it struggles to push through graphics-intense &quot;modern&quot; websites like the vendor&#x27;s. Their old website was plain and worked great, letting me quickly search through their products and quickly purchase them. Last night I literally struggled to add things to cart and check out; it was actually harrowing.<p>Fuck the web, fuck web browsers, web design, SEO, searching, advertising, and all the schlock that comes with it. I&#x27;m done. If I can in any way purchase something without the web, I&#x27;mma do that. I don&#x27;t hate technology (entirely...) but the web is just a rotten egg now.
> **0xbadcafebee**: 我；我称之为：网络已死。感谢&quot；AI&quot；与2005年相比，我现在花更多的时间在搜索中寻找有用的东西。你找到的网站大多都是垃圾<p> 举个随机的例子：当我已经了解这家公司时，仅仅想找到一套特别受欢迎的无线耳机就至少需要10分钟；s网站、销售该公司的其他供应商；货物等；她刚刚被埋在成吨的垃圾下面。我的笔记本电脑是“；老&quot；（8核i7处理器，16GB RAM），因此它难以通过图形密集型“；现代”；像供应商这样的网站；s.他们的旧网站简单明了，运行良好，让我可以快速搜索他们的产品并快速购买。昨晚，我真的很难把东西加到购物车里结账；这实际上是令人痛心的<p> 他妈的网络，他妈的网页浏览器，网页设计，搜索引擎优化，搜索，广告，以及随之而来的所有问题；我做完了。如果我可以在没有网络的情况下以任何方式购买东西，我；妈妈这样做。我不知道；我（完全）不讨厌技术，但网络现在只是一个臭鸡蛋。
            
**weinzierl**: <i>&quot;I don&#x27;t think anyone has reliable information about post-2021 language usage by humans.&quot;</i><p>We&#x27;ve been past the tipping point when it comes to text for some time, but for video I feel we are living through the watershed moment right now.<p>Especially smaller children don&#x27;t have a good intuition on what is real and what is not. When I get asked if the person in a video is real, I still feel pretty confident to answer but I get less and less confident every day.<p>The technology is certainly there, but the majority of video content is still not affected by it. I expect this to change very soon.
> **weinzierl**: <i> &quot；我不知道；我认为没有人有关于2021年后人类语言使用的可靠信息&“</i> <p>我们；一段时间以来，当涉及到文本时，我已经过了临界点，但对于视频，我觉得我们现在正经历着分水岭时刻<p> 尤其是较小的孩子不会；对于什么是真实的，什么不是，我没有很好的直觉。当我被问到视频中的人是否真实时，我仍然很有信心回答，但我每天都越来越没有信心了<p> 技术当然存在，但大多数视频内容仍然不受其影响。我预计这种情况很快就会改变。