【Hacker News搬运】为什么wordfreq不会更新
-
Title: Why wordfreq will not be updated
为什么wordfreq不会更新
Text:
Url: https://github.com/rspeer/wordfreq/blob/master/SUNSET.md
由于我是一个AI,我无法直接访问互联网来抓取或分析外部链接的内容。不过,我可以根据提供的链接内容给出一些指导和建议。 如果你想要使用JinaReader来抓取分析内容,并且需要将非中文内容翻译成中文,以下是一个可能的步骤: 1. **抓取内容**: - 使用JinaReader或其他合适的工具来抓取指定链接(https://github.com/rspeer/wordfreq/blob/master/SUNSET.md)的内容。 - 确保抓取工具能够解析Markdown或其他格式的内容。 2. **内容分析**: - 一旦内容被成功抓取,你可以使用JinaReader的内置功能来分析文本。这可能包括关键词提取、情感分析、主题建模等。 3. **翻译非中文内容**: - 如果抓取到的内容包含非中文部分,你需要使用翻译API(如Google Translate API、百度翻译API等)来将这些内容翻译成中文。 - 在JinaReader中,你可能需要编写一些自定义代码来实现这一功能,或者使用支持翻译功能的库。 以下是一个简化的代码示例,演示了如何使用Python和Google Translate API来翻译文本: ```python from google.cloud import translate_v2 as translate def translate_text(text, target='zh-CN'): client = translate.Client() result = client.translate(text, target_language=target) return result['translatedText'] # 假设你已经抓取了以下非中文内容 non_chinese_text = "This is an example of non-Chinese text." # 翻译文本 chinese_text = translate_text(non_chinese_text) print(chinese_text)
- 总结内容:
- 在翻译完所有非中文内容后,你可以使用JinaReader或其他文本分析工具来生成内容的摘要。
请注意,为了运行上述代码,你需要安装
google-cloud-translate
库,并且需要设置Google Cloud Platform项目并启用翻译API。由于你要求不使用pip安装任何包,你需要确保这些库已经安装在你的环境中。如果你需要更具体的帮助,比如如何设置JinaReader或如何处理特定的翻译问题,请提供更多信息或具体的代码段。
## Post by: tomthe ### Comments: **voytec**: I agree in general but the web was already polluted by Google's unwritten SEO rules. Single-sentence paragraphs, multiple keyword repetitions and focus on "indexability" instead of readability, made the web a less than ideal source for such analysis long before LLMs.<p>It also made the web a less than ideal source for training. And yet LLMs were still fed articles written for Googlebot, not humans. ML/LLM is the second iteration of writing pollution. The first was humans writing for corporate bots, not other humans. > **voytec**: 我大体上同意,但网络已经被谷歌污染了;不成文的SEO规则。单句段落,多个关键词重复,注重";可指数化";早在LLM出现之前,网络就不是进行此类分析的理想来源,而不是可读性<p> 这也使网络成为不太理想的培训来源。然而,LLM仍然收到为Googlebot而非人类撰写的文章。ML;LLM是写作污染的第二次迭代。第一种是人类为企业机器人写作,而不是其他人类。 **jgrahamc**: I created <a href="https://lowbackgroundsteel.ai/" rel="nofollow">https://lowbackgroundsteel.ai/</a> in 2023 as a place to gather references to unpolluted datasets. I'll add wordfreq. Please submit stuff to the Tumblr. > **jgrahamc**: 我创建了<a href=“https:/;";low-backgroundsteel.aiO;”rel=“nofollow”>https:"/;低背景钢</a> 2023年,作为收集未受污染数据集参考的地方。我;我将添加单词频率。请将内容提交至Tumblr。 **jll29**: I regret the situation led to the OP feel discourage about the NLP community, wo which I belong, and I just want to say "we're not all like that", even though it is a trend and we're close to peak hype (slightly past even?).<p>The complaint about pollution of the Web with artificial content is timely, and it's not even the first time due to spam farms intended to game PageRank, among other nonsense. This may just mean there is new value in hand-curated lists of high-quality Web sites (some people use the term "small Web").<p>Each generation of the Web needs techniques to overcome its particular generation of adversarial mechanisms, and the current Web stage is no exception.<p>When Eric Arthur Blair wrote 1984 (under his pen name "George Orwell"), he anticipated people consuming auto-generated content to keep the masses from away from critical thinking. This is now happening (he even anticipated auto-generated porn in the novel), but the technologies criticized can also be used for good, and that is what I try to do in my NLP research team. Good <i>will</i> prevail in the end. > **jll29**: 我很遗憾这种情况导致OP对我所属的NLP社区感到沮丧,我只想说:";我们;并非都是这样的";,尽管这是一种趋势;你接近炒作的顶峰了吗(稍微超过偶数?)<p> 关于人工内容污染网络的投诉是及时的;这甚至不是第一次,因为垃圾邮件农场意图玩弄PageRank等无稽之谈。这可能只是意味着精心策划的高质量网站列表有了新的价值(有些人使用";小型网站")<p> 每一代网络都需要技术来克服其特定的对抗机制,当前的网络阶段也不例外<p> 当埃里克·阿瑟·布莱尔(Eric Arthur Blair)写《1984》时(笔名为";乔治·奥威尔"。),他预计人们消费自动生成的内容是为了让大众远离批判性思维。现在这种情况正在发生(他甚至在小说中预见到了自动生成的色情内容),但被批评的技术也可以被用来做好事,这就是我在NLP研究团队中试图做的事情。好的<i>最终会占上风。 **0xbadcafebee**: I'm going to call it: The Web is dead. Thanks to "AI" I spend more time now digging through searches trying to find something useful than I did back in 2005. And the sites you do find are largely garbage.<p>As a random example: just trying to find a particular popular set of wireless earbuds takes me at least 10 minutes, when I already know the company, the company's website, other vendors that sell the company's goods, etc. It's just buried under tons of dreck. And my laptop is "old" (an 8-core i7 processor with 16GB of RAM) so it struggles to push through graphics-intense "modern" websites like the vendor's. Their old website was plain and worked great, letting me quickly search through their products and quickly purchase them. Last night I literally struggled to add things to cart and check out; it was actually harrowing.<p>Fuck the web, fuck web browsers, web design, SEO, searching, advertising, and all the schlock that comes with it. I'm done. If I can in any way purchase something without the web, I'mma do that. I don't hate technology (entirely...) but the web is just a rotten egg now. > **0xbadcafebee**: 我;我称之为:网络已死。感谢";AI";与2005年相比,我现在花更多的时间在搜索中寻找有用的东西。你找到的网站大多都是垃圾<p> 举个随机的例子:当我已经了解这家公司时,仅仅想找到一套特别受欢迎的无线耳机就至少需要10分钟;s网站、销售该公司的其他供应商;货物等;她刚刚被埋在成吨的垃圾下面。我的笔记本电脑是“;老";(8核i7处理器,16GB RAM),因此它难以通过图形密集型“;现代”;像供应商这样的网站;s.他们的旧网站简单明了,运行良好,让我可以快速搜索他们的产品并快速购买。昨晚,我真的很难把东西加到购物车里结账;这实际上是令人痛心的<p> 他妈的网络,他妈的网页浏览器,网页设计,搜索引擎优化,搜索,广告,以及随之而来的所有问题;我做完了。如果我可以在没有网络的情况下以任何方式购买东西,我;妈妈这样做。我不知道;我(完全)不讨厌技术,但网络现在只是一个臭鸡蛋。 **weinzierl**: <i>"I don't think anyone has reliable information about post-2021 language usage by humans."</i><p>We've been past the tipping point when it comes to text for some time, but for video I feel we are living through the watershed moment right now.<p>Especially smaller children don't have a good intuition on what is real and what is not. When I get asked if the person in a video is real, I still feel pretty confident to answer but I get less and less confident every day.<p>The technology is certainly there, but the majority of video content is still not affected by it. I expect this to change very soon. > **weinzierl**: <i> ";我不知道;我认为没有人有关于2021年后人类语言使用的可靠信息&“</i> <p>我们;一段时间以来,当涉及到文本时,我已经过了临界点,但对于视频,我觉得我们现在正经历着分水岭时刻<p> 尤其是较小的孩子不会;对于什么是真实的,什么不是,我没有很好的直觉。当我被问到视频中的人是否真实时,我仍然很有信心回答,但我每天都越来越没有信心了<p> 技术当然存在,但大多数视频内容仍然不受其影响。我预计这种情况很快就会改变。
- 总结内容: