【Hacker News搬运】高维空间的反直觉性质

hackernews

Title: Counterintuitive Properties of High Dimensional Space

高维空间的反直觉性质

Text:

Url: https://people.eecs.berkeley.edu/~jrs/highd/

由于我无法直接访问互联网来获取外部链接的内容，我无法直接使用 JinaReader 或任何其他工具来分析您提供的链接内容。

但是，我可以指导您如何使用 JinaReader 或类似工具来分析网页内容并进行总结：

1. **安装 JinaReader**：
   - 首先，您需要确保您已经安装了 JinaReader。如果您还没有安装，您可以从 JinaReader 的官方网站下载并安装它。

2. **打开网页**：
   - 使用 JinaReader 打开您提供的链接：`https://people.eecs.berkeley.edu/~jrs/highd/`。

3. **分析内容**：
   - 一旦网页打开，JinaReader 将开始分析网页内容。它会提取文本、图片、链接等信息。

4. **翻译非中文内容**：
   - 如果网页内容包含非中文内容，您可能需要使用 JinaReader 的翻译功能将内容翻译成中文。大多数文本分析工具都提供了内置的翻译功能，您可以在分析过程中选择要翻译的语言。

5. **生成总结**：
   - JinaReader 通常会有一个功能来生成文本摘要或总结。您可以使用这个功能来获取网页内容的简短概述。

以下是一个使用 JinaReader 分析网页内容的示例步骤（假设您已经安装了 JinaReader）：

```python
from jina import Document
from jina.executors.text.summarization import Summarizer

# 创建一个Document对象来表示网页内容
doc = Document()
doc.content = "网页内容字符串"

# 创建一个Summarizer对象来生成摘要
summarizer = Summarizer()

# 分析文档并生成摘要
doc = summarizer(doc)

# 输出摘要
print(doc.summary)

请注意，上述代码仅为示例，实际使用时您需要根据 JinaReader 的具体API进行调整。

如果内容不是中文并且需要翻译，您可能需要使用翻译API，如Google翻译API或其他支持的语言翻译服务，并将翻译后的内容传递给JinaReader进行分析和总结。以下是一个使用假设的翻译API进行翻译的示例：

from googletrans import Translator

# 假设的内容（非中文）
non_chinese_content = "This is some non-Chinese text."

# 创建翻译器对象
translator = Translator()

# 翻译内容
translated_content = translator.translate(non_chinese_content, src='en', dest='zh-cn').text

# 使用JinaReader分析翻译后的内容
# ...

请记住，您需要根据实际的API文档来实现翻译功能。

        
## Post by: nabla9
        
### Comments: 
        
**rectang**: Time to share my favorite quote from <i>Symbols, Signals and Noise</i> by John R. Pierce, where he discusses how Shannon achieved a breakthrough in Information Theory:<p>&gt; <i>This chapter has had another aspect. In it we have illustrated the use of a novel viewpoint and the application of a powerful field of mathematics in attacking a problem of communication theory. Equation 9.3 was arrived at by the by-no-means-obvious expedient of representing long electrical signals and the noises added to them by points in a multidimensional space.  The square of the distance of a point from the origin was interpreted as the energy of the signal represented by a point.</i><p>&gt; <i>Thus a problem in communication theory was made to correspond to a problem in geometry, and the desired result was arrived at by geometrical arguments.</i>
> **rectang**: 是时候分享我最喜欢的约翰·R·皮尔斯的《符号、信号和噪声》一文了，他在文中讨论了香农如何在信息论方面取得突破：<p>><i> 本章还有另一个方面。在这本书中，我们说明了一种新的观点的使用，以及一个强大的数学领域在解决通信理论问题中的应用。方程9.3是通过在多维空间中表示长电信号和由点添加到它们的噪声而得出的，这绝非明显的权宜之计。点与原点距离的平方被解释为由点表示的信号的能量</i> <p>&gt<i> 因此，通信理论中的问题与几何中的问题相对应，并通过几何论证达到了预期的结果</i>
            
**gcanyon**: One that isn&#x27;t listed here, and which is critical to machine learning, is the idea of near-orthogonality. When you think of 2D or 3D space, you can only have 2 or 3 orthogonal directions, and allowing for near-orthogonality doesn&#x27;t really gain you anything. But in higher dimensions, you can reasonably work with directions that are only somewhat orthogonal, and &quot;somewhat&quot; gets pretty silly large once you get to thousands of dimensions -- like 75 degrees is fine (I&#x27;m writing this from memory, don&#x27;t quote me). And the number of orthogonal-enough dimensions you can have scales as maybe as much as 10^sqrt(dimension_count), meaning that yes, if your embeddings have 10,000 dimensions, you might be able to have literally 10^100 different orthogonal-enough dimensions. This is critical for turning embeddings + machine learning into LLMs.
> **gcanyon**: 一个不是；这里列出的对机器学习至关重要的是近正交性的概念。当你想到2D或3D空间时，你只能有2个或3个正交方向，而允许近乎正交的方向是不行的；我真的没给你什么好处。但在更高的维度中，你可以合理地使用只是略微正交的方向，并且&quot；有点&quot；一旦你达到数千个维度，它就会变得非常愚蠢——比如75度是可以的（我是凭记忆写的，不要引用我的话）。正交足够维度的数量可以高达10^sqrt（dimension_count），这意味着，是的，如果你的嵌入有10000个维度，你可能会有10^100个不同的正交足够维度。这对于将嵌入+机器学习转化为LLM至关重要。
            
**FabHK**: For high-dimensional spheres, most of the volume is in the &quot;shell&quot;, ie near the boundary [0]. This sort of makes sense to me, but I don&#x27;t know how to square that with the observation in the article that most of the surface area is near the equator. (In particular, by symmetry, it&#x27;s near <i>any</i> equator; so, one would think, in their intersection. That is near the centre, though, not the shell.)<p>Anyway. Never buy a high-dimensional orange, it&#x27;s mostly rind.<p>[0] <a href="https:&#x2F;&#x2F;www.math.wustl.edu&#x2F;~feres&#x2F;highdim" rel="nofollow">https:&#x2F;&#x2F;www.math.wustl.edu&#x2F;~feres&#x2F;highdim</a>
> **FabHK**: 对于高维球体，大部分体积在&quot；外壳”；，（边界附近）。这对我来说是有道理的，但我不这么认为；我不知道如何与文章中的观察结果相一致，即大部分地表靠近赤道。（特别是，根据对称性，它靠近任何</i>赤道；所以，人们会认为，在它们的交点上。不过，它靠近中心，而不是外壳。）<p>不管怎样。永远不要买高维度的橙色，它；<p>[0]<a href=“https://www.math.wustl.edu”~feres“highdim”rel=“nofollow”>https:&#x2F；www.math.wustl.edu~feres；highdim</a>
            
**mattxxx**: Yea - high dimensional spaces are weird and hard to reason about... and we&#x27;re working very frequently in them, especially when dealing with ML.
> **mattxxx**: 是的，高维空间很奇怪，很难推理。。。我们；我们经常在其中工作，尤其是在处理ML时。
            
**derbOac**: I love this stuff because it&#x27;s so counterintuitive until you&#x27;ve worked through some of it. There was an article linked to on HN a while back about high-dimensional Gaussian distributions that was similar in message, and probably mathematically related at some level. It has so many implications for much of the work in deep learning and large data, among other things.
> **derbOac**: 我喜欢这些东西，因为它；这太违反直觉了，直到你；我已经研究过其中的一些。不久前，有一篇关于HN的文章，内容是关于高维高斯分布，其信息相似，可能在某种程度上具有数学相关性。它对深度学习和大数据等领域的许多工作都有很多影响。