【Hacker News搬运】GPU上的全尺寸文件系统加速[pdf]

hackernews

Title: Full-scale file system acceleration on GPU [pdf]

GPU上的全尺寸文件系统加速[pdf]

Text:

Url: https://dl.gi.de/server/api/core/bitstreams/7c7a8830-fd81-4e56-8507-cd4809020660/content

很抱歉，我尝试使用webscraper工具抓取指定URL的内容时遇到了问题，因为该URL返回的内容类型是application/json，而webscraper目前不支持这种内容类型的抓取。如果您需要分析该URL返回的JSON内容，您可能需要使用其他工具或方法来获取和解析数据。如果您有其他请求或需要帮助，请告诉我。

Post by: west0n

Comments:

magicalhippo: Given that PCIe allows data to be piped directly from one device to another without going through the host CPU[1][2], I guess it might make sense to just have the GPU read blocks straight from the NVMe (or even NVMe-of[3]) rather than having the CPU do a lot of work.edit: blind as a bat, says so right in the paper of course:PMem is mapped directly to the GPU, and NVMe memory is accessed via Peer to Peer-DMA (P2PDMA)[1]: <a href="https://nvmexpress.org/wp-content/uploads/Enabling-the-NVMe-CMB-and-PMR-Ecosystem.pdf" rel="nofollow">https://nvmexpress.org/wp-content/uploads/Enabling-the-NVMe-...</a>[2]: <a href="https://lwn.net/Articles/767281/" rel="nofollow">https://lwn.net/Articles/767281/</a>[3]: <a href="https://www.nvmexpress.org/wp-content/uploads/NVMe_Over_Fabrics.pdf" rel="nofollow">https://www.nvmexpress.org/wp-content/uploads/NVMe_Over_Fabr...</a>

magicalhippo: 考虑到PCIe允许数据直接从一个设备通过管道传输到另一个设备，而无需经过主机CPU[1][2]，我想让GPU直接从NVMe（甚至[3]的NVMe）读取块可能是有意义的，而不是让CPU做很多工作。edit:blind as bat，当然在论文中是这么说的：PMem直接映射到GPU，NVMe内存通过对等DMA（P2PDMA）访问[1]:<a href=“https://；&#x2F；nvmexpress.org&#x2F；wp内容&#x20F；上传&#x2F：启用NVMe CMB和PMR生态系统.pdf“rel=”nofollow“>https://&#x2F；nvmexpress.org&#x2F；wp内容；上传；正在启用NVMe-</a> [2]：<a href=“https://；&#x2F；lwn.net#xx2F；文章#xx2F：767281#xx2F”rel=“nofollow”>https://&#x2F；lwn.net；文章&#x2F；767281</a> [3]：<a href=“https://；&#x2F；www.nvexpress.org&#x2F：wp-contence&#x2F，上传&#x2F！NVMe_Over_Fabrics.pdf”rel=“nofollow”>https:&#x2F&#x2F；www.nvexpress.org&#x2F；wp内容；上传；NVMe_Over_Fabr</一

west0n: According to this paper, GPU4FS is a file system that can run on the GPU and be accessed by applications. Since GPUs cannot make system calls, GPU4FS uses shared video memory (VRAM) and a parallel queue implementation. Applications running on the GPU can utilize GPU4FS after modifying their code, eliminating the need for a CPU-side file system when accessing the file system. The experiments are done on Optane memory.It would be interesting to know if this approach could optimize the performance of training and inference for large models.

west0n: 根据本文，GPU4FS是一个可以在GPU上运行并由应用程序访问的文件系统。由于GPU不能进行系统调用，GPU4FS使用共享视频存储器（VRAM）和并行队列实现。在GPU上运行的应用程序可以在修改其代码后使用GPU4FS，从而在访问文件系统时无需CPU端文件系统。实验是在Optane存储器上进行的 我们很想知道这种方法是否可以优化大型模型的训练和推理性能。

ec109685: Interesting they would discuss system call overhead of opening a file, reading from it and closing it. Seems like in almost all cases the open and close calls would be overwhelmed by the other operations.

ec109685: 有趣的是，他们会讨论打开文件、读取文件和关闭文件的系统调用开销。似乎在几乎所有情况下，打开和关闭调用都会被其他操作淹没。