【Hacker News搬运】分页寡妇,或者,为什么我对我的电子书感到尴尬(2023)
-
Title: Pagination widows, or, Why I'm embarrassed about my eBook (2023)
分页寡妇,或者,为什么我对我的电子书感到尴尬(2023)
Text:
Url: https://clagnut.com/blog/2426
由于我是一个AI,我无法直接访问互联网来抓取或分析网页内容。不过,我可以指导你如何使用JinaReader这样的工具来抓取和分析网页内容,以及如何处理非中文内容的翻译。 以下是一个使用JinaReader进行网页抓取和内容分析的步骤概述,以及如何翻译非中文内容: ### 使用JinaReader抓取和总结内容 1. **安装JinaReader**: 确保你已经安装了JinaReader库。如果没有,可以使用pip安装: ```bash pip install jinareader
-
抓取网页:
使用JinaReader的API或命令行工具来抓取网页内容。以下是一个基本的命令行示例:jinareader fetch https://clagnut.com/blog/2426
这将抓取指定URL的内容。
-
分析内容:
一旦抓取到内容,你可以使用JinaReader的文本分析功能来提取关键信息,例如摘要、关键词等:from jinareader import Reader reader = Reader() doc = reader.read("path_to_your_crawled_content.html") summary = doc.summary() # 获取摘要 print(summary)
翻译非中文内容
如果抓取到的内容不是中文,你需要使用翻译服务来将其翻译成中文。以下是一些常用的翻译方法:
-
使用Google翻译API:
你可以注册Google Cloud平台,获取API密钥,然后使用该API进行翻译。以下是一个简单的示例:from google.cloud import translate_v2 as translate translate_client = translate.Client() def translate_text(text, target='zh-CN'): # 文本内容,目标语言代码 result = translate_client.translate(text, target_language=target) return result['translatedText'] # 假设`non_chinese_content`是抓取到的非中文内容 chinese_content = translate_text(non_chinese_content) print(chinese_content)
-
使用其他翻译服务:
有很多其他的翻译服务,如DeepL、Yandex等,它们也提供了API,你可以根据需要选择合适的翻译服务。
请注意,上述代码示例需要安装相应的库(如
google-cloud-translate
),并且需要你有一个有效的API密钥。将上述步骤结合起来,你就可以抓取网页内容,分析其摘要,并对非中文内容进行翻译。
## Post by: OuterVale ### Comments: **userbinator**: The fact that it's a book about typography may mean the requirements are a little different, because I personally (and likely many others) don't really pay attention to such things. > **userbinator**: 事实上,它;这是一本关于排版的书,可能意味着要求有点不同,因为我个人(可能还有许多其他人)不这么认为;我真的不太注意这些事情。 **gorgoiler**: In the page model, a heading says it needs only one line of vertical space, so if there’s a tiny bit of space at the bottom of the page it’ll get orphaned. (Vertical box space shown as <i>!</i> and <i>%</i> for the heading and paragraph, respectively.)<p><pre><code> Page 1 Page 2 .. Paragraph..% .. ..text. % .. .. .. .. !Heading </code></pre> When instead it should be moved to the top of the next page:<p><pre><code> Page 1 Page 2 .. Heading ! .. Paragraph.. % .. ..text. % .. .. .. </code></pre> Rather than being honest about needing one line…<p><pre><code> Heading ! Paragraph.. % ..text. % </code></pre> …the heading could instead claim it needs three lines, which would ensure it would never be orphaned:<p><pre><code> Heading ! ! ! Paragraph.. % ..text. % </code></pre> But now you have a big gap below the heading.<p>If you could then shift the paragraph up from where it should be in the flow such that the vertical space of the heading and paragraph overlapped…<p><pre><code> Heading ! Paragraph.. !% ..text. !% </code></pre> …then you’d get a heading that would never be orphaned on one line, but which looked as it if only used one line. > **gorgoiler**: 在页面模型中,标题表示它只需要一行垂直空间,所以如果页面底部有一点空间,它就会变成孤立的。(标题和段落的垂直空格分别显示为<i>!</i>和<i>%</i>。)<p><pre><code>第1页第2页..段落..%.. ..文本。%.. .... ..!航向</code></pre>当它应该移动到下一页的顶部时:<p><pre><code>第1页第2页..前进!..段落..%.. ..文本。%.. ....</code></pre>与其诚实地说需要一行…<p><pre><code>标题!段落..%..文本。%</code></pre>…标题可以改为声明它需要三行,这将确保它永远不会成为孤立的:<p><pre><code>heading!!!段落..%..文本。%</code></pre>但现在你在标题下方有一个很大的差距<p> 如果你能把段落从它应该在流中的位置上移,这样标题和段落的垂直空间就会重叠…<p><pre><code>标题!段落..!%..文本。!%</code></pre>…然后你会得到一个永远不会在一行上孤立的标题,但看起来就像只使用一行一样。 **acabal**: If you think it's bad that `break-*` isn't supported in Firefox or Chrome, wait till you see what your ebook looks like in Kindle, or worse, ADE-based readers, of which there are still many in use!<p>Kindle, the reading device with by far the largest market share, is basically the IE6 of ereaders - too big to ignore, and at the same time dragging down the entire ebook ecosystem with its crappy renderer. Amazon has shown little interest in improving it for over a decade now, while simultaneously fragmenting its own ecosystem with a variety of different proprietary formats that support different CSS and features.<p>ADE, while less common in new devices, is still very common in much older devices - B&N's eink Nooks were based on ADE at least as late as a few years ago. (Perhaps they still are?) ADE is closer to IE5 in terms of CSS support!<p>At Standard Ebooks we're often hamstrung in our attempts to make beautiful ebooks by these big players refusing to improve their renderers. We're forced to dumb down our CSS and use outdated techniques (like occasionally having to use tables for layout!) because ebook renderers are so bad.<p>iBooks is the top tier renderer, because as far as I can tell it's basically a wrapper for an up-to-date Webkit; next is Kobo - also Webkit-based - along with other Webkit-based indie apps. The rest of the big players are far, far, far distant. > **acabal**: 如果你认为;坏的是“break-*”不是;Firefox或Chrome不支持,等你在Kindle或更糟糕的是,基于ADE的阅读器上看到你的电子书是什么样子,其中仍有许多在使用<p> Kindle是迄今为止市场份额最大的阅读设备,基本上是电子阅读器的IE6——太大了,不容忽视,同时用糟糕的渲染器拖垮了整个电子书生态系统。十多年来,亚马逊对改进它几乎没有兴趣,同时用支持不同CSS和功能的各种不同专有格式来分割自己的生态系统<p> ADE虽然在新设备中不太常见,但在更旧的设备中仍然很常见;N;早在几年前,s eink Nooks就基于ADE。(也许他们仍然是?)ADE在CSS支持方面更接近IE5<p> 在Standard Ebooks,我们;这些大公司拒绝改进他们的渲染器,这常常阻碍了我们制作精美电子书的努力。我们;我们不得不降低CSS的效率,使用过时的技术(比如偶尔不得不使用表格进行布局!),因为电子书渲染器太糟糕了<p> iBooks是顶级渲染器,因为据我所知;它基本上是最新Webkit的包装器;下一个是Kobo,也是基于Webkit的,以及其他基于Webkit的独立应用程序。其余的大玩家都离得很远,很远,很远。 **cratermoon**: Maybe the reason we're still stuck with LaTeX and PDFs because ebook software can't be bothered to implement decent typesetting. > **cratermoon**: 也许是因为我们;我们仍然坚持使用LaTeX和PDF,因为电子书软件可以;不要费心去实现像样的排版。 **fragmede**: Pragmatism wins out of waiting for css properties to get implemented, and div display inline block works today in epubs and doesn't need to be backported to iBooks.<p><a href="https://ebooks.stackexchange.com/questions/7014/how-can-i-prevent-a-widowed-orphaned-header" rel="nofollow">https://ebooks.stackexchange.com/questions/7014/how-can-i-pr...</a> > **fragmede**: 实用主义赢得了等待css属性实现的胜利,而div显示内联块如今在epubs中工作,但不起作用;不需要背移植到iBooks<p> <a href=“https:/;ebooks.stackchange.com/问题᭦如何预防丧偶孤儿头”rel=“nofollow”>https:/;ebooks.stackchange.com;问题";7014;我怎么能</a>
-