Title: | Convert Html into Text |
---|---|
Description: | Convert a html document to plain texts by stripping off all html tags. |
Authors: | Sangchul Park [aut, cre] |
Maintainer: | Sangchul Park <[email protected]> |
License: | GPL (>= 2) |
Version: | 2.2.1 |
Built: | 2025-02-21 02:57:59 UTC |
Source: | https://github.com/replicable/htm2txt |
Display simple plain texts in a web page at a certain URL
browse(URL, ...)
browse(URL, ...)
URL |
A character indicating the URL of a web page. |
... |
Other |
None (invisible NULL).
browse("https://www.wikipedia.org/")
browse("https://www.wikipedia.org/")
Extract simple plain texts from a web page at a certain URL
gettxt(URL, encoding = "UTF-8", ...)
gettxt(URL, encoding = "UTF-8", ...)
URL |
A character indicating the URL of a web page. |
encoding |
Encoding method (e.g., "UTF-8", "latin1", "bytes", "unknown", etc.). |
... |
Other |
A character containing plain texts converted from the htm document at the URL.
text = gettxt("https://www.wikipedia.org/")
text = gettxt("https://www.wikipedia.org/")
Convert a html document to plain texts by stripping off all html tags
htm2txt(htm, list = "\n• ", pagebreak = "\n\n----------\n\n")
htm2txt(htm, list = "\n• ", pagebreak = "\n\n----------\n\n")
htm |
A character vector, containing a html document, to be converted into plain texts (other objects are coerced into character vectors). |
list |
A character that replaces "li" tags (referring to a numbering or bullet for lists). The default is a line change followed by a bullet character and a space. |
pagebreak |
A character that replaces "hr" tags (referring to a thematic change in the content or a page break). |
A character vector containing plain texts converted from the html document.
text = htm2txt("<html><body>html texts</body></html>") text = htm2txt(c("Hello<p>World", "Goodbye<br>Friends")) text = htm2txt("<p>Menu:</p><ul></li>Coffee</li><li>Tea</li></ul>", list = "\n- ") text = htm2txt("Page 1<hr>Page 2", pagebreak = "\n\n[NEW PAGE]\n\n")
text = htm2txt("<html><body>html texts</body></html>") text = htm2txt(c("Hello<p>World", "Goodbye<br>Friends")) text = htm2txt("<p>Menu:</p><ul></li>Coffee</li><li>Tea</li></ul>", list = "\n- ") text = htm2txt("Page 1<hr>Page 2", pagebreak = "\n\n[NEW PAGE]\n\n")