Security News
Introducing the Socket Python SDK
The initial version of the Socket Python SDK is now on PyPI, enabling developers to more easily interact with the Socket REST API in Python projects.
github.com/WiseEcho/go-readability
Go-Readability is a Go package that find the main readable content and the metadata from a HTML page. It works by removing clutter like buttons, ads, background images, script, etc.
This package is based from Readability.js by Mozilla and written line by line to make sure it looks and works as similar as possible. This way, hopefully all web page that can be parsed by Readability.js are parse-able by go-readability as well.
This package is stable enough for use and up to date with Readability.js v0.4.4 (commit b359811
).
To install this package, just run go get
:
go get -u -v github.com/WiseEcho/go-readability
To get the readable content from an URL, you can use readability.FromURL
. It will fetch the web page from specified url, check if it's readable, then parses the response to find the readable content :
package main
import (
"fmt"
"log"
"os"
"time"
readability "github.com/WiseEcho/go-readability"
)
var (
urls = []string{
// this one is article, so it's parse-able
"https://www.nytimes.com/2019/02/20/climate/climate-national-security-threat.html",
// while this one is not an article, so readability will fail to parse.
"https://www.nytimes.com/",
}
)
func main() {
for i, url := range urls {
article, err := readability.FromURL(url, 30*time.Second)
if err != nil {
log.Fatalf("failed to parse %s, %v\n", url, err)
}
dstTxtFile, _ := os.Create(fmt.Sprintf("text-%02d.txt", i+1))
defer dstTxtFile.Close()
dstTxtFile.WriteString(article.TextContent)
dstHTMLFile, _ := os.Create(fmt.Sprintf("html-%02d.html", i+1))
defer dstHTMLFile.Close()
dstHTMLFile.WriteString(article.Content)
fmt.Printf("URL : %s\n", url)
fmt.Printf("Title : %s\n", article.Title)
fmt.Printf("Author : %s\n", article.Byline)
fmt.Printf("Length : %d\n", article.Length)
fmt.Printf("Excerpt : %s\n", article.Excerpt)
fmt.Printf("SiteName: %s\n", article.SiteName)
fmt.Printf("Image : %s\n", article.Image)
fmt.Printf("Favicon : %s\n", article.Favicon)
fmt.Printf("Text content saved to \"text-%02d.txt\"\n", i+1)
fmt.Printf("HTML content saved to \"html-%02d.html\"\n", i+1)
fmt.Println()
}
}
However, sometimes you want to parse an URL no matter if it's an article or not. For example is when you only want to get metadata of the page. To do that, you have to download the page manually using http.Get
, then parse it using readability.FromReader
:
package main
import (
"fmt"
"log"
"net/http"
"net/url"
readability "github.com/WiseEcho/go-readability"
)
var (
urls = []string{
// Both will be parse-able now
"https://www.nytimes.com/2019/02/20/climate/climate-national-security-threat.html",
// But this one will not have any content
"https://www.nytimes.com/",
}
)
func main() {
for _, u := range urls {
resp, err := http.Get(u)
if err != nil {
log.Fatalf("failed to download %s: %v\n", u, err)
}
defer resp.Body.Close()
parsedURL, err := url.Parse(u)
if err != nil {
log.Fatalf("error parsing url")
}
article, err := readability.FromReader(resp.Body, parsedURL)
if err != nil {
log.Fatalf("failed to parse %s: %v\n", u, err)
}
fmt.Printf("URL : %s\n", u)
fmt.Printf("Title : %s\n", article.Title)
fmt.Printf("Author : %s\n", article.Byline)
fmt.Printf("Length : %d\n", article.Length)
fmt.Printf("Excerpt : %s\n", article.Excerpt)
fmt.Printf("SiteName: %s\n", article.SiteName)
fmt.Printf("Image : %s\n", article.Image)
fmt.Printf("Favicon : %s\n", article.Favicon)
fmt.Println()
}
}
You can also use go-readability
as command line app. To do that, first install the CLI :
go get -u -v github.com/WiseEcho/go-readability/cmd/...
Now you can use it by running go-readability
in your terminal :
$ go-readability -h
go-readability is parser to fetch the readable content of a web page.
The source can be an url or existing file in your storage.
Usage:
go-readability [flags] source
Flags:
-h, --help help for go-readability
-l, --http string start the http server at the specified address
-m, --metadata only print the page's metadata
-t, --text only print the page's text
Go-Readability is distributed under MIT license, which means you can use and modify it however you want. However, if you make an enhancement for it, if possible, please send a pull request. If you like this project, please consider donating to me either via PayPal or Ko-Fi.
FAQs
Unknown package
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
The initial version of the Socket Python SDK is now on PyPI, enabling developers to more easily interact with the Socket REST API in Python projects.
Security News
Floating dependency ranges in npm can introduce instability and security risks into your project by allowing unverified or incompatible versions to be installed automatically, leading to unpredictable behavior and potential conflicts.
Security News
A new Rust RFC proposes "Trusted Publishing" for Crates.io, introducing short-lived access tokens via OIDC to improve security and reduce risks associated with long-lived API tokens.