# Multilingual Web Page Content Extractor

## Introduction
`ce` is a golang package for multilingual web page content extraction. It is used to extract the content of article type web pages, such as news, blog posts, etc.

## Basic usage
```go
package main

import (
	"encoding/json"
	"flag"
	"fmt"
	"strings"

	"github.com/crawlerclub/ce"
	"github.com/crawlerclub/dl"
)

var (
	url = flag.String("url",
		"http://china.huanqiu.com/article/2017-07/11034896.html",
		"news url")
	debug = flag.Bool("debug", false, "debug mode")
)

func main() {
	flag.Parse()
	res := dl.DownloadUrl(*url)
	if res.Error != nil {
		fmt.Println(res.Error)
		return
	}

	items := strings.Split(res.RemoteAddr, ":")
	ip := ""
	if len(items) > 0 {
		ip = items[0]
	}
	doc := ce.ParsePro(*url, res.Text, ip, *debug)
	j, _ := json.Marshal(doc)
	fmt.Println(string(j))
}
```

## Fields

`ce` can extract the following fields from raw web htmls:
* `title`: the title of article
* `text`: the main content of article in plain text
* `html`: the main content of article with basic html format, images included
* `publish_date`: the publish time of article
* `language`: the language of article
* `location`: the country code
* `author`: the author of artile
* `images`: the images used in the article


crawler.club/ce

An ambiguous license classifier was found.

What is an ambiguous license classifier?

Ambiguous License Classifier

Source files are encoded using a non-standard text encoding.

What is bad text encoding?

Bad text encoding

Package version is not a valid semantic version (semver).

What is bad semver?

Bad semver

Package has dependencies with an invalid semantic version. This could be a sign of beta, low quality, or unmaintained dependencies.

What is bad dependency semver?

Bad dependency semver

Source files contain bidirectional unicode control characters. This could indicate a Trojan source supply chain attack. See: trojansource.codes for more information.

What are bidirectional unicode control characters?

Bidirectional unicode control characters

This package has multiple bin scripts with the same name. This can cause non-deterministic behavior when installing or could be a sign of a supply chain attack.

What is bin script confusion?

Bin script confusion

This Chrome extension includes content scripts that execute JavaScript on specified websites.

What are Chrome extension content scripts?

Chrome: Content Script

This Chrome extension requests host permissions to access specific websites or domains.

What are Chrome extension host permissions?

Chrome: Host Permission

This Chrome extension requests permissions to access browser APIs, user data, or system features.

What are Chrome extension permissions?

Chrome: Permission

This Chrome extension requests wildcard host permissions that grant broad access to websites.

What are Chrome extension wildcard host permissions?

Chrome: Wildcard Host Permission

Semantic versions published out of chronological order.

What is a chronological version anomaly?

Chronological version anomaly

Project maintainer's SSH key has been compromised.

What is a compromised SSH key?

Compromised SSH key

What do I need to know about license files?

Copyleft License

Contains a Critical Common Vulnerability and Exposure (CVE).

What is a critical CVE?

Title

Critical CVE

Contains a high severity Common Vulnerability and Exposure (CVE).

What is a CVE?

High CVE

Uses debug, reflection and dynamic code execution features.

What is debug access?

Debug access

The maintainer of the package marked it as deprecated. This could indicate that a single version should not be used, or that the package is no longer maintained and any new vulnerabilities will not be fixed.

What is a deprecated package?

Deprecated

Contains a known deprecated SPDX license exception.

What is a deprecated SPDX exception?

Deprecated SPDX exception

License is deprecated which may have legal implications regarding the package's use.

What is a deprecated license?

Deprecated license

Package name is similar to other popular packages and may not be the package you want.

What is a typosquat?

Possible typosquat attack

Dynamic require can indicate the package is performing dangerous or unsafe dynamic code execution.

What is dynamic require?

Dynamic require

Package does not contain any code. It may be removed, is name squatting, or the result of a faulty package publish.

What is an empty package?

Empty package

Package accesses environment variables, which may be a sign of credential stuffing or data theft.

What is environment variable access?

Environment variable access

Something was found which is explicitly marked as unlicensed.

Explicitly Unlicensed Item

Package optionally loads a dependency which is not specified within any of the package.json dependency fields. It may inadvertently be importing dependencies specified by other packages.

What are extraneous dependencies?

Name

Extraneous dependency

Contains a dependency which resolves to a file. This can obfuscate analysis and serves no useful purpose.

What are file dependencies?

File dependency

Accesses the file system, and could potentially read sensitive data.

What is filesystem access?

Filesystem access

Package has a dependency with a floating version range. This can cause issues if the dependency publishes a new major version.

What are wildcard dependencies?

Wildcard dependency

Generic ad-hoc alert, uploaded by user or produced by system diagnostics

Alert title

Generic alert

An input argument to this GitHub Action is being exported as an environment variable. If a user of this action passes untrusted input, it could be used in an insecure manner by subsequent workflow steps.

What are GitHub Actions taint flows?

GitHub Actions: Input argument exported as environment variable

An input argument to this GitHub Action is being passed back as an output. If a user of this action passes untrusted input, it could be used in an insecure manner by consuming workflows.

GitHub Actions: Input argument passed back as output

An input argument to this GitHub Action flows into a dangerous sink (such as shell command execution). This could allow a malicious user to inject commands or exploit the action.

GitHub Actions: Input argument flows to dangerous sink

A GitHub context variable (such as issue title, PR description, or comment body) is being exported as an environment variable. These context values are user-controlled and could be exploited by subsequent workflow steps.

GitHub Actions: GitHub context variable exported as environment variable

A GitHub context variable (such as issue title, PR description, or comment body) is being passed back as an output. These context values are user-controlled and could be exploited by consuming workflows.

GitHub Actions: GitHub context variable passed back as output

A GitHub context variable (such as issue title, PR description, or comment body) flows into a dangerous sink (such as shell command execution). This is a critical security issue that could allow command injection or code execution attacks.

GitHub Actions: GitHub context variable flows to dangerous sink

An environment variable flows into a dangerous sink (such as shell command execution). If this environment variable comes from an untrusted source, it could be exploited to inject commands.

GitHub Actions: Environment variable flows to dangerous sink

Contains a dependency which resolves to a remote git URL. Dependencies fetched from git URLs are not immutable and can be used to inject untrusted code or reduce the likelihood of a reproducible install.

What are git dependencies?

Git dependency

Contains a dependency which resolves to a GitHub URL. Dependencies fetched from GitHub specifiers are not immutable can be used to inject untrusted code or reduce the likelihood of a reproducible install.

What are GitHub dependencies?

GitHub dependency

AI has identified unusual behaviors that may pose a security risk.

What is an AI-detected potential code anomaly?

AI-detected potential code anomaly

AI has identified this package as a potential typosquat of a more popular package. This suggests that the package may be intentionally mimicking another package's name, description, or other metadata.

What is AI-detected potential typosquatting?

AI-detected possible typosquat

AI has identified this package as malware. This is a strong signal that the package may be malicious.

What is AI-detected potential malware?

AI-detected potential malware

AI has determined that this package may contain potential security issues or vulnerabilities.

What are AI-detected potential security risks?

AI-detected potential security risk

Contains native code (e.g., compiled binaries or shared libraries). Including native code can obscure malicious behavior.

Why is native code a concern?

Native code

Contains high entropy strings. This could be a sign of encrypted data, leaked secrets or obfuscated code.

What are high entropy strings?

High entropy strings

Contains unicode homoglyphs which can be used in supply chain confusion attacks.

What are unicode homoglyphs?

Unicode homoglyphs

Contains a dependency which resolves to a remote HTTP URL which could be used to inject untrusted code and reduce overall package reliability.

What are http dependencies?

HTTP dependency

Install scripts are run when the package is installed or built. Malicious packages often use scripts that run automatically to execute payloads or fetch additional code.

What is an install script?

Install scripts

Package has an invalid manifest file and can cause installation problems if you try to use it.

What is an invalid manifest file?

Invalid manifest file

Source files contain invisible characters. This could indicate source obfuscation or a supply chain attack.

What are invisible characters?

Invisible chars

What is a license change?

License change

What is a license exception?

License exception

This package is not allowed per your license policy. Review the package's license to ensure compliance.

What is a license policy violation?

License Policy Violation

Contains long string literals, which may be a sign of obfuscated or packed code.

What's wrong with long strings?

Long strings

Package has recently undergone a major refactor. It may be unstable or indicate significant internal changes. Use caution when updating to versions that include significant changes.

What is a major refactor?

Major refactor

This package version is identified as malware. It has been flagged either by Socket's AI scanner and confirmed by our threat research team, or is listed as malicious in security databases and other sources.

What is known malware?

Known malware

This package has inconsistent metadata. This could be malicious or caused by an error when publishing the package.

What is manifest confusion?

Manifest confusion

Contains a medium severity Common Vulnerability and Exposure (CVE).

What is a medium CVE?

Medium CVE

Contains a low severity Common Vulnerability and Exposure (CVE).

What is a mild CVE?

Low CVE

This package contains minified code. This may be harmless in some cases where minified code is included in packaged libraries, however packages on npm should not minify code.

What's wrong with minified code?

Minified code

A package's licensing information has fine-grained problems.

Misc. License Issues

The package was published by an npm account that no longer exists.

What is a non-existent author?

Non-existent author

A required dependency is not declared in package.json and may prevent the package from working.

What is a missing dependency?

Missing dependency

Package does not have a license and consumption legal status is unknown.

What is a missing license?

Missing license

This package is missing its tarball. It could be removed from the npm registry or there may have been an error when publishing.

What is a missing tarball?

Missing package tarball

What is a mixed license?

Mixed license

Package contains a modified version of an SPDX license exception. Please read carefully before using this code.

What is a modified license exception?

Modified license exception

Package contains a modified version of an SPDX license. Please read carefully before using this code.

What is a modified license?

Modified license

What is network access?

Network access

A new npm collaborator published a version of the package for the first time. New collaborators are usually benign additions to a project, but do indicate a change to the security surface area of a package.

crawler.club/ce

Multilingual Web Page Content Extractor

Introduction

Basic usage

Fields

Related posts

TeamPCP-Linked Supply Chain Attack Hits SAP CAP and Cloud MTA npm Packages

Socket Has Acquired Secure Annex