@arbitral/article-parser

Dependencies

Maintainers

Versions

Alerts

File Explorer

Advanced tools

npm Scripts

Install Socket

Detect and block malicious and high-risk dependencies

Install

@arbitral/article-parser

To extract main article from given URL

6.0.48

latest

source

npm

Version published: 3 months ago

Weekly downloads: 251; decreased by-65.24%

Maintainers: 2

Install size: 6.82 MB

Created: 2 years ago

Weekly downloads

Readme

Source

article-parser

Extract main article, main image and meta data from URL.

Demo

Setup

Node.js

npm i article-parser

# pnpm
pnpm i article-parser

# yarn
yarn add article-parser

Usage

import { extract } from 'article-parser'

// with CommonJS environments
// const { extract } = require('article-parser/dist/cjs/article-parser.js')

const url = 'https://www.binance.com/en/blog/markets/15-new-years-resolutions-that-will-make-2022-your-best-year-yet-421499824684903249'

extract(url).then((article) => {
  console.log(article)
}).catch((err) => {
  console.trace(err)
})

Note:

Since Node.js v14, ECMAScript modules have became the official standard format. Just ensure that you are using module system and enjoy with ES6 import/export syntax.

APIs

extract(String url | String html)

Load and extract article data. Return a Promise object.

Example:

import { extract } from 'article-parser'

const getArticle = async (url) => {
  try {
    const article = await extract(url)
    return article
  } catch (err) {
    console.trace(err)
    return null
  }
}

getArticle('https://domain.com/path/to/article')

If the extraction works well, you should get an article object with the structure as below:

{
  "url": URI String,
  "title": String,
  "description": String,
  "image": URI String,
  "author": Person[], // https://schema.org/Person
  "publisher": Organization,  // https://schema.org/Organization
  "content": HTML String,
  "published": Date String,
  "source": String, // original publisher
  "links": Array, // list of alternative links
  "ttr": Number, // time to read in second, 0 = unknown
}

Click here for seeing an actual result.

addQueryRules(Array queryRules)

Add custom rules to get main article from the specific domains.

This can be useful when the default extraction algorithm fails, or when you want to remove some parts of main article content.

Example:

import { addQueryRules, extract } from 'article-parser'

// extractor doesn't work for you!
extract('https://bad-website.domain/page/article')

// add some rules for bad-website.domain
addQueryRules([
  {
    patterns: [
      { hostname: 'bad-website.domain' }
    ],
    selector: '#noop_article_locates_here',
    unwanted: [
      '.advertise-area',
      '.stupid-banner'
    ]
  }
])

// extractor will try to find article at `#noop_article_locates_here`

// call it again, hopefully it works for you now :)
extract('https://bad-website.domain/page/article')

While adding rules, you can specify a transform() function to fine-tune article content more thoroughly.

Example rule with transformation:

import { addQueryRules } from 'article-parser'

addQueryRules([
  {
    patterns: [
      { hostname: 'bad-website.domain' }
    ],
    selector: '#article_id_here',
    rawTransformer: (document) => {
      // Use to transform the raw HTML before any other operations
      // Usefull for setting meta tags for extraction by the parsor
      const metaChild = document.createElement('meta');
      metaChild.setAttribute('name','publisher')
      metaCgild.setAttribute('content','Epoch Times')
      document.head.appendChild(metaChild)
      // at the end, you mush return document
      return document
    },
    transform: (document) => {
      // document is parsed by https://github.com/WebReflection/linkedom which is almost identical to the browser Document object.
      // for example, here we replace all <h1></h1> with <b></b>
      document.querySelectorAll('h1').forEach(node => {
        const newNode = document.createElement('b')
        newNode.innerHTML = node.innerHTML
        node.parentNode.replaceChild(newNode, node)
      })
      // at the end, you mush return document
      return document
    }
  }
])

Please refer MDN for more info.

Configuration methods

In addition, this lib provides some methods to customize default settings. Don't touch them unless you have reason to do that.

getParserOptions()
setParserOptions(Object parserOptions)
getRequestOptions()
setRequestOptions(Object requestOptions)
getSanitizeHtmlOptions()
setSanitizeHtmlOptions(Object sanitizeHtmlOptions)

Here are default properties/values:

Object `parserOptions`:

{
  wordsPerMinute: 300, // to estimate "time to read"
  urlsCompareAlgorithm: 'levenshtein', // to find the best url from list
  descriptionLengthThreshold: 40, // min num of chars required for description
  descriptionTruncateLen: 156, // max num of chars generated for description
  contentLengthThreshold: 200 // content must have at least 200 chars
}

Read string-comparison docs for more info about urlsCompareAlgorithm.

Object `requestOptions`:

{
  headers: {
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:95.0) Gecko/20100101 Firefox/95.0',
    accept: 'text/html; charset=utf-8'
  },
  responseType: 'text',
  responseEncoding: 'utf8',
  timeout: 6e4,
  maxRedirects: 3
}

Read axios' request config for more info.

Object `sanitizeHtmlOptions`:

{
  allowedTags: [
    'h1', 'h2', 'h3', 'h4', 'h5',
    'u', 'b', 'i', 'em', 'strong', 'small', 'sup', 'sub',
    'div', 'span', 'p', 'article', 'blockquote', 'section',
    'details', 'summary',
    'pre', 'code',
    'ul', 'ol', 'li', 'dd', 'dl',
    'table', 'th', 'tr', 'td', 'thead', 'tbody', 'tfood',
    'fieldset', 'legend',
    'figure', 'figcaption', 'img', 'picture',
    'video', 'audio', 'source',
    'iframe',
    'progress',
    'br', 'p', 'hr',
    'label',
    'abbr',
    'a',
    'svg'
  ],
  allowedAttributes: {
    a: ['href', 'target', 'title'],
    abbr: ['title'],
    progress: ['value', 'max'],
    img: ['src', 'srcset', 'alt', 'width', 'height', 'style', 'title'],
    picture: ['media', 'srcset'],
    video: ['controls', 'width', 'height', 'autoplay', 'muted'],
    audio: ['controls'],
    source: ['src', 'srcset', 'data-srcset', 'type', 'media', 'sizes'],
    iframe: ['src', 'frameborder', 'height', 'width', 'scrolling'],
    svg: ['width', 'height']
  },
  allowedIframeDomains: ['youtube.com', 'vimeo.com']
}

Read sanitize-html docs for more info.

Test

git clone https://github.com/ndaidong/article-parser.git
cd article-parser
npm install
npm test

# quick evaluation
npm run eval {URL_TO_PARSE_ARTICLE}

License

The MIT License (MIT)

Keywords

FAQs

What is @arbitral/article-parser?

Is @arbitral/article-parser popular?

Is @arbitral/article-parser well maintained?

Last updated on 20 Feb 2024

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

@arbitral/article-parser

article-parser

Demo

Setup

Usage

Note:

APIs

extract(String url | String html)

addQueryRules(Array queryRules)

Configuration methods

Object parserOptions:

Object requestOptions:

Object sanitizeHtmlOptions:

Test

License

Keywords

Related posts

NIST Drafts New Security Framework to Tackle Emerging Risks of Generative AI

Risky Biz Podcast: How Shifts in Open Source Made It a Prime Attack Vector

Object `parserOptions`:

Object `requestOptions`:

Object `sanitizeHtmlOptions`: