
Research
Namastex.ai npm Packages Hit with TeamPCP-Style CanisterWorm Malware
Malicious Namastex.ai npm packages appear to replicate TeamPCP-style Canister Worm tradecraft, including exfiltration and self-propagation.
newspaper4k
Advanced tools
Newspaper4k Project grew from a fork of the well known newspaper3k by codelucas which was not updated since September 2020. The initial goal of this fork was to keep the project alive and to add new features and fix bugs. As of version 0.9.3 there are many new features and improvements that make Newspaper4k a great tool for article scraping and curation. To make the migration to Newspaper4k easier, all the classes and methods from the original project were kept and the new features were added on top of them. All API calls from the original project still work as expected, such that for users familiar with newspaper3k you will feel right at home with Newspaper4k.
At the moment of the fork, in the original project were over 400 open issues, which I have duplicated, and as of v 0.9.3 only about 180 issues still need to be verified (many are already fixed, but it's pretty cumbersome to check - hint hint ... anyone contributing?). If you have any issues or feature requests please open an issue here.
| Experimental ChatGPT helper bot for Newspaper4k: | ![]() |
- Python 3.10+ minimum
pip install newspaper4k
You can start directly from the command line, using the included CLI:
python -m newspaper --url="https://edition.cnn.com/2023/11/17/success/job-seekers-use-ai/index.html" --language=en --output-format=json --output-file=article.json
More information about the CLI can be found in the CLI documentation.

Alternatively, you can use Newspaper4k in Python:
import newspaper
article = newspaper.article('https://edition.cnn.com/2023/10/29/sport/nfl-week-8-how-to-watch-spt-intl/index.html')
print(article.authors)
# ['Hannah Brewitt']
print(article.publish_date)
# 2023-10-29 09:00:15.717000+00:00
print(article.text)
# New England Patriots head coach Bill Belichick, right, embraces Buffalo Bills head coach Sean McDermott ...
print(article.top_image)
# https://media.cnn.com/api/v1/images/stellar/prod/231015223702-06-nfl-season-gallery-1015.jpg?c=16x9&q=w_800,c_fill
print(article.movies)
# []
article.nlp()
print(article.keywords)
# ['patrick', 'mahomes', 'history', 'nfl', 'week', 'broncos', 'denver', 'p', 'm', '00', 'pittsburgh',...]
print(article.summary)
# Kevin Sabitus/Getty Images Denver Broncos running back Javonte Williams evades Green Bay Packers safety Darnell Savage, bottom.
# Kathryn Riley/Getty Images Kansas City Chiefs quarterback Patrick Mahomes calls a play during the Chiefs' 19-8 Thursday Night Football win over the Denver Broncos on October 12.
# Paul Sancya/AP New York Jets running back Breece Hall carries the ball during a game against the Denver Broncos.
# The Broncos have not beaten the Chiefs since 2015, and have never beaten Chiefs quarterback Patrick Mahomes.
# Australia: NFL+, ESPN, 7Plus Brazil: NFL+, ESPN Canada: NFL+, CTV, TSN, RDS Germany: NFL+, ProSieben MAXX, DAZN Mexico: NFL+, TUDN, ESPN, Fox Sports, Sky Sports UK: NFL+, Sky Sports, ITV, Channel 5 US: NFL+, CBS, NBC, FOX, ESPN, Amazon Prime

This way you can build a Source object from a newspaper websites. This class will allow you to get all the articles and categories on the website. When you build the source, articles are not yet downloaded. The build() call will parse front page, will detect category links (if possible), get any RSS feeds published by the news site, and will create a list of article links.
You need to call download_articles() to download the articles, but note that it can take a significant time.
download_articles() will download the articles in a multithreaded fashion using ThreadPoolExecutor from the concurrent package. The number of concurrent threads can be configured in Configuration.number_threads or passed as an argument to build().
import newspaper
cnn_paper = newspaper.build('http://cnn.com', number_threads=3)
print(cnn_paper.category_urls())
# ['https://cnn.com', 'https://money.cnn.com', 'https://arabic.cnn.com',
# 'https://cnnespanol.cnn.com', 'http://edition.cnn.com',
# 'https://edition.cnn.com', 'https://us.cnn.com', 'https://www.cnn.com']
article_urls = [article.url for article in cnn_paper.articles]
print(article_urls[:3])
# ['https://arabic.cnn.com/middle-east/article/2023/10/30/number-of-hostages-held-in-gaza-now-up-to-239-idf-spokesperson',
# 'https://arabic.cnn.com/middle-east/video/2023/10/30/v146619-sotu-sullivan-hostage-negotiations',
# 'https://arabic.cnn.com/middle-east/article/2023/10/29/norwegian-pm-israel-gaza']
article = cnn_paper.articles[0]
article.download()
article.parse()
print(article.title)
# المتحدث باسم الجيش الإسرائيلي: عدد الرهائن المحتجزين في غزة يصل إلى
Or if you want to get bulk articles from the website (have in mind that this could take a long time and could get your IP blocked by the newssite):
import newspaper
cnn_source = newspaper.build('http://cnn.com', number_threads=3)
print(len(newspaper.article_urls))
articles = source.download_articles()
print(len(articles))
print(articles[0].title)
First, make sure you have the google extra installed, since we rely on the Gnews package to get the articles from Google News. You can install it using pip like this:
pip install newspaper4k[gnews]
Then you can use the GoogleNews class to get articles from Google News:
from newspaper.google_news import GoogleNewsSource
source = GoogleNewsSource(
country="US",
period="7d",
max_results=10,
)
source.build(top_news=True)
print(source.article_urls())
# ['https://www.cnn.com/2024/03/18/politics/trump-464-million-dollar-bond/index.html', 'https://www.cnn.com/2024/03/18/politics/supreme-court-new-york-nra/index.html', ...
source.download_articles()
Newspaper can extract and detect languages seamlessly based on the article meta tags. Additionally, you can specify the language for the website / article. If no language is specified, Newspaper will attempt to auto detect a language from the available meta data. The fallback language is English.
Language detection is crucial for accurate article extraction. If the wrong language is detected or provided, chances are that no article text will be returned. Before parsing, check that your language is supported by our package.
from newspaper import Article
article = Article('https://www.bbc.com/zhongwen/simp/chinese-news-67084358')
article.download()
article.parse()
print(article.title)
# 晶片大战:台湾厂商助攻华为突破美国封锁?
if article.config.use_meta_language:
# If we use the autodetected language, this config attribute will be true
print(article.meta_lang)
else:
print(article.config.language)
# zh
Check out The Docs for full and detailed guides using newspaper.
python -m newspaper....)Using the dataset from ScrapingHub I created an evaluator script that compares the performance of newspaper against it's previous versions. This way we can see how newspaper updates improve or worsen the performance of the library.
| Version | Corpus BLEU Score | Corpus Precision Score | Corpus Recall Score | Corpus F1 Score |
|---|---|---|---|---|
| Newspaper3k 0.2.8 | 0.8660 | 0.9128 | 0.9071 | 0.9100 |
| Newspaper4k 0.9.0 | 0.9212 | 0.8992 | 0.9336 | 0.9161 |
| Newspaper4k 0.9.1 | 0.9224 | 0.8895 | 0.9242 | 0.9065 |
| Newspaper4k 0.9.2 | 0.9426 | 0.9070 | 0.9087 | 0.9078 |
| Newspaper4k 0.9.3 | 0.9531 | 0.9585 | 0.9339 | 0.9460 |
| Newspaper4k 0.9.4 | 0.9531 | 0.9585 | 0.9339 | 0.9460 |
| Newspaper4k 0.9.5 | 0.9531 | 0.9585 | 0.9339 | 0.9460 |
Precision, Recall and F1 are computed using overlap of shingles with n-grams of size 4. The corpus BLEU score is computed using the nltk's bleu_score.
We also use our own, newly created dataset, the Newspaper Article Extraction Benchmark (NAEB) which is a collection of over 400 articles from 200 different news sources to evaluate the performance of the library.
| Version | Corpus BLEU Score | Corpus Precision Score | Corpus Recall Score | Corpus F1 Score |
|---|---|---|---|---|
| Newspaper3k 0.2.8 | 0.8445 | 0.8760 | 0.8556 | 0.8657 |
| Newspaper4k 0.9.0 | 0.8357 | 0.8547 | 0.8909 | 0.8724 |
| Newspaper4k 0.9.1 | 0.8373 | 0.8505 | 0.8867 | 0.8682 |
| Newspaper4k 0.9.2 | 0.8422 | 0.8888 | 0.9240 | 0.9061 |
| Newspaper4k 0.9.3 | 0.8695 | 0.9140 | 0.8921 | 0.9029 |
| Newspaper4k 0.9.4 | 0.8689 | 0.9140 | 0.8921 | 0.9029 |
| Newspaper4k 0.9.5 | 0.8689 | 0.9140 | 0.8921 | 0.9029 |
The package has two kinds of requirements:
lxml and image support)pip or a pinned requirements.txt)System packages (common)
libjpeg-dev, zlib1g-dev, libpng-dev (or libpng12-dev on older systems)libxml2-dev, libxslt1-devpython3-devDebian / Ubuntu (install prerequisites):
sudo apt-get install python3 python3-dev python3-pip libxml2-dev libxslt1-dev libjpeg-dev zlib1g-dev libpng-dev
macOS (Homebrew):
brew install libxml2 libxslt
brew install libtiff libjpeg webp little-cms2
Installing the package (pip)
Basic install:
pip install newspaper4k
Optional extras
Newspaper4k exposes several optional extras that enable additional features:
gnews — Google News integration (gnews package)nlp — NLP helpers (e.g. nltk)cloudflare — cloudscraper for Cloudflare-protected siteszh, th, ja, bn, hi,np, ta - language-specific NLP support (e.g. jieba for Chinese)all — a convenience extra that installs many language and helper packagesExamples:
pip install newspaper4k[gnews]
pip install "newspaper4k[gnews,nlp]"
pip install newspaper4k[all]
Use whichever extras you need; extras can be combined as shown above.
Install using uv (recommended for reproducible, pinned installs)
uv can generate a pinned requirements.txt from pyproject.toml which is
handy for deployments or for reproducible installs.
Install uv (if you don't have it):
pip install uv
add the newspaper4k package to your project using uv add (you must be in a project directory with a pyproject.toml file):
uv add newspaper4k
or if you want to include extras:
uv add --all-groups newspaper4k
or uv add --group gnews newspaper4k
Notes
libpng12-dev, try libpng-dev instead.pyproject.toml under
[project.optional-dependencies] (for example gnews, nlp, cloudflare, all).see CONTRIBUTING.md
Authored and maintained by Andrei Paraschiv.
Newspaper was originally developed by Lucas Ou-Yang (codelucas), the original repository can be found here. Newspaper is licensed under the MIT license.
Thanks to Lucas Ou-Yang for creating the original Newspaper3k project and to all contributors of the original project.
253dd55) (by Muzaffer Cikay)030e50d) (by Andrei)62dece9) (by Andrei)708cc10) (by Andrei)datePublished over dateCreated in JSON-LD extraction(cdadb9e) (by Pontus Svensson)18ca21c) (by Andrei)nltk as an optional dependency for leaner deployments (e073459) (by Andrei)aceb853) (by Andrei)bd82a41)77d6ccc) (by ghxm)7caa2a5) (by Andrei)fa9c542) (by Andrei)7744e17) (by Andrei)3bd4f00) (by Andrei)9434dde) (by Andrei)e02872e) (by Andrei)0ebaabf)26439c9) (by Andrei)Bumped min Python version to 3.10. Version 3.8 and 3.9 are no longer supported, but might still work.
2345076) (by Andrei)6ff72bd) (by Andrei)10cae21) (by Andrei)80279d1) (by Andrei)cb560f7) (by Andrei)7ed25e9) (by Andrei)typing-extensions, lxml compatibility (#639)(c5e4170) (by Chris)059b45c) (by Andrei)2ab8208) (by Michael Braun)3039463) (by Andrei)88fc5a7) (by Andrei)802ae11) (by Andrei)Some fixes with regards to python >= 3.11 dependencies. Numpy version was incompatible with colab. Now it is fixed. Also, there was a typo in the Nepali language code - it was "np" instead of "ne". This is now fixed.
Massive improvements in multi-language capabilities. Added over 40 new languages and completely reworked the language module. Much easier to add new languages now. Additionally, added support for Google News as a source. You can now search and parse news based on keywords, topic, location or website. Integrated cloudscraper as an optional dependency. If installed, it will us cloudscraper as a layer over requests. Cloudscraper tries to bypass cloudflair protection. We now have use two evaluation datasets - the one from scrapinghub and one created by us drom the top 200 most popular websites. This will help keeping track of future improvements and to have a clear view of the impact of the changes.
We see a steady improvement from version 0.9.0 up to 0.9.3. The evaluation results are available in the documentation. The evaluation dataset is also available in the following repository: Article Extraction Dataset
0833859) (by Andrei)fd41af5) (by Andrei)bba7a99) (by Andrei)13670c3) (by Andrei)4ff82a8) (by Andrei)afcdc27) (by Andrei)89f3152) (by Andrei)f0f8133) (by Andrei)ef40042) (by Andrei)afca45b) (by Andrei)0bd48ec) (by Andrei)7a08fc2) (by Andrei)665f6fe) (by Andrei)af3f80f) (by Andrei)f564524) (by Andrei)33c3409) (by Andrei)6b8c23e) (by Andrei)720bfe4) (by Andrei)cd921a3) (by Andrei)a3b6250) (by Andrei)17a2dad) (by Andrei)7a3abe9) (by Andrei)7ad77cf) (by Andrei)d5e8b2b) (by Andrei)2f7b698) (by Andrei)0096999) (by Andrei)5a226e0) (by Andrei)07076cb) (by Andrei)3bb4ca9) (by Andrei)5bb581e) (by Andrei)2644f7a) (by Andrei)86d7128) (by Andrei)Some major changes in document parsing. In previous versions the chance that parts of the article body were missing was high. In addition, in some cases the order of the paragraphs was not correct. This release should fix these issues.
Highlighted features:
python -m newspaper --url https://www.test.com. More information in the documentation.news_pool was replaced with a fetch_news() function.newspaper.article() function for convenience. It will create, download and parse an article in one step. It takes all the parameters of the Article class.41677b0) (by Andrei)670aad9) (by Andrei)python -m newspaper --url https://www.test.com(f46b443) (by Andrei)31b9079) (by Andrei)3be1e47) (by Andrei)d0fcdd8) (by Andrei)f51a04f) (by Andrei)1af12d2) (by Andrei)6485c40) (by Andrei)4aebf29) (by Andrei)0d41fc7) (by Andrei)f4d8f0f) (by Andrei)95d5cfa) (by Andrei)8677dbe) (by Andrei)8ca3d40) (by Andrei)e7a60dd) (by Andrei)737c226) (by Andrei)bug: :lipstick: instead of memorize_articles the option / function / parameter was memoize_articles(aaef712) (by Andrei)
bug: MEMO_DIR is now Path object. addition with str forgotten from refactoring(0b98e71) (by Andrei)
depend: removed feedfinder2 as dependency. was not used(c230aca) (by Andrei)
doc: some minor documentation changes(764742a) (by Andrei)
lang added additional stopwords for "fa". Issue #398(3453538) (by Andrei)
lang: :speech_balloon: fixed serbian stopwords. added chirilic version (Issue #389)(dfcb760) (by Andrei)
parse itemprop containing but not equal to articleBody(510be0e) (by Andrei)
parse: :art: removed some additional advertising snippets(bd30d48) (by Andrei)
parse: :chart_with_upwards_trend: removed possible image caption remains from cleaned article text (Issue #44)(7298140) (by Andrei)
parse: :globe_with_meridians: image parsing and movie parsing improvements. get links from additional attributes such as "data-src".(c02bb23) (by Andrei)
parse: :memo: exclude some tags from get_text. Tags such as script, option can add garbage to the text output(f0e1965) (by Andrei)
parse: :memo: Improved newline geeneration based on block level tags.
's are better taken into account.(22327d8) (by Andrei)
parse: added youtu.be to video sources(bf516a1) (by Andrei)
parse: additional fixes for caption(3e7fdcc) (by Andrei)
refactor: deprecated non pythonic configuration attributes (all caps vs lower caps). for the moment both approaches work(691e12f) (by Andrei)
sec: bump nltk and requests min version(553ef27) (by Andrei)
sources: :bug: fixed a problem with some type of articlelinks.(9a5c0e2) (by Andrei)
f7107be) (by Andrei)592f6f6) (by Andrei)0720de1) (by Andrei)5ff5d27) (by Andrei)cadea6a) (by Murat Çorlu)fc413af) (by Andrei)9df8c16) (by Andrei)41152eb) (by Andrei)dc35e29) (by Andrei)45c4e8d) (by Andrei)22e9dca) (by Andrei)8e54946) (by Andrei)e8126d5) (by Andrei)0261054, bfbac2c) (by Andrei)895343f) (by Andrei)37bb0b7) (by Andrei)23c547f) (by Andrei)6874d05) (by Andrei)5598d95) (by Andrei)6bdf813) (by Andrei)f93a9c2) (by Andrei)0cc1e83) (by Andrei)3ccb87c) (by Andrei)9140a04) (by Andrei)7e671df) (by Andrei)8855f00) (by Andrei)First release after the fork. This release is based on the 0.1.7 release of the original newspaper3k project. I jumped versions such that it is clear that this is a fork and not the original project.
f294a01) (by Andrei)39a5cff) (by Andrei)d5f9209) (by Andrei)ec2d474) (by Andrei)<article> and <div> tag as candidate for content parent_node(447a429) (by Andrei)d7608da) (by Andrei)4d137eb) (by Andrei)79553f6) (by Andrei)... see here for earlier changes.
FAQs
Simplified python article discovery & extraction.
We found that newspaper4k demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Research
Malicious Namastex.ai npm packages appear to replicate TeamPCP-style Canister Worm tradecraft, including exfiltration and self-propagation.

Product
Explore exportable charts for vulnerabilities, dependencies, and usage with Reports, Socket’s new extensible reporting framework.

Product
Socket for Jira lets teams turn alerts into Jira tickets with manual creation, automated ticketing rules, and two-way sync.