
Security News
npm Adopts OIDC for Trusted Publishing in CI/CD Workflows
npm now supports Trusted Publishing with OIDC, enabling secure package publishing directly from CI/CD workflows without relying on long-lived tokens.
Script to fetch news URLs from news websites to database using API.
Install Python 3.8 or higher, install poetry, run poetry install --no-dev
.
Then you can just run poetry run COMMAND
to run specific commands under python virtual environment created by poetry.
Or you can enter poetry shell (by running poetry shell
) and then type script commands.
You can also use pip
.
Assuming Python 3.8 or higher and poetry are installed.
Initialise and update virtual environment (assuming you are in the folder with this README file):
poetry install --no-dev
Run script:
poetry run python news_fetcher/news_fetcher.py --help
Assuming Python 3.8 or higher is installed.
Install poetry (in Windows PowerShell):
(Invoke-WebRequest -Uri https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py -UseBasicParsing).Content | python
You may need to restart PowerShell or reboot your computer.
To install or update libraries, run batch file update.bat
.
Run script:
poetry run python news_fetcher/news_fetcher.py --help
run_all.sh
is the Shell script for running all steps. It requires that environment variables are set in .env
file: MEDIAWIKI_CREDENTIALS
, DATABASE_URL
, WIKI_TOOL_DIRECTORY
, DATA_FILE
, SOURCE_PATH
, SOURCE_NAME
, TARGET_API_URL
, WIKI_PREFIX
, BOT_NAME
, REQUESTS_INTERVAL
.news_fetcher/news_fetcher.py
is the script entry point.news_fetcher/db.py
is the DB initialization module.news_fetcher/models.py
is the module with DB models.news_fetcher/module.py
is the module with base class for "source modules" which are used to grab news from different sources.This modules fetches news using prostoprosport.ru API.
news_fetcher/prostoprosport.py
is the source module.data/categories_from_js.json
is a category URL data grabbed from JS.data/categories_bonus.json
is an additional category URL data grabbed from RSS.This modules fetches news using RSS.
news_fetcher/rss.py
is the source module.Source
Source website.
slug_name
— string website ID (primary key), for example: birmingham-post
.Tag
Tag for news articles.
tag_id
— numerical ID (primary key).title
— tag text, for example: Sport
(must be unique).Article
News article from source website.
article_id
— numerical ID (primary key).source
— source website (foreign key).slug_name
— string identifier (must be unique per source website), for example: sir-stanley-matthews-1915-2000-a-potteries-hero
.title
— human-readable article title, for example: Sir Stanley Matthews 1915-2000: A Potteries hero; Stanley stayed loyal to his beloved.date
— publication date, for example: 2020-02-24T00:00:00.source_url
— full article URL, for example: https://www.thefreelibrary.com/Sir+Stanley+Matthews+1915-2000%3A+A+Potteries+hero%3B+Stanley+stayed...-a060517953.source_url_ok
— true if URL can be retrieved, false if it can not, null if it was not checked yet.author_name
— human-readable author name, may be null.wikitext_paragraphs
— article content converted into wiki-text stored as JSON list of paragraphs, may be null if not fetched yet.misc_data
— miscellaneous data stored as JSON, specific format and structure is module-dependent.tags
— article tags (many-to-many relation with Tag
model through technical ArticleTag
model with table named article_m2m_tag
).--help
Show help message and exit. If this option is used with command, then help message for that specific command will be printed.
--source-module TEXT
(required) — source module name, can be prostoprosport
or rss
--data-file FILENAME
— file with categories data (can be built using process-categories
command)--source-path TEXT
— API method name, can be news
or main_news
--data-file FILENAME
— JSON file with configuration, should contain folllowing keys:
css_selector
— CSS selector for article paragraphs on web pagesource_title
— source titlesource_template_name
— template name for generated wiki-pages (optional, source title is used by default)removed_last_lines
— count of paragraphs at the end of article that should be skipped (optional, 0 by default)disable_bold_font
— true to avoid bold font in generated page (optional, false by default)extra_first_lines
— array of strings to add at the beginning of generated page (optional, empty by default)--source-name
(required) — source slug name (identifier) for DB--source-path TEXT
(required) — RSS feed URLfetch-news
Fetch news for page range and write data to DB. Pages are numbered from most recent (1) to least recent. Note that page numbers are now used in Prostoprosport source module only.
--first-page INTEGER
— number of first page to load, should not be less than 1--last-page INTEGER
— number of last page to load, should not be less than 1. If it is less than first page number, no data will be fetchedFetch most recent page (1):
python news_fetcher/prostoprosport_news_fetcher.py fetch-news
Fetch pages 5 most recent pages (5 to 1):
python news_fetcher/prostoprosport_news_fetcher.py fetch-news --last-page 5
Fetch pages 11 to 20:
python news_fetcher/prostoprosport_news_fetcher.py fetch-news --first-page 11 --last-page 20
/post/
URL path, without category URL.fetch-news-pages
Fetch news pages contents for pages which were:
python news_fetcher/prostoprosport_news_fetcher.py fetch-news-pages
generate-wiki-pages
Generate MediaWiki pages as text files for fetched news pages not marked as uploaded.
--output-file FILE
— output JSON file with list of generated pages, it contains dictionary, where keys are page titles, and values are page file paths--output-directory FILE
— directory to place generated MediaWiki page files--bot-name STRING
— name of bot user account to use in page templatepython news_fetcher/prostoprosport_news_fetcher.py generate-wiki-pages --output-file ../data/pages.json --output-directory ../data/pages/
mark-uploaded-pages
Mark news articles as uploaded in database.
--input-file FILE
input JSON file generated by generate-wiki-pages
commandpython news_fetcher/prostoprosport_news_fetcher.py mark-uploaded-pages --input-file ../data/pages.json
process-categories
in prostoprosport
source moduleBuild categories mapping file. It will contain data about base URL for category slugs and IDs. For example, category rpl
have base URL (without leading slash) football/russia/rpl
.
--input-from-js-file FILE
— JSON file with categories data grabbed from JavaScript, default is data/categories_from_js.json
--input-bonus-file FILE
— JSON file with additional data, default is data/categories_bonus.json
--input-colors-file FILE
— JSON file with ID-to-color mapping data grabbed from JavaScript, default is data/category_colors.json
--output-file FILE
— output JSON file, default is data1/categories_data.json
python news_fetcher/prostoprosport.py process-categories
FAQs
Script to fetch news using API and convert to wiki-text
We found that news_fetcher demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
npm now supports Trusted Publishing with OIDC, enabling secure package publishing directly from CI/CD workflows without relying on long-lived tokens.
Research
/Security News
A RubyGems malware campaign used 60 malicious packages posing as automation tools to steal credentials from social media and marketing tool users.
Security News
The CNA Scorecard ranks CVE issuers by data completeness, revealing major gaps in patch info and software identifiers across thousands of vulnerabilities.