# RSS新闻爬虫工具

## 概述
这是一个自动抓取RSS新闻源并存储到SQLite数据库的工具，支持：
- 多RSS源并行抓取
- 内容去重和压缩存储
- 当日新闻过滤
- 日志记录

## 安装
```
pip install rss-news-crawler``` 

## 使用方法
```
from rss_news_crawler import NewsCrawler
# 创建爬虫对象
crawler = NewsCrawler(
    db_name='news.db',  # SQLite数据库路径
    log_file='news.log',  # 日志文件路径
    rss_feeds_file='rss_feeds.txt',  # RSS源文件路径，在RSS文件不存在或为空时将使用默认RSS源
)
# 爬取RSS源
crawler.fetch_and_store_news()```

## rss_feeds.txt文件格式
每行一个RSS源的URL，例如：
```
https://www.example.com/rss.xml
https://www.example.com/rss2.xml
```
## 数据库表结构
```CREATE TABLE IF NOT EXISTS news (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    publish_time DATETIME NOT NULL,
    crawl_time DATETIME NOT NULL,
    title TEXT NOT NULL,
    content BLOB NOT NULL,
    url TEXT NOT NULL UNIQUE
)
```

Content字段存储的是经过压缩和去重的新闻内容，使用feed_handler.compress_content()进行压缩
## 示例
```
from rss_news_crawler import NewsCrawler

crawler = NewsCrawler(
    db_name='news.db',
    log_file='news.log',
    rss_feeds_file='rss_feeds.txt',
)

crawler.fetch_and_store_news()
```



RSS新闻爬虫工具，自动抓取并存储RSS源的最新新闻. 

Meet the Socket Team at BlackHat and DEF CON in Las Vegas, Aug 4-6

rss-news-crawler

RSS新闻爬虫工具，自动抓取并存储RSS源的最新新闻

John_MC_Python

(Experimental) An ambiguous license classifier was found.

What is an ambiguous license classifier?

Ambiguous License Classifier

Source files are encoded using a non-standard text encoding.

What is bad text encoding?

Bad text encoding

Package version is not a valid semantic version (semver).

What is bad semver?

Bad semver

Package has dependencies with an invalid semantic version. This could be a sign of beta, low quality, or unmaintained dependencies.

What is bad dependency semver?

Bad dependency semver

Source files contain bidirectional unicode control characters. This could indicate a Trojan source supply chain attack. See: trojansource.codes for more information.

What are bidirectional unicode control characters?

Bidirectional unicode control characters

This package has multiple bin scripts with the same name. This can cause non-deterministic behavior when installing or could be a sign of a supply chain attack.

What is bin script confusion?

Bin script confusion

This Chrome extension includes a content script '{scriptFile}' that runs on websites matching '{matches}'.

What are Chrome extension content scripts?

Chrome Extension Content Script

This Chrome extension requests access to '{host}'.

What are Chrome extension host permissions?

Chrome Extension Host Permission

This Chrome extension uses the '{permission}' permission.

What are Chrome extension permissions?

Chrome Extension Permission

This Chrome extension requests broad access to websites with the pattern '{host}'.

What are Chrome extension wildcard host permissions?

Chrome Extension Wildcard Host Permission

Semantic versions published out of chronological order.

What is a chronological version anomaly?

Chronological version anomaly

Project maintainer's SSH key has been compromised.

What is a compromised SSH key?

Compromised SSH key

(Experimental) Copyleft license information was found.

What do I need to know about license files?

Copyleft License

Contains a Critical Common Vulnerability and Exposure (CVE).

What is a critical CVE?

Title

Critical CVE

Contains a high severity Common Vulnerability and Exposure (CVE).

What is a CVE?

High CVE

Uses debug, reflection and dynamic code execution features.

What is debug access?

Debug access

The maintainer of the package marked it as deprecated. This could indicate that a single version should not be used, or that the package is no longer maintained and any new vulnerabilities will not be fixed.

What is a deprecated package?

Deprecated

(Experimental) Contains a known deprecated SPDX license exception.

What is a deprecated SPDX exception?

Deprecated SPDX exception

(Experimental) License is deprecated which may have legal implications regarding the package's use.

What is a deprecated license?

Deprecated license

Package name is similar to other popular packages and may not be the package you want.

What is a typosquat?

Possible typosquat attack

Dynamic require can indicate the package is performing dangerous or unsafe dynamic code execution.

What is dynamic require?

Dynamic require

Package does not contain any code. It may be removed, is name squatting, or the result of a faulty package publish.

What is an empty package?

Empty package

Package accesses environment variables, which may be a sign of credential stuffing or data theft.

What is environment variable access?

Environment variable access

(Experimental) Something was found which is explicitly marked as unlicensed.

Explicitly Unlicensed Item

Package optionally loads a dependency which is not specified within any of the package.json dependency fields. It may inadvertently be importing dependencies specified by other packages.

What are extraneous dependencies?

Name

Extraneous dependency

Contains a dependency which resolves to a file. This can obfuscate analysis and serves no useful purpose.

What are file dependencies?

File dependency

Accesses the file system, and could potentially read sensitive data.

What is filesystem access?

Filesystem access

Package has a dependency with a floating version range. This can cause issues if the dependency publishes a new major version.

What are wildcard dependencies?

Wildcard dependency

Contains a dependency which resolves to a remote git URL. Dependencies fetched from git URLs are not immutable and can be used to inject untrusted code or reduce the likelihood of a reproducible install.

What are git dependencies?

Git dependency

Contains a dependency which resolves to a GitHub URL. Dependencies fetched from GitHub specifiers are not immutable can be used to inject untrusted code or reduce the likelihood of a reproducible install.

What are GitHub dependencies?

GitHub dependency

AI has identified unusual behaviors that may pose a security risk.

What is an AI-detected potential code anomaly?

AI-detected potential code anomaly

AI has identified this package as a potential typosquat of a more popular package. This suggests that the package may be intentionally mimicking another package's name, description, or other metadata.

What is AI-detected potential typosquatting?

AI-detected possible typosquat

AI has identified this package as malware. This is a strong signal that the package may be malicious.

What is AI-detected potential malware?

AI-detected potential malware

AI has determined that this package may contain potential security issues or vulnerabilities.

What are AI-detected potential security risks?

AI-detected potential security risk

Contains native code (e.g., compiled binaries or shared libraries). Including native code can obscure malicious behavior.

Why is native code a concern?

Native code

Contains high entropy strings. This could be a sign of encrypted data, leaked secrets or obfuscated code.

What are high entropy strings?

High entropy strings

Contains unicode homoglyphs which can be used in supply chain confusion attacks.

What are unicode homoglyphs?

Unicode homoglyphs

Contains a dependency which resolves to a remote HTTP URL which could be used to inject untrusted code and reduce overall package reliability.

What are http dependencies?

HTTP dependency

Install scripts are run when the package is installed. The majority of malware in npm is hidden in install scripts.

What is an install script?

Install scripts

Package has an invalid manifest file and can cause installation problems if you try to use it.

What is an invalid manifest file?

Invalid manifest file

Source files contain invisible characters. This could indicate source obfuscation or a supply chain attack.

What are invisible characters?

Invisible chars

(Experimental) Package license has recently changed.

What is a license change?

License change

(Experimental) Contains an SPDX license exception.

What is a license exception?

License exception

This package is not allowed per your license policy. Review the package's license to ensure compliance.

What is a license policy violation?

License Policy Violation

Contains long string literals, which may be a sign of obfuscated or packed code.

What's wrong with long strings?

Long strings

Package has recently undergone a major refactor. It may be unstable or indicate significant internal changes. Use caution when updating to versions that include significant changes.

What is a major refactor?

Major refactor

This package is identified as malware. It has been flagged either by Socket's AI scanner and confirmed by our threat research team, or is listed as malicious in security databases and other sources.

What is known malware?

Known malware

This package has inconsistent metadata. This could be malicious or caused by an error when publishing the package.

What is manifest confusion?

Manifest confusion

Contains a medium severity Common Vulnerability and Exposure (CVE).

What is a medium CVE?

Medium CVE

Contains a low severity Common Vulnerability and Exposure (CVE).

What is a mild CVE?

Low CVE

This package contains minified code. This may be harmless in some cases where minified code is included in packaged libraries, however packages on npm should not minify code.

What's wrong with minified code?

Minified code

(Experimental) A package's licensing information has fine-grained problems.

Misc. License Issues

The package was published by an npm account that no longer exists.

What is a non-existent author?

Non-existent author

A required dependency is not declared in package.json and may prevent the package from working.

What is a missing dependency?

Missing dependency

(Experimental) Package does not have a license and consumption legal status is unknown.

What is a missing license?

Missing license

This package is missing its tarball. It could be removed from the npm registry or there may have been an error when publishing.

What is a missing tarball?

Missing package tarball

(Experimental) Package contains multiple licenses.

What is a mixed license?

Mixed license

(Experimental) Package contains a modified version of an SPDX license exception. Please read carefully before using this code.

What is a modified license exception?

Modified license exception

(Experimental) Package contains a modified version of an SPDX license. Please read carefully before using this code.

What is a modified license?

Modified license

What is network access?

Network access

A new npm collaborator published a version of the package for the first time. New collaborators are usually benign additions to a project, but do indicate a change to the security surface area of a package.

What is new author?

New author

Package does not specify a list of contributors or an author in package.json.

Why is contributor and author data important?

No contributors or author data

Package does not have a linked bug tracker in package.json.

Why are bug trackers important?

No bug tracker

(Experimental) License information could not be found.

No License Found

Package does not have a README. This may indicate a failed publish or a low quality package.

Why are READMEs important?

No README

Package does not have a linked source code repository. Without this field, a package will have no reference to the location of the source code use to generate the package.

rss-news-crawler

RSS新闻爬虫工具

概述

安装

创建爬虫对象

爬取RSS源

rss_feeds.txt文件格式

数据库表结构

示例

Related posts

AI + a16z Podcast: Vibe Coding, Security Risks, and the Path to Progress

Toptal’s GitHub Organization Hijacked: 10 Malicious Packages Published