
Security News
Open Source Maintainers Demand Ability to Block Copilot-Generated Issues and PRs
Open source maintainers are urging GitHub to let them block Copilot from submitting AI-generated issues and pull requests to their repositories.
Scrapy utils for Modis crawlers projects.
Some utils connected with mongodb.
MongoDBPipeline - pipeline for saving items in mongodb.
Params:
test
).MONGO_COLLECTION
)
where to save item (default value is collection
).Some utils connected with kafka.
KafkaPipeline - pipeline for pushing items into kafka.
Pipeline outputs data into stream with name {RESOURCE_TAG}.{DATA_TYPE}
.
Where RESOURCE_TAG
is tag of resource from which data is crawled and DATA_TYPE
is type of
data crawled: data
, post
, comment
, like
, user
, friend
, share
, member
, news
,
community
.
Params:
RESOURCE_TAG
(default value is platform
)RESOURCE_TAG
for crawled items without KAFKA_RESOURCE_TAG_KEY
(default value is crawler
)DATA_TYPE
(default value is type
).DATA_TYPE
for crawled items without KAFKA_DATA_TYPE_KEY
(default value is data
).gzip
.OpenSearchRequestsDownloaderMiddleware transforms request-response pair into an item, and then sends it to the OpenSearch.
Settings:
`OPENSEARCH_REQUESTS_SETTINGS` - dict specifying OpenSearch client connections:
"hosts": Optional[str | list[str]] = "localhost:9200" - hosts with opensearch endpoint,
"timeout": Optional[int] = 60 - timeout of connections,
"http_auth": Optional[tuple[str, str]] = None - HTTP authentication if needed,
"port": Optional[int] = 443 - access port if not specified in hosts,
"use_ssl": Optional[bool] = True - usage of SSL,
"verify_certs": Optional[bool] = False - verifying certificates,
"ssl_show_warn": Optional[bool] = False - show SSL warnings,
"ca_certs": Optional[str] = None - CA certificate path,
"client_key": Optional[str] = None - client key path,
"client_cert": Optional[str] = None - client certificate path,
"buffer_length": Optional[int] = 500 - number of items in OpenSearchStorage's buffer.
`OPENSEARCH_REQUESTS_INDEX`: Optional[str] = "scrapy-job-requests" - index in OpenSearch.
See an example in examples/opensearch.
Captcha detection middleware for scrapy crawlers. It gets the HTML code from the response (if present), sends it to the captcha detection web-server and logs the result.
If you don't want to check exact response if it has captcha, provide meta-key dont_check_captcha
with True
value.
The middleware must be set up with higher precedence (lower number) than RetryMiddleware:
DOWNLOADER_MIDDLEWARES = {
"crawler_utils.CaptchaDetectionDownloaderMiddleware": 549, # By default, RetryMiddleware has 550
}
Middleware settings:
You may want to log exceptions during crawling to your Sentry.
Use the crawler_utils.sentry_logging.SentryLoggingExtension
for this.
Note that sentry_sdk wants to be loaded as earlier as possible.
To satisfy this condition make the extension with negative order:
EXTENSIONS = {
# Load SentryLogging extension before other extensions.
"crawler_utils.sentry_logging.SentryLoggingExtension": -1,
}
Settings:
SENTRY_DSN: str - Sentry's DSN, where to send events.
SENTRY_SAMPLE_RATE: float = 1.0 - sample rate for error events. Must be in range from 0.0 to 1.0.
SENTRY_TRACES_SAMPLE_RATE: float = 1.0 - the percentage chance a given transaction will be sent to Sentry.
SENTRY_ATTACH_STACKTRACE: bool = False - whether to attach stacktrace for error events.
SENTRY_MAX_BREADCRUMBS: int = 10 - max breadcrumbs to capture with Sentry.
For an example, check examples/sentry_logging
.
FAQs
Scrapy utils for Modis crawlers projects.
We found that modis-crawler-utils demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Open source maintainers are urging GitHub to let them block Copilot from submitting AI-generated issues and pull requests to their repositories.
Research
Security News
Malicious Koishi plugin silently exfiltrates messages with hex strings to a hardcoded QQ account, exposing secrets in chatbots across platforms.
Research
Security News
Malicious PyPI checkers validate stolen emails against TikTok and Instagram APIs, enabling targeted account attacks and dark web credential sales.