Research
Security News
Quasar RAT Disguised as an npm Package for Detecting Vulnerabilities in Ethereum Smart Contracts
Socket researchers uncover a malicious npm package posing as a tool for detecting vulnerabilities in Etherium smart contracts.
A session-management extension for Scrapy.
This library resolves at least three long-standing issues in Scrapy's session-management system that people have raised concerns about for years:
This library contains a CookiesMiddleware
that exposes the Scrapy cookie jars in the spider attribute sessions
. This is an instance of the new Sessions
class (objects.Sessions
) that allows one to examine the content of the current sessions and to clear and/or renew a session that is failing. The renewal procedure short-circuits the Scrapy request scheduling process, inducing an immediate download of the request specified, ahead of all others. This does not cause any adverse consequences (for example, scrape statistics are maintained perfectly).
This library also provides a tool for maintaining and rotating "profiles", making it easy to give the appearance that your scrape's requests are being generated by multiple, entirely distinct clients.
Another use case is for handling session cookies collected outside of Scrapy and fed into your spider. Whenever this external collection is necessary (for websites that require some kind of demonstration of Javascript rendering before they serve a session to an unknown client), this library provides a handy solution for cycling from one session to the next at each point of failure.
The scrapy-sessions CookiesMiddleware
is designed to override the default Scrapy CookiesMiddleware
. It is an extension of the default middleware, so there shouldn't be adverse consequences from adopting it.
The "COOKIES_ENABLED"
and "COOKIES_DEBUG"
settings work exactly as with the default middleware: if "COOKIES_ENABLED"
is disabled, this middleware is disabled, and if "COOKIES_DEBUG"
is enabled, you will get the same debug messages about cookies sent and received.
With this said, there are some important differences to note. With the default Scrapy middleware, the value of the "cookiejar"
key in your request.meta
names the session (cookie jar) that the request will use. If the session does not exist, a new session is created. The exact same applies in this library, except that you can now also use the "session_id"
key for this purpose. The default value for this is now 0
, rather than None
. So, if you don't use either of these keywords in any of your requests, each request will by default send the cookies associated with session 0
, and add any cookies it receives to session 0
.
Override the default middleware:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': None,
'scrapy_sessions.CookiesMiddleware': 700,
}
This will allow you to interact with the spider.sessions
attribute, in order to inspect, modify, clear, and renew sessions (see usage). It will also give you access to the response cookies via response.meta["cookies"]
.
This is a separate add-on that hooks onto the sessions.
After changing settings.py
as above, add the following:
SESSIONS_PROFILES_SYNC: True
.
Then create a profiles.py
file at the head of your project similar to the following:
from w3lib.http import basic_auth_header
PROFILES = [
{"proxy":['proxy_url', basic_auth_header('username', 'password')], "user-agent": "MY USER AGENT"},
{"proxy":['proxy_url', basic_auth_header('username', 'password')], "user-agent": "MY USER AGENT"}
]
(Either the "proxy" key or the "user-agent" key can be omitted for each profile (but not both).)
Finally, after importing the load_profiles
function (from scrapy_sessions.utils import load_profiles
), add the following to your spider settings:
custom_settings = {
"SESSIONS_PROFILES":load_profiles('profiles.py')
}
Currently, this load_profiles
function fails when trying to deploy on Zyte. I will try to solve this issue when I have time.
Access these methods via the sessions
instance attached to your spider when using this library. For example usage, see the next section.
get(self, session_id=0, mode=None, domain=None)
For inspecting your sessions. Two formats toggled by mode: dictionary (name:value) or list of strings (containing the full cookie data for each cookie).
get_profile(self, session_id=0)
For inspecting the profile attached to the given session. Only works if SESSIONS_PROFILES_SYNC
is enabled.
add_cookies_manually(self, cookies, url, session_id=0)
For explicitly adding a set of cookies to a given session. The cookies must be in the format {name:value}; the url is the url that these cookies would come from conventionally.
clear(self, session_id=0, renewal_request=None)
For clearing a session and/or immediately renewing it with a special one-off request. If you don't specify a renewal_request, the session will be retried with the first new request off the rank.
response.meta["cookies"]
response.meta["session_id"]
In the below self
is referring to a Scrapy.spider
class.
The cookies in the first domain of the default session (session 0):
self.sessions.get()
The cookies in the first domain of a specified session:
self.sessions.get(response.meta["session_id"])
A specified session with a specified domain:
self.sessions.get(response.meta["session_id"], domain='exampledomain.com')
In dictionary format:
self.sessions.get(session_id, mode=dict)
The default session:
self.sessions.clear()
Specifying a session works the same as in get
.
The default session:
self.sessions.clear(renewal_request=Request(url='renewal_url',callback=self.cb))
The callback is optional; if no callback is specified, the session is renewed just the same.
The profile for the default session:
self.sessions.get_profile()
Specifying a session works the same as before.
This method will only work if SESSIONS_PROFILES_SYNC
is enabled in the spider settings.
self.sessions.add_cookies_manually({name1: val1, name2: val2}, 'https://exampledomain.com/', 0)
There are two use cases for this:
Profiles
add-on.Set up your profiles, then within some part of an errback function or middleware that only gets activated when a session expires (you may need some custom logic here), clear and renew your session using sessions.clear
. Because you are using profiles
, then any renewal_request
you specify within the clear
method will automatically get visited by a fresh profile.
Within some part of an errback function or middleware that only gets activated when a session expires, clear and renew your session using sessions.clear
by specifying a renewal_request
that uses a fresh proxy and/or user-agent.
Since this is the most complicated part of the library it's worth describing the underlying process. The following is what happens when clear
is called with a renewal_request
argument:
renewal_request
will reach the process_response
method of the middleware and therein re-fill the session.clear
trigger but before the renewal event has occurred are re-scheduled or re-downloaded. To see the manner in which this is achieved, see the code.The idea of this tool is to manage distinct client identities within a scrape. The identity consists of two or more of the following attributes: session + user agent + proxy.
The profiles are input via a special profiles.py
file (see setting up profiles). Once you have these set up (and have tweaked the settings as required), one of these profiles is automatically associated with every new session created in your scrape. If there are more sessions than profiles, the profiles will be automatically recycled from the beginning. When a session is cleared, the profile is also removed.
Index 0 of any "proxy" value is fed into the request.meta["proxy"]
field in the process_request
function of the middleware. Index 1 is fed into request.headers['Proxy-Authorization']
.
Similarly, the "user-agent" value is fed into request.headers["user-agent"]
.
I am planning to add tests, and then I may at some point submit a pull request on the Scrapy repository proposing this as a replacement for the default Scrapy CookiesMiddleware
.
I've noticed what might be described as a bug in the default Scrapy implementation of the cookiejar via the http.cookiejar library. I'm not sure it raises to the level of a bug but either way it's an unexpected behaviour. This library has not addressed it because it would be best addressed within the http.cookiejar library itself. The behaviour is as follows:
In Scrapy, you can send off a number of cookies with a single request. These get merged with the existing cookies in the session before the request is sent off. One way of adding these cookies to the request is in list format, and within this format, you can specify a domain for each cookie. (e.g. cookies=[{'name':name','value':value,'domain':domain}]
). If you specify this domain, these cookies will get stored in the session (cookiejar) under that domain except with a leading dot (full stop). It doesn't matter whether you include this leading dot yourself or not. But, by default, any cookies that get added to the session when you make requests to that very same domain without explicit cookies will be added to the session without this leading dot. Therefore, because these two sets of cookies only get merged together if they are filed under an identical domain, the cookies may fail to merge properly in the next request to that domain, so that you end up sending cookies that overlap (with duplicate names or potentially even duplicate names + values).
This can be resolved either by simply not adding an explicit domain to cookies that you specify in this way, or by using the method add_cookies_manually
to add these extra cookies to the session before you send off any requests that require them.
FAQs
Session management extension for Scrapy.
We found that scrapy-sessions demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket researchers uncover a malicious npm package posing as a tool for detecting vulnerabilities in Etherium smart contracts.
Security News
Research
A supply chain attack on Rspack's npm packages injected cryptomining malware, potentially impacting thousands of developers.
Research
Security News
Socket researchers discovered a malware campaign on npm delivering the Skuld infostealer via typosquatted packages, exposing sensitive data.