==============================================
Scrapy & JavaScript integration through Splash
.. image:: https://img.shields.io/pypi/v/scrapy-splash.svg
:target: https://pypi.python.org/pypi/scrapy-splash
:alt: PyPI Version
.. image:: https://github.com/scrapy-plugins/scrapy-splash/workflows/Tests/badge.svg
:target: https://github.com/scrapy-plugins/scrapy-splash/actions/workflows/tests.yml
:alt: Test Status
.. image:: http://codecov.io/github/scrapy-plugins/scrapy-splash/coverage.svg?branch=master
:target: http://codecov.io/github/scrapy-plugins/scrapy-splash?branch=master
:alt: Code Coverage
This library provides Scrapy_ and JavaScript integration using Splash_.
The license is BSD 3-clause.
.. _Scrapy: https://github.com/scrapy/scrapy
.. _Splash: https://github.com/scrapinghub/splash
Installation
Install scrapy-splash using pip::
$ pip install scrapy-splash
Scrapy-Splash uses Splash_ HTTP API, so you also need a Splash instance.
Usually to install & run Splash, something like this is enough::
$ docker run -p 8050:8050 scrapinghub/splash
Check Splash install docs
_ for more info.
.. _install docs: http://splash.readthedocs.org/en/latest/install.html
Configuration
-
Add the Splash server address to settings.py
of your Scrapy project
like this::
SPLASH_URL = 'http://192.168.59.103:8050'
-
Enable the Splash middleware by adding it to DOWNLOADER_MIDDLEWARES
in your settings.py
file and changing HttpCompressionMiddleware
priority:
.. code:: python
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
Order 723
is just before HttpProxyMiddleware
(750) in default
scrapy settings.
HttpCompressionMiddleware priority should be changed in order to allow
advanced response processing; see https://github.com/scrapy/scrapy/issues/1895
for details.
-
Enable SplashDeduplicateArgsMiddleware
by adding it to
SPIDER_MIDDLEWARES
in your settings.py
:
.. code:: python
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
This middleware is needed to support cache_args
feature; it allows
to save disk space by not storing duplicate Splash arguments multiple
times in a disk request queue. If Splash 2.1+ is used the middleware
also allows to save network traffic by not sending these duplicate
arguments to Splash server multiple times.
-
Set a custom REQUEST_FINGERPRINTER_CLASS
:
.. code:: python
REQUEST_FINGERPRINTER_CLASS = 'scrapy_splash.SplashRequestFingerprinter'
There are also some additional options available.
Put them into your settings.py
if you want to change the defaults:
SPLASH_COOKIES_DEBUG
is False
by default.
Set to True
to enable debugging cookies in the SplashCookiesMiddleware
.
This option is similar to COOKIES_DEBUG
for the built-in Scrapy cookies middleware: it logs sent and received cookies
for all requests.SPLASH_LOG_400
is True
by default - it instructs to log all 400 errors
from Splash. They are important because they show errors occurred
when executing the Splash script. Set it to False
to disable this logging.SPLASH_SLOT_POLICY
is scrapy_splash.SlotPolicy.PER_DOMAIN
(as object, not just a string) by default.
It specifies how concurrency & politeness are maintained for Splash requests,
and specify the default value for slot_policy
argument for
SplashRequest
, which is described below.SCRAPY_SPLASH_REQUEST_FINGERPRINTER_BASE_CLASS
is scrapy.settings.default_settings.REQUEST_FINGERPRINTER_CLASS
by default. This changes the base class the Fingerprinter uses to get a fingerprint.
Usage
Requests
The easiest way to render requests with Splash is to
use scrapy_splash.SplashRequest
:
.. code:: python
yield SplashRequest(url, self.parse_result,
args={
# optional; parameters passed to Splash HTTP API
'wait': 0.5,
# 'url' is prefilled from request url
# 'http_method' is set to 'POST' for POST requests
# 'body' is set to request body for POST requests
},
endpoint='render.json', # optional; default is render.html
splash_url='<url>', # optional; overrides SPLASH_URL
slot_policy=scrapy_splash.SlotPolicy.PER_DOMAIN, # optional
)
Alternatively, you can use regular scrapy.Request and
'splash'
Request meta
key:
.. code:: python
yield scrapy.Request(url, self.parse_result, meta={
'splash': {
'args': {
# set rendering arguments here
'html': 1,
'png': 1,
# 'url' is prefilled from request url
# 'http_method' is set to 'POST' for POST requests
# 'body' is set to request body for POST requests
},
# optional parameters
'endpoint': 'render.json', # optional; default is render.json
'splash_url': '<url>', # optional; overrides SPLASH_URL
'slot_policy': scrapy_splash.SlotPolicy.PER_DOMAIN,
'splash_headers': {}, # optional; a dict with headers sent to Splash
'dont_process_response': True, # optional, default is False
'dont_send_headers': True, # optional, default is False
'magic_response': False, # optional, default is True
}
})
Use request.meta['splash']
API in middlewares or when scrapy.Request
subclasses are used (there is also SplashFormRequest
described below).
For example, meta['splash']
allows to create a middleware which enables
Splash for all outgoing requests by default.
SplashRequest
is a convenient utility to fill request.meta['splash']
;
it should be easier to use in most cases. For each request.meta['splash']
key there is a corresponding SplashRequest
keyword argument: for example,
to set meta['splash']['args']
use SplashRequest(..., args=myargs)
.
-
meta['splash']['args']
contains arguments sent to Splash.
scrapy-splash adds some default keys/values to args
:
- 'url' is set to request.url;
- 'http_method' is set to 'POST' for POST requests;
- 'body' is set to to request.body for POST requests.
You can override default values by setting them explicitly.
Note that by default Scrapy escapes URL fragments using AJAX escaping scheme.
If you want to pass a URL with a fragment to Splash then set url
in args
dict manually. This is handled automatically if you use
SplashRequest
, but you need to keep that in mind if you use raw
meta['splash']
API.
Splash 1.8+ is required to handle POST requests; in earlier Splash versions
'http_method' and 'body' arguments are ignored. If you work with /execute
endpoint and want to support POST requests you have to handle
http_method
and body
arguments in your Lua script manually.
-
meta['splash']['cache_args']
is a list of argument names to cache
on Splash side. These arguments are sent to Splash only once, then cached
values are used; it allows to save network traffic and decreases request
queue disk memory usage. Use cache_args
only for large arguments
which don't change with each request; lua_source
is a good candidate
(if you don't use string formatting to build it). Splash 2.1+ is required
for this feature to work.
-
meta['splash']['endpoint']
is the Splash endpoint to use.
In case of SplashRequest
render.html <http://splash.readthedocs.org/en/latest/api.html#render-html>
_
is used by default. If you're using raw scrapy.Request then
render.json <http://splash.readthedocs.org/en/latest/api.html#render-json>
_
is a default (for historical reasons). It is better to always pass endpoint
explicitly.
See Splash HTTP API docs
_ for a full list of available endpoints
and parameters.
.. _HTTP API docs: http://splash.readthedocs.org/en/latest/api.html
-
meta['splash']['splash_url']
overrides the Splash URL set
in settings.py
.
-
meta['splash']['splash_headers']
allows to add or change headers
which are sent to Splash server. Note that this option is not for
setting headers which are sent to the remote website.
-
meta['splash']['slot_policy']
customize how
concurrency & politeness are maintained for Splash requests.
Currently there are 3 policies available:
-
scrapy_splash.SlotPolicy.PER_DOMAIN
(default) - send Splash requests to
downloader slots based on URL being rendered. It is useful if you want
to maintain per-domain politeness & concurrency settings.
-
scrapy_splash.SlotPolicy.SINGLE_SLOT
- send all Splash requests to
a single downloader slot. It is useful if you want to throttle requests
to Splash.
-
scrapy_splash.SlotPolicy.SCRAPY_DEFAULT
- don't do anything with slots.
It is similar to SINGLE_SLOT
policy, but can be different if you access
other services on the same address as Splash.
-
meta['splash']['dont_process_response']
- when set to True,
SplashMiddleware won't change the response to a custom scrapy.Response
subclass. By default for Splash requests one of SplashResponse,
SplashTextResponse or SplashJsonResponse is passed to the callback.
-
meta['splash']['dont_send_headers']
: by default scrapy-splash passes
request headers to Splash in 'headers' JSON POST field. For all render.xxx
endpoints it means Scrapy header options are respected by default
(http://splash.readthedocs.org/en/stable/api.html#arg-headers). In Lua
scripts you can use headers
argument of splash:go
to apply the
passed headers: splash:go{url, headers=splash.args.headers}
.
Set 'dont_send_headers' to True if you don't want to pass headers
to Splash.
-
meta['splash']['http_status_from_error_code']
- set response.status
to HTTP error code when assert(splash:go(..))
fails; it requires
meta['splash']['magic_response']=True
. http_status_from_error_code
option is False by default if you use raw meta API;
SplashRequest sets it to True by default.
-
meta['splash']['magic_response']
- when set to True and a JSON
response is received from Splash, several attributes of the response
(headers, body, url, status code) are filled using data returned in JSON:
- response.headers are filled from 'headers' keys;
- response.url is set to the value of 'url' key;
- response.body is set to the value of 'html' key,
or to base64-decoded value of 'body' key;
- response.status is set to the value of 'http_status' key.
When
meta['splash']['http_status_from_error_code']
is True
and assert(splash:go(..))
fails with an HTTP error
response.status is also set to HTTP error code.
Original URL, status and headers are available as response.real_url
,
response.splash_response_status
and response.splash_response_headers
.
This option is set to True by default if you use SplashRequest.
render.json
and execute
endpoints may not have all the necessary
keys/values in the response.
For non-JSON endpoints, only url is filled, regardless of the
magic_response
setting.
Use scrapy_splash.SplashFormRequest
if you want to make a FormRequest
via splash. It accepts the same arguments as SplashRequest
,
and also formdata
, like FormRequest
from scrapy::
>>> from scrapy_splash import SplashFormRequest
>>> SplashFormRequest('http://example.com', formdata={'foo': 'bar'})
<POST http://example.com>
SplashFormRequest.from_response
is also supported, and works as described
in scrapy documentation <http://scrapy.readthedocs.org/en/latest/topics/request-response.html#scrapy.http.FormRequest.from_response>
_.
Responses
scrapy-splash returns Response subclasses for Splash requests:
- SplashResponse is returned for binary Splash responses - e.g. for
/render.png responses;
- SplashTextResponse is returned when the result is text - e.g. for
/render.html responses;
- SplashJsonResponse is returned when the result is a JSON object - e.g.
for /render.json responses or /execute responses when script returns
a Lua table.
To use standard Response classes set meta['splash']['dont_process_response']=True
or pass dont_process_response=True
argument to SplashRequest.
All these responses set response.url
to the URL of the original request
(i.e. to the URL of a website you want to render), not to the URL of the
requested Splash endpoint. "True" URL is still available as
response.real_url
.
SplashJsonResponse provide extra features:
-
response.data
attribute contains response data decoded from JSON;
you can access it like response.data['html']
.
-
If Splash session handling is configured, you can access current cookies
as response.cookiejar
; it is a CookieJar instance.
-
If Scrapy-Splash response magic is enabled in request (default),
several response attributes (headers, body, url, status code)
are set automatically from original response body:
- response.headers are filled from 'headers' keys;
- response.url is set to the value of 'url' key;
- response.body is set to the value of 'html' key,
or to base64-decoded value of 'body' key;
- response.status is set from the value of 'http_status' key.
When response.body
is updated in SplashJsonResponse
(either from 'html' or from 'body' keys) familiar response.css
and response.xpath
methods are available.
To turn off special handling of JSON result keys either set
meta['splash']['magic_response']=False
or pass magic_response=False
argument to SplashRequest.
Session Handling
Splash itself is stateless - each request starts from a clean state.
In order to support sessions the following is required:
- client (Scrapy) must send current cookies to Splash;
- Splash script should make requests using these cookies and update
them from HTTP response headers or JavaScript code;
- updated cookies should be sent back to the client;
- client should merge current cookies with the updated cookies.
For (2) and (3) Splash provides splash:get_cookies()
and
splash:init_cookies()
methods which can be used in Splash Lua scripts.
scrapy-splash provides helpers for (1) and (4): to send current cookies
in 'cookies' field and merge cookies back from 'cookies' response field
set request.meta['splash']['session_id']
to the session
identifier. If you only want a single session use the same session_id
for
all request; any value like '1' or 'foo' is fine.
For scrapy-splash session handling to work you must use /execute
endpoint
and a Lua script which accepts 'cookies' argument and returns 'cookies'
field in the result:
.. code:: python
function main(splash)
splash:init_cookies(splash.args.cookies)
-- ... your script
return {
cookies = splash:get_cookies(),
-- ... other results, e.g. html
}
end
SplashRequest sets session_id
automatically for /execute
endpoint,
i.e. cookie handling is enabled by default if you use SplashRequest,
/execute
endpoint and a compatible Lua rendering script.
If you want to start from the same set of cookies, but then 'fork' sessions
set request.meta['splash']['new_session_id']
in addition to
session_id
. Request cookies will be fetched from cookiejar session_id
,
but response cookies will be merged back to the new_session_id
cookiejar.
Standard Scrapy cookies
argument can be used with SplashRequest
to add cookies to the current Splash cookiejar.
Examples
Get HTML contents:
.. code:: python
import scrapy
from scrapy_splash import SplashRequest
class MySpider(scrapy.Spider):
name = "MySpider"
start_urls = ["http://example.com", "http://example.com/foo"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, args={'wait': 0.5})
def parse(self, response):
# response.body is a result of render.html call; it
# contains HTML processed by a browser.
# ...
Get HTML contents and a screenshot:
.. code:: python
import json
import base64
import scrapy
from scrapy_splash import SplashRequest
class MySpider(scrapy.Spider):
# ...
splash_args = {
'wait': 1,
'html': 1,
'png': 1,
'width': 600,
'render_all': 1,
}
yield SplashRequest(url, self.parse_result, endpoint='render.json',
args=splash_args)
# ...
def parse_result(self, response):
# magic responses are turned ON by default,
# so the result under 'html' key is available as response.body
html = response.body
# you can also query the html result as usual
title = response.css('title').extract_first()
# full decoded JSON data is available as response.data:
png_bytes = base64.b64decode(response.data['png'])
# ...
Run a simple Splash Lua Script
_:
.. code:: python
import json
import base64
from scrapy_splash import SplashRequest
class MySpider(scrapy.Spider):
# ...
script = """
function main(splash)
assert(splash:go(splash.args.url))
return splash:evaljs("document.title")
end
"""
yield SplashRequest(url, self.parse_result, endpoint='execute',
args={'lua_source': script})
# ...
def parse_result(self, response):
doc_title = response.text
# ...
More complex Splash Lua Script
_ example - get a screenshot of an HTML
element by its CSS selector (it requires Splash 2.1+).
Note how are arguments passed to the script:
.. code:: python
import json
import base64
from scrapy_splash import SplashRequest
script = """
-- Arguments:
-- * url - URL to render;
-- * css - CSS selector to render;
-- * pad - screenshot padding size.
-- this function adds padding around region
function pad(r, pad)
return {r[1]-pad, r[2]-pad, r[3]+pad, r[4]+pad}
end
-- main script
function main(splash)
-- this function returns element bounding box
local get_bbox = splash:jsfunc([[
function(css) {
var el = document.querySelector(css);
var r = el.getBoundingClientRect();
return [r.left, r.top, r.right, r.bottom];
}
]])
assert(splash:go(splash.args.url))
assert(splash:wait(0.5))
-- don't crop image by a viewport
splash:set_viewport_full()
local region = pad(get_bbox(splash.args.css), splash.args.pad)
return splash:png{region=region}
end
"""
class MySpider(scrapy.Spider):
# ...
yield SplashRequest(url, self.parse_element_screenshot,
endpoint='execute',
args={
'lua_source': script,
'pad': 32,
'css': 'a.title'
}
)
# ...
def parse_element_screenshot(self, response):
image_data = response.body # binary image data in PNG format
# ...
Use a Lua script to get an HTML response with cookies, headers, body
and method set to correct values; lua_source
argument value is cached
on Splash server and is not sent with each request (it requires Splash 2.1+):
.. code:: python
import scrapy
from scrapy_splash import SplashRequest
script = """
function main(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go{
splash.args.url,
headers=splash.args.headers,
http_method=splash.args.http_method,
body=splash.args.body,
})
assert(splash:wait(0.5))
local entries = splash:history()
local last_response = entries[#entries].response
return {
url = splash:url(),
headers = last_response.headers,
http_status = last_response.status,
cookies = splash:get_cookies(),
html = splash:html(),
}
end
"""
class MySpider(scrapy.Spider):
# ...
yield SplashRequest(url, self.parse_result,
endpoint='execute',
cache_args=['lua_source'],
args={'lua_source': script},
headers={'X-My-Header': 'value'},
)
def parse_result(self, response):
# here response.body contains result HTML;
# response.headers are filled with headers from last
# web page loaded to Splash;
# cookies from all responses and from JavaScript are collected
# and put into Set-Cookie response header, so that Scrapy
# can remember them.
.. _Splash Lua Script: http://splash.readthedocs.org/en/latest/scripting-tutorial.html
HTTP Basic Auth
If you need to use HTTP Basic Authentication to access Splash, use the
SPLASH_USER
and SPLASH_PASS
optional settings::
SPLASH_USER = 'user'
SPLASH_PASS = 'userpass'
Another option is meta['splash']['splash_headers']
: it allows to set
custom headers which are sent to Splash server; add Authorization header
to splash_headers
if you want to change credentials per-request::
import scrapy
from w3lib.http import basic_auth_header
class MySpider(scrapy.Spider):
# ...
def start_requests(self):
auth = basic_auth_header('user', 'userpass')
yield SplashRequest(url, self.parse,
splash_headers={'Authorization': auth})
WARNING: Don't use HttpAuthMiddleware
_
(i.e. http_user
/ http_pass
spider attributes) for Splash
authentication: if you occasionally send a non-Splash request from your spider,
you may expose Splash credentials to a remote website, as HttpAuthMiddleware
sets credentials for all requests unconditionally.
.. _HttpAuthMiddleware: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.httpauth
Why not use the Splash HTTP API directly?
The obvious alternative to scrapy-splash would be to send requests directly
to the Splash HTTP API
_. Take a look at the example below and make
sure to read the observations after it:
.. code:: python
import json
import scrapy
from scrapy.http.headers import Headers
RENDER_HTML_URL = "http://127.0.0.1:8050/render.html"
class MySpider(scrapy.Spider):
start_urls = ["http://example.com", "http://example.com/foo"]
def start_requests(self):
for url in self.start_urls:
body = json.dumps({"url": url, "wait": 0.5}, sort_keys=True)
headers = Headers({'Content-Type': 'application/json'})
yield scrapy.Request(RENDER_HTML_URL, self.parse, method="POST",
body=body, headers=headers)
def parse(self, response):
# response.body is a result of render.html call; it
# contains HTML processed by a browser.
# ...
It works and is easy enough, but there are some issues that you should be
aware of:
-
There is a bit of boilerplate.
-
As seen by Scrapy, we're sending requests to RENDER_HTML_URL
instead
of the target URLs. It affects concurrency and politeness settings:
CONCURRENT_REQUESTS_PER_DOMAIN
, DOWNLOAD_DELAY
, etc could behave
in unexpected ways since delays and concurrency settings are no longer
per-domain.
-
As seen by Scrapy, response.url is an URL of the Splash server.
scrapy-splash fixes it to be an URL of a requested page.
"Real" URL is still available as response.real_url
. scrapy-splash also
allows to handle response.status
and response.headers
transparently
on Scrapy side.
-
Some options depend on each other - for example, if you use timeout_
Splash option then you may want to set download_timeout
scrapy.Request meta key as well.
-
It is easy to get it subtly wrong - e.g. if you won't use
sort_keys=True
argument when preparing JSON body then binary POST body
content could vary even if all keys and values are the same, and it means
dupefilter and cache will work incorrectly.
-
Default Scrapy duplication filter doesn't take Splash specifics in
account. For example, if an URL is sent in a JSON POST request body
Scrapy will compute request fingerprint without canonicalizing this URL.
-
Splash Bad Request (HTTP 400) errors are hard to debug because by default
response content is not displayed by Scrapy. SplashMiddleware logs content
of HTTP 400 Splash responses by default (it can be turned off by setting
SPLASH_LOG_400 = False
option).
-
Cookie handling is tedious to implement, and you can't use Scrapy
built-in Cookie middleware to handle cookies when working with Splash.
-
Large Splash arguments which don't change with every request
(e.g. lua_source
) may take a lot of space when saved to Scrapy disk
request queues. scrapy-splash
provides a way to store such static
parameters only once.
-
Splash 2.1+ provides a way to save network traffic by caching large
static arguments on server, but it requires client support: client should
send proper save_args
and load_args
values and handle HTTP 498
responses.
scrapy-splash utilities allow to handle such edge cases and reduce
the boilerplate.
.. _HTTP API: http://splash.readthedocs.org/en/latest/api.html
.. _timeout: http://splash.readthedocs.org/en/latest/api.html#arg-timeout
Getting help
- for problems with rendering pages read "
Splash FAQ
_" page - for Scrapy-related bugs take a look at "
reporting Scrapy bugs
_" page
Best approach to get any other help is to ask a question on Stack Overflow
_
.. _reporting Scrapy bugs: https://doc.scrapy.org/en/master/contributing.html#reporting-bugs
.. _Splash FAQ: http://splash.readthedocs.io/en/stable/faq.html#website-is-not-rendered-correctly
.. _Stack Overflow: https://stackoverflow.com/questions/tagged/scrapy-splash?sort=frequent&pageSize=15&mixed=1
Contributing
Source code and bug tracker are on github:
https://github.com/scrapy-plugins/scrapy-splash
To run tests, install "tox" Python package and then run tox
command
from the source checkout.
To run integration tests, start Splash and set SPLASH_URL env variable
to Splash address before running tox
command::
docker run -d --rm -p8050:8050 scrapinghub/splash:3.0
SPLASH_URL=http://127.0.0.1:8050 tox -e py36
Changes
0.10.1 (2025-01-27)
0.10.0 (2025-01-21)
-
Removed official support for Python 3.7 and 3.8, and added official support
for Python 3.12 and 3.13.
-
Added support for Scrapy 2.12+.
This includes deprecating SplashAwareDupeFilter
and
SplashAwareFSCacheStorage
in favor of the corresponding built-in, default
Scrapy components, and instead using the new SplashRequestFingerprinter
component to ensure request fingerprinting for Splash requests stays the
same, now for every Scrapy component doing request fingerprinting and not
only for duplicate filtering and HTTP caching.
0.9.0 (2023-02-03)
-
Removed official support for Python 2.7, 3.4, 3.5 and 3.6, and added official
support for Python 3.9, 3.10 and 3.11.
-
Deprecated SplashJsonResponse.body_as_unicode()
, to be replaced by
SplashJsonResponse.text
.
-
Removed calls to obsolete to_native_str
, removed in Scrapy 2.8.
0.8.0 (2021-10-05)
-
Security bug fix:
If you use HttpAuthMiddleware_ (i.e. the http_user
and http_pass
spider attributes) for Splash authentication, any non-Splash request will
expose your credentials to the request target. This includes robots.txt
requests sent by Scrapy when the ROBOTSTXT_OBEY
setting is set to
True
.
Use the new SPLASH_USER
and SPLASH_PASS
settings instead to set
your Splash authentication credentials safely.
.. _HttpAuthMiddleware: http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.httpauth
-
Responses now expose the HTTP status code and headers from Splash as
response.splash_response_status
and
response.splash_response_headers
(#158)
-
The meta
argument passed to the scrapy_splash.request.SplashRequest
constructor is no longer modified (#164)
-
Website responses with 400 or 498 as HTTP status code are no longer
handled as the equivalent Splash responses (#158)
-
Cookies are no longer sent to Splash itself (#156)
-
scrapy_splash.utils.dict_hash
now also works with obj=None
(225793b
)
-
Our test suite now includes integration tests (#156) and tests can be run
in parallel (6fb8c41
)
-
There’s a new ‘Getting help’ section in the README.rst
file (#161,
#162), the documentation about SPLASH_SLOT_POLICY
has been improved
(#157) and a typo as been fixed (#121)
-
Made some internal improvements (ee5000d
, 25de545
, 2aaa79d
)
0.7.2 (2017-03-30)
- fixed issue with response type detection.
0.7.1 (2016-12-20)
- Scrapy 1.0.x support is back;
- README updates.
0.7 (2016-05-16)
SPLASH_COOKIES_DEBUG
setting allows to log cookies
sent and received to/from Splash in cookies
request/response fields.
It is similar to Scrapy's builtin COOKIES_DEBUG
, but works for
Splash requests;- README cleanup.
0.6.1 (2016-04-29)
- Warning about HTTP methods is no longer logged for non-Splash requests.
0.6 (2016-04-20)
SplashAwareDupeFilter
and splash_request_fingerprint
are improved:
they now canonicalize URLs and take URL fragments in account;cache_args
value fingerprints are now calculated faster.
0.5 (2016-04-18)
cache_args
SplashRequest argument and
request.meta['splash']['cache_args']
key allow to save network traffic
and disk storage by not storing duplicate Splash arguments in disk request
queues and not sending them to Splash multiple times. This feature requires
Splash 2.1+.
To upgrade from v0.4 enable SplashDeduplicateArgsMiddleware
in settings.py::
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
0.4 (2016-04-14)
- SplashFormRequest class is added; it is a variant of FormRequest which uses
Splash;
- Splash parameters are no longer stored in request.meta twice; this change
should decrease disk queues data size;
- SplashMiddleware now increases request priority when rescheduling the request;
this should decrease disk queue data size and help with stale cookie
problems.
0.3 (2016-04-11)
Package is renamed from scrapyjs
to scrapy-splash
.
An easiest way to upgrade is to replace scrapyjs
imports with
scrapy_splash
and update settings.py
with new defaults
(check the README).
There are many new helpers to handle JavaScript rendering transparently;
the recommended way is now to use scrapy_splash.SplashRequest
instead
of request.meta['splash']
. Please make sure to read the README if
you're upgrading from scrapyjs - you may be able to drop some code from your
project, especially if you want to access response html, handle cookies
and headers.
- new SplashRequest class; it can be used as a replacement for scrapy.Request
to provide a better integration with Splash;
- added support for POST requests;
- SplashResponse, SplashTextResponse and SplashJsonResponse allow to
handle Splash responses transparently, taking care of response.url,
response.body, response.headers and response.status. SplashJsonResponse
allows to access decoded response JSON data as
response.data
. - cookie handling improvements: it is possible to handle Scrapy and Splash
cookies transparently; current cookiejar is exposed as response.cookiejar;
- headers are passed to Splash by default;
- URLs with fragments are handled automatically when using SplashRequest;
- logging is improved:
SplashRequest.__repr__
shows both requested URL
and Splash URL; - in case of Splash HTTP 400 errors the response is logged by default;
- an issue with dupefilters is fixed: previously the order of keys in
JSON request body could vary, making requests appear as non-duplicates;
- it is now possible to pass custom headers to Splash server itself;
- test coverage reports are enabled.
0.2 (2016-03-26)
0.1.1 (2015-03-16)
Fixed fingerprint calculation for non-string meta values.
0.1 (2015-02-28)
Initial release