
Research
Security News
The Landscape of Malicious Open Source Packages: 2025 Mid‑Year Threat Report
A look at the top trends in how threat actors are weaponizing open source packages to deliver malware and persist across the software supply chain.
.. image:: https://img.shields.io/pypi/pyversions/python-dataservice.svg :alt: Python Versions
Lightweight - async - data gathering for Python.
DataService is a lightweight web scraping and general purpose data gathering library for Python.
Designed for simplicity, it's built upon common web scraping and data gathering patterns.
No complex API to learn, just standard Python idioms.
Dual synchronous and asynchronous support.
Please note that DataService requires Python 3.11 or higher.
You can install DataService via pip:
.. code-block:: bash
pip install python-dataservice
You can also install the optional playwright
dependency to use the PlaywrightClient
:
.. code-block:: bash
pip install python-dataservice[playwright]
To install Playwright, run:
.. code-block:: bash
python -m playwright install
or simply:
.. code-block:: bash
playwright install
To start, create a DataService
instance with an Iterable
of Request
objects. This setup provides you with an Iterator
of data objects that you can then iterate over or convert to a list
, tuple
, a pd.DataFrame
or any data structure of choice.
.. code-block:: python
start_requests = [Request(url="https://books.toscrape.com/index.html", callback=parse_books_page, client=HttpXClient())]
data_service = DataService(start_requests)
data = tuple(data_service)
A Request
is a Pydantic
model that includes the URL to fetch, a reference to the client
callable, and a callback
function for parsing the Response
object.
The client can be any async Python callable that accepts a Request
object and returns a Response
object.
DataService
provides an HttpXClient
class by default, which is based on the httpx
library, but you are free to use your own custom async client.
The callback function processes a Response
object and returns either data
or additional Request
objects.
In this trivial example we are requesting the Books to Scrape <https://books.toscrape.com/index.html>
_ homepage and parsing the number of books on the page.
Example parse_books_page
function:
.. code-block:: python
def parse_books_page(response: Response):
articles = response.html.find_all("article", {"class": "product_pod"})
return {
"url": response.url,
"title": response.html.title.get_text(strip=True),
"articles": len(articles),
}
This function takes a Response
object, which has a html
attribute (a BeautifulSoup
object of the HTML content). The function parses the HTML content and returns data.
The callback function can return
or yield
either data
(dict
or pydantic.BaseModel
) or more Request
objects.
If you have used Scrapy
before, you will find this pattern familiar.
For more examples and advanced usage, check out the examples <https://dataservice.readthedocs.io/en/latest/examples.html>
_ section.
For a detailed API reference, check out the API <https://dataservice.readthedocs.io/en/latest/modules.html>
_ section.
FAQs
Lightweight async data gathering for Python
We found that python-dataservice demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
A look at the top trends in how threat actors are weaponizing open source packages to deliver malware and persist across the software supply chain.
Security News
ESLint now supports HTML linting with 48 new rules, expanding its language plugin system to cover more of the modern web development stack.
Security News
CISA is discontinuing official RSS support for KEV and cybersecurity alerts, shifting updates to email and social media, disrupting automation workflows.