Scrapfly SDK
Installation
pip install scrapfly-sdk
You can also install extra dependencies
pip install "scrapfly-sdk[seepdup]"
for performance improvementpip install "scrapfly-sdk[concurrency]"
for concurrency out of the box (asyncio / thread)pip install "scrapfly-sdk[scrapy]"
for scrapy integrationpip install "scrapfly-sdk[all]"
Everything!
For use of built-in HTML parser (via ScrapeApiResponse.selector
property) additional requirement of either parsel or scrapy is required.
For reference of usage or examples, please checkout the folder /examples
in this repository.
This SDK cover the following Scrapfly API endpoints:
Integrations
Scrapfly Python SDKs are integrated with LlamaIndex and LangChain. Both framework allows training Large Language Models (LLMs) using augmented context.
This augmented context is approached by training LLMs on top of private or domain-specific data for common use cases:
- Question-Answering Chatbots (commonly referred to as RAG systems, which stands for "Retrieval-Augmented Generation")
- Document Understanding and Extraction
- Autonomous Agents that can perform research and take actions
In the context of web scraping, web page data can be extracted as Text or Markdown using Scrapfly's format feature to train LLMs with the scraped data.
LlamaIndex
Installation
Install llama-index
, llama-index-readers-web
, and scrapfly-sdk
using pip:
pip install llama-index llama-index-readers-web scrapfly-sdk
Usage
Scrapfly is available at LlamaIndex as a data connector, known as a Reader
. This reader is used to gather a web page data into a Document
representation, which can be used with the LLM directly. Below is an example of building a RAG system using LlamaIndex and scraped data. See the LlamaIndex use cases for more.
import os
from llama_index.readers.web import ScrapflyReader
from llama_index.core import VectorStoreIndex
scrapfly_reader = ScrapflyReader(
api_key="Your Scrapfly API key",
ignore_scrape_failures=True,
)
documents = scrapfly_reader.load_data(
urls=["https://web-scraping.dev/products"]
)
os.environ['OPENAI_API_KEY'] = "Your OpenAI Key"
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What is the flavor of the dark energy potion?")
print(response)
"The flavor of the dark energy potion is bold cherry cola."
The load_data
function accepts a ScrapeConfig object to use the desired Scrapfly API parameters:
from llama_index.readers.web import ScrapflyReader
scrapfly_reader = ScrapflyReader(
api_key="Your Scrapfly API key",
ignore_scrape_failures=True,
)
scrapfly_scrape_config = {
"asp": True,
"render_js": True,
"proxy_pool": "public_residential_pool",
"country": "us",
"auto_scroll": True,
"js": "",
}
documents = scrapfly_reader.load_data(
urls=["https://web-scraping.dev/products"],
scrape_config=scrapfly_scrape_config,
scrape_format="markdown",
)
LangChain
Installation
Install langchain
, langchain-community
, and scrapfly-sdk
using pip:
pip install langchain langchain-community scrapfly-sdk
Usage
Scrapfly is available at LangChain as a document loader, known as a Loader
. This reader is used to gather a web page data into Document
representation, which canbe used with the LLM after a few operations. Below is an example of building a RAG system with LangChain using scraped data, see LangChain tutorials for further use cases.
import os
from langchain import hub
from langchain_chroma import Chroma
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import ScrapflyLoader
scrapfly_loader = ScrapflyLoader(
["https://web-scraping.dev/products"],
api_key="Your Scrapfly API key",
continue_on_failure=True,
)
documents = scrapfly_loader.load()
os.environ["OPENAI_API_KEY"] = "Your OpenAI key"
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(documents)
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever()
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
model = ChatOpenAI()
prompt = hub.pull("rlm/rag-prompt")
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| model
| StrOutputParser()
)
response = rag_chain.invoke("What is the flavor of the dark energy potion?")
print(response)
"The flavor of the Dark Energy Potion is bold cherry cola."
To use the full Scrapfly features with LangChain, pass a ScrapeConfig object to the ScrapflyLoader
:
from langchain_community.document_loaders import ScrapflyLoader
scrapfly_scrape_config = {
"asp": True,
"render_js": True,
"proxy_pool": "public_residential_pool",
"country": "us",
"auto_scroll": True,
"js": "",
}
scrapfly_loader = ScrapflyLoader(
["https://web-scraping.dev/products"],
api_key="Your Scrapfly API key",
continue_on_failure=True,
scrape_config=scrapfly_scrape_config,
scrape_format="markdown",
)
documents = scrapfly_loader.load()
print(documents)
Get Your API Key
You can create a free account on Scrapfly to get your API Key.
Migration
Migrate from 0.7.x to 0.8
asyncio-pool dependency has been dropped
scrapfly.concurrent_scrape
is now an async generator. If the concurrency is None
or not defined, the max concurrency allowed by
your current subscription is used.
async for result in scrapfly.concurrent_scrape(concurrency=10, scrape_configs=[ScrapConfig(...), ...]):
print(result)
brotli args is deprecated and will be removed in the next minor. There is not benefit in most of case
versus gzip regarding and size and use more CPU.
What's new
0.8.x
- Better error log
- Async/Improvement for concurrent scrape with asyncio
- Scrapy media pipeline are now supported out of the box