Security News
Fluent Assertions Faces Backlash After Abandoning Open Source Licensing
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
langchain-scrapegraph
Advanced tools
Library for extracting structured data from websites using ScrapeGraphAI
Supercharge your LangChain agents with AI-powered web scraping capabilities. LangChain-ScrapeGraph provides a seamless integration between LangChain and ScrapeGraph AI, enabling your agents to extract structured data from websites using natural language.
If you are looking for a quick solution to integrate ScrapeGraph in your system, check out our powerful API here!
We offer SDKs in both Python and Node.js, making it easy to integrate into your projects. Check them out below:
SDK | Language | GitHub Link |
---|---|---|
Python SDK | Python | scrapegraph-py |
Node.js SDK | Node.js | scrapegraph-js |
pip install langchain-scrapegraph
Convert any webpage into clean, formatted markdown.
from langchain_scrapegraph.tools import MarkdownifyTool
tool = MarkdownifyTool()
markdown = tool.invoke({"website_url": "https://example.com"})
print(markdown)
Extract structured data from any webpage using natural language prompts.
from langchain_scrapegraph.tools import SmartscraperTool
# Initialize the tool (uses SGAI_API_KEY from environment)
tool = SmartscraperTool()
# Extract information using natural language
result = tool.invoke({
"website_url": "https://www.example.com",
"user_prompt": "Extract the main heading and first paragraph"
})
print(result)
You can define the structure of the output using Pydantic models:
from typing import List
from pydantic import BaseModel, Field
from langchain_scrapegraph.tools import SmartscraperTool
class WebsiteInfo(BaseModel):
title: str = Field(description="The main title of the webpage")
description: str = Field(description="The main description or first paragraph")
urls: List[str] = Field(description="The URLs inside the webpage")
# Initialize with schema
tool = SmartscraperTool(llm_output_schema=WebsiteInfo)
# The output will conform to the WebsiteInfo schema
result = tool.invoke({
"website_url": "https://www.example.com",
"user_prompt": "Extract the website information"
})
print(result)
# {
# "title": "Example Domain",
# "description": "This domain is for use in illustrative examples...",
# "urls": ["https://www.iana.org/domains/example"]
# }
Extract information from HTML content using AI.
from langchain_scrapegraph.tools import LocalscraperTool
tool = LocalscraperTool()
result = tool.invoke({
"user_prompt": "Extract all contact information",
"website_html": "<html>...</html>"
})
print(result)
You can define the structure of the output using Pydantic models:
from typing import Optional
from pydantic import BaseModel, Field
from langchain_scrapegraph.tools import LocalscraperTool
class CompanyInfo(BaseModel):
name: str = Field(description="The company name")
description: str = Field(description="The company description")
email: Optional[str] = Field(description="Contact email if available")
phone: Optional[str] = Field(description="Contact phone if available")
# Initialize with schema
tool = LocalscraperTool(llm_output_schema=CompanyInfo)
html_content = """
<html>
<body>
<h1>TechCorp Solutions</h1>
<p>We are a leading AI technology company.</p>
<div class="contact">
<p>Email: contact@techcorp.com</p>
<p>Phone: (555) 123-4567</p>
</div>
</body>
</html>
"""
# The output will conform to the CompanyInfo schema
result = tool.invoke({
"website_html": html_content,
"user_prompt": "Extract the company information"
})
print(result)
# {
# "name": "TechCorp Solutions",
# "description": "We are a leading AI technology company.",
# "email": "contact@techcorp.com",
# "phone": "(555) 123-4567"
# }
from langchain.agents import initialize_agent, AgentType
from langchain_scrapegraph.tools import SmartscraperTool
from langchain_openai import ChatOpenAI
# Initialize tools
tools = [
SmartscraperTool(),
]
# Create an agent
agent = initialize_agent(
tools=tools,
llm=ChatOpenAI(temperature=0),
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
verbose=True
)
# Use the agent
response = agent.run("""
Visit example.com, make a summary of the content and extract the main heading and first paragraph
""")
Set your ScrapeGraph API key in your environment:
export SGAI_API_KEY="your-api-key-here"
Or set it programmatically:
import os
os.environ["SGAI_API_KEY"] = "your-api-key-here"
This project is licensed under the MIT License - see the LICENSE file for details.
This project is built on top of:
Made with ❤️ by ScrapeGraph AI
FAQs
Library for extracting structured data from websites using ScrapeGraphAI
We found that langchain-scrapegraph demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
Research
Security News
Socket researchers uncover the risks of a malicious Python package targeting Discord developers.
Security News
The UK is proposing a bold ban on ransomware payments by public entities to disrupt cybercrime, protect critical services, and lead global cybersecurity efforts.