langchain-scrapegraph


Namelangchain-scrapegraph JSON
Version 1.2.0 PyPI version JSON
download
home_pagehttps://scrapegraphai.com/
SummaryLibrary for extracting structured data from websites using ScrapeGraphAI
upload_time2024-12-18 16:50:30
maintainerNone
docs_urlNone
authorMarco Perini
requires_python<4.0,>=3.10
licenseMIT
keywords scrapegraph ai artificial intelligence gpt machine learning natural language processing nlp openai graph llm langchain scrape scrape graph
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # 🕷️🦜 langchain-scrapegraph

[![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[![Python Support](https://img.shields.io/pypi/pyversions/langchain-scrapegraph.svg)](https://pypi.org/project/langchain-scrapegraph/)
[![Documentation](https://img.shields.io/badge/Documentation-Latest-green)](https://scrapegraphai.com/docs)

Supercharge your LangChain agents with AI-powered web scraping capabilities. LangChain-ScrapeGraph provides a seamless integration between [LangChain](https://github.com/langchain-ai/langchain) and [ScrapeGraph AI](https://scrapegraphai.com), enabling your agents to extract structured data from websites using natural language.

## 🔗 ScrapeGraph API & SDKs
If you are looking for a quick solution to integrate ScrapeGraph in your system, check out our powerful API [here!](https://dashboard.scrapegraphai.com/login)

<p align="center">
  <img src="https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/api-banner.png" alt="ScrapeGraph API Banner" style="width: 70%;">
</p>

We offer SDKs in both Python and Node.js, making it easy to integrate into your projects. Check them out below:

| SDK       | Language | GitHub Link                                                                 |
|-----------|----------|-----------------------------------------------------------------------------|
| Python SDK | Python   | [scrapegraph-py](https://github.com/ScrapeGraphAI/scrapegraph-sdk/tree/main/scrapegraph-py) |
| Node.js SDK | Node.js  | [scrapegraph-js](https://github.com/ScrapeGraphAI/scrapegraph-sdk/tree/main/scrapegraph-js) |

## 📦 Installation

```bash
pip install langchain-scrapegraph
```

## 🛠️ Available Tools

### 📝 MarkdownifyTool
Convert any webpage into clean, formatted markdown.

```python
from langchain_scrapegraph.tools import MarkdownifyTool

tool = MarkdownifyTool()
markdown = tool.invoke({"website_url": "https://example.com"})

print(markdown)
```

### 🔍 SmartscraperTool
Extract structured data from any webpage using natural language prompts.

```python
from langchain_scrapegraph.tools import SmartscraperTool

# Initialize the tool (uses SGAI_API_KEY from environment)
tool = SmartscraperTool()

# Extract information using natural language
result = tool.invoke({
    "website_url": "https://www.example.com",
    "user_prompt": "Extract the main heading and first paragraph"
})

print(result)
```

<details>
<summary>🔍 Using Output Schemas with SmartscraperTool</summary>

You can define the structure of the output using Pydantic models:

```python
from typing import List
from pydantic import BaseModel, Field
from langchain_scrapegraph.tools import SmartscraperTool

class WebsiteInfo(BaseModel):
    title: str = Field(description="The main title of the webpage")
    description: str = Field(description="The main description or first paragraph")
    urls: List[str] = Field(description="The URLs inside the webpage")

# Initialize with schema
tool = SmartscraperTool(llm_output_schema=WebsiteInfo)

# The output will conform to the WebsiteInfo schema
result = tool.invoke({
    "website_url": "https://www.example.com",
    "user_prompt": "Extract the website information"
})

print(result)
# {
#     "title": "Example Domain",
#     "description": "This domain is for use in illustrative examples...",
#     "urls": ["https://www.iana.org/domains/example"]
# }
```
</details>

### 💻 LocalscraperTool
Extract information from HTML content using AI.

```python
from langchain_scrapegraph.tools import LocalscraperTool

tool = LocalscraperTool()
result = tool.invoke({
    "user_prompt": "Extract all contact information",
    "website_html": "<html>...</html>"
})

print(result)
```

<details>
<summary>🔍 Using Output Schemas with LocalscraperTool</summary>

You can define the structure of the output using Pydantic models:

```python
from typing import Optional
from pydantic import BaseModel, Field
from langchain_scrapegraph.tools import LocalscraperTool

class CompanyInfo(BaseModel):
    name: str = Field(description="The company name")
    description: str = Field(description="The company description")
    email: Optional[str] = Field(description="Contact email if available")
    phone: Optional[str] = Field(description="Contact phone if available")

# Initialize with schema
tool = LocalscraperTool(llm_output_schema=CompanyInfo)

html_content = """
<html>
    <body>
        <h1>TechCorp Solutions</h1>
        <p>We are a leading AI technology company.</p>
        <div class="contact">
            <p>Email: contact@techcorp.com</p>
            <p>Phone: (555) 123-4567</p>
        </div>
    </body>
</html>
"""

# The output will conform to the CompanyInfo schema
result = tool.invoke({
    "website_html": html_content,
    "user_prompt": "Extract the company information"
})

print(result)
# {
#     "name": "TechCorp Solutions",
#     "description": "We are a leading AI technology company.",
#     "email": "contact@techcorp.com",
#     "phone": "(555) 123-4567"
# }
```
</details>

## 🌟 Key Features

- 🐦 **LangChain Integration**: Seamlessly works with LangChain agents and chains
- 🔍 **AI-Powered Extraction**: Use natural language to describe what data to extract
- 📊 **Structured Output**: Get clean, structured data ready for your agents
- 🔄 **Flexible Tools**: Choose from multiple specialized scraping tools
- ⚡ **Async Support**: Built-in support for async operations

## 💡 Use Cases

- 📖 **Research Agents**: Create agents that gather and analyze web data
- 📊 **Data Collection**: Automate structured data extraction from websites
- 📝 **Content Processing**: Convert web content into markdown for further processing
- 🔍 **Information Extraction**: Extract specific data points using natural language

## 🤖 Example Agent

```python
from langchain.agents import initialize_agent, AgentType
from langchain_scrapegraph.tools import SmartscraperTool
from langchain_openai import ChatOpenAI

# Initialize tools
tools = [
    SmartscraperTool(),
]

# Create an agent
agent = initialize_agent(
    tools=tools,
    llm=ChatOpenAI(temperature=0),
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

# Use the agent
response = agent.run("""
    Visit example.com, make a summary of the content and extract the main heading and first paragraph
""")
```

## ⚙️ Configuration

Set your ScrapeGraph API key in your environment:
```bash
export SGAI_API_KEY="your-api-key-here"
```

Or set it programmatically:
```python
import os
os.environ["SGAI_API_KEY"] = "your-api-key-here"
```

## 📚 Documentation

- [API Documentation](https://scrapegraphai.com/docs)
- [LangChain Documentation](https://python.langchain.com/docs/get_started/introduction.html)
- [Examples](examples/)

## 💬 Support & Feedback

- 📧 Email: support@scrapegraphai.com
- 💻 GitHub Issues: [Create an issue](https://github.com/ScrapeGraphAI/langchain-scrapegraph/issues)
- 🌟 Feature Requests: [Request a feature](https://github.com/ScrapeGraphAI/langchain-scrapegraph/issues/new)

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

This project is built on top of:
- [LangChain](https://github.com/langchain-ai/langchain)
- [ScrapeGraph AI](https://scrapegraphai.com)

---

Made with ❤️ by [ScrapeGraph AI](https://scrapegraphai.com)


            

Raw data

            {
    "_id": null,
    "home_page": "https://scrapegraphai.com/",
    "name": "langchain-scrapegraph",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": null,
    "keywords": "scrapegraph, ai, artificial intelligence, gpt, machine learning, natural language processing, nlp, openai, graph, llm, langchain, scrape, scrape graph",
    "author": "Marco Perini",
    "author_email": "marco.perini@scrapegraphai.com",
    "download_url": "https://files.pythonhosted.org/packages/40/b8/d4a0b36e71b5f6257e4b09f1a0969009577cb7200d11987efc980aa10024/langchain_scrapegraph-1.2.0.tar.gz",
    "platform": null,
    "description": "# \ud83d\udd77\ufe0f\ud83e\udd9c langchain-scrapegraph\n\n[![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)\n[![Python Support](https://img.shields.io/pypi/pyversions/langchain-scrapegraph.svg)](https://pypi.org/project/langchain-scrapegraph/)\n[![Documentation](https://img.shields.io/badge/Documentation-Latest-green)](https://scrapegraphai.com/docs)\n\nSupercharge your LangChain agents with AI-powered web scraping capabilities. LangChain-ScrapeGraph provides a seamless integration between [LangChain](https://github.com/langchain-ai/langchain) and [ScrapeGraph AI](https://scrapegraphai.com), enabling your agents to extract structured data from websites using natural language.\n\n## \ud83d\udd17 ScrapeGraph API & SDKs\nIf you are looking for a quick solution to integrate ScrapeGraph in your system, check out our powerful API [here!](https://dashboard.scrapegraphai.com/login)\n\n<p align=\"center\">\n  <img src=\"https://raw.githubusercontent.com/VinciGit00/Scrapegraph-ai/main/docs/assets/api-banner.png\" alt=\"ScrapeGraph API Banner\" style=\"width: 70%;\">\n</p>\n\nWe offer SDKs in both Python and Node.js, making it easy to integrate into your projects. Check them out below:\n\n| SDK       | Language | GitHub Link                                                                 |\n|-----------|----------|-----------------------------------------------------------------------------|\n| Python SDK | Python   | [scrapegraph-py](https://github.com/ScrapeGraphAI/scrapegraph-sdk/tree/main/scrapegraph-py) |\n| Node.js SDK | Node.js  | [scrapegraph-js](https://github.com/ScrapeGraphAI/scrapegraph-sdk/tree/main/scrapegraph-js) |\n\n## \ud83d\udce6 Installation\n\n```bash\npip install langchain-scrapegraph\n```\n\n## \ud83d\udee0\ufe0f Available Tools\n\n### \ud83d\udcdd MarkdownifyTool\nConvert any webpage into clean, formatted markdown.\n\n```python\nfrom langchain_scrapegraph.tools import MarkdownifyTool\n\ntool = MarkdownifyTool()\nmarkdown = tool.invoke({\"website_url\": \"https://example.com\"})\n\nprint(markdown)\n```\n\n### \ud83d\udd0d SmartscraperTool\nExtract structured data from any webpage using natural language prompts.\n\n```python\nfrom langchain_scrapegraph.tools import SmartscraperTool\n\n# Initialize the tool (uses SGAI_API_KEY from environment)\ntool = SmartscraperTool()\n\n# Extract information using natural language\nresult = tool.invoke({\n    \"website_url\": \"https://www.example.com\",\n    \"user_prompt\": \"Extract the main heading and first paragraph\"\n})\n\nprint(result)\n```\n\n<details>\n<summary>\ud83d\udd0d Using Output Schemas with SmartscraperTool</summary>\n\nYou can define the structure of the output using Pydantic models:\n\n```python\nfrom typing import List\nfrom pydantic import BaseModel, Field\nfrom langchain_scrapegraph.tools import SmartscraperTool\n\nclass WebsiteInfo(BaseModel):\n    title: str = Field(description=\"The main title of the webpage\")\n    description: str = Field(description=\"The main description or first paragraph\")\n    urls: List[str] = Field(description=\"The URLs inside the webpage\")\n\n# Initialize with schema\ntool = SmartscraperTool(llm_output_schema=WebsiteInfo)\n\n# The output will conform to the WebsiteInfo schema\nresult = tool.invoke({\n    \"website_url\": \"https://www.example.com\",\n    \"user_prompt\": \"Extract the website information\"\n})\n\nprint(result)\n# {\n#     \"title\": \"Example Domain\",\n#     \"description\": \"This domain is for use in illustrative examples...\",\n#     \"urls\": [\"https://www.iana.org/domains/example\"]\n# }\n```\n</details>\n\n### \ud83d\udcbb LocalscraperTool\nExtract information from HTML content using AI.\n\n```python\nfrom langchain_scrapegraph.tools import LocalscraperTool\n\ntool = LocalscraperTool()\nresult = tool.invoke({\n    \"user_prompt\": \"Extract all contact information\",\n    \"website_html\": \"<html>...</html>\"\n})\n\nprint(result)\n```\n\n<details>\n<summary>\ud83d\udd0d Using Output Schemas with LocalscraperTool</summary>\n\nYou can define the structure of the output using Pydantic models:\n\n```python\nfrom typing import Optional\nfrom pydantic import BaseModel, Field\nfrom langchain_scrapegraph.tools import LocalscraperTool\n\nclass CompanyInfo(BaseModel):\n    name: str = Field(description=\"The company name\")\n    description: str = Field(description=\"The company description\")\n    email: Optional[str] = Field(description=\"Contact email if available\")\n    phone: Optional[str] = Field(description=\"Contact phone if available\")\n\n# Initialize with schema\ntool = LocalscraperTool(llm_output_schema=CompanyInfo)\n\nhtml_content = \"\"\"\n<html>\n    <body>\n        <h1>TechCorp Solutions</h1>\n        <p>We are a leading AI technology company.</p>\n        <div class=\"contact\">\n            <p>Email: contact@techcorp.com</p>\n            <p>Phone: (555) 123-4567</p>\n        </div>\n    </body>\n</html>\n\"\"\"\n\n# The output will conform to the CompanyInfo schema\nresult = tool.invoke({\n    \"website_html\": html_content,\n    \"user_prompt\": \"Extract the company information\"\n})\n\nprint(result)\n# {\n#     \"name\": \"TechCorp Solutions\",\n#     \"description\": \"We are a leading AI technology company.\",\n#     \"email\": \"contact@techcorp.com\",\n#     \"phone\": \"(555) 123-4567\"\n# }\n```\n</details>\n\n## \ud83c\udf1f Key Features\n\n- \ud83d\udc26 **LangChain Integration**: Seamlessly works with LangChain agents and chains\n- \ud83d\udd0d **AI-Powered Extraction**: Use natural language to describe what data to extract\n- \ud83d\udcca **Structured Output**: Get clean, structured data ready for your agents\n- \ud83d\udd04 **Flexible Tools**: Choose from multiple specialized scraping tools\n- \u26a1 **Async Support**: Built-in support for async operations\n\n## \ud83d\udca1 Use Cases\n\n- \ud83d\udcd6 **Research Agents**: Create agents that gather and analyze web data\n- \ud83d\udcca **Data Collection**: Automate structured data extraction from websites\n- \ud83d\udcdd **Content Processing**: Convert web content into markdown for further processing\n- \ud83d\udd0d **Information Extraction**: Extract specific data points using natural language\n\n## \ud83e\udd16 Example Agent\n\n```python\nfrom langchain.agents import initialize_agent, AgentType\nfrom langchain_scrapegraph.tools import SmartscraperTool\nfrom langchain_openai import ChatOpenAI\n\n# Initialize tools\ntools = [\n    SmartscraperTool(),\n]\n\n# Create an agent\nagent = initialize_agent(\n    tools=tools,\n    llm=ChatOpenAI(temperature=0),\n    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,\n    verbose=True\n)\n\n# Use the agent\nresponse = agent.run(\"\"\"\n    Visit example.com, make a summary of the content and extract the main heading and first paragraph\n\"\"\")\n```\n\n## \u2699\ufe0f Configuration\n\nSet your ScrapeGraph API key in your environment:\n```bash\nexport SGAI_API_KEY=\"your-api-key-here\"\n```\n\nOr set it programmatically:\n```python\nimport os\nos.environ[\"SGAI_API_KEY\"] = \"your-api-key-here\"\n```\n\n## \ud83d\udcda Documentation\n\n- [API Documentation](https://scrapegraphai.com/docs)\n- [LangChain Documentation](https://python.langchain.com/docs/get_started/introduction.html)\n- [Examples](examples/)\n\n## \ud83d\udcac Support & Feedback\n\n- \ud83d\udce7 Email: support@scrapegraphai.com\n- \ud83d\udcbb GitHub Issues: [Create an issue](https://github.com/ScrapeGraphAI/langchain-scrapegraph/issues)\n- \ud83c\udf1f Feature Requests: [Request a feature](https://github.com/ScrapeGraphAI/langchain-scrapegraph/issues/new)\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\ude4f Acknowledgments\n\nThis project is built on top of:\n- [LangChain](https://github.com/langchain-ai/langchain)\n- [ScrapeGraph AI](https://scrapegraphai.com)\n\n---\n\nMade with \u2764\ufe0f by [ScrapeGraph AI](https://scrapegraphai.com)\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Library for extracting structured data from websites using ScrapeGraphAI",
    "version": "1.2.0",
    "project_urls": {
        "Documentation": "https://scrapegraphai.com/docs",
        "Homepage": "https://scrapegraphai.com/",
        "Repository": "https://github.com/scrapegraphai/langchain-scrapegraph"
    },
    "split_keywords": [
        "scrapegraph",
        " ai",
        " artificial intelligence",
        " gpt",
        " machine learning",
        " natural language processing",
        " nlp",
        " openai",
        " graph",
        " llm",
        " langchain",
        " scrape",
        " scrape graph"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5386684ef04e85fab27a388602326690b4166591ce89a99e648d0754db34e7f9",
                "md5": "3449f6ae15ca00405e2fc8f497171500",
                "sha256": "a67f23594ac4461d118c44a385c400571ec7a01eb31937298d3dbe08ba8026bd"
            },
            "downloads": -1,
            "filename": "langchain_scrapegraph-1.2.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3449f6ae15ca00405e2fc8f497171500",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.10",
            "size": 11491,
            "upload_time": "2024-12-18T16:50:28",
            "upload_time_iso_8601": "2024-12-18T16:50:28.335746Z",
            "url": "https://files.pythonhosted.org/packages/53/86/684ef04e85fab27a388602326690b4166591ce89a99e648d0754db34e7f9/langchain_scrapegraph-1.2.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "40b8d4a0b36e71b5f6257e4b09f1a0969009577cb7200d11987efc980aa10024",
                "md5": "556fb0a7fb66f7bf9581936323774444",
                "sha256": "b78ccd523555240f82785073a87c57c1dd213badfd062197a8831f51384ce02a"
            },
            "downloads": -1,
            "filename": "langchain_scrapegraph-1.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "556fb0a7fb66f7bf9581936323774444",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 9063,
            "upload_time": "2024-12-18T16:50:30",
            "upload_time_iso_8601": "2024-12-18T16:50:30.605357Z",
            "url": "https://files.pythonhosted.org/packages/40/b8/d4a0b36e71b5f6257e4b09f1a0969009577cb7200d11987efc980aa10024/langchain_scrapegraph-1.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-18 16:50:30",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "scrapegraphai",
    "github_project": "langchain-scrapegraph",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "langchain-scrapegraph"
}
        
Elapsed time: 0.35833s