scrapontologies


Namescrapontologies JSON
Version 1.1.0 PyPI version JSON
download
home_pageNone
SummaryLibrary for extracting schemas and building ontologies from documents using LLM
upload_time2024-10-15 11:28:01
maintainerNone
docs_urlNone
authorNone
requires_python<4.0,>=3.9
licenseNone
keywords ai artificial intelligence documents gpt graph knowledge graph langchain machine learning natural language processing nlp ontologies openai rag scrape_schema scrapegraph scrapegraphai scraping structured data unstructured data web scraping web scraping library web scraping tool webscraping
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # scrapontologies

![graph](docs/assets/graph_pyecharts.png)

The generate schemas can be used to infer from document to use for tables in a database or for generating knowledge graph.

## Features

- **Entity Extraction**: Automatically identifies and extracts entities from PDF files.
- **Schema Generation**: Constructs a schema based and structure of the extracted entities.
- **Visualization**: Dynamic schema visualization

## Quick Start

### Prerequisites

Before you begin, ensure you have the following installed on your system:

- **Python**: Make sure Python 3.9+ is installed.
- **Poppler**: This tool is necessary for converting PDF to images.

#### MacOS Installation

To install Poppler on MacOS, use the following command:

```bash
brew install poppler

```

#### Linux Installation

To install Graphviz on Linux, use the following command:

```bash
sudo apt-get install poppler-utils
```

#### Windows

1. Download the latest Poppler release for Windows from [poppler releases](https://github.com/oschwartz10612/poppler-windows/releases/).
2. Extract the downloaded zip file to a location on your computer (e.g., `C:\Program Files\poppler`).
3. Add the `bin` directory of the extracted folder to your system's PATH environment variable.

To add to PATH:
1. Search for "Environment Variables" in the Start menu and open it.
2. Under "System variables", find and select "Path", then click "Edit".
3. Click "New" and add the path to the Poppler `bin` directory (e.g., `C:\Program Files\poppler\bin`).
4. Click "OK" to save the changes.

After installation, restart your terminal or command prompt for the changes to take effect.
If doesn't work try the magic restart button.

#### Installation
After installing the prerequisites and dependencies, you can start using scrape_schema to extract entities and their schema from PDFs.

Here’s a basic example:
```bash
git clone https://github.com/ScrapeGraphAI/scrape_schema
pip install -r requirements.txt
```

## Usage

```python
from scrape_schema import FileExtractor, PDFParser
import os
from dotenv import load_dotenv

load_dotenv()  # Load environment variables from .env file
api_key = os.getenv("OPENAI_API_KEY")

# Path to your PDF file
pdf_path = "./test.pdf"

# Create an LLMClient instance
llm_client = LLMClient(api_key)

# Create a PDFParser instance with the LLMClient
pdf_parser = PDFParser(llm_client)

# Create a FileExtraxctor instance with the PDF parser
pdf_extractor = FileExtractor(pdf_path, pdf_parser)

# Extract entities from the PDF
entities = pdf_extractor.generate_json_schema()

print(entities)
```
## Output
```json
{
  "ROOT": {
    "portfolio": {
      "type": "object",
      "properties": {
        "name": {
          "type": "string"
        },
        "series": {
          "type": "string"
        },
        "fees": {
          "type": "object",
          "properties": {
            "salesCharges": {
              "type": "string"
            },
            "fundExpenses": {
              "type": "object",
              "properties": {
                "managementExpenseRatio": {
                  "type": "string"
                },
                "tradingExpenseRatio": {
                  "type": "string"
                },
                "totalExpenses": {
                  "type": "string"
                }
              }
            },
            "trailingCommissions": {
              "type": "string"
            }
          }
        },
        "withdrawalRights": {
          "type": "object",
          "properties": {
            "timeLimit": {
              "type": "string"
            },
            "conditions": {
              "type": "array",
              "items": {
                "type": "string"
              }
            }
          }
        },
        "contactInformation": {
          "type": "object",
          "properties": {
            "companyName": {
              "type": "string"
            },
            "address": {
              "type": "string"
            },
            "phone": {
              "type": "string"
            },
            "email": {
              "type": "string"
            },
            "website": {
              "type": "string"
            }
          }
        },
        "yearByYearReturns": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "year": {
                "type": "string"
              },
              "return": {
                "type": "string"
              }
            }
          }
        },
        "bestWorstReturns": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "type": {
                "type": "string"
              },
              "return": {
                "type": "string"
              },
              "date": {
                "type": "string"
              },
              "investmentValue": {
                "type": "string"
              }
            }
          }
        },
        "averageReturn": {
          "type": "string"
        },
        "targetInvestors": {
          "type": "array",
          "items": {
            "type": "string"
          }
        },
        "taxInformation": {
          "type": "string"
        }
      }
    }
  }
}
```

## 🤝 Contributing

Feel free to contribute and join our Discord server to discuss with us improvements and give us suggestions!

Please see the [contributing guidelines](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/CONTRIBUTING.md).

[![My Skills](https://skillicons.dev/icons?i=discord)](https://discord.gg/uJN7TYcpNa)
[![My Skills](https://skillicons.dev/icons?i=linkedin)](https://www.linkedin.com/company/scrapegraphai/)
[![My Skills](https://skillicons.dev/icons?i=twitter)](https://twitter.com/scrapegraphai)

***
## Created by Scrapegraphai

![](docs/assets/scrapegraphai_logo.svg)
***
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "scrapontologies",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.9",
    "maintainer_email": null,
    "keywords": "ai, artificial intelligence, documents, gpt, graph, knowledge graph, langchain, machine learning, natural language processing, nlp, ontologies, openai, rag, scrape_schema, scrapegraph, scrapegraphai, scraping, structured data, unstructured data, web scraping, web scraping library, web scraping tool, webscraping",
    "author": null,
    "author_email": "Lorenzo Padoan <lorenzo.padoan977@gmail.com>, Marco Vinciguerra <mvincig11@gmail.com>, Marco Perini <perinim.98@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/98/8c/49f40b59051b4465d6812f3a16e3c3525c683cce6a67854fa0c3a08d482c/scrapontologies-1.1.0.tar.gz",
    "platform": null,
    "description": "# scrapontologies\n\n![graph](docs/assets/graph_pyecharts.png)\n\nThe generate schemas can be used to infer from document to use for tables in a database or for generating knowledge graph.\n\n## Features\n\n- **Entity Extraction**: Automatically identifies and extracts entities from PDF files.\n- **Schema Generation**: Constructs a schema based and structure of the extracted entities.\n- **Visualization**: Dynamic schema visualization\n\n## Quick Start\n\n### Prerequisites\n\nBefore you begin, ensure you have the following installed on your system:\n\n- **Python**: Make sure Python 3.9+ is installed.\n- **Poppler**: This tool is necessary for converting PDF to images.\n\n#### MacOS Installation\n\nTo install Poppler on MacOS, use the following command:\n\n```bash\nbrew install poppler\n\n```\n\n#### Linux Installation\n\nTo install Graphviz on Linux, use the following command:\n\n```bash\nsudo apt-get install poppler-utils\n```\n\n#### Windows\n\n1. Download the latest Poppler release for Windows from [poppler releases](https://github.com/oschwartz10612/poppler-windows/releases/).\n2. Extract the downloaded zip file to a location on your computer (e.g., `C:\\Program Files\\poppler`).\n3. Add the `bin` directory of the extracted folder to your system's PATH environment variable.\n\nTo add to PATH:\n1. Search for \"Environment Variables\" in the Start menu and open it.\n2. Under \"System variables\", find and select \"Path\", then click \"Edit\".\n3. Click \"New\" and add the path to the Poppler `bin` directory (e.g., `C:\\Program Files\\poppler\\bin`).\n4. Click \"OK\" to save the changes.\n\nAfter installation, restart your terminal or command prompt for the changes to take effect.\nIf doesn't work try the magic restart button.\n\n#### Installation\nAfter installing the prerequisites and dependencies, you can start using scrape_schema to extract entities and their schema from PDFs.\n\nHere\u2019s a basic example:\n```bash\ngit clone https://github.com/ScrapeGraphAI/scrape_schema\npip install -r requirements.txt\n```\n\n## Usage\n\n```python\nfrom scrape_schema import FileExtractor, PDFParser\nimport os\nfrom dotenv import load_dotenv\n\nload_dotenv()  # Load environment variables from .env file\napi_key = os.getenv(\"OPENAI_API_KEY\")\n\n# Path to your PDF file\npdf_path = \"./test.pdf\"\n\n# Create an LLMClient instance\nllm_client = LLMClient(api_key)\n\n# Create a PDFParser instance with the LLMClient\npdf_parser = PDFParser(llm_client)\n\n# Create a FileExtraxctor instance with the PDF parser\npdf_extractor = FileExtractor(pdf_path, pdf_parser)\n\n# Extract entities from the PDF\nentities = pdf_extractor.generate_json_schema()\n\nprint(entities)\n```\n## Output\n```json\n{\n  \"ROOT\": {\n    \"portfolio\": {\n      \"type\": \"object\",\n      \"properties\": {\n        \"name\": {\n          \"type\": \"string\"\n        },\n        \"series\": {\n          \"type\": \"string\"\n        },\n        \"fees\": {\n          \"type\": \"object\",\n          \"properties\": {\n            \"salesCharges\": {\n              \"type\": \"string\"\n            },\n            \"fundExpenses\": {\n              \"type\": \"object\",\n              \"properties\": {\n                \"managementExpenseRatio\": {\n                  \"type\": \"string\"\n                },\n                \"tradingExpenseRatio\": {\n                  \"type\": \"string\"\n                },\n                \"totalExpenses\": {\n                  \"type\": \"string\"\n                }\n              }\n            },\n            \"trailingCommissions\": {\n              \"type\": \"string\"\n            }\n          }\n        },\n        \"withdrawalRights\": {\n          \"type\": \"object\",\n          \"properties\": {\n            \"timeLimit\": {\n              \"type\": \"string\"\n            },\n            \"conditions\": {\n              \"type\": \"array\",\n              \"items\": {\n                \"type\": \"string\"\n              }\n            }\n          }\n        },\n        \"contactInformation\": {\n          \"type\": \"object\",\n          \"properties\": {\n            \"companyName\": {\n              \"type\": \"string\"\n            },\n            \"address\": {\n              \"type\": \"string\"\n            },\n            \"phone\": {\n              \"type\": \"string\"\n            },\n            \"email\": {\n              \"type\": \"string\"\n            },\n            \"website\": {\n              \"type\": \"string\"\n            }\n          }\n        },\n        \"yearByYearReturns\": {\n          \"type\": \"array\",\n          \"items\": {\n            \"type\": \"object\",\n            \"properties\": {\n              \"year\": {\n                \"type\": \"string\"\n              },\n              \"return\": {\n                \"type\": \"string\"\n              }\n            }\n          }\n        },\n        \"bestWorstReturns\": {\n          \"type\": \"array\",\n          \"items\": {\n            \"type\": \"object\",\n            \"properties\": {\n              \"type\": {\n                \"type\": \"string\"\n              },\n              \"return\": {\n                \"type\": \"string\"\n              },\n              \"date\": {\n                \"type\": \"string\"\n              },\n              \"investmentValue\": {\n                \"type\": \"string\"\n              }\n            }\n          }\n        },\n        \"averageReturn\": {\n          \"type\": \"string\"\n        },\n        \"targetInvestors\": {\n          \"type\": \"array\",\n          \"items\": {\n            \"type\": \"string\"\n          }\n        },\n        \"taxInformation\": {\n          \"type\": \"string\"\n        }\n      }\n    }\n  }\n}\n```\n\n## \ud83e\udd1d Contributing\n\nFeel free to contribute and join our Discord server to discuss with us improvements and give us suggestions!\n\nPlease see the [contributing guidelines](https://github.com/VinciGit00/Scrapegraph-ai/blob/main/CONTRIBUTING.md).\n\n[![My Skills](https://skillicons.dev/icons?i=discord)](https://discord.gg/uJN7TYcpNa)\n[![My Skills](https://skillicons.dev/icons?i=linkedin)](https://www.linkedin.com/company/scrapegraphai/)\n[![My Skills](https://skillicons.dev/icons?i=twitter)](https://twitter.com/scrapegraphai)\n\n***\n## Created by Scrapegraphai\n\n![](docs/assets/scrapegraphai_logo.svg)\n***",
    "bugtrack_url": null,
    "license": null,
    "summary": "Library for extracting schemas and building ontologies from documents using LLM",
    "version": "1.1.0",
    "project_urls": null,
    "split_keywords": [
        "ai",
        " artificial intelligence",
        " documents",
        " gpt",
        " graph",
        " knowledge graph",
        " langchain",
        " machine learning",
        " natural language processing",
        " nlp",
        " ontologies",
        " openai",
        " rag",
        " scrape_schema",
        " scrapegraph",
        " scrapegraphai",
        " scraping",
        " structured data",
        " unstructured data",
        " web scraping",
        " web scraping library",
        " web scraping tool",
        " webscraping"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "82de122571e44a8d049e5ade2293a9265072b155f398a89a351ff1656b55b6e5",
                "md5": "9faefae1d4ecd262399c1d8ddadf34ed",
                "sha256": "72b52786c99b345a4c8c1bb12642c1849aa3f8c62ce803ca04eb083164b428fc"
            },
            "downloads": -1,
            "filename": "scrapontologies-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9faefae1d4ecd262399c1d8ddadf34ed",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.9",
            "size": 24023,
            "upload_time": "2024-10-15T11:28:00",
            "upload_time_iso_8601": "2024-10-15T11:28:00.104026Z",
            "url": "https://files.pythonhosted.org/packages/82/de/122571e44a8d049e5ade2293a9265072b155f398a89a351ff1656b55b6e5/scrapontologies-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "988c49f40b59051b4465d6812f3a16e3c3525c683cce6a67854fa0c3a08d482c",
                "md5": "4668483c07cca94fd8021e7926768202",
                "sha256": "af7e34730aee05efc2ed6d22069ac38b77a06afd2bba7c326d0f1ff987819f7c"
            },
            "downloads": -1,
            "filename": "scrapontologies-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "4668483c07cca94fd8021e7926768202",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.9",
            "size": 507956,
            "upload_time": "2024-10-15T11:28:01",
            "upload_time_iso_8601": "2024-10-15T11:28:01.170708Z",
            "url": "https://files.pythonhosted.org/packages/98/8c/49f40b59051b4465d6812f3a16e3c3525c683cce6a67854fa0c3a08d482c/scrapontologies-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-15 11:28:01",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "scrapontologies"
}
        
Elapsed time: 0.69399s