validex


Namevalidex JSON
Version 0.0.1 PyPI version JSON
download
home_pagehttps://github.com/msoedov/validex
SummaryA Python package to extract data from unstructured into structured format
upload_time2024-08-04 20:19:52
maintainerAlexander Miasoiedov
docs_urlNone
authorAlexander Miasoiedov
requires_python<4.0,>=3.10
licenseMIT
keywords nlp extraction openai structured output parsing fastapi llm
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # ValidEx

ValidEx is a Python library that simplifies retrieval, extraction and training of structured data from various unstructured sources.

<p>
<img alt="GitHub Contributors" src="https://img.shields.io/github/contributors/msoedov/validex" />
<img alt="GitHub Last Commit" src="https://img.shields.io/github/last-commit/msoedov/validex" />
<img alt="" src="https://img.shields.io/github/repo-size/msoedov/validex" />
<img alt="GitHub Issues" src="https://img.shields.io/github/issues/msoedov/validex" />
<img alt="GitHub Pull Requests" src="https://img.shields.io/github/issues-pr/msoedov/validex" />
<img alt="Github License" src="https://img.shields.io/github/license/msoedov/validex" />
</p>

## 🏷 Features

- **Structured Data Extraction**: Parse and extract structured data from various unstructured sources including web pages, text files, PDFs, and more.
- **Heuristic data cleaning**  text normalization (case, whitespace, special characters), deduplication
- **Concurrency Support**: Efficiently process multiple data sources simultaneously.
- **Retry Mechanism**: Implement automatic retries for failed extraction attempts.
- **Hallucination check**: Implement strategies to detect and reduce LLM hallucinations in extracted data.
- **Fine-tuning Dataset Export**: Generate datasets in JSONL format for OpenAI chat fine-tuning.
- **Local Model Creation**: Build custom extraction models combining Named Entity Recognition (NER) and regular expressions.

## 📦 Installation

To get started with ValidEx, simply install the package using pip:

```shell
pip install validex
```

## ⛓️ Quick Start

```python
import validex
from pydantic import BaseModel


class Superhero(BaseModel):
    name: str
    age: int
    power: str
    enemies: list[str]


def main():
    app = validex.App()

    app.add("https://www.britannica.com/topic/list-of-superheroes-2024795")
    app.add("*.txt")
    app.add("*.pdf")
    app.add("*.md")

    superheroes = app.extract(Superhero)
    print(f"Extracted superheroes: {list(superheroes)}")

    first_hero = app.extract_first(Superhero)
    print(f"First extracted hero: {first_hero}")

    print(f"Total cost: ${app.cost()}")
    print(f"Total usage: {app.usage}")


if __name__ == "__main__":
    main()
```

```python
[
    (
        Superhero(
            name="Batman",
            age=81,
            power="Brilliant detective skills, martial arts",
            enemies=["Joker", "Penguin"],
        ),
        {"url": "https://www.britannica.com/topic/list-of-superheroes-2024795"},
    ),
    (
        Superhero(
            name="Wonder Woman",
            age=80,
            power="Superhuman strength, speed, agility",
            enemies=["Ares", "Cheetah"],
        ),
        {"url": "https://www.britannica.com/topic/list-of-superheroes-2024795"},
    ),
    (
        Superhero(
            name="Spider-Man",
            age=59,
            power="Wall-crawling, spider sense",
            enemies=["Green Goblin", "Venom"],
        ),
        {"url": "https://www.britannica.com/topic/list-of-superheroes-2024795"},
    ),
    (
        Superhero(
            name="Captain America",
            age=101,
            power="Super soldier serum, shield",
            enemies=["Red Skull", "Hydra"],
        ),
        {"url": "https://www.britannica.com/topic/list-of-superheroes-2024795"},
    ),
    (
        Superhero(
            name="Superman", age=35, power="Flight", enemies=["Lex Luthor", "Doomsday"]
        ),
        {"url": "https://www.britannica.com/robots.txt"},
    ),
    (
        Superhero(
            name="Wonder Woman",
            age=30,
            power="Super Strength",
            enemies=["Ares", "Cheetah"],
        ),
        {"url": "https://www.britannica.com/robots.txt"},
    ),
    (
        Superhero(
            name="Spider-Man",
            age=25,
            power="Wall-crawling",
            enemies=["Green Goblin", "Venom"],
        ),
        {"url": "https://www.britannica.com/robots.txt"},
    ),
]
```

### Hallucinations and autofix

```python
class Superhero(BaseModel):
    name: str
    age: int
    power: str
    enemies: list[str]

    def fix(self):
        # Logic to auto fix and normalize the generated data
        if self.age < 0:
            self.age = 0

    def check_hallucinations(self):
        # Check name
        if not re.match(r"^[A-Za-z\s-]+$", self.name):
            raise ValueError(f"Name '{self.name}' contains unusual characters")

        # Check age
        if self.age < 0 or self.age > 1000:
            raise ValueError(f"Age {self.age} seems unrealistic")

        # Check power
        if len(self.power) > 50:
            raise ValueError("Power description is unusually long")

        # Check enemies
        if len(self.enemies) > 10:
            raise ValueError("Unusually high number of enemies")

        for enemy in self.enemies:
            if not re.match(r"^[A-Za-z\s-]+$", enemy):
                raise ValueError(f"Enemy name '{enemy}' contains unusual characters")
```

### Experimental: Export and fine tunning

```python
# Use the OpenAI chat fine-tuning format to save data
app.export_jsonl("fine_tune.jsonl")

# Local model training
app.fit()
app.save("state.validex")


app.infer_extract("booob")
```

### Multi-model Extraction

ValidEx supports extracting multiple models at once

```python
class Superhero2(BaseModel):
    name: str
    age: int
    power: str
    enemies: list[str]


multi_results = app.multi_extract(Superhero, Superhero2)
print(f"Multi-extraction results: {multi_results}")
```

### Limitations

TBD

## 🛠️ Roadmap

## 👋 Contributing

Contributions to ValidEx are welcome! If you'd like to contribute, please follow these steps:

- Fork the repository on GitHub
- Create a new branch for your changes
- Commit your changes to the new branch
- Push your changes to the forked repository
- Open a pull request to the main ValidEx repository

Before contributing, please read the contributing guidelines.

## License

ValidEx is released under the MIT License.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/msoedov/validex",
    "name": "validex",
    "maintainer": "Alexander Miasoiedov",
    "docs_url": null,
    "requires_python": "<4.0,>=3.10",
    "maintainer_email": "msoedov@gmail.com",
    "keywords": "nlp, extraction, openai, structured output parsing, fastapi, llm",
    "author": "Alexander Miasoiedov",
    "author_email": "msoedov@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/51/b5/f45a55eb39005d3bb684c1c6740e8cd0dd1419da7668f61f8eea467ef418/validex-0.0.1.tar.gz",
    "platform": null,
    "description": "# ValidEx\n\nValidEx is a Python library that simplifies retrieval, extraction and training of structured data from various unstructured sources.\n\n<p>\n<img alt=\"GitHub Contributors\" src=\"https://img.shields.io/github/contributors/msoedov/validex\" />\n<img alt=\"GitHub Last Commit\" src=\"https://img.shields.io/github/last-commit/msoedov/validex\" />\n<img alt=\"\" src=\"https://img.shields.io/github/repo-size/msoedov/validex\" />\n<img alt=\"GitHub Issues\" src=\"https://img.shields.io/github/issues/msoedov/validex\" />\n<img alt=\"GitHub Pull Requests\" src=\"https://img.shields.io/github/issues-pr/msoedov/validex\" />\n<img alt=\"Github License\" src=\"https://img.shields.io/github/license/msoedov/validex\" />\n</p>\n\n## \ud83c\udff7 Features\n\n- **Structured Data Extraction**: Parse and extract structured data from various unstructured sources including web pages, text files, PDFs, and more.\n- **Heuristic data cleaning**  text normalization (case, whitespace, special characters), deduplication\n- **Concurrency Support**: Efficiently process multiple data sources simultaneously.\n- **Retry Mechanism**: Implement automatic retries for failed extraction attempts.\n- **Hallucination check**: Implement strategies to detect and reduce LLM hallucinations in extracted data.\n- **Fine-tuning Dataset Export**: Generate datasets in JSONL format for OpenAI chat fine-tuning.\n- **Local Model Creation**: Build custom extraction models combining Named Entity Recognition (NER) and regular expressions.\n\n## \ud83d\udce6 Installation\n\nTo get started with ValidEx, simply install the package using pip:\n\n```shell\npip install validex\n```\n\n## \u26d3\ufe0f Quick Start\n\n```python\nimport validex\nfrom pydantic import BaseModel\n\n\nclass Superhero(BaseModel):\n    name: str\n    age: int\n    power: str\n    enemies: list[str]\n\n\ndef main():\n    app = validex.App()\n\n    app.add(\"https://www.britannica.com/topic/list-of-superheroes-2024795\")\n    app.add(\"*.txt\")\n    app.add(\"*.pdf\")\n    app.add(\"*.md\")\n\n    superheroes = app.extract(Superhero)\n    print(f\"Extracted superheroes: {list(superheroes)}\")\n\n    first_hero = app.extract_first(Superhero)\n    print(f\"First extracted hero: {first_hero}\")\n\n    print(f\"Total cost: ${app.cost()}\")\n    print(f\"Total usage: {app.usage}\")\n\n\nif __name__ == \"__main__\":\n    main()\n```\n\n```python\n[\n    (\n        Superhero(\n            name=\"Batman\",\n            age=81,\n            power=\"Brilliant detective skills, martial arts\",\n            enemies=[\"Joker\", \"Penguin\"],\n        ),\n        {\"url\": \"https://www.britannica.com/topic/list-of-superheroes-2024795\"},\n    ),\n    (\n        Superhero(\n            name=\"Wonder Woman\",\n            age=80,\n            power=\"Superhuman strength, speed, agility\",\n            enemies=[\"Ares\", \"Cheetah\"],\n        ),\n        {\"url\": \"https://www.britannica.com/topic/list-of-superheroes-2024795\"},\n    ),\n    (\n        Superhero(\n            name=\"Spider-Man\",\n            age=59,\n            power=\"Wall-crawling, spider sense\",\n            enemies=[\"Green Goblin\", \"Venom\"],\n        ),\n        {\"url\": \"https://www.britannica.com/topic/list-of-superheroes-2024795\"},\n    ),\n    (\n        Superhero(\n            name=\"Captain America\",\n            age=101,\n            power=\"Super soldier serum, shield\",\n            enemies=[\"Red Skull\", \"Hydra\"],\n        ),\n        {\"url\": \"https://www.britannica.com/topic/list-of-superheroes-2024795\"},\n    ),\n    (\n        Superhero(\n            name=\"Superman\", age=35, power=\"Flight\", enemies=[\"Lex Luthor\", \"Doomsday\"]\n        ),\n        {\"url\": \"https://www.britannica.com/robots.txt\"},\n    ),\n    (\n        Superhero(\n            name=\"Wonder Woman\",\n            age=30,\n            power=\"Super Strength\",\n            enemies=[\"Ares\", \"Cheetah\"],\n        ),\n        {\"url\": \"https://www.britannica.com/robots.txt\"},\n    ),\n    (\n        Superhero(\n            name=\"Spider-Man\",\n            age=25,\n            power=\"Wall-crawling\",\n            enemies=[\"Green Goblin\", \"Venom\"],\n        ),\n        {\"url\": \"https://www.britannica.com/robots.txt\"},\n    ),\n]\n```\n\n### Hallucinations and autofix\n\n```python\nclass Superhero(BaseModel):\n    name: str\n    age: int\n    power: str\n    enemies: list[str]\n\n    def fix(self):\n        # Logic to auto fix and normalize the generated data\n        if self.age < 0:\n            self.age = 0\n\n    def check_hallucinations(self):\n        # Check name\n        if not re.match(r\"^[A-Za-z\\s-]+$\", self.name):\n            raise ValueError(f\"Name '{self.name}' contains unusual characters\")\n\n        # Check age\n        if self.age < 0 or self.age > 1000:\n            raise ValueError(f\"Age {self.age} seems unrealistic\")\n\n        # Check power\n        if len(self.power) > 50:\n            raise ValueError(\"Power description is unusually long\")\n\n        # Check enemies\n        if len(self.enemies) > 10:\n            raise ValueError(\"Unusually high number of enemies\")\n\n        for enemy in self.enemies:\n            if not re.match(r\"^[A-Za-z\\s-]+$\", enemy):\n                raise ValueError(f\"Enemy name '{enemy}' contains unusual characters\")\n```\n\n### Experimental: Export and fine tunning\n\n```python\n# Use the OpenAI chat fine-tuning format to save data\napp.export_jsonl(\"fine_tune.jsonl\")\n\n# Local model training\napp.fit()\napp.save(\"state.validex\")\n\n\napp.infer_extract(\"booob\")\n```\n\n### Multi-model Extraction\n\nValidEx supports extracting multiple models at once\n\n```python\nclass Superhero2(BaseModel):\n    name: str\n    age: int\n    power: str\n    enemies: list[str]\n\n\nmulti_results = app.multi_extract(Superhero, Superhero2)\nprint(f\"Multi-extraction results: {multi_results}\")\n```\n\n### Limitations\n\nTBD\n\n## \ud83d\udee0\ufe0f Roadmap\n\n## \ud83d\udc4b Contributing\n\nContributions to ValidEx are welcome! If you'd like to contribute, please follow these steps:\n\n- Fork the repository on GitHub\n- Create a new branch for your changes\n- Commit your changes to the new branch\n- Push your changes to the forked repository\n- Open a pull request to the main ValidEx repository\n\nBefore contributing, please read the contributing guidelines.\n\n## License\n\nValidEx is released under the MIT License.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A Python package to extract data from unstructured into structured format",
    "version": "0.0.1",
    "project_urls": {
        "Homepage": "https://github.com/msoedov/validex",
        "Repository": "https://github.com/msoedov/validex"
    },
    "split_keywords": [
        "nlp",
        " extraction",
        " openai",
        " structured output parsing",
        " fastapi",
        " llm"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "51b5f45a55eb39005d3bb684c1c6740e8cd0dd1419da7668f61f8eea467ef418",
                "md5": "2c343b51131791122f658215006f1453",
                "sha256": "8c089694b6076b6f7852d1f73ac753e03d10a4e152e49f53e4d2da1e187e0918"
            },
            "downloads": -1,
            "filename": "validex-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "2c343b51131791122f658215006f1453",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.10",
            "size": 13345,
            "upload_time": "2024-08-04T20:19:52",
            "upload_time_iso_8601": "2024-08-04T20:19:52.704448Z",
            "url": "https://files.pythonhosted.org/packages/51/b5/f45a55eb39005d3bb684c1c6740e8cd0dd1419da7668f61f8eea467ef418/validex-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-04 20:19:52",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "msoedov",
    "github_project": "validex",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "validex"
}
        
Elapsed time: 0.92466s