# ValidEx
ValidEx is a Python library that simplifies retrieval, extraction and training of structured data from various unstructured sources.
<p>
<img alt="GitHub Contributors" src="https://img.shields.io/github/contributors/msoedov/validex" />
<img alt="GitHub Last Commit" src="https://img.shields.io/github/last-commit/msoedov/validex" />
<img alt="" src="https://img.shields.io/github/repo-size/msoedov/validex" />
<img alt="GitHub Issues" src="https://img.shields.io/github/issues/msoedov/validex" />
<img alt="GitHub Pull Requests" src="https://img.shields.io/github/issues-pr/msoedov/validex" />
<img alt="Github License" src="https://img.shields.io/github/license/msoedov/validex" />
</p>
## 🏷 Features
- **Structured Data Extraction**: Parse and extract structured data from various unstructured sources including web pages, text files, PDFs, and more.
- **Heuristic data cleaning** text normalization (case, whitespace, special characters), deduplication
- **Concurrency Support**: Efficiently process multiple data sources simultaneously.
- **Retry Mechanism**: Implement automatic retries for failed extraction attempts.
- **Hallucination check**: Implement strategies to detect and reduce LLM hallucinations in extracted data.
- **Fine-tuning Dataset Export**: Generate datasets in JSONL format for OpenAI chat fine-tuning.
- **Local Model Creation**: Build custom extraction models combining Named Entity Recognition (NER) and regular expressions.
## 📦 Installation
To get started with ValidEx, simply install the package using pip:
```shell
pip install validex
```
## ⛓️ Quick Start
```python
import validex
from pydantic import BaseModel
class Superhero(BaseModel):
name: str
age: int
power: str
enemies: list[str]
def main():
app = validex.App()
app.add("https://www.britannica.com/topic/list-of-superheroes-2024795")
app.add("*.txt")
app.add("*.pdf")
app.add("*.md")
superheroes = app.extract(Superhero)
print(f"Extracted superheroes: {list(superheroes)}")
first_hero = app.extract_first(Superhero)
print(f"First extracted hero: {first_hero}")
print(f"Total cost: ${app.cost()}")
print(f"Total usage: {app.usage}")
if __name__ == "__main__":
main()
```
```python
[
(
Superhero(
name="Batman",
age=81,
power="Brilliant detective skills, martial arts",
enemies=["Joker", "Penguin"],
),
{"url": "https://www.britannica.com/topic/list-of-superheroes-2024795"},
),
(
Superhero(
name="Wonder Woman",
age=80,
power="Superhuman strength, speed, agility",
enemies=["Ares", "Cheetah"],
),
{"url": "https://www.britannica.com/topic/list-of-superheroes-2024795"},
),
(
Superhero(
name="Spider-Man",
age=59,
power="Wall-crawling, spider sense",
enemies=["Green Goblin", "Venom"],
),
{"url": "https://www.britannica.com/topic/list-of-superheroes-2024795"},
),
(
Superhero(
name="Captain America",
age=101,
power="Super soldier serum, shield",
enemies=["Red Skull", "Hydra"],
),
{"url": "https://www.britannica.com/topic/list-of-superheroes-2024795"},
),
(
Superhero(
name="Superman", age=35, power="Flight", enemies=["Lex Luthor", "Doomsday"]
),
{"url": "https://www.britannica.com/robots.txt"},
),
(
Superhero(
name="Wonder Woman",
age=30,
power="Super Strength",
enemies=["Ares", "Cheetah"],
),
{"url": "https://www.britannica.com/robots.txt"},
),
(
Superhero(
name="Spider-Man",
age=25,
power="Wall-crawling",
enemies=["Green Goblin", "Venom"],
),
{"url": "https://www.britannica.com/robots.txt"},
),
]
```
### Hallucinations and autofix
```python
class Superhero(BaseModel):
name: str
age: int
power: str
enemies: list[str]
def fix(self):
# Logic to auto fix and normalize the generated data
if self.age < 0:
self.age = 0
def check_hallucinations(self):
# Check name
if not re.match(r"^[A-Za-z\s-]+$", self.name):
raise ValueError(f"Name '{self.name}' contains unusual characters")
# Check age
if self.age < 0 or self.age > 1000:
raise ValueError(f"Age {self.age} seems unrealistic")
# Check power
if len(self.power) > 50:
raise ValueError("Power description is unusually long")
# Check enemies
if len(self.enemies) > 10:
raise ValueError("Unusually high number of enemies")
for enemy in self.enemies:
if not re.match(r"^[A-Za-z\s-]+$", enemy):
raise ValueError(f"Enemy name '{enemy}' contains unusual characters")
```
### Experimental: Export and fine tunning
```python
# Use the OpenAI chat fine-tuning format to save data
app.export_jsonl("fine_tune.jsonl")
# Local model training
app.fit()
app.save("state.validex")
app.infer_extract("booob")
```
### Multi-model Extraction
ValidEx supports extracting multiple models at once
```python
class Superhero2(BaseModel):
name: str
age: int
power: str
enemies: list[str]
multi_results = app.multi_extract(Superhero, Superhero2)
print(f"Multi-extraction results: {multi_results}")
```
### Limitations
TBD
## 🛠️ Roadmap
## 👋 Contributing
Contributions to ValidEx are welcome! If you'd like to contribute, please follow these steps:
- Fork the repository on GitHub
- Create a new branch for your changes
- Commit your changes to the new branch
- Push your changes to the forked repository
- Open a pull request to the main ValidEx repository
Before contributing, please read the contributing guidelines.
## License
ValidEx is released under the MIT License.
Raw data
{
"_id": null,
"home_page": "https://github.com/msoedov/validex",
"name": "validex",
"maintainer": "Alexander Miasoiedov",
"docs_url": null,
"requires_python": "<4.0,>=3.10",
"maintainer_email": "msoedov@gmail.com",
"keywords": "nlp, extraction, openai, structured output parsing, fastapi, llm",
"author": "Alexander Miasoiedov",
"author_email": "msoedov@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/51/b5/f45a55eb39005d3bb684c1c6740e8cd0dd1419da7668f61f8eea467ef418/validex-0.0.1.tar.gz",
"platform": null,
"description": "# ValidEx\n\nValidEx is a Python library that simplifies retrieval, extraction and training of structured data from various unstructured sources.\n\n<p>\n<img alt=\"GitHub Contributors\" src=\"https://img.shields.io/github/contributors/msoedov/validex\" />\n<img alt=\"GitHub Last Commit\" src=\"https://img.shields.io/github/last-commit/msoedov/validex\" />\n<img alt=\"\" src=\"https://img.shields.io/github/repo-size/msoedov/validex\" />\n<img alt=\"GitHub Issues\" src=\"https://img.shields.io/github/issues/msoedov/validex\" />\n<img alt=\"GitHub Pull Requests\" src=\"https://img.shields.io/github/issues-pr/msoedov/validex\" />\n<img alt=\"Github License\" src=\"https://img.shields.io/github/license/msoedov/validex\" />\n</p>\n\n## \ud83c\udff7 Features\n\n- **Structured Data Extraction**: Parse and extract structured data from various unstructured sources including web pages, text files, PDFs, and more.\n- **Heuristic data cleaning** text normalization (case, whitespace, special characters), deduplication\n- **Concurrency Support**: Efficiently process multiple data sources simultaneously.\n- **Retry Mechanism**: Implement automatic retries for failed extraction attempts.\n- **Hallucination check**: Implement strategies to detect and reduce LLM hallucinations in extracted data.\n- **Fine-tuning Dataset Export**: Generate datasets in JSONL format for OpenAI chat fine-tuning.\n- **Local Model Creation**: Build custom extraction models combining Named Entity Recognition (NER) and regular expressions.\n\n## \ud83d\udce6 Installation\n\nTo get started with ValidEx, simply install the package using pip:\n\n```shell\npip install validex\n```\n\n## \u26d3\ufe0f Quick Start\n\n```python\nimport validex\nfrom pydantic import BaseModel\n\n\nclass Superhero(BaseModel):\n name: str\n age: int\n power: str\n enemies: list[str]\n\n\ndef main():\n app = validex.App()\n\n app.add(\"https://www.britannica.com/topic/list-of-superheroes-2024795\")\n app.add(\"*.txt\")\n app.add(\"*.pdf\")\n app.add(\"*.md\")\n\n superheroes = app.extract(Superhero)\n print(f\"Extracted superheroes: {list(superheroes)}\")\n\n first_hero = app.extract_first(Superhero)\n print(f\"First extracted hero: {first_hero}\")\n\n print(f\"Total cost: ${app.cost()}\")\n print(f\"Total usage: {app.usage}\")\n\n\nif __name__ == \"__main__\":\n main()\n```\n\n```python\n[\n (\n Superhero(\n name=\"Batman\",\n age=81,\n power=\"Brilliant detective skills, martial arts\",\n enemies=[\"Joker\", \"Penguin\"],\n ),\n {\"url\": \"https://www.britannica.com/topic/list-of-superheroes-2024795\"},\n ),\n (\n Superhero(\n name=\"Wonder Woman\",\n age=80,\n power=\"Superhuman strength, speed, agility\",\n enemies=[\"Ares\", \"Cheetah\"],\n ),\n {\"url\": \"https://www.britannica.com/topic/list-of-superheroes-2024795\"},\n ),\n (\n Superhero(\n name=\"Spider-Man\",\n age=59,\n power=\"Wall-crawling, spider sense\",\n enemies=[\"Green Goblin\", \"Venom\"],\n ),\n {\"url\": \"https://www.britannica.com/topic/list-of-superheroes-2024795\"},\n ),\n (\n Superhero(\n name=\"Captain America\",\n age=101,\n power=\"Super soldier serum, shield\",\n enemies=[\"Red Skull\", \"Hydra\"],\n ),\n {\"url\": \"https://www.britannica.com/topic/list-of-superheroes-2024795\"},\n ),\n (\n Superhero(\n name=\"Superman\", age=35, power=\"Flight\", enemies=[\"Lex Luthor\", \"Doomsday\"]\n ),\n {\"url\": \"https://www.britannica.com/robots.txt\"},\n ),\n (\n Superhero(\n name=\"Wonder Woman\",\n age=30,\n power=\"Super Strength\",\n enemies=[\"Ares\", \"Cheetah\"],\n ),\n {\"url\": \"https://www.britannica.com/robots.txt\"},\n ),\n (\n Superhero(\n name=\"Spider-Man\",\n age=25,\n power=\"Wall-crawling\",\n enemies=[\"Green Goblin\", \"Venom\"],\n ),\n {\"url\": \"https://www.britannica.com/robots.txt\"},\n ),\n]\n```\n\n### Hallucinations and autofix\n\n```python\nclass Superhero(BaseModel):\n name: str\n age: int\n power: str\n enemies: list[str]\n\n def fix(self):\n # Logic to auto fix and normalize the generated data\n if self.age < 0:\n self.age = 0\n\n def check_hallucinations(self):\n # Check name\n if not re.match(r\"^[A-Za-z\\s-]+$\", self.name):\n raise ValueError(f\"Name '{self.name}' contains unusual characters\")\n\n # Check age\n if self.age < 0 or self.age > 1000:\n raise ValueError(f\"Age {self.age} seems unrealistic\")\n\n # Check power\n if len(self.power) > 50:\n raise ValueError(\"Power description is unusually long\")\n\n # Check enemies\n if len(self.enemies) > 10:\n raise ValueError(\"Unusually high number of enemies\")\n\n for enemy in self.enemies:\n if not re.match(r\"^[A-Za-z\\s-]+$\", enemy):\n raise ValueError(f\"Enemy name '{enemy}' contains unusual characters\")\n```\n\n### Experimental: Export and fine tunning\n\n```python\n# Use the OpenAI chat fine-tuning format to save data\napp.export_jsonl(\"fine_tune.jsonl\")\n\n# Local model training\napp.fit()\napp.save(\"state.validex\")\n\n\napp.infer_extract(\"booob\")\n```\n\n### Multi-model Extraction\n\nValidEx supports extracting multiple models at once\n\n```python\nclass Superhero2(BaseModel):\n name: str\n age: int\n power: str\n enemies: list[str]\n\n\nmulti_results = app.multi_extract(Superhero, Superhero2)\nprint(f\"Multi-extraction results: {multi_results}\")\n```\n\n### Limitations\n\nTBD\n\n## \ud83d\udee0\ufe0f Roadmap\n\n## \ud83d\udc4b Contributing\n\nContributions to ValidEx are welcome! If you'd like to contribute, please follow these steps:\n\n- Fork the repository on GitHub\n- Create a new branch for your changes\n- Commit your changes to the new branch\n- Push your changes to the forked repository\n- Open a pull request to the main ValidEx repository\n\nBefore contributing, please read the contributing guidelines.\n\n## License\n\nValidEx is released under the MIT License.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A Python package to extract data from unstructured into structured format",
"version": "0.0.1",
"project_urls": {
"Homepage": "https://github.com/msoedov/validex",
"Repository": "https://github.com/msoedov/validex"
},
"split_keywords": [
"nlp",
" extraction",
" openai",
" structured output parsing",
" fastapi",
" llm"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "51b5f45a55eb39005d3bb684c1c6740e8cd0dd1419da7668f61f8eea467ef418",
"md5": "2c343b51131791122f658215006f1453",
"sha256": "8c089694b6076b6f7852d1f73ac753e03d10a4e152e49f53e4d2da1e187e0918"
},
"downloads": -1,
"filename": "validex-0.0.1.tar.gz",
"has_sig": false,
"md5_digest": "2c343b51131791122f658215006f1453",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.10",
"size": 13345,
"upload_time": "2024-08-04T20:19:52",
"upload_time_iso_8601": "2024-08-04T20:19:52.704448Z",
"url": "https://files.pythonhosted.org/packages/51/b5/f45a55eb39005d3bb684c1c6740e8cd0dd1419da7668f61f8eea467ef418/validex-0.0.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-04 20:19:52",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "msoedov",
"github_project": "validex",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "validex"
}