Name | dataframes-haystack JSON |
Version |
0.0.4
JSON |
| download |
home_page | None |
Summary | Haystack custom components for your favourite dataframe library. |
upload_time | 2024-10-24 17:58:53 |
maintainer | None |
docs_url | None |
author | Edoardo Abati |
requires_python | >=3.8 |
license | MIT License Copyright (c) 2024-present Edoardo Abati Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. |
keywords |
ai
dataframe
haystack
llm
machine-learning
nlp
pandas
polars
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# Dataframes Haystack
[![PyPI - Version](https://img.shields.io/pypi/v/dataframes-haystack)](https://pypi.org/project/dataframes-haystack)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/dataframes-haystack?logo=python&logoColor=white)](https://pypi.org/project/dataframes-haystack)
[![PyPI - License](https://img.shields.io/pypi/l/dataframes-haystack)](https://pypi.org/project/dataframes-haystack)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![GH Actions Tests](https://github.com/EdAbati/dataframes-haystack/actions/workflows/test.yml/badge.svg)](https://github.com/EdAbati/dataframes-haystack/actions/workflows/test.yml)
[![pre-commit.ci status](https://results.pre-commit.ci/badge/github/EdAbati/dataframes-haystack/main.svg)](https://results.pre-commit.ci/latest/github/EdAbati/dataframes-haystack/main)
-----
## 📃 Description
`dataframes-haystack` is an extension for [Haystack 2](https://docs.haystack.deepset.ai/docs/intro) that enables integration with dataframe libraries.
The dataframe libraries currently supported are:
- [pandas](https://pandas.pydata.org/)
- [Polars](https://pola.rs)
The library offers various custom [Converters](https://docs.haystack.deepset.ai/docs/converters) components to transform dataframes into Haystack [`Document`](https://docs.haystack.deepset.ai/docs/data-classes#document) objects:
- `DataFrameFileToDocument` is a main generic converter that reads files using a dataframe backend and converts them into `Document` objects.
- `FileToPandasDataFrame` and `FileToPolarsDataFrame` read files and convert them into dataframes.
- `PandasDataFrameConverter` or `PolarsDataFrameConverter` convert data stored in dataframes into Haystack `Document`objects.
`dataframes-haystack` supports reading files in various formats:
- _csv_, _json_, _parquet_, _excel_, _html_, _xml_, _orc_, _pickle_, _fixed-width format_ for `pandas`. See the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) for more details.
- _csv_, _json_, _parquet_, _excel_, _avro_, _delta_, _ipc_ for `polars`. See the [polars documentation](https://docs.pola.rs/api/python/stable/reference/io.html) for more details.
## 🛠️ Installation
```sh
# for pandas (pandas is already included in `haystack-ai`)
pip install dataframes-haystack
# for polars
pip install "dataframes-haystack[polars]"
```
## 💻 Usage
> [!TIP]
> See the [Example Notebooks](./notebooks) for complete examples.
## DataFrameFileToDocument
[Complete example](https://github.com/EdAbati/dataframes-haystack/blob/main/notebooks/dataframe-file-to-doc-example.ipynb)
You can leverage both `pandas` and `polars` backends (thanks to [`narwhals`](https://github.com/narwhals-dev/narwhals)) to read your data!
```python
from dataframes_haystack.components.converters import DataFrameFileToDocument
converter = DataFrameFileToDocument(content_column="text_str")
documents = converter.run(files=["file1.csv", "file2.csv"])
```
```python
>>> documents
{'documents': [
Document(id=0, content: 'Hello world', meta: {}),
Document(id=1, content: 'Hello everyone', meta: {})
]}
```
### Pandas
[Complete example](https://github.com/EdAbati/dataframes-haystack/blob/main/notebooks/pandas-example.ipynb)
#### FileToPandasDataFrame
```python
from dataframes_haystack.components.converters.pandas import FileToPandasDataFrame
converter = FileToPandasDataFrame(file_format="csv")
output_dataframe = converter.run(
file_paths=["data/doc1.csv", "data/doc2.csv"]
)
```
Result:
```python
>>> output_dataframe
{'dataframe': <pandas.DataFrame>}
```
#### PandasDataFrameConverter
```python
import pandas as pd
from dataframes_haystack.components.converters.pandas import PandasDataFrameConverter
df = pd.DataFrame({
"text": ["Hello world", "Hello everyone"],
"filename": ["doc1.txt", "doc2.txt"],
})
converter = PandasDataFrameConverter(content_column="text", meta_columns=["filename"])
documents = converter.run(df)
```
Result:
```python
>>> documents
{'documents': [
Document(id=0, content: 'Hello world', meta: {'filename': 'doc1.txt'}),
Document(id=1, content: 'Hello everyone', meta: {'filename': 'doc2.txt'})
]}
```
### Polars
[Complete example](https://github.com/EdAbati/dataframes-haystack/blob/main/notebooks/polars-example.ipynb)
#### FileToPolarsDataFrame
```python
from dataframes_haystack.components.converters.polars import FileToPolarsDataFrame
converter = FileToPolarsDataFrame(file_format="csv")
output_dataframe = converter.run(
file_paths=["data/doc1.csv", "data/doc2.csv"]
)
```
Result:
```python
>>> output_dataframe
{'dataframe': <polars.DataFrame>}
```
#### PolarsDataFrameConverter
```python
import polars as pl
from dataframes_haystack.components.converters.polars import PolarsDataFrameConverter
df = pl.DataFrame({
"text": ["Hello world", "Hello everyone"],
"filename": ["doc1.txt", "doc2.txt"],
})
converter = PolarsDataFrameConverter(content_column="text", meta_columns=["filename"])
documents = converter.run(df)
```
Result:
```python
>>> documents
{'documents': [
Document(id=0, content: 'Hello world', meta: {'filename': 'doc1.txt'}),
Document(id=1, content: 'Hello everyone', meta: {'filename': 'doc2.txt'})
]}
```
## 🤝 Contributing
Do you have an idea for a new feature? Did you find a bug that needs fixing?
Feel free to [open an issue](https://github.com/EdAbati/dataframes-haystack/issues) or submit a PR!
### Setup development environment
Requirements: [`hatch`](https://hatch.pypa.io/latest/install/), [`pre-commit`](https://pre-commit.com/#install)
1. Clone the repository
1. Run `hatch shell` to create and activate a virtual environment
1. Run `pre-commit install` to install the pre-commit hooks. This will force the linting and formatting checks.
### Run tests
- Linting and formatting checks: `hatch run lint:fmt`
- Unit tests: `hatch run test-cov-all`
## ✍️ License
`dataframes-haystack` is distributed under the terms of the [MIT](https://spdx.org/licenses/MIT.html) license.
Raw data
{
"_id": null,
"home_page": null,
"name": "dataframes-haystack",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "ai, dataframe, haystack, llm, machine-learning, nlp, pandas, polars",
"author": "Edoardo Abati",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/eb/03/546fd5f508554c1529eac8ea9ca73a06431ea919a07859e31f1d6ce28de3/dataframes_haystack-0.0.4.tar.gz",
"platform": null,
"description": "# Dataframes Haystack\n\n[![PyPI - Version](https://img.shields.io/pypi/v/dataframes-haystack)](https://pypi.org/project/dataframes-haystack)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/dataframes-haystack?logo=python&logoColor=white)](https://pypi.org/project/dataframes-haystack)\n[![PyPI - License](https://img.shields.io/pypi/l/dataframes-haystack)](https://pypi.org/project/dataframes-haystack)\n\n\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)\n\n[![GH Actions Tests](https://github.com/EdAbati/dataframes-haystack/actions/workflows/test.yml/badge.svg)](https://github.com/EdAbati/dataframes-haystack/actions/workflows/test.yml)\n[![pre-commit.ci status](https://results.pre-commit.ci/badge/github/EdAbati/dataframes-haystack/main.svg)](https://results.pre-commit.ci/latest/github/EdAbati/dataframes-haystack/main)\n\n-----\n\n## \ud83d\udcc3 Description\n\n`dataframes-haystack` is an extension for [Haystack 2](https://docs.haystack.deepset.ai/docs/intro) that enables integration with dataframe libraries.\n\nThe dataframe libraries currently supported are:\n- [pandas](https://pandas.pydata.org/)\n- [Polars](https://pola.rs)\n\nThe library offers various custom [Converters](https://docs.haystack.deepset.ai/docs/converters) components to transform dataframes into Haystack [`Document`](https://docs.haystack.deepset.ai/docs/data-classes#document) objects:\n- `DataFrameFileToDocument` is a main generic converter that reads files using a dataframe backend and converts them into `Document` objects.\n- `FileToPandasDataFrame` and `FileToPolarsDataFrame` read files and convert them into dataframes.\n- `PandasDataFrameConverter` or `PolarsDataFrameConverter` convert data stored in dataframes into Haystack `Document`objects.\n\n`dataframes-haystack` supports reading files in various formats:\n- _csv_, _json_, _parquet_, _excel_, _html_, _xml_, _orc_, _pickle_, _fixed-width format_ for `pandas`. See the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) for more details.\n- _csv_, _json_, _parquet_, _excel_, _avro_, _delta_, _ipc_ for `polars`. See the [polars documentation](https://docs.pola.rs/api/python/stable/reference/io.html) for more details.\n\n## \ud83d\udee0\ufe0f Installation\n\n```sh\n# for pandas (pandas is already included in `haystack-ai`)\npip install dataframes-haystack\n\n# for polars\npip install \"dataframes-haystack[polars]\"\n```\n\n## \ud83d\udcbb Usage\n\n> [!TIP]\n> See the [Example Notebooks](./notebooks) for complete examples.\n\n## DataFrameFileToDocument\n\n[Complete example](https://github.com/EdAbati/dataframes-haystack/blob/main/notebooks/dataframe-file-to-doc-example.ipynb)\n\nYou can leverage both `pandas` and `polars` backends (thanks to [`narwhals`](https://github.com/narwhals-dev/narwhals)) to read your data!\n\n```python\nfrom dataframes_haystack.components.converters import DataFrameFileToDocument\n\nconverter = DataFrameFileToDocument(content_column=\"text_str\")\ndocuments = converter.run(files=[\"file1.csv\", \"file2.csv\"])\n```\n\n```python\n>>> documents\n{'documents': [\n Document(id=0, content: 'Hello world', meta: {}),\n Document(id=1, content: 'Hello everyone', meta: {})\n]}\n```\n\n### Pandas\n\n[Complete example](https://github.com/EdAbati/dataframes-haystack/blob/main/notebooks/pandas-example.ipynb)\n\n#### FileToPandasDataFrame\n\n```python\nfrom dataframes_haystack.components.converters.pandas import FileToPandasDataFrame\n\nconverter = FileToPandasDataFrame(file_format=\"csv\")\n\noutput_dataframe = converter.run(\n file_paths=[\"data/doc1.csv\", \"data/doc2.csv\"]\n)\n```\n\nResult:\n```python\n>>> output_dataframe\n{'dataframe': <pandas.DataFrame>}\n```\n\n#### PandasDataFrameConverter\n\n```python\nimport pandas as pd\n\nfrom dataframes_haystack.components.converters.pandas import PandasDataFrameConverter\n\ndf = pd.DataFrame({\n \"text\": [\"Hello world\", \"Hello everyone\"],\n \"filename\": [\"doc1.txt\", \"doc2.txt\"],\n})\n\nconverter = PandasDataFrameConverter(content_column=\"text\", meta_columns=[\"filename\"])\ndocuments = converter.run(df)\n```\n\nResult:\n```python\n>>> documents\n{'documents': [\n Document(id=0, content: 'Hello world', meta: {'filename': 'doc1.txt'}),\n Document(id=1, content: 'Hello everyone', meta: {'filename': 'doc2.txt'})\n]}\n```\n\n### Polars\n\n[Complete example](https://github.com/EdAbati/dataframes-haystack/blob/main/notebooks/polars-example.ipynb)\n\n#### FileToPolarsDataFrame\n\n```python\nfrom dataframes_haystack.components.converters.polars import FileToPolarsDataFrame\n\nconverter = FileToPolarsDataFrame(file_format=\"csv\")\n\noutput_dataframe = converter.run(\n file_paths=[\"data/doc1.csv\", \"data/doc2.csv\"]\n)\n```\n\nResult:\n```python\n>>> output_dataframe\n{'dataframe': <polars.DataFrame>}\n```\n\n#### PolarsDataFrameConverter\n\n```python\nimport polars as pl\n\nfrom dataframes_haystack.components.converters.polars import PolarsDataFrameConverter\n\ndf = pl.DataFrame({\n \"text\": [\"Hello world\", \"Hello everyone\"],\n \"filename\": [\"doc1.txt\", \"doc2.txt\"],\n})\n\nconverter = PolarsDataFrameConverter(content_column=\"text\", meta_columns=[\"filename\"])\ndocuments = converter.run(df)\n```\n\nResult:\n```python\n>>> documents\n{'documents': [\n Document(id=0, content: 'Hello world', meta: {'filename': 'doc1.txt'}),\n Document(id=1, content: 'Hello everyone', meta: {'filename': 'doc2.txt'})\n]}\n```\n\n## \ud83e\udd1d Contributing\n\nDo you have an idea for a new feature? Did you find a bug that needs fixing?\n\nFeel free to [open an issue](https://github.com/EdAbati/dataframes-haystack/issues) or submit a PR!\n\n### Setup development environment\n\nRequirements: [`hatch`](https://hatch.pypa.io/latest/install/), [`pre-commit`](https://pre-commit.com/#install)\n\n1. Clone the repository\n1. Run `hatch shell` to create and activate a virtual environment\n1. Run `pre-commit install` to install the pre-commit hooks. This will force the linting and formatting checks.\n\n### Run tests\n\n- Linting and formatting checks: `hatch run lint:fmt`\n- Unit tests: `hatch run test-cov-all`\n\n## \u270d\ufe0f License\n\n`dataframes-haystack` is distributed under the terms of the [MIT](https://spdx.org/licenses/MIT.html) license.\n",
"bugtrack_url": null,
"license": "MIT License Copyright (c) 2024-present Edoardo Abati Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.",
"summary": "Haystack custom components for your favourite dataframe library.",
"version": "0.0.4",
"project_urls": {
"Documentation": "https://github.com/EdAbati/dataframes-haystack#readme",
"Issues": "https://github.com/EdAbati/dataframes-haystack/issues",
"Source": "https://github.com/EdAbati/dataframes-haystack"
},
"split_keywords": [
"ai",
" dataframe",
" haystack",
" llm",
" machine-learning",
" nlp",
" pandas",
" polars"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "9dd668ebf746f93fa9e99e3ba30879a612d3319e4e806cc41075ffd21d09ef8f",
"md5": "b263080ed239a69e938cece0a9dbced5",
"sha256": "e1b29cc09e9a165608ab4d98e33dd595f71c51799df5394eddffc412f309beba"
},
"downloads": -1,
"filename": "dataframes_haystack-0.0.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b263080ed239a69e938cece0a9dbced5",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 12735,
"upload_time": "2024-10-24T17:58:52",
"upload_time_iso_8601": "2024-10-24T17:58:52.165335Z",
"url": "https://files.pythonhosted.org/packages/9d/d6/68ebf746f93fa9e99e3ba30879a612d3319e4e806cc41075ffd21d09ef8f/dataframes_haystack-0.0.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "eb03546fd5f508554c1529eac8ea9ca73a06431ea919a07859e31f1d6ce28de3",
"md5": "beb8ee592edd1a533c21def9a89dc101",
"sha256": "c7a4b95507eb0836a7279d8c11002092ee8efbb247d6de1ba17d172052fb38f3"
},
"downloads": -1,
"filename": "dataframes_haystack-0.0.4.tar.gz",
"has_sig": false,
"md5_digest": "beb8ee592edd1a533c21def9a89dc101",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 163271,
"upload_time": "2024-10-24T17:58:53",
"upload_time_iso_8601": "2024-10-24T17:58:53.154002Z",
"url": "https://files.pythonhosted.org/packages/eb/03/546fd5f508554c1529eac8ea9ca73a06431ea919a07859e31f1d6ce28de3/dataframes_haystack-0.0.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-24 17:58:53",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "EdAbati",
"github_project": "dataframes-haystack#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "dataframes-haystack"
}