# data-harvest-reader
## Features
1. **Reading Various File Formats**: Suporta leitura de arquivos CSV, JSON, Parquet e Excel.
2. **Directory and ZIP File Handling**: Capacidade de ler dados de diretórios e arquivos ZIP, além de bytes e objetos `zipfile.ZipFile`.
3. **Data Joining**: União de DataFrames que possuem colunas semelhantes.
4. **Deduplication**: Remoção de duplicatas com base em colunas específicas.
5. **Custom Filters**: Aplicação de filtros personalizados aos DataFrames.
6. **Logging**: Registro detalhado das operações de leitura e manipulação de dados.
## Installation Requirements
```bash
pip install polars loguru
```
## Usage
### Initialization
```python
from data_harvest_reader import DataReader
data_reader = DataReader(log_to_file=True, log_file="data_reader.log")
```
### Reading Data
#### From Directory
```python
data = data_reader.read_data('path/to/directory', join_similar=True)
```
#### From ZIP File
```python
data = data_reader.read_data('path/to/zipfile.zip', join_similar=False)
```
#### From Bytes
```python
with open('path/to/zipfile.zip', 'rb') as f:
zip_bytes = f.read()
data = data_reader.read_data(zip_bytes, join_similar=False)
```
#### From \`zipfile.ZipFile\` Object
```python
with zipfile.ZipFile('path/to/zipfile.zip', 'r') as zip_file:
data = data_reader.read_data(zip_file, join_similar=False)
```
### Applying Deduplication
```python
duplicated_subset_dict = {'file1': ['column1', 'column2']}
data = data_reader.read_data('path/to/source', duplicated_subset_dict=duplicated_subset_dict)
```
### Applying Filters
```python
filter_subset = {
'file1': [{'column': 'Col1', 'operation': '>', 'values': 100},
{'column': 'Col2', 'operation': '==', 'values': 'Value'}]
}
data = data_reader.read_data('path/to/source', filter_subset=filter_subset)
```
### Handling Exceptions
```python
try:
data = data_reader.read_data('path/to/source')
except UnsupportedFormatError:
print("Unsupported file format provided")
except FilterConfigurationError:
print("Error in filter configuration")
```
## Example
```python
data_reader = DataReader()
data = data_reader.read_data(r'C:\path o\data', join_similar=True,
filter_subset={'example_file': [{'column': 'Age', 'operation': '>', 'values': 30}]})
```
## Contributing to DataReader
### Getting Started
1. **Fork the Repository**: Start by forking the main repository. This creates your own copy of the project where you can make changes.
2. **Clone the Forked Repository**: Clone your fork to your local machine. This step allows you to work on the codebase directly.
3. **Set Up the Development Environment**: Ensure you have all necessary dependencies installed. It's recommended to use a virtual environment.
4. **Create a New Branch**: Always create a new branch for your changes. This keeps the main branch stable and makes reviewing changes easier.
### Making Contributions
1. **Make Your Changes**: Implement your feature, fix a bug, or make your proposed changes. Ensure your code adheres to the project's coding standards and guidelines.
2. **Test Your Changes**: Before submitting, test your changes thoroughly. Write unit tests if applicable, and ensure all existing tests pass.
3. **Document Your Changes**: Update the documentation to reflect your changes. If you're adding a new feature, include usage examples.
4. **Commit Your Changes**: Make concise and clear commit messages, describing what each commit does.
5. **Push to Your Fork**: Push your changes to your fork on GitHub.
6. **Create a Pull Request (PR)**: Go to the original \`DataReader\` repository and create a pull request from your fork. Ensure you describe your changes in detail and link any relevant issues.
### Review Process
After submitting your PR, the maintainers will review your changes. Be responsive to feedback:
1. **Respond to Comments**: If the reviewers ask for changes, make them promptly. Discuss any suggestions or concerns.
2. **Update Your PR**: If needed, update your PR based on feedback. This may involve adding more tests or tweaking your approach.
### Final Steps
Once your PR is approved:
1. **Merge**: The maintainers will merge your changes into the main codebase.
2. **Stay Engaged**: Continue to stay involved in the project. Look out for feedback from users on your new feature or fix.
## Conclusion
Contributing to \`DataReader\` is a rewarding experience that benefits the entire user community. Your contributions help make \`DataReader\` a more robust and versatile tool. We welcome developers of all skill levels and appreciate every form of contribution, from code to documentation. Thank you for considering contributing to \`DataReader\`!
Raw data
{
"_id": null,
"home_page": "https://github.com/Jeferson-Peter/data-harvest-reader",
"name": "data-harvest-reader",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "Python, File Reading, Multiple File Handler",
"author": "Jeferson-Peter (Jeferson Peter)",
"author_email": "jeferson.peter@pm.me",
"download_url": "https://files.pythonhosted.org/packages/d3/7c/d3d1e93f6be02107033e056fa2233a5cdaee094bddc39d3e18c7fcf6f77e/data_harvest_reader-0.0.11.tar.gz",
"platform": null,
"description": "\r\n\r\r\n# data-harvest-reader\r\r\n\r\r\n## Features\r\r\n\r\r\n1. **Reading Various File Formats**: Suporta leitura de arquivos CSV, JSON, Parquet e Excel.\r\r\n2. **Directory and ZIP File Handling**: Capacidade de ler dados de diret\u00f3rios e arquivos ZIP, al\u00e9m de bytes e objetos `zipfile.ZipFile`.\r\r\n3. **Data Joining**: Uni\u00e3o de DataFrames que possuem colunas semelhantes.\r\r\n4. **Deduplication**: Remo\u00e7\u00e3o de duplicatas com base em colunas espec\u00edficas.\r\r\n5. **Custom Filters**: Aplica\u00e7\u00e3o de filtros personalizados aos DataFrames.\r\r\n6. **Logging**: Registro detalhado das opera\u00e7\u00f5es de leitura e manipula\u00e7\u00e3o de dados.\r\r\n\r\r\n## Installation Requirements\r\r\n\r\r\n```bash\r\r\npip install polars loguru\r\r\n```\r\r\n\r\r\n## Usage\r\r\n\r\r\n### Initialization\r\r\n\r\r\n```python\r\r\nfrom data_harvest_reader import DataReader\r\r\n\r\r\ndata_reader = DataReader(log_to_file=True, log_file=\"data_reader.log\")\r\r\n```\r\r\n\r\r\n### Reading Data\r\r\n\r\r\n#### From Directory\r\r\n\r\r\n```python\r\r\ndata = data_reader.read_data('path/to/directory', join_similar=True)\r\r\n```\r\r\n\r\r\n#### From ZIP File\r\r\n\r\r\n```python\r\r\ndata = data_reader.read_data('path/to/zipfile.zip', join_similar=False)\r\r\n```\r\r\n\r\r\n#### From Bytes\r\r\n\r\r\n```python\r\r\nwith open('path/to/zipfile.zip', 'rb') as f:\r\r\n zip_bytes = f.read()\r\r\ndata = data_reader.read_data(zip_bytes, join_similar=False)\r\r\n```\r\r\n\r\r\n#### From \\`zipfile.ZipFile\\` Object\r\r\n\r\r\n```python\r\r\nwith zipfile.ZipFile('path/to/zipfile.zip', 'r') as zip_file:\r\r\n data = data_reader.read_data(zip_file, join_similar=False)\r\r\n```\r\r\n\r\r\n### Applying Deduplication\r\r\n\r\r\n```python\r\r\nduplicated_subset_dict = {'file1': ['column1', 'column2']}\r\r\ndata = data_reader.read_data('path/to/source', duplicated_subset_dict=duplicated_subset_dict)\r\r\n```\r\r\n\r\r\n### Applying Filters\r\r\n\r\r\n```python\r\r\nfilter_subset = {\r\r\n 'file1': [{'column': 'Col1', 'operation': '>', 'values': 100},\r\r\n {'column': 'Col2', 'operation': '==', 'values': 'Value'}]\r\r\n}\r\r\n\r\r\ndata = data_reader.read_data('path/to/source', filter_subset=filter_subset)\r\r\n```\r\r\n\r\r\n### Handling Exceptions\r\r\n\r\r\n```python\r\r\ntry:\r\r\n data = data_reader.read_data('path/to/source')\r\r\nexcept UnsupportedFormatError:\r\r\n print(\"Unsupported file format provided\")\r\r\nexcept FilterConfigurationError:\r\r\n print(\"Error in filter configuration\")\r\r\n```\r\r\n\r\r\n## Example\r\r\n\r\r\n```python\r\r\ndata_reader = DataReader()\r\r\n\r\r\ndata = data_reader.read_data(r'C:\\path\to\\data', join_similar=True,\r\r\n filter_subset={'example_file': [{'column': 'Age', 'operation': '>', 'values': 30}]})\r\r\n```\r\r\n\r\r\n## Contributing to DataReader\r\r\n\r\r\n### Getting Started\r\r\n\r\r\n1. **Fork the Repository**: Start by forking the main repository. This creates your own copy of the project where you can make changes.\r\r\n2. **Clone the Forked Repository**: Clone your fork to your local machine. This step allows you to work on the codebase directly.\r\r\n3. **Set Up the Development Environment**: Ensure you have all necessary dependencies installed. It's recommended to use a virtual environment.\r\r\n4. **Create a New Branch**: Always create a new branch for your changes. This keeps the main branch stable and makes reviewing changes easier.\r\r\n\r\r\n### Making Contributions\r\r\n\r\r\n1. **Make Your Changes**: Implement your feature, fix a bug, or make your proposed changes. Ensure your code adheres to the project's coding standards and guidelines.\r\r\n2. **Test Your Changes**: Before submitting, test your changes thoroughly. Write unit tests if applicable, and ensure all existing tests pass.\r\r\n3. **Document Your Changes**: Update the documentation to reflect your changes. If you're adding a new feature, include usage examples.\r\r\n4. **Commit Your Changes**: Make concise and clear commit messages, describing what each commit does.\r\r\n5. **Push to Your Fork**: Push your changes to your fork on GitHub.\r\r\n6. **Create a Pull Request (PR)**: Go to the original \\`DataReader\\` repository and create a pull request from your fork. Ensure you describe your changes in detail and link any relevant issues.\r\r\n\r\r\n### Review Process\r\r\n\r\r\nAfter submitting your PR, the maintainers will review your changes. Be responsive to feedback:\r\r\n\r\r\n1. **Respond to Comments**: If the reviewers ask for changes, make them promptly. Discuss any suggestions or concerns.\r\r\n2. **Update Your PR**: If needed, update your PR based on feedback. This may involve adding more tests or tweaking your approach.\r\r\n\r\r\n### Final Steps\r\r\n\r\r\nOnce your PR is approved:\r\r\n\r\r\n1. **Merge**: The maintainers will merge your changes into the main codebase.\r\r\n2. **Stay Engaged**: Continue to stay involved in the project. Look out for feedback from users on your new feature or fix.\r\r\n\r\r\n## Conclusion\r\r\n\r\r\nContributing to \\`DataReader\\` is a rewarding experience that benefits the entire user community. Your contributions help make \\`DataReader\\` a more robust and versatile tool. We welcome developers of all skill levels and appreciate every form of contribution, from code to documentation. Thank you for considering contributing to \\`DataReader\\`!\r\r\n",
"bugtrack_url": null,
"license": null,
"summary": "A class to handle and process multiple files with identical structures within a directory.",
"version": "0.0.11",
"project_urls": {
"Homepage": "https://github.com/Jeferson-Peter/data-harvest-reader"
},
"split_keywords": [
"python",
" file reading",
" multiple file handler"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "50a4927dd4ad6b8f6bdfbb9891956af05e848a2e2cd2e449414f0d40e44bf433",
"md5": "a3ff4dd7568732cc0c242de20feae1c2",
"sha256": "caf229956b11be051d44bcfb26fe447cbfde74a7ccd93214d74550a55450aa1c"
},
"downloads": -1,
"filename": "data_harvest_reader-0.0.11-py3-none-any.whl",
"has_sig": false,
"md5_digest": "a3ff4dd7568732cc0c242de20feae1c2",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 7021,
"upload_time": "2024-12-05T23:26:22",
"upload_time_iso_8601": "2024-12-05T23:26:22.765881Z",
"url": "https://files.pythonhosted.org/packages/50/a4/927dd4ad6b8f6bdfbb9891956af05e848a2e2cd2e449414f0d40e44bf433/data_harvest_reader-0.0.11-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "d37cd3d1e93f6be02107033e056fa2233a5cdaee094bddc39d3e18c7fcf6f77e",
"md5": "e7c731c8341979f29d206838c75cacaf",
"sha256": "2abccf9ff54d85e00b4dd45bbaadd24b5ed42e2be25218a368a7baea95aa5f5e"
},
"downloads": -1,
"filename": "data_harvest_reader-0.0.11.tar.gz",
"has_sig": false,
"md5_digest": "e7c731c8341979f29d206838c75cacaf",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 6943,
"upload_time": "2024-12-05T23:26:24",
"upload_time_iso_8601": "2024-12-05T23:26:24.370774Z",
"url": "https://files.pythonhosted.org/packages/d3/7c/d3d1e93f6be02107033e056fa2233a5cdaee094bddc39d3e18c7fcf6f77e/data_harvest_reader-0.0.11.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-05 23:26:24",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Jeferson-Peter",
"github_project": "data-harvest-reader",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "colorama",
"specs": [
[
"==",
"0.4.6"
]
]
},
{
"name": "loguru",
"specs": [
[
"==",
"0.7.2"
]
]
},
{
"name": "polars",
"specs": [
[
"==",
"0.20.1"
]
]
}
],
"lcname": "data-harvest-reader"
}