# pyrsona
<img align="left" src="https://github.com/johnbullnz/pyrsona/actions/workflows/python.yml/badge.svg"><br>
Text data file validation and structure management using the [pydantic](https://pydantic-docs.helpmanual.io/) and [parse](https://github.com/r1chardj0n3s/parse) Python packages.
## Installation
Install using `pip install pyrsona`.
## A Simple Example
For the text file `example.txt`:
```
operator name: Jane Smith
country: NZ
year: 2022
ID,Time,Duration (sec),Reading
1,20:04:05,12.2,2098
2,20:05:00,2.35,4328
```
The following *pyrsona* file structure model can be defined:
```python
from pyrsona import BaseStructure
from pydantic import BaseModel
from datetime import time
class ExampleStructure(BaseStructure):
structure = (
"operator name: {operator_name}\n"
"country: {country}\n"
"year: {}\n"
"\n"
"ID,Time,Duration (sec),Reading\n"
)
class meta_model(BaseModel):
operator_name: str
country: str
class row_model(BaseModel):
id: int
time: time
duration_sec: float
value: float
```
The `read()` method can then be used to read the file, parse its contents and validate the meta data and table rows:
```python
meta, table_rows, structure_id = ExampleStructure.read("example.txt")
print(meta)
#> {'operator_name': 'Jane Smith', 'country': 'NZ'}
print(table_rows)
#> [{'id': 1, 'time': datetime.time(20, 4, 5), 'value': 2098.0}, {'id': 2,
# 'time': datetime.time(20, 5), 'value': 4328.0}]
print(structure_id)
#> ExampleStructure
```
**What's going on here:**
- The `structure` class attribute contains a definition of the basic file structure. This definition includes the meta data lines and table header lines. Any variable text of interest is replaced with curly brackets and a field name, E.g. `'{operator_name}'`, while any variable text that should be ignored is replaced with empty curly brackets, E.g. `'{}'`. The `structure` definition must contain all spaces, tabs and new line characters in order for a file to successfully match it. The named fields in the `structure` definition will be passed to `meta_model`.
- `meta_model` is simply a [pydantic model](https://pydantic-docs.helpmanual.io/usage/models/) with field names that match the named fields in the `structure` definition. All values sent to `meta_model` will be strings and these will be converted to the field types defined in `meta_model`. Custom [pydantic validators](https://pydantic-docs.helpmanual.io/usage/validators/) can be included in the `meta_model` definition as per standard pydantic models.
- `row_model` is also a [pydantic model](https://pydantic-docs.helpmanual.io/usage/models/). This time the field names do not need to match the header line in the `structure` definition; however, the `row_model` fields do need to be provided in the **same order as the table columns**. This allows the table column names to be customised/standardised where the user does not control the file structure itself. Again, custom [pydantic validators](https://pydantic-docs.helpmanual.io/usage/validators/) can be included in the `row_model` definition if required.
## Another Example
Should the file structure change at some point in the future a new model can be created based on the original model. This is referred to as a *sub-model*, where the original model is the *parent* model.
Given the slightly modified file structure of `new_example.txt`:
```
operator name: Jane Smith
country: NZ
city: Auckland
year: 2022
ID,Time,Duration (sec),Reading
1,20:04:05,12.2,2098
2,20:05:00,2.35,4328
```
Attempting to parse this file using the original `ExampleStructure` model will raise a `PyrsonaError` due to the addition of the `'city: Auckland'` line. In order to successfully parse the file and capture the new `'city'` field the following *sub-model* should be defined.
```python
from pyrsona import BaseStructure
from pydantic import BaseModel
from datetime import time
class NewExampleStructure(ExampleStructure):
structure = (
"operator name: {operator_name}\n"
"country: {country}\n"
"city: {city}\n"
"year: {}\n"
"\n"
"ID,Time,Duration (sec),Reading\n"
)
class meta_model(BaseModel):
operator_name: str
country: str
city: str
```
`ExampleStructure` is still used as the entry point; however, *pyrsona* will attempt to parse the file using any *sub-models* that exist (in this case `NewExampleStructure`) before using `ExampleStructure` itself.
```python
meta, table_rows, structure_id = ExampleStructure.read("new_example.txt")
print(meta)
#> {'operator_name': 'Jane Smith', 'country': 'NZ', 'city': 'Auckland'}
print(table_rows)
#> [{'id': 1, 'time': datetime.time(20, 4, 5), 'value': 2098.0}, {'id': 2,
# 'time': datetime.time(20, 5), 'value': 4328.0}]
print(structure_id)
#> NewExampleStructure
```
**What's going on here:**
- A new *pyrsona* file structure model is defined based on the original `ExampleStructure` model. This means that `structure`, `meta_model` and `row_model` will be inherited from `ExampleStructure`. This also provides a single entry point (I.e. `ExampleStructure.read()`) when attempting to read the different file versions.
- `structure` and `meta_model` are redefined to include the new `"city: Auckland"` meta data line. Alternatively, the original `meta_model` in `ExampleStructure` could have been updated to include an *optional* `city` field.
## Post-processors
It is sometimes necessary to modify some of the data following parsing by the `meta_model` and `row_model`. Two post-processing methods are available for this purpose.
Using the `ExampleStructure` class above, `meta_postprocessor` and `table_postprocessor` static methods are defined for post-processing the meta data and table_rows, respectively:
```python
class ExampleStructure(BaseStructure):
# Lines omitted for brevity
@staticmethod
def meta_postprocessor(meta):
meta["version"] = 3
return meta
@staticmethod
def table_postprocessor(table_rows, meta):
# Add a cumulative total and delete the "id" field:
total = 0
for ii, row in enumerate(table_rows):
total += row["value"]
row["total"] = total
del(row["id"])
table_rows[ii] = row
return table_rows
```
The meta data and table_rows are now run through the post-processing stages before being returned, resulting in the following changes:
- A new *version* field is added to the meta data.
- The *id* field is deleted from the table_rows and a cumulative total field is added.
```python
meta, table_rows, structure_id = ExampleStructure.read("example.txt")
print(meta)
#> {'operator_name': 'Jane Smith', 'country': 'NZ', 'version': 3}
print(table_rows)
#> [{'time': datetime.time(20, 4, 5), 'duration_sec': 12.2, 'value': 2098.0,
# 'total': 2098.0}, {'time': datetime.time(20, 5), 'duration_sec': 2.35, 'value': 4328.0,
# 'total': 6426.0}]
print(structure_id)
#> NewExampleStructure
```
### Array data in field
Sometimes the table rows contain array data that is not easily converted to a pydantic model. In this case, the `row_model` can be omitted and the `table_postprocessor` method can be used to convert the table rows into a more suitable format.
```python
class ExampleStructure(BaseStructure):
structure = (
"operator name: {operator_name}\n"
"country: {country}\n"
"year: {}\n"
"\n"
"ID,Time,Duration (sec),Reading\n"
)
class meta_model(BaseModel):
operator_name: str
country: str
@staticmethod
def table_postprocessor(table_rows, meta):
class row_model(BaseModel):
id: int
array_data: list[str]
ids = [row[0] for row in table_rows]
array_data = [row[1:] for row in table_rows]
table_rows = [
row_model(id=row_id, array_data=row_array_data).dict()
for row_id, row_array_data in zip(ids, array_data)
]
return table_rows
```
With an undefined `row_model` the table row data would be returned as a list of strings. The `table_postprocessor` method can then be used to convert the data into a more suitable format using custom logic.
```python
print(table_rows)
#> [{'id': 1, 'array_data': ['20:04:05', '12.2', '2098']}, {'id': 2, 'array_data': ['20:05:00','2.35','4328']}]
```
## Extra details
### All meta lines MUST be included
While the *parse* package allows a wildcard `'{}'` to be used to ignore several lines this can cause a named field to be unexpectedly included in the wildcard section. *pyrsona* therefore checks for the presence of a new line character `'\n'` in the named field values and fails if one is found.
### Sub-sub-models
Calling the `read()` method will first build a list of *pyrsona* file structure models from the *parent* model down.
Any *sub-models* of the *parent* model will themselves be checked for *sub-models*, meaning that every model in the tree below the *parent* model will be used when attempting to parse a file.
Each branch of models will be ordered bottom-up so that the deepest nested model in a branch will be used first. The *parent* model will be the final model used if all others fail.
### Model names
The `read()` method returns a `structure_id` variable that matches the model name. This `structure_id` can be useful when creating automated tests that sit alongside the *pyrsona* models as it provides a mechanism for confirming that a text file was parsed using the expected *pyrsona* model where multiple *sub-models* exist.
As the number of *sub-models* grows a naming convention becomes more important. One option is to set the names of any `sub-models` to a random hexadecimal value prefixed with a single underscore (in case the value begins with a number), E.g. `'_a4c15356'`. The initial underscore will be removed from model name when returning the `structure_id` value.
### *parse* formats
The *parse* package allows format specifications to be included alongside the fields, E.g. `'{year:d}'`. While including these format types in the structure definition is valid, more complex format conversions can be made using `meta_model`. Keeping all format conversions in `meta_model` means that all conversions are defined in one place.
Raw data
{
"_id": null,
"home_page": "https://github.com/johnbullnz/pyrsona",
"name": "pyrsona",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0,>=3.9",
"maintainer_email": null,
"keywords": null,
"author": "John",
"author_email": "johnbullnz@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/78/00/80bf3140b449f4a8917936fdbc40864296504052a3babcd56927f6dbc362/pyrsona-1.0.tar.gz",
"platform": null,
"description": "# pyrsona\n\n<img align=\"left\" src=\"https://github.com/johnbullnz/pyrsona/actions/workflows/python.yml/badge.svg\"><br>\n\nText data file validation and structure management using the [pydantic](https://pydantic-docs.helpmanual.io/) and [parse](https://github.com/r1chardj0n3s/parse) Python packages.\n\n\n## Installation\n\nInstall using `pip install pyrsona`.\n\n\n## A Simple Example\n\nFor the text file `example.txt`:\n\n```\noperator name: Jane Smith\ncountry: NZ\nyear: 2022\n\nID,Time,Duration (sec),Reading\n1,20:04:05,12.2,2098\n2,20:05:00,2.35,4328\n```\n\nThe following *pyrsona* file structure model can be defined:\n\n```python\nfrom pyrsona import BaseStructure\nfrom pydantic import BaseModel\nfrom datetime import time\n\n\nclass ExampleStructure(BaseStructure):\n\n structure = (\n \"operator name: {operator_name}\\n\"\n \"country: {country}\\n\"\n \"year: {}\\n\"\n \"\\n\"\n \"ID,Time,Duration (sec),Reading\\n\"\n )\n\n class meta_model(BaseModel):\n operator_name: str\n country: str\n\n class row_model(BaseModel):\n id: int\n time: time\n duration_sec: float\n value: float\n```\n\nThe `read()` method can then be used to read the file, parse its contents and validate the meta data and table rows:\n\n```python\nmeta, table_rows, structure_id = ExampleStructure.read(\"example.txt\")\n\nprint(meta)\n#> {'operator_name': 'Jane Smith', 'country': 'NZ'}\n\nprint(table_rows)\n#> [{'id': 1, 'time': datetime.time(20, 4, 5), 'value': 2098.0}, {'id': 2,\n# 'time': datetime.time(20, 5), 'value': 4328.0}]\n\nprint(structure_id)\n#> ExampleStructure\n```\n\n**What's going on here:**\n\n- The `structure` class attribute contains a definition of the basic file structure. This definition includes the meta data lines and table header lines. Any variable text of interest is replaced with curly brackets and a field name, E.g. `'{operator_name}'`, while any variable text that should be ignored is replaced with empty curly brackets, E.g. `'{}'`. The `structure` definition must contain all spaces, tabs and new line characters in order for a file to successfully match it. The named fields in the `structure` definition will be passed to `meta_model`.\n\n- `meta_model` is simply a [pydantic model](https://pydantic-docs.helpmanual.io/usage/models/) with field names that match the named fields in the `structure` definition. All values sent to `meta_model` will be strings and these will be converted to the field types defined in `meta_model`. Custom [pydantic validators](https://pydantic-docs.helpmanual.io/usage/validators/) can be included in the `meta_model` definition as per standard pydantic models.\n\n- `row_model` is also a [pydantic model](https://pydantic-docs.helpmanual.io/usage/models/). This time the field names do not need to match the header line in the `structure` definition; however, the `row_model` fields do need to be provided in the **same order as the table columns**. This allows the table column names to be customised/standardised where the user does not control the file structure itself. Again, custom [pydantic validators](https://pydantic-docs.helpmanual.io/usage/validators/) can be included in the `row_model` definition if required.\n\n\n## Another Example\n\nShould the file structure change at some point in the future a new model can be created based on the original model. This is referred to as a *sub-model*, where the original model is the *parent* model.\n\nGiven the slightly modified file structure of `new_example.txt`:\n\n```\noperator name: Jane Smith\ncountry: NZ\ncity: Auckland\nyear: 2022\n\nID,Time,Duration (sec),Reading\n1,20:04:05,12.2,2098\n2,20:05:00,2.35,4328\n```\n\nAttempting to parse this file using the original `ExampleStructure` model will raise a `PyrsonaError` due to the addition of the `'city: Auckland'` line. In order to successfully parse the file and capture the new `'city'` field the following *sub-model* should be defined.\n\n```python\nfrom pyrsona import BaseStructure\nfrom pydantic import BaseModel\nfrom datetime import time\n\n\nclass NewExampleStructure(ExampleStructure):\n\n structure = (\n \"operator name: {operator_name}\\n\"\n \"country: {country}\\n\"\n \"city: {city}\\n\"\n \"year: {}\\n\"\n \"\\n\"\n \"ID,Time,Duration (sec),Reading\\n\"\n )\n\n class meta_model(BaseModel):\n operator_name: str\n country: str\n city: str\n```\n\n`ExampleStructure` is still used as the entry point; however, *pyrsona* will attempt to parse the file using any *sub-models* that exist (in this case `NewExampleStructure`) before using `ExampleStructure` itself.\n\n```python\nmeta, table_rows, structure_id = ExampleStructure.read(\"new_example.txt\")\n\nprint(meta)\n#> {'operator_name': 'Jane Smith', 'country': 'NZ', 'city': 'Auckland'}\n\nprint(table_rows)\n#> [{'id': 1, 'time': datetime.time(20, 4, 5), 'value': 2098.0}, {'id': 2,\n# 'time': datetime.time(20, 5), 'value': 4328.0}]\n\nprint(structure_id)\n#> NewExampleStructure\n```\n\n**What's going on here:**\n\n- A new *pyrsona* file structure model is defined based on the original `ExampleStructure` model. This means that `structure`, `meta_model` and `row_model` will be inherited from `ExampleStructure`. This also provides a single entry point (I.e. `ExampleStructure.read()`) when attempting to read the different file versions.\n\n- `structure` and `meta_model` are redefined to include the new `\"city: Auckland\"` meta data line. Alternatively, the original `meta_model` in `ExampleStructure` could have been updated to include an *optional* `city` field.\n\n\n## Post-processors\n\nIt is sometimes necessary to modify some of the data following parsing by the `meta_model` and `row_model`. Two post-processing methods are available for this purpose.\n\nUsing the `ExampleStructure` class above, `meta_postprocessor` and `table_postprocessor` static methods are defined for post-processing the meta data and table_rows, respectively:\n\n```python\nclass ExampleStructure(BaseStructure):\n\n # Lines omitted for brevity\n\n @staticmethod\n def meta_postprocessor(meta):\n meta[\"version\"] = 3\n return meta\n\n @staticmethod\n def table_postprocessor(table_rows, meta):\n # Add a cumulative total and delete the \"id\" field:\n total = 0\n for ii, row in enumerate(table_rows):\n total += row[\"value\"]\n row[\"total\"] = total\n del(row[\"id\"])\n table_rows[ii] = row\n return table_rows\n```\n\nThe meta data and table_rows are now run through the post-processing stages before being returned, resulting in the following changes:\n\n - A new *version* field is added to the meta data.\n - The *id* field is deleted from the table_rows and a cumulative total field is added.\n\n```python\nmeta, table_rows, structure_id = ExampleStructure.read(\"example.txt\")\n\nprint(meta)\n#> {'operator_name': 'Jane Smith', 'country': 'NZ', 'version': 3}\n\nprint(table_rows)\n#> [{'time': datetime.time(20, 4, 5), 'duration_sec': 12.2, 'value': 2098.0,\n# 'total': 2098.0}, {'time': datetime.time(20, 5), 'duration_sec': 2.35, 'value': 4328.0,\n# 'total': 6426.0}]\n\nprint(structure_id)\n#> NewExampleStructure\n```\n\n### Array data in field\n\nSometimes the table rows contain array data that is not easily converted to a pydantic model. In this case, the `row_model` can be omitted and the `table_postprocessor` method can be used to convert the table rows into a more suitable format.\n\n```python\nclass ExampleStructure(BaseStructure):\n\n structure = (\n \"operator name: {operator_name}\\n\"\n \"country: {country}\\n\"\n \"year: {}\\n\"\n \"\\n\"\n \"ID,Time,Duration (sec),Reading\\n\"\n )\n\n class meta_model(BaseModel):\n operator_name: str\n country: str\n\n @staticmethod\n def table_postprocessor(table_rows, meta):\n\n class row_model(BaseModel):\n id: int\n array_data: list[str]\n\n ids = [row[0] for row in table_rows]\n array_data = [row[1:] for row in table_rows]\n\n table_rows = [\n row_model(id=row_id, array_data=row_array_data).dict()\n for row_id, row_array_data in zip(ids, array_data)\n ]\n\n return table_rows\n```\n\nWith an undefined `row_model` the table row data would be returned as a list of strings. The `table_postprocessor` method can then be used to convert the data into a more suitable format using custom logic.\n\n```python\nprint(table_rows)\n#> [{'id': 1, 'array_data': ['20:04:05', '12.2', '2098']}, {'id': 2, 'array_data': ['20:05:00','2.35','4328']}]\n```\n\n\n## Extra details\n\n\n### All meta lines MUST be included\n\nWhile the *parse* package allows a wildcard `'{}'` to be used to ignore several lines this can cause a named field to be unexpectedly included in the wildcard section. *pyrsona* therefore checks for the presence of a new line character `'\\n'` in the named field values and fails if one is found.\n\n\n### Sub-sub-models\n\nCalling the `read()` method will first build a list of *pyrsona* file structure models from the *parent* model down. \n\nAny *sub-models* of the *parent* model will themselves be checked for *sub-models*, meaning that every model in the tree below the *parent* model will be used when attempting to parse a file.\n\nEach branch of models will be ordered bottom-up so that the deepest nested model in a branch will be used first. The *parent* model will be the final model used if all others fail.\n\n### Model names\n\nThe `read()` method returns a `structure_id` variable that matches the model name. This `structure_id` can be useful when creating automated tests that sit alongside the *pyrsona* models as it provides a mechanism for confirming that a text file was parsed using the expected *pyrsona* model where multiple *sub-models* exist.\n\nAs the number of *sub-models* grows a naming convention becomes more important. One option is to set the names of any `sub-models` to a random hexadecimal value prefixed with a single underscore (in case the value begins with a number), E.g. `'_a4c15356'`. The initial underscore will be removed from model name when returning the `structure_id` value.\n\n\n### *parse* formats\n\nThe *parse* package allows format specifications to be included alongside the fields, E.g. `'{year:d}'`. While including these format types in the structure definition is valid, more complex format conversions can be made using `meta_model`. Keeping all format conversions in `meta_model` means that all conversions are defined in one place.\n\n",
"bugtrack_url": null,
"license": "MIT",
"summary": null,
"version": "1.0",
"project_urls": {
"Homepage": "https://github.com/johnbullnz/pyrsona"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "7a3b5450bc7197918f9379654617114dfed9ac3e88ef41b8f985b3bf6504760c",
"md5": "605e14dfcaeec4c6463a3c6d94b6256c",
"sha256": "7b8951b1b8d0ce2b0d385809e80ef1fc51bf1f44a3561827aa42dfee29348a61"
},
"downloads": -1,
"filename": "pyrsona-1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "605e14dfcaeec4c6463a3c6d94b6256c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0,>=3.9",
"size": 7818,
"upload_time": "2024-08-07T04:01:27",
"upload_time_iso_8601": "2024-08-07T04:01:27.975155Z",
"url": "https://files.pythonhosted.org/packages/7a/3b/5450bc7197918f9379654617114dfed9ac3e88ef41b8f985b3bf6504760c/pyrsona-1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "780080bf3140b449f4a8917936fdbc40864296504052a3babcd56927f6dbc362",
"md5": "6496f2d3d1c080eea4e2cd1c269398c3",
"sha256": "3e43e2007633d5a9c5480454922e4d1326ebc5fb82683b8a286db6b2965d798f"
},
"downloads": -1,
"filename": "pyrsona-1.0.tar.gz",
"has_sig": false,
"md5_digest": "6496f2d3d1c080eea4e2cd1c269398c3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0,>=3.9",
"size": 7469,
"upload_time": "2024-08-07T04:01:29",
"upload_time_iso_8601": "2024-08-07T04:01:29.324015Z",
"url": "https://files.pythonhosted.org/packages/78/00/80bf3140b449f4a8917936fdbc40864296504052a3babcd56927f6dbc362/pyrsona-1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-07 04:01:29",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "johnbullnz",
"github_project": "pyrsona",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pyrsona"
}