### ***TAB-dataset : A tool for structuring tabular data***
*TAB-dataset analyzes, measures and transforms the relationships between Fields in any tabular Dataset.*
*The TAB-dataset tool is part of the [Environmental Sensing Project](https://github.com/loco-philippe/Environmental-Sensing#readme)*
For more information, see the [user guide](https://loco-philippe.github.io/tab-dataset/docs/user_guide.html) or the [github repository](https://github.com/loco-philippe/tab-dataset).
# What is TAB-dataset ?
## Principles
In tabular data, columns and rows are not equivalent, the columns (or fields) represent the 'semantics' of the data and the rows represent the objects arranged according to the structure defined by the columns.
The TAB-dataset tool measures and analyzes relationships between fields via the [TAB-analysis tool](https://github.com/loco-philippe/tab-analysis#readme).
TAB-dataset uses relationships between fields to have an optimized JSON format (JSON-TAB format).
It also identifies data that does not respect given relationships.
Finally, it proposes transformations of the data set to respect a set of relationships.
TAB-dataset is used by [ntv_pandas](https://github.com/loco-philippe/ntv-pandas/blob/main/README.md) to identify consistency errors in DataFrame.
## Examples
Here is a price list of different foods based on packaging.
| plants | quantity | product | price |
|-----------|----------|---------|-------|
| fruit | 1 kg | apple | 1 |
| fruit | 10 kg | apple | 10 |
| fruit | 1 kg | orange | 2 |
| fruit | 10 kg | orange | 20 |
| vegetable | 1 kg | peppers | 1.5 |
| vegetable | 10 kg | peppers | 15 |
| fruit | 1 kg | banana | 0.5 |
| fruit | 10 kg | banana | 5 |
In this example, we observe two kinds of relationships:
- classification ("derived" relationship): between 'plants' and 'product' (each product belongs a plant)
- crossing ("crossed" relationship): between 'product' and 'quantity' (all the combinations of the two fields are present).
Another observation is that each record has a specific combination of 'product' and 'quantity', it will be possible to convert this dataset in matrix:
| price | 1 kg | 10 kg|
|---------|------|------|
| apple | 1 | 10 |
| orange | 2 | 20 |
| peppers | 1.5 | 15 |
| banana | 0.5 | 5 |
```python
In [1]: # creation of the `prices` object
from tab_dataset import Sdataset
tabular = {'plants': ['fruit', 'fruit','fruit', 'fruit','vegetable','vegetable','fruit', 'fruit' ],
'quantity': ['1 kg' , '10 kg', '1 kg', '10 kg', '1 kg', '10 kg', '1 kg', '10 kg' ],
'product': ['apple', 'apple', 'orange', 'orange', 'peppers', 'peppers', 'banana', 'banana'],
'price': [1, 10, 2, 20, 1.5, 15, 0.5, 5 ]}
prices = Sdataset.ntv(tabular)
In [2]: # the `field_partition` method return the main structure of the dataset (see TAB-analysis)
prices.field_partition(mode='id')
Out[2]: {'primary': ['quantity', 'product'],
'secondary': ['plants'],
'unique': [],
'variable': ['price']}
In [4]: # we can send the data to tools supporting the identified data structure
prices.to_xarray()
Out[4]: <xarray.DataArray 'price' (quantity: 2, product: 4)>
array([[1, 2, 1.5, 0.5],
[10, 20, 15, 5]], dtype=object)
Coordinates:
* quantity (quantity) object '1 kg' '10 kg'
* product (product) object 'apple' 'orange' 'peppers' 'banana'
plants (product) object 'fruit' 'fruit' 'vegetable' 'fruit'
In [5]: # what if an error occurs ?
tabul_2 = {'plants': ['fruit', 'fruit','fruit', 'fruit','vegetable','vegetable','vegetable','fruit' ],
'quantity': ['1 kg' , '10 kg', '1 kg', '10 kg', '1 kg', '10 kg', '1 kg', '10 kg' ],
'product': ['apple', 'apple', 'orange', 'orange', 'peppers', 'peppers', 'banana', 'banana'],
'price': [1, 10, 2, 20, 1.5, 15, 0.5, 5 ]}
prices = Sdataset.ntv(tabul_2)
In [6]: # the relationship is no more 'derived'
prices.relation('plants', 'product').typecoupl
Out[6]: 'linked'
In [7]: # how much data is prohibited from being 'derived' ?
prices.relation('plants', 'product').distomin
Out[7]: 1
In [8]: # What data needs to be corrected ?
prices.check_relation('product', 'plants', 'derived', value=True)
Out[8]: {'row': [6, 7],
'plants': ['vegetable', 'fruit'],
'product': ['banana', 'banana']}
```
## Dataset structure
To analyze the relationships between fields, a particular modeling is used:
- each field is transformed into a list of distinct values and a list of pointers to these values
- the analysis is then carried out on these lists of pointers
> Example :
>
> The field: ['john', 'anna', 'paul', 'anna', 'john', 'lisa'] is transformed into:
>
> - a first list of values ['john', 'anna', 'paul', ' lisa']
> - a second list of pointers: [0, 1, 2, 1, 0, 3].
>
> We find for example this format in the 'categorical' data of pandas DataFrame.
## JSON interface
TAB-dataset uses relationships between fields to have an optimized JSON format ([JSON-TAB format](https://github.com/loco-philippe/NTV/blob/main/documentation/JSON-TAB-standard.pdf)).
```python
In [9]: # the JSON length (equivalent to CSV length) is not optimized
len(json.dumps(tabular))
Out[9]: 309
In [10]: # the JSON-TAB format is optimized
len(json.dumps(prices.to_ntv().to_obj()))
Out[10]: 193
In [10]: prices.to_ntv().to_obj()
Out[10]: {'plants': [['fruit', 'vegetable'], 2, [0, 0, 1, 0]],
'quantity': [['1 kg', '10 kg'], [1]],
'product': [['apple', 'orange', 'peppers', 'banana'], [2]],
'price': [1, 10, 2, 20, 1.5, 15, 0.5, 5]}
In [11]: # the JSON-TAB format is reversible
Sdataset.from_ntv(prices.to_ntv().to_obj()) == prices
Out[11]: True
```
## Uses
TAB-dataset accepts pandas Dataframe, json data ([NTV format](https://github.com/loco-philippe/NTV#readme)) and simple structure like list of list or dict of list.
Possible uses are as follows:
- control of a dataset in relation to a data model,
- quality indicators of a dataset
- analysis of datasets
- error detection and correction,
- generation of optimized data formats (alternative to CSV format)
- interface to specific applications
Raw data
{
"_id": null,
"home_page": "https://github.com/loco-philippe/tab_dataset/blob/main/README.md",
"name": "tab-dataset",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.9, <4",
"maintainer_email": "",
"keywords": "tabular data,open data,environmental data",
"author": "Philippe Thomy",
"author_email": "philippe@loco-labs.io",
"download_url": "https://files.pythonhosted.org/packages/40/1c/32e9f8c62ceedab8f271451e06bd0225eee3acf73517969a059d58156796/tab_dataset-0.1.1.tar.gz",
"platform": null,
"description": "### ***TAB-dataset : A tool for structuring tabular data***\r\n\r\n*TAB-dataset analyzes, measures and transforms the relationships between Fields in any tabular Dataset.*\r\n\r\n*The TAB-dataset tool is part of the [Environmental Sensing Project](https://github.com/loco-philippe/Environmental-Sensing#readme)*\r\n\r\nFor more information, see the [user guide](https://loco-philippe.github.io/tab-dataset/docs/user_guide.html) or the [github repository](https://github.com/loco-philippe/tab-dataset).\r\n\r\n# What is TAB-dataset ?\r\n\r\n## Principles\r\n\r\nIn tabular data, columns and rows are not equivalent, the columns (or fields) represent the 'semantics' of the data and the rows represent the objects arranged according to the structure defined by the columns.\r\n\r\nThe TAB-dataset tool measures and analyzes relationships between fields via the [TAB-analysis tool](https://github.com/loco-philippe/tab-analysis#readme).\r\n\r\nTAB-dataset uses relationships between fields to have an optimized JSON format (JSON-TAB format).\r\n\r\nIt also identifies data that does not respect given relationships.\r\n\r\nFinally, it proposes transformations of the data set to respect a set of relationships.\r\n\r\nTAB-dataset is used by [ntv_pandas](https://github.com/loco-philippe/ntv-pandas/blob/main/README.md) to identify consistency errors in DataFrame.\r\n\r\n## Examples\r\n\r\nHere is a price list of different foods based on packaging.\r\n\r\n| plants | quantity | product | price |\r\n|-----------|----------|---------|-------|\r\n| fruit | 1 kg | apple | 1 |\r\n| fruit | 10 kg | apple | 10 |\r\n| fruit | 1 kg | orange | 2 |\r\n| fruit | 10 kg | orange | 20 |\r\n| vegetable | 1 kg | peppers | 1.5 |\r\n| vegetable | 10 kg | peppers | 15 |\r\n| fruit | 1 kg | banana | 0.5 |\r\n| fruit | 10 kg | banana | 5 |\r\n\r\nIn this example, we observe two kinds of relationships:\r\n\r\n- classification (\"derived\" relationship): between 'plants' and 'product' (each product belongs a plant)\r\n- crossing (\"crossed\" relationship): between 'product' and 'quantity' (all the combinations of the two fields are present).\r\n\r\nAnother observation is that each record has a specific combination of 'product' and 'quantity', it will be possible to convert this dataset in matrix:\r\n\r\n| price | 1 kg | 10 kg|\r\n|---------|------|------|\r\n| apple | 1 | 10 |\r\n| orange | 2 | 20 |\r\n| peppers | 1.5 | 15 |\r\n| banana | 0.5 | 5 |\r\n\r\n```python\r\nIn [1]: # creation of the `prices` object \r\n from tab_dataset import Sdataset\r\n tabular = {'plants': ['fruit', 'fruit','fruit', 'fruit','vegetable','vegetable','fruit', 'fruit' ],\r\n 'quantity': ['1 kg' , '10 kg', '1 kg', '10 kg', '1 kg', '10 kg', '1 kg', '10 kg' ], \r\n 'product': ['apple', 'apple', 'orange', 'orange', 'peppers', 'peppers', 'banana', 'banana'], \r\n 'price': [1, 10, 2, 20, 1.5, 15, 0.5, 5 ]}\r\n prices = Sdataset.ntv(tabular)\r\n\r\nIn [2]: # the `field_partition` method return the main structure of the dataset (see TAB-analysis)\r\n prices.field_partition(mode='id')\r\nOut[2]: {'primary': ['quantity', 'product'],\r\n 'secondary': ['plants'],\r\n 'unique': [],\r\n 'variable': ['price']}\r\n\r\nIn [4]: # we can send the data to tools supporting the identified data structure\r\n prices.to_xarray()\r\nOut[4]: <xarray.DataArray 'price' (quantity: 2, product: 4)>\r\n array([[1, 2, 1.5, 0.5],\r\n [10, 20, 15, 5]], dtype=object)\r\n Coordinates:\r\n * quantity (quantity) object '1 kg' '10 kg'\r\n * product (product) object 'apple' 'orange' 'peppers' 'banana'\r\n plants (product) object 'fruit' 'fruit' 'vegetable' 'fruit'\r\n\r\nIn [5]: # what if an error occurs ?\r\n tabul_2 = {'plants': ['fruit', 'fruit','fruit', 'fruit','vegetable','vegetable','vegetable','fruit' ],\r\n 'quantity': ['1 kg' , '10 kg', '1 kg', '10 kg', '1 kg', '10 kg', '1 kg', '10 kg' ], \r\n 'product': ['apple', 'apple', 'orange', 'orange', 'peppers', 'peppers', 'banana', 'banana'], \r\n 'price': [1, 10, 2, 20, 1.5, 15, 0.5, 5 ]}\r\n prices = Sdataset.ntv(tabul_2)\r\n\r\nIn [6]: # the relationship is no more 'derived'\r\n prices.relation('plants', 'product').typecoupl\r\nOut[6]: 'linked'\r\n\r\nIn [7]: # how much data is prohibited from being 'derived' ?\r\n prices.relation('plants', 'product').distomin\r\nOut[7]: 1\r\n\r\nIn [8]: # What data needs to be corrected ?\r\n prices.check_relation('product', 'plants', 'derived', value=True)\r\nOut[8]: {'row': [6, 7],\r\n 'plants': ['vegetable', 'fruit'],\r\n 'product': ['banana', 'banana']}\r\n```\r\n\r\n## Dataset structure\r\n\r\nTo analyze the relationships between fields, a particular modeling is used:\r\n\r\n- each field is transformed into a list of distinct values and a list of pointers to these values\r\n- the analysis is then carried out on these lists of pointers\r\n\r\n> Example :\r\n>\r\n> The field: ['john', 'anna', 'paul', 'anna', 'john', 'lisa'] is transformed into:\r\n>\r\n> - a first list of values ['john', 'anna', 'paul', ' lisa']\r\n> - a second list of pointers: [0, 1, 2, 1, 0, 3].\r\n>\r\n> We find for example this format in the 'categorical' data of pandas DataFrame.\r\n\r\n## JSON interface\r\n\r\nTAB-dataset uses relationships between fields to have an optimized JSON format ([JSON-TAB format](https://github.com/loco-philippe/NTV/blob/main/documentation/JSON-TAB-standard.pdf)).\r\n\r\n```python\r\nIn [9]: # the JSON length (equivalent to CSV length) is not optimized\r\n len(json.dumps(tabular))\r\nOut[9]: 309\r\n\r\nIn [10]: # the JSON-TAB format is optimized\r\n len(json.dumps(prices.to_ntv().to_obj()))\r\nOut[10]: 193\r\n\r\nIn [10]: prices.to_ntv().to_obj()\r\nOut[10]: {'plants': [['fruit', 'vegetable'], 2, [0, 0, 1, 0]],\r\n 'quantity': [['1 kg', '10 kg'], [1]],\r\n 'product': [['apple', 'orange', 'peppers', 'banana'], [2]],\r\n 'price': [1, 10, 2, 20, 1.5, 15, 0.5, 5]}\r\n\r\nIn [11]: # the JSON-TAB format is reversible\r\n Sdataset.from_ntv(prices.to_ntv().to_obj()) == prices\r\nOut[11]: True\r\n```\r\n\r\n## Uses\r\n\r\nTAB-dataset accepts pandas Dataframe, json data ([NTV format](https://github.com/loco-philippe/NTV#readme)) and simple structure like list of list or dict of list.\r\n\r\nPossible uses are as follows:\r\n\r\n- control of a dataset in relation to a data model,\r\n- quality indicators of a dataset\r\n- analysis of datasets\r\n- error detection and correction,\r\n- generation of optimized data formats (alternative to CSV format)\r\n- interface to specific applications\r\n",
"bugtrack_url": null,
"license": "",
"summary": "TAB-dataset : A tool for structuring tabular data",
"version": "0.1.1",
"project_urls": {
"Homepage": "https://github.com/loco-philippe/tab_dataset/blob/main/README.md"
},
"split_keywords": [
"tabular data",
"open data",
"environmental data"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "a772ad86b04a384a0c90121c8f695435ce3c525aec746599faa8c5193f77ce0e",
"md5": "b2f0bb1a975cbb9534a9eecdddbbb988",
"sha256": "99004f1b50da1a1e9c63c2dbfc34dfd7165eaeeba2b40939e9753eb1002a4df2"
},
"downloads": -1,
"filename": "tab_dataset-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b2f0bb1a975cbb9534a9eecdddbbb988",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9, <4",
"size": 37137,
"upload_time": "2024-01-05T10:47:35",
"upload_time_iso_8601": "2024-01-05T10:47:35.209971Z",
"url": "https://files.pythonhosted.org/packages/a7/72/ad86b04a384a0c90121c8f695435ce3c525aec746599faa8c5193f77ce0e/tab_dataset-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "401c32e9f8c62ceedab8f271451e06bd0225eee3acf73517969a059d58156796",
"md5": "fd972660e16b312bafa5ad2a43d46740",
"sha256": "ad710a7c53e8bbf38f24a15b16031237d1dae52499c82d8dfa7834ec3f3eb243"
},
"downloads": -1,
"filename": "tab_dataset-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "fd972660e16b312bafa5ad2a43d46740",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9, <4",
"size": 49040,
"upload_time": "2024-01-05T10:47:37",
"upload_time_iso_8601": "2024-01-05T10:47:37.251462Z",
"url": "https://files.pythonhosted.org/packages/40/1c/32e9f8c62ceedab8f271451e06bd0225eee3acf73517969a059d58156796/tab_dataset-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-01-05 10:47:37",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "loco-philippe",
"github_project": "tab_dataset",
"github_not_found": true,
"lcname": "tab-dataset"
}