# smart_tools: tools to make data analysis easy
**smart_tools** contains a collection of command-line tools developed in Python. It aims in performing common data analyst activities easier.
# Table of Contents
- [Where to get it](#where-to-get-it)
- [Dependencies](#dependencies)
- [How to use command-line tools](#how-to-use-command-line-tools)
- [dissector](#dissector), analyze one or more files for data profiling
- [morpher](#morpher), convert files from one format to another
- [comparator](#comparator), compare two files for differences
- [aggregator](#comparator), append two or more files row-wise
- [fusioner](#fusioner), transform columns in a file
# Where to get it
The source code is currently hosted on GitHub at: https://github.com/arcot23/smart_tools
Binary installers for the released version are available at the [Python Package Index (PyPI)](https://pypi.org/project/smart-tools/)
```text
# PyPI
python -m pip install smart-tools
```
# Dependencies
- [pandas](https://pandas.pydata.org/)
- [pyyaml](https://pyyaml.org/)
# How to use command-line tools
To get help, simply run respective executable with `-h` argument from your terminal. For example dissector can be run with `dissector.exe -h`. Run the command with positional arguments which are mandatory, but review the optional arguments `dissector.exe dir file*.txt`.
To easily access these command-line tools, add the executable's directory to PATH (in Windows) environment variable `$Env:PATH`. Most tools also depends on a `config.yaml` file for certain additional settings.
```text
dissector.exe
morpher.exe
comparator.exe
aggregator.exe
fusioner.exe
└── config/
├── dissector_config.yaml
├── morpher_config.yaml
├── comparator_config.yaml
├── aggregator_config.yaml
├── fusioner_config.yaml
└── ...
```
All command-line tools takes an input and generates an output. Input is typically a directory `dir` together with a file or files `file`. Output is created under `dir` which comprises an output directory and output files. `dir `can be a relative path from where the command is run or an absolute path. The folder hierarchy listed below shows the structure.
```text
dir
├── file1.txt
├── file2.txt
├── ...
├── .d/
│ └── dissector_result.xlsx
├── .m/
│ └── morpher_result.xlsx
├── .c/
│ └── comparator_result.xlsx
├── .a/
│ └── aggregator_result.xlsx
└── .f/
└── fusioner_result.xlsx
```
# Dissector
**dissector.exe** is a command-line tool to analyze CSV files. The input `file` can be a single file or files from a directory `dir` that have a common column separator `sep`. The _dissected_ results can be generated in the form of an excel file (`xlsx`) or text (`json` or `csv`). By default, the analysis is run on the entire content of the file i.e., without any filters. But `slicers` help slice data and run analysis.
```commandline
usage: dissector.exe [-h] [--to {xlsx,json,csv}] [--sep SEP]
[--slicers [SLICERS ...]] [--nsample NSAMPLE]
[--outfile OUTFILE] [--config CONFIG]
dir file
positional arguments:
dir Input directory
file Input file (for multiple files use wildcard)
optional arguments:
-h, --help show this help message and exit
--to {xlsx,json,csv} Save result to xlsx or json or csv (default: xlsx)
--sep SEP Column separator (default: ,)
--slicers [SLICERS ...]
Informs how to slice data (default: for no slicing)
--nsample NSAMPLE Number of samples (default: 10)
--outfile OUTFILE Output file name (default: dissect_result)
--config CONFIG Config file for meta data (default:
`.\config\dissector_config.yaml`)
```
The output gives the following information for each column element in the input file(s).
- column: column name.
- strlen: minimum and maximum string length.
- nnull: count of NANs and empty strings.
- nrow: number of rows.
- nunique: number of unique values.
- nvalue: number of rows with values.
- freq: frequency distribution of top n values. n is configured in `dissector_config.yaml`.
- sample: a sample of top n values. n is configured in `dissector_config.yaml`.
- symbols: non-alphanumic characters that are not in [a-zA-Z0-9]
- n: column order.
- filename: name of the input file from where the column was picked.
- filetype: file type to which the file is associated to (e.g., csv).
The output also presents other additional info:
- slice: The _slice_ used to select. Slices represents _filter conditions_ to select subsets of rows within a dataset.
- timestamp: file modified date timestamp of the input file.
- hash: md5 hash of the input file.
- size: file size of the input file in bytes.
Ensure that a yaml config file is present at `.\config\dissector_config.yaml` in relation to `dissector.exe` prior to executing the command.
```yaml
---
read_csv:
skiprows: 0
skipfooter: 0
engine: 'python' # {'c', 'python', 'pyarrow'}
encoding: 'latin-1' # {'utf-8', 'latin-1'}
quotechar: '"'
on_bad_lines: 'warn' # {'error', 'warn', 'skip'}
dtype: 'str'
keep_default_na: false
```
**Examples**
- Fetch `*.csv` from `.\temp` and dissect them with `,` as column separator.
`dissector .\temp *.csv -s ,`
- Fetch `myfile.text` from `c:\temp` and dissect the file with `;` as column separator.
`dissector c:\temp myfile.text -s ;`
- Fetch `myfile.text` from `c:\temp` and dissect the file with `;` as column separator by slicing the data with a filter on `COLUMN1 == 'VALUE'` and also without filtering any.
`dissector c:\temp myfile.text -s ; --slicers "" "COLUMN1 == 'VALUE'"`
- Fetch `myfile.text` from `c:\temp` and dissect the file with TAB `\t` as column separator by slicing the data with a filter on a column name that has a space in it ` COLUMN 1 == 'VALUE'`.
`dissector c:\temp myfile.txt -sep ';' --slicers "" "`COLUMN 1` == 'VALUE'"`
Using powershell, read the arguments from a text file.
```powershell
Get-Content args.txt | ForEach-Object {
$arguments = $_ -split '#'
& dissector.exe $arguments
}
```
Here is a sample args.txt file.
```
.\temp#*.csv#-s#,
```
# Morpher
**morpher.exe** is a command-line tool to convert format of a file or files in a directory that have a common column separator. For example, convert `file` delimited by `sep` in `dir` from csv to `xlsx` or csv to `json`.
```text
usage: morpher.exe [-h] [--sep SEP] [--replace] [--to {xlsx,json}] dir file
positional arguments:
dir Input directory
file Input file or files (wildcard)
optional arguments:
-h, --help show this help message and exit
--sep SEP Column separator (default: ,)
--replace Replace output file if it already exists (default: false)
--to {xlsx,json} Morph to xlsx or json (default: xlsx)
```
# Comparator
**comparator.exe** is a command-line tool to compare one file with another file.
```text
usage: comparator.exe [-h] [-s SEP] [-t {xlsx,json,csv}] file1 file2
positional arguments:
file1 File to compare
file2 File to compare with
optional arguments:
-h, --help show this help message and exit
-s SEP, --sep SEP Column separator (default: `,`)
-t {xlsx,json,csv}, --to {xlsx,json,csv}
Save result to xlsx or json or csv (default: `xlsx`)
```
# Aggregator
**aggregator.exe** is a command-line tool to aggregate two or more file together into one.
```text
usage: aggregator.py [-h] [--sep SEP] [--to {xlsx,json,csv}]
[--outfile OUTFILE] [--config CONFIG]
dir file
positional arguments:
dir Input directory
file Input file or files (for multiple files use wildcard)
optional arguments:
-h, --help show this help message and exit
--sep SEP Column separator (default: `,`)
--to {xlsx,json,csv} Save result to xlsx or json or csv (default: `xlsx`)
--outfile OUTFILE Output directory and file name (default:
.\.a\aggregated_result)
--config CONFIG Config file for meta data (default:
`.\config\aggregator_config.yaml`)
```
# Fusioner
**aggregator.exe** is a command-line tool to aggregate two or more file together into one.
```text
usage: fusioner.py [-h] [--sep SEP] [--outfile OUTFILE] [--config CONFIG] file
positional arguments:
file Input file
optional arguments:
-h, --help show this help message and exit
--sep SEP Column separator (default: ,)
--outfile OUTFILE Output directory and file name (default:
.\.f\fusioner_result)
--config CONFIG Config file for ETL (default:
`.\config\fusioner_config.toml`)
```
Raw data
{
"_id": null,
"home_page": "https://github.com/arcot23/smart_tools",
"name": "smart-tools",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "smart, tools, dissector, morpher, comparator, aggregator, fusioner, analysis, analyze, data",
"author": "Prabhuram Venkatesan",
"author_email": "arcot23@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/96/ba/9fbcfd08efcdf86ab9020bd3f504d6071f2028e32acc2d4c0263dedc5ee6/smart_tools-0.10.3.tar.gz",
"platform": null,
"description": "# smart_tools: tools to make data analysis easy\n\n**smart_tools** contains a collection of command-line tools developed in Python. It aims in performing common data analyst activities easier.\n\n# Table of Contents\n\n- [Where to get it](#where-to-get-it)\n- [Dependencies](#dependencies)\n- [How to use command-line tools](#how-to-use-command-line-tools)\n- [dissector](#dissector), analyze one or more files for data profiling\n- [morpher](#morpher), convert files from one format to another\n- [comparator](#comparator), compare two files for differences\n- [aggregator](#comparator), append two or more files row-wise\n- [fusioner](#fusioner), transform columns in a file\n\n# Where to get it\n\nThe source code is currently hosted on GitHub at: https://github.com/arcot23/smart_tools\n\nBinary installers for the released version are available at the [Python Package Index (PyPI)](https://pypi.org/project/smart-tools/)\n\n```text\n# PyPI\npython -m pip install smart-tools\n```\n\n# Dependencies\n\n- [pandas](https://pandas.pydata.org/)\n- [pyyaml](https://pyyaml.org/)\n\n# How to use command-line tools\n\nTo get help, simply run respective executable with `-h` argument from your terminal. For example dissector can be run with `dissector.exe -h`. Run the command with positional arguments which are mandatory, but review the optional arguments `dissector.exe dir file*.txt`.\n\nTo easily access these command-line tools, add the executable's directory to PATH (in Windows) environment variable `$Env:PATH`. Most tools also depends on a `config.yaml` file for certain additional settings. \n\n```text\ndissector.exe\nmorpher.exe\ncomparator.exe\naggregator.exe\nfusioner.exe\n\u2514\u2500\u2500 config/\n \u251c\u2500\u2500 dissector_config.yaml\n \u251c\u2500\u2500 morpher_config.yaml\n \u251c\u2500\u2500 comparator_config.yaml\n \u251c\u2500\u2500 aggregator_config.yaml\n \u251c\u2500\u2500 fusioner_config.yaml\n \u2514\u2500\u2500 ...\n```\n\nAll command-line tools takes an input and generates an output. Input is typically a directory `dir` together with a file or files `file`. Output is created under `dir` which comprises an output directory and output files. `dir `can be a relative path from where the command is run or an absolute path. The folder hierarchy listed below shows the structure.\n\n```text\ndir\n\u251c\u2500\u2500 file1.txt\n\u251c\u2500\u2500 file2.txt\n\u251c\u2500\u2500 ...\n\u251c\u2500\u2500 .d/\n\u2502 \u2514\u2500\u2500 dissector_result.xlsx\n\u251c\u2500\u2500 .m/\n\u2502 \u2514\u2500\u2500 morpher_result.xlsx\n\u251c\u2500\u2500 .c/\n\u2502 \u2514\u2500\u2500 comparator_result.xlsx\n\u251c\u2500\u2500 .a/\n\u2502 \u2514\u2500\u2500 aggregator_result.xlsx\n\u2514\u2500\u2500 .f/\n \u2514\u2500\u2500 fusioner_result.xlsx\n```\n\n# Dissector\n\n**dissector.exe** is a command-line tool to analyze CSV files. The input `file` can be a single file or files from a directory `dir` that have a common column separator `sep`. The _dissected_ results can be generated in the form of an excel file (`xlsx`) or text (`json` or `csv`). By default, the analysis is run on the entire content of the file i.e., without any filters. But `slicers` help slice data and run analysis. \n\n\n```commandline\nusage: dissector.exe [-h] [--to {xlsx,json,csv}] [--sep SEP]\n [--slicers [SLICERS ...]] [--nsample NSAMPLE]\n [--outfile OUTFILE] [--config CONFIG]\n dir file\n\npositional arguments:\n dir Input directory\n file Input file (for multiple files use wildcard)\n\noptional arguments:\n -h, --help show this help message and exit\n --to {xlsx,json,csv} Save result to xlsx or json or csv (default: xlsx)\n --sep SEP Column separator (default: ,)\n --slicers [SLICERS ...]\n Informs how to slice data (default: for no slicing)\n --nsample NSAMPLE Number of samples (default: 10)\n --outfile OUTFILE Output file name (default: dissect_result)\n --config CONFIG Config file for meta data (default:\n `.\\config\\dissector_config.yaml`)\n```\n\n\nThe output gives the following information for each column element in the input file(s).\n\n- column: column name.\n- strlen: minimum and maximum string length.\n- nnull: count of NANs and empty strings.\n- nrow: number of rows.\n- nunique: number of unique values.\n- nvalue: number of rows with values.\n- freq: frequency distribution of top n values. n is configured in `dissector_config.yaml`.\n- sample: a sample of top n values. n is configured in `dissector_config.yaml`.\n- symbols: non-alphanumic characters that are not in [a-zA-Z0-9]\n- n: column order.\n- filename: name of the input file from where the column was picked.\n- filetype: file type to which the file is associated to (e.g., csv).\n\nThe output also presents other additional info:\n\n- slice: The _slice_ used to select. Slices represents _filter conditions_ to select subsets of rows within a dataset.\n- timestamp: file modified date timestamp of the input file.\n- hash: md5 hash of the input file.\n- size: file size of the input file in bytes.\n\nEnsure that a yaml config file is present at `.\\config\\dissector_config.yaml` in relation to `dissector.exe` prior to executing the command.\n\n```yaml\n---\nread_csv:\n skiprows: 0\n skipfooter: 0\n engine: 'python' # {'c', 'python', 'pyarrow'}\n encoding: 'latin-1' # {'utf-8', 'latin-1'}\n quotechar: '\"'\n on_bad_lines: 'warn' # {'error', 'warn', 'skip'}\n dtype: 'str'\n keep_default_na: false\n```\n\n**Examples**\n\n- Fetch `*.csv` from `.\\temp` and dissect them with `,` as column separator.\n\n `dissector .\\temp *.csv -s ,`\n\n- Fetch `myfile.text` from `c:\\temp` and dissect the file with `;` as column separator.\n\n `dissector c:\\temp myfile.text -s ;`\n\n- Fetch `myfile.text` from `c:\\temp` and dissect the file with `;` as column separator by slicing the data with a filter on `COLUMN1 == 'VALUE'` and also without filtering any.\n\n `dissector c:\\temp myfile.text -s ; --slicers \"\" \"COLUMN1 == 'VALUE'\"`\n\n- Fetch `myfile.text` from `c:\\temp` and dissect the file with TAB `\\t` as column separator by slicing the data with a filter on a column name that has a space in it ` COLUMN 1 == 'VALUE'`.\n\n `dissector c:\\temp myfile.txt -sep ';' --slicers \"\" \"`COLUMN 1` == 'VALUE'\"`\n\n Using powershell, read the arguments from a text file.\n\n ```powershell\n Get-Content args.txt | ForEach-Object {\n $arguments = $_ -split '#'\n & dissector.exe $arguments\n }\n ```\n Here is a sample args.txt file.\n \n ```\n .\\temp#*.csv#-s#,\n ```\n\n# Morpher\n\n**morpher.exe** is a command-line tool to convert format of a file or files in a directory that have a common column separator. For example, convert `file` delimited by `sep` in `dir` from csv to `xlsx` or csv to `json`.\n\n```text\nusage: morpher.exe [-h] [--sep SEP] [--replace] [--to {xlsx,json}] dir file\n\npositional arguments:\n dir Input directory\n file Input file or files (wildcard)\n\noptional arguments:\n -h, --help show this help message and exit\n --sep SEP Column separator (default: ,)\n --replace Replace output file if it already exists (default: false)\n --to {xlsx,json} Morph to xlsx or json (default: xlsx)\n```\n\n# Comparator\n\n**comparator.exe** is a command-line tool to compare one file with another file.\n\n```text\nusage: comparator.exe [-h] [-s SEP] [-t {xlsx,json,csv}] file1 file2\n\npositional arguments:\n file1 File to compare\n file2 File to compare with\n\noptional arguments:\n -h, --help show this help message and exit\n -s SEP, --sep SEP Column separator (default: `,`)\n -t {xlsx,json,csv}, --to {xlsx,json,csv}\n Save result to xlsx or json or csv (default: `xlsx`)\n```\n\n# Aggregator\n\n**aggregator.exe** is a command-line tool to aggregate two or more file together into one.\n\n```text\nusage: aggregator.py [-h] [--sep SEP] [--to {xlsx,json,csv}]\n [--outfile OUTFILE] [--config CONFIG]\n dir file\n\npositional arguments:\n dir Input directory\n file Input file or files (for multiple files use wildcard)\n\noptional arguments:\n -h, --help show this help message and exit\n --sep SEP Column separator (default: `,`)\n --to {xlsx,json,csv} Save result to xlsx or json or csv (default: `xlsx`)\n --outfile OUTFILE Output directory and file name (default:\n .\\.a\\aggregated_result)\n --config CONFIG Config file for meta data (default:\n `.\\config\\aggregator_config.yaml`)\n```\n\n# Fusioner\n\n**aggregator.exe** is a command-line tool to aggregate two or more file together into one.\n\n```text\nusage: fusioner.py [-h] [--sep SEP] [--outfile OUTFILE] [--config CONFIG] file\n\npositional arguments:\n file Input file\n\noptional arguments:\n -h, --help show this help message and exit\n --sep SEP Column separator (default: ,)\n --outfile OUTFILE Output directory and file name (default:\n .\\.f\\fusioner_result)\n --config CONFIG Config file for ETL (default:\n `.\\config\\fusioner_config.toml`)\n\n```\n",
"bugtrack_url": null,
"license": null,
"summary": "A variety of smart tools to make analytics easy",
"version": "0.10.3",
"project_urls": {
"Homepage": "https://pypi.org/project/smart-tools/",
"Repository": "https://github.com/arcot23/smart_tools"
},
"split_keywords": [
"smart",
" tools",
" dissector",
" morpher",
" comparator",
" aggregator",
" fusioner",
" analysis",
" analyze",
" data"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "368be95af80b27c7a9c940fac14c3e506181067ba82a80b6af961c50c3310fc1",
"md5": "51b09dec3561dadf24834b25bed6c4ba",
"sha256": "a2b48f23d939426a46aff3cde4246e9c04d62ed7ca4504ceecbd7330200e30da"
},
"downloads": -1,
"filename": "smart_tools-0.10.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "51b09dec3561dadf24834b25bed6c4ba",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 25914,
"upload_time": "2024-10-13T12:30:26",
"upload_time_iso_8601": "2024-10-13T12:30:26.948280Z",
"url": "https://files.pythonhosted.org/packages/36/8b/e95af80b27c7a9c940fac14c3e506181067ba82a80b6af961c50c3310fc1/smart_tools-0.10.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "96ba9fbcfd08efcdf86ab9020bd3f504d6071f2028e32acc2d4c0263dedc5ee6",
"md5": "53f1e97cbc0c173cfcfd910fe7fe992f",
"sha256": "a2afe8d320797dd59a767a2252415ac0fc4f75d0795a97484c5e3ea4e2d04b12"
},
"downloads": -1,
"filename": "smart_tools-0.10.3.tar.gz",
"has_sig": false,
"md5_digest": "53f1e97cbc0c173cfcfd910fe7fe992f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 18420,
"upload_time": "2024-10-13T12:30:28",
"upload_time_iso_8601": "2024-10-13T12:30:28.324701Z",
"url": "https://files.pythonhosted.org/packages/96/ba/9fbcfd08efcdf86ab9020bd3f504d6071f2028e32acc2d4c0263dedc5ee6/smart_tools-0.10.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-13 12:30:28",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "arcot23",
"github_project": "smart_tools",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "pandas",
"specs": []
},
{
"name": "pyyaml",
"specs": []
},
{
"name": "xlsxwriter",
"specs": []
},
{
"name": "build",
"specs": []
},
{
"name": "sqlalchemy",
"specs": []
},
{
"name": "oracledb",
"specs": []
},
{
"name": "scikit-learn",
"specs": []
},
{
"name": "matplotlib",
"specs": []
},
{
"name": "tabulate",
"specs": []
},
{
"name": "lxml",
"specs": []
},
{
"name": "html5lib",
"specs": []
},
{
"name": "beautifulsoup4",
"specs": []
},
{
"name": "openpyxl",
"specs": []
}
],
"lcname": "smart-tools"
}