smart-tools


Namesmart-tools JSON
Version 0.10.3 PyPI version JSON
download
home_pagehttps://github.com/arcot23/smart_tools
SummaryA variety of smart tools to make analytics easy
upload_time2024-10-13 12:30:28
maintainerNone
docs_urlNone
authorPrabhuram Venkatesan
requires_python>=3.8
licenseNone
keywords smart tools dissector morpher comparator aggregator fusioner analysis analyze data
VCS
bugtrack_url
requirements pandas pyyaml xlsxwriter build sqlalchemy oracledb scikit-learn matplotlib tabulate lxml html5lib beautifulsoup4 openpyxl
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # smart_tools: tools to make data analysis easy

**smart_tools** contains a collection of command-line tools developed in Python. It aims in performing common data analyst activities easier.

# Table of Contents

- [Where to get it](#where-to-get-it)
- [Dependencies](#dependencies)
- [How to use command-line tools](#how-to-use-command-line-tools)
- [dissector](#dissector), analyze one or more files for data profiling
- [morpher](#morpher), convert files from one format to another
- [comparator](#comparator), compare two files for differences
- [aggregator](#comparator), append two or more files row-wise
- [fusioner](#fusioner), transform columns in a file

# Where to get it

The source code is currently hosted on GitHub at: https://github.com/arcot23/smart_tools

Binary installers for the released version are available at the [Python Package Index (PyPI)](https://pypi.org/project/smart-tools/)

```text
# PyPI
python -m pip install smart-tools
```

# Dependencies

- [pandas](https://pandas.pydata.org/)
- [pyyaml](https://pyyaml.org/)

# How to use command-line tools

To get help, simply run respective executable with `-h` argument from your terminal. For example dissector can be run with `dissector.exe -h`.  Run the command with positional arguments which are mandatory, but review the optional arguments `dissector.exe dir file*.txt`.

To easily access these command-line tools, add the executable's directory to PATH (in Windows) environment variable `$Env:PATH`. Most tools also depends on a `config.yaml` file for certain additional settings. 

```text
dissector.exe
morpher.exe
comparator.exe
aggregator.exe
fusioner.exe
└── config/
    ├── dissector_config.yaml
    ├── morpher_config.yaml
    ├── comparator_config.yaml
    ├── aggregator_config.yaml
    ├── fusioner_config.yaml
    └── ...
```

All command-line tools takes an input and generates an output. Input is typically a directory `dir` together with a file or files `file`. Output is created under `dir` which comprises an output directory and output files. `dir `can be a relative path from where the command is run or an absolute path. The folder hierarchy listed below shows the structure.

```text
dir
├── file1.txt
├── file2.txt
├── ...
├── .d/
│   └── dissector_result.xlsx
├── .m/
│   └── morpher_result.xlsx
├── .c/
│   └── comparator_result.xlsx
├── .a/
│   └── aggregator_result.xlsx
└── .f/
    └── fusioner_result.xlsx
```

# Dissector

**dissector.exe** is a command-line tool to analyze CSV files. The input `file` can be a single file or files from a directory `dir` that have a common column separator `sep`. The _dissected_ results can be generated in the form of an excel file (`xlsx`) or text (`json` or `csv`). By default, the analysis is run on the entire content of the file i.e., without any filters. But `slicers` help slice data and run analysis. 


```commandline
usage: dissector.exe [-h] [--to {xlsx,json,csv}] [--sep SEP]
                    [--slicers [SLICERS ...]] [--nsample NSAMPLE]
                    [--outfile OUTFILE] [--config CONFIG]
                    dir file

positional arguments:
  dir                   Input directory
  file                  Input file (for multiple files use wildcard)

optional arguments:
  -h, --help            show this help message and exit
  --to {xlsx,json,csv}  Save result to xlsx or json or csv (default: xlsx)
  --sep SEP             Column separator (default: ,)
  --slicers [SLICERS ...]
                        Informs how to slice data (default: for no slicing)
  --nsample NSAMPLE     Number of samples (default: 10)
  --outfile OUTFILE     Output file name (default: dissect_result)
  --config CONFIG       Config file for meta data (default:
                        `.\config\dissector_config.yaml`)
```


The output gives the following information for each column element in the input file(s).

- column: column name.
- strlen: minimum and maximum string length.
- nnull: count of NANs and empty strings.
- nrow: number of rows.
- nunique: number of unique values.
- nvalue: number of rows with values.
- freq: frequency distribution of top n values. n is configured in `dissector_config.yaml`.
- sample: a sample of top n values. n is configured in `dissector_config.yaml`.
- symbols: non-alphanumic characters that are not in [a-zA-Z0-9]
- n: column order.
- filename: name of the input file from where the column was picked.
- filetype: file type to which the file is associated to (e.g., csv).

The output also presents other additional info:

- slice: The _slice_ used to select. Slices represents _filter conditions_ to select subsets of rows within a dataset.
- timestamp: file modified date timestamp of the input file.
- hash: md5 hash of the input file.
- size: file size of the input file in bytes.

Ensure that a yaml config file is present at `.\config\dissector_config.yaml` in relation to `dissector.exe` prior to executing the command.

```yaml
---
read_csv:
  skiprows: 0
  skipfooter: 0
  engine: 'python' # {'c', 'python', 'pyarrow'}
  encoding: 'latin-1' # {'utf-8', 'latin-1'}
  quotechar: '"'
  on_bad_lines: 'warn' # {'error', 'warn', 'skip'}
  dtype: 'str'
  keep_default_na: false
```

**Examples**

- Fetch `*.csv` from `.\temp` and dissect them with `,` as column separator.

    `dissector .\temp *.csv -s ,`

- Fetch `myfile.text` from `c:\temp` and dissect the file with `;` as column separator.

    `dissector c:\temp myfile.text -s ;`

- Fetch `myfile.text` from `c:\temp` and dissect the file with `;` as column separator by slicing the data with a filter on `COLUMN1 == 'VALUE'` and also without filtering any.

    `dissector c:\temp myfile.text -s ; --slicers "" "COLUMN1 == 'VALUE'"`

- Fetch `myfile.text` from `c:\temp` and dissect the file with TAB `\t` as column separator by slicing the data with a filter on a column name that has a space in it    ` COLUMN 1 == 'VALUE'`.

     `dissector c:\temp myfile.txt -sep ';' --slicers "" "`COLUMN 1` == 'VALUE'"`

     Using powershell, read the arguments from a text file.

    ```powershell
    Get-Content args.txt | ForEach-Object {
        $arguments = $_ -split '#'
        & dissector.exe $arguments
    }
    ```
    Here is a sample args.txt file.
  
    ```
    .\temp#*.csv#-s#,
    ```

# Morpher

**morpher.exe** is a command-line tool to convert format of a file or files  in a directory that have a common column separator. For example, convert `file` delimited by `sep` in `dir` from  csv to `xlsx` or csv to `json`.

```text
usage: morpher.exe [-h] [--sep SEP] [--replace] [--to {xlsx,json}] dir file

positional arguments:
  dir               Input directory
  file              Input file or files (wildcard)

optional arguments:
  -h, --help        show this help message and exit
  --sep SEP         Column separator (default: ,)
  --replace         Replace output file if it already exists (default: false)
  --to {xlsx,json}  Morph to xlsx or json (default: xlsx)
```

# Comparator

**comparator.exe** is a command-line tool to compare one file with another file.

```text
usage: comparator.exe [-h] [-s SEP] [-t {xlsx,json,csv}] file1 file2

positional arguments:
  file1                 File to compare
  file2                 File to compare with

optional arguments:
  -h, --help            show this help message and exit
  -s SEP, --sep SEP     Column separator (default: `,`)
  -t {xlsx,json,csv}, --to {xlsx,json,csv}
                        Save result to xlsx or json or csv (default: `xlsx`)
```

# Aggregator

**aggregator.exe** is a command-line tool to aggregate two or more file together into one.

```text
usage: aggregator.py [-h] [--sep SEP] [--to {xlsx,json,csv}]
                     [--outfile OUTFILE] [--config CONFIG]
                     dir file

positional arguments:
  dir                   Input directory
  file                  Input file or files (for multiple files use wildcard)

optional arguments:
  -h, --help            show this help message and exit
  --sep SEP             Column separator (default: `,`)
  --to {xlsx,json,csv}  Save result to xlsx or json or csv (default: `xlsx`)
  --outfile OUTFILE     Output directory and file name (default:
                        .\.a\aggregated_result)
  --config CONFIG       Config file for meta data (default:
                        `.\config\aggregator_config.yaml`)
```

# Fusioner

**aggregator.exe** is a command-line tool to aggregate two or more file together into one.

```text
usage: fusioner.py [-h] [--sep SEP] [--outfile OUTFILE] [--config CONFIG] file

positional arguments:
  file               Input file

optional arguments:
  -h, --help         show this help message and exit
  --sep SEP          Column separator (default: ,)
  --outfile OUTFILE  Output directory and file name (default:
                     .\.f\fusioner_result)
  --config CONFIG    Config file for ETL (default:
                     `.\config\fusioner_config.toml`)

```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/arcot23/smart_tools",
    "name": "smart-tools",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "smart, tools, dissector, morpher, comparator, aggregator, fusioner, analysis, analyze, data",
    "author": "Prabhuram Venkatesan",
    "author_email": "arcot23@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/96/ba/9fbcfd08efcdf86ab9020bd3f504d6071f2028e32acc2d4c0263dedc5ee6/smart_tools-0.10.3.tar.gz",
    "platform": null,
    "description": "# smart_tools: tools to make data analysis easy\n\n**smart_tools** contains a collection of command-line tools developed in Python. It aims in performing common data analyst activities easier.\n\n# Table of Contents\n\n- [Where to get it](#where-to-get-it)\n- [Dependencies](#dependencies)\n- [How to use command-line tools](#how-to-use-command-line-tools)\n- [dissector](#dissector), analyze one or more files for data profiling\n- [morpher](#morpher), convert files from one format to another\n- [comparator](#comparator), compare two files for differences\n- [aggregator](#comparator), append two or more files row-wise\n- [fusioner](#fusioner), transform columns in a file\n\n# Where to get it\n\nThe source code is currently hosted on GitHub at: https://github.com/arcot23/smart_tools\n\nBinary installers for the released version are available at the [Python Package Index (PyPI)](https://pypi.org/project/smart-tools/)\n\n```text\n# PyPI\npython -m pip install smart-tools\n```\n\n# Dependencies\n\n- [pandas](https://pandas.pydata.org/)\n- [pyyaml](https://pyyaml.org/)\n\n# How to use command-line tools\n\nTo get help, simply run respective executable with `-h` argument from your terminal. For example dissector can be run with `dissector.exe -h`.  Run the command with positional arguments which are mandatory, but review the optional arguments `dissector.exe dir file*.txt`.\n\nTo easily access these command-line tools, add the executable's directory to PATH (in Windows) environment variable `$Env:PATH`. Most tools also depends on a `config.yaml` file for certain additional settings. \n\n```text\ndissector.exe\nmorpher.exe\ncomparator.exe\naggregator.exe\nfusioner.exe\n\u2514\u2500\u2500 config/\n    \u251c\u2500\u2500 dissector_config.yaml\n    \u251c\u2500\u2500 morpher_config.yaml\n    \u251c\u2500\u2500 comparator_config.yaml\n    \u251c\u2500\u2500 aggregator_config.yaml\n    \u251c\u2500\u2500 fusioner_config.yaml\n    \u2514\u2500\u2500 ...\n```\n\nAll command-line tools takes an input and generates an output. Input is typically a directory `dir` together with a file or files `file`. Output is created under `dir` which comprises an output directory and output files. `dir `can be a relative path from where the command is run or an absolute path. The folder hierarchy listed below shows the structure.\n\n```text\ndir\n\u251c\u2500\u2500 file1.txt\n\u251c\u2500\u2500 file2.txt\n\u251c\u2500\u2500 ...\n\u251c\u2500\u2500 .d/\n\u2502   \u2514\u2500\u2500 dissector_result.xlsx\n\u251c\u2500\u2500 .m/\n\u2502   \u2514\u2500\u2500 morpher_result.xlsx\n\u251c\u2500\u2500 .c/\n\u2502   \u2514\u2500\u2500 comparator_result.xlsx\n\u251c\u2500\u2500 .a/\n\u2502   \u2514\u2500\u2500 aggregator_result.xlsx\n\u2514\u2500\u2500 .f/\n    \u2514\u2500\u2500 fusioner_result.xlsx\n```\n\n# Dissector\n\n**dissector.exe** is a command-line tool to analyze CSV files. The input `file` can be a single file or files from a directory `dir` that have a common column separator `sep`. The _dissected_ results can be generated in the form of an excel file (`xlsx`) or text (`json` or `csv`). By default, the analysis is run on the entire content of the file i.e., without any filters. But `slicers` help slice data and run analysis. \n\n\n```commandline\nusage: dissector.exe [-h] [--to {xlsx,json,csv}] [--sep SEP]\n                    [--slicers [SLICERS ...]] [--nsample NSAMPLE]\n                    [--outfile OUTFILE] [--config CONFIG]\n                    dir file\n\npositional arguments:\n  dir                   Input directory\n  file                  Input file (for multiple files use wildcard)\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --to {xlsx,json,csv}  Save result to xlsx or json or csv (default: xlsx)\n  --sep SEP             Column separator (default: ,)\n  --slicers [SLICERS ...]\n                        Informs how to slice data (default: for no slicing)\n  --nsample NSAMPLE     Number of samples (default: 10)\n  --outfile OUTFILE     Output file name (default: dissect_result)\n  --config CONFIG       Config file for meta data (default:\n                        `.\\config\\dissector_config.yaml`)\n```\n\n\nThe output gives the following information for each column element in the input file(s).\n\n- column: column name.\n- strlen: minimum and maximum string length.\n- nnull: count of NANs and empty strings.\n- nrow: number of rows.\n- nunique: number of unique values.\n- nvalue: number of rows with values.\n- freq: frequency distribution of top n values. n is configured in `dissector_config.yaml`.\n- sample: a sample of top n values. n is configured in `dissector_config.yaml`.\n- symbols: non-alphanumic characters that are not in [a-zA-Z0-9]\n- n: column order.\n- filename: name of the input file from where the column was picked.\n- filetype: file type to which the file is associated to (e.g., csv).\n\nThe output also presents other additional info:\n\n- slice: The _slice_ used to select. Slices represents _filter conditions_ to select subsets of rows within a dataset.\n- timestamp: file modified date timestamp of the input file.\n- hash: md5 hash of the input file.\n- size: file size of the input file in bytes.\n\nEnsure that a yaml config file is present at `.\\config\\dissector_config.yaml` in relation to `dissector.exe` prior to executing the command.\n\n```yaml\n---\nread_csv:\n  skiprows: 0\n  skipfooter: 0\n  engine: 'python' # {'c', 'python', 'pyarrow'}\n  encoding: 'latin-1' # {'utf-8', 'latin-1'}\n  quotechar: '\"'\n  on_bad_lines: 'warn' # {'error', 'warn', 'skip'}\n  dtype: 'str'\n  keep_default_na: false\n```\n\n**Examples**\n\n- Fetch `*.csv` from `.\\temp` and dissect them with `,` as column separator.\n\n    `dissector .\\temp *.csv -s ,`\n\n- Fetch `myfile.text` from `c:\\temp` and dissect the file with `;` as column separator.\n\n    `dissector c:\\temp myfile.text -s ;`\n\n- Fetch `myfile.text` from `c:\\temp` and dissect the file with `;` as column separator by slicing the data with a filter on `COLUMN1 == 'VALUE'` and also without filtering any.\n\n    `dissector c:\\temp myfile.text -s ; --slicers \"\" \"COLUMN1 == 'VALUE'\"`\n\n- Fetch `myfile.text` from `c:\\temp` and dissect the file with TAB `\\t` as column separator by slicing the data with a filter on a column name that has a space in it    ` COLUMN 1 == 'VALUE'`.\n\n     `dissector c:\\temp myfile.txt -sep ';' --slicers \"\" \"`COLUMN 1` == 'VALUE'\"`\n\n     Using powershell, read the arguments from a text file.\n\n    ```powershell\n    Get-Content args.txt | ForEach-Object {\n        $arguments = $_ -split '#'\n        & dissector.exe $arguments\n    }\n    ```\n    Here is a sample args.txt file.\n  \n    ```\n    .\\temp#*.csv#-s#,\n    ```\n\n# Morpher\n\n**morpher.exe** is a command-line tool to convert format of a file or files  in a directory that have a common column separator. For example, convert `file` delimited by `sep` in `dir` from  csv to `xlsx` or csv to `json`.\n\n```text\nusage: morpher.exe [-h] [--sep SEP] [--replace] [--to {xlsx,json}] dir file\n\npositional arguments:\n  dir               Input directory\n  file              Input file or files (wildcard)\n\noptional arguments:\n  -h, --help        show this help message and exit\n  --sep SEP         Column separator (default: ,)\n  --replace         Replace output file if it already exists (default: false)\n  --to {xlsx,json}  Morph to xlsx or json (default: xlsx)\n```\n\n# Comparator\n\n**comparator.exe** is a command-line tool to compare one file with another file.\n\n```text\nusage: comparator.exe [-h] [-s SEP] [-t {xlsx,json,csv}] file1 file2\n\npositional arguments:\n  file1                 File to compare\n  file2                 File to compare with\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -s SEP, --sep SEP     Column separator (default: `,`)\n  -t {xlsx,json,csv}, --to {xlsx,json,csv}\n                        Save result to xlsx or json or csv (default: `xlsx`)\n```\n\n# Aggregator\n\n**aggregator.exe** is a command-line tool to aggregate two or more file together into one.\n\n```text\nusage: aggregator.py [-h] [--sep SEP] [--to {xlsx,json,csv}]\n                     [--outfile OUTFILE] [--config CONFIG]\n                     dir file\n\npositional arguments:\n  dir                   Input directory\n  file                  Input file or files (for multiple files use wildcard)\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --sep SEP             Column separator (default: `,`)\n  --to {xlsx,json,csv}  Save result to xlsx or json or csv (default: `xlsx`)\n  --outfile OUTFILE     Output directory and file name (default:\n                        .\\.a\\aggregated_result)\n  --config CONFIG       Config file for meta data (default:\n                        `.\\config\\aggregator_config.yaml`)\n```\n\n# Fusioner\n\n**aggregator.exe** is a command-line tool to aggregate two or more file together into one.\n\n```text\nusage: fusioner.py [-h] [--sep SEP] [--outfile OUTFILE] [--config CONFIG] file\n\npositional arguments:\n  file               Input file\n\noptional arguments:\n  -h, --help         show this help message and exit\n  --sep SEP          Column separator (default: ,)\n  --outfile OUTFILE  Output directory and file name (default:\n                     .\\.f\\fusioner_result)\n  --config CONFIG    Config file for ETL (default:\n                     `.\\config\\fusioner_config.toml`)\n\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A variety of smart tools to make analytics easy",
    "version": "0.10.3",
    "project_urls": {
        "Homepage": "https://pypi.org/project/smart-tools/",
        "Repository": "https://github.com/arcot23/smart_tools"
    },
    "split_keywords": [
        "smart",
        " tools",
        " dissector",
        " morpher",
        " comparator",
        " aggregator",
        " fusioner",
        " analysis",
        " analyze",
        " data"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "368be95af80b27c7a9c940fac14c3e506181067ba82a80b6af961c50c3310fc1",
                "md5": "51b09dec3561dadf24834b25bed6c4ba",
                "sha256": "a2b48f23d939426a46aff3cde4246e9c04d62ed7ca4504ceecbd7330200e30da"
            },
            "downloads": -1,
            "filename": "smart_tools-0.10.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "51b09dec3561dadf24834b25bed6c4ba",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 25914,
            "upload_time": "2024-10-13T12:30:26",
            "upload_time_iso_8601": "2024-10-13T12:30:26.948280Z",
            "url": "https://files.pythonhosted.org/packages/36/8b/e95af80b27c7a9c940fac14c3e506181067ba82a80b6af961c50c3310fc1/smart_tools-0.10.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "96ba9fbcfd08efcdf86ab9020bd3f504d6071f2028e32acc2d4c0263dedc5ee6",
                "md5": "53f1e97cbc0c173cfcfd910fe7fe992f",
                "sha256": "a2afe8d320797dd59a767a2252415ac0fc4f75d0795a97484c5e3ea4e2d04b12"
            },
            "downloads": -1,
            "filename": "smart_tools-0.10.3.tar.gz",
            "has_sig": false,
            "md5_digest": "53f1e97cbc0c173cfcfd910fe7fe992f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 18420,
            "upload_time": "2024-10-13T12:30:28",
            "upload_time_iso_8601": "2024-10-13T12:30:28.324701Z",
            "url": "https://files.pythonhosted.org/packages/96/ba/9fbcfd08efcdf86ab9020bd3f504d6071f2028e32acc2d4c0263dedc5ee6/smart_tools-0.10.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-13 12:30:28",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "arcot23",
    "github_project": "smart_tools",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "pandas",
            "specs": []
        },
        {
            "name": "pyyaml",
            "specs": []
        },
        {
            "name": "xlsxwriter",
            "specs": []
        },
        {
            "name": "build",
            "specs": []
        },
        {
            "name": "sqlalchemy",
            "specs": []
        },
        {
            "name": "oracledb",
            "specs": []
        },
        {
            "name": "scikit-learn",
            "specs": []
        },
        {
            "name": "matplotlib",
            "specs": []
        },
        {
            "name": "tabulate",
            "specs": []
        },
        {
            "name": "lxml",
            "specs": []
        },
        {
            "name": "html5lib",
            "specs": []
        },
        {
            "name": "beautifulsoup4",
            "specs": []
        },
        {
            "name": "openpyxl",
            "specs": []
        }
    ],
    "lcname": "smart-tools"
}
        
Elapsed time: 0.47715s