# Introduction to Pydit
Pydit is a library of data wrangling tools aimed to internal auditors
specifically for our use cases.
This library is also a learning exercise for me on how to create a package, build documentation & tests, and publish it.
Code quality varies, and I don't commit to keep backward compatibility (see below how I use it) So, use it at your own peril!
If, despite all that, you wish to contribute, feel free to get in touch.
Shout out: Pydit takes ideas (and some code) from Pyjanitor, an awesome library.
Check it out!
## Why a dedicated library for auditors?
The problem Pydit solves is that a big part of our audit tests have to do with basic data quality checks (e.g. find duplicates or blanks) as they may flag potential fraud or systemic errors.
But to do those check I often end up pasting snippets from internet or reusing code from previous audits with no consistency or tests done.
Libraries like pyjanitor do a great job, however:
a) require installation that often is not allowed in your environment
b) tend to be very compact and non verbose (e.g. use method chaining), and
c) are difficult to review/verify.
What I really need is:
a) easy to review code, both code and execution (even for non-programmers)
b) portable, minimal dependencies, pure python, drop-in module ideally.
c) performance is ultimately secondary to readability and repeatability.
Pydit follows these principles:
1. Functions should be self-standing with minimal imports/dependencies.
The auditor should be able to import or copy paste only a specfic module into the project to perform a particular the audit test. That makes it easier to undertand, customise, review. Plus, it removes dependencies of future versions of pydit. Note that anyway, we need to file the actual code exactly as it was used during the audit.
2. Functions should include verbose logging, short of debug level.
3. Focus on documentation, tests and simple code, less concerns on performance.
4. No method chaining, in interest of source code readability.
While Pyjanitor is great and its method chaining approach is elegant, I've found the good old "step by step" works better for documenting the test, and explaining to reviewers or newbies.
5. Returns a new transformed copy of the object, code does not mutate the input object(s). Any previous inplace=True parameter is deprecated and I will remove in future versions.
## Quick start
```python
import pandas as pd
from pydit import start_logging_info # sets up nice logging params with rotation
from pydit import profile_dataframe # runs a few descriptive analysis on a df
from pydit import cleanup_column_names # opinionated cleanup of column names
logger = start_logging_info()
logger.info("Started")
```
The logger feature is used extensively by default, aiming to generate a human readable audit log to be included in workpapers.
I recommend importing individual functions so you can copy them locally to your project folder and just change the import command to point to the local module, that way you freeze the version and reduce dependencies.
```python
df=pd.read_excel("mydata.xlsx")
df_profile= profile_dataframe(df) # will return a df with summary statistics
# you may realise the columns from excel are all over the place with cases and
# special chars
df_clean= cleanup_column_names(df)
df_deduped=check_duplicates(df_clean, columns=["customer_id","last_update_date"],ascending=[True,False],keep="first",indicator=True, also_return_non_duplicates=True)
# you will get a nice output with the report on duplicates, retaining the last
# modification entry (via the pre-sort descending by date) and returning
# the non-duplicates,
# It also brings a boolean column flagging those that had a duplication removed.
```
## Requires
- python >=3.13 (Should work by and large in 3.10 onwards, but I test in 3.13)
- pandas
- numpy
- openpyxl
- matplotlib (for the ocassional plot, e.g. Benford)
## Installation
```bash
pip install pydit-jceresearch
```
(not available in anaconda yet)
## Documentation
Documentation can be found [here](https://pydit.readthedocs.io/en/latest/index.html)
## Dev Install
```bash
git clone https://github.com/jceresearch/pydit.git
pip install -e .
```
This project uses:
- ```pylint``` for linting
- ```black``` for style
- ```pytest``` for testing
- ```sphinx``` for documentation in RTD
- ```myst_parser``` is a requirement for RTD too
- ```poetry``` for packaging
Raw data
{
"_id": null,
"home_page": null,
"name": "pydit-jceresearch",
"maintainer": null,
"docs_url": null,
"requires_python": "<4.0.0,>=3.12.7",
"maintainer_email": null,
"keywords": "internal audit, munging, cleansing, audit, wrangling, pandas",
"author": "jceresearch",
"author_email": "jceresearch@users.noreply.github.com",
"download_url": "https://files.pythonhosted.org/packages/8c/4d/cbb0ecd5b7afdd179fb4f6050f0d41b4ccffbb79fd15583287e4810b4ff8/pydit_jceresearch-0.1.8.tar.gz",
"platform": null,
"description": "\n# Introduction to Pydit \n\nPydit is a library of data wrangling tools aimed to internal auditors \nspecifically for our use cases.\n\nThis library is also a learning exercise for me on how to create a package, build documentation & tests, and publish it. \nCode quality varies, and I don't commit to keep backward compatibility (see below how I use it) So, use it at your own peril! \nIf, despite all that, you wish to contribute, feel free to get in touch.\n\nShout out: Pydit takes ideas (and some code) from Pyjanitor, an awesome library. \nCheck it out!\n\n## Why a dedicated library for auditors?\n\nThe problem Pydit solves is that a big part of our audit tests have to do with basic data quality checks (e.g. find duplicates or blanks) as they may flag potential fraud or systemic errors.\n\nBut to do those check I often end up pasting snippets from internet or reusing code from previous audits with no consistency or tests done.\n\nLibraries like pyjanitor do a great job, however: \n\n a) require installation that often is not allowed in your environment \n\n b) tend to be very compact and non verbose (e.g. use method chaining), and \n\n c) are difficult to review/verify. \n\n\nWhat I really need is:\n a) easy to review code, both code and execution (even for non-programmers) \n\n b) portable, minimal dependencies, pure python, drop-in module ideally. \n\n c) performance is ultimately secondary to readability and repeatability. \n \n\nPydit follows these principles:\n\n1. Functions should be self-standing with minimal imports/dependencies. \n\nThe auditor should be able to import or copy paste only a specfic module into the project to perform a particular the audit test. That makes it easier to undertand, customise, review. Plus, it removes dependencies of future versions of pydit. Note that anyway, we need to file the actual code exactly as it was used during the audit.\n\n2. Functions should include verbose logging, short of debug level. \n\n3. Focus on documentation, tests and simple code, less concerns on performance.\n\n4. No method chaining, in interest of source code readability.\n\nWhile Pyjanitor is great and its method chaining approach is elegant, I've found the good old \"step by step\" works better for documenting the test, and explaining to reviewers or newbies. \n\n5. Returns a new transformed copy of the object, code does not mutate the input object(s). Any previous inplace=True parameter is deprecated and I will remove in future versions.\n\n## Quick start\n\n```python\nimport pandas as pd\nfrom pydit import start_logging_info # sets up nice logging params with rotation\nfrom pydit import profile_dataframe # runs a few descriptive analysis on a df\nfrom pydit import cleanup_column_names # opinionated cleanup of column names\n\n\nlogger = start_logging_info()\nlogger.info(\"Started\")\n\n```\n\nThe logger feature is used extensively by default, aiming to generate a human readable audit log to be included in workpapers.\n\nI recommend importing individual functions so you can copy them locally to your project folder and just change the import command to point to the local module, that way you freeze the version and reduce dependencies.\n\n```python\ndf=pd.read_excel(\"mydata.xlsx\")\n\ndf_profile= profile_dataframe(df) # will return a df with summary statistics\n\n# you may realise the columns from excel are all over the place with cases and\n# special chars\n\ndf_clean= cleanup_column_names(df) \n\ndf_deduped=check_duplicates(df_clean, columns=[\"customer_id\",\"last_update_date\"],ascending=[True,False],keep=\"first\",indicator=True, also_return_non_duplicates=True)\n\n# you will get a nice output with the report on duplicates, retaining the last\n# modification entry (via the pre-sort descending by date) and returning \n# the non-duplicates, \n# It also brings a boolean column flagging those that had a duplication removed.\n\n\n```\n\n## Requires\n\n- python >=3.13 (Should work by and large in 3.10 onwards, but I test in 3.13)\n- pandas\n- numpy\n- openpyxl\n- matplotlib (for the ocassional plot, e.g. Benford)\n\n## Installation\n\n```bash\npip install pydit-jceresearch\n```\n\n(not available in anaconda yet)\n\n## Documentation\n\nDocumentation can be found [here](https://pydit.readthedocs.io/en/latest/index.html)\n\n## Dev Install\n\n```bash\ngit clone https://github.com/jceresearch/pydit.git\npip install -e .\n```\n\nThis project uses:\n\n- ```pylint``` for linting \n- ```black``` for style \n- ```pytest``` for testing \n- ```sphinx``` for documentation in RTD \n- ```myst_parser``` is a requirement for RTD too \n- ```poetry``` for packaging \n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Data cleansing tools for Internal Auditors",
"version": "0.1.8",
"project_urls": {
"Documentation": "https://pydit.readthedocs.io/en/latest/"
},
"split_keywords": [
"internal audit",
" munging",
" cleansing",
" audit",
" wrangling",
" pandas"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "235b6a64c491a1edf80197f7119003ad2c572a259eea282ce000ed659ce7e97f",
"md5": "8faee7a6f7fd4738ce7176cd599192ca",
"sha256": "8e08a0522c9902c0bd3f295264f5c4baa5b11f359b5cea2a34d634f98d60b88c"
},
"downloads": -1,
"filename": "pydit_jceresearch-0.1.8-py3-none-any.whl",
"has_sig": false,
"md5_digest": "8faee7a6f7fd4738ce7176cd599192ca",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4.0.0,>=3.12.7",
"size": 62873,
"upload_time": "2024-10-24T13:58:16",
"upload_time_iso_8601": "2024-10-24T13:58:16.040040Z",
"url": "https://files.pythonhosted.org/packages/23/5b/6a64c491a1edf80197f7119003ad2c572a259eea282ce000ed659ce7e97f/pydit_jceresearch-0.1.8-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "8c4dcbb0ecd5b7afdd179fb4f6050f0d41b4ccffbb79fd15583287e4810b4ff8",
"md5": "ee46b1cfaa3a10c51156cd254c7828b8",
"sha256": "49bf17175215d9471917d03a4d80a71f8fd749cc3e9e65add499e4315956c052"
},
"downloads": -1,
"filename": "pydit_jceresearch-0.1.8.tar.gz",
"has_sig": false,
"md5_digest": "ee46b1cfaa3a10c51156cd254c7828b8",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4.0.0,>=3.12.7",
"size": 51563,
"upload_time": "2024-10-24T13:58:17",
"upload_time_iso_8601": "2024-10-24T13:58:17.370143Z",
"url": "https://files.pythonhosted.org/packages/8c/4d/cbb0ecd5b7afdd179fb4f6050f0d41b4ccffbb79fd15583287e4810b4ff8/pydit_jceresearch-0.1.8.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-24 13:58:17",
"github": false,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"lcname": "pydit-jceresearch"
}