# PIIDIgger
**PIIDigger** is a program to identify Personally Identifiable Information in common file types
## Features
- Works anywhere Python is available
- Pre-built binaries available
- Customizable configuration file
- Identifies files based on file extension and MIME type
- Aware of OneDrive and Dropbox "cloud-only files" (see [ERRATA](https://github.com/kirkpatrickprice/PIIDigger/blob/main/ERRATA.md))
- Tunable [PERFORMANCE](https://github.com/kirkpatrickprice/PIIDigger/blob/main/PERFORMANCE.md) - especially useful for production servers
- Extensible file handlers to read any type of file
- Initial release supports plain text files, Word Documents and Excel spreadsheets
- See `--list-filetypes` command line option for currently supported file types
- Extensible data handlers to identify any type of data
- Initial release supports primary account numbers for credit card data
- See `--list-datahandlers` command line option for for currently supported document types
- Saves output in multiple formats
- Initial releaase provides JSON and text file outputs
- Getting started with PIIDigger video on [YouTube](https://youtu.be/wnUNnzy1JDw)
## Errata
Check out the [ERRATA](https://github.com/kirkpatrickprice/PIIDigger/blob/main/ERRATA.md) page for known issues, troubleshooting tips and instructions on reporting new problems.
## Performance Tuning
Check out the [PERFORMANCE](https://github.com/kirkpatrickprice/PIIDigger/blob/main/PERFORMANCE.md) page for notes on tuning performance, especially on production servers.
## Installation
### Binary Packages
You can download OS-specific binaries from the [releases](https://github.com/kirkpatrickprice/PIIDigger/releases) page.
Additional information on [Windows Releases](https://github.com/kirkpatrickprice/PIIDigger/blob/main/WINDOWS_RELEASES.md)
### Installing from Pip (e.g. MacOS and/or Linux)
NOTE: A virtual environment is strongly recommended to isolate PIIDigger and its dependencies from any other Python programs already on your system. However, if you're not actively using Python, a system-wide installation is possible by running only the last command below.
**Linux/MacOS**
python3 -m venv piidigger #(or use your own folder name instead of "piidigger")
source piidigger/bin/activate
python3 -m pip install -U piidigger
PIIDigger will now be available as a program. Run it with `piddigger` on the terminal prompt.
**Windows PowerShell**
python.exe -m venv .venv #(or use your own folder name instead of ".venv")
.venv/Scripts/activate
python.exe -m pip install -U piidigger[win]
PIIDigger will now be available as a program. Run it with `piddigger.exe` in your PowerShell prompt.
NOTE:
* Update 26-MAR 2024: I'm trying a new packaging method that should avoid virus warnings from Defender and others.
* See the [ERRATA](https://github.com/kirkpatrickprice/PIIDigger/blob/main/ERRATA.md) page for information about antivirus products and packaged Python binaries.
## Usage
Getting started with PIIDigger video is availble on [YouTube](https://youtu.be/wnUNnzy1JDw)
```
usage: piidigger [-h] [-c CREATECONFIGFILE] [-d] [-f CONFIGFILE] [-p MAXPROC] [--cpu-count] [--list-datahandlers]
[--list-filetypes]
Search the file system for Personally Identifiable Information
NOTES:
* All program configuration is kept in 'piidigger.toml' -- a TOML-formatted configuration file
* A default configuration will be used if the default 'piidigger.toml' file doesn't exist
options:
-h, --help show this help message and exit
Configuration:
-c CREATECONFIGFILE, --create-conf CREATECONFIGFILE
Create a default configuration file for editing/reuse.
-d, --default-conf Use the default, internal config.
-f CONFIGFILE, --conf-file CONFIGFILE
path/to/configfile.toml configuration file (Default = "piidigger.toml"). If the file is not
found, the default, internal configuration will be used.
-p MAXPROC, --max-process MAXPROC
Override the number processes to use for searching files. Will use the lesser of CPU cores or
this value. On production servers, consider setting this to less than the number of physical
CPUs. See '--cpu-count' below.
Misc. Info:
--cpu-count Show the number of logical CPUs provided by the OS. Use this to tune performance. See '--max-
process' above.
--list-datahandlers Display the list of data handlers and exit
--list-filetypes Display the list of file types and exit
```
If a configuration file doesn't exist, PIIDigger will use a default configuration as shown below.
## Advanced Configurations
All other options are configured from the configuration file. In most cases, the defaults should work just fine. You can create a configuration file with the `-c piidigger.toml` option. `piidigger.toml` is the default file and if found, PIIDigger will use it automatically. You can also create as many different configuration files as you like and reference them with `piidigger -f <filename>`.
An explanation of the configuration file options follows:
```
dataHandlers = ["pan"]
localFilesOnly = true
[results]
path = "piidigger-results/"
json = true
text = true
[includeFiles]
ext = "all"
mime = "all"
[includeFiles.startDirs]
windows = "all"
linux = ["/"]
darwin = ["/"]
[excludeDirs]
windows = ["C:\\Windows", "C:\\Program Files (x86)", "C:\\Program Files"]
linux = ["/proc", "/sys", "/dev", "/usr/bin", "/usr/lib", "/usr/lib32", "/usr/lib64", "/usr/libx32", "/usr/sbin", "*/.vscode-server", "/mnt/c", "/mnt/d", "/mnt/wslg"]
darwin = ["/dev", "/usr/bin", "/usr/lib", "/usr/sbin", "/Applications", "/System"]
[logging]
logLevel = "INFO"
logFile = "logs/piidigger.log"
```
| Option | Description |
| ------ | ---------- |
| `dataHandlers` | Default = `"pan"`. Provides a list of the datahandlers that should be used. "All" will load all data handlers currently defined in the datahandlers module. To limit the selection, use a `[bracket-list]`, such as `['pan', 'ssn']`. |
| `localFilesOnly` | Default True. For OneDrive and Dropbox files on Windows, only scan files which are already on the local disk.
| `[results]path` | Where to save the results to. Current output formats are JSON and text files. A folder name can be included and PIIDigger will create any missing folders in the path. |
| `[results]json` | Default True. Whether to create a JSON output file |
| `[results]csv` | Default True. Whether to create a CSV output file |
| `[includeFiles]` | Defines the criteria by which files will be included in the scan |
| `[includeFiles]ext` | Default = `"all"`. The file extensions to include. "All" will collect all supported file extensions from the file handlers currently supported. To limit the selection, use a `[bracket-list]`, such as `['.txt', '.xlsx']`. |
| `[includeFiles]mime` | Default = `"all"`. The file extensions to include. "All" will collect all supported file extensions from the file handlers currently supported. To limit the selection, use a `[bracket-list]`, such as `['text/plain-text', 'application/vnd.ms-excel']`. |
| `[includeFiles.startDirs]` | For each operating system, define the starting directories/drives to start the search from. OS types are `windows`, `linux`, `darwin` (for MacOS). For each OS type, you can also use a `[bracket-list]` to provide specific starting points, such as `['C:\Users\<username>']` on Windows or `['/home/<username>']` on Linux/MacOS. |
| `...[startDirs\]windows` | Default = `"all"` which will identify all currently accessible drive letters on the system. NOTE: This also includes network drives, which might not be desired behavior. You can use the `excludeDirs` option below to remove any network-mapped drive letters from the scan. |
| `...[startDirs\]linux` and `darwin` | Default = `["/"]`, or scan the entire file system. If there are network-mounted paths, you can exclude those with the `excludeDirs` option below.
| `[excludeDirs]` | For each operating system, a `[bracket-list]` of the folders/directories to exclude. The defaults exclude operating system-specific directories such as `C:\Windows` and `/usr/bin`. Additional patterns can be supplied and will match as a simple string (no wildcards, regex or glob patterns) from the beginning of the path. `[results]` and `[logFile]` folders will always be excluded |
| `[logging]` | Define the logging level and log file destination. The defaults should always be fine, unless directed to create a DEBUG-level log file for troubleshooting |
| `[logging]logLevel` | Default = `"INFO"`, can be overridden using Python logging levels (https://docs.python.org/3/howto/logging.html). Must be in ALL CAPS and enclosed in quotes. Would normally be either "INFO" (default) or, if advised for troubleshooting purposes, "DEBUG" |
| `[logging]logFile` | Default = "logs/piidigger.log" which should be just fine. |
Raw data
{
"_id": null,
"home_page": null,
"name": "piidigger",
"maintainer": null,
"docs_url": null,
"requires_python": "<4,>=3.9",
"maintainer_email": null,
"keywords": "pii discovery, data discovery, credit card discovery",
"author": null,
"author_email": "Randy Bartels <rjbartels@outlook.com>",
"download_url": "https://files.pythonhosted.org/packages/56/01/8e38931f18207858cb5820db234500321094221a5260f2465c8c625fa63e/piidigger-1.1.4.tar.gz",
"platform": null,
"description": "# PIIDIgger\n\n**PIIDigger** is a program to identify Personally Identifiable Information in common file types\n\n## Features\n- Works anywhere Python is available\n- Pre-built binaries available\n- Customizable configuration file\n- Identifies files based on file extension and MIME type\n- Aware of OneDrive and Dropbox \"cloud-only files\" (see [ERRATA](https://github.com/kirkpatrickprice/PIIDigger/blob/main/ERRATA.md))\n- Tunable [PERFORMANCE](https://github.com/kirkpatrickprice/PIIDigger/blob/main/PERFORMANCE.md) - especially useful for production servers\n- Extensible file handlers to read any type of file\n - Initial release supports plain text files, Word Documents and Excel spreadsheets\n - See `--list-filetypes` command line option for currently supported file types\n- Extensible data handlers to identify any type of data\n - Initial release supports primary account numbers for credit card data\n - See `--list-datahandlers` command line option for for currently supported document types\n- Saves output in multiple formats\n - Initial releaase provides JSON and text file outputs\n- Getting started with PIIDigger video on [YouTube](https://youtu.be/wnUNnzy1JDw)\n\n## Errata\nCheck out the [ERRATA](https://github.com/kirkpatrickprice/PIIDigger/blob/main/ERRATA.md) page for known issues, troubleshooting tips and instructions on reporting new problems.\n\n## Performance Tuning\nCheck out the [PERFORMANCE](https://github.com/kirkpatrickprice/PIIDigger/blob/main/PERFORMANCE.md) page for notes on tuning performance, especially on production servers.\n\n## Installation\n\n### Binary Packages\nYou can download OS-specific binaries from the [releases](https://github.com/kirkpatrickprice/PIIDigger/releases) page.\n\nAdditional information on [Windows Releases](https://github.com/kirkpatrickprice/PIIDigger/blob/main/WINDOWS_RELEASES.md)\n\n### Installing from Pip (e.g. MacOS and/or Linux)\n\nNOTE: A virtual environment is strongly recommended to isolate PIIDigger and its dependencies from any other Python programs already on your system. However, if you're not actively using Python, a system-wide installation is possible by running only the last command below.\n\n**Linux/MacOS**\n\n python3 -m venv piidigger #(or use your own folder name instead of \"piidigger\")\n source piidigger/bin/activate\n python3 -m pip install -U piidigger\n\nPIIDigger will now be available as a program. Run it with `piddigger` on the terminal prompt.\n\n**Windows PowerShell**\n\n python.exe -m venv .venv #(or use your own folder name instead of \".venv\")\n .venv/Scripts/activate\n python.exe -m pip install -U piidigger[win]\n\nPIIDigger will now be available as a program. Run it with `piddigger.exe` in your PowerShell prompt.\n\nNOTE:\n* Update 26-MAR 2024: I'm trying a new packaging method that should avoid virus warnings from Defender and others.\n* See the [ERRATA](https://github.com/kirkpatrickprice/PIIDigger/blob/main/ERRATA.md) page for information about antivirus products and packaged Python binaries.\n\n## Usage\nGetting started with PIIDigger video is availble on [YouTube](https://youtu.be/wnUNnzy1JDw)\n```\nusage: piidigger [-h] [-c CREATECONFIGFILE] [-d] [-f CONFIGFILE] [-p MAXPROC] [--cpu-count] [--list-datahandlers]\n [--list-filetypes]\n\nSearch the file system for Personally Identifiable Information\n\nNOTES:\n * All program configuration is kept in 'piidigger.toml' -- a TOML-formatted configuration file\n * A default configuration will be used if the default 'piidigger.toml' file doesn't exist\n\noptions:\n -h, --help show this help message and exit\n\nConfiguration:\n -c CREATECONFIGFILE, --create-conf CREATECONFIGFILE\n Create a default configuration file for editing/reuse.\n -d, --default-conf Use the default, internal config.\n -f CONFIGFILE, --conf-file CONFIGFILE\n path/to/configfile.toml configuration file (Default = \"piidigger.toml\"). If the file is not\n found, the default, internal configuration will be used.\n -p MAXPROC, --max-process MAXPROC\n Override the number processes to use for searching files. Will use the lesser of CPU cores or\n this value. On production servers, consider setting this to less than the number of physical\n CPUs. See '--cpu-count' below.\n\nMisc. Info:\n --cpu-count Show the number of logical CPUs provided by the OS. Use this to tune performance. See '--max-\n process' above.\n --list-datahandlers Display the list of data handlers and exit\n --list-filetypes Display the list of file types and exit\n```\n\nIf a configuration file doesn't exist, PIIDigger will use a default configuration as shown below.\n\n## Advanced Configurations\n\nAll other options are configured from the configuration file. In most cases, the defaults should work just fine. You can create a configuration file with the `-c piidigger.toml` option. `piidigger.toml` is the default file and if found, PIIDigger will use it automatically. You can also create as many different configuration files as you like and reference them with `piidigger -f <filename>`.\n\nAn explanation of the configuration file options follows:\n\n\n```\ndataHandlers = [\"pan\"]\n\nlocalFilesOnly = true\n\n[results]\npath = \"piidigger-results/\"\njson = true\ntext = true\n\n[includeFiles]\next = \"all\"\nmime = \"all\"\n\n[includeFiles.startDirs]\nwindows = \"all\"\nlinux = [\"/\"]\ndarwin = [\"/\"]\n\n[excludeDirs]\nwindows = [\"C:\\\\Windows\", \"C:\\\\Program Files (x86)\", \"C:\\\\Program Files\"]\nlinux = [\"/proc\", \"/sys\", \"/dev\", \"/usr/bin\", \"/usr/lib\", \"/usr/lib32\", \"/usr/lib64\", \"/usr/libx32\", \"/usr/sbin\", \"*/.vscode-server\", \"/mnt/c\", \"/mnt/d\", \"/mnt/wslg\"]\ndarwin = [\"/dev\", \"/usr/bin\", \"/usr/lib\", \"/usr/sbin\", \"/Applications\", \"/System\"]\n\n[logging]\nlogLevel = \"INFO\"\nlogFile = \"logs/piidigger.log\"\n```\n\n| Option | Description |\n| ------ | ---------- |\n| `dataHandlers` | Default = `\"pan\"`. Provides a list of the datahandlers that should be used. \"All\" will load all data handlers currently defined in the datahandlers module. To limit the selection, use a `[bracket-list]`, such as `['pan', 'ssn']`. |\n| `localFilesOnly` | Default True. For OneDrive and Dropbox files on Windows, only scan files which are already on the local disk.\n| `[results]path` | Where to save the results to. Current output formats are JSON and text files. A folder name can be included and PIIDigger will create any missing folders in the path. |\n| `[results]json` | Default True. Whether to create a JSON output file |\n| `[results]csv` | Default True. Whether to create a CSV output file |\n| `[includeFiles]` | Defines the criteria by which files will be included in the scan |\n| `[includeFiles]ext` | Default = `\"all\"`. The file extensions to include. \"All\" will collect all supported file extensions from the file handlers currently supported. To limit the selection, use a `[bracket-list]`, such as `['.txt', '.xlsx']`. |\n| `[includeFiles]mime` | Default = `\"all\"`. The file extensions to include. \"All\" will collect all supported file extensions from the file handlers currently supported. To limit the selection, use a `[bracket-list]`, such as `['text/plain-text', 'application/vnd.ms-excel']`. |\n| `[includeFiles.startDirs]` | For each operating system, define the starting directories/drives to start the search from. OS types are `windows`, `linux`, `darwin` (for MacOS). For each OS type, you can also use a `[bracket-list]` to provide specific starting points, such as `['C:\\Users\\<username>']` on Windows or `['/home/<username>']` on Linux/MacOS. |\n| `...[startDirs\\]windows` | Default = `\"all\"` which will identify all currently accessible drive letters on the system. NOTE: This also includes network drives, which might not be desired behavior. You can use the `excludeDirs` option below to remove any network-mapped drive letters from the scan. |\n| `...[startDirs\\]linux` and `darwin` | Default = `[\"/\"]`, or scan the entire file system. If there are network-mounted paths, you can exclude those with the `excludeDirs` option below.\n| `[excludeDirs]` | For each operating system, a `[bracket-list]` of the folders/directories to exclude. The defaults exclude operating system-specific directories such as `C:\\Windows` and `/usr/bin`. Additional patterns can be supplied and will match as a simple string (no wildcards, regex or glob patterns) from the beginning of the path. `[results]` and `[logFile]` folders will always be excluded |\n| `[logging]` | Define the logging level and log file destination. The defaults should always be fine, unless directed to create a DEBUG-level log file for troubleshooting |\n| `[logging]logLevel` | Default = `\"INFO\"`, can be overridden using Python logging levels (https://docs.python.org/3/howto/logging.html). Must be in ALL CAPS and enclosed in quotes. Would normally be either \"INFO\" (default) or, if advised for troubleshooting purposes, \"DEBUG\" |\n| `[logging]logFile` | Default = \"logs/piidigger.log\" which should be just fine. |\n",
"bugtrack_url": null,
"license": null,
"summary": "Python program to identify Personally Identifiable Information in common file types",
"version": "1.1.4",
"project_urls": {
"Homepage": "https://github.com/kirkpatrickprice/PIIDigger",
"Issues": "https://github.com/kirkpatrickprice/PIIDigger/issues"
},
"split_keywords": [
"pii discovery",
" data discovery",
" credit card discovery"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "99191cd7f10408b6d8585a1c3f6e7c4cb07c1736d0d842b5b11785c6ab3e1740",
"md5": "dbf4339cea99260fb914487d0a07bbaf",
"sha256": "8b49eceaae207b178449996bfa3d1ac011360402765f8e156e0e551b1e13cb4f"
},
"downloads": -1,
"filename": "piidigger-1.1.4-py3-none-any.whl",
"has_sig": false,
"md5_digest": "dbf4339cea99260fb914487d0a07bbaf",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<4,>=3.9",
"size": 43360,
"upload_time": "2024-07-12T20:33:47",
"upload_time_iso_8601": "2024-07-12T20:33:47.809980Z",
"url": "https://files.pythonhosted.org/packages/99/19/1cd7f10408b6d8585a1c3f6e7c4cb07c1736d0d842b5b11785c6ab3e1740/piidigger-1.1.4-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "56018e38931f18207858cb5820db234500321094221a5260f2465c8c625fa63e",
"md5": "dea247b18097d30be1229db80a07caa9",
"sha256": "b05da5072e80af49eef0b6e823d50e69d17e831f2e9ae19a8210d283c27f5046"
},
"downloads": -1,
"filename": "piidigger-1.1.4.tar.gz",
"has_sig": false,
"md5_digest": "dea247b18097d30be1229db80a07caa9",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<4,>=3.9",
"size": 58834,
"upload_time": "2024-07-12T20:33:49",
"upload_time_iso_8601": "2024-07-12T20:33:49.332461Z",
"url": "https://files.pythonhosted.org/packages/56/01/8e38931f18207858cb5820db234500321094221a5260f2465c8c625fa63e/piidigger-1.1.4.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-12 20:33:49",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "kirkpatrickprice",
"github_project": "PIIDigger",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "piidigger"
}