piidigger


Namepiidigger JSON
Version 1.0.0 PyPI version JSON
download
home_page
SummaryPython program to identify Personally Identifiable Information in common file types
upload_time2024-03-14 19:07:37
maintainer
docs_urlNone
author
requires_python<4,>=3.9
license
keywords pii discovery data discovery credit card discovery
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # PIIDIgger

**PIIDigger** is a program to identify Personally Identifiable Information in common file types

## Features
- Works anywhere Python is available
- Pre-built binaries available (see [binaries](https://github.com/flyguy62n/PIIDigger/binaries/) folder)
- Customizable configuration file
- Identifies files based on file extension and MIME type
- Aware of OneDrive and Dropbox "cloud-only files" (see [ERRATA](https://github.com/flyguy62n/PIIDigger/blob/main/ERRATA.md))
- Tunable [PERFORMANCE](https://github.com/flyguy62n/PIIDigger/blob/main/PERFORMANCE.md) - especially useful for production servers
- Extensible file handlers to read any type of file
    - Initial release supports plain text files, Word Documents and Excel spreadsheets
    - See `--list-filetypes` command line option for currently supported file types
- Extensible data handlers to identify any type of data
    - Initial release supports primary account numbers for credit card data
    - See `--list-datahandlers` command line option for for currently supported document types
- Saves output in multiple formats
    - Initial releaase provides JSON and text file outputs

## Errata
Check out the [ERRATA](https://github.com/flyguy62n/PIIDigger/blob/main/ERRATA.md) page for known issues, troubleshooting tips and instructions on reporting new problems.

## Performance Tuning
Check out the [PERFORMANCE](https://github.com/flyguy62n/PIIDigger/blob/main/PERFORMANCE.md) page for notes on tuning performance, especially on production servers.

## Installation
### If Python is Available (e.g. MacOS and/or Linux)

NOTE: A virtual environment is strongly recommended to isolate PIIDigger and its dependencies from any other Python programs already on your system.  However, if you're not actively using Python, a system-wide installation is possible by running only the last command below.

**Linux/MacOS**

    python3 -m venv piidigger  #(or use your own folder name instead of "piidigger")
    source piidigger/bin/activate
    python3 -m pip install -U piidigger

PIIDigger will now be available as a program.  Run it with `piddigger` on the terminal prompt.
**Windows PowerShell**

    python.exe -m venv .venv  #(or use your own folder name instead of ".venv")
    .venv/Scripts/activate
    python.exe -m pip install -U piidigger[win]

PIIDigger will now be available as a program.  Run it with `piddigger.exe` in your PowerShell prompt.
### Binary Packages
If Python is not available, you can download OS-specific binaries from the [binaries/](https://github.com/flyguy62n/PIIDigger/binaries/) folder above.

NOTE: See the [ERRATA](https://github.com/flyguy62n/PIIDigger/blob/main/ERRATA.md) page for information about antivirus products and packaged Python binaries.

## Usage
```
usage: piidigger [-h] [-c CREATECONFIGFILE] [-d] [-f CONFIGFILE] [-p MAXPROC] [--cpu-count] [--list-datahandlers]
                      [--list-filetypes]

Search the file system for Personally Identifiable Information

NOTES:
    * All program configuration is kept in 'piidigger.toml' -- a TOML-formatted configuration file
    * A default configuration will be used if the default 'piidigger.toml' file doesn't exist

options:
  -h, --help            show this help message and exit

Configuration:
  -c CREATECONFIGFILE, --create-conf CREATECONFIGFILE
                        Create a default configuration file for editing/reuse.
  -d, --default-conf    Use the default, internal config.
  -f CONFIGFILE, --conf-file CONFIGFILE
                        path/to/configfile.toml configuration file (Default = "piidigger.toml"). If the file is not
                        found, the default, internal configuration will be used.
  -p MAXPROC, --max-process MAXPROC
                        Override the number processes to use for searching files. Will use the lesser of CPU cores or
                        this value. On production servers, consider setting this to less than the number of physical
                        CPUs. See '--cpu-count' below.

Misc. Info:
  --cpu-count           Show the number of logical CPUs provided by the OS. Use this to tune performance. See '--max-
                        process' above.
  --list-datahandlers   Display the list of data handlers and exit
  --list-filetypes      Display the list of file types and exit
```

If a configuration file doesn't exist, PIIDigger will use a default configuration as shown below.

## Advanced Configurations

All other options are configured from the configuration file.  In most cases, the defaults should work just fine.  You can create a configuration file with the `-c piidigger.toml` option.  `piidigger.toml` is the default file and if found, PIIDigger will use it automatically.  You can also create as many different configuration files as you like and reference them with `piidigger -f <filename>`.

An explanation of the configuration file options follows:


```
dataHandlers = ["pan"]

localFilesOnly = true

[results]
path = "piidigger-results/"
json = true
text = true

[includeFiles]
ext = "all"
mime = "all"

[includeFiles.startDirs]
windows = "all"
linux = ["/"]
darwin = ["/"]

[excludeDirs]
windows = ["C:\\Windows", "C:\\Program Files (x86)", "C:\\Program Files"]
linux = ["/proc", "/sys", "/dev", "/usr/bin", "/usr/lib", "/usr/lib32", "/usr/lib64", "/usr/libx32", "/usr/sbin", "*/.vscode-server", "/mnt/c", "/mnt/d", "/mnt/wslg"]
darwin = ["/dev", "/usr/bin", "/usr/lib", "/usr/sbin", "/Applications", "/System"]

[logging]
logLevel = "INFO"
logFile = "logs/piidigger.log"
```

| Option                                | Description  |
| ------                                | ----------   |
| `dataHandlers`                        | Default = `"pan"`.  Provides a list of the datahandlers that should be used.  "All" will load all data handlers currently defined in the datahandlers module.  To limit the selection, use a `[bracket-list]`, such as `['pan', 'ssn']`. |
| `localFilesOnly`                      | Default True.  For OneDrive and Dropbox files on Windows, only scan files which are already on the local disk.
| `[results]path`                       | Where to save the results to.  Current output formats are JSON and text files.  A folder name can be included and PIIDigger will create any missing folders in the path. |
| `[results]json`                       | Default True.  Whether to create a JSON output file |
| `[results]csv`                        | Default True.  Whether to create a CSV output file |
| `[includeFiles]`                      | Defines the criteria by which files will be included in the scan |
| `[includeFiles]ext`                   | Default = `"all"`.  The file extensions to include.  "All" will collect all supported file extensions from the file handlers currently supported.  To limit the selection, use a `[bracket-list]`, such as `['.txt', '.xlsx']`. |
| `[includeFiles]mime`                  | Default = `"all"`.  The file extensions to include.  "All" will collect all supported file extensions from the file handlers currently supported.  To limit the selection, use a `[bracket-list]`, such as `['text/plain-text', 'application/vnd.ms-excel']`. |
| `[includeFiles.startDirs]`            | For each operating system, define the starting directories/drives to start the search from.  OS types are `windows`, `linux`, `darwin` (for MacOS).  For each OS type, you can also use a `[bracket-list]` to provide specific starting points, such as `['C:\Users\<username>']` on Windows or `['/home/<username>']` on Linux/MacOS. |
| `...[startDirs\]windows`              | Default = `"all"` which will identify all currently accessible drive letters on the system.  NOTE: This also includes network drives, which might not be desired behavior.  You can use the `excludeDirs` option below to remove any network-mapped drive letters from the scan. |
| `...[startDirs\]linux` and `darwin`   | Default = `["/"]`, or scan the entire file system.  If there are network-mounted paths, you can exclude those with the `excludeDirs` option below.
| `[excludeDirs]`                       | For each operating system, a `[bracket-list]` of the folders/directories to exclude.  The defaults exclude operating system-specific directories such as `C:\Windows` and `/usr/bin`.  Additional patterns can be supplied and will match as a simple string (no wildcards, regex or glob patterns) from the beginning of the path.  `[results]` and `[logFile]` folders will always be excluded  |
| `[logging]`                           | Define the logging level and log file destination.  The defaults should always be fine, unless directed to create a DEBUG-level log file for troubleshooting |
| `[logging]logLevel`                   | Default = `"INFO"`, can be overridden using  Python logging levels (https://docs.python.org/3/howto/logging.html).  Must be in ALL CAPS and enclosed in quotes.  Would normally be either "INFO" (default) or, if advised for troubleshooting purposes, "DEBUG" |
| `[logging]logFile`                    | Default = "logs/piidigger.log" which should be just fine. |

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "piidigger",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "<4,>=3.9",
    "maintainer_email": "",
    "keywords": "pii discovery,data discovery,credit card discovery",
    "author": "",
    "author_email": "Randy Bartels <rjbartels@outlook.com>",
    "download_url": "https://files.pythonhosted.org/packages/86/ed/6464fa258406a4ebf71e8b816b5d32d77a2960362254c9768e8f76c946ae/piidigger-1.0.0.tar.gz",
    "platform": null,
    "description": "# PIIDIgger\r\n\r\n**PIIDigger** is a program to identify Personally Identifiable Information in common file types\r\n\r\n## Features\r\n- Works anywhere Python is available\r\n- Pre-built binaries available (see [binaries](https://github.com/flyguy62n/PIIDigger/binaries/) folder)\r\n- Customizable configuration file\r\n- Identifies files based on file extension and MIME type\r\n- Aware of OneDrive and Dropbox \"cloud-only files\" (see [ERRATA](https://github.com/flyguy62n/PIIDigger/blob/main/ERRATA.md))\r\n- Tunable [PERFORMANCE](https://github.com/flyguy62n/PIIDigger/blob/main/PERFORMANCE.md) - especially useful for production servers\r\n- Extensible file handlers to read any type of file\r\n    - Initial release supports plain text files, Word Documents and Excel spreadsheets\r\n    - See `--list-filetypes` command line option for currently supported file types\r\n- Extensible data handlers to identify any type of data\r\n    - Initial release supports primary account numbers for credit card data\r\n    - See `--list-datahandlers` command line option for for currently supported document types\r\n- Saves output in multiple formats\r\n    - Initial releaase provides JSON and text file outputs\r\n\r\n## Errata\r\nCheck out the [ERRATA](https://github.com/flyguy62n/PIIDigger/blob/main/ERRATA.md) page for known issues, troubleshooting tips and instructions on reporting new problems.\r\n\r\n## Performance Tuning\r\nCheck out the [PERFORMANCE](https://github.com/flyguy62n/PIIDigger/blob/main/PERFORMANCE.md) page for notes on tuning performance, especially on production servers.\r\n\r\n## Installation\r\n### If Python is Available (e.g. MacOS and/or Linux)\r\n\r\nNOTE: A virtual environment is strongly recommended to isolate PIIDigger and its dependencies from any other Python programs already on your system.  However, if you're not actively using Python, a system-wide installation is possible by running only the last command below.\r\n\r\n**Linux/MacOS**\r\n\r\n    python3 -m venv piidigger  #(or use your own folder name instead of \"piidigger\")\r\n    source piidigger/bin/activate\r\n    python3 -m pip install -U piidigger\r\n\r\nPIIDigger will now be available as a program.  Run it with `piddigger` on the terminal prompt.\r\n**Windows PowerShell**\r\n\r\n    python.exe -m venv .venv  #(or use your own folder name instead of \".venv\")\r\n    .venv/Scripts/activate\r\n    python.exe -m pip install -U piidigger[win]\r\n\r\nPIIDigger will now be available as a program.  Run it with `piddigger.exe` in your PowerShell prompt.\r\n### Binary Packages\r\nIf Python is not available, you can download OS-specific binaries from the [binaries/](https://github.com/flyguy62n/PIIDigger/binaries/) folder above.\r\n\r\nNOTE: See the [ERRATA](https://github.com/flyguy62n/PIIDigger/blob/main/ERRATA.md) page for information about antivirus products and packaged Python binaries.\r\n\r\n## Usage\r\n```\r\nusage: piidigger [-h] [-c CREATECONFIGFILE] [-d] [-f CONFIGFILE] [-p MAXPROC] [--cpu-count] [--list-datahandlers]\r\n                      [--list-filetypes]\r\n\r\nSearch the file system for Personally Identifiable Information\r\n\r\nNOTES:\r\n    * All program configuration is kept in 'piidigger.toml' -- a TOML-formatted configuration file\r\n    * A default configuration will be used if the default 'piidigger.toml' file doesn't exist\r\n\r\noptions:\r\n  -h, --help            show this help message and exit\r\n\r\nConfiguration:\r\n  -c CREATECONFIGFILE, --create-conf CREATECONFIGFILE\r\n                        Create a default configuration file for editing/reuse.\r\n  -d, --default-conf    Use the default, internal config.\r\n  -f CONFIGFILE, --conf-file CONFIGFILE\r\n                        path/to/configfile.toml configuration file (Default = \"piidigger.toml\"). If the file is not\r\n                        found, the default, internal configuration will be used.\r\n  -p MAXPROC, --max-process MAXPROC\r\n                        Override the number processes to use for searching files. Will use the lesser of CPU cores or\r\n                        this value. On production servers, consider setting this to less than the number of physical\r\n                        CPUs. See '--cpu-count' below.\r\n\r\nMisc. Info:\r\n  --cpu-count           Show the number of logical CPUs provided by the OS. Use this to tune performance. See '--max-\r\n                        process' above.\r\n  --list-datahandlers   Display the list of data handlers and exit\r\n  --list-filetypes      Display the list of file types and exit\r\n```\r\n\r\nIf a configuration file doesn't exist, PIIDigger will use a default configuration as shown below.\r\n\r\n## Advanced Configurations\r\n\r\nAll other options are configured from the configuration file.  In most cases, the defaults should work just fine.  You can create a configuration file with the `-c piidigger.toml` option.  `piidigger.toml` is the default file and if found, PIIDigger will use it automatically.  You can also create as many different configuration files as you like and reference them with `piidigger -f <filename>`.\r\n\r\nAn explanation of the configuration file options follows:\r\n\r\n\r\n```\r\ndataHandlers = [\"pan\"]\r\n\r\nlocalFilesOnly = true\r\n\r\n[results]\r\npath = \"piidigger-results/\"\r\njson = true\r\ntext = true\r\n\r\n[includeFiles]\r\next = \"all\"\r\nmime = \"all\"\r\n\r\n[includeFiles.startDirs]\r\nwindows = \"all\"\r\nlinux = [\"/\"]\r\ndarwin = [\"/\"]\r\n\r\n[excludeDirs]\r\nwindows = [\"C:\\\\Windows\", \"C:\\\\Program Files (x86)\", \"C:\\\\Program Files\"]\r\nlinux = [\"/proc\", \"/sys\", \"/dev\", \"/usr/bin\", \"/usr/lib\", \"/usr/lib32\", \"/usr/lib64\", \"/usr/libx32\", \"/usr/sbin\", \"*/.vscode-server\", \"/mnt/c\", \"/mnt/d\", \"/mnt/wslg\"]\r\ndarwin = [\"/dev\", \"/usr/bin\", \"/usr/lib\", \"/usr/sbin\", \"/Applications\", \"/System\"]\r\n\r\n[logging]\r\nlogLevel = \"INFO\"\r\nlogFile = \"logs/piidigger.log\"\r\n```\r\n\r\n| Option                                | Description  |\r\n| ------                                | ----------   |\r\n| `dataHandlers`                        | Default = `\"pan\"`.  Provides a list of the datahandlers that should be used.  \"All\" will load all data handlers currently defined in the datahandlers module.  To limit the selection, use a `[bracket-list]`, such as `['pan', 'ssn']`. |\r\n| `localFilesOnly`                      | Default True.  For OneDrive and Dropbox files on Windows, only scan files which are already on the local disk.\r\n| `[results]path`                       | Where to save the results to.  Current output formats are JSON and text files.  A folder name can be included and PIIDigger will create any missing folders in the path. |\r\n| `[results]json`                       | Default True.  Whether to create a JSON output file |\r\n| `[results]csv`                        | Default True.  Whether to create a CSV output file |\r\n| `[includeFiles]`                      | Defines the criteria by which files will be included in the scan |\r\n| `[includeFiles]ext`                   | Default = `\"all\"`.  The file extensions to include.  \"All\" will collect all supported file extensions from the file handlers currently supported.  To limit the selection, use a `[bracket-list]`, such as `['.txt', '.xlsx']`. |\r\n| `[includeFiles]mime`                  | Default = `\"all\"`.  The file extensions to include.  \"All\" will collect all supported file extensions from the file handlers currently supported.  To limit the selection, use a `[bracket-list]`, such as `['text/plain-text', 'application/vnd.ms-excel']`. |\r\n| `[includeFiles.startDirs]`            | For each operating system, define the starting directories/drives to start the search from.  OS types are `windows`, `linux`, `darwin` (for MacOS).  For each OS type, you can also use a `[bracket-list]` to provide specific starting points, such as `['C:\\Users\\<username>']` on Windows or `['/home/<username>']` on Linux/MacOS. |\r\n| `...[startDirs\\]windows`              | Default = `\"all\"` which will identify all currently accessible drive letters on the system.  NOTE: This also includes network drives, which might not be desired behavior.  You can use the `excludeDirs` option below to remove any network-mapped drive letters from the scan. |\r\n| `...[startDirs\\]linux` and `darwin`   | Default = `[\"/\"]`, or scan the entire file system.  If there are network-mounted paths, you can exclude those with the `excludeDirs` option below.\r\n| `[excludeDirs]`                       | For each operating system, a `[bracket-list]` of the folders/directories to exclude.  The defaults exclude operating system-specific directories such as `C:\\Windows` and `/usr/bin`.  Additional patterns can be supplied and will match as a simple string (no wildcards, regex or glob patterns) from the beginning of the path.  `[results]` and `[logFile]` folders will always be excluded  |\r\n| `[logging]`                           | Define the logging level and log file destination.  The defaults should always be fine, unless directed to create a DEBUG-level log file for troubleshooting |\r\n| `[logging]logLevel`                   | Default = `\"INFO\"`, can be overridden using  Python logging levels (https://docs.python.org/3/howto/logging.html).  Must be in ALL CAPS and enclosed in quotes.  Would normally be either \"INFO\" (default) or, if advised for troubleshooting purposes, \"DEBUG\" |\r\n| `[logging]logFile`                    | Default = \"logs/piidigger.log\" which should be just fine. |\r\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Python program to identify Personally Identifiable Information in common file types",
    "version": "1.0.0",
    "project_urls": {
        "Homepage": "https://github.com/flyguy62n/PIIDigger",
        "Issues": "https://github.com/flyguy62n/PIIDigger/issues"
    },
    "split_keywords": [
        "pii discovery",
        "data discovery",
        "credit card discovery"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4f0f688a549743aa7954b6813e217790eef1af6855dfc48c97c97051e4ccda8e",
                "md5": "3e3abff2c365a60884d1a7bec6902b89",
                "sha256": "4c50195fa97014db4aebdbac3f3a82e59f0417de0c67170bad997260d89923ef"
            },
            "downloads": -1,
            "filename": "piidigger-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3e3abff2c365a60884d1a7bec6902b89",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4,>=3.9",
            "size": 42582,
            "upload_time": "2024-03-14T19:07:36",
            "upload_time_iso_8601": "2024-03-14T19:07:36.026557Z",
            "url": "https://files.pythonhosted.org/packages/4f/0f/688a549743aa7954b6813e217790eef1af6855dfc48c97c97051e4ccda8e/piidigger-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "86ed6464fa258406a4ebf71e8b816b5d32d77a2960362254c9768e8f76c946ae",
                "md5": "64a8089e0c6627db2ae236722943641a",
                "sha256": "16fe5a67956a6f955f918641c91d216ece9f807ff8b16cd39c1254f402972703"
            },
            "downloads": -1,
            "filename": "piidigger-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "64a8089e0c6627db2ae236722943641a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4,>=3.9",
            "size": 38715,
            "upload_time": "2024-03-14T19:07:37",
            "upload_time_iso_8601": "2024-03-14T19:07:37.945765Z",
            "url": "https://files.pythonhosted.org/packages/86/ed/6464fa258406a4ebf71e8b816b5d32d77a2960362254c9768e8f76c946ae/piidigger-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-14 19:07:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "flyguy62n",
    "github_project": "PIIDigger",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "piidigger"
}
        
Elapsed time: 0.21246s