datafog


Namedatafog JSON
Version 2.4.0 PyPI version JSON
download
home_pagehttps://datafog.ai
SummaryScan, redact, and manage PII in your documents before they get uploaded to a Retrieval Augmented Generation (RAG) system.
upload_time2024-04-02 04:14:59
maintainerDataFog
docs_urlNone
authorDataFog
requires_python>=3.10
licenseMIT
keywords pii redaction nlp rag retrieval augmented generation
VCS
bugtrack_url
requirements presidio_analyzer pandas pytest Requests aiohttp yarl frozenlist en_spacy_pii_fast unstructured unstructured
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <p align="center">
  <a href="https://www.datafog.ai"><img src="public/colorlogo.png" alt="DataFog logo"></a>
</p>

<p align="center">
    <b>Open-source DevSecOps for Generative AI Systems</b>. <br />
</p>

<p align="center">
  <a href="https://pypi.org/project/datafog/"><img src="https://img.shields.io/pypi/v/datafog.svg?style=flat-square" alt="PyPi Version"></a>
  <a href="https://pypi.org/project/datafog/"><img src="https://img.shields.io/pypi/pyversions/datafog.svg?style=flat-square" alt="PyPI pyversions"></a>
  <a href="https://github.com/datafog/datafog-python"><img src="https://img.shields.io/github/stars/datafog/datafog-python.svg?style=flat-square&logo=github&label=Stars&logoColor=white" alt="GitHub stars"></a>
  <a href="https://pypistats.org/packages/datafog"><img src="https://img.shields.io/pypi/dm/datafog.svg?style=flat-square" alt="PyPi downloads"></a>
  <a href="https://discord.gg/bzDth394R4"><img src="https://img.shields.io/discord/1173803135341449227?style=flat" alt="Discord"></a>
  <a href="https://github.com/psf/black"><img src="https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square" alt="Code style: black"></a>
  <a href="https://codecov.io/gh/datafog/datafog-python"><img src="https://img.shields.io/codecov/c/github/datafog/datafog-python.svg?style=flat-square" alt="codecov"></a>
  <a href="https://github.com/datafog/datafog-python/issues"><img src="https://img.shields.io/github/issues/datafog/datafog-python.svg?style=flat-square" alt="GitHub Issues"></a>
</p>

## Overview

### What is DataFog?

DataFog is an open-source DevSecOps platform that lets you scan and redact Personally Identifiable Information (PII) out of your Generative AI applications.

### What problem are we solving?

**Context**

The primary use case today is Retrieval Augmented Generation (RAG) systems. As a refresher, RAG systems operate by retrieving information from a custom knowledge base—constructed by you or your team—and leverage this information, either by directly citing the files in a response or inferred through the model's responses. This knowledge base is assembled through a deliberate process, which involves uploading files into a workflow. These files are then segmented into logical information blocks and tagged according to their contextual significance. There are a thousand ways to add nuance to this characterization, but this suffices for the vast majority of cases!

**Problem**

How do you keep:

- Customer PII
- Employee PII
- Sensitive company information pertaining to org changes or restructurings
- Pending M&A activity
- Conversations with external counsel on material corporate matters (i.e. product recall, etc)
- and more

from entering a Generative AI environment in the first place? What you need is a tool to scan and redact your RAG-bound documents based on your organization or team needs.

That's where DataFog comes in. Our solution to this problem is through two major approaches:

**PII Observability** Take in your batch/streaming data and return a scan indicating character-level detection of entities
**Privacy Filter** DataFog can slot in as a pre-processor that redacts PII from your files before they get uploaded to a RAG database

With this SDK, you can import it into a Python environment (like a Google Colab notebook, check out our [Getting Started](examples/getting-started.ipynb)) and within a few lines of code you're up and running.

### How it works

<img src="https://www.datafog.ai/hero.png" alt="DataFog Overview" style="width:50%;">

### There's lots of PII tools out there; why DataFog?

If you look at the landscape of PII detection tools, their very existence was in many cases driven by regulatory requirements (i.e. 'comply with CCPA/GDPR/HIPAA').
In this scenario, there's a very defined problem, a specific set of immutable entities to look for, and a relatively static universe of document schema to work with. What that means as an end-result is that the products
are purpose-built for the problem that they are solving.

However, Generative AI changes how we think about privacy. There's now a changing set of privacy requirements (new M&A deals, internal discussions means new terms to scan/redact) as well as different and varying document sources to contend with. PII detection is no longer just about compliance, it's an active - and for some, new - internal security threat for CISOs and Eng Leaders to contend with. We want DataFog to be built and driven to meet the needs of the open-source community as they tackle this challenge.

## Installation

DataFog can be installed via pip:

```bash
pip install datafog
```

and in your python environment:

```
from datafog import PresidioEngine as presidio
datafog = datafog.DataFog()

```

## Examples

Here are some examples of datafog being used to redact information in business contexts. Please see '/examples' for our [Getting Started](examples/getting-started.ipynb) notebook. We'll be regularly updating content and providing comprehensive guides to using DataFog in production contexts. If you have any ideas for a tutorial or guide that you would like to see, please let us know!

### Scanning a single string

```
  ceo_email_chunk = "I'm announcing on Friday that Jeff is going to be CTO."

  scan_results1 = presidio.scan(ceo_email_chunk)
  print("PII Detected - base case:", scan_results1)
  # PII Detected - base case: [type: PERSON, start: 30, end: 34, score: 0.85]


  scan_results2 = presidio.scan(ceo_email_chunk, deny_list=['CTO'])
  print("PII Detected with deny list:", scan_results2)
  # PII Detected with deny list: [type: CUSTOM_PII, start: 50, end: 53, score: 1.0, type: PERSON, start: 30, end: 34, score: 0.85]

```

### Scanning a list of PDFs

```
file_dir = ["/Users/sidmohan/Desktop/datafog-v2.4.0/datafog-python/tests/files/input_files/agi-builder-meetup.pdf",
           "/Users/sidmohan/Desktop/datafog-v2.4.0/datafog-python/tests/files/input_files/pypdf-readthedocs-io-en-stable.pdf"]
datafog = datafog.DataFog()
result = datafog.upload_files(uploaded_files=file_dir)
print(result)
```

The output here will be a dictionary where the keys are the file names and the values are the scan results for that file.
for ex:
`{'agi-builder-meetup.pdf': "2/26/24, 2:16 PM\nAGI Builders Meetup SF · Luma\nContact the HostReport Event29\nEvent FullIf youʼd like"}`

## Contributing

DataFog is a community-driven **open-source** platform and we've been fortunate to have a small and growing contributor base. We'd love to hear ideas, feedback, suggestions for improvement - anything on your mind about what you think can be done to make DataFog better! Join our [Discord](https://discord.gg/bzDth394R4) and join our growing community.

### Dev Notes

- Justfile commands:
  - `just format` to apply formatting.
  - `just lint` to check formatting and style.

### Testing

To run the datafog unit tests, check out this repository and do

```

tox

```

## License

This software is published under the [MIT
license](https://en.wikipedia.org/wiki/MIT_License).

            

Raw data

            {
    "_id": null,
    "home_page": "https://datafog.ai",
    "name": "datafog",
    "maintainer": "DataFog",
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "hi@datafog.ai",
    "keywords": "pii, redaction, nlp, rag, retrieval augmented generation",
    "author": "DataFog",
    "author_email": "hi@datafog.ai",
    "download_url": "https://files.pythonhosted.org/packages/5e/68/d035cc02914f5f3b337016b9b237422eb2a9c491ee5e9899f2327f326ebf/datafog-2.4.0.tar.gz",
    "platform": null,
    "description": "<p align=\"center\">\n  <a href=\"https://www.datafog.ai\"><img src=\"public/colorlogo.png\" alt=\"DataFog logo\"></a>\n</p>\n\n<p align=\"center\">\n    <b>Open-source DevSecOps for Generative AI Systems</b>. <br />\n</p>\n\n<p align=\"center\">\n  <a href=\"https://pypi.org/project/datafog/\"><img src=\"https://img.shields.io/pypi/v/datafog.svg?style=flat-square\" alt=\"PyPi Version\"></a>\n  <a href=\"https://pypi.org/project/datafog/\"><img src=\"https://img.shields.io/pypi/pyversions/datafog.svg?style=flat-square\" alt=\"PyPI pyversions\"></a>\n  <a href=\"https://github.com/datafog/datafog-python\"><img src=\"https://img.shields.io/github/stars/datafog/datafog-python.svg?style=flat-square&logo=github&label=Stars&logoColor=white\" alt=\"GitHub stars\"></a>\n  <a href=\"https://pypistats.org/packages/datafog\"><img src=\"https://img.shields.io/pypi/dm/datafog.svg?style=flat-square\" alt=\"PyPi downloads\"></a>\n  <a href=\"https://discord.gg/bzDth394R4\"><img src=\"https://img.shields.io/discord/1173803135341449227?style=flat\" alt=\"Discord\"></a>\n  <a href=\"https://github.com/psf/black\"><img src=\"https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square\" alt=\"Code style: black\"></a>\n  <a href=\"https://codecov.io/gh/datafog/datafog-python\"><img src=\"https://img.shields.io/codecov/c/github/datafog/datafog-python.svg?style=flat-square\" alt=\"codecov\"></a>\n  <a href=\"https://github.com/datafog/datafog-python/issues\"><img src=\"https://img.shields.io/github/issues/datafog/datafog-python.svg?style=flat-square\" alt=\"GitHub Issues\"></a>\n</p>\n\n## Overview\n\n### What is DataFog?\n\nDataFog is an open-source DevSecOps platform that lets you scan and redact Personally Identifiable Information (PII) out of your Generative AI applications.\n\n### What problem are we solving?\n\n**Context**\n\nThe primary use case today is Retrieval Augmented Generation (RAG) systems. As a refresher, RAG systems operate by retrieving information from a custom knowledge base\u2014constructed by you or your team\u2014and leverage this information, either by directly citing the files in a response or inferred through the model's responses. This knowledge base is assembled through a deliberate process, which involves uploading files into a workflow. These files are then segmented into logical information blocks and tagged according to their contextual significance. There are a thousand ways to add nuance to this characterization, but this suffices for the vast majority of cases!\n\n**Problem**\n\nHow do you keep:\n\n- Customer PII\n- Employee PII\n- Sensitive company information pertaining to org changes or restructurings\n- Pending M&A activity\n- Conversations with external counsel on material corporate matters (i.e. product recall, etc)\n- and more\n\nfrom entering a Generative AI environment in the first place? What you need is a tool to scan and redact your RAG-bound documents based on your organization or team needs.\n\nThat's where DataFog comes in. Our solution to this problem is through two major approaches:\n\n**PII Observability** Take in your batch/streaming data and return a scan indicating character-level detection of entities\n**Privacy Filter** DataFog can slot in as a pre-processor that redacts PII from your files before they get uploaded to a RAG database\n\nWith this SDK, you can import it into a Python environment (like a Google Colab notebook, check out our [Getting Started](examples/getting-started.ipynb)) and within a few lines of code you're up and running.\n\n### How it works\n\n<img src=\"https://www.datafog.ai/hero.png\" alt=\"DataFog Overview\" style=\"width:50%;\">\n\n### There's lots of PII tools out there; why DataFog?\n\nIf you look at the landscape of PII detection tools, their very existence was in many cases driven by regulatory requirements (i.e. 'comply with CCPA/GDPR/HIPAA').\nIn this scenario, there's a very defined problem, a specific set of immutable entities to look for, and a relatively static universe of document schema to work with. What that means as an end-result is that the products\nare purpose-built for the problem that they are solving.\n\nHowever, Generative AI changes how we think about privacy. There's now a changing set of privacy requirements (new M&A deals, internal discussions means new terms to scan/redact) as well as different and varying document sources to contend with. PII detection is no longer just about compliance, it's an active - and for some, new - internal security threat for CISOs and Eng Leaders to contend with. We want DataFog to be built and driven to meet the needs of the open-source community as they tackle this challenge.\n\n## Installation\n\nDataFog can be installed via pip:\n\n```bash\npip install datafog\n```\n\nand in your python environment:\n\n```\nfrom datafog import PresidioEngine as presidio\ndatafog = datafog.DataFog()\n\n```\n\n## Examples\n\nHere are some examples of datafog being used to redact information in business contexts. Please see '/examples' for our [Getting Started](examples/getting-started.ipynb) notebook. We'll be regularly updating content and providing comprehensive guides to using DataFog in production contexts. If you have any ideas for a tutorial or guide that you would like to see, please let us know!\n\n### Scanning a single string\n\n```\n  ceo_email_chunk = \"I'm announcing on Friday that Jeff is going to be CTO.\"\n\n  scan_results1 = presidio.scan(ceo_email_chunk)\n  print(\"PII Detected - base case:\", scan_results1)\n  # PII Detected - base case: [type: PERSON, start: 30, end: 34, score: 0.85]\n\n\n  scan_results2 = presidio.scan(ceo_email_chunk, deny_list=['CTO'])\n  print(\"PII Detected with deny list:\", scan_results2)\n  # PII Detected with deny list: [type: CUSTOM_PII, start: 50, end: 53, score: 1.0, type: PERSON, start: 30, end: 34, score: 0.85]\n\n```\n\n### Scanning a list of PDFs\n\n```\nfile_dir = [\"/Users/sidmohan/Desktop/datafog-v2.4.0/datafog-python/tests/files/input_files/agi-builder-meetup.pdf\",\n           \"/Users/sidmohan/Desktop/datafog-v2.4.0/datafog-python/tests/files/input_files/pypdf-readthedocs-io-en-stable.pdf\"]\ndatafog = datafog.DataFog()\nresult = datafog.upload_files(uploaded_files=file_dir)\nprint(result)\n```\n\nThe output here will be a dictionary where the keys are the file names and the values are the scan results for that file.\nfor ex:\n`{'agi-builder-meetup.pdf': \"2/26/24, 2:16 PM\\nAGI Builders Meetup SF \u00b7 Luma\\nContact the HostReport Event29\\nEvent FullIf you\u02bcd like\"}`\n\n## Contributing\n\nDataFog is a community-driven **open-source** platform and we've been fortunate to have a small and growing contributor base. We'd love to hear ideas, feedback, suggestions for improvement - anything on your mind about what you think can be done to make DataFog better! Join our [Discord](https://discord.gg/bzDth394R4) and join our growing community.\n\n### Dev Notes\n\n- Justfile commands:\n  - `just format` to apply formatting.\n  - `just lint` to check formatting and style.\n\n### Testing\n\nTo run the datafog unit tests, check out this repository and do\n\n```\n\ntox\n\n```\n\n## License\n\nThis software is published under the [MIT\nlicense](https://en.wikipedia.org/wiki/MIT_License).\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Scan, redact, and manage PII in your documents before they get uploaded to a Retrieval Augmented Generation (RAG) system.",
    "version": "2.4.0",
    "project_urls": {
        "Discord": "https://discord.gg/bzDth394R4",
        "Documentation": "https://docs.datafog.ai",
        "GitHub": "https://github.com/datafog/datafog-python",
        "Homepage": "https://datafog.ai",
        "Twitter": "https://twitter.com/datafoginc"
    },
    "split_keywords": [
        "pii",
        " redaction",
        " nlp",
        " rag",
        " retrieval augmented generation"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5e68d035cc02914f5f3b337016b9b237422eb2a9c491ee5e9899f2327f326ebf",
                "md5": "538fb3bd32397776dc5ad3d6c11adc3b",
                "sha256": "303b9242c9db08897f5c80d87febb90ca87d1bc82bbeadebedc391807f6b75d3"
            },
            "downloads": -1,
            "filename": "datafog-2.4.0.tar.gz",
            "has_sig": false,
            "md5_digest": "538fb3bd32397776dc5ad3d6c11adc3b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 15223,
            "upload_time": "2024-04-02T04:14:59",
            "upload_time_iso_8601": "2024-04-02T04:14:59.523592Z",
            "url": "https://files.pythonhosted.org/packages/5e/68/d035cc02914f5f3b337016b9b237422eb2a9c491ee5e9899f2327f326ebf/datafog-2.4.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-04-02 04:14:59",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "datafog",
    "github_project": "datafog-python",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "presidio_analyzer",
            "specs": [
                [
                    "==",
                    "2.2.353"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    "==",
                    "2.2.1"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": [
                [
                    "==",
                    "8.0.2"
                ]
            ]
        },
        {
            "name": "Requests",
            "specs": [
                [
                    "==",
                    "2.31.0"
                ]
            ]
        },
        {
            "name": "aiohttp",
            "specs": [
                [
                    "==",
                    "3.8.2"
                ]
            ]
        },
        {
            "name": "yarl",
            "specs": [
                [
                    "==",
                    "1.8.1"
                ]
            ]
        },
        {
            "name": "frozenlist",
            "specs": [
                [
                    "==",
                    "1.3.1"
                ]
            ]
        },
        {
            "name": "en_spacy_pii_fast",
            "specs": []
        },
        {
            "name": "unstructured",
            "specs": []
        },
        {
            "name": "unstructured",
            "specs": []
        }
    ],
    "tox": true,
    "lcname": "datafog"
}
        
Elapsed time: 0.24156s