<p align="center">
<a href="https://www.datafog.ai"><img src="public/colorlogo.png" alt="DataFog logo"></a>
</p>
<p align="center">
<b>Open-source DevSecOps for Generative AI Systems</b>. <br />
</p>
<p align="center">
<a href="https://pypi.org/project/datafog/"><img src="https://img.shields.io/pypi/v/datafog.svg?style=flat-square" alt="PyPi Version"></a>
<a href="https://pypi.org/project/datafog/"><img src="https://img.shields.io/pypi/pyversions/datafog.svg?style=flat-square" alt="PyPI pyversions"></a>
<a href="https://github.com/datafog/datafog-python"><img src="https://img.shields.io/github/stars/datafog/datafog-python.svg?style=flat-square&logo=github&label=Stars&logoColor=white" alt="GitHub stars"></a>
<a href="https://pypistats.org/packages/datafog"><img src="https://img.shields.io/pypi/dm/datafog.svg?style=flat-square" alt="PyPi downloads"></a>
<a href="https://discord.gg/bzDth394R4"><img src="https://img.shields.io/discord/1173803135341449227?style=flat" alt="Discord"></a>
<a href="https://github.com/psf/black"><img src="https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square" alt="Code style: black"></a>
<a href="https://codecov.io/gh/datafog/datafog-python"><img src="https://img.shields.io/codecov/c/github/datafog/datafog-python.svg?style=flat-square" alt="codecov"></a>
<a href="https://github.com/datafog/datafog-python/issues"><img src="https://img.shields.io/github/issues/datafog/datafog-python.svg?style=flat-square" alt="GitHub Issues"></a>
</p>
## Overview
### What is DataFog?
DataFog is an open-source DevSecOps platform that lets you scan and redact Personally Identifiable Information (PII) out of your Generative AI applications.
### What problem are we solving?
**Context**
The primary use case today is Retrieval Augmented Generation (RAG) systems. As a refresher, RAG systems operate by retrieving information from a custom knowledge base—constructed by you or your team—and leverage this information, either by directly citing the files in a response or inferred through the model's responses. This knowledge base is assembled through a deliberate process, which involves uploading files into a workflow. These files are then segmented into logical information blocks and tagged according to their contextual significance. There are a thousand ways to add nuance to this characterization, but this suffices for the vast majority of cases!
**Problem**
How do you keep:
- Customer PII
- Employee PII
- Sensitive company information pertaining to org changes or restructurings
- Pending M&A activity
- Conversations with external counsel on material corporate matters (i.e. product recall, etc)
- and more
from entering a Generative AI environment in the first place? What you need is a tool to scan and redact your RAG-bound documents based on your organization or team needs.
That's where DataFog comes in. Our solution to this problem is through two major approaches:
**PII Observability** Take in your batch/streaming data and return a scan indicating character-level detection of entities
**Privacy Filter** DataFog can slot in as a pre-processor that redacts PII from your files before they get uploaded to a RAG database
With this SDK, you can import it into a Python environment (like a Google Colab notebook, check out our [Getting Started](examples/getting-started.ipynb)) and within a few lines of code you're up and running.
### How it works
<img src="https://www.datafog.ai/hero.png" alt="DataFog Overview" style="width:50%;">
### There's lots of PII tools out there; why DataFog?
If you look at the landscape of PII detection tools, their very existence was in many cases driven by regulatory requirements (i.e. 'comply with CCPA/GDPR/HIPAA').
In this scenario, there's a very defined problem, a specific set of immutable entities to look for, and a relatively static universe of document schema to work with. What that means as an end-result is that the products
are purpose-built for the problem that they are solving.
However, Generative AI changes how we think about privacy. There's now a changing set of privacy requirements (new M&A deals, internal discussions means new terms to scan/redact) as well as different and varying document sources to contend with. PII detection is no longer just about compliance, it's an active - and for some, new - internal security threat for CISOs and Eng Leaders to contend with. We want DataFog to be built and driven to meet the needs of the open-source community as they tackle this challenge.
## Installation
DataFog can be installed via pip:
```bash
pip install datafog
```
and in your python environment:
```
from datafog import PresidioEngine as presidio
datafog = datafog.DataFog()
```
## Examples
Here are some examples of datafog being used to redact information in business contexts. Please see '/examples' for our [Getting Started](examples/getting-started.ipynb) notebook. We'll be regularly updating content and providing comprehensive guides to using DataFog in production contexts. If you have any ideas for a tutorial or guide that you would like to see, please let us know!
### Scanning a single string
```
ceo_email_chunk = "I'm announcing on Friday that Jeff is going to be CTO."
scan_results1 = presidio.scan(ceo_email_chunk)
print("PII Detected - base case:", scan_results1)
# PII Detected - base case: [type: PERSON, start: 30, end: 34, score: 0.85]
scan_results2 = presidio.scan(ceo_email_chunk, deny_list=['CTO'])
print("PII Detected with deny list:", scan_results2)
# PII Detected with deny list: [type: CUSTOM_PII, start: 50, end: 53, score: 1.0, type: PERSON, start: 30, end: 34, score: 0.85]
```
### Scanning a list of PDFs
```
file_dir = ["/Users/sidmohan/Desktop/datafog-v2.4.0/datafog-python/tests/files/input_files/agi-builder-meetup.pdf",
"/Users/sidmohan/Desktop/datafog-v2.4.0/datafog-python/tests/files/input_files/pypdf-readthedocs-io-en-stable.pdf"]
datafog = datafog.DataFog()
result = datafog.upload_files(uploaded_files=file_dir)
print(result)
```
The output here will be a dictionary where the keys are the file names and the values are the scan results for that file.
for ex:
`{'agi-builder-meetup.pdf': "2/26/24, 2:16 PM\nAGI Builders Meetup SF · Luma\nContact the HostReport Event29\nEvent FullIf youʼd like"}`
## Contributing
DataFog is a community-driven **open-source** platform and we've been fortunate to have a small and growing contributor base. We'd love to hear ideas, feedback, suggestions for improvement - anything on your mind about what you think can be done to make DataFog better! Join our [Discord](https://discord.gg/bzDth394R4) and join our growing community.
### Dev Notes
- Justfile commands:
- `just format` to apply formatting.
- `just lint` to check formatting and style.
### Testing
To run the datafog unit tests, check out this repository and do
```
tox
```
## License
This software is published under the [MIT
license](https://en.wikipedia.org/wiki/MIT_License).
Raw data
{
"_id": null,
"home_page": "https://datafog.ai",
"name": "datafog",
"maintainer": "DataFog",
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": "hi@datafog.ai",
"keywords": "pii, redaction, nlp, rag, retrieval augmented generation",
"author": "DataFog",
"author_email": "hi@datafog.ai",
"download_url": "https://files.pythonhosted.org/packages/5e/68/d035cc02914f5f3b337016b9b237422eb2a9c491ee5e9899f2327f326ebf/datafog-2.4.0.tar.gz",
"platform": null,
"description": "<p align=\"center\">\n <a href=\"https://www.datafog.ai\"><img src=\"public/colorlogo.png\" alt=\"DataFog logo\"></a>\n</p>\n\n<p align=\"center\">\n <b>Open-source DevSecOps for Generative AI Systems</b>. <br />\n</p>\n\n<p align=\"center\">\n <a href=\"https://pypi.org/project/datafog/\"><img src=\"https://img.shields.io/pypi/v/datafog.svg?style=flat-square\" alt=\"PyPi Version\"></a>\n <a href=\"https://pypi.org/project/datafog/\"><img src=\"https://img.shields.io/pypi/pyversions/datafog.svg?style=flat-square\" alt=\"PyPI pyversions\"></a>\n <a href=\"https://github.com/datafog/datafog-python\"><img src=\"https://img.shields.io/github/stars/datafog/datafog-python.svg?style=flat-square&logo=github&label=Stars&logoColor=white\" alt=\"GitHub stars\"></a>\n <a href=\"https://pypistats.org/packages/datafog\"><img src=\"https://img.shields.io/pypi/dm/datafog.svg?style=flat-square\" alt=\"PyPi downloads\"></a>\n <a href=\"https://discord.gg/bzDth394R4\"><img src=\"https://img.shields.io/discord/1173803135341449227?style=flat\" alt=\"Discord\"></a>\n <a href=\"https://github.com/psf/black\"><img src=\"https://img.shields.io/badge/code%20style-black-000000.svg?style=flat-square\" alt=\"Code style: black\"></a>\n <a href=\"https://codecov.io/gh/datafog/datafog-python\"><img src=\"https://img.shields.io/codecov/c/github/datafog/datafog-python.svg?style=flat-square\" alt=\"codecov\"></a>\n <a href=\"https://github.com/datafog/datafog-python/issues\"><img src=\"https://img.shields.io/github/issues/datafog/datafog-python.svg?style=flat-square\" alt=\"GitHub Issues\"></a>\n</p>\n\n## Overview\n\n### What is DataFog?\n\nDataFog is an open-source DevSecOps platform that lets you scan and redact Personally Identifiable Information (PII) out of your Generative AI applications.\n\n### What problem are we solving?\n\n**Context**\n\nThe primary use case today is Retrieval Augmented Generation (RAG) systems. As a refresher, RAG systems operate by retrieving information from a custom knowledge base\u2014constructed by you or your team\u2014and leverage this information, either by directly citing the files in a response or inferred through the model's responses. This knowledge base is assembled through a deliberate process, which involves uploading files into a workflow. These files are then segmented into logical information blocks and tagged according to their contextual significance. There are a thousand ways to add nuance to this characterization, but this suffices for the vast majority of cases!\n\n**Problem**\n\nHow do you keep:\n\n- Customer PII\n- Employee PII\n- Sensitive company information pertaining to org changes or restructurings\n- Pending M&A activity\n- Conversations with external counsel on material corporate matters (i.e. product recall, etc)\n- and more\n\nfrom entering a Generative AI environment in the first place? What you need is a tool to scan and redact your RAG-bound documents based on your organization or team needs.\n\nThat's where DataFog comes in. Our solution to this problem is through two major approaches:\n\n**PII Observability** Take in your batch/streaming data and return a scan indicating character-level detection of entities\n**Privacy Filter** DataFog can slot in as a pre-processor that redacts PII from your files before they get uploaded to a RAG database\n\nWith this SDK, you can import it into a Python environment (like a Google Colab notebook, check out our [Getting Started](examples/getting-started.ipynb)) and within a few lines of code you're up and running.\n\n### How it works\n\n<img src=\"https://www.datafog.ai/hero.png\" alt=\"DataFog Overview\" style=\"width:50%;\">\n\n### There's lots of PII tools out there; why DataFog?\n\nIf you look at the landscape of PII detection tools, their very existence was in many cases driven by regulatory requirements (i.e. 'comply with CCPA/GDPR/HIPAA').\nIn this scenario, there's a very defined problem, a specific set of immutable entities to look for, and a relatively static universe of document schema to work with. What that means as an end-result is that the products\nare purpose-built for the problem that they are solving.\n\nHowever, Generative AI changes how we think about privacy. There's now a changing set of privacy requirements (new M&A deals, internal discussions means new terms to scan/redact) as well as different and varying document sources to contend with. PII detection is no longer just about compliance, it's an active - and for some, new - internal security threat for CISOs and Eng Leaders to contend with. We want DataFog to be built and driven to meet the needs of the open-source community as they tackle this challenge.\n\n## Installation\n\nDataFog can be installed via pip:\n\n```bash\npip install datafog\n```\n\nand in your python environment:\n\n```\nfrom datafog import PresidioEngine as presidio\ndatafog = datafog.DataFog()\n\n```\n\n## Examples\n\nHere are some examples of datafog being used to redact information in business contexts. Please see '/examples' for our [Getting Started](examples/getting-started.ipynb) notebook. We'll be regularly updating content and providing comprehensive guides to using DataFog in production contexts. If you have any ideas for a tutorial or guide that you would like to see, please let us know!\n\n### Scanning a single string\n\n```\n ceo_email_chunk = \"I'm announcing on Friday that Jeff is going to be CTO.\"\n\n scan_results1 = presidio.scan(ceo_email_chunk)\n print(\"PII Detected - base case:\", scan_results1)\n # PII Detected - base case: [type: PERSON, start: 30, end: 34, score: 0.85]\n\n\n scan_results2 = presidio.scan(ceo_email_chunk, deny_list=['CTO'])\n print(\"PII Detected with deny list:\", scan_results2)\n # PII Detected with deny list: [type: CUSTOM_PII, start: 50, end: 53, score: 1.0, type: PERSON, start: 30, end: 34, score: 0.85]\n\n```\n\n### Scanning a list of PDFs\n\n```\nfile_dir = [\"/Users/sidmohan/Desktop/datafog-v2.4.0/datafog-python/tests/files/input_files/agi-builder-meetup.pdf\",\n \"/Users/sidmohan/Desktop/datafog-v2.4.0/datafog-python/tests/files/input_files/pypdf-readthedocs-io-en-stable.pdf\"]\ndatafog = datafog.DataFog()\nresult = datafog.upload_files(uploaded_files=file_dir)\nprint(result)\n```\n\nThe output here will be a dictionary where the keys are the file names and the values are the scan results for that file.\nfor ex:\n`{'agi-builder-meetup.pdf': \"2/26/24, 2:16 PM\\nAGI Builders Meetup SF \u00b7 Luma\\nContact the HostReport Event29\\nEvent FullIf you\u02bcd like\"}`\n\n## Contributing\n\nDataFog is a community-driven **open-source** platform and we've been fortunate to have a small and growing contributor base. We'd love to hear ideas, feedback, suggestions for improvement - anything on your mind about what you think can be done to make DataFog better! Join our [Discord](https://discord.gg/bzDth394R4) and join our growing community.\n\n### Dev Notes\n\n- Justfile commands:\n - `just format` to apply formatting.\n - `just lint` to check formatting and style.\n\n### Testing\n\nTo run the datafog unit tests, check out this repository and do\n\n```\n\ntox\n\n```\n\n## License\n\nThis software is published under the [MIT\nlicense](https://en.wikipedia.org/wiki/MIT_License).\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Scan, redact, and manage PII in your documents before they get uploaded to a Retrieval Augmented Generation (RAG) system.",
"version": "2.4.0",
"project_urls": {
"Discord": "https://discord.gg/bzDth394R4",
"Documentation": "https://docs.datafog.ai",
"GitHub": "https://github.com/datafog/datafog-python",
"Homepage": "https://datafog.ai",
"Twitter": "https://twitter.com/datafoginc"
},
"split_keywords": [
"pii",
" redaction",
" nlp",
" rag",
" retrieval augmented generation"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "5e68d035cc02914f5f3b337016b9b237422eb2a9c491ee5e9899f2327f326ebf",
"md5": "538fb3bd32397776dc5ad3d6c11adc3b",
"sha256": "303b9242c9db08897f5c80d87febb90ca87d1bc82bbeadebedc391807f6b75d3"
},
"downloads": -1,
"filename": "datafog-2.4.0.tar.gz",
"has_sig": false,
"md5_digest": "538fb3bd32397776dc5ad3d6c11adc3b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 15223,
"upload_time": "2024-04-02T04:14:59",
"upload_time_iso_8601": "2024-04-02T04:14:59.523592Z",
"url": "https://files.pythonhosted.org/packages/5e/68/d035cc02914f5f3b337016b9b237422eb2a9c491ee5e9899f2327f326ebf/datafog-2.4.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-04-02 04:14:59",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "datafog",
"github_project": "datafog-python",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "presidio_analyzer",
"specs": [
[
"==",
"2.2.353"
]
]
},
{
"name": "pandas",
"specs": [
[
"==",
"2.2.1"
]
]
},
{
"name": "pytest",
"specs": [
[
"==",
"8.0.2"
]
]
},
{
"name": "Requests",
"specs": [
[
"==",
"2.31.0"
]
]
},
{
"name": "aiohttp",
"specs": [
[
"==",
"3.8.2"
]
]
},
{
"name": "yarl",
"specs": [
[
"==",
"1.8.1"
]
]
},
{
"name": "frozenlist",
"specs": [
[
"==",
"1.3.1"
]
]
},
{
"name": "en_spacy_pii_fast",
"specs": []
},
{
"name": "unstructured",
"specs": []
},
{
"name": "unstructured",
"specs": []
}
],
"tox": true,
"lcname": "datafog"
}