inkognito

Name	inkognito JSON
Version	0.1.0 JSON
	download
home_page	None
Summary	Privacy-first document processing FastMCP server with PII anonymization
upload_time	2025-08-13 17:45:52
maintainer	None
docs_url	None
author	None
requires_python	<3.13,>=3.10
license	MIT
keywords	anonymization document-processing extraction fastmcp mcp pdf pii privacy
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Inkognito

Privacy-first document processing FastMCP server. Extract, anonymize, and segment documents through FastMCP's modern tool interface.

Please note: As an MCP, privacy of file contents cannot be absolutely guaranteed, but it is a central design consideration. While file _contents_ should be low risk (but non-zero) risk for leakage, file _names_ will, unavoidably and by design, be read and written by the MCP. Plan accordingly. Consider using a local model.

## Quick Start

### Installation

```bash
# Install via pip
pip install inkognito

# Or via uvx (no Python setup needed)
uvx inkognito

# Or run directly with FastMCP
fastmcp run inkognito
```

### Configure Claude Desktop

If not already present, you need to make sure you add a filesystem MCP.

Add to your `claude_desktop_config.json`:

```json
{
  "mcpServers": {
    "inkognito": {
      "command": "uvx",
      "args": ["inkognito"],
      "env": {
        // Optional: Add keys when extractors are implemented
        // "AZURE_DI_KEY": "your-key-here",
        // "LLAMAPARSE_API_KEY": "your-key-here"
      }
    },
    "filesystem": {
      "command": "npx",
      "args": [
        "-y",
        "@modelcontextprotocol/server-filesystem",
        "/Users/you/input-files-or-whatever",
        "/Users/you/output-folder-if-you-want-one"
      ],
      "env": {},
      "transport": "stdio",
      "type": null,
      "cwd": null,
      "timeout": null,
      "description": null,
      "icon": null,
      "authentication": null
    }
  }
}
```

### Basic Usage

In Claude Desktop:

```
"Extract this PDF to markdown"
"Anonymize all documents in my contracts folder"
"Split this large document into chunks for processing"
"Create individual prompts from this documentation"
```

## Features

### 🔒 Privacy-First Anonymization

- Universal PII detection (50+ types)
- Consistent replacements across all documents
- Reversible with secure vault file
- No configuration needed - smart defaults

### 📄 Multiple Extraction Options

- **Available Now**: Docling (default, with OCR support)
- **Planned**: Azure DI, LlamaIndex, MinerU (placeholders only)
- Auto-selects best available option
- Falls back to Docling if no cloud options

### ✂️ Intelligent Segmentation

- **Large documents**: 10k-30k token chunks
- **Prompt generation**: Split by headings
- Preserves context and structure
- Markdown-native processing

## FastMCP Tools

All tools are exposed through FastMCP's modern interface with automatic progress reporting and error handling.

### anonymize_documents

Replace PII with consistent fake data across multiple files.

```python
anonymize_documents(
    directory="/path/to/docs",
    output_dir="/secure/output"
)
```

### extract_document

Convert PDF/DOCX to markdown.

```python
extract_document(
    file_path="/path/to/document.pdf",
    extraction_method="auto"  # auto, docling (others coming soon)
)
```

### segment_document

Split large documents for LLM processing.

```python
segment_document(
    file_path="/path/to/large.md",
    output_dir="/output/segments",
    max_tokens=20000
)
```

### split_into_prompts

Create individual prompts from structured content.

```python
split_into_prompts(
    file_path="/path/to/guide.md",
    output_dir="/output/prompts",
    split_level="h2", #configurable, LLM should be able to read the contents of these files safely
)
```

### restore_documents

Restore original PII using vault.

```python
restore_documents(
    directory="/anonymized/docs",
    output_dir="/restored",
    vault_path="/secure/vault.json"
)
```

## Extractor Status

| Extractor      | Status               | Notes                                                                            |
| -------------- | -------------------- | -------------------------------------------------------------------------------- |
| **Docling**    | ✅ Fully Implemented | Default extractor with OCR support (OCRMac on macOS, EasyOCR on other platforms) |
| **Azure DI**   | ⚠️ Placeholder       | Requires `AZURE_DI_KEY` environment variable when implemented                    |
| **LlamaIndex** | ⚠️ Placeholder       | Requires `LLAMAPARSE_API_KEY` environment variable when implemented              |
| **MinerU**     | ⚠️ Placeholder       | Will require magic-pdf library when implemented                                  |

## Configuration

Following FastMCP conventions, all configuration is via environment variables:

```bash
# Optional API keys for cloud extractors (when implemented)
export AZURE_DI_KEY="your-key-here"
export LLAMAPARSE_API_KEY="your-key-here"

# Optional OCR languages (comma-separated, default: all available)
export INKOGNITO_OCR_LANGUAGES="en,fr,de"
```

## Examples

### Legal Document Processing

```
You: "Anonymize all contracts in the merger folder for review"

Claude: "I'll anonymize those contracts for you...

[Processing 23 files...]

✓ Anonymized 23 contracts
✓ Replaced: 145 company names, 89 person names, 67 case numbers
✓ Vault saved to: /output/vault.json
```

### Research Paper Extraction

```
You: "Extract this 300-page research PDF"

Claude: "I'll extract that PDF to markdown...

[Using Docling for extraction...]

✓ Extracted 300 pages
✓ Preserved: tables, figures, citations
✓ Output size: 487,000 tokens
✓ Saved to: research_paper.md
```

### Documentation to Prompts

```
You: "Split this API documentation into individual prompts"

Claude: "I'll split the documentation by endpoints...

[Splitting by H2 headings...]

✓ Created 47 prompt files
✓ Each prompt includes endpoint context
✓ Ready for training or testing
```

## Performance

| Extractor  | Speed          | Requirements | Status       |
| ---------- | -------------- | ------------ | ------------ |
| Azure DI   | 0.2-1 sec/page | API key      | Planned      |
| LlamaIndex | 1-2 sec/page   | API key      | Planned      |
| MinerU     | 3-7 sec/page   | Local, GPU   | Planned      |
| Docling    | 5-10 sec/page  | Local, CPU   | ✅ Available |

## Privacy & Security

- **Local processing**: No cloud services required
- **No persistence**: Nothing saved without explicit paths
- **Secure vaults**: Encrypted mapping storage
- **API key safety**: Never logged or transmitted

## Development

### Running Locally

```bash
# Clone the repository
git clone https://github.com/phren0logy/inkognito
cd inkognito

# Run with FastMCP CLI
fastmcp dev

# Or run directly in development
uv run python server.py
```

### Testing with FastMCP

```bash
# Install the server configuration
fastmcp install inkognito

# Test a specific tool
fastmcp test inkognito extract_document
```

## Project Structure

```
inkognito/
├── pyproject.toml          # FastMCP-compatible packaging
├── LICENSE                 # MIT license
├── README.md               # This file
├── server.py               # FastMCP server and entry point
├── anonymizer.py           # PII detection and anonymization
├── vault.py                # Vault management for reversibility
├── segmenter.py            # Document segmentation
├── exceptions.py           # Custom exceptions
├── extractors/             # PDF extraction backends
│   ├── __init__.py
│   ├── base.py
│   ├── registry.py
│   ├── docling.py          # ✅ Implemented
│   ├── azure_di.py         # Placeholder
│   ├── llamaindex.py       # Placeholder
│   └── mineru.py           # Placeholder
└── tests/
```

## License

MIT License - see LICENSE file for details.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "inkognito",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.13,>=3.10",
    "maintainer_email": null,
    "keywords": "anonymization, document-processing, extraction, fastmcp, mcp, pdf, pii, privacy",
    "author": null,
    "author_email": "Andrew Nanton <git-nanton@stanford.edu>",
    "download_url": "https://files.pythonhosted.org/packages/1b/e3/3ecc1b1c8650c06c277b4123326a804beb412d0b77617e8e88c014d90b74/inkognito-0.1.0.tar.gz",
    "platform": null,
    "description": "# Inkognito\n\nPrivacy-first document processing FastMCP server. Extract, anonymize, and segment documents through FastMCP's modern tool interface.\n\nPlease note: As an MCP, privacy of file contents cannot be absolutely guaranteed, but it is a central design consideration. While file _contents_ should be low risk (but non-zero) risk for leakage, file _names_ will, unavoidably and by design, be read and written by the MCP. Plan accordingly. Consider using a local model.\n\n## Quick Start\n\n### Installation\n\n```bash\n# Install via pip\npip install inkognito\n\n# Or via uvx (no Python setup needed)\nuvx inkognito\n\n# Or run directly with FastMCP\nfastmcp run inkognito\n```\n\n### Configure Claude Desktop\n\nIf not already present, you need to make sure you add a filesystem MCP.\n\nAdd to your `claude_desktop_config.json`:\n\n```json\n{\n  \"mcpServers\": {\n    \"inkognito\": {\n      \"command\": \"uvx\",\n      \"args\": [\"inkognito\"],\n      \"env\": {\n        // Optional: Add keys when extractors are implemented\n        // \"AZURE_DI_KEY\": \"your-key-here\",\n        // \"LLAMAPARSE_API_KEY\": \"your-key-here\"\n      }\n    },\n    \"filesystem\": {\n      \"command\": \"npx\",\n      \"args\": [\n        \"-y\",\n        \"@modelcontextprotocol/server-filesystem\",\n        \"/Users/you/input-files-or-whatever\",\n        \"/Users/you/output-folder-if-you-want-one\"\n      ],\n      \"env\": {},\n      \"transport\": \"stdio\",\n      \"type\": null,\n      \"cwd\": null,\n      \"timeout\": null,\n      \"description\": null,\n      \"icon\": null,\n      \"authentication\": null\n    }\n  }\n}\n```\n\n### Basic Usage\n\nIn Claude Desktop:\n\n```\n\"Extract this PDF to markdown\"\n\"Anonymize all documents in my contracts folder\"\n\"Split this large document into chunks for processing\"\n\"Create individual prompts from this documentation\"\n```\n\n## Features\n\n### \ud83d\udd12 Privacy-First Anonymization\n\n- Universal PII detection (50+ types)\n- Consistent replacements across all documents\n- Reversible with secure vault file\n- No configuration needed - smart defaults\n\n### \ud83d\udcc4 Multiple Extraction Options\n\n- **Available Now**: Docling (default, with OCR support)\n- **Planned**: Azure DI, LlamaIndex, MinerU (placeholders only)\n- Auto-selects best available option\n- Falls back to Docling if no cloud options\n\n### \u2702\ufe0f Intelligent Segmentation\n\n- **Large documents**: 10k-30k token chunks\n- **Prompt generation**: Split by headings\n- Preserves context and structure\n- Markdown-native processing\n\n## FastMCP Tools\n\nAll tools are exposed through FastMCP's modern interface with automatic progress reporting and error handling.\n\n### anonymize_documents\n\nReplace PII with consistent fake data across multiple files.\n\n```python\nanonymize_documents(\n    directory=\"/path/to/docs\",\n    output_dir=\"/secure/output\"\n)\n```\n\n### extract_document\n\nConvert PDF/DOCX to markdown.\n\n```python\nextract_document(\n    file_path=\"/path/to/document.pdf\",\n    extraction_method=\"auto\"  # auto, docling (others coming soon)\n)\n```\n\n### segment_document\n\nSplit large documents for LLM processing.\n\n```python\nsegment_document(\n    file_path=\"/path/to/large.md\",\n    output_dir=\"/output/segments\",\n    max_tokens=20000\n)\n```\n\n### split_into_prompts\n\nCreate individual prompts from structured content.\n\n```python\nsplit_into_prompts(\n    file_path=\"/path/to/guide.md\",\n    output_dir=\"/output/prompts\",\n    split_level=\"h2\", #configurable, LLM should be able to read the contents of these files safely\n)\n```\n\n### restore_documents\n\nRestore original PII using vault.\n\n```python\nrestore_documents(\n    directory=\"/anonymized/docs\",\n    output_dir=\"/restored\",\n    vault_path=\"/secure/vault.json\"\n)\n```\n\n## Extractor Status\n\n| Extractor      | Status               | Notes                                                                            |\n| -------------- | -------------------- | -------------------------------------------------------------------------------- |\n| **Docling**    | \u2705 Fully Implemented | Default extractor with OCR support (OCRMac on macOS, EasyOCR on other platforms) |\n| **Azure DI**   | \u26a0\ufe0f Placeholder       | Requires `AZURE_DI_KEY` environment variable when implemented                    |\n| **LlamaIndex** | \u26a0\ufe0f Placeholder       | Requires `LLAMAPARSE_API_KEY` environment variable when implemented              |\n| **MinerU**     | \u26a0\ufe0f Placeholder       | Will require magic-pdf library when implemented                                  |\n\n## Configuration\n\nFollowing FastMCP conventions, all configuration is via environment variables:\n\n```bash\n# Optional API keys for cloud extractors (when implemented)\nexport AZURE_DI_KEY=\"your-key-here\"\nexport LLAMAPARSE_API_KEY=\"your-key-here\"\n\n# Optional OCR languages (comma-separated, default: all available)\nexport INKOGNITO_OCR_LANGUAGES=\"en,fr,de\"\n```\n\n## Examples\n\n### Legal Document Processing\n\n```\nYou: \"Anonymize all contracts in the merger folder for review\"\n\nClaude: \"I'll anonymize those contracts for you...\n\n[Processing 23 files...]\n\n\u2713 Anonymized 23 contracts\n\u2713 Replaced: 145 company names, 89 person names, 67 case numbers\n\u2713 Vault saved to: /output/vault.json\n```\n\n### Research Paper Extraction\n\n```\nYou: \"Extract this 300-page research PDF\"\n\nClaude: \"I'll extract that PDF to markdown...\n\n[Using Docling for extraction...]\n\n\u2713 Extracted 300 pages\n\u2713 Preserved: tables, figures, citations\n\u2713 Output size: 487,000 tokens\n\u2713 Saved to: research_paper.md\n```\n\n### Documentation to Prompts\n\n```\nYou: \"Split this API documentation into individual prompts\"\n\nClaude: \"I'll split the documentation by endpoints...\n\n[Splitting by H2 headings...]\n\n\u2713 Created 47 prompt files\n\u2713 Each prompt includes endpoint context\n\u2713 Ready for training or testing\n```\n\n## Performance\n\n| Extractor  | Speed          | Requirements | Status       |\n| ---------- | -------------- | ------------ | ------------ |\n| Azure DI   | 0.2-1 sec/page | API key      | Planned      |\n| LlamaIndex | 1-2 sec/page   | API key      | Planned      |\n| MinerU     | 3-7 sec/page   | Local, GPU   | Planned      |\n| Docling    | 5-10 sec/page  | Local, CPU   | \u2705 Available |\n\n## Privacy & Security\n\n- **Local processing**: No cloud services required\n- **No persistence**: Nothing saved without explicit paths\n- **Secure vaults**: Encrypted mapping storage\n- **API key safety**: Never logged or transmitted\n\n## Development\n\n### Running Locally\n\n```bash\n# Clone the repository\ngit clone https://github.com/phren0logy/inkognito\ncd inkognito\n\n# Run with FastMCP CLI\nfastmcp dev\n\n# Or run directly in development\nuv run python server.py\n```\n\n### Testing with FastMCP\n\n```bash\n# Install the server configuration\nfastmcp install inkognito\n\n# Test a specific tool\nfastmcp test inkognito extract_document\n```\n\n## Project Structure\n\n```\ninkognito/\n\u251c\u2500\u2500 pyproject.toml          # FastMCP-compatible packaging\n\u251c\u2500\u2500 LICENSE                 # MIT license\n\u251c\u2500\u2500 README.md               # This file\n\u251c\u2500\u2500 server.py               # FastMCP server and entry point\n\u251c\u2500\u2500 anonymizer.py           # PII detection and anonymization\n\u251c\u2500\u2500 vault.py                # Vault management for reversibility\n\u251c\u2500\u2500 segmenter.py            # Document segmentation\n\u251c\u2500\u2500 exceptions.py           # Custom exceptions\n\u251c\u2500\u2500 extractors/             # PDF extraction backends\n\u2502   \u251c\u2500\u2500 __init__.py\n\u2502   \u251c\u2500\u2500 base.py\n\u2502   \u251c\u2500\u2500 registry.py\n\u2502   \u251c\u2500\u2500 docling.py          # \u2705 Implemented\n\u2502   \u251c\u2500\u2500 azure_di.py         # Placeholder\n\u2502   \u251c\u2500\u2500 llamaindex.py       # Placeholder\n\u2502   \u2514\u2500\u2500 mineru.py           # Placeholder\n\u2514\u2500\u2500 tests/\n```\n\n## License\n\nMIT License - see LICENSE file for details.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Privacy-first document processing FastMCP server with PII anonymization",
    "version": "0.1.0",
    "project_urls": {
        "Documentation": "https://github.com/phren0logy/inkognito#readme",
        "Homepage": "https://github.com/phren0logy/inkognito",
        "Issues": "https://github.com/phren0logy/inkognito/issues",
        "Repository": "https://github.com/phren0logy/inkognito"
    },
    "split_keywords": [
        "anonymization",
        " document-processing",
        " extraction",
        " fastmcp",
        " mcp",
        " pdf",
        " pii",
        " privacy"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "78c31490de84b78e78dbd004f336f788e5ad299ae71eea03a1136e251a27780d",
                "md5": "bd2fe9a9d21281206c9e9f46b875ff48",
                "sha256": "de866f176003e0e83a24759e719599c96f363b911685e2191fa7d0cb75b3fda4"
            },
            "downloads": -1,
            "filename": "inkognito-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "bd2fe9a9d21281206c9e9f46b875ff48",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.13,>=3.10",
            "size": 29973,
            "upload_time": "2025-08-13T17:45:51",
            "upload_time_iso_8601": "2025-08-13T17:45:51.047983Z",
            "url": "https://files.pythonhosted.org/packages/78/c3/1490de84b78e78dbd004f336f788e5ad299ae71eea03a1136e251a27780d/inkognito-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1be33ecc1b1c8650c06c277b4123326a804beb412d0b77617e8e88c014d90b74",
                "md5": "bce1fe51de78838bae2db8f26d1376b4",
                "sha256": "d9c2983fce901decb919165ace412603c7f1ec8376581c176fb8283316e92823"
            },
            "downloads": -1,
            "filename": "inkognito-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "bce1fe51de78838bae2db8f26d1376b4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.13,>=3.10",
            "size": 88313,
            "upload_time": "2025-08-13T17:45:52",
            "upload_time_iso_8601": "2025-08-13T17:45:52.621931Z",
            "url": "https://files.pythonhosted.org/packages/1b/e3/3ecc1b1c8650c06c277b4123326a804beb412d0b77617e8e88c014d90b74/inkognito-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-13 17:45:52",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "phren0logy",
    "github_project": "inkognito#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "inkognito"
}

None