pdf-requirement-extractor


Namepdf-requirement-extractor JSON
Version 2.0.0 PyPI version JSON
download
home_pagehttps://pypi.org/project/pdf-requirement-extractor/
SummaryExtract structured brand requirements from PDF documents
upload_time2025-08-19 06:27:19
maintainerNone
docs_urlNone
authorS M Asiful Islam Saky
requires_python>=3.9
licenseNone
keywords pdf extraction brand requirements guidelines parsing text-processing document-analysis openai gpt
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # PDF Requirement Extractor

[![Python Version](https://img.shields.io/pypi/pyversions/pdf-requirement-extractor)](https://pypi.org/project/pdf-requirement-extractor/)
[![PyPI Version](https://img.shields.io/pypi/v/pdf-requirement-extractor)](https://pypi.org/project/pdf-requirement-extractor/)
[![License](https://img.shields.io/pypi/l/pdf-requirement-extractor)](https://pypi.org/project/pdf-requirement-extractor/)
- **PyPI**: [pdf-requirement-extractor](https://pypi.org/project/pdf-requirement-extractor/)
- **Documentation**: Available in this README and package docstrings

A powerful Python package for extracting structured brand requirements from PDF documents using PyMuPDF and optional GPT enhancement.

## 🚀 Features

- **📄 PDF Text Extraction**: Extract text from PDF documents using PyMuPDF (fitz)
- **🔍 Pattern Recognition**: Detect hashtags, URLs, emails, colors, fonts, and more using regex
- **🤖 GPT Enhancement**: Optional GPT-powered analysis for deeper insights (requires OpenAI API)
- **📊 Structured Output**: Get results in JSON format with organized categories
- **🎨 Brand-Specific**: Tailored for brand guidelines, style guides, and requirement documents  
- **⚡ Fast & Efficient**: Optimized for batch processing of multiple documents
- **🔧 Customizable**: Add custom regex patterns for specific extraction needs
- **💪 Type Hints**: Full type annotation support for better development experience

## 📦 Installation

### Basic Installation
```bash
pip install pdf-requirement-extractor
```

### With GPT Enhancement
```bash
pip install pdf-requirement-extractor[gpt]
```

### With OCR Support  
```bash
pip install pdf-requirement-extractor[ocr]
```

### All Features
```bash
pip install pdf-requirement-extractor[all]
```

## 🔧 Quick Start

### Basic Usage

```python
from pdf_requirements_extractor import PDFRequirementExtractor

# Initialize the extractor
extractor = PDFRequirementExtractor()

# Extract requirements from a PDF file
result = extractor.extract_from_file("brand_guidelines.pdf")

# Access the results
print(f"Document: {result.file_name}")
print(f"Pages: {result.total_pages}")
print(f"Requirements found: {len(result.extracted_requirements)}")

# Get specific requirement categories
requirements = result.extracted_requirements
hashtags = requirements.get("hashtags", [])
colors = requirements.get("colors", [])
fonts = requirements.get("fonts", [])

print(f"Hashtags: {hashtags}")
print(f"Colors: {colors}")
print(f"Fonts: {fonts}")
```

### Enhanced Extraction with GPT

```python
import os
from pdf_requirements_extractor import PDFRequirementExtractor
from pdf_requirements_extractor.gpt_parser import GPTRequirementParser, GPTConfig

# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = "your-api-key-here"

# Configure GPT (default model is gpt-4.1-mini)
gpt_config = GPTConfig(
    model="gpt-4.1-mini",  # Default model - you can change to your preferred model
    temperature=0.3,
    max_tokens=1500
)

# Or use factory methods for common configurations:
# gpt_config = GPTConfig.for_cost_effective()      # gpt-3.5-turbo optimized
# gpt_config = GPTConfig.for_high_quality()        # gpt-4 optimized  
# gpt_config = GPTConfig.for_fast_processing()     # gpt-3.5-turbo fast
# gpt_config = GPTConfig.for_mini_model()          # gpt-4.1-mini cost-effective
# gpt_config = GPTConfig.create_default()          # Uses gpt-4.1-mini by default

# Initialize enhanced parser
parser = GPTRequirementParser(gpt_config=gpt_config)

# Basic extraction first
extractor = PDFRequirementExtractor()
basic_result = extractor.extract_from_file("brand_guidelines.pdf")

# Enhance with GPT
enhanced_requirements = parser.parse_with_gpt(
    basic_result.raw_text_excerpt,
    basic_result.extracted_requirements
)

# Access enhanced results
print("Enhanced requirements:")
for category, data in enhanced_requirements.items():
    print(f"{category}: {data}")
```

### Custom Pattern Extraction

```python
from pdf_requirements_extractor import PDFRequirementExtractor

# Define custom patterns
custom_patterns = {
    "social_handles": r"@[A-Za-z0-9_]+",
    "price_mentions": r"\$\d+(?:\.\d{2})?",
    "version_numbers": r"v?\d+\.\d+(?:\.\d+)?",
}

# Initialize with custom patterns
extractor = PDFRequirementExtractor(custom_patterns=custom_patterns)

# Extract with custom patterns
result = extractor.extract_from_file("document.pdf")
social_handles = result.extracted_requirements.get("social_handles", [])
```

### Batch Processing

```python
from pdf_requirements_extractor import PDFRequirementExtractor
import json
from pathlib import Path

# Initialize extractor
extractor = PDFRequirementExtractor()

# Process multiple files
pdf_files = Path("pdf_directory").glob("*.pdf")
results = []

for pdf_file in pdf_files:
    try:
        result = extractor.extract_from_file(pdf_file)
        results.append(result.to_dict())
        print(f"✅ Processed: {pdf_file.name}")
    except Exception as e:
        print(f"❌ Error processing {pdf_file.name}: {e}")

# Save all results
with open("batch_results.json", "w") as f:
    json.dump(results, f, indent=2)
```

## Output Format

The extractor returns structured JSON with the following format:

```json
{
  "file_name": "brand_guidelines.pdf",
  "total_pages": 24,
  "raw_text_excerpt": "Brand Guidelines Document...",
  "extracted_requirements": {
    "hashtags": ["#BrandName", "#QualityFirst"],
    "urls": ["https://company.com", "https://support.company.com"],
    "emails": ["contact@company.com", "support@company.com"],
    "colors": ["#FF5733", "RGB(255, 87, 51)", "Pantone 286 C"],
    "fonts": ["Helvetica Neue Bold", "Arial Regular"],
    "quoted_phrases": ["Excellence in Everything", "Your Trusted Partner"],
    "phone_numbers": ["(555) 123-4567", "1-800-COMPANY"],
    "dimensions": ["1 inch", "12pt", "300 DPI"],
    "tone": ["professional", "modern"],
    "logo_requirements": ["Minimum size: 1 inch", "Clear space required"],
    "spoken_phrases": ["Excellence in Everything"]
  },
  "metadata": {
    "title": "Brand Guidelines",
    "author": "Design Team", 
    "pages": 24,
    "creation_date": "2024-01-15"
  }
}
```

## Supported Extraction Patterns

### Default Patterns

The package comes with built-in patterns for:

- **Hashtags**: `#BrandName`, `#Marketing`
- **URLs**: `https://company.com`, `www.example.org`
- **Emails**: `contact@company.com`
- **Phone Numbers**: `(555) 123-4567`, `+1-800-555-0123`
- **Colors**: `#FF5733`, `RGB(255,87,51)`, `Pantone 286 C`
- **Fonts**: Font family names and specifications
- **Dimensions**: `12pt`, `1 inch`, `300 DPI`, `50px`
- **Quoted Text**: `"Brand taglines"`, `'Voice guidelines'`
- **Brand Terms**: Keywords related to brand guidelines

### Custom Patterns

You can add custom patterns when initializing the extractor:

```python
from pdf_requirements_extractor import PDFRequirementExtractor

# Define custom patterns
custom_patterns = {
    "social_handles": r"@[A-Za-z0-9_]+",
    "price_ranges": r"\$\d+-\$\d+",
    "dates": r"\b\d{1,2}/\d{1,2}/\d{4}\b"
}

# Initialize with custom patterns
extractor = PDFRequirementExtractor(custom_patterns=custom_patterns)
```

## GPT Enhancement Features

When using GPT enhancement, you get:

### Advanced Analysis
- **Brand Name Detection**: Automatic brand name identification
- **Tone Analysis**: Professional, casual, modern, traditional classifications
- **Logo Guidelines**: Structured logo usage rules and restrictions
- **Color Categorization**: Primary, secondary, accent color groupings
- **Typography Hierarchy**: Primary, secondary, body font classifications

### Strategic Insights
- **Priority Levels**: Importance ranking for each requirement
- **Implementation Notes**: Practical guidance for applying guidelines
- **Compliance Analysis**: Risk assessment for brand consistency
- **Missing Elements**: Identification of gaps in brand guidelines

### Enhanced Structure
```json
{
  "brand_name": "Company Name",
  "document_type": "Brand Guidelines",
  "primary_colors": ["#FF5733", "#0066CC"],
  "fonts": {
    "primary": "Helvetica Neue Bold",
    "secondary": "Arial Regular",
    "body": "Open Sans"
  },
  "logo_requirements": {
    "minimum_size": "1 inch",
    "clear_space": "2x logo height",
    "usage_rules": ["Always use on white background"],
    "prohibited_uses": ["Never stretch or distort"]
  },
  "tone_of_voice": {
    "personality": ["professional", "approachable"],
    "style": "Clear and confident communication",
    "do_say": ["We deliver excellence"],
    "dont_say": ["Maybe", "Possibly"]
  }
}
```

## Configuration Options

### Extractor Configuration

```python
extractor = PDFRequirementExtractor(
    max_text_length=2000,     # Maximum excerpt length
    enable_ocr=True,          # Enable OCR for scanned PDFs
    custom_patterns={         # Custom regex patterns
        "pattern_name": r"regex_pattern"
    }
)
```

### GPT Configuration

```python
from pdf_requirements_extractor.gpt_parser import GPTConfig

# Method 1: Explicit configuration (recommended)
gpt_config = GPTConfig(
    model="gpt-4.1-mini",   # Default model - choose your preferred model
    temperature=0.3,         # Creativity level (0.0-1.0)
    max_tokens=2000,        # Maximum response tokens
    timeout=30              # Request timeout in seconds
)

# Method 2: Use factory methods for common use cases
cost_config = GPTConfig.for_cost_effective()    # gpt-3.5-turbo, optimized for cost
quality_config = GPTConfig.for_high_quality()   # gpt-4, optimized for quality
fast_config = GPTConfig.for_fast_processing()   # gpt-3.5-turbo, optimized for speed
mini_config = GPTConfig.for_mini_model()        # gpt-4.1-mini, cost-effective high quality

# Method 3: Create default with custom model (uses gpt-4.1-mini by default)
default_config = GPTConfig.create_default()
custom_config = GPTConfig.create_default(model="gpt-4-turbo", temperature=0.2)
```

## Command Line Usage

The package includes a command-line interface:

```bash
# Basic extraction
pdf-extract-requirements input.pdf

# Save to specific output file
pdf-extract-requirements input.pdf --output results.json

# Enable GPT enhancement (uses gpt-4.1-mini by default)
pdf-extract-requirements input.pdf --gpt --api-key your-key

# Use specific GPT model
pdf-extract-requirements input.pdf --gpt --gpt-model gpt-4 --api-key your-key

# Process multiple files
pdf-extract-requirements *.pdf --batch --output-dir results/

# Use custom patterns
pdf-extract-requirements input.pdf --patterns patterns.json
```

## Testing

Run the test suite:

```bash
# Install development dependencies
pip install pdf-requirement-extractor[dev]

# Run tests
pytest

# Run with coverage
pytest --cov=pdf_requirements_extractor

# Run specific test categories
pytest -m unit          # Unit tests only
pytest -m integration   # Integration tests only
pytest -m "not slow"    # Skip slow tests
```

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Acknowledgments

- [PyMuPDF](https://pymupdf.readthedocs.io/) for excellent PDF processing capabilities
- [OpenAI](https://openai.com/) for GPT API integration
- All contributors and users of this package

## 📞 Support

- 📧 **Email**: saky.aiu22@gmail.com
- � **PyPI**: [pdf-requirement-extractor](https://pypi.org/project/pdf-requirement-extractor/)
- � **Documentation**: Available in this README and package docstrings


            

Raw data

            {
    "_id": null,
    "home_page": "https://pypi.org/project/pdf-requirement-extractor/",
    "name": "pdf-requirement-extractor",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "pdf, extraction, brand, requirements, guidelines, parsing, text-processing, document-analysis, openai, gpt",
    "author": "S M Asiful Islam Saky",
    "author_email": "S M Asiful Islam Saky <saky.aiu22@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/e7/02/f7303189f62f8be989ccb56dafe8e7e15fd9f2fcb2ac031759327290bfbf/pdf_requirement_extractor-2.0.0.tar.gz",
    "platform": null,
    "description": "# PDF Requirement Extractor\n\n[![Python Version](https://img.shields.io/pypi/pyversions/pdf-requirement-extractor)](https://pypi.org/project/pdf-requirement-extractor/)\n[![PyPI Version](https://img.shields.io/pypi/v/pdf-requirement-extractor)](https://pypi.org/project/pdf-requirement-extractor/)\n[![License](https://img.shields.io/pypi/l/pdf-requirement-extractor)](https://pypi.org/project/pdf-requirement-extractor/)\n- **PyPI**: [pdf-requirement-extractor](https://pypi.org/project/pdf-requirement-extractor/)\n- **Documentation**: Available in this README and package docstrings\n\nA powerful Python package for extracting structured brand requirements from PDF documents using PyMuPDF and optional GPT enhancement.\n\n## \ud83d\ude80 Features\n\n- **\ud83d\udcc4 PDF Text Extraction**: Extract text from PDF documents using PyMuPDF (fitz)\n- **\ud83d\udd0d Pattern Recognition**: Detect hashtags, URLs, emails, colors, fonts, and more using regex\n- **\ud83e\udd16 GPT Enhancement**: Optional GPT-powered analysis for deeper insights (requires OpenAI API)\n- **\ud83d\udcca Structured Output**: Get results in JSON format with organized categories\n- **\ud83c\udfa8 Brand-Specific**: Tailored for brand guidelines, style guides, and requirement documents  \n- **\u26a1 Fast & Efficient**: Optimized for batch processing of multiple documents\n- **\ud83d\udd27 Customizable**: Add custom regex patterns for specific extraction needs\n- **\ud83d\udcaa Type Hints**: Full type annotation support for better development experience\n\n## \ud83d\udce6 Installation\n\n### Basic Installation\n```bash\npip install pdf-requirement-extractor\n```\n\n### With GPT Enhancement\n```bash\npip install pdf-requirement-extractor[gpt]\n```\n\n### With OCR Support  \n```bash\npip install pdf-requirement-extractor[ocr]\n```\n\n### All Features\n```bash\npip install pdf-requirement-extractor[all]\n```\n\n## \ud83d\udd27 Quick Start\n\n### Basic Usage\n\n```python\nfrom pdf_requirements_extractor import PDFRequirementExtractor\n\n# Initialize the extractor\nextractor = PDFRequirementExtractor()\n\n# Extract requirements from a PDF file\nresult = extractor.extract_from_file(\"brand_guidelines.pdf\")\n\n# Access the results\nprint(f\"Document: {result.file_name}\")\nprint(f\"Pages: {result.total_pages}\")\nprint(f\"Requirements found: {len(result.extracted_requirements)}\")\n\n# Get specific requirement categories\nrequirements = result.extracted_requirements\nhashtags = requirements.get(\"hashtags\", [])\ncolors = requirements.get(\"colors\", [])\nfonts = requirements.get(\"fonts\", [])\n\nprint(f\"Hashtags: {hashtags}\")\nprint(f\"Colors: {colors}\")\nprint(f\"Fonts: {fonts}\")\n```\n\n### Enhanced Extraction with GPT\n\n```python\nimport os\nfrom pdf_requirements_extractor import PDFRequirementExtractor\nfrom pdf_requirements_extractor.gpt_parser import GPTRequirementParser, GPTConfig\n\n# Set your OpenAI API key\nos.environ[\"OPENAI_API_KEY\"] = \"your-api-key-here\"\n\n# Configure GPT (default model is gpt-4.1-mini)\ngpt_config = GPTConfig(\n    model=\"gpt-4.1-mini\",  # Default model - you can change to your preferred model\n    temperature=0.3,\n    max_tokens=1500\n)\n\n# Or use factory methods for common configurations:\n# gpt_config = GPTConfig.for_cost_effective()      # gpt-3.5-turbo optimized\n# gpt_config = GPTConfig.for_high_quality()        # gpt-4 optimized  \n# gpt_config = GPTConfig.for_fast_processing()     # gpt-3.5-turbo fast\n# gpt_config = GPTConfig.for_mini_model()          # gpt-4.1-mini cost-effective\n# gpt_config = GPTConfig.create_default()          # Uses gpt-4.1-mini by default\n\n# Initialize enhanced parser\nparser = GPTRequirementParser(gpt_config=gpt_config)\n\n# Basic extraction first\nextractor = PDFRequirementExtractor()\nbasic_result = extractor.extract_from_file(\"brand_guidelines.pdf\")\n\n# Enhance with GPT\nenhanced_requirements = parser.parse_with_gpt(\n    basic_result.raw_text_excerpt,\n    basic_result.extracted_requirements\n)\n\n# Access enhanced results\nprint(\"Enhanced requirements:\")\nfor category, data in enhanced_requirements.items():\n    print(f\"{category}: {data}\")\n```\n\n### Custom Pattern Extraction\n\n```python\nfrom pdf_requirements_extractor import PDFRequirementExtractor\n\n# Define custom patterns\ncustom_patterns = {\n    \"social_handles\": r\"@[A-Za-z0-9_]+\",\n    \"price_mentions\": r\"\\$\\d+(?:\\.\\d{2})?\",\n    \"version_numbers\": r\"v?\\d+\\.\\d+(?:\\.\\d+)?\",\n}\n\n# Initialize with custom patterns\nextractor = PDFRequirementExtractor(custom_patterns=custom_patterns)\n\n# Extract with custom patterns\nresult = extractor.extract_from_file(\"document.pdf\")\nsocial_handles = result.extracted_requirements.get(\"social_handles\", [])\n```\n\n### Batch Processing\n\n```python\nfrom pdf_requirements_extractor import PDFRequirementExtractor\nimport json\nfrom pathlib import Path\n\n# Initialize extractor\nextractor = PDFRequirementExtractor()\n\n# Process multiple files\npdf_files = Path(\"pdf_directory\").glob(\"*.pdf\")\nresults = []\n\nfor pdf_file in pdf_files:\n    try:\n        result = extractor.extract_from_file(pdf_file)\n        results.append(result.to_dict())\n        print(f\"\u2705 Processed: {pdf_file.name}\")\n    except Exception as e:\n        print(f\"\u274c Error processing {pdf_file.name}: {e}\")\n\n# Save all results\nwith open(\"batch_results.json\", \"w\") as f:\n    json.dump(results, f, indent=2)\n```\n\n## Output Format\n\nThe extractor returns structured JSON with the following format:\n\n```json\n{\n  \"file_name\": \"brand_guidelines.pdf\",\n  \"total_pages\": 24,\n  \"raw_text_excerpt\": \"Brand Guidelines Document...\",\n  \"extracted_requirements\": {\n    \"hashtags\": [\"#BrandName\", \"#QualityFirst\"],\n    \"urls\": [\"https://company.com\", \"https://support.company.com\"],\n    \"emails\": [\"contact@company.com\", \"support@company.com\"],\n    \"colors\": [\"#FF5733\", \"RGB(255, 87, 51)\", \"Pantone 286 C\"],\n    \"fonts\": [\"Helvetica Neue Bold\", \"Arial Regular\"],\n    \"quoted_phrases\": [\"Excellence in Everything\", \"Your Trusted Partner\"],\n    \"phone_numbers\": [\"(555) 123-4567\", \"1-800-COMPANY\"],\n    \"dimensions\": [\"1 inch\", \"12pt\", \"300 DPI\"],\n    \"tone\": [\"professional\", \"modern\"],\n    \"logo_requirements\": [\"Minimum size: 1 inch\", \"Clear space required\"],\n    \"spoken_phrases\": [\"Excellence in Everything\"]\n  },\n  \"metadata\": {\n    \"title\": \"Brand Guidelines\",\n    \"author\": \"Design Team\", \n    \"pages\": 24,\n    \"creation_date\": \"2024-01-15\"\n  }\n}\n```\n\n## Supported Extraction Patterns\n\n### Default Patterns\n\nThe package comes with built-in patterns for:\n\n- **Hashtags**: `#BrandName`, `#Marketing`\n- **URLs**: `https://company.com`, `www.example.org`\n- **Emails**: `contact@company.com`\n- **Phone Numbers**: `(555) 123-4567`, `+1-800-555-0123`\n- **Colors**: `#FF5733`, `RGB(255,87,51)`, `Pantone 286 C`\n- **Fonts**: Font family names and specifications\n- **Dimensions**: `12pt`, `1 inch`, `300 DPI`, `50px`\n- **Quoted Text**: `\"Brand taglines\"`, `'Voice guidelines'`\n- **Brand Terms**: Keywords related to brand guidelines\n\n### Custom Patterns\n\nYou can add custom patterns when initializing the extractor:\n\n```python\nfrom pdf_requirements_extractor import PDFRequirementExtractor\n\n# Define custom patterns\ncustom_patterns = {\n    \"social_handles\": r\"@[A-Za-z0-9_]+\",\n    \"price_ranges\": r\"\\$\\d+-\\$\\d+\",\n    \"dates\": r\"\\b\\d{1,2}/\\d{1,2}/\\d{4}\\b\"\n}\n\n# Initialize with custom patterns\nextractor = PDFRequirementExtractor(custom_patterns=custom_patterns)\n```\n\n## GPT Enhancement Features\n\nWhen using GPT enhancement, you get:\n\n### Advanced Analysis\n- **Brand Name Detection**: Automatic brand name identification\n- **Tone Analysis**: Professional, casual, modern, traditional classifications\n- **Logo Guidelines**: Structured logo usage rules and restrictions\n- **Color Categorization**: Primary, secondary, accent color groupings\n- **Typography Hierarchy**: Primary, secondary, body font classifications\n\n### Strategic Insights\n- **Priority Levels**: Importance ranking for each requirement\n- **Implementation Notes**: Practical guidance for applying guidelines\n- **Compliance Analysis**: Risk assessment for brand consistency\n- **Missing Elements**: Identification of gaps in brand guidelines\n\n### Enhanced Structure\n```json\n{\n  \"brand_name\": \"Company Name\",\n  \"document_type\": \"Brand Guidelines\",\n  \"primary_colors\": [\"#FF5733\", \"#0066CC\"],\n  \"fonts\": {\n    \"primary\": \"Helvetica Neue Bold\",\n    \"secondary\": \"Arial Regular\",\n    \"body\": \"Open Sans\"\n  },\n  \"logo_requirements\": {\n    \"minimum_size\": \"1 inch\",\n    \"clear_space\": \"2x logo height\",\n    \"usage_rules\": [\"Always use on white background\"],\n    \"prohibited_uses\": [\"Never stretch or distort\"]\n  },\n  \"tone_of_voice\": {\n    \"personality\": [\"professional\", \"approachable\"],\n    \"style\": \"Clear and confident communication\",\n    \"do_say\": [\"We deliver excellence\"],\n    \"dont_say\": [\"Maybe\", \"Possibly\"]\n  }\n}\n```\n\n## Configuration Options\n\n### Extractor Configuration\n\n```python\nextractor = PDFRequirementExtractor(\n    max_text_length=2000,     # Maximum excerpt length\n    enable_ocr=True,          # Enable OCR for scanned PDFs\n    custom_patterns={         # Custom regex patterns\n        \"pattern_name\": r\"regex_pattern\"\n    }\n)\n```\n\n### GPT Configuration\n\n```python\nfrom pdf_requirements_extractor.gpt_parser import GPTConfig\n\n# Method 1: Explicit configuration (recommended)\ngpt_config = GPTConfig(\n    model=\"gpt-4.1-mini\",   # Default model - choose your preferred model\n    temperature=0.3,         # Creativity level (0.0-1.0)\n    max_tokens=2000,        # Maximum response tokens\n    timeout=30              # Request timeout in seconds\n)\n\n# Method 2: Use factory methods for common use cases\ncost_config = GPTConfig.for_cost_effective()    # gpt-3.5-turbo, optimized for cost\nquality_config = GPTConfig.for_high_quality()   # gpt-4, optimized for quality\nfast_config = GPTConfig.for_fast_processing()   # gpt-3.5-turbo, optimized for speed\nmini_config = GPTConfig.for_mini_model()        # gpt-4.1-mini, cost-effective high quality\n\n# Method 3: Create default with custom model (uses gpt-4.1-mini by default)\ndefault_config = GPTConfig.create_default()\ncustom_config = GPTConfig.create_default(model=\"gpt-4-turbo\", temperature=0.2)\n```\n\n## Command Line Usage\n\nThe package includes a command-line interface:\n\n```bash\n# Basic extraction\npdf-extract-requirements input.pdf\n\n# Save to specific output file\npdf-extract-requirements input.pdf --output results.json\n\n# Enable GPT enhancement (uses gpt-4.1-mini by default)\npdf-extract-requirements input.pdf --gpt --api-key your-key\n\n# Use specific GPT model\npdf-extract-requirements input.pdf --gpt --gpt-model gpt-4 --api-key your-key\n\n# Process multiple files\npdf-extract-requirements *.pdf --batch --output-dir results/\n\n# Use custom patterns\npdf-extract-requirements input.pdf --patterns patterns.json\n```\n\n## Testing\n\nRun the test suite:\n\n```bash\n# Install development dependencies\npip install pdf-requirement-extractor[dev]\n\n# Run tests\npytest\n\n# Run with coverage\npytest --cov=pdf_requirements_extractor\n\n# Run specific test categories\npytest -m unit          # Unit tests only\npytest -m integration   # Integration tests only\npytest -m \"not slow\"    # Skip slow tests\n```\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Acknowledgments\n\n- [PyMuPDF](https://pymupdf.readthedocs.io/) for excellent PDF processing capabilities\n- [OpenAI](https://openai.com/) for GPT API integration\n- All contributors and users of this package\n\n## \ud83d\udcde Support\n\n- \ud83d\udce7 **Email**: saky.aiu22@gmail.com\n- \ufffd **PyPI**: [pdf-requirement-extractor](https://pypi.org/project/pdf-requirement-extractor/)\n- \ufffd **Documentation**: Available in this README and package docstrings\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Extract structured brand requirements from PDF documents",
    "version": "2.0.0",
    "project_urls": {
        "Documentation": "https://pypi.org/project/pdf-requirement-extractor/",
        "Download": "https://pypi.org/project/pdf-requirement-extractor/#files",
        "Homepage": "https://pypi.org/project/pdf-requirement-extractor/",
        "Source Code": "https://pypi.org/project/pdf-requirement-extractor/"
    },
    "split_keywords": [
        "pdf",
        " extraction",
        " brand",
        " requirements",
        " guidelines",
        " parsing",
        " text-processing",
        " document-analysis",
        " openai",
        " gpt"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "ee8feedfed3355cf7b900208aa82f74c8414941a7f6039633f9aad0d1456f184",
                "md5": "3355c460e56950536e64910b0ef52154",
                "sha256": "2a99017b01879661f3f1e42572ce2de481bf298f611d1cdbf5274fa39b75be69"
            },
            "downloads": -1,
            "filename": "pdf_requirement_extractor-2.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3355c460e56950536e64910b0ef52154",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 27898,
            "upload_time": "2025-08-19T06:27:17",
            "upload_time_iso_8601": "2025-08-19T06:27:17.603811Z",
            "url": "https://files.pythonhosted.org/packages/ee/8f/eedfed3355cf7b900208aa82f74c8414941a7f6039633f9aad0d1456f184/pdf_requirement_extractor-2.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e702f7303189f62f8be989ccb56dafe8e7e15fd9f2fcb2ac031759327290bfbf",
                "md5": "fab16c4de48c1b581508f9793d37f08a",
                "sha256": "26480fc16dc4dd7204c443df5d91cfc6efa43ac71d64d37982001260979d8a28"
            },
            "downloads": -1,
            "filename": "pdf_requirement_extractor-2.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "fab16c4de48c1b581508f9793d37f08a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 28723,
            "upload_time": "2025-08-19T06:27:19",
            "upload_time_iso_8601": "2025-08-19T06:27:19.562499Z",
            "url": "https://files.pythonhosted.org/packages/e7/02/f7303189f62f8be989ccb56dafe8e7e15fd9f2fcb2ac031759327290bfbf/pdf_requirement_extractor-2.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-19 06:27:19",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "pdf-requirement-extractor"
}
        
Elapsed time: 1.47077s