pricetag


Namepricetag JSON
Version 1.0.0 PyPI version JSON
download
home_pageNone
SummaryA pure-Python library for extracting price and currency information from unstructured text
upload_time2025-08-21 23:50:22
maintainerNone
docs_urlNone
authorNone
requires_python>=3.13
licenseNone
keywords price extraction salary parsing text processing nlp compensation job postings data extraction
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Pricetag

A pure-Python library for extracting price and currency information from unstructured text, with a primary focus on salary data in job postings.

## Features

- **Zero Dependencies**: Uses only Python standard library
- **High Performance**: Processes 1000+ documents per second
- **Comprehensive Pattern Support**: 
  - Individual amounts: `$50,000`, `$50k`, `50k USD`
  - Ranges: `$50k-$75k`, `$50,000 to $75,000`
  - Hourly/Annual/Monthly rates: `$25/hour`, `$50k/year`, `$4k/month`
  - Contextual terms: `six figures`, `competitive`, `DOE`
- **Smart Normalization**: Converts all amounts to annual USD for easy comparison
- **Confidence Scoring**: Each extraction includes a confidence score
- **Validation & Sanity Checks**: Flags unusual or problematic values

## Installation

```bash
pip install pricetag
```

## Quick Start

```python
from pricetag import PriceExtractor

# Initialize the extractor
extractor = PriceExtractor()

# Extract prices from text
text = "Senior Engineer position paying $120,000 - $150,000 annually"
results = extractor.extract(text)

# Access the results
for result in results:
    print(f"Value: {result['value']}")
    print(f"Type: {result['type']}")
    print(f"Confidence: {result['confidence']}")
    print(f"Annual value: {result['normalized_annual']}")
```

## Configuration Options

```python
extractor = PriceExtractor(
    min_confidence=0.5,        # Minimum confidence score to include
    include_contextual=True,   # Extract terms like "six figures"
    normalize_to_annual=True,  # Convert to annual amounts
    min_salary=10000,          # Minimum reasonable salary
    max_salary=10000000,       # Maximum reasonable salary
    assume_hours_per_year=2080,# For hourly→annual conversion
    fast_mode=False,           # Enable for better performance
    max_results_per_text=None  # Limit number of results
)
```

## Examples

### Basic Salary Extraction
```python
text = "The position offers $75,000 per year with benefits"
results = extractor.extract(text)
# Returns: [{'value': 75000.0, 'type': 'annual', ...}]
```

### Hourly Rate with Normalization
```python
text = "Paying $45/hour for senior developers"
results = extractor.extract(text)
# Returns: [{'value': 45.0, 'type': 'hourly', 'normalized_annual': 93600.0, ...}]
```

### Salary Range
```python
text = "Salary range: $60k-$80k depending on experience"
results = extractor.extract(text)
# Returns: [{'value': (60000.0, 80000.0), 'is_range': True, ...}]
```

### Contextual Terms
```python
text = "Looking for someone with 5+ years experience, six figures"
results = extractor.extract(text)
# Returns: [{'value': (100000, 999999), 'type': 'unknown', 'confidence': 0.7, ...}]
```

### Multiple Prices
```python
text = "Base: $100k, Bonus: up to $30k, Equity: 0.5%"
results = extractor.extract(text)
# Returns multiple results with appropriate flags
```

### Batch Processing
```python
texts = [
    "Salary: $50,000",
    "Rate: $30/hour", 
    "Competitive pay"
]
results_batch = extractor.extract_batch(texts)
```

## Output Format

Each extraction returns a `PriceResult` dictionary:

```python
{
    'value': float | tuple[float, float],  # Single value or (min, max)
    'raw_text': str,                       # Original matched text
    'position': tuple[int, int],           # Character positions
    'type': str,                           # 'hourly', 'annual', 'monthly', etc.
    'confidence': float,                   # 0.0 to 1.0
    'normalized_annual': float | tuple,    # Annual USD amount
    'currency': str,                       # Always 'USD' in v1
    'is_range': bool,                      # True for ranges
    'flags': list[str]                     # Validation flags
}
```

### Validation Flags

- `invalid_range`: Max less than min
- `below_minimum`: Below configured threshold
- `above_maximum`: Above configured threshold
- `unreasonable_hourly_rate`: Outside $7-$500/hour
- `potential_inconsistency`: Large discrepancy with other prices
- `ambiguous_type`: Unclear if hourly/annual
- `approximate`: Estimated from "approximately"
- `requires_market_data`: Needs external data (e.g., "competitive")
- `experience_dependent`: Depends on experience (e.g., "DOE")

## Performance

The library is optimized for high-volume processing:

- Pre-compiled regex patterns
- Number parsing cache
- Quick pre-filtering
- Fast mode for bulk processing
- Batch processing support

```python
# Fast mode for high-volume processing
extractor = PriceExtractor(fast_mode=True, max_results_per_text=5)

# Process 1000 documents
texts = ["..." for _ in range(1000)]
results = extractor.extract_batch(texts)  # < 1 second
```

## Testing

Run the test suite:

```bash
# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Run with coverage
pytest tests/ --cov=pricetag
```

## Limitations

- Currently supports USD only
- Optimized for US salary formats
- Context window limited to surrounding text
- Does not handle equity/stock compensation

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

MIT License - see LICENSE file for details

## Author

Michelle Pellon - mgracepellon@gmail.com

## Acknowledgments

Built with pure Python for maximum compatibility and zero dependencies.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "pricetag",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.13",
    "maintainer_email": "Michelle Pellon <mgracepellon@gmail.com>",
    "keywords": "price extraction, salary parsing, text processing, nlp, compensation, job postings, data extraction",
    "author": null,
    "author_email": "Michelle Pellon <mgracepellon@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/a3/df/de79fb7001169aeddd859290db8483ad5d2ce918e04f96713045386a008d/pricetag-1.0.0.tar.gz",
    "platform": null,
    "description": "# Pricetag\n\nA pure-Python library for extracting price and currency information from unstructured text, with a primary focus on salary data in job postings.\n\n## Features\n\n- **Zero Dependencies**: Uses only Python standard library\n- **High Performance**: Processes 1000+ documents per second\n- **Comprehensive Pattern Support**: \n  - Individual amounts: `$50,000`, `$50k`, `50k USD`\n  - Ranges: `$50k-$75k`, `$50,000 to $75,000`\n  - Hourly/Annual/Monthly rates: `$25/hour`, `$50k/year`, `$4k/month`\n  - Contextual terms: `six figures`, `competitive`, `DOE`\n- **Smart Normalization**: Converts all amounts to annual USD for easy comparison\n- **Confidence Scoring**: Each extraction includes a confidence score\n- **Validation & Sanity Checks**: Flags unusual or problematic values\n\n## Installation\n\n```bash\npip install pricetag\n```\n\n## Quick Start\n\n```python\nfrom pricetag import PriceExtractor\n\n# Initialize the extractor\nextractor = PriceExtractor()\n\n# Extract prices from text\ntext = \"Senior Engineer position paying $120,000 - $150,000 annually\"\nresults = extractor.extract(text)\n\n# Access the results\nfor result in results:\n    print(f\"Value: {result['value']}\")\n    print(f\"Type: {result['type']}\")\n    print(f\"Confidence: {result['confidence']}\")\n    print(f\"Annual value: {result['normalized_annual']}\")\n```\n\n## Configuration Options\n\n```python\nextractor = PriceExtractor(\n    min_confidence=0.5,        # Minimum confidence score to include\n    include_contextual=True,   # Extract terms like \"six figures\"\n    normalize_to_annual=True,  # Convert to annual amounts\n    min_salary=10000,          # Minimum reasonable salary\n    max_salary=10000000,       # Maximum reasonable salary\n    assume_hours_per_year=2080,# For hourly\u2192annual conversion\n    fast_mode=False,           # Enable for better performance\n    max_results_per_text=None  # Limit number of results\n)\n```\n\n## Examples\n\n### Basic Salary Extraction\n```python\ntext = \"The position offers $75,000 per year with benefits\"\nresults = extractor.extract(text)\n# Returns: [{'value': 75000.0, 'type': 'annual', ...}]\n```\n\n### Hourly Rate with Normalization\n```python\ntext = \"Paying $45/hour for senior developers\"\nresults = extractor.extract(text)\n# Returns: [{'value': 45.0, 'type': 'hourly', 'normalized_annual': 93600.0, ...}]\n```\n\n### Salary Range\n```python\ntext = \"Salary range: $60k-$80k depending on experience\"\nresults = extractor.extract(text)\n# Returns: [{'value': (60000.0, 80000.0), 'is_range': True, ...}]\n```\n\n### Contextual Terms\n```python\ntext = \"Looking for someone with 5+ years experience, six figures\"\nresults = extractor.extract(text)\n# Returns: [{'value': (100000, 999999), 'type': 'unknown', 'confidence': 0.7, ...}]\n```\n\n### Multiple Prices\n```python\ntext = \"Base: $100k, Bonus: up to $30k, Equity: 0.5%\"\nresults = extractor.extract(text)\n# Returns multiple results with appropriate flags\n```\n\n### Batch Processing\n```python\ntexts = [\n    \"Salary: $50,000\",\n    \"Rate: $30/hour\", \n    \"Competitive pay\"\n]\nresults_batch = extractor.extract_batch(texts)\n```\n\n## Output Format\n\nEach extraction returns a `PriceResult` dictionary:\n\n```python\n{\n    'value': float | tuple[float, float],  # Single value or (min, max)\n    'raw_text': str,                       # Original matched text\n    'position': tuple[int, int],           # Character positions\n    'type': str,                           # 'hourly', 'annual', 'monthly', etc.\n    'confidence': float,                   # 0.0 to 1.0\n    'normalized_annual': float | tuple,    # Annual USD amount\n    'currency': str,                       # Always 'USD' in v1\n    'is_range': bool,                      # True for ranges\n    'flags': list[str]                     # Validation flags\n}\n```\n\n### Validation Flags\n\n- `invalid_range`: Max less than min\n- `below_minimum`: Below configured threshold\n- `above_maximum`: Above configured threshold\n- `unreasonable_hourly_rate`: Outside $7-$500/hour\n- `potential_inconsistency`: Large discrepancy with other prices\n- `ambiguous_type`: Unclear if hourly/annual\n- `approximate`: Estimated from \"approximately\"\n- `requires_market_data`: Needs external data (e.g., \"competitive\")\n- `experience_dependent`: Depends on experience (e.g., \"DOE\")\n\n## Performance\n\nThe library is optimized for high-volume processing:\n\n- Pre-compiled regex patterns\n- Number parsing cache\n- Quick pre-filtering\n- Fast mode for bulk processing\n- Batch processing support\n\n```python\n# Fast mode for high-volume processing\nextractor = PriceExtractor(fast_mode=True, max_results_per_text=5)\n\n# Process 1000 documents\ntexts = [\"...\" for _ in range(1000)]\nresults = extractor.extract_batch(texts)  # < 1 second\n```\n\n## Testing\n\nRun the test suite:\n\n```bash\n# Install dev dependencies\npip install -e \".[dev]\"\n\n# Run tests\npytest tests/\n\n# Run with coverage\npytest tests/ --cov=pricetag\n```\n\n## Limitations\n\n- Currently supports USD only\n- Optimized for US salary formats\n- Context window limited to surrounding text\n- Does not handle equity/stock compensation\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nMIT License - see LICENSE file for details\n\n## Author\n\nMichelle Pellon - mgracepellon@gmail.com\n\n## Acknowledgments\n\nBuilt with pure Python for maximum compatibility and zero dependencies.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A pure-Python library for extracting price and currency information from unstructured text",
    "version": "1.0.0",
    "project_urls": {
        "Documentation": "https://github.com/michellepellon/pricetag#readme",
        "Homepage": "https://github.com/michellepellon/pricetag",
        "Repository": "https://github.com/michellepellon/pricetag"
    },
    "split_keywords": [
        "price extraction",
        " salary parsing",
        " text processing",
        " nlp",
        " compensation",
        " job postings",
        " data extraction"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e00eef55efbdd8e3c87e6cdf4f5c25138da9ccff46991257aec988bb8f65c12b",
                "md5": "da8346498daa0b029c2e69480349e543",
                "sha256": "ce18a9cfdf74492ef04273f043770516cd6c93acb1c4dc4b23854c490693273f"
            },
            "downloads": -1,
            "filename": "pricetag-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "da8346498daa0b029c2e69480349e543",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.13",
            "size": 19095,
            "upload_time": "2025-08-21T23:50:20",
            "upload_time_iso_8601": "2025-08-21T23:50:20.948132Z",
            "url": "https://files.pythonhosted.org/packages/e0/0e/ef55efbdd8e3c87e6cdf4f5c25138da9ccff46991257aec988bb8f65c12b/pricetag-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a3dfde79fb7001169aeddd859290db8483ad5d2ce918e04f96713045386a008d",
                "md5": "d7fbcb8bd51e1823ec7b6e5a0a0f3e81",
                "sha256": "62119312c63a2678d9b1c65c819fca7b620111b7a98757d120b83f7d82d51346"
            },
            "downloads": -1,
            "filename": "pricetag-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "d7fbcb8bd51e1823ec7b6e5a0a0f3e81",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.13",
            "size": 27744,
            "upload_time": "2025-08-21T23:50:22",
            "upload_time_iso_8601": "2025-08-21T23:50:22.488777Z",
            "url": "https://files.pythonhosted.org/packages/a3/df/de79fb7001169aeddd859290db8483ad5d2ce918e04f96713045386a008d/pricetag-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-21 23:50:22",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "michellepellon",
    "github_project": "pricetag#readme",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "pricetag"
}
        
Elapsed time: 1.92414s