# Pricetag
A pure-Python library for extracting price and currency information from unstructured text, with a primary focus on salary data in job postings.
## Features
- **Zero Dependencies**: Uses only Python standard library
- **High Performance**: Processes 1000+ documents per second
- **Comprehensive Pattern Support**:
- Individual amounts: `$50,000`, `$50k`, `50k USD`
- Ranges: `$50k-$75k`, `$50,000 to $75,000`
- Hourly/Annual/Monthly rates: `$25/hour`, `$50k/year`, `$4k/month`
- Contextual terms: `six figures`, `competitive`, `DOE`
- **Smart Normalization**: Converts all amounts to annual USD for easy comparison
- **Confidence Scoring**: Each extraction includes a confidence score
- **Validation & Sanity Checks**: Flags unusual or problematic values
## Installation
```bash
pip install pricetag
```
## Quick Start
```python
from pricetag import PriceExtractor
# Initialize the extractor
extractor = PriceExtractor()
# Extract prices from text
text = "Senior Engineer position paying $120,000 - $150,000 annually"
results = extractor.extract(text)
# Access the results
for result in results:
print(f"Value: {result['value']}")
print(f"Type: {result['type']}")
print(f"Confidence: {result['confidence']}")
print(f"Annual value: {result['normalized_annual']}")
```
## Configuration Options
```python
extractor = PriceExtractor(
min_confidence=0.5, # Minimum confidence score to include
include_contextual=True, # Extract terms like "six figures"
normalize_to_annual=True, # Convert to annual amounts
min_salary=10000, # Minimum reasonable salary
max_salary=10000000, # Maximum reasonable salary
assume_hours_per_year=2080,# For hourly→annual conversion
fast_mode=False, # Enable for better performance
max_results_per_text=None # Limit number of results
)
```
## Examples
### Basic Salary Extraction
```python
text = "The position offers $75,000 per year with benefits"
results = extractor.extract(text)
# Returns: [{'value': 75000.0, 'type': 'annual', ...}]
```
### Hourly Rate with Normalization
```python
text = "Paying $45/hour for senior developers"
results = extractor.extract(text)
# Returns: [{'value': 45.0, 'type': 'hourly', 'normalized_annual': 93600.0, ...}]
```
### Salary Range
```python
text = "Salary range: $60k-$80k depending on experience"
results = extractor.extract(text)
# Returns: [{'value': (60000.0, 80000.0), 'is_range': True, ...}]
```
### Contextual Terms
```python
text = "Looking for someone with 5+ years experience, six figures"
results = extractor.extract(text)
# Returns: [{'value': (100000, 999999), 'type': 'unknown', 'confidence': 0.7, ...}]
```
### Multiple Prices
```python
text = "Base: $100k, Bonus: up to $30k, Equity: 0.5%"
results = extractor.extract(text)
# Returns multiple results with appropriate flags
```
### Batch Processing
```python
texts = [
"Salary: $50,000",
"Rate: $30/hour",
"Competitive pay"
]
results_batch = extractor.extract_batch(texts)
```
## Output Format
Each extraction returns a `PriceResult` dictionary:
```python
{
'value': float | tuple[float, float], # Single value or (min, max)
'raw_text': str, # Original matched text
'position': tuple[int, int], # Character positions
'type': str, # 'hourly', 'annual', 'monthly', etc.
'confidence': float, # 0.0 to 1.0
'normalized_annual': float | tuple, # Annual USD amount
'currency': str, # Always 'USD' in v1
'is_range': bool, # True for ranges
'flags': list[str] # Validation flags
}
```
### Validation Flags
- `invalid_range`: Max less than min
- `below_minimum`: Below configured threshold
- `above_maximum`: Above configured threshold
- `unreasonable_hourly_rate`: Outside $7-$500/hour
- `potential_inconsistency`: Large discrepancy with other prices
- `ambiguous_type`: Unclear if hourly/annual
- `approximate`: Estimated from "approximately"
- `requires_market_data`: Needs external data (e.g., "competitive")
- `experience_dependent`: Depends on experience (e.g., "DOE")
## Performance
The library is optimized for high-volume processing:
- Pre-compiled regex patterns
- Number parsing cache
- Quick pre-filtering
- Fast mode for bulk processing
- Batch processing support
```python
# Fast mode for high-volume processing
extractor = PriceExtractor(fast_mode=True, max_results_per_text=5)
# Process 1000 documents
texts = ["..." for _ in range(1000)]
results = extractor.extract_batch(texts) # < 1 second
```
## Testing
Run the test suite:
```bash
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/
# Run with coverage
pytest tests/ --cov=pricetag
```
## Limitations
- Currently supports USD only
- Optimized for US salary formats
- Context window limited to surrounding text
- Does not handle equity/stock compensation
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## License
MIT License - see LICENSE file for details
## Author
Michelle Pellon - mgracepellon@gmail.com
## Acknowledgments
Built with pure Python for maximum compatibility and zero dependencies.
Raw data
{
"_id": null,
"home_page": null,
"name": "pricetag",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.13",
"maintainer_email": "Michelle Pellon <mgracepellon@gmail.com>",
"keywords": "price extraction, salary parsing, text processing, nlp, compensation, job postings, data extraction",
"author": null,
"author_email": "Michelle Pellon <mgracepellon@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/a3/df/de79fb7001169aeddd859290db8483ad5d2ce918e04f96713045386a008d/pricetag-1.0.0.tar.gz",
"platform": null,
"description": "# Pricetag\n\nA pure-Python library for extracting price and currency information from unstructured text, with a primary focus on salary data in job postings.\n\n## Features\n\n- **Zero Dependencies**: Uses only Python standard library\n- **High Performance**: Processes 1000+ documents per second\n- **Comprehensive Pattern Support**: \n - Individual amounts: `$50,000`, `$50k`, `50k USD`\n - Ranges: `$50k-$75k`, `$50,000 to $75,000`\n - Hourly/Annual/Monthly rates: `$25/hour`, `$50k/year`, `$4k/month`\n - Contextual terms: `six figures`, `competitive`, `DOE`\n- **Smart Normalization**: Converts all amounts to annual USD for easy comparison\n- **Confidence Scoring**: Each extraction includes a confidence score\n- **Validation & Sanity Checks**: Flags unusual or problematic values\n\n## Installation\n\n```bash\npip install pricetag\n```\n\n## Quick Start\n\n```python\nfrom pricetag import PriceExtractor\n\n# Initialize the extractor\nextractor = PriceExtractor()\n\n# Extract prices from text\ntext = \"Senior Engineer position paying $120,000 - $150,000 annually\"\nresults = extractor.extract(text)\n\n# Access the results\nfor result in results:\n print(f\"Value: {result['value']}\")\n print(f\"Type: {result['type']}\")\n print(f\"Confidence: {result['confidence']}\")\n print(f\"Annual value: {result['normalized_annual']}\")\n```\n\n## Configuration Options\n\n```python\nextractor = PriceExtractor(\n min_confidence=0.5, # Minimum confidence score to include\n include_contextual=True, # Extract terms like \"six figures\"\n normalize_to_annual=True, # Convert to annual amounts\n min_salary=10000, # Minimum reasonable salary\n max_salary=10000000, # Maximum reasonable salary\n assume_hours_per_year=2080,# For hourly\u2192annual conversion\n fast_mode=False, # Enable for better performance\n max_results_per_text=None # Limit number of results\n)\n```\n\n## Examples\n\n### Basic Salary Extraction\n```python\ntext = \"The position offers $75,000 per year with benefits\"\nresults = extractor.extract(text)\n# Returns: [{'value': 75000.0, 'type': 'annual', ...}]\n```\n\n### Hourly Rate with Normalization\n```python\ntext = \"Paying $45/hour for senior developers\"\nresults = extractor.extract(text)\n# Returns: [{'value': 45.0, 'type': 'hourly', 'normalized_annual': 93600.0, ...}]\n```\n\n### Salary Range\n```python\ntext = \"Salary range: $60k-$80k depending on experience\"\nresults = extractor.extract(text)\n# Returns: [{'value': (60000.0, 80000.0), 'is_range': True, ...}]\n```\n\n### Contextual Terms\n```python\ntext = \"Looking for someone with 5+ years experience, six figures\"\nresults = extractor.extract(text)\n# Returns: [{'value': (100000, 999999), 'type': 'unknown', 'confidence': 0.7, ...}]\n```\n\n### Multiple Prices\n```python\ntext = \"Base: $100k, Bonus: up to $30k, Equity: 0.5%\"\nresults = extractor.extract(text)\n# Returns multiple results with appropriate flags\n```\n\n### Batch Processing\n```python\ntexts = [\n \"Salary: $50,000\",\n \"Rate: $30/hour\", \n \"Competitive pay\"\n]\nresults_batch = extractor.extract_batch(texts)\n```\n\n## Output Format\n\nEach extraction returns a `PriceResult` dictionary:\n\n```python\n{\n 'value': float | tuple[float, float], # Single value or (min, max)\n 'raw_text': str, # Original matched text\n 'position': tuple[int, int], # Character positions\n 'type': str, # 'hourly', 'annual', 'monthly', etc.\n 'confidence': float, # 0.0 to 1.0\n 'normalized_annual': float | tuple, # Annual USD amount\n 'currency': str, # Always 'USD' in v1\n 'is_range': bool, # True for ranges\n 'flags': list[str] # Validation flags\n}\n```\n\n### Validation Flags\n\n- `invalid_range`: Max less than min\n- `below_minimum`: Below configured threshold\n- `above_maximum`: Above configured threshold\n- `unreasonable_hourly_rate`: Outside $7-$500/hour\n- `potential_inconsistency`: Large discrepancy with other prices\n- `ambiguous_type`: Unclear if hourly/annual\n- `approximate`: Estimated from \"approximately\"\n- `requires_market_data`: Needs external data (e.g., \"competitive\")\n- `experience_dependent`: Depends on experience (e.g., \"DOE\")\n\n## Performance\n\nThe library is optimized for high-volume processing:\n\n- Pre-compiled regex patterns\n- Number parsing cache\n- Quick pre-filtering\n- Fast mode for bulk processing\n- Batch processing support\n\n```python\n# Fast mode for high-volume processing\nextractor = PriceExtractor(fast_mode=True, max_results_per_text=5)\n\n# Process 1000 documents\ntexts = [\"...\" for _ in range(1000)]\nresults = extractor.extract_batch(texts) # < 1 second\n```\n\n## Testing\n\nRun the test suite:\n\n```bash\n# Install dev dependencies\npip install -e \".[dev]\"\n\n# Run tests\npytest tests/\n\n# Run with coverage\npytest tests/ --cov=pricetag\n```\n\n## Limitations\n\n- Currently supports USD only\n- Optimized for US salary formats\n- Context window limited to surrounding text\n- Does not handle equity/stock compensation\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nMIT License - see LICENSE file for details\n\n## Author\n\nMichelle Pellon - mgracepellon@gmail.com\n\n## Acknowledgments\n\nBuilt with pure Python for maximum compatibility and zero dependencies.\n",
"bugtrack_url": null,
"license": null,
"summary": "A pure-Python library for extracting price and currency information from unstructured text",
"version": "1.0.0",
"project_urls": {
"Documentation": "https://github.com/michellepellon/pricetag#readme",
"Homepage": "https://github.com/michellepellon/pricetag",
"Repository": "https://github.com/michellepellon/pricetag"
},
"split_keywords": [
"price extraction",
" salary parsing",
" text processing",
" nlp",
" compensation",
" job postings",
" data extraction"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "e00eef55efbdd8e3c87e6cdf4f5c25138da9ccff46991257aec988bb8f65c12b",
"md5": "da8346498daa0b029c2e69480349e543",
"sha256": "ce18a9cfdf74492ef04273f043770516cd6c93acb1c4dc4b23854c490693273f"
},
"downloads": -1,
"filename": "pricetag-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "da8346498daa0b029c2e69480349e543",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.13",
"size": 19095,
"upload_time": "2025-08-21T23:50:20",
"upload_time_iso_8601": "2025-08-21T23:50:20.948132Z",
"url": "https://files.pythonhosted.org/packages/e0/0e/ef55efbdd8e3c87e6cdf4f5c25138da9ccff46991257aec988bb8f65c12b/pricetag-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "a3dfde79fb7001169aeddd859290db8483ad5d2ce918e04f96713045386a008d",
"md5": "d7fbcb8bd51e1823ec7b6e5a0a0f3e81",
"sha256": "62119312c63a2678d9b1c65c819fca7b620111b7a98757d120b83f7d82d51346"
},
"downloads": -1,
"filename": "pricetag-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "d7fbcb8bd51e1823ec7b6e5a0a0f3e81",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.13",
"size": 27744,
"upload_time": "2025-08-21T23:50:22",
"upload_time_iso_8601": "2025-08-21T23:50:22.488777Z",
"url": "https://files.pythonhosted.org/packages/a3/df/de79fb7001169aeddd859290db8483ad5d2ce918e04f96713045386a008d/pricetag-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-21 23:50:22",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "michellepellon",
"github_project": "pricetag#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "pricetag"
}