oxidize-postal


Nameoxidize-postal JSON
Version 0.1.2 PyPI version JSON
download
home_pageNone
SummaryHigh-performance postal address parser and normalizer using libpostal with Rust bindings
upload_time2025-09-06 22:14:52
maintainerNone
docs_urlNone
authorNone
requires_python>=3.9
licenseNone
keywords address postal parsing normalization libpostal rust performance
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # oxidize-postal

Python bindings for libpostal address parsing with improved performance and installation experience.

oxidize-postal provides the same address parsing capabilities as [pypostal](https://github.com/openvenues/pypostal) but addresses key limitations: it installs without C compilation, releases the Python GIL for true parallel processing, and offers a cleaner API. Built using Rust and [libpostal-rust](https://crates.io/crates/libpostal-rust) bindings to the [libpostal](https://github.com/openvenues/libpostal) C library.

## Key Improvements Over pypostal

| Feature | oxidize-postal | pypostal |
|---------|----------------|----------|
| **Installation** | `pip install` with pre-built wheels | Requires C compilation, system dependencies |
| **Parallel Processing** | GIL released, true multithreading | GIL blocks concurrent parsing |
| **API Design** | Single module, consistent naming | Multiple imports, scattered functions |
| **Error Handling** | Structured errors with context | Basic exception messages |
| **Platform Support** | Cross-platform wheels | Complex Windows build process |

## Core Functionality

- **Address Parsing**: Extract components (street, city, state, postal code, etc.) from address strings
- **Address Expansion**: Generate normalized variations with abbreviations expanded (St. → Street)
- **Address Normalization**: Standardize address formatting and component ordering
- **International Support**: Handles addresses worldwide with Unicode and multiple scripts

## Installation

```bash
pip install oxidize-postal

# Download language model data (one-time setup)
python -c "import oxidize_postal; oxidize_postal.download_data()"
```

## Usage

### Basic Address Parsing

```python
import oxidize_postal

# Parse an address into components
address = "781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA"
parsed = oxidize_postal.parse_address(address)
print(parsed)
# Output: {'house_number': '781', 'road': 'franklin ave', 'suburb': 'crown heights', 
#          'city': 'brooklyn', 'state': 'ny', 'postcode': '11216', 'country': 'usa'}

# Get parsed address as JSON string
json_result = oxidize_postal.parse_address_to_json(address)
```

### Address Expansion

```python
# Expand address abbreviations
address = "123 Main St NYC NY"
expansions = oxidize_postal.expand_address(address)
print(expansions)
# Output: ['123 main street nyc new york', '123 main street nyc ny', ...]

# Get expansions as JSON
json_expansions = oxidize_postal.expand_address_to_json(address)
```

## Parallel Processing & Performance

One of the key advantages of oxidize-postal over pypostal is GIL-free parallel processing. However, it's important to understand when you'll see benefits.

### When Parallel Processing Helps

Parallel processing provides the most benefit when combined with slower I/O operations:

**Great for parallel processing:**
```python
import oxidize_postal
from concurrent.futures import ThreadPoolExecutor
import requests

def process_customer_record(record):
    # Fetch from API (50-200ms)
    customer = requests.get(f"https://api.example.com/customers/{record['id']}").json()
    
    # Parse address (0.3ms) - GIL released so other threads can work
    parsed = oxidize_postal.parse_address(customer['address'])
    
    # Write to database (50-200ms)
    db.update(customer['id'], parsed)
    
    return parsed

# Process many records in parallel
with ThreadPoolExecutor(max_workers=20) as executor:
    results = list(executor.map(process_customer_record, records))
```

**Limited benefit for pure address parsing:**
```python
# Just parsing addresses without I/O
addresses = ["123 Main St", "456 Oak Ave"] * 100

# Parallel might even be slower due to thread overhead
with ThreadPoolExecutor() as executor:
    results = list(executor.map(oxidize_postal.parse_address, addresses))
```

### Real-World Use Cases

Where to use oxidize-postal's GIL release:

1. **ETL Pipelines**: Reading from databases/APIs, parsing, and writing back
2. **Stream Processing**: Handling Kafka/Kinesis streams with address data
3. **Web Services**: API endpoints that parse addresses alongside other operations
4. **File Processing**: Reading large CSV/Parquet files, parsing addresses, writing results

### Threading vs Multiprocessing

Because oxidize-postal releases the GIL, **threading is usually preferable** to multiprocessing:

```python
from concurrent.futures import ThreadPoolExecutor
from multiprocessing import Pool

# Threading - Lower overhead, shared memory
with ThreadPoolExecutor(max_workers=8) as executor:
    results = list(executor.map(oxidize_postal.parse_address, addresses))

# Multiprocessing - Higher overhead due to serialization
# Only use if you need true CPU parallelism for other operations
with Pool(processes=8) as pool:
    results = pool.map(oxidize_postal.parse_address, addresses)
```

Threading outperforms multiprocessing by 3-5x for pure address parsing of small batches (under 5-20k addresses depending on your machine) due to lower overhead.

## API Reference

### Core Functions

#### `parse_address(address: str) -> dict`
Parse an address string into its component parts.

**Parameters:**
- `address`: The address string to parse

**Returns:**
- Dictionary with keys like 'house_number', 'road', 'city', 'state', 'postcode', etc.

#### `expand_address(address: str) -> list[str]`
Generate normalized variations of an address.

**Parameters:**
- `address`: The address string to expand

**Returns:**
- List of expanded address strings

#### `download_data(force: bool = False) -> bool`
Download the libpostal data files.

**Parameters:**
- `force`: If True, re-download even if data exists

**Returns:**
- True if successful, False otherwise

### Additional Functions

- `parse_address_to_json(address: str) -> str`: Parse and return as JSON
- `expand_address_to_json(address: str) -> str`: Expand and return as JSON
- `normalize_address(address: str) -> str`: Normalize an address string

### Constants

The module provides various constants for address components:

```python
import oxidize_postal

# Address component constants
oxidize_postal.ADDRESS_ANY
oxidize_postal.ADDRESS_NAME
oxidize_postal.ADDRESS_HOUSE_NUMBER
oxidize_postal.ADDRESS_STREET
oxidize_postal.ADDRESS_UNIT
oxidize_postal.ADDRESS_LEVEL
oxidize_postal.ADDRESS_POSTAL_CODE
# ... and more
```

## Requirements

- Python 3.9+
- libpostal data files (~2GB, downloaded separately)
- Rust toolchain (for building from source)

## Project Structure

```
oxidize-postal/
├── oxidize-postal/         # Rust extension module
│   ├── src/
│   │   ├── lib.rs          # PyO3 module definition
│   │   └── postal/
│   │       ├── parser.rs   # Core parsing functions
│   │       ├── python_api.rs   # Python-exposed functions
│   │       ├── error.rs    # Error types
│   │       └── constants.rs    # libpostal constants
│   ├── Cargo.toml          # Rust dependencies
│   └── pyproject.toml      # Python package config
├── tests/
│   ├── fixtures/           # Sample addresses
│   ├── unit/               # Unit tests
│   ├── integration/        # End-to-end tests
│   └── performance/        # Benchmarking tests
├── main.py                 # Usage examples
├── data_manager.py         # libpostal data downloader
├── build.sh                # Build script
└── pyproject.toml          # Root package config
```

### Architecture

- **Stack**: Python → PyO3 → Rust → libpostal-rust → libpostal C library
- **GIL Release**: All parsing operations release the Python GIL for true parallel processing
- **Error Handling**: Rust errors are converted to Python exceptions (ValueError, RuntimeError)
- **Data Requirements**: libpostal needs ~2GB of language model data (stored in `/usr/local/share/libpostal`)

### Build Process

1. `maturin` compiles the Rust extension with PyO3 bindings
2. Links against libpostal-rust crate
3. Produces a Python wheel with native extension
4. No Python runtime dependencies required

## License

MIT License

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Acknowledgments

- [libpostal](https://github.com/openvenues/libpostal) - The core C library for address parsing
- [libpostal-rust](https://crates.io/crates/libpostal-rust) - Rust bindings for libpostal
- [pypostal](https://github.com/openvenues/pypostal) - The original Python bindings that inspired this project

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "oxidize-postal",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "address, postal, parsing, normalization, libpostal, rust, performance",
    "author": null,
    "author_email": "Eric Aleman <eric@example.com>",
    "download_url": null,
    "platform": null,
    "description": "# oxidize-postal\n\nPython bindings for libpostal address parsing with improved performance and installation experience.\n\noxidize-postal provides the same address parsing capabilities as [pypostal](https://github.com/openvenues/pypostal) but addresses key limitations: it installs without C compilation, releases the Python GIL for true parallel processing, and offers a cleaner API. Built using Rust and [libpostal-rust](https://crates.io/crates/libpostal-rust) bindings to the [libpostal](https://github.com/openvenues/libpostal) C library.\n\n## Key Improvements Over pypostal\n\n| Feature | oxidize-postal | pypostal |\n|---------|----------------|----------|\n| **Installation** | `pip install` with pre-built wheels | Requires C compilation, system dependencies |\n| **Parallel Processing** | GIL released, true multithreading | GIL blocks concurrent parsing |\n| **API Design** | Single module, consistent naming | Multiple imports, scattered functions |\n| **Error Handling** | Structured errors with context | Basic exception messages |\n| **Platform Support** | Cross-platform wheels | Complex Windows build process |\n\n## Core Functionality\n\n- **Address Parsing**: Extract components (street, city, state, postal code, etc.) from address strings\n- **Address Expansion**: Generate normalized variations with abbreviations expanded (St. \u2192 Street)\n- **Address Normalization**: Standardize address formatting and component ordering\n- **International Support**: Handles addresses worldwide with Unicode and multiple scripts\n\n## Installation\n\n```bash\npip install oxidize-postal\n\n# Download language model data (one-time setup)\npython -c \"import oxidize_postal; oxidize_postal.download_data()\"\n```\n\n## Usage\n\n### Basic Address Parsing\n\n```python\nimport oxidize_postal\n\n# Parse an address into components\naddress = \"781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA\"\nparsed = oxidize_postal.parse_address(address)\nprint(parsed)\n# Output: {'house_number': '781', 'road': 'franklin ave', 'suburb': 'crown heights', \n#          'city': 'brooklyn', 'state': 'ny', 'postcode': '11216', 'country': 'usa'}\n\n# Get parsed address as JSON string\njson_result = oxidize_postal.parse_address_to_json(address)\n```\n\n### Address Expansion\n\n```python\n# Expand address abbreviations\naddress = \"123 Main St NYC NY\"\nexpansions = oxidize_postal.expand_address(address)\nprint(expansions)\n# Output: ['123 main street nyc new york', '123 main street nyc ny', ...]\n\n# Get expansions as JSON\njson_expansions = oxidize_postal.expand_address_to_json(address)\n```\n\n## Parallel Processing & Performance\n\nOne of the key advantages of oxidize-postal over pypostal is GIL-free parallel processing. However, it's important to understand when you'll see benefits.\n\n### When Parallel Processing Helps\n\nParallel processing provides the most benefit when combined with slower I/O operations:\n\n**Great for parallel processing:**\n```python\nimport oxidize_postal\nfrom concurrent.futures import ThreadPoolExecutor\nimport requests\n\ndef process_customer_record(record):\n    # Fetch from API (50-200ms)\n    customer = requests.get(f\"https://api.example.com/customers/{record['id']}\").json()\n    \n    # Parse address (0.3ms) - GIL released so other threads can work\n    parsed = oxidize_postal.parse_address(customer['address'])\n    \n    # Write to database (50-200ms)\n    db.update(customer['id'], parsed)\n    \n    return parsed\n\n# Process many records in parallel\nwith ThreadPoolExecutor(max_workers=20) as executor:\n    results = list(executor.map(process_customer_record, records))\n```\n\n**Limited benefit for pure address parsing:**\n```python\n# Just parsing addresses without I/O\naddresses = [\"123 Main St\", \"456 Oak Ave\"] * 100\n\n# Parallel might even be slower due to thread overhead\nwith ThreadPoolExecutor() as executor:\n    results = list(executor.map(oxidize_postal.parse_address, addresses))\n```\n\n### Real-World Use Cases\n\nWhere to use oxidize-postal's GIL release:\n\n1. **ETL Pipelines**: Reading from databases/APIs, parsing, and writing back\n2. **Stream Processing**: Handling Kafka/Kinesis streams with address data\n3. **Web Services**: API endpoints that parse addresses alongside other operations\n4. **File Processing**: Reading large CSV/Parquet files, parsing addresses, writing results\n\n### Threading vs Multiprocessing\n\nBecause oxidize-postal releases the GIL, **threading is usually preferable** to multiprocessing:\n\n```python\nfrom concurrent.futures import ThreadPoolExecutor\nfrom multiprocessing import Pool\n\n# Threading - Lower overhead, shared memory\nwith ThreadPoolExecutor(max_workers=8) as executor:\n    results = list(executor.map(oxidize_postal.parse_address, addresses))\n\n# Multiprocessing - Higher overhead due to serialization\n# Only use if you need true CPU parallelism for other operations\nwith Pool(processes=8) as pool:\n    results = pool.map(oxidize_postal.parse_address, addresses)\n```\n\nThreading outperforms multiprocessing by 3-5x for pure address parsing of small batches (under 5-20k addresses depending on your machine) due to lower overhead.\n\n## API Reference\n\n### Core Functions\n\n#### `parse_address(address: str) -> dict`\nParse an address string into its component parts.\n\n**Parameters:**\n- `address`: The address string to parse\n\n**Returns:**\n- Dictionary with keys like 'house_number', 'road', 'city', 'state', 'postcode', etc.\n\n#### `expand_address(address: str) -> list[str]`\nGenerate normalized variations of an address.\n\n**Parameters:**\n- `address`: The address string to expand\n\n**Returns:**\n- List of expanded address strings\n\n#### `download_data(force: bool = False) -> bool`\nDownload the libpostal data files.\n\n**Parameters:**\n- `force`: If True, re-download even if data exists\n\n**Returns:**\n- True if successful, False otherwise\n\n### Additional Functions\n\n- `parse_address_to_json(address: str) -> str`: Parse and return as JSON\n- `expand_address_to_json(address: str) -> str`: Expand and return as JSON\n- `normalize_address(address: str) -> str`: Normalize an address string\n\n### Constants\n\nThe module provides various constants for address components:\n\n```python\nimport oxidize_postal\n\n# Address component constants\noxidize_postal.ADDRESS_ANY\noxidize_postal.ADDRESS_NAME\noxidize_postal.ADDRESS_HOUSE_NUMBER\noxidize_postal.ADDRESS_STREET\noxidize_postal.ADDRESS_UNIT\noxidize_postal.ADDRESS_LEVEL\noxidize_postal.ADDRESS_POSTAL_CODE\n# ... and more\n```\n\n## Requirements\n\n- Python 3.9+\n- libpostal data files (~2GB, downloaded separately)\n- Rust toolchain (for building from source)\n\n## Project Structure\n\n```\noxidize-postal/\n\u251c\u2500\u2500 oxidize-postal/         # Rust extension module\n\u2502   \u251c\u2500\u2500 src/\n\u2502   \u2502   \u251c\u2500\u2500 lib.rs          # PyO3 module definition\n\u2502   \u2502   \u2514\u2500\u2500 postal/\n\u2502   \u2502       \u251c\u2500\u2500 parser.rs   # Core parsing functions\n\u2502   \u2502       \u251c\u2500\u2500 python_api.rs   # Python-exposed functions\n\u2502   \u2502       \u251c\u2500\u2500 error.rs    # Error types\n\u2502   \u2502       \u2514\u2500\u2500 constants.rs    # libpostal constants\n\u2502   \u251c\u2500\u2500 Cargo.toml          # Rust dependencies\n\u2502   \u2514\u2500\u2500 pyproject.toml      # Python package config\n\u251c\u2500\u2500 tests/\n\u2502   \u251c\u2500\u2500 fixtures/           # Sample addresses\n\u2502   \u251c\u2500\u2500 unit/               # Unit tests\n\u2502   \u251c\u2500\u2500 integration/        # End-to-end tests\n\u2502   \u2514\u2500\u2500 performance/        # Benchmarking tests\n\u251c\u2500\u2500 main.py                 # Usage examples\n\u251c\u2500\u2500 data_manager.py         # libpostal data downloader\n\u251c\u2500\u2500 build.sh                # Build script\n\u2514\u2500\u2500 pyproject.toml          # Root package config\n```\n\n### Architecture\n\n- **Stack**: Python \u2192 PyO3 \u2192 Rust \u2192 libpostal-rust \u2192 libpostal C library\n- **GIL Release**: All parsing operations release the Python GIL for true parallel processing\n- **Error Handling**: Rust errors are converted to Python exceptions (ValueError, RuntimeError)\n- **Data Requirements**: libpostal needs ~2GB of language model data (stored in `/usr/local/share/libpostal`)\n\n### Build Process\n\n1. `maturin` compiles the Rust extension with PyO3 bindings\n2. Links against libpostal-rust crate\n3. Produces a Python wheel with native extension\n4. No Python runtime dependencies required\n\n## License\n\nMIT License\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## Acknowledgments\n\n- [libpostal](https://github.com/openvenues/libpostal) - The core C library for address parsing\n- [libpostal-rust](https://crates.io/crates/libpostal-rust) - Rust bindings for libpostal\n- [pypostal](https://github.com/openvenues/pypostal) - The original Python bindings that inspired this project\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "High-performance postal address parser and normalizer using libpostal with Rust bindings",
    "version": "0.1.2",
    "project_urls": {
        "Repository": "https://github.com/ericaleman/oxidize-postal"
    },
    "split_keywords": [
        "address",
        " postal",
        " parsing",
        " normalization",
        " libpostal",
        " rust",
        " performance"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f45e070269819484c2611bbc22447ac741b98fbfc38e2dbc402eaa4a99a9aa5c",
                "md5": "f45813c91c3c24a720f62a8364eb4ae8",
                "sha256": "475ec0e98c0770b17c4606adedaa00de8f06b42ef9b9b1c53f8cf90cb68964a9"
            },
            "downloads": -1,
            "filename": "oxidize_postal-0.1.2-cp312-cp312-macosx_11_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "f45813c91c3c24a720f62a8364eb4ae8",
            "packagetype": "bdist_wheel",
            "python_version": "cp312",
            "requires_python": ">=3.9",
            "size": 1248661,
            "upload_time": "2025-09-06T22:14:52",
            "upload_time_iso_8601": "2025-09-06T22:14:52.926402Z",
            "url": "https://files.pythonhosted.org/packages/f4/5e/070269819484c2611bbc22447ac741b98fbfc38e2dbc402eaa4a99a9aa5c/oxidize_postal-0.1.2-cp312-cp312-macosx_11_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-06 22:14:52",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ericaleman",
    "github_project": "oxidize-postal",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "oxidize-postal"
}
        
Elapsed time: 0.86992s