# oxidize-postal
Python bindings for libpostal address parsing with improved performance and installation experience.
oxidize-postal provides the same address parsing capabilities as [pypostal](https://github.com/openvenues/pypostal) but addresses key limitations: it installs without C compilation, releases the Python GIL for true parallel processing, and offers a cleaner API. Built using Rust and [libpostal-rust](https://crates.io/crates/libpostal-rust) bindings to the [libpostal](https://github.com/openvenues/libpostal) C library.
## Key Improvements Over pypostal
| Feature | oxidize-postal | pypostal |
|---------|----------------|----------|
| **Installation** | `pip install` with pre-built wheels | Requires C compilation, system dependencies |
| **Parallel Processing** | GIL released, true multithreading | GIL blocks concurrent parsing |
| **API Design** | Single module, consistent naming | Multiple imports, scattered functions |
| **Error Handling** | Structured errors with context | Basic exception messages |
| **Platform Support** | Cross-platform wheels | Complex Windows build process |
## Core Functionality
- **Address Parsing**: Extract components (street, city, state, postal code, etc.) from address strings
- **Address Expansion**: Generate normalized variations with abbreviations expanded (St. → Street)
- **Address Normalization**: Standardize address formatting and component ordering
- **International Support**: Handles addresses worldwide with Unicode and multiple scripts
## Installation
```bash
pip install oxidize-postal
# Download language model data (one-time setup)
python -c "import oxidize_postal; oxidize_postal.download_data()"
```
## Usage
### Basic Address Parsing
```python
import oxidize_postal
# Parse an address into components
address = "781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA"
parsed = oxidize_postal.parse_address(address)
print(parsed)
# Output: {'house_number': '781', 'road': 'franklin ave', 'suburb': 'crown heights',
# 'city': 'brooklyn', 'state': 'ny', 'postcode': '11216', 'country': 'usa'}
# Get parsed address as JSON string
json_result = oxidize_postal.parse_address_to_json(address)
```
### Address Expansion
```python
# Expand address abbreviations
address = "123 Main St NYC NY"
expansions = oxidize_postal.expand_address(address)
print(expansions)
# Output: ['123 main street nyc new york', '123 main street nyc ny', ...]
# Get expansions as JSON
json_expansions = oxidize_postal.expand_address_to_json(address)
```
## Parallel Processing & Performance
One of the key advantages of oxidize-postal over pypostal is GIL-free parallel processing. However, it's important to understand when you'll see benefits.
### When Parallel Processing Helps
Parallel processing provides the most benefit when combined with slower I/O operations:
**Great for parallel processing:**
```python
import oxidize_postal
from concurrent.futures import ThreadPoolExecutor
import requests
def process_customer_record(record):
# Fetch from API (50-200ms)
customer = requests.get(f"https://api.example.com/customers/{record['id']}").json()
# Parse address (0.3ms) - GIL released so other threads can work
parsed = oxidize_postal.parse_address(customer['address'])
# Write to database (50-200ms)
db.update(customer['id'], parsed)
return parsed
# Process many records in parallel
with ThreadPoolExecutor(max_workers=20) as executor:
results = list(executor.map(process_customer_record, records))
```
**Limited benefit for pure address parsing:**
```python
# Just parsing addresses without I/O
addresses = ["123 Main St", "456 Oak Ave"] * 100
# Parallel might even be slower due to thread overhead
with ThreadPoolExecutor() as executor:
results = list(executor.map(oxidize_postal.parse_address, addresses))
```
### Real-World Use Cases
Where to use oxidize-postal's GIL release:
1. **ETL Pipelines**: Reading from databases/APIs, parsing, and writing back
2. **Stream Processing**: Handling Kafka/Kinesis streams with address data
3. **Web Services**: API endpoints that parse addresses alongside other operations
4. **File Processing**: Reading large CSV/Parquet files, parsing addresses, writing results
### Threading vs Multiprocessing
Because oxidize-postal releases the GIL, **threading is usually preferable** to multiprocessing:
```python
from concurrent.futures import ThreadPoolExecutor
from multiprocessing import Pool
# Threading - Lower overhead, shared memory
with ThreadPoolExecutor(max_workers=8) as executor:
results = list(executor.map(oxidize_postal.parse_address, addresses))
# Multiprocessing - Higher overhead due to serialization
# Only use if you need true CPU parallelism for other operations
with Pool(processes=8) as pool:
results = pool.map(oxidize_postal.parse_address, addresses)
```
Threading outperforms multiprocessing by 3-5x for pure address parsing of small batches (under 5-20k addresses depending on your machine) due to lower overhead.
## API Reference
### Core Functions
#### `parse_address(address: str) -> dict`
Parse an address string into its component parts.
**Parameters:**
- `address`: The address string to parse
**Returns:**
- Dictionary with keys like 'house_number', 'road', 'city', 'state', 'postcode', etc.
#### `expand_address(address: str) -> list[str]`
Generate normalized variations of an address.
**Parameters:**
- `address`: The address string to expand
**Returns:**
- List of expanded address strings
#### `download_data(force: bool = False) -> bool`
Download the libpostal data files.
**Parameters:**
- `force`: If True, re-download even if data exists
**Returns:**
- True if successful, False otherwise
### Additional Functions
- `parse_address_to_json(address: str) -> str`: Parse and return as JSON
- `expand_address_to_json(address: str) -> str`: Expand and return as JSON
- `normalize_address(address: str) -> str`: Normalize an address string
### Constants
The module provides various constants for address components:
```python
import oxidize_postal
# Address component constants
oxidize_postal.ADDRESS_ANY
oxidize_postal.ADDRESS_NAME
oxidize_postal.ADDRESS_HOUSE_NUMBER
oxidize_postal.ADDRESS_STREET
oxidize_postal.ADDRESS_UNIT
oxidize_postal.ADDRESS_LEVEL
oxidize_postal.ADDRESS_POSTAL_CODE
# ... and more
```
## Requirements
- Python 3.9+
- libpostal data files (~2GB, downloaded separately)
- Rust toolchain (for building from source)
## Project Structure
```
oxidize-postal/
├── oxidize-postal/ # Rust extension module
│ ├── src/
│ │ ├── lib.rs # PyO3 module definition
│ │ └── postal/
│ │ ├── parser.rs # Core parsing functions
│ │ ├── python_api.rs # Python-exposed functions
│ │ ├── error.rs # Error types
│ │ └── constants.rs # libpostal constants
│ ├── Cargo.toml # Rust dependencies
│ └── pyproject.toml # Python package config
├── tests/
│ ├── fixtures/ # Sample addresses
│ ├── unit/ # Unit tests
│ ├── integration/ # End-to-end tests
│ └── performance/ # Benchmarking tests
├── main.py # Usage examples
├── data_manager.py # libpostal data downloader
├── build.sh # Build script
└── pyproject.toml # Root package config
```
### Architecture
- **Stack**: Python → PyO3 → Rust → libpostal-rust → libpostal C library
- **GIL Release**: All parsing operations release the Python GIL for true parallel processing
- **Error Handling**: Rust errors are converted to Python exceptions (ValueError, RuntimeError)
- **Data Requirements**: libpostal needs ~2GB of language model data (stored in `/usr/local/share/libpostal`)
### Build Process
1. `maturin` compiles the Rust extension with PyO3 bindings
2. Links against libpostal-rust crate
3. Produces a Python wheel with native extension
4. No Python runtime dependencies required
## License
MIT License
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## Acknowledgments
- [libpostal](https://github.com/openvenues/libpostal) - The core C library for address parsing
- [libpostal-rust](https://crates.io/crates/libpostal-rust) - Rust bindings for libpostal
- [pypostal](https://github.com/openvenues/pypostal) - The original Python bindings that inspired this project
Raw data
{
"_id": null,
"home_page": null,
"name": "oxidize-postal",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "address, postal, parsing, normalization, libpostal, rust, performance",
"author": null,
"author_email": "Eric Aleman <eric@example.com>",
"download_url": null,
"platform": null,
"description": "# oxidize-postal\n\nPython bindings for libpostal address parsing with improved performance and installation experience.\n\noxidize-postal provides the same address parsing capabilities as [pypostal](https://github.com/openvenues/pypostal) but addresses key limitations: it installs without C compilation, releases the Python GIL for true parallel processing, and offers a cleaner API. Built using Rust and [libpostal-rust](https://crates.io/crates/libpostal-rust) bindings to the [libpostal](https://github.com/openvenues/libpostal) C library.\n\n## Key Improvements Over pypostal\n\n| Feature | oxidize-postal | pypostal |\n|---------|----------------|----------|\n| **Installation** | `pip install` with pre-built wheels | Requires C compilation, system dependencies |\n| **Parallel Processing** | GIL released, true multithreading | GIL blocks concurrent parsing |\n| **API Design** | Single module, consistent naming | Multiple imports, scattered functions |\n| **Error Handling** | Structured errors with context | Basic exception messages |\n| **Platform Support** | Cross-platform wheels | Complex Windows build process |\n\n## Core Functionality\n\n- **Address Parsing**: Extract components (street, city, state, postal code, etc.) from address strings\n- **Address Expansion**: Generate normalized variations with abbreviations expanded (St. \u2192 Street)\n- **Address Normalization**: Standardize address formatting and component ordering\n- **International Support**: Handles addresses worldwide with Unicode and multiple scripts\n\n## Installation\n\n```bash\npip install oxidize-postal\n\n# Download language model data (one-time setup)\npython -c \"import oxidize_postal; oxidize_postal.download_data()\"\n```\n\n## Usage\n\n### Basic Address Parsing\n\n```python\nimport oxidize_postal\n\n# Parse an address into components\naddress = \"781 Franklin Ave Crown Heights Brooklyn NYC NY 11216 USA\"\nparsed = oxidize_postal.parse_address(address)\nprint(parsed)\n# Output: {'house_number': '781', 'road': 'franklin ave', 'suburb': 'crown heights', \n# 'city': 'brooklyn', 'state': 'ny', 'postcode': '11216', 'country': 'usa'}\n\n# Get parsed address as JSON string\njson_result = oxidize_postal.parse_address_to_json(address)\n```\n\n### Address Expansion\n\n```python\n# Expand address abbreviations\naddress = \"123 Main St NYC NY\"\nexpansions = oxidize_postal.expand_address(address)\nprint(expansions)\n# Output: ['123 main street nyc new york', '123 main street nyc ny', ...]\n\n# Get expansions as JSON\njson_expansions = oxidize_postal.expand_address_to_json(address)\n```\n\n## Parallel Processing & Performance\n\nOne of the key advantages of oxidize-postal over pypostal is GIL-free parallel processing. However, it's important to understand when you'll see benefits.\n\n### When Parallel Processing Helps\n\nParallel processing provides the most benefit when combined with slower I/O operations:\n\n**Great for parallel processing:**\n```python\nimport oxidize_postal\nfrom concurrent.futures import ThreadPoolExecutor\nimport requests\n\ndef process_customer_record(record):\n # Fetch from API (50-200ms)\n customer = requests.get(f\"https://api.example.com/customers/{record['id']}\").json()\n \n # Parse address (0.3ms) - GIL released so other threads can work\n parsed = oxidize_postal.parse_address(customer['address'])\n \n # Write to database (50-200ms)\n db.update(customer['id'], parsed)\n \n return parsed\n\n# Process many records in parallel\nwith ThreadPoolExecutor(max_workers=20) as executor:\n results = list(executor.map(process_customer_record, records))\n```\n\n**Limited benefit for pure address parsing:**\n```python\n# Just parsing addresses without I/O\naddresses = [\"123 Main St\", \"456 Oak Ave\"] * 100\n\n# Parallel might even be slower due to thread overhead\nwith ThreadPoolExecutor() as executor:\n results = list(executor.map(oxidize_postal.parse_address, addresses))\n```\n\n### Real-World Use Cases\n\nWhere to use oxidize-postal's GIL release:\n\n1. **ETL Pipelines**: Reading from databases/APIs, parsing, and writing back\n2. **Stream Processing**: Handling Kafka/Kinesis streams with address data\n3. **Web Services**: API endpoints that parse addresses alongside other operations\n4. **File Processing**: Reading large CSV/Parquet files, parsing addresses, writing results\n\n### Threading vs Multiprocessing\n\nBecause oxidize-postal releases the GIL, **threading is usually preferable** to multiprocessing:\n\n```python\nfrom concurrent.futures import ThreadPoolExecutor\nfrom multiprocessing import Pool\n\n# Threading - Lower overhead, shared memory\nwith ThreadPoolExecutor(max_workers=8) as executor:\n results = list(executor.map(oxidize_postal.parse_address, addresses))\n\n# Multiprocessing - Higher overhead due to serialization\n# Only use if you need true CPU parallelism for other operations\nwith Pool(processes=8) as pool:\n results = pool.map(oxidize_postal.parse_address, addresses)\n```\n\nThreading outperforms multiprocessing by 3-5x for pure address parsing of small batches (under 5-20k addresses depending on your machine) due to lower overhead.\n\n## API Reference\n\n### Core Functions\n\n#### `parse_address(address: str) -> dict`\nParse an address string into its component parts.\n\n**Parameters:**\n- `address`: The address string to parse\n\n**Returns:**\n- Dictionary with keys like 'house_number', 'road', 'city', 'state', 'postcode', etc.\n\n#### `expand_address(address: str) -> list[str]`\nGenerate normalized variations of an address.\n\n**Parameters:**\n- `address`: The address string to expand\n\n**Returns:**\n- List of expanded address strings\n\n#### `download_data(force: bool = False) -> bool`\nDownload the libpostal data files.\n\n**Parameters:**\n- `force`: If True, re-download even if data exists\n\n**Returns:**\n- True if successful, False otherwise\n\n### Additional Functions\n\n- `parse_address_to_json(address: str) -> str`: Parse and return as JSON\n- `expand_address_to_json(address: str) -> str`: Expand and return as JSON\n- `normalize_address(address: str) -> str`: Normalize an address string\n\n### Constants\n\nThe module provides various constants for address components:\n\n```python\nimport oxidize_postal\n\n# Address component constants\noxidize_postal.ADDRESS_ANY\noxidize_postal.ADDRESS_NAME\noxidize_postal.ADDRESS_HOUSE_NUMBER\noxidize_postal.ADDRESS_STREET\noxidize_postal.ADDRESS_UNIT\noxidize_postal.ADDRESS_LEVEL\noxidize_postal.ADDRESS_POSTAL_CODE\n# ... and more\n```\n\n## Requirements\n\n- Python 3.9+\n- libpostal data files (~2GB, downloaded separately)\n- Rust toolchain (for building from source)\n\n## Project Structure\n\n```\noxidize-postal/\n\u251c\u2500\u2500 oxidize-postal/ # Rust extension module\n\u2502 \u251c\u2500\u2500 src/\n\u2502 \u2502 \u251c\u2500\u2500 lib.rs # PyO3 module definition\n\u2502 \u2502 \u2514\u2500\u2500 postal/\n\u2502 \u2502 \u251c\u2500\u2500 parser.rs # Core parsing functions\n\u2502 \u2502 \u251c\u2500\u2500 python_api.rs # Python-exposed functions\n\u2502 \u2502 \u251c\u2500\u2500 error.rs # Error types\n\u2502 \u2502 \u2514\u2500\u2500 constants.rs # libpostal constants\n\u2502 \u251c\u2500\u2500 Cargo.toml # Rust dependencies\n\u2502 \u2514\u2500\u2500 pyproject.toml # Python package config\n\u251c\u2500\u2500 tests/\n\u2502 \u251c\u2500\u2500 fixtures/ # Sample addresses\n\u2502 \u251c\u2500\u2500 unit/ # Unit tests\n\u2502 \u251c\u2500\u2500 integration/ # End-to-end tests\n\u2502 \u2514\u2500\u2500 performance/ # Benchmarking tests\n\u251c\u2500\u2500 main.py # Usage examples\n\u251c\u2500\u2500 data_manager.py # libpostal data downloader\n\u251c\u2500\u2500 build.sh # Build script\n\u2514\u2500\u2500 pyproject.toml # Root package config\n```\n\n### Architecture\n\n- **Stack**: Python \u2192 PyO3 \u2192 Rust \u2192 libpostal-rust \u2192 libpostal C library\n- **GIL Release**: All parsing operations release the Python GIL for true parallel processing\n- **Error Handling**: Rust errors are converted to Python exceptions (ValueError, RuntimeError)\n- **Data Requirements**: libpostal needs ~2GB of language model data (stored in `/usr/local/share/libpostal`)\n\n### Build Process\n\n1. `maturin` compiles the Rust extension with PyO3 bindings\n2. Links against libpostal-rust crate\n3. Produces a Python wheel with native extension\n4. No Python runtime dependencies required\n\n## License\n\nMIT License\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## Acknowledgments\n\n- [libpostal](https://github.com/openvenues/libpostal) - The core C library for address parsing\n- [libpostal-rust](https://crates.io/crates/libpostal-rust) - Rust bindings for libpostal\n- [pypostal](https://github.com/openvenues/pypostal) - The original Python bindings that inspired this project\n",
"bugtrack_url": null,
"license": null,
"summary": "High-performance postal address parser and normalizer using libpostal with Rust bindings",
"version": "0.1.2",
"project_urls": {
"Repository": "https://github.com/ericaleman/oxidize-postal"
},
"split_keywords": [
"address",
" postal",
" parsing",
" normalization",
" libpostal",
" rust",
" performance"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "f45e070269819484c2611bbc22447ac741b98fbfc38e2dbc402eaa4a99a9aa5c",
"md5": "f45813c91c3c24a720f62a8364eb4ae8",
"sha256": "475ec0e98c0770b17c4606adedaa00de8f06b42ef9b9b1c53f8cf90cb68964a9"
},
"downloads": -1,
"filename": "oxidize_postal-0.1.2-cp312-cp312-macosx_11_0_arm64.whl",
"has_sig": false,
"md5_digest": "f45813c91c3c24a720f62a8364eb4ae8",
"packagetype": "bdist_wheel",
"python_version": "cp312",
"requires_python": ">=3.9",
"size": 1248661,
"upload_time": "2025-09-06T22:14:52",
"upload_time_iso_8601": "2025-09-06T22:14:52.926402Z",
"url": "https://files.pythonhosted.org/packages/f4/5e/070269819484c2611bbc22447ac741b98fbfc38e2dbc402eaa4a99a9aa5c/oxidize_postal-0.1.2-cp312-cp312-macosx_11_0_arm64.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-06 22:14:52",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ericaleman",
"github_project": "oxidize-postal",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "oxidize-postal"
}