oxidize


Nameoxidize JSON
Version 0.5.0 PyPI version JSON
download
home_pageNone
SummaryHigh-performance XML to JSON streaming parser built with Rust
upload_time2025-08-24 18:22:41
maintainerNone
docs_urlNone
authorEric Aleman
requires_python>=3.9
licenseNone
keywords xml json parser rust performance streaming
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Oxidize

High-performance XML to JSON streaming parser built with Rust and PyO3. Specialized for extracting repeated elements from large XML files like API responses, log files, and data exports, particularly for engineers and analysts working in DuckDB, Polars, and Pandas.

## Key Features

- **High Performance**: 2-3x faster than lxml, built with Rust's quick-xml parser with batch processing in Rayon
- **Low Memory Usage**: Streaming architecture processes files larger than available RAM
- **Specialized Design**: Opinionated API and schema design for common data engineering and data analysis workflows

## Use Cases

Perfect for extracting structured data from XML files containing repeated elements into newline JSON
- **API responses**: Extract `<record>`, `<item>`, or `<entry>` elements from REST API responses
- **Log files**: Parse `<event>` or `<log>` entries from XML-formatted logs
- **Data exports**: Process `<row>`, `<product>`, or `<transaction>` elements from database exports
- **Configuration files**: Extract `<server>`, `<user>`, or similar repeated configuration blocks

## Installation

```bash
pip install oxidize
```

## Development Setup

```bash
# Install dependencies
poetry install

# Build the extension
./build.sh

# Run tests
pytest tests/
```

## Usage

Extract specific elements from XML and convert to JSON-Lines:

```python
import oxidize

# File to file
count = oxidize.parse_xml_file_to_json_file("data.xml", "book", "books.json")

# File to string  
json_lines = oxidize.parse_xml_file_to_json_string("data.xml", "book")

# String to string
result = oxidize.parse_xml_string_to_json_string(xml_content, "book")

# String to file
result = oxidize.parse_xml_string_to_json_file(xml_content, "book")
```

## Conversion Rules

**Uniform arrays**: All elements become arrays for consistent schema inference:

```xml
<book id="bk101">
    <author>J.K. Rowling</author>
    <title>Harry Potter</title>
</book>
```

```json
{
  "@id": "bk101",
  "author": ["J.K. Rowling"], 
  "title": ["Harry Potter"]
}
```

**Key behaviors:**
- **Attributes**: Prefixed with `@` to avoid conflicts with element names
- **Mixed content**: Text in elements with children stored as `#text` entries
- **Empty elements**: Self-closing/empty tags become `null` values
- **Structure preservation**: Element order maintained via IndexMap
- **Namespace handling**: Prefixes kept in element names, declarations treated as attributes

**Ignored features:**
- Processing instructions, DTDs, comments (not relevant for data extraction)
- Custom entity definitions (entity references passed through as text)
- Character references automatically unescaped by quick_xml


## API

```python
parse_xml_file_to_json_file(input_path, target_element, output_path, batch_size=1000) -> int
parse_xml_file_to_json_string(input_path, target_element, batch_size=1000) -> str  
parse_xml_string_to_json_file(xml_content, target_element, output_path, batch_size=1000) -> int
parse_xml_string_to_json_string(xml_content, target_element, batch_size=1000) -> str
```

**Parameters:**
- `batch_size`: Number of elements to process per batch (default: 1000, min: 1)
- Returns the number of elements processed, or raises `ValueError` for invalid inputs


## Testing

Run the test suite:

```bash
# All tests
pytest tests/

# Integration tests only  
pytest tests/integration/

# Performance benchmarks
pytest tests/performance/ --benchmark-only

# With coverage
pytest --cov=oxidize --cov-report=html
```

Test coverage includes:
- Core functionality validation
- Error handling with malformed XML
- Performance regression detection
- Memory usage monitoring
- Edge cases and concurrent operations

## Architecture

```
oxidize/src/io/
├── error.rs         # Centralized error handling and Python conversions
├── parser.rs        # Core XML streaming parser with security validations  
├── python_api.rs    # Clean Python function wrappers with shared logic
├── xml_utils.rs     # XML-to-JSON conversion utilities
└── mod.rs          # Module organization and exports
```


## Security

Oxidize includes various security protections against XML-based attacks:

### File Path Security
- **Path sanitization**: Prevents directory traversal attacks (`../` sequences)
- **Null byte protection**: Rejects paths containing null bytes
- **Path length limits**: Maximum 4096 character paths
- **Canonical path validation**: Uses system path normalization

### XML Bomb Protection
- **Element nesting limit**: Maximum 1000 levels of nesting depth
- **Element size limit**: Maximum 10MB per element
- **Attribute limits**: Maximum 1000 attributes per element
- **Attribute size limit**: Maximum 64KB per attribute value

### Security Limits
```rust
MAX_ELEMENT_DEPTH: 1000        // Maximum XML nesting depth
MAX_ELEMENT_SIZE: 10_000_000   // Maximum element size (10MB)
MAX_ATTRIBUTE_COUNT: 1000      // Maximum attributes per element  
MAX_ATTRIBUTE_SIZE: 65536      // Maximum attribute size (64KB)
```

These limits prevent:
- **Billion laughs attacks**: Exponential entity expansion
- **Quadratic blowup attacks**: Deeply nested structures
- **Memory exhaustion**: Oversized elements or attributes
- **Directory traversal**: Path-based security vulnerabilities


            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "oxidize",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "xml, json, parser, rust, performance, streaming",
    "author": "Eric Aleman",
    "author_email": null,
    "download_url": null,
    "platform": null,
    "description": "# Oxidize\n\nHigh-performance XML to JSON streaming parser built with Rust and PyO3. Specialized for extracting repeated elements from large XML files like API responses, log files, and data exports, particularly for engineers and analysts working in DuckDB, Polars, and Pandas.\n\n## Key Features\n\n- **High Performance**: 2-3x faster than lxml, built with Rust's quick-xml parser with batch processing in Rayon\n- **Low Memory Usage**: Streaming architecture processes files larger than available RAM\n- **Specialized Design**: Opinionated API and schema design for common data engineering and data analysis workflows\n\n## Use Cases\n\nPerfect for extracting structured data from XML files containing repeated elements into newline JSON\n- **API responses**: Extract `<record>`, `<item>`, or `<entry>` elements from REST API responses\n- **Log files**: Parse `<event>` or `<log>` entries from XML-formatted logs\n- **Data exports**: Process `<row>`, `<product>`, or `<transaction>` elements from database exports\n- **Configuration files**: Extract `<server>`, `<user>`, or similar repeated configuration blocks\n\n## Installation\n\n```bash\npip install oxidize\n```\n\n## Development Setup\n\n```bash\n# Install dependencies\npoetry install\n\n# Build the extension\n./build.sh\n\n# Run tests\npytest tests/\n```\n\n## Usage\n\nExtract specific elements from XML and convert to JSON-Lines:\n\n```python\nimport oxidize\n\n# File to file\ncount = oxidize.parse_xml_file_to_json_file(\"data.xml\", \"book\", \"books.json\")\n\n# File to string  \njson_lines = oxidize.parse_xml_file_to_json_string(\"data.xml\", \"book\")\n\n# String to string\nresult = oxidize.parse_xml_string_to_json_string(xml_content, \"book\")\n\n# String to file\nresult = oxidize.parse_xml_string_to_json_file(xml_content, \"book\")\n```\n\n## Conversion Rules\n\n**Uniform arrays**: All elements become arrays for consistent schema inference:\n\n```xml\n<book id=\"bk101\">\n    <author>J.K. Rowling</author>\n    <title>Harry Potter</title>\n</book>\n```\n\n```json\n{\n  \"@id\": \"bk101\",\n  \"author\": [\"J.K. Rowling\"], \n  \"title\": [\"Harry Potter\"]\n}\n```\n\n**Key behaviors:**\n- **Attributes**: Prefixed with `@` to avoid conflicts with element names\n- **Mixed content**: Text in elements with children stored as `#text` entries\n- **Empty elements**: Self-closing/empty tags become `null` values\n- **Structure preservation**: Element order maintained via IndexMap\n- **Namespace handling**: Prefixes kept in element names, declarations treated as attributes\n\n**Ignored features:**\n- Processing instructions, DTDs, comments (not relevant for data extraction)\n- Custom entity definitions (entity references passed through as text)\n- Character references automatically unescaped by quick_xml\n\n\n## API\n\n```python\nparse_xml_file_to_json_file(input_path, target_element, output_path, batch_size=1000) -> int\nparse_xml_file_to_json_string(input_path, target_element, batch_size=1000) -> str  \nparse_xml_string_to_json_file(xml_content, target_element, output_path, batch_size=1000) -> int\nparse_xml_string_to_json_string(xml_content, target_element, batch_size=1000) -> str\n```\n\n**Parameters:**\n- `batch_size`: Number of elements to process per batch (default: 1000, min: 1)\n- Returns the number of elements processed, or raises `ValueError` for invalid inputs\n\n\n## Testing\n\nRun the test suite:\n\n```bash\n# All tests\npytest tests/\n\n# Integration tests only  \npytest tests/integration/\n\n# Performance benchmarks\npytest tests/performance/ --benchmark-only\n\n# With coverage\npytest --cov=oxidize --cov-report=html\n```\n\nTest coverage includes:\n- Core functionality validation\n- Error handling with malformed XML\n- Performance regression detection\n- Memory usage monitoring\n- Edge cases and concurrent operations\n\n## Architecture\n\n```\noxidize/src/io/\n\u251c\u2500\u2500 error.rs         # Centralized error handling and Python conversions\n\u251c\u2500\u2500 parser.rs        # Core XML streaming parser with security validations  \n\u251c\u2500\u2500 python_api.rs    # Clean Python function wrappers with shared logic\n\u251c\u2500\u2500 xml_utils.rs     # XML-to-JSON conversion utilities\n\u2514\u2500\u2500 mod.rs          # Module organization and exports\n```\n\n\n## Security\n\nOxidize includes various security protections against XML-based attacks:\n\n### File Path Security\n- **Path sanitization**: Prevents directory traversal attacks (`../` sequences)\n- **Null byte protection**: Rejects paths containing null bytes\n- **Path length limits**: Maximum 4096 character paths\n- **Canonical path validation**: Uses system path normalization\n\n### XML Bomb Protection\n- **Element nesting limit**: Maximum 1000 levels of nesting depth\n- **Element size limit**: Maximum 10MB per element\n- **Attribute limits**: Maximum 1000 attributes per element\n- **Attribute size limit**: Maximum 64KB per attribute value\n\n### Security Limits\n```rust\nMAX_ELEMENT_DEPTH: 1000        // Maximum XML nesting depth\nMAX_ELEMENT_SIZE: 10_000_000   // Maximum element size (10MB)\nMAX_ATTRIBUTE_COUNT: 1000      // Maximum attributes per element  \nMAX_ATTRIBUTE_SIZE: 65536      // Maximum attribute size (64KB)\n```\n\nThese limits prevent:\n- **Billion laughs attacks**: Exponential entity expansion\n- **Quadratic blowup attacks**: Deeply nested structures\n- **Memory exhaustion**: Oversized elements or attributes\n- **Directory traversal**: Path-based security vulnerabilities\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "High-performance XML to JSON streaming parser built with Rust",
    "version": "0.5.0",
    "project_urls": {
        "Repository": "https://github.com/ericaleman/oxidize"
    },
    "split_keywords": [
        "xml",
        " json",
        " parser",
        " rust",
        " performance",
        " streaming"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "df8651284675eaa9aeefe6b3ff381212d9684739a6f77b7ca0b94d54c2df4b62",
                "md5": "8541187a08d6ee81b9a38484228caa4e",
                "sha256": "b19ddc0485102fd13921d01d39184f637e918f23f66dfbd944b2d43f39df2b10"
            },
            "downloads": -1,
            "filename": "oxidize-0.5.0-cp312-cp312-macosx_11_0_arm64.whl",
            "has_sig": false,
            "md5_digest": "8541187a08d6ee81b9a38484228caa4e",
            "packagetype": "bdist_wheel",
            "python_version": "cp312",
            "requires_python": ">=3.9",
            "size": 339941,
            "upload_time": "2025-08-24T18:22:41",
            "upload_time_iso_8601": "2025-08-24T18:22:41.435205Z",
            "url": "https://files.pythonhosted.org/packages/df/86/51284675eaa9aeefe6b3ff381212d9684739a6f77b7ca0b94d54c2df4b62/oxidize-0.5.0-cp312-cp312-macosx_11_0_arm64.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-24 18:22:41",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ericaleman",
    "github_project": "oxidize",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "oxidize"
}
        
Elapsed time: 1.07927s