tidyxl


Nametidyxl JSON
Version 0.1.7 PyPI version JSON
download
home_pageNone
SummaryImport Excel files into tidy format with cell-level data extraction
upload_time2025-07-31 01:51:24
maintainerNone
docs_urlNone
authorNone
requires_python>=3.12
licenseMIT
keywords excel xlsx tidy data spreadsheet openpyxl pandas data-analysis data-science cell-extraction
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # tidyxl

A Python package that imports Excel files (.xlsx, .xlsm) into a tidy format where each cell is represented as a single row with detailed metadata. This package is inspired by and closely mirrors the functionality of the R tidyxl package.

## Installation

```bash
pip install tidyxl
```

## Quick Start

```python
from tidyxl import xlsx_cells, xlsx_sheet_names

# Get all sheet names
sheets = xlsx_sheet_names("data.xlsx")
print(f"Found sheets: {sheets}")

# Read all cells from all sheets
cells = xlsx_cells("data.xlsx")
print(f"Total cells: {len(cells)}")

# Read specific sheet only
employees = xlsx_cells("data.xlsx", sheets="Employees")

# Filter for cells with content
content_cells = cells[~cells['is_blank']]
print(content_cells[['sheet', 'address', 'data_type', 'character', 'numeric']].head())
```

## Features

- **Tidy Format**: Each Excel cell becomes one row with comprehensive metadata
- **Complete Cell Information**: Extract values, formulas, formatting, comments, and more
- **Multiple Worksheets**: Process all sheets or specify particular ones
- **Named Ranges**: Extract and analyze Excel named ranges and formulas
- **Data Validation**: Discover data validation rules applied to cells
- **Type Safety**: Separate columns for different data types (numeric, character, logical, date, error)
- **R tidyxl Compatible**: Identical API and output structure to the R package

## Core Functions

### xlsx_cells()
Extract all cell data in tidy format:

```python
from tidyxl import xlsx_cells

# Read all sheets
cells = xlsx_cells("file.xlsx")

# Read specific sheets
cells = xlsx_cells("file.xlsx", sheets=["Sheet1", "Sheet2"])
cells = xlsx_cells("file.xlsx", sheets="Data")

# Include/exclude blank cells
cells = xlsx_cells("file.xlsx", include_blank_cells=False)
```

### xlsx_sheet_names()
List all worksheet names:

```python
from tidyxl import xlsx_sheet_names

sheets = xlsx_sheet_names("file.xlsx")
# Returns: ['Sheet1', 'Data', 'Summary']
```

### xlsx_names()
Extract named ranges and formulas:

```python
from tidyxl import xlsx_names

names = xlsx_names("file.xlsx")
print(names[['name', 'formula', 'sheet', 'is_range']])
```

### xlsx_validation()
Extract data validation rules:

```python
from tidyxl import xlsx_validation

validation = xlsx_validation("file.xlsx")
print(validation[['sheet', 'ref', 'type', 'formula1']])
```

### xlsx_formats()
Extract formatting information:

```python
from tidyxl import xlsx_formats

formats = xlsx_formats("file.xlsx")
# Returns dict with keys: fonts, fills, borders, number_formats
```

## Output Structure

The `xlsx_cells()` function returns a pandas DataFrame with 23 columns matching the R tidyxl package:

| Column | Type | Description |
|--------|------|-------------|
| `sheet` | str | Worksheet name |
| `address` | str | Cell address (A1 notation) |
| `row` | int | Row number |
| `col` | int | Column number |
| `is_blank` | bool | Whether cell has a value |
| `content` | str | Raw cell value before type conversion |
| `data_type` | str | Cell type (character, numeric, logical, date, error, blank) |
| `error` | str | Error value if cell contains error |
| `logical` | bool | Boolean value if cell contains TRUE/FALSE |
| `numeric` | float | Numeric value if cell contains number |
| `date` | datetime | Date value if cell contains date |
| `character` | str | String value if cell contains text |
| `formula` | str | Formula if cell contains formula |
| `is_array` | bool | Whether formula is array formula |
| `formula_ref` | str | Range address for array/shared formulas |
| `formula_group` | int | Formula group identifier |
| `comment` | str | Cell comment text |
| `height` | float | Row height in Excel units |
| `width` | float | Column width in Excel units |
| `row_outline_level` | int | Row outline/grouping level |
| `col_outline_level` | int | Column outline/grouping level |
| `style_format` | str | Style format identifier |
| `local_format_id` | int | Local formatting identifier |

## License

This project is licensed under the MIT License.

## Acknowledgments

- Inspired by the excellent R tidyxl package by Duncan Garmonsway
- Built on top of openpyxl for Excel file processing
- Uses pandas for data structure management

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "tidyxl",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.12",
    "maintainer_email": "Wendi Wang <wwd1015@gmail.com>",
    "keywords": "excel, xlsx, tidy, data, spreadsheet, openpyxl, pandas, data-analysis, data-science, cell-extraction",
    "author": null,
    "author_email": "Wendi Wang <wwd1015@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/ef/ea/bbeea189c96256d6c89ade9cd0c927155e33141ffa07fb8058fe2e2477df/tidyxl-0.1.7.tar.gz",
    "platform": null,
    "description": "# tidyxl\n\nA Python package that imports Excel files (.xlsx, .xlsm) into a tidy format where each cell is represented as a single row with detailed metadata. This package is inspired by and closely mirrors the functionality of the R tidyxl package.\n\n## Installation\n\n```bash\npip install tidyxl\n```\n\n## Quick Start\n\n```python\nfrom tidyxl import xlsx_cells, xlsx_sheet_names\n\n# Get all sheet names\nsheets = xlsx_sheet_names(\"data.xlsx\")\nprint(f\"Found sheets: {sheets}\")\n\n# Read all cells from all sheets\ncells = xlsx_cells(\"data.xlsx\")\nprint(f\"Total cells: {len(cells)}\")\n\n# Read specific sheet only\nemployees = xlsx_cells(\"data.xlsx\", sheets=\"Employees\")\n\n# Filter for cells with content\ncontent_cells = cells[~cells['is_blank']]\nprint(content_cells[['sheet', 'address', 'data_type', 'character', 'numeric']].head())\n```\n\n## Features\n\n- **Tidy Format**: Each Excel cell becomes one row with comprehensive metadata\n- **Complete Cell Information**: Extract values, formulas, formatting, comments, and more\n- **Multiple Worksheets**: Process all sheets or specify particular ones\n- **Named Ranges**: Extract and analyze Excel named ranges and formulas\n- **Data Validation**: Discover data validation rules applied to cells\n- **Type Safety**: Separate columns for different data types (numeric, character, logical, date, error)\n- **R tidyxl Compatible**: Identical API and output structure to the R package\n\n## Core Functions\n\n### xlsx_cells()\nExtract all cell data in tidy format:\n\n```python\nfrom tidyxl import xlsx_cells\n\n# Read all sheets\ncells = xlsx_cells(\"file.xlsx\")\n\n# Read specific sheets\ncells = xlsx_cells(\"file.xlsx\", sheets=[\"Sheet1\", \"Sheet2\"])\ncells = xlsx_cells(\"file.xlsx\", sheets=\"Data\")\n\n# Include/exclude blank cells\ncells = xlsx_cells(\"file.xlsx\", include_blank_cells=False)\n```\n\n### xlsx_sheet_names()\nList all worksheet names:\n\n```python\nfrom tidyxl import xlsx_sheet_names\n\nsheets = xlsx_sheet_names(\"file.xlsx\")\n# Returns: ['Sheet1', 'Data', 'Summary']\n```\n\n### xlsx_names()\nExtract named ranges and formulas:\n\n```python\nfrom tidyxl import xlsx_names\n\nnames = xlsx_names(\"file.xlsx\")\nprint(names[['name', 'formula', 'sheet', 'is_range']])\n```\n\n### xlsx_validation()\nExtract data validation rules:\n\n```python\nfrom tidyxl import xlsx_validation\n\nvalidation = xlsx_validation(\"file.xlsx\")\nprint(validation[['sheet', 'ref', 'type', 'formula1']])\n```\n\n### xlsx_formats()\nExtract formatting information:\n\n```python\nfrom tidyxl import xlsx_formats\n\nformats = xlsx_formats(\"file.xlsx\")\n# Returns dict with keys: fonts, fills, borders, number_formats\n```\n\n## Output Structure\n\nThe `xlsx_cells()` function returns a pandas DataFrame with 23 columns matching the R tidyxl package:\n\n| Column | Type | Description |\n|--------|------|-------------|\n| `sheet` | str | Worksheet name |\n| `address` | str | Cell address (A1 notation) |\n| `row` | int | Row number |\n| `col` | int | Column number |\n| `is_blank` | bool | Whether cell has a value |\n| `content` | str | Raw cell value before type conversion |\n| `data_type` | str | Cell type (character, numeric, logical, date, error, blank) |\n| `error` | str | Error value if cell contains error |\n| `logical` | bool | Boolean value if cell contains TRUE/FALSE |\n| `numeric` | float | Numeric value if cell contains number |\n| `date` | datetime | Date value if cell contains date |\n| `character` | str | String value if cell contains text |\n| `formula` | str | Formula if cell contains formula |\n| `is_array` | bool | Whether formula is array formula |\n| `formula_ref` | str | Range address for array/shared formulas |\n| `formula_group` | int | Formula group identifier |\n| `comment` | str | Cell comment text |\n| `height` | float | Row height in Excel units |\n| `width` | float | Column width in Excel units |\n| `row_outline_level` | int | Row outline/grouping level |\n| `col_outline_level` | int | Column outline/grouping level |\n| `style_format` | str | Style format identifier |\n| `local_format_id` | int | Local formatting identifier |\n\n## License\n\nThis project is licensed under the MIT License.\n\n## Acknowledgments\n\n- Inspired by the excellent R tidyxl package by Duncan Garmonsway\n- Built on top of openpyxl for Excel file processing\n- Uses pandas for data structure management\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Import Excel files into tidy format with cell-level data extraction",
    "version": "0.1.7",
    "project_urls": {
        "Bug Tracker": "https://github.com/yourusername/tidyxl/issues",
        "Changelog": "https://github.com/yourusername/tidyxl/blob/main/CHANGELOG.md",
        "Documentation": "https://github.com/yourusername/tidyxl#readme",
        "Homepage": "https://github.com/yourusername/tidyxl",
        "Source Code": "https://github.com/yourusername/tidyxl"
    },
    "split_keywords": [
        "excel",
        " xlsx",
        " tidy",
        " data",
        " spreadsheet",
        " openpyxl",
        " pandas",
        " data-analysis",
        " data-science",
        " cell-extraction"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e3799310b77d563ae1ab3bfb9f57fe88e432461f0934f9f465746ff6ccff3ad3",
                "md5": "2d513be5bec9790b2fd9e68795c08cbb",
                "sha256": "2f4f218e589cb7a094c46348dc08a52c4612d751598cfefb0608146762fa1ed8"
            },
            "downloads": -1,
            "filename": "tidyxl-0.1.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2d513be5bec9790b2fd9e68795c08cbb",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.12",
            "size": 11985,
            "upload_time": "2025-07-31T01:51:23",
            "upload_time_iso_8601": "2025-07-31T01:51:23.048343Z",
            "url": "https://files.pythonhosted.org/packages/e3/79/9310b77d563ae1ab3bfb9f57fe88e432461f0934f9f465746ff6ccff3ad3/tidyxl-0.1.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "efeabbeea189c96256d6c89ade9cd0c927155e33141ffa07fb8058fe2e2477df",
                "md5": "64c4ea896a7f93b38a7c9e80ff5b1cf7",
                "sha256": "3dc7035b778e74fc5084e6349bbfdb69b9d52db0f0e07b19ff9a4bf898b6753d"
            },
            "downloads": -1,
            "filename": "tidyxl-0.1.7.tar.gz",
            "has_sig": false,
            "md5_digest": "64c4ea896a7f93b38a7c9e80ff5b1cf7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.12",
            "size": 21547,
            "upload_time": "2025-07-31T01:51:24",
            "upload_time_iso_8601": "2025-07-31T01:51:24.037137Z",
            "url": "https://files.pythonhosted.org/packages/ef/ea/bbeea189c96256d6c89ade9cd0c927155e33141ffa07fb8058fe2e2477df/tidyxl-0.1.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-31 01:51:24",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "yourusername",
    "github_project": "tidyxl",
    "github_not_found": true,
    "lcname": "tidyxl"
}
        
Elapsed time: 0.62400s