# python-pwalk
[](https://pypi.org/project/pwalk/)
[](https://pypi.org/project/pwalk/)
[](https://raw.githubusercontent.com/dirkpetersen/python-pwalk/main/LICENSE)
[](https://pypi.org/project/pwalk/)
[](https://github.com/dirkpetersen/python-pwalk/actions)
**A high-performance toolkit for filesystem analysis and reporting — optimized for petabyte-scale filesystems and HPC environments.**
Generate comprehensive filesystem metadata reports **5-10x faster** than traditional tools, with true multi-threading, intelligent buffering, and automatic zstd compression achieving 23x size reduction. Perfect for system administrators, data scientists, and HPC users working with millions or billions of files.
## Why python-pwalk?
Traditional filesystem traversal tools struggle with modern storage systems containing billions of files. `python-pwalk` solves this with a battle-tested approach combining Python's ease of use with C's raw performance.
### Key Features
- 🚀 **Extreme Performance**: 8,000-30,000 files/second — traverse 50 million files in ~41 minutes
- 🔄 **True Parallelism**: Multi-threaded C implementation using up to 32 threads
- 🗜️ **23x Compression**: Automatic zstd compression reduces 100 GB CSV to 4 GB
- 📦 **Zero Dependencies**: No PyArrow, no numpy — just Python + C
- 🔌 **Drop-in Replacement**: 100% compatible with `os.walk()` API
- 💾 **Memory Efficient**: Thread-local buffers with automatic spillover for billions of files
- 🎯 **HPC Ready**: SLURM-aware, `.snapshot` filtering, cross-filesystem detection
- 🦆 **DuckDB Native**: Output `.csv.zst` files readable directly by DuckDB
- 🛡️ **Production Tested**: Based on John Dey's pwalk used in HPC environments worldwide
### Perfect For
- **System Administrators**: Audit multi-petabyte filesystems in minutes
- **Data Scientists**: Analyze file distributions across massive datasets
- **HPC Users**: Track storage usage in supercomputing environments
- **Storage Teams**: Generate reports for NetApp, Lustre, GPFS filesystems
- **Compliance**: Create auditable records of filesystem contents
## Installation
```bash
pip install pwalk
```
**That's it!** Pre-compiled binary wheels are available for:
- **Linux**: x86_64 (manylinux2014)
- **CPython**: 3.10, 3.11, 3.12, 3.13, 3.14
- **CPython (No GIL)**: 3.13t, 3.14t (experimental free-threading builds)
- **PyPy**: 3.10, 3.11 (fast JIT-compiled Python alternative)
**No system dependencies needed** — wheels include everything pre-compiled!
For free-threading Python:
```bash
python3.13t -m pip install pwalk
PYTHON_GIL=0 python3.13t your_script.py
```
For PyPy:
```bash
pypy3 -m pip install pwalk
```
## Quick Start
### 30-Second Example
```python
from pwalk import walk, report
# 1. Drop-in replacement for os.walk() — 100% compatible API
for dirpath, dirnames, filenames in walk('/data'):
print(f"{dirpath}: {len(filenames)} files")
# 2. Generate compressed filesystem report (5-10x faster with multi-threading!)
output, errors = report('/data', compress='zstd')
# Creates scan.csv.zst - 23x smaller than plain CSV!
# 3. Analyze with DuckDB
import duckdb
df = duckdb.connect().execute(f"SELECT * FROM '{output}'").fetchdf()
print(df.head())
```
> **Note on Performance**: The `walk()` function uses `os.walk()` under the hood (single-threaded) for maximum compatibility across Python versions. For **5-10x faster performance**, use `report()` which leverages our multi-threaded C implementation. In Python 3.13+ with free-threading (no-GIL mode), `walk()` will automatically use parallel traversal for massive speedups!
### Basic Usage
```python
from pwalk import walk
# 100% compatible with os.walk() API
for dirpath, dirnames, filenames in walk('/data'):
print(f"Directory: {dirpath}")
print(f" Subdirectories: {len(dirnames)}")
print(f" Files: {len(filenames)}")
```
### Full API Compatibility
```python
from pwalk import walk
# All os.walk() parameters supported
for dirpath, dirnames, filenames in walk(
'/data',
topdown=True, # Process parents before children
onerror=handle_error, # Error callback
followlinks=False # Don't follow symlinks
):
# Modify dirnames in-place to prune traversal
dirnames[:] = [d for d in dirnames if not d.startswith('.')]
```
### Advanced: Thread Control
```python
from pwalk import walk
# Explicit thread count (default: cpu_count() or SLURM_CPUS_ON_NODE)
for dirpath, dirnames, filenames in walk('/data', max_threads=16):
process_directory(dirpath, filenames)
# Traverse snapshots (disabled by default)
for dirpath, dirnames, filenames in walk('/data', ignore_snapshots=False):
...
```
## Filesystem Metadata Reports
### CSV Output with Zstd Compression (Default)
```python
from pwalk import report
# Generate compressed CSV (8-10x smaller, DuckDB compatible)
output, errors = report(
'/data',
output='scan.csv',
compress='zstd' # or 'gzip', 'auto', 'none'
)
print(f"Report saved to: {output}")
print(f"Inaccessible directories: {len(errors)}")
```
**CSV Format** (100% compatible with John Dey's pwalk):
```
inode,parent-inode,directory-depth,"filename","fileExtension",UID,GID,st_size,st_dev,st_blocks,st_nlink,"st_mode",st_atime,st_mtime,st_ctime,pw_fcount,pw_dirsum
```
**Compression Options**:
- `compress='auto'`: Use zstd if available, else uncompressed (default)
- `compress='zstd'`: Force zstd compression (8-10x, fast)
- `compress='gzip'`: Use gzip compression (6-7x, slower but universal)
- `compress='none'`: No compression
## DuckDB Analysis Workflow
```python
# 1. Generate compressed CSV report
from pwalk import report
output, errors = report('/data', output='scan.csv', compress='zstd')
# 2. DuckDB reads .csv.zst natively!
import duckdb
con = duckdb.connect('fs_analysis.db')
con.execute("CREATE TABLE fs AS SELECT * FROM 'scan.csv.zst'")
# 3. Answer questions like "Who used the last 10TB?"
result = con.execute("""
SELECT
uid,
count(*) as file_count,
sum(st_size) / (1024*1024*1024*1024) as size_tb
FROM fs
WHERE st_ctime > unixepoch(now() - INTERVAL 7 DAY)
GROUP BY uid
ORDER BY size_tb DESC
LIMIT 10
""").fetchdf()
print(result)
# 4. Optional: Convert to Parquet for long-term storage
con.execute("""
COPY (SELECT * FROM 'scan.csv.zst')
TO 'scan.parquet' (FORMAT PARQUET, COMPRESSION SNAPPY)
""")
```
## Filesystem Repair (Root Only)
```python
from pwalk import repair
# Dry-run to preview changes
repair(
'/shared',
dry_run=True,
change_gids=[1234, 5678], # Treat these GIDs like private groups
force_group_writable=True, # Ensure group read+write+execute
exclude=['/shared/archive'] # Skip specific paths
)
# Apply changes
repair('/shared', dry_run=False, force_group_writable=True)
```
## Real-World Use Cases
### 1. Answer Critical Storage Questions Fast
```python
from pwalk import report
import duckdb
# Generate comprehensive filesystem metadata
report('/data', compress='zstd') # Done in minutes, not hours!
# Who used the last 10 TB this week?
con = duckdb.connect()
result = con.execute("""
SELECT UID, COUNT(*) as files, SUM(st_size)/1e12 as TB
FROM 'scan.csv.zst'
WHERE st_ctime > unixepoch(now() - INTERVAL 7 DAY)
GROUP BY UID ORDER BY TB DESC LIMIT 10
""").fetchdf()
```
### 2. Find Storage Hogs and Opportunities
```python
# Find directories with >1M files (performance issues!)
huge_dirs = con.execute("""
SELECT "filename", pw_fcount, pw_dirsum/1e9 as GB
FROM 'scan.csv.zst'
WHERE pw_fcount > 1000000
ORDER BY pw_fcount DESC
""").fetchdf()
# Find ancient files (cleanup candidates)
old_files = con.execute("""
SELECT "filename", st_size, datetime(st_mtime, 'unixepoch')
FROM 'scan.csv.zst'
WHERE st_mtime < unixepoch(now() - INTERVAL 2 YEAR)
ORDER BY st_size DESC LIMIT 100
""").fetchdf()
```
### 3. Monitor Growth Over Time
```python
# Weekly snapshots
import schedule
def weekly_snapshot():
timestamp = time.strftime('%Y%m%d')
report('/data', output=f'snapshot_{timestamp}.csv.zst')
schedule.every().sunday.at("02:00").do(weekly_snapshot)
```
## Performance
### Multi-Threading Support Matrix
| Python Version | `walk()` | `report()` |
|----------------|----------|------------|
| **CPython 3.10** | ❌ | ✅ |
| **CPython 3.11** | ❌ | ✅ |
| **CPython 3.12** | ❌ | ✅ |
| **CPython 3.13** | ❌ | ✅ |
| **CPython 3.13t** (No GIL) | ✅ | ✅ |
| **CPython 3.14** | ❌ | ✅ |
| **CPython 3.14t** (No GIL) | ✅ | ✅ |
| **PyPy 3.10** | ❌ | ✅ |
| **PyPy 3.11** | ❌ | ✅ |
**Legend**: ✅ = Multi-threaded (5-10x faster) | ❌ = Single-threaded
**Key Takeaway**: Use `report()` for maximum performance on all Python versions!
### Current Performance (Python 3.10-3.14)
**`walk()` function**: Uses `os.walk()` internally (single-threaded) for 100% compatibility
- Same speed as `os.walk()` — perfect drop-in replacement
- No threading overhead, works everywhere
**`report()` function**: Multi-threaded C implementation (5-10x faster!)
- **Speed**: 8,000-30,000 stat operations per second
- **Example**: 50 million files in ~41 minutes at 20K stats/sec
- **Parallelism**: Up to 32 threads (releases GIL, works on all Python implementations)
- **Scaling**: Performance depends on storage system, host CPU, and file layout
- **Compression**: Zstd reduces CSV size by 23x with minimal overhead
- **PyPy Compatible**: Multi-threading works on PyPy through C extension (cpyext)
### Future Performance (Python 3.13+ Free-Threading)
**What's Changing?** Python 3.13 introduced optional "free-threading" mode (also called "no-GIL mode").
**The Global Interpreter Lock (GIL) Explained**: For decades, Python had a "global lock" that prevented multiple threads from running Python code simultaneously. This meant that even with multiple CPU cores, only one thread could execute Python code at a time. Python 3.13+ can optionally remove this lock, allowing true parallel execution.
**What This Means for pwalk**:
- **CPython 3.13+ with free-threading**: `walk()` will automatically use parallel traversal for 5-10x speedup
- **CPython 3.10-3.12**: `walk()` uses `os.walk()` (single-threaded)
- **PyPy (all versions)**: `walk()` uses `os.walk()` (single-threaded, but JIT-compiled Python is faster)
- **`report()` is always fast**: Multi-threaded C code releases GIL - works on all implementations!
**How to Get Free-Threading Python** (Python 3.13+):
Free-threading Python builds are now available! Here's how to get them:
**Option 1: Official Python.org Installers** (Easiest)
```bash
# Download from python.org
# Look for "Free-threaded" builds (separate downloads)
# https://www.python.org/downloads/
# On Linux/macOS, the free-threaded interpreter has a 't' suffix
python3.13t --version # Should show "Python 3.13.x (free-threaded)"
```
**Option 2: Build from Source** (Linux/macOS)
```bash
# Clone Python source
git clone https://github.com/python/cpython.git
cd cpython
git checkout v3.13.0 # or latest 3.13.x tag
# Configure with free-threading
./configure --disable-gil
make -j$(nproc)
sudo make install
# Verify
python3.13 --version
python3.13 -c "import sys; print(f'GIL disabled: {not sys._is_gil_enabled()}')"
```
**Option 3: Docker/Conda** (Recommended for Testing)
```bash
# Using official Python Docker image
docker run -it python:3.13-slim python3 -c "import sys; print(sys._is_gil_enabled())"
# Conda (check for free-threading builds)
conda install python=3.13
```
**Option 4: pyenv** (Developers)
```bash
# Install pyenv if not already installed
curl https://pyenv.run | bash
# Install free-threaded Python 3.13
pyenv install 3.13.0t # 't' suffix for free-threaded build
pyenv local 3.13.0t
# Verify
python --version
python -c "import sys; print(f'GIL: {sys._is_gil_enabled()}')"
```
**Using Free-Threading with pwalk**:
```bash
# Install pwalk
python3.13t -m pip install pwalk
# IMPORTANT: Set PYTHON_GIL=0 to keep GIL disabled
export PYTHON_GIL=0
# Run your script
python3.13t your_script.py
# Or inline:
PYTHON_GIL=0 python3.13t your_script.py
# Verify it's working
PYTHON_GIL=0 python3.13t -c "import sys; print(f'Free-threading: {not sys._is_gil_enabled()}')"
```
> **Important**: By default, Python 3.13t will **enable the GIL** when loading C extensions that haven't declared GIL-free compatibility. Use `PYTHON_GIL=0` or `-Xgil=0` to keep it disabled.
> **Note**: As of 2025, free-threading is still **experimental**. Some packages may not be compatible yet. For production use today, stick with `report()` which is always multi-threaded!
## Technical Architecture
- **Single Optimized C Extension**: `_pwalk_core` — 320 lines of highly optimized C
- **Thread-Local Buffers**: 512KB per thread, zero lock contention during traversal
- **Multithreaded Traversal**: Up to 32 parallel threads using John Dey's proven algorithm
- **Streaming Compression**: Zstd level 1 for 23x compression at 200-400 MB/s
- **SLURM Integration**: Auto-detects `SLURM_CPUS_ON_NODE` for HPC environments
- **Zero Dependencies**: No external Python packages — ships ready to run
## Why No Dependencies Matters
Unlike other tools that require PyArrow (~50 MB), numpy, pandas, etc., `pwalk` installs in seconds with zero dependencies. This means:
- ✅ Instant installation on air-gapped HPC systems
- ✅ No version conflicts with existing packages
- ✅ Minimal attack surface for security-conscious environments
- ✅ Works everywhere Python 3.10+ works
## Contributing
Contributions welcome! Based on the rock-solid [filesystem-reporting-tools](https://github.com/john-dey/filesystem-reporting-tools) by John Dey.
## License
GPL v2 — Same as John Dey's original pwalk implementation.
## Links
- **PyPI**: https://pypi.org/project/pwalk/
- **GitHub**: https://github.com/dirkpetersen/python-pwalk
- **Issues**: https://github.com/dirkpetersen/python-pwalk/issues
- **DuckDB**: https://duckdb.org — Perfect companion for analyzing pwalk output
- **Original pwalk**: https://github.com/john-dey/filesystem-reporting-tools
---
**Built with ❤️ for system administrators and HPC users worldwide. Based on John Dey's battle-tested pwalk implementation.**
Raw data
{
"_id": null,
"home_page": "https://github.com/dirkpetersen/python-pwalk",
"name": "pwalk",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "filesystem, walk, parallel, performance, hpc, hpc-tools, storage-analysis",
"author": "Dirk Petersen",
"author_email": "Dirk Petersen <dp@nowhere.com>",
"download_url": null,
"platform": null,
"description": "# python-pwalk\n\n[](https://pypi.org/project/pwalk/)\n[](https://pypi.org/project/pwalk/)\n[](https://raw.githubusercontent.com/dirkpetersen/python-pwalk/main/LICENSE)\n[](https://pypi.org/project/pwalk/)\n[](https://github.com/dirkpetersen/python-pwalk/actions)\n\n**A high-performance toolkit for filesystem analysis and reporting \u2014 optimized for petabyte-scale filesystems and HPC environments.**\n\nGenerate comprehensive filesystem metadata reports **5-10x faster** than traditional tools, with true multi-threading, intelligent buffering, and automatic zstd compression achieving 23x size reduction. Perfect for system administrators, data scientists, and HPC users working with millions or billions of files.\n\n## Why python-pwalk?\n\nTraditional filesystem traversal tools struggle with modern storage systems containing billions of files. `python-pwalk` solves this with a battle-tested approach combining Python's ease of use with C's raw performance.\n\n### Key Features\n\n- \ud83d\ude80 **Extreme Performance**: 8,000-30,000 files/second \u2014 traverse 50 million files in ~41 minutes\n- \ud83d\udd04 **True Parallelism**: Multi-threaded C implementation using up to 32 threads\n- \ud83d\udddc\ufe0f **23x Compression**: Automatic zstd compression reduces 100 GB CSV to 4 GB\n- \ud83d\udce6 **Zero Dependencies**: No PyArrow, no numpy \u2014 just Python + C\n- \ud83d\udd0c **Drop-in Replacement**: 100% compatible with `os.walk()` API\n- \ud83d\udcbe **Memory Efficient**: Thread-local buffers with automatic spillover for billions of files\n- \ud83c\udfaf **HPC Ready**: SLURM-aware, `.snapshot` filtering, cross-filesystem detection\n- \ud83e\udd86 **DuckDB Native**: Output `.csv.zst` files readable directly by DuckDB\n- \ud83d\udee1\ufe0f **Production Tested**: Based on John Dey's pwalk used in HPC environments worldwide\n\n### Perfect For\n\n- **System Administrators**: Audit multi-petabyte filesystems in minutes\n- **Data Scientists**: Analyze file distributions across massive datasets\n- **HPC Users**: Track storage usage in supercomputing environments\n- **Storage Teams**: Generate reports for NetApp, Lustre, GPFS filesystems\n- **Compliance**: Create auditable records of filesystem contents\n\n## Installation\n\n```bash\npip install pwalk\n```\n\n**That's it!** Pre-compiled binary wheels are available for:\n- **Linux**: x86_64 (manylinux2014)\n- **CPython**: 3.10, 3.11, 3.12, 3.13, 3.14\n- **CPython (No GIL)**: 3.13t, 3.14t (experimental free-threading builds)\n- **PyPy**: 3.10, 3.11 (fast JIT-compiled Python alternative)\n\n**No system dependencies needed** \u2014 wheels include everything pre-compiled!\n\nFor free-threading Python:\n```bash\npython3.13t -m pip install pwalk\nPYTHON_GIL=0 python3.13t your_script.py\n```\n\nFor PyPy:\n```bash\npypy3 -m pip install pwalk\n```\n\n## Quick Start\n\n### 30-Second Example\n\n```python\nfrom pwalk import walk, report\n\n# 1. Drop-in replacement for os.walk() \u2014 100% compatible API\nfor dirpath, dirnames, filenames in walk('/data'):\n print(f\"{dirpath}: {len(filenames)} files\")\n\n# 2. Generate compressed filesystem report (5-10x faster with multi-threading!)\noutput, errors = report('/data', compress='zstd')\n# Creates scan.csv.zst - 23x smaller than plain CSV!\n\n# 3. Analyze with DuckDB\nimport duckdb\ndf = duckdb.connect().execute(f\"SELECT * FROM '{output}'\").fetchdf()\nprint(df.head())\n```\n\n> **Note on Performance**: The `walk()` function uses `os.walk()` under the hood (single-threaded) for maximum compatibility across Python versions. For **5-10x faster performance**, use `report()` which leverages our multi-threaded C implementation. In Python 3.13+ with free-threading (no-GIL mode), `walk()` will automatically use parallel traversal for massive speedups!\n\n### Basic Usage\n\n```python\nfrom pwalk import walk\n\n# 100% compatible with os.walk() API\nfor dirpath, dirnames, filenames in walk('/data'):\n print(f\"Directory: {dirpath}\")\n print(f\" Subdirectories: {len(dirnames)}\")\n print(f\" Files: {len(filenames)}\")\n```\n\n### Full API Compatibility\n\n```python\nfrom pwalk import walk\n\n# All os.walk() parameters supported\nfor dirpath, dirnames, filenames in walk(\n '/data',\n topdown=True, # Process parents before children\n onerror=handle_error, # Error callback\n followlinks=False # Don't follow symlinks\n):\n # Modify dirnames in-place to prune traversal\n dirnames[:] = [d for d in dirnames if not d.startswith('.')]\n```\n\n### Advanced: Thread Control\n\n```python\nfrom pwalk import walk\n\n# Explicit thread count (default: cpu_count() or SLURM_CPUS_ON_NODE)\nfor dirpath, dirnames, filenames in walk('/data', max_threads=16):\n process_directory(dirpath, filenames)\n\n# Traverse snapshots (disabled by default)\nfor dirpath, dirnames, filenames in walk('/data', ignore_snapshots=False):\n ...\n```\n\n## Filesystem Metadata Reports\n\n### CSV Output with Zstd Compression (Default)\n\n```python\nfrom pwalk import report\n\n# Generate compressed CSV (8-10x smaller, DuckDB compatible)\noutput, errors = report(\n '/data',\n output='scan.csv',\n compress='zstd' # or 'gzip', 'auto', 'none'\n)\n\nprint(f\"Report saved to: {output}\")\nprint(f\"Inaccessible directories: {len(errors)}\")\n```\n\n**CSV Format** (100% compatible with John Dey's pwalk):\n```\ninode,parent-inode,directory-depth,\"filename\",\"fileExtension\",UID,GID,st_size,st_dev,st_blocks,st_nlink,\"st_mode\",st_atime,st_mtime,st_ctime,pw_fcount,pw_dirsum\n```\n\n**Compression Options**:\n- `compress='auto'`: Use zstd if available, else uncompressed (default)\n- `compress='zstd'`: Force zstd compression (8-10x, fast)\n- `compress='gzip'`: Use gzip compression (6-7x, slower but universal)\n- `compress='none'`: No compression\n\n## DuckDB Analysis Workflow\n\n```python\n# 1. Generate compressed CSV report\nfrom pwalk import report\noutput, errors = report('/data', output='scan.csv', compress='zstd')\n\n# 2. DuckDB reads .csv.zst natively!\nimport duckdb\ncon = duckdb.connect('fs_analysis.db')\ncon.execute(\"CREATE TABLE fs AS SELECT * FROM 'scan.csv.zst'\")\n\n# 3. Answer questions like \"Who used the last 10TB?\"\nresult = con.execute(\"\"\"\n SELECT\n uid,\n count(*) as file_count,\n sum(st_size) / (1024*1024*1024*1024) as size_tb\n FROM fs\n WHERE st_ctime > unixepoch(now() - INTERVAL 7 DAY)\n GROUP BY uid\n ORDER BY size_tb DESC\n LIMIT 10\n\"\"\").fetchdf()\nprint(result)\n\n# 4. Optional: Convert to Parquet for long-term storage\ncon.execute(\"\"\"\n COPY (SELECT * FROM 'scan.csv.zst')\n TO 'scan.parquet' (FORMAT PARQUET, COMPRESSION SNAPPY)\n\"\"\")\n```\n\n## Filesystem Repair (Root Only)\n\n```python\nfrom pwalk import repair\n\n# Dry-run to preview changes\nrepair(\n '/shared',\n dry_run=True,\n change_gids=[1234, 5678], # Treat these GIDs like private groups\n force_group_writable=True, # Ensure group read+write+execute\n exclude=['/shared/archive'] # Skip specific paths\n)\n\n# Apply changes\nrepair('/shared', dry_run=False, force_group_writable=True)\n```\n\n## Real-World Use Cases\n\n### 1. Answer Critical Storage Questions Fast\n\n```python\nfrom pwalk import report\nimport duckdb\n\n# Generate comprehensive filesystem metadata\nreport('/data', compress='zstd') # Done in minutes, not hours!\n\n# Who used the last 10 TB this week?\ncon = duckdb.connect()\nresult = con.execute(\"\"\"\n SELECT UID, COUNT(*) as files, SUM(st_size)/1e12 as TB\n FROM 'scan.csv.zst'\n WHERE st_ctime > unixepoch(now() - INTERVAL 7 DAY)\n GROUP BY UID ORDER BY TB DESC LIMIT 10\n\"\"\").fetchdf()\n```\n\n### 2. Find Storage Hogs and Opportunities\n\n```python\n# Find directories with >1M files (performance issues!)\nhuge_dirs = con.execute(\"\"\"\n SELECT \"filename\", pw_fcount, pw_dirsum/1e9 as GB\n FROM 'scan.csv.zst'\n WHERE pw_fcount > 1000000\n ORDER BY pw_fcount DESC\n\"\"\").fetchdf()\n\n# Find ancient files (cleanup candidates)\nold_files = con.execute(\"\"\"\n SELECT \"filename\", st_size, datetime(st_mtime, 'unixepoch')\n FROM 'scan.csv.zst'\n WHERE st_mtime < unixepoch(now() - INTERVAL 2 YEAR)\n ORDER BY st_size DESC LIMIT 100\n\"\"\").fetchdf()\n```\n\n### 3. Monitor Growth Over Time\n\n```python\n# Weekly snapshots\nimport schedule\n\ndef weekly_snapshot():\n timestamp = time.strftime('%Y%m%d')\n report('/data', output=f'snapshot_{timestamp}.csv.zst')\n\nschedule.every().sunday.at(\"02:00\").do(weekly_snapshot)\n```\n\n## Performance\n\n### Multi-Threading Support Matrix\n\n| Python Version | `walk()` | `report()` |\n|----------------|----------|------------|\n| **CPython 3.10** | \u274c | \u2705 |\n| **CPython 3.11** | \u274c | \u2705 |\n| **CPython 3.12** | \u274c | \u2705 |\n| **CPython 3.13** | \u274c | \u2705 |\n| **CPython 3.13t** (No GIL) | \u2705 | \u2705 |\n| **CPython 3.14** | \u274c | \u2705 |\n| **CPython 3.14t** (No GIL) | \u2705 | \u2705 |\n| **PyPy 3.10** | \u274c | \u2705 |\n| **PyPy 3.11** | \u274c | \u2705 |\n\n**Legend**: \u2705 = Multi-threaded (5-10x faster) | \u274c = Single-threaded\n\n**Key Takeaway**: Use `report()` for maximum performance on all Python versions!\n\n### Current Performance (Python 3.10-3.14)\n\n**`walk()` function**: Uses `os.walk()` internally (single-threaded) for 100% compatibility\n- Same speed as `os.walk()` \u2014 perfect drop-in replacement\n- No threading overhead, works everywhere\n\n**`report()` function**: Multi-threaded C implementation (5-10x faster!)\n- **Speed**: 8,000-30,000 stat operations per second\n- **Example**: 50 million files in ~41 minutes at 20K stats/sec\n- **Parallelism**: Up to 32 threads (releases GIL, works on all Python implementations)\n- **Scaling**: Performance depends on storage system, host CPU, and file layout\n- **Compression**: Zstd reduces CSV size by 23x with minimal overhead\n- **PyPy Compatible**: Multi-threading works on PyPy through C extension (cpyext)\n\n### Future Performance (Python 3.13+ Free-Threading)\n\n**What's Changing?** Python 3.13 introduced optional \"free-threading\" mode (also called \"no-GIL mode\").\n\n**The Global Interpreter Lock (GIL) Explained**: For decades, Python had a \"global lock\" that prevented multiple threads from running Python code simultaneously. This meant that even with multiple CPU cores, only one thread could execute Python code at a time. Python 3.13+ can optionally remove this lock, allowing true parallel execution.\n\n**What This Means for pwalk**:\n- **CPython 3.13+ with free-threading**: `walk()` will automatically use parallel traversal for 5-10x speedup\n- **CPython 3.10-3.12**: `walk()` uses `os.walk()` (single-threaded)\n- **PyPy (all versions)**: `walk()` uses `os.walk()` (single-threaded, but JIT-compiled Python is faster)\n- **`report()` is always fast**: Multi-threaded C code releases GIL - works on all implementations!\n\n**How to Get Free-Threading Python** (Python 3.13+):\n\nFree-threading Python builds are now available! Here's how to get them:\n\n**Option 1: Official Python.org Installers** (Easiest)\n```bash\n# Download from python.org\n# Look for \"Free-threaded\" builds (separate downloads)\n# https://www.python.org/downloads/\n\n# On Linux/macOS, the free-threaded interpreter has a 't' suffix\npython3.13t --version # Should show \"Python 3.13.x (free-threaded)\"\n```\n\n**Option 2: Build from Source** (Linux/macOS)\n```bash\n# Clone Python source\ngit clone https://github.com/python/cpython.git\ncd cpython\ngit checkout v3.13.0 # or latest 3.13.x tag\n\n# Configure with free-threading\n./configure --disable-gil\nmake -j$(nproc)\nsudo make install\n\n# Verify\npython3.13 --version\npython3.13 -c \"import sys; print(f'GIL disabled: {not sys._is_gil_enabled()}')\"\n```\n\n**Option 3: Docker/Conda** (Recommended for Testing)\n```bash\n# Using official Python Docker image\ndocker run -it python:3.13-slim python3 -c \"import sys; print(sys._is_gil_enabled())\"\n\n# Conda (check for free-threading builds)\nconda install python=3.13\n```\n\n**Option 4: pyenv** (Developers)\n```bash\n# Install pyenv if not already installed\ncurl https://pyenv.run | bash\n\n# Install free-threaded Python 3.13\npyenv install 3.13.0t # 't' suffix for free-threaded build\npyenv local 3.13.0t\n\n# Verify\npython --version\npython -c \"import sys; print(f'GIL: {sys._is_gil_enabled()}')\"\n```\n\n**Using Free-Threading with pwalk**:\n```bash\n# Install pwalk\npython3.13t -m pip install pwalk\n\n# IMPORTANT: Set PYTHON_GIL=0 to keep GIL disabled\nexport PYTHON_GIL=0\n\n# Run your script\npython3.13t your_script.py\n\n# Or inline:\nPYTHON_GIL=0 python3.13t your_script.py\n\n# Verify it's working\nPYTHON_GIL=0 python3.13t -c \"import sys; print(f'Free-threading: {not sys._is_gil_enabled()}')\"\n```\n\n> **Important**: By default, Python 3.13t will **enable the GIL** when loading C extensions that haven't declared GIL-free compatibility. Use `PYTHON_GIL=0` or `-Xgil=0` to keep it disabled.\n\n> **Note**: As of 2025, free-threading is still **experimental**. Some packages may not be compatible yet. For production use today, stick with `report()` which is always multi-threaded!\n\n## Technical Architecture\n\n- **Single Optimized C Extension**: `_pwalk_core` \u2014 320 lines of highly optimized C\n- **Thread-Local Buffers**: 512KB per thread, zero lock contention during traversal\n- **Multithreaded Traversal**: Up to 32 parallel threads using John Dey's proven algorithm\n- **Streaming Compression**: Zstd level 1 for 23x compression at 200-400 MB/s\n- **SLURM Integration**: Auto-detects `SLURM_CPUS_ON_NODE` for HPC environments\n- **Zero Dependencies**: No external Python packages \u2014 ships ready to run\n\n## Why No Dependencies Matters\n\nUnlike other tools that require PyArrow (~50 MB), numpy, pandas, etc., `pwalk` installs in seconds with zero dependencies. This means:\n\n- \u2705 Instant installation on air-gapped HPC systems\n- \u2705 No version conflicts with existing packages\n- \u2705 Minimal attack surface for security-conscious environments\n- \u2705 Works everywhere Python 3.10+ works\n\n## Contributing\n\nContributions welcome! Based on the rock-solid [filesystem-reporting-tools](https://github.com/john-dey/filesystem-reporting-tools) by John Dey.\n\n## License\n\nGPL v2 \u2014 Same as John Dey's original pwalk implementation.\n\n## Links\n\n- **PyPI**: https://pypi.org/project/pwalk/\n- **GitHub**: https://github.com/dirkpetersen/python-pwalk\n- **Issues**: https://github.com/dirkpetersen/python-pwalk/issues\n- **DuckDB**: https://duckdb.org \u2014 Perfect companion for analyzing pwalk output\n- **Original pwalk**: https://github.com/john-dey/filesystem-reporting-tools\n\n---\n\n**Built with \u2764\ufe0f for system administrators and HPC users worldwide. Based on John Dey's battle-tested pwalk implementation.** \n\n\n",
"bugtrack_url": null,
"license": "GPL-2.0",
"summary": "High-performance parallel filesystem walker",
"version": "0.1.6",
"project_urls": {
"Changelog": "https://github.com/dirkpetersen/python-pwalk/releases",
"Homepage": "https://github.com/dirkpetersen/python-pwalk",
"Issues": "https://github.com/dirkpetersen/python-pwalk/issues",
"Repository": "https://github.com/dirkpetersen/python-pwalk"
},
"split_keywords": [
"filesystem",
" walk",
" parallel",
" performance",
" hpc",
" hpc-tools",
" storage-analysis"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "df294264102d24bfbdede9e73c59dc2fca71f4e96208657bbf0518c3a8e4e24d",
"md5": "bb2e213d14aaefecab44e1616eda3ae0",
"sha256": "1862d19493c5dbf0ed8d176cfacd2f8082102153fdea82c0a3dd359a1e2c8e0b"
},
"downloads": -1,
"filename": "pwalk-0.1.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "bb2e213d14aaefecab44e1616eda3ae0",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.10",
"size": 367041,
"upload_time": "2025-10-19T18:09:50",
"upload_time_iso_8601": "2025-10-19T18:09:50.146537Z",
"url": "https://files.pythonhosted.org/packages/df/29/4264102d24bfbdede9e73c59dc2fca71f4e96208657bbf0518c3a8e4e24d/pwalk-0.1.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "819ccd406b5151cce4e8df72d7643742ae5f7cb444b80446135c60445cb82e4c",
"md5": "3d880b641aaecb42f95b440c4b9eeb77",
"sha256": "0c17444c19a20cb11c4e21f2fb7980385a4e1d7df2354ba687d0027d91ca0539"
},
"downloads": -1,
"filename": "pwalk-0.1.6-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl",
"has_sig": false,
"md5_digest": "3d880b641aaecb42f95b440c4b9eeb77",
"packagetype": "bdist_wheel",
"python_version": "cp310",
"requires_python": ">=3.10",
"size": 38372,
"upload_time": "2025-10-19T18:09:52",
"upload_time_iso_8601": "2025-10-19T18:09:52.040979Z",
"url": "https://files.pythonhosted.org/packages/81/9c/cd406b5151cce4e8df72d7643742ae5f7cb444b80446135c60445cb82e4c/pwalk-0.1.6-cp310-cp310-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "a6227ac4a44fc991944434528e5b283791962b85bd5f5acfff4b7f84e4a38387",
"md5": "c611e442fb3a6715b077e8a8aca08c94",
"sha256": "492bfebbf3f484e6218ca742e0df57eab3c4fc8ca7a73ca27028cae33398d842"
},
"downloads": -1,
"filename": "pwalk-0.1.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "c611e442fb3a6715b077e8a8aca08c94",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.10",
"size": 367858,
"upload_time": "2025-10-19T18:09:53",
"upload_time_iso_8601": "2025-10-19T18:09:53.550107Z",
"url": "https://files.pythonhosted.org/packages/a6/22/7ac4a44fc991944434528e5b283791962b85bd5f5acfff4b7f84e4a38387/pwalk-0.1.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "2a0af1dd57f7e634e98048a5560f344610da830f79f11cefe39b4302e1c54b88",
"md5": "81d2c4221da524dac86022b57a6bf62e",
"sha256": "ad5adc17a20aca371a5d4cfd0dd3ffbacc234022a978a26d1a4577433b167ed2"
},
"downloads": -1,
"filename": "pwalk-0.1.6-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl",
"has_sig": false,
"md5_digest": "81d2c4221da524dac86022b57a6bf62e",
"packagetype": "bdist_wheel",
"python_version": "cp311",
"requires_python": ">=3.10",
"size": 39149,
"upload_time": "2025-10-19T18:09:55",
"upload_time_iso_8601": "2025-10-19T18:09:55.200059Z",
"url": "https://files.pythonhosted.org/packages/2a/0a/f1dd57f7e634e98048a5560f344610da830f79f11cefe39b4302e1c54b88/pwalk-0.1.6-cp311-cp311-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "e232832361cad2ee006419dfce64ab5412e37498747556039a6b0ffeb5597ad0",
"md5": "2e232e02c1976f6758a443220b27561d",
"sha256": "fd311507a14194bb11392a6773633570ea38b4012425a091741f7e4423d31184"
},
"downloads": -1,
"filename": "pwalk-0.1.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "2e232e02c1976f6758a443220b27561d",
"packagetype": "bdist_wheel",
"python_version": "cp312",
"requires_python": ">=3.10",
"size": 367523,
"upload_time": "2025-10-19T18:09:56",
"upload_time_iso_8601": "2025-10-19T18:09:56.379784Z",
"url": "https://files.pythonhosted.org/packages/e2/32/832361cad2ee006419dfce64ab5412e37498747556039a6b0ffeb5597ad0/pwalk-0.1.6-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "100191a9584a8c214e23d33336a2f38458c9e2f9c1217ce0914b98951482bf1a",
"md5": "cf5b8ba40d71dfc00e6cf9544dd2b6c9",
"sha256": "760a274edd180ee40adf48ca65afb4a1bce9750b0ae9f4730fe4ef9da4d3aff2"
},
"downloads": -1,
"filename": "pwalk-0.1.6-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl",
"has_sig": false,
"md5_digest": "cf5b8ba40d71dfc00e6cf9544dd2b6c9",
"packagetype": "bdist_wheel",
"python_version": "cp312",
"requires_python": ">=3.10",
"size": 38753,
"upload_time": "2025-10-19T18:09:58",
"upload_time_iso_8601": "2025-10-19T18:09:58.095142Z",
"url": "https://files.pythonhosted.org/packages/10/01/91a9584a8c214e23d33336a2f38458c9e2f9c1217ce0914b98951482bf1a/pwalk-0.1.6-cp312-cp312-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "f8ebfb36b38a9e0e4eec9b0cc7786837bfcf6d49812cf23c5a14edff24c042d7",
"md5": "a50270d7c19f1c0e8abe46fc0f823be7",
"sha256": "32cd1ac431e84e2b33b65b2f4ab0a517344691033fb3de409450185879aa3ead"
},
"downloads": -1,
"filename": "pwalk-0.1.6-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "a50270d7c19f1c0e8abe46fc0f823be7",
"packagetype": "bdist_wheel",
"python_version": "cp313",
"requires_python": ">=3.10",
"size": 367474,
"upload_time": "2025-10-19T18:09:59",
"upload_time_iso_8601": "2025-10-19T18:09:59.634361Z",
"url": "https://files.pythonhosted.org/packages/f8/eb/fb36b38a9e0e4eec9b0cc7786837bfcf6d49812cf23c5a14edff24c042d7/pwalk-0.1.6-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "522494b696ea9f52e55d9382703c5082e385fd32a0cc3242a69f7f977c27d183",
"md5": "4eeff4bb476d1b29bc47ea823b808098",
"sha256": "a53a9e48f4060f302cd1f2890ce1996e99ac4b7ecc01e335dfe780cd3e5832c7"
},
"downloads": -1,
"filename": "pwalk-0.1.6-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl",
"has_sig": false,
"md5_digest": "4eeff4bb476d1b29bc47ea823b808098",
"packagetype": "bdist_wheel",
"python_version": "cp313",
"requires_python": ">=3.10",
"size": 38715,
"upload_time": "2025-10-19T18:10:01",
"upload_time_iso_8601": "2025-10-19T18:10:01.278516Z",
"url": "https://files.pythonhosted.org/packages/52/24/94b696ea9f52e55d9382703c5082e385fd32a0cc3242a69f7f977c27d183/pwalk-0.1.6-cp313-cp313-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "dfab4bacb628d06c32508cfcc6d73d165fbf0a4b91eef37f790cfec16fe2df62",
"md5": "60b1a23600c8b0cc02bbe39060baf6c2",
"sha256": "f2f56e649730effbf266417b5a99b649e3ad388f7c32f7cd66a62c959a3de50c"
},
"downloads": -1,
"filename": "pwalk-0.1.6-cp313-cp313t-manylinux_2_34_x86_64.whl",
"has_sig": false,
"md5_digest": "60b1a23600c8b0cc02bbe39060baf6c2",
"packagetype": "bdist_wheel",
"python_version": "cp313",
"requires_python": ">=3.10",
"size": 370602,
"upload_time": "2025-10-19T18:10:02",
"upload_time_iso_8601": "2025-10-19T18:10:02.501555Z",
"url": "https://files.pythonhosted.org/packages/df/ab/4bacb628d06c32508cfcc6d73d165fbf0a4b91eef37f790cfec16fe2df62/pwalk-0.1.6-cp313-cp313t-manylinux_2_34_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "d9826c831c278d43639f32b19e2f8a2992e0d39b3080e3c026f74b7f70e10985",
"md5": "678d85615288de33341a1a60fc90b7df",
"sha256": "e9bc22d0a2f39ae261cf2faa75a4805704cc91adc977efb52dd8e0808735c19f"
},
"downloads": -1,
"filename": "pwalk-0.1.6-cp314-cp314t-manylinux_2_34_x86_64.whl",
"has_sig": false,
"md5_digest": "678d85615288de33341a1a60fc90b7df",
"packagetype": "bdist_wheel",
"python_version": "cp314",
"requires_python": ">=3.10",
"size": 370755,
"upload_time": "2025-10-19T18:10:03",
"upload_time_iso_8601": "2025-10-19T18:10:03.926693Z",
"url": "https://files.pythonhosted.org/packages/d9/82/6c831c278d43639f32b19e2f8a2992e0d39b3080e3c026f74b7f70e10985/pwalk-0.1.6-cp314-cp314t-manylinux_2_34_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "624e6c844f046c0f6b2ca0a4cb141f28bd810f80d01c73c61734c7fe0f66d250",
"md5": "2f7e83da2b11b3b351d6d075a5e218aa",
"sha256": "6afe8db5895dfd8be0f98d20e4adf047b5ba3eeb60bf57bae877cf55777de7c5"
},
"downloads": -1,
"filename": "pwalk-0.1.6-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"has_sig": false,
"md5_digest": "2f7e83da2b11b3b351d6d075a5e218aa",
"packagetype": "bdist_wheel",
"python_version": "pp310",
"requires_python": ">=3.10",
"size": 354396,
"upload_time": "2025-10-19T18:10:05",
"upload_time_iso_8601": "2025-10-19T18:10:05.318979Z",
"url": "https://files.pythonhosted.org/packages/62/4e/6c844f046c0f6b2ca0a4cb141f28bd810f80d01c73c61734c7fe0f66d250/pwalk-0.1.6-pp310-pypy310_pp73-manylinux_2_17_x86_64.manylinux2014_x86_64.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "ffbd56b0074f6b89959273ab92fbc9325aaf3f6781f5ed8a131cefab68c09b16",
"md5": "61e38ac64c942559ad8482cf542c364b",
"sha256": "eb6162dc596cbea891b6d4a94a1b5ae758bb01fbf08e1059f3adf3d840cdc276"
},
"downloads": -1,
"filename": "pwalk-0.1.6-pp310-pypy310_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl",
"has_sig": false,
"md5_digest": "61e38ac64c942559ad8482cf542c364b",
"packagetype": "bdist_wheel",
"python_version": "pp310",
"requires_python": ">=3.10",
"size": 27071,
"upload_time": "2025-10-19T18:10:07",
"upload_time_iso_8601": "2025-10-19T18:10:07.223259Z",
"url": "https://files.pythonhosted.org/packages/ff/bd/56b0074f6b89959273ab92fbc9325aaf3f6781f5ed8a131cefab68c09b16/pwalk-0.1.6-pp310-pypy310_pp73-manylinux_2_5_i686.manylinux1_i686.manylinux_2_17_i686.manylinux2014_i686.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-19 18:09:50",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "dirkpetersen",
"github_project": "python-pwalk",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "pwalk"
}