# DedupeCopy
A multi-threaded command-line tool for finding duplicate files and copying/restructuring file layouts while eliminating duplicates.
[](LICENSE)
[](https://www.python.org/downloads/)
## Table of Contents
- [Overview](#overview)
- [Architecture](#architecture)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Key Concepts](#key-concepts)
- [Usage Examples](#usage-examples)
- [Command-Line Options](#command-line-options)
- [Path Rules](#path-rules)
- [Advanced Workflows](#advanced-workflows)
- [Performance Tips](#performance-tips)
- [Troubleshooting](#troubleshooting)
- [Safety and Best Practices](#safety-and-best-practices)
## Overview
DedupeCopy is designed for consolidating and restructuring sprawling file systems, particularly useful for:
- **Backup consolidation**: Merge multiple backup sources while eliminating duplicates
- **Photo/media library organization**: Consolidate photos from various devices and organize by date
- **File system cleanup**: Identify and remove duplicate files
- **Server migration**: Copy files to new storage while preserving structure
- **Duplicate detection**: Generate reports of duplicate files without copying
- **Deleting duplicates**: Reclaim disk space by removing redundant files.
**The good bits:**
- Uses MD5 checksums for accurate duplicate detection
- Multi-threaded for fast processing
- Manifest system for resuming interrupted operations
- Flexible path restructuring rules
- Can compare against multiple file systems without full re-scans
- Configurable logging with verbosity levels (quiet, normal, verbose, debug)
- Colored output for better readability (optional)
- Helpful error messages with actionable suggestions
- Real-time progress with processing rates
**Note:** This is *not* a replacement for rsync or Robocopy for incremental synchronization. Those are good tools that might work for you, so do try them.
## Architecture
DedupeCopy uses a multi-threaded pipeline architecture to maximize performance when processing large file systems. Understanding this architecture helps explain performance characteristics and tuning options.
### High-Level Design
```
┌─────────────────────────────────────────────────────────────────┐
│ MAIN THREAD │
│ • Orchestrates the entire operation │
│ • Manages thread lifecycle and coordination │
│ • Handles manifest loading/saving │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ THREAD POOLS (Queues) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────┐ ┌────────────────┐ ┌───────────────┐ │
│ │ Walk Threads │───▶│ Read Threads │───▶│ Copy/Delete │ │
│ │ (4 default) │ │ (8 default) │ │ Threads │ │
│ │ │ │ │ │ (8 default) │ │
│ └────────────────┘ └────────────────┘ └───────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ Walk Queue Work Queue Copy/Delete │
│ (directories) (files to hash) Queue │
│ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ PROGRESS THREAD │
│ • Collects status updates from all worker threads │
│ • Displays progress, rates, and statistics │
│ • Logs errors and warnings │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ RESULT PROCESSOR │
│ • Processes file hashes from read threads │
│ • Detects duplicate files │
│ • Updates manifest and collision dictionaries │
│ • Performs incremental saves every 50,000 files │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ PERSISTENT STORAGE │
│ • Manifest: Maps hash → list of files with that hash │
│ • Collision DB: Tracks duplicate files │
│ • SQLite-backed with disk caching (LRU-like eviction) │
│ • Auto-saves every 50,000 files for crash recovery │
└─────────────────────────────────────────────────────────────────┘
```
### Pipeline Stages
#### 1. **Walk Stage** (WalkThread)
- **Purpose**: Discover files and directories in the source paths
- **Thread Count**: 4 by default (configurable with `--walk-threads`)
- **Output**: Adds directories to `walk_queue` and files to `work_queue`
- **Filtering**: Applies extension filters and ignore patterns
#### 2. **Read Stage** (ReadThread)
- **Purpose**: Hash file contents to detect duplicates
- **Thread Count**: 8 by default (configurable with `--read-threads`)
- **Input**: Files from `work_queue`
- **Output**: Tuple of (hash, size, mtime, filepath) to `result_queue`
- **Algorithm**: `md5` or `xxh64` (configurable with `--hash-algo`)
#### 3. **Result Processing** (ResultProcessor)
- **Purpose**: Aggregate hashes and detect collisions
- **Thread Count**: 1 (single-threaded for data consistency)
- **Input**: Hash results from `result_queue`
- **Output**: Updates `manifest` and `collisions` dictionaries
- **Auto-save**: Incremental saves every 50,000 processed files
#### 4. **Copy/Delete Stage** (CopyThread/DeleteThread)
- **Purpose**: Perform file operations based on duplicate analysis
- **Thread Count**: 8 by default (configurable with `--copy-threads`)
- **Input**: Files from `copy_queue` or `delete_queue`
- **Operations**:
- Copy unique files to destination (with optional path rules)
- Delete duplicate files (keeping first occurrence)
#### 5. **Progress Monitoring** (ProgressThread)
- **Purpose**: Centralized logging and status updates
- **Thread Count**: 1 (collects events from all other threads)
- **Features**:
- Real-time progress with processing rates
- Queue size monitoring (to detect bottlenecks)
- Error aggregation and reporting
- Final summary statistics
### Data Structures
#### Manifest
- **Storage**: Disk-backed dictionary (SQLite + cache layer)
- **Format**: `hash → [(filepath, size, mtime), ...]`
- **Cache**: In-memory LRU cache (10,000 items default)
- **Persistence**: Auto-saved during operation and on completion
#### Disk Cache Dictionary
- **Purpose**: Handle datasets larger than available RAM
- **Backend**: SQLite with Write-Ahead Logging (WAL)
- **Optimization**: Batched commits for performance
- **Eviction**: LRU or random eviction when cache is full
### Performance Characteristics
**Queue-Based Throttling**: When queues exceed 100,000 items, the system introduces deliberate delays to prevent memory exhaustion.
**Bottleneck Detection**: Progress thread monitors queue sizes:
- **Walk queue growing**: Too many directories, consider reducing `--walk-threads`
- **Work queue growing**: Hashing is slower than discovery, increase `--read-threads`
- **Copy queue growing**: I/O is slower than hashing, increase `--copy-threads`
**I/O Patterns**:
- **Walk threads**: Mostly metadata operations (directory listings)
- **Read threads**: Sequential reads of entire files
- **Copy threads**: Large sequential writes
### Thread Safety
- **Queues**: Python's `queue.Queue` provides thread-safe operations
- **Manifest**: Uses database-level locking (SQLite RLock)
- **Save operations**: Coordinated via `save_event` to pause workers during saves
## Installation
### Via pip (recommended)
```bash
pip install DedupeCopy
```
### With color support (optional)
For colored console output (errors in red, warnings in yellow, etc.):
```bash
pip install DedupeCopy[color]
```
### With fast hashing (optional)
For faster file hashing using xxhash:
```bash
pip install DedupeCopy[fast_hash]
```
### From source
```bash
git clone https://github.com/othererik/dedupe_copy.git
cd dedupe_copy
pip install -e .
# Or with color support:
pip install -e .[color]
```
### Requirements
- Python 3.11 or later
- Sufficient disk space for manifest files (typically small, but can grow for very large file sets)
- Optional: colorama for colored console output (installed with `[color]` extra)
## Quick Start
### Find duplicates in a directory
```bash
dedupecopy -p /path/to/search -r duplicates.csv
```
This scans `/path/to/search` and creates a CSV report of all duplicate files.
### Copy files while removing duplicates
```bash
dedupecopy -p /source/path -c /destination/path
```
This copies all files from source to destination, skipping any duplicates.
### Copy with manifest (recommended for large operations)
```bash
dedupecopy -p /source/path -c /destination/path -m manifest.db
```
Creates a manifest file that allows you to resume if interrupted. For example, if the operation is stopped, you can resume it with:
```bash
dedupecopy -p /source/path -c /destination/path -i manifest.db -m manifest.db
```
### Delete Duplicates
```bash
dedupecopy -p /path/to/search --delete --dry-run
```
This will scan the specified path and show you which files would be deleted. Once you are sure, you can run the command again without `--dry-run` to perform the deletion.
## Key Concepts
### Manifests
Manifests are database files that store:
- MD5 checksums of processed files
- File metadata (size, modification time, path)
- Which files have been scanned
**Benefits:**
- **Resumability**: If an operation is interrupted, you can resume it by loading the manifest. The tool will skip any files that have already been processed.
- **Comparison**: Manifests can be used to compare file systems without re-scanning. For example, you can compare a source and a destination to see which files are missing from the destination.
- **Incremental Backups**: By loading a manifest from a previous backup, you can process only new or modified files.
- **Tracking**: Manifests keep a record of which files have been processed, which is useful for auditing and tracking.
**Manifest Options:**
- `-m manifest.db` - Save manifest after processing (output)
- `-i manifest.db` - Load existing manifest before processing (input)
- `--compare manifest.db` - Load manifest for duplicate checking only (does not copy those files)
**Important Safety Rule:** You **cannot** use the same file path for both `-i` (input) and `-m` (output). This prevents accidental manifest corruption during operations.
### Understanding `-i` vs `--compare`
Both `-i` and `--compare` use existing manifests to determine which files to skip, but they serve different purposes.
#### `-i` / `--manifest-read-path` (Input Manifest)
- **Purpose**: Resume an interrupted operation or continue from a previous run.
- **Behavior**: Files in this manifest are considered "already processed" and are skipped. They **are** included in the output manifest.
- **Use Case**: You started a large copy, it was interrupted, and you want to resume without re-scanning everything.
```bash
# Initial run (interrupted)
dedupecopy -p /source -c /dest -m progress.db
# Resume (skips files in progress.db)
dedupecopy -p /source -c /dest -i progress.db -m progress_new.db
```
#### `--compare` (Comparison Manifest)
- **Purpose**: Deduplicate against another location.
- **Behavior**: Files in this manifest are treated as duplicates and are **not** copied. They are **not** included in the output manifest.
- **Use Case**: You want to back up new photos from your phone, but you want to skip any photos that are already in your main archive.
```bash
# Incremental backup (skip files already in the main backup)
dedupecopy -p /phone_backup -c /main_archive --compare main_archive.db -m phone_backup_new.db
```
#### Key Differences
| Feature | `-i` (Input Manifest) | `--compare` (Comparison Manifest) |
|------------------------------|------------------------------|-----------------------------------|
| **Files Copied?** | No (already processed) | No (treated as duplicates) |
| **Included in Output?** | Yes | No |
| **Primary Use Case** | Resume Operations | Deduplicate Across Sources |
| **Can use with same output?**| **No** (safety rule) | Yes |
**Note on `--no-walk`**: When using `-i` or `--compare`, you can also use `--no-walk` to prevent the tool from scanning the source file system. This is useful when you want to operate *only* on the files listed in the manifests.
### Duplicate Detection
Files are considered duplicates when:
1. They have identical MD5 checksums
2. They have the same file size
**Special case:** Empty (zero-byte) files are treated as unique by default. Use `--dedupe-empty` to treat them as duplicates.
## Usage Examples
### Basic Operations
#### 1. Generate a duplicate file report
```bash
dedupecopy -p /Users/johndoe -r dupes.csv -m manifest.db
```
Creates a CSV report of all duplicates and saves a manifest for future use.
**With quiet output (minimal):**
```bash
dedupecopy -p /Users/johndoe -r dupes.csv -m manifest.db --quiet
```
**With verbose output (detailed progress):**
```bash
dedupecopy -p /Users/johndoe -r dupes.csv -m manifest.db --verbose
```
#### 2. Copy specific file types
```bash
dedupecopy -p /source -c /backup -e jpg -e png -e gif
```
Copy only image files (jpg, png, gif) to the backup directory.
### Photo Organization
#### Organize photos by date
```bash
dedupecopy -p C:\pics -p D:\pics -e jpg -R "jpg:mtime" -c X:\organized_photos
```
Copies all JPG files from C: and D: drives, organizing them into folders by year/month (e.g., `2024_03/`). See the "Path and Extension Rules" section for more on combining rules.
### Multi-Source Consolidation
#### Copy from multiple sources to single destination
```bash
dedupecopy -p /source1 -p /source2 -p /source3 -c /backup -m backup_manifest.db
```
Scans all three source paths and copies unique files to backup.
#### Resume an interrupted copy
```bash
dedupecopy -p /source -c /destination -i manifest.db -m manifest.db
```
Loads the previous manifest and resumes where it left off.
### Advanced Pattern Matching
#### Ignore specific patterns
```bash
dedupecopy -p /source -c /backup --ignore "*.tmp" --ignore "*.cache" --ignore "**/Thumbs.db"
```
Excludes temporary files and thumbnails from processing.
#### Extension-specific rules
```bash
dedupecopy -p /media -c /organized \
-R "*.jpg:mtime" \
-R "*.mp4:extension" \
-R "*.doc*:no_change"
```
Different organization rules for different file types.
## Command-Line Options
### Required Options (one of)
| Option | Description |
|--------|-------------|
| `-p PATH`, `--read-path PATH` | Source path(s) to scan. Can be specified multiple times. |
| `--no-walk` | Skip file system walk; use paths from loaded manifest only. |
### Core Options
| Option | Description |
|--------|-------------|
| `-c PATH`, `--copy-path PATH` | Destination path for copying files. Cannot be used with `--delete`. |
| `--delete` | Deletes duplicate files, keeping the first-seen file. Cannot be used with `--copy-path`. |
| `-r PATH`, `--result-path PATH` | Output path for CSV duplicate report. |
| `-m PATH`, `--manifest-dump-path PATH` | Path to save manifest file. |
| `-i PATH`, `--manifest-read-path PATH` | Path to load existing manifest. Can be specified multiple times. |
| `-e EXT`, `--extensions EXT` | File extension(s) to include (e.g., `jpg`, `*.png`). Can be specified multiple times. |
| `--ignore PATTERN` | File pattern(s) to exclude (supports wildcards). Can be specified multiple times. |
| `-R RULE`, `--path-rules RULE` | Path restructuring rule(s) in format `extension:rule`. Can be specified multiple times. |
### Special Options
| Option | Description |
|--------|-------------|
| `--compare PATH` | Load manifest but don't copy its files (for comparison only). This is useful for excluding files that are already present in a destination or another source. Can be specified multiple times. |
| `--copy-metadata` | Preserve file timestamps and permissions (uses `shutil.copy2` instead of `copyfile`). |
| `--dedupe-empty` | Treat empty (0-byte) files as duplicates rather than unique. |
| `--ignore-old-collisions` | Only detect new duplicates (ignore duplicates already in loaded manifest). |
| `--dry-run` | Simulate operations without making any changes to the filesystem. |
| `--min-delete-size BYTES` | Minimum size of a file to be considered for deletion (e.g., `1048576` for 1MB). Default: `0`. |
| `--delete-on-copy` | Deletes source files after a successful copy. Requires `--copy-path` and `-m`. WARNING: this will consider duplicated objects as copied and remove them. |
### Output Control Options
| Option | Description |
|--------|-------------|
| `-q`, `--quiet` | Show only warnings and errors (minimal output). |
| `-v`, `--verbose` | Show detailed progress information (same as normal, kept for compatibility). |
| `--debug` | Show debug information including queue states and internal diagnostics. |
| `--no-color` | Disable colored output (useful for logging to files or non-terminal output). |
**Output Verbosity Levels:**
- **Normal** (default): Standard progress updates, errors, and summaries.
- **Quiet** (`--quiet`): Only warnings, errors, and the final summary.
- **Verbose** (`--verbose`): More frequent progress updates that include processing rates and timing details.
- **Debug** (`--debug`): All output including queue states and internal operations for troubleshooting.
### Performance Options
| Option | Default | Description |
|--------|---------|-------------|
| `--walk-threads N` | 4 | Number of threads for file system traversal. |
| `--read-threads N` | 8 | Number of threads for reading and hashing files. |
| `--copy-threads N` | 8 | Number of threads for copying files. |
| `--hash-algo ALGO` | `md5` | Hashing algorithm to use (`md5` or `xxh64`). `xxh64` requires the `fast_hash` extra. |
### Path Conversion Options
| Option | Description |
|--------|-------------|
| `--convert-manifest-paths-from PREFIX` | Original path prefix in manifest to replace. |
| `--convert-manifest-paths-to PREFIX` | New path prefix (useful when drive letters or mount points change). |
## Path and Extension Rules
This section explains how to control file selection and organization using extension filters (`-e`), ignore patterns (`--ignore`), and path restructuring rules (`-R`).
### Filtering Files by Extension
Use the `-e` or `--extensions` option to specify which file types to process.
- **`jpg`**: Matches `.jpg` files.
- **`*.jp*g`**: Matches `.jpg`, `.jpeg`, `.jpng`, etc.
- **`*`**: Matches all extensions.
If no `-e` option is provided, all files are processed by default.
### Ignoring Files and Directories
Use `--ignore` to exclude files or directories that match a specific pattern.
- **`"*.tmp"`**: Ignores all files with a `.tmp` extension.
- **`"**/Thumbs.db"`**: Ignores `Thumbs.db` files in any directory.
- **`"*.cache"`**: Ignores all files ending in `.cache`.
### Restructuring Destination Paths
Path rules (`-R` or `--path-rules`) determine how files are organized in the destination directory. The format is `pattern:rule`.
**Default Behavior:** If no `-R` flag is specified, the original directory structure is preserved (equivalent to `-R "*:no_change"`). This is the most intuitive behavior for backup and copy operations.
#### Available Rules
| Rule | Description | Example Output |
|-------------|-------------------------------------------------|-----------------------------------------|
| `no_change` | **[DEFAULT]** Preserves the original directory structure | `/dest/original/path/file.jpg` |
| `mtime` | Organizes by modification date (`YYYY_MM`) | `/dest/2024_03/file.jpg` |
| `extension` | Organizes into folders by file extension | `/dest/jpg/file.jpg` |
#### Combining Rules
Rules are applied in the order they are specified, creating nested directories.
```bash
# Organize first by extension, then by date
dedupecopy -p /source -c /backup -R "*:extension" -R "*:mtime"
```
**Result:** `/backup/jpg/2024_03/photo.jpg`
#### Pattern Matching for Rules
The `pattern` part of the rule determines which files the rule applies to. It supports wildcards, just like the `-e` filter.
- **`"*.jpg:mtime"`**: Applies the `mtime` rule only to JPG files.
- **`"*.jp*g:mtime"`**: Applies to `.jpg`, `.jpeg`, etc.
- **`"*:no_change"`**: Applies the `no_change` rule to all files.
The most specific pattern wins if multiple patterns could match a file.
#### Example: Different Rules for Different Files
```bash
dedupecopy -p /media -c /organized \
-R "*.jpg:mtime" \
-R "*.mp4:extension" \
-R "*.pdf:no_change"
```
- **JPG files** are organized by date.
- **MP4 files** are organized into an `mp4` folder.
- **PDF files** keep their original directory structure.
## Advanced Workflows
### Sequential Multi-Source Backup
When consolidating from multiple sources to a single target while avoiding duplicates between sources:
#### Step 1: Create manifests for all locations
```bash
# Scan target (if it has existing files)
dedupecopy -p /backup/target -m target_manifest.db
# Scan each source
dedupecopy -p /source1 -m source1_manifest.db
dedupecopy -p /source2 -m source2_manifest.db
```
#### Step 2: Copy each source sequentially
```bash
# Copy source1 (skip files already in target or source2)
dedupecopy -p /source1 -c /backup/target \
--compare target_manifest.db \
--compare source2_manifest.db \
-m target_v1.db \
--no-walk
# Copy source2 (skip files already in target or source1)
dedupecopy -p /source2 -c /backup/target \
--compare target_v1.db \
--compare source1_manifest.db \
-m target_v2.db \
--no-walk
```
**How it works:**
- `--no-walk` skips re-scanning the filesystem (uses manifest data from `-i` or scans `--compare` manifests)
- `--compare` loads manifests for duplicate checking but doesn't copy those files
- Each source is copied only if files aren't already in target or other sources
- Each step creates a new manifest tracking what's been copied so far
**Note:** We use `--compare` instead of `-i` because:
- `-i` + `-m` cannot use the same file path (safety rule)
- `--compare` is designed for exactly this use case (deduplication across sources)
- The source manifests are used with `--no-walk` to avoid re-scanning
### Manifest Path Conversion
If drive letters or mount points change between runs:
```bash
dedupecopy -i old_manifest.db -m new_manifest.db \
--convert-manifest-paths-from "/Volumes/OldDrive" \
--convert-manifest-paths-to "/Volumes/NewDrive" \
--no-walk
```
Updates all paths in the manifest without re-scanning files.
### Incremental Backup
The most common use case for incremental backups is to copy new files from a source to a destination, skipping files that are already in the destination.
#### Step 1: Create a manifest of the destination
First, create a manifest of your destination directory. This gives you a record of what's already there.
```bash
dedupecopy -p /path/to/backup -m backup.db
```
#### Step 2: Run the incremental copy
Now, you can copy new files from your source, using `--compare` to skip duplicates that are already in the backup.
```bash
dedupecopy -p /path/to/source -c /path/to/backup --compare backup.db -m backup_new.db
```
**How it works:**
- `--compare` efficiently checks for duplicates without re-scanning the entire destination.
- Only new files from the source are copied.
- A new manifest (`backup_new.db`) is created, which includes both the old and new files. You can use this for the next incremental backup.
#### Example: Golden Directory Backup
This is useful for maintaining a "golden" directory with unique files from multiple sources.
```bash
# 1. Create a manifest of the golden directory
dedupecopy -p /golden_dir -m golden.db
# 2. Copy new, unique files from multiple sources
dedupecopy -p /source1 -p /source2 -c /golden_dir --compare golden.db -m golden_new.db
```
### Comparison Without Copying
Compare two directories to find what's different:
```bash
# Scan both locations
dedupecopy -p /location1 -m manifest1.db
dedupecopy -p /location2 -m manifest2.db
# Generate report of files in location1 not in location2
dedupecopy -p /location1 -i manifest1.db --compare manifest2.db -r unique_files.csv --no-walk
```
## Performance Tips
### Thread Count Tuning
**Default settings (4/8/8)** work well for most scenarios.
**For SSDs/NVMe:**
```bash
--walk-threads 8 --read-threads 16 --copy-threads 16
```
**For HDDs:**
```bash
--walk-threads 2 --read-threads 4 --copy-threads 4
```
**For network shares:**
```bash
--walk-threads 2 --read-threads 4 --copy-threads 2
```
Network latency makes more threads counterproductive.
### Large File Sets
For very large directories (millions of files):
1. **Use manifests** - Essential for resumability
2. **Process in batches** - Use `--ignore` to exclude subdirectories, process separately
3. **Monitor memory** - Manifests use disk-based caching to minimize memory usage
4. **Incremental saves** - Manifests auto-save every 50,000 files
### Network Considerations
- **Network paths may timeout** - Tool retries after 3 seconds
- **SMB/CIFS shares** - Use lower thread counts
- **Bandwidth limits** - Reduce copy threads to avoid saturation
- **VPN connections** - May need much lower thread counts
### Manifest Storage
- Manifest files are stored as SQLite database files
- Size is proportional to number of unique files (typically a few MB per 100k files)
- Keep manifests on fast storage (SSD) for best performance
- Manifests are incrementally saved every 50,000 processed files
## Logging and Output Control
### Verbosity Levels
DedupeCopy provides flexible output control to suit different use cases:
#### Normal Mode (Default)
Standard output with progress updates every 1,000 files:
```bash
dedupecopy -p /source -c /destination
```
**Output includes:**
- Pre-flight configuration summary
- Progress updates with file counts and processing rates
- Error messages with helpful suggestions
- Final summary statistics
#### Quiet Mode
Minimal output - only warnings, errors, and final results:
```bash
dedupecopy -p /source -c /destination --quiet
```
**Best for:**
- Cron jobs and automated scripts
- When you only care about problems
- Reducing log file sizes
#### Verbose Mode
Detailed progress information:
```bash
dedupecopy -p /source -c /destination --verbose
```
**Output includes:**
- All normal mode output
- More frequent progress updates
- Detailed timing and rate information
#### Debug Mode
Comprehensive diagnostic information:
```bash
dedupecopy -p /source -c /destination --debug
```
**Output includes:**
- All verbose mode output
- Queue sizes and internal state
- Thread activity details
- Useful for troubleshooting performance issues
### Color Output
By default, DedupeCopy uses colored output when writing to a terminal (if colorama is installed):
- **Errors**: Red text
- **Warnings**: Yellow text
- **Info messages**: Default color
- **Debug messages**: Cyan text
To disable colors (e.g., when logging to a file):
```bash
dedupecopy -p /source -c /destination --no-color
```
Colors are automatically disabled when output is redirected to a file or pipe.
### Enhanced Features
#### Pre-Flight Summary
Before starting operations, you'll see a summary of configuration:
```
======================================================================
DEDUPE COPY - Operation Summary
======================================================================
Source path(s): 2 path(s)
- /Volumes/Source1
- /Volumes/Source2
Destination: /Volumes/Backup
Extension filter: jpg, png, gif
Path rules: *.jpg:mtime
Threads: walk=4, read=8, copy=8
Options: dedupe_empty=False, preserve_stat=True, no_walk=False
======================================================================
```
#### Progress with Rates
During operation, you'll see processing rates:
```
Discovered 5000 files (dirs: 250), accepted 4850. Rate: 142.3 files/sec
Work queue has 234 items. Progress queue has 12 items. Walk queue has 5 items.
...
Copied 4800 items. Skipped 50 items. Rate: 125.7 files/sec
```
#### Helpful Error Messages
Errors include context and suggestions:
```
Error processing '/path/to/file.txt': [PermissionError] Permission denied
Suggestions: Check file permissions; Ensure you have read access to source files
```
#### Proactive Warnings
The tool warns about potential issues before they become problems:
```
WARNING: Work queue is large (42000 items). Consider reducing thread counts to avoid memory issues.
WARNING: Progress queue is backing up (12000 items). This may indicate slow processing.
```
### Examples
#### Silent operation for scripts
```bash
dedupecopy -p /source -c /backup --quiet 2>&1 | tee backup.log
```
#### Maximum detail for troubleshooting
```bash
dedupecopy -p /source -c /backup --debug --no-color > debug.log 2>&1
```
#### Normal operation with color
```bash
dedupecopy -p /source -c /backup --verbose
```
## Troubleshooting
### Common Issues
#### "Directory disappeared during walk"
**Cause:** Network path timeout or files deleted during scan.
**Solution:**
- Reduce `--walk-threads` for network paths
- Ensure stable network connection
- Exclude volatile directories with `--ignore`
#### Out of Memory Errors
**Cause:** Very large queue sizes.
**Solution:**
- Reduce thread counts
- Process in smaller batches
- Ensure sufficient swap space
#### Permission Errors
**Cause:** Insufficient permissions on source or destination.
**Solution:**
```bash
# Check permissions
ls -la /source/path
ls -la /destination/path
# Run with appropriate user or use sudo (carefully!)
```
#### Resuming Interrupted Runs
If a run is interrupted:
```bash
# Resume using the manifest
dedupecopy -p /source -c /destination -i manifest.db -m manifest.db
```
Files already processed (in manifest) are skipped.
#### Manifest Corruption
If manifest files become corrupted:
**Solution:**
- Delete manifest files and restart
- Manifest files: `manifest.db` and `manifest.db.read`
- Consider keeping backup copies of manifests for very long operations
### Getting Help
Check the output during run:
- Progress updates every 1,000 files with processing rates
- Error messages show problematic files with helpful suggestions
- Warnings alert you to potential issues proactively
- Final summary shows counts and errors
For debugging, use `--debug` mode:
```bash
dedupecopy -p /source -c /destination --debug --no-color > debug.log 2>&1
```
Debug output includes:
- File counts and progress with timing
- Queue sizes and internal state (useful if growing unbounded)
- Thread activity and performance metrics
- Specific error messages with file paths and suggestions
## Safety and Best Practices
### ⚠️ Important Warnings
1. **Test first**: Run with `-r` (report only) before using `-c` (copy) on important data
2. **Backup important data**: Always have backups before restructuring
3. **Use manifests**: They provide a record of what was processed
4. **Verify results**: Check file counts and spot-check files after copy operations
5. **Watch disk space**: Ensure sufficient space on destination
### Manifest Safety Rules
To prevent accidental data loss, DedupeCopy enforces the following rules for manifest usage:
1. **Destructive Operations Require an Output Manifest**: Any operation that modifies the set of files being tracked (e.g., `--delete`, `--delete-on-copy`) **requires** the `-m`/`--manifest-dump-path` option. This ensures that the results of the operation are saved to a new manifest, preserving the original.
2. **Input and Output Manifests Must Be Different**: To protect your original manifest, you cannot use the same file path for both `-i`/`--manifest-read-path` and `-m`/`--manifest-dump-path`. This prevents the input manifest from being overwritten.
**Example of a safe delete operation:**
```bash
# Load an existing manifest and save the changes to a new one
dedupecopy --no-walk --delete -i manifest_original.db -m manifest_after_delete.db
```
### Recommended Workflow
```bash
# Step 1: Generate report to understand what will happen
dedupecopy -p /source -r preview.csv -m preview_manifest.db
# Step 2: Review the CSV report
# Check duplicate counts, file types, sizes
# Step 3: Run the actual copy with manifest
dedupecopy -p /source -c /destination -i preview_manifest.db -m final_manifest.db
# Step 4: Verify
# Check file counts, spot-check files, verify important files copied
```
### What Gets Copied / Deleted
- **First occurrence** of each unique file (by MD5 hash) is kept.
- Subsequent identical files are either copied to the destination or deleted from the source.
- Files are considered unique if their MD5 hash differs.
- By default, empty files are treated as unique. Use `--dedupe-empty` to treat them as duplicates.
- Ignored patterns (`--ignore`) are never copied or deleted.
### What Doesn't Get Copied / Deleted
- The first-seen version of a file is never deleted.
- Files matching `--ignore` patterns.
- Files listed in `--compare` manifests (used for comparison only).
- Extensions not matching the `-e` filter (if specified).
### Preserving Metadata
By default, only file contents are copied. To preserve timestamps and permissions:
```bash
dedupecopy -p /source -c /destination --copy-metadata
```
This uses Python's `shutil.copy2()` which preserves:
- Modification time
- Access time
- File mode (permissions)
**Note:** Not all metadata may transfer across different file systems.
## Output Files
### CSV Duplicate Report
Format: `Collision #, MD5, Path, Size (bytes), mtime`
```csv
Src: ['/source/path']
Collision #, MD5, Path, Size (bytes), mtime
1, d41d8cd98f00b204e9800998ecf8427e, '/path/file1.jpg', 1024, 1633024800.0
1, d41d8cd98f00b204e9800998ecf8427e, '/path/file2.jpg', 1024, 1633024800.0
2, a3d5c12f8b9e4a1c2d3e4f5a6b7c8d9e, '/path/doc1.pdf', 2048, 1633111200.0
2, a3d5c12f8b9e4a1c2d3e4f5a6b7c8d9e, '/path/doc2.pdf', 2048, 1633111200.0
```
### Manifest Files
Binary database files (not human-readable):
- `manifest.db` - MD5 hashes and file metadata
- `manifest.db.read` - List of processed file paths
These enable resuming and incremental operations.
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## License
This project is licensed under the Simplified BSD License.
## Project Links
- **GitHub**: https://github.com/othererik/dedupe_copy
- **PyPI**: https://pypi.org/project/DedupeCopy/
## Author
Erik Schweller (othererik@gmail.com)
---
**Status**: Tested and seems to work, but use with caution and always backup important data!
Raw data
{
"_id": null,
"home_page": null,
"name": "DedupeCopy",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": "Erik Schweller <othererik@gmail.com>",
"keywords": "de-duplication, file management, deduplication, file-copy, backup, cleanup, disk-space, file-organization, command-line, hashing, manifest",
"author": null,
"author_email": "Erik Schweller <othererik@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/5f/79/f4cf230b98ae9156e461153366580a5f6551cd8c9474798cb6601c00d908/dedupecopy-1.1.8.tar.gz",
"platform": null,
"description": "# DedupeCopy\n\nA multi-threaded command-line tool for finding duplicate files and copying/restructuring file layouts while eliminating duplicates.\n\n[](LICENSE)\n[](https://www.python.org/downloads/)\n\n## Table of Contents\n\n- [Overview](#overview)\n- [Architecture](#architecture)\n- [Installation](#installation)\n- [Quick Start](#quick-start)\n- [Key Concepts](#key-concepts)\n- [Usage Examples](#usage-examples)\n- [Command-Line Options](#command-line-options)\n- [Path Rules](#path-rules)\n- [Advanced Workflows](#advanced-workflows)\n- [Performance Tips](#performance-tips)\n- [Troubleshooting](#troubleshooting)\n- [Safety and Best Practices](#safety-and-best-practices)\n\n## Overview\n\nDedupeCopy is designed for consolidating and restructuring sprawling file systems, particularly useful for:\n\n- **Backup consolidation**: Merge multiple backup sources while eliminating duplicates\n- **Photo/media library organization**: Consolidate photos from various devices and organize by date\n- **File system cleanup**: Identify and remove duplicate files\n- **Server migration**: Copy files to new storage while preserving structure\n- **Duplicate detection**: Generate reports of duplicate files without copying\n- **Deleting duplicates**: Reclaim disk space by removing redundant files.\n\n**The good bits:**\n- Uses MD5 checksums for accurate duplicate detection\n- Multi-threaded for fast processing\n- Manifest system for resuming interrupted operations\n- Flexible path restructuring rules\n- Can compare against multiple file systems without full re-scans\n- Configurable logging with verbosity levels (quiet, normal, verbose, debug)\n- Colored output for better readability (optional)\n- Helpful error messages with actionable suggestions\n- Real-time progress with processing rates\n\n**Note:** This is *not* a replacement for rsync or Robocopy for incremental synchronization. Those are good tools that might work for you, so do try them.\n\n## Architecture\n\nDedupeCopy uses a multi-threaded pipeline architecture to maximize performance when processing large file systems. Understanding this architecture helps explain performance characteristics and tuning options.\n\n### High-Level Design\n\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 MAIN THREAD \u2502\n\u2502 \u2022 Orchestrates the entire operation \u2502\n\u2502 \u2022 Manages thread lifecycle and coordination \u2502\n\u2502 \u2022 Handles manifest loading/saving \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502\n \u25bc\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 THREAD POOLS (Queues) \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 \u2502\n\u2502 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510 \u2502\n\u2502 \u2502 Walk Threads \u2502\u2500\u2500\u2500\u25b6\u2502 Read Threads \u2502\u2500\u2500\u2500\u25b6\u2502 Copy/Delete \u2502 \u2502\n\u2502 \u2502 (4 default) \u2502 \u2502 (8 default) \u2502 \u2502 Threads \u2502 \u2502\n\u2502 \u2502 \u2502 \u2502 \u2502 \u2502 (8 default) \u2502 \u2502\n\u2502 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518 \u2502\n\u2502 \u2502 \u2502 \u2502 \u2502\n\u2502 \u25bc \u25bc \u25bc \u2502\n\u2502 Walk Queue Work Queue Copy/Delete \u2502\n\u2502 (directories) (files to hash) Queue \u2502\n\u2502 \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502\n \u25bc\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 PROGRESS THREAD \u2502\n\u2502 \u2022 Collects status updates from all worker threads \u2502\n\u2502 \u2022 Displays progress, rates, and statistics \u2502\n\u2502 \u2022 Logs errors and warnings \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502\n \u25bc\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 RESULT PROCESSOR \u2502\n\u2502 \u2022 Processes file hashes from read threads \u2502\n\u2502 \u2022 Detects duplicate files \u2502\n\u2502 \u2022 Updates manifest and collision dictionaries \u2502\n\u2502 \u2022 Performs incremental saves every 50,000 files \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n \u2502\n \u25bc\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 PERSISTENT STORAGE \u2502\n\u2502 \u2022 Manifest: Maps hash \u2192 list of files with that hash \u2502\n\u2502 \u2022 Collision DB: Tracks duplicate files \u2502\n\u2502 \u2022 SQLite-backed with disk caching (LRU-like eviction) \u2502\n\u2502 \u2022 Auto-saves every 50,000 files for crash recovery \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n### Pipeline Stages\n\n#### 1. **Walk Stage** (WalkThread)\n- **Purpose**: Discover files and directories in the source paths\n- **Thread Count**: 4 by default (configurable with `--walk-threads`)\n- **Output**: Adds directories to `walk_queue` and files to `work_queue`\n- **Filtering**: Applies extension filters and ignore patterns\n\n#### 2. **Read Stage** (ReadThread)\n- **Purpose**: Hash file contents to detect duplicates\n- **Thread Count**: 8 by default (configurable with `--read-threads`)\n- **Input**: Files from `work_queue`\n- **Output**: Tuple of (hash, size, mtime, filepath) to `result_queue`\n- **Algorithm**: `md5` or `xxh64` (configurable with `--hash-algo`)\n\n#### 3. **Result Processing** (ResultProcessor)\n- **Purpose**: Aggregate hashes and detect collisions\n- **Thread Count**: 1 (single-threaded for data consistency)\n- **Input**: Hash results from `result_queue`\n- **Output**: Updates `manifest` and `collisions` dictionaries\n- **Auto-save**: Incremental saves every 50,000 processed files\n\n#### 4. **Copy/Delete Stage** (CopyThread/DeleteThread)\n- **Purpose**: Perform file operations based on duplicate analysis\n- **Thread Count**: 8 by default (configurable with `--copy-threads`)\n- **Input**: Files from `copy_queue` or `delete_queue`\n- **Operations**:\n - Copy unique files to destination (with optional path rules)\n - Delete duplicate files (keeping first occurrence)\n\n#### 5. **Progress Monitoring** (ProgressThread)\n- **Purpose**: Centralized logging and status updates\n- **Thread Count**: 1 (collects events from all other threads)\n- **Features**:\n - Real-time progress with processing rates\n - Queue size monitoring (to detect bottlenecks)\n - Error aggregation and reporting\n - Final summary statistics\n\n### Data Structures\n\n#### Manifest\n- **Storage**: Disk-backed dictionary (SQLite + cache layer)\n- **Format**: `hash \u2192 [(filepath, size, mtime), ...]`\n- **Cache**: In-memory LRU cache (10,000 items default)\n- **Persistence**: Auto-saved during operation and on completion\n\n#### Disk Cache Dictionary\n- **Purpose**: Handle datasets larger than available RAM\n- **Backend**: SQLite with Write-Ahead Logging (WAL)\n- **Optimization**: Batched commits for performance\n- **Eviction**: LRU or random eviction when cache is full\n\n### Performance Characteristics\n\n**Queue-Based Throttling**: When queues exceed 100,000 items, the system introduces deliberate delays to prevent memory exhaustion.\n\n**Bottleneck Detection**: Progress thread monitors queue sizes:\n- **Walk queue growing**: Too many directories, consider reducing `--walk-threads`\n- **Work queue growing**: Hashing is slower than discovery, increase `--read-threads`\n- **Copy queue growing**: I/O is slower than hashing, increase `--copy-threads`\n\n**I/O Patterns**:\n- **Walk threads**: Mostly metadata operations (directory listings)\n- **Read threads**: Sequential reads of entire files\n- **Copy threads**: Large sequential writes\n\n### Thread Safety\n\n- **Queues**: Python's `queue.Queue` provides thread-safe operations\n- **Manifest**: Uses database-level locking (SQLite RLock)\n- **Save operations**: Coordinated via `save_event` to pause workers during saves\n\n## Installation\n\n### Via pip (recommended)\n\n```bash\npip install DedupeCopy\n```\n\n### With color support (optional)\n\nFor colored console output (errors in red, warnings in yellow, etc.):\n\n```bash\npip install DedupeCopy[color]\n```\n\n### With fast hashing (optional)\n\nFor faster file hashing using xxhash:\n```bash\npip install DedupeCopy[fast_hash]\n```\n\n### From source\n\n```bash\ngit clone https://github.com/othererik/dedupe_copy.git\ncd dedupe_copy\npip install -e .\n# Or with color support:\npip install -e .[color]\n```\n\n### Requirements\n\n- Python 3.11 or later\n- Sufficient disk space for manifest files (typically small, but can grow for very large file sets)\n- Optional: colorama for colored console output (installed with `[color]` extra)\n\n## Quick Start\n\n### Find duplicates in a directory\n\n```bash\ndedupecopy -p /path/to/search -r duplicates.csv\n```\n\nThis scans `/path/to/search` and creates a CSV report of all duplicate files.\n\n### Copy files while removing duplicates\n\n```bash\ndedupecopy -p /source/path -c /destination/path\n```\n\nThis copies all files from source to destination, skipping any duplicates.\n\n### Copy with manifest (recommended for large operations)\n\n```bash\ndedupecopy -p /source/path -c /destination/path -m manifest.db\n```\n\nCreates a manifest file that allows you to resume if interrupted. For example, if the operation is stopped, you can resume it with:\n\n```bash\ndedupecopy -p /source/path -c /destination/path -i manifest.db -m manifest.db\n```\n\n### Delete Duplicates\n\n```bash\ndedupecopy -p /path/to/search --delete --dry-run\n```\n\nThis will scan the specified path and show you which files would be deleted. Once you are sure, you can run the command again without `--dry-run` to perform the deletion.\n\n## Key Concepts\n\n### Manifests\n\nManifests are database files that store:\n- MD5 checksums of processed files\n- File metadata (size, modification time, path)\n- Which files have been scanned\n\n**Benefits:**\n- **Resumability**: If an operation is interrupted, you can resume it by loading the manifest. The tool will skip any files that have already been processed.\n- **Comparison**: Manifests can be used to compare file systems without re-scanning. For example, you can compare a source and a destination to see which files are missing from the destination.\n- **Incremental Backups**: By loading a manifest from a previous backup, you can process only new or modified files.\n- **Tracking**: Manifests keep a record of which files have been processed, which is useful for auditing and tracking.\n\n**Manifest Options:**\n- `-m manifest.db` - Save manifest after processing (output)\n- `-i manifest.db` - Load existing manifest before processing (input)\n- `--compare manifest.db` - Load manifest for duplicate checking only (does not copy those files)\n\n**Important Safety Rule:** You **cannot** use the same file path for both `-i` (input) and `-m` (output). This prevents accidental manifest corruption during operations.\n\n### Understanding `-i` vs `--compare`\n\nBoth `-i` and `--compare` use existing manifests to determine which files to skip, but they serve different purposes.\n\n#### `-i` / `--manifest-read-path` (Input Manifest)\n\n- **Purpose**: Resume an interrupted operation or continue from a previous run.\n- **Behavior**: Files in this manifest are considered \"already processed\" and are skipped. They **are** included in the output manifest.\n- **Use Case**: You started a large copy, it was interrupted, and you want to resume without re-scanning everything.\n\n```bash\n# Initial run (interrupted)\ndedupecopy -p /source -c /dest -m progress.db\n\n# Resume (skips files in progress.db)\ndedupecopy -p /source -c /dest -i progress.db -m progress_new.db\n```\n\n#### `--compare` (Comparison Manifest)\n\n- **Purpose**: Deduplicate against another location.\n- **Behavior**: Files in this manifest are treated as duplicates and are **not** copied. They are **not** included in the output manifest.\n- **Use Case**: You want to back up new photos from your phone, but you want to skip any photos that are already in your main archive.\n\n```bash\n# Incremental backup (skip files already in the main backup)\ndedupecopy -p /phone_backup -c /main_archive --compare main_archive.db -m phone_backup_new.db\n```\n\n#### Key Differences\n\n| Feature | `-i` (Input Manifest) | `--compare` (Comparison Manifest) |\n|------------------------------|------------------------------|-----------------------------------|\n| **Files Copied?** | No (already processed) | No (treated as duplicates) |\n| **Included in Output?** | Yes | No |\n| **Primary Use Case** | Resume Operations | Deduplicate Across Sources |\n| **Can use with same output?**| **No** (safety rule) | Yes |\n\n**Note on `--no-walk`**: When using `-i` or `--compare`, you can also use `--no-walk` to prevent the tool from scanning the source file system. This is useful when you want to operate *only* on the files listed in the manifests.\n\n### Duplicate Detection\n\nFiles are considered duplicates when:\n1. They have identical MD5 checksums\n2. They have the same file size\n\n**Special case:** Empty (zero-byte) files are treated as unique by default. Use `--dedupe-empty` to treat them as duplicates.\n\n## Usage Examples\n\n### Basic Operations\n\n#### 1. Generate a duplicate file report\n\n```bash\ndedupecopy -p /Users/johndoe -r dupes.csv -m manifest.db\n```\n\nCreates a CSV report of all duplicates and saves a manifest for future use.\n\n**With quiet output (minimal):**\n```bash\ndedupecopy -p /Users/johndoe -r dupes.csv -m manifest.db --quiet\n```\n\n**With verbose output (detailed progress):**\n```bash\ndedupecopy -p /Users/johndoe -r dupes.csv -m manifest.db --verbose\n```\n\n#### 2. Copy specific file types\n\n```bash\ndedupecopy -p /source -c /backup -e jpg -e png -e gif\n```\n\nCopy only image files (jpg, png, gif) to the backup directory.\n\n### Photo Organization\n\n#### Organize photos by date\n\n```bash\ndedupecopy -p C:\\pics -p D:\\pics -e jpg -R \"jpg:mtime\" -c X:\\organized_photos\n```\n\nCopies all JPG files from C: and D: drives, organizing them into folders by year/month (e.g., `2024_03/`). See the \"Path and Extension Rules\" section for more on combining rules.\n\n### Multi-Source Consolidation\n\n#### Copy from multiple sources to single destination\n\n```bash\ndedupecopy -p /source1 -p /source2 -p /source3 -c /backup -m backup_manifest.db\n```\n\nScans all three source paths and copies unique files to backup.\n\n#### Resume an interrupted copy\n\n```bash\ndedupecopy -p /source -c /destination -i manifest.db -m manifest.db\n```\n\nLoads the previous manifest and resumes where it left off.\n\n### Advanced Pattern Matching\n\n#### Ignore specific patterns\n\n```bash\ndedupecopy -p /source -c /backup --ignore \"*.tmp\" --ignore \"*.cache\" --ignore \"**/Thumbs.db\"\n```\n\nExcludes temporary files and thumbnails from processing.\n\n#### Extension-specific rules\n\n```bash\ndedupecopy -p /media -c /organized \\\n -R \"*.jpg:mtime\" \\\n -R \"*.mp4:extension\" \\\n -R \"*.doc*:no_change\"\n```\n\nDifferent organization rules for different file types.\n\n## Command-Line Options\n\n### Required Options (one of)\n\n| Option | Description |\n|--------|-------------|\n| `-p PATH`, `--read-path PATH` | Source path(s) to scan. Can be specified multiple times. |\n| `--no-walk` | Skip file system walk; use paths from loaded manifest only. |\n\n### Core Options\n\n| Option | Description |\n|--------|-------------|\n| `-c PATH`, `--copy-path PATH` | Destination path for copying files. Cannot be used with `--delete`. |\n| `--delete` | Deletes duplicate files, keeping the first-seen file. Cannot be used with `--copy-path`. |\n| `-r PATH`, `--result-path PATH` | Output path for CSV duplicate report. |\n| `-m PATH`, `--manifest-dump-path PATH` | Path to save manifest file. |\n| `-i PATH`, `--manifest-read-path PATH` | Path to load existing manifest. Can be specified multiple times. |\n| `-e EXT`, `--extensions EXT` | File extension(s) to include (e.g., `jpg`, `*.png`). Can be specified multiple times. |\n| `--ignore PATTERN` | File pattern(s) to exclude (supports wildcards). Can be specified multiple times. |\n| `-R RULE`, `--path-rules RULE` | Path restructuring rule(s) in format `extension:rule`. Can be specified multiple times. |\n\n### Special Options\n\n| Option | Description |\n|--------|-------------|\n| `--compare PATH` | Load manifest but don't copy its files (for comparison only). This is useful for excluding files that are already present in a destination or another source. Can be specified multiple times. |\n| `--copy-metadata` | Preserve file timestamps and permissions (uses `shutil.copy2` instead of `copyfile`). |\n| `--dedupe-empty` | Treat empty (0-byte) files as duplicates rather than unique. |\n| `--ignore-old-collisions` | Only detect new duplicates (ignore duplicates already in loaded manifest). |\n| `--dry-run` | Simulate operations without making any changes to the filesystem. |\n| `--min-delete-size BYTES` | Minimum size of a file to be considered for deletion (e.g., `1048576` for 1MB). Default: `0`. |\n| `--delete-on-copy` | Deletes source files after a successful copy. Requires `--copy-path` and `-m`. WARNING: this will consider duplicated objects as copied and remove them. |\n\n### Output Control Options\n\n| Option | Description |\n|--------|-------------|\n| `-q`, `--quiet` | Show only warnings and errors (minimal output). |\n| `-v`, `--verbose` | Show detailed progress information (same as normal, kept for compatibility). |\n| `--debug` | Show debug information including queue states and internal diagnostics. |\n| `--no-color` | Disable colored output (useful for logging to files or non-terminal output). |\n\n\n**Output Verbosity Levels:**\n- **Normal** (default): Standard progress updates, errors, and summaries.\n- **Quiet** (`--quiet`): Only warnings, errors, and the final summary.\n- **Verbose** (`--verbose`): More frequent progress updates that include processing rates and timing details.\n- **Debug** (`--debug`): All output including queue states and internal operations for troubleshooting.\n\n### Performance Options\n\n| Option | Default | Description |\n|--------|---------|-------------|\n| `--walk-threads N` | 4 | Number of threads for file system traversal. |\n| `--read-threads N` | 8 | Number of threads for reading and hashing files. |\n| `--copy-threads N` | 8 | Number of threads for copying files. |\n| `--hash-algo ALGO` | `md5` | Hashing algorithm to use (`md5` or `xxh64`). `xxh64` requires the `fast_hash` extra. |\n\n### Path Conversion Options\n\n| Option | Description |\n|--------|-------------|\n| `--convert-manifest-paths-from PREFIX` | Original path prefix in manifest to replace. |\n| `--convert-manifest-paths-to PREFIX` | New path prefix (useful when drive letters or mount points change). |\n\n## Path and Extension Rules\n\nThis section explains how to control file selection and organization using extension filters (`-e`), ignore patterns (`--ignore`), and path restructuring rules (`-R`).\n\n### Filtering Files by Extension\n\nUse the `-e` or `--extensions` option to specify which file types to process.\n\n- **`jpg`**: Matches `.jpg` files.\n- **`*.jp*g`**: Matches `.jpg`, `.jpeg`, `.jpng`, etc.\n- **`*`**: Matches all extensions.\n\nIf no `-e` option is provided, all files are processed by default.\n\n### Ignoring Files and Directories\n\nUse `--ignore` to exclude files or directories that match a specific pattern.\n\n- **`\"*.tmp\"`**: Ignores all files with a `.tmp` extension.\n- **`\"**/Thumbs.db\"`**: Ignores `Thumbs.db` files in any directory.\n- **`\"*.cache\"`**: Ignores all files ending in `.cache`.\n\n### Restructuring Destination Paths\n\nPath rules (`-R` or `--path-rules`) determine how files are organized in the destination directory. The format is `pattern:rule`.\n\n**Default Behavior:** If no `-R` flag is specified, the original directory structure is preserved (equivalent to `-R \"*:no_change\"`). This is the most intuitive behavior for backup and copy operations.\n\n#### Available Rules\n\n| Rule | Description | Example Output |\n|-------------|-------------------------------------------------|-----------------------------------------|\n| `no_change` | **[DEFAULT]** Preserves the original directory structure | `/dest/original/path/file.jpg` |\n| `mtime` | Organizes by modification date (`YYYY_MM`) | `/dest/2024_03/file.jpg` |\n| `extension` | Organizes into folders by file extension | `/dest/jpg/file.jpg` |\n\n#### Combining Rules\n\nRules are applied in the order they are specified, creating nested directories.\n\n```bash\n# Organize first by extension, then by date\ndedupecopy -p /source -c /backup -R \"*:extension\" -R \"*:mtime\"\n```\n**Result:** `/backup/jpg/2024_03/photo.jpg`\n\n#### Pattern Matching for Rules\n\nThe `pattern` part of the rule determines which files the rule applies to. It supports wildcards, just like the `-e` filter.\n\n- **`\"*.jpg:mtime\"`**: Applies the `mtime` rule only to JPG files.\n- **`\"*.jp*g:mtime\"`**: Applies to `.jpg`, `.jpeg`, etc.\n- **`\"*:no_change\"`**: Applies the `no_change` rule to all files.\n\nThe most specific pattern wins if multiple patterns could match a file.\n\n#### Example: Different Rules for Different Files\n\n```bash\ndedupecopy -p /media -c /organized \\\n -R \"*.jpg:mtime\" \\\n -R \"*.mp4:extension\" \\\n -R \"*.pdf:no_change\"\n```\n- **JPG files** are organized by date.\n- **MP4 files** are organized into an `mp4` folder.\n- **PDF files** keep their original directory structure.\n\n## Advanced Workflows\n\n### Sequential Multi-Source Backup\n\nWhen consolidating from multiple sources to a single target while avoiding duplicates between sources:\n\n#### Step 1: Create manifests for all locations\n\n```bash\n# Scan target (if it has existing files)\ndedupecopy -p /backup/target -m target_manifest.db\n\n# Scan each source\ndedupecopy -p /source1 -m source1_manifest.db\ndedupecopy -p /source2 -m source2_manifest.db\n```\n\n#### Step 2: Copy each source sequentially\n\n```bash\n# Copy source1 (skip files already in target or source2)\ndedupecopy -p /source1 -c /backup/target \\\n --compare target_manifest.db \\\n --compare source2_manifest.db \\\n -m target_v1.db \\\n --no-walk\n\n# Copy source2 (skip files already in target or source1)\ndedupecopy -p /source2 -c /backup/target \\\n --compare target_v1.db \\\n --compare source1_manifest.db \\\n -m target_v2.db \\\n --no-walk\n```\n\n**How it works:**\n- `--no-walk` skips re-scanning the filesystem (uses manifest data from `-i` or scans `--compare` manifests)\n- `--compare` loads manifests for duplicate checking but doesn't copy those files\n- Each source is copied only if files aren't already in target or other sources\n- Each step creates a new manifest tracking what's been copied so far\n\n**Note:** We use `--compare` instead of `-i` because:\n- `-i` + `-m` cannot use the same file path (safety rule)\n- `--compare` is designed for exactly this use case (deduplication across sources)\n- The source manifests are used with `--no-walk` to avoid re-scanning\n\n### Manifest Path Conversion\n\nIf drive letters or mount points change between runs:\n\n```bash\ndedupecopy -i old_manifest.db -m new_manifest.db \\\n --convert-manifest-paths-from \"/Volumes/OldDrive\" \\\n --convert-manifest-paths-to \"/Volumes/NewDrive\" \\\n --no-walk\n```\n\nUpdates all paths in the manifest without re-scanning files.\n\n### Incremental Backup\n\nThe most common use case for incremental backups is to copy new files from a source to a destination, skipping files that are already in the destination.\n\n#### Step 1: Create a manifest of the destination\n\nFirst, create a manifest of your destination directory. This gives you a record of what's already there.\n\n```bash\ndedupecopy -p /path/to/backup -m backup.db\n```\n\n#### Step 2: Run the incremental copy\n\nNow, you can copy new files from your source, using `--compare` to skip duplicates that are already in the backup.\n\n```bash\ndedupecopy -p /path/to/source -c /path/to/backup --compare backup.db -m backup_new.db\n```\n\n**How it works:**\n- `--compare` efficiently checks for duplicates without re-scanning the entire destination.\n- Only new files from the source are copied.\n- A new manifest (`backup_new.db`) is created, which includes both the old and new files. You can use this for the next incremental backup.\n\n#### Example: Golden Directory Backup\n\nThis is useful for maintaining a \"golden\" directory with unique files from multiple sources.\n\n```bash\n# 1. Create a manifest of the golden directory\ndedupecopy -p /golden_dir -m golden.db\n\n# 2. Copy new, unique files from multiple sources\ndedupecopy -p /source1 -p /source2 -c /golden_dir --compare golden.db -m golden_new.db\n```\n\n### Comparison Without Copying\n\nCompare two directories to find what's different:\n\n```bash\n# Scan both locations\ndedupecopy -p /location1 -m manifest1.db\ndedupecopy -p /location2 -m manifest2.db\n\n# Generate report of files in location1 not in location2\ndedupecopy -p /location1 -i manifest1.db --compare manifest2.db -r unique_files.csv --no-walk\n```\n\n## Performance Tips\n\n### Thread Count Tuning\n\n**Default settings (4/8/8)** work well for most scenarios.\n\n**For SSDs/NVMe:**\n```bash\n--walk-threads 8 --read-threads 16 --copy-threads 16\n```\n\n**For HDDs:**\n```bash\n--walk-threads 2 --read-threads 4 --copy-threads 4\n```\n\n**For network shares:**\n```bash\n--walk-threads 2 --read-threads 4 --copy-threads 2\n```\nNetwork latency makes more threads counterproductive.\n\n### Large File Sets\n\nFor very large directories (millions of files):\n\n1. **Use manifests** - Essential for resumability\n2. **Process in batches** - Use `--ignore` to exclude subdirectories, process separately\n3. **Monitor memory** - Manifests use disk-based caching to minimize memory usage\n4. **Incremental saves** - Manifests auto-save every 50,000 files\n\n### Network Considerations\n\n- **Network paths may timeout** - Tool retries after 3 seconds\n- **SMB/CIFS shares** - Use lower thread counts\n- **Bandwidth limits** - Reduce copy threads to avoid saturation\n- **VPN connections** - May need much lower thread counts\n\n### Manifest Storage\n\n- Manifest files are stored as SQLite database files\n- Size is proportional to number of unique files (typically a few MB per 100k files)\n- Keep manifests on fast storage (SSD) for best performance\n- Manifests are incrementally saved every 50,000 processed files\n\n## Logging and Output Control\n\n### Verbosity Levels\n\nDedupeCopy provides flexible output control to suit different use cases:\n\n#### Normal Mode (Default)\nStandard output with progress updates every 1,000 files:\n\n```bash\ndedupecopy -p /source -c /destination\n```\n\n**Output includes:**\n- Pre-flight configuration summary\n- Progress updates with file counts and processing rates\n- Error messages with helpful suggestions\n- Final summary statistics\n\n#### Quiet Mode\nMinimal output - only warnings, errors, and final results:\n\n```bash\ndedupecopy -p /source -c /destination --quiet\n```\n\n**Best for:**\n- Cron jobs and automated scripts\n- When you only care about problems\n- Reducing log file sizes\n\n#### Verbose Mode\nDetailed progress information:\n\n```bash\ndedupecopy -p /source -c /destination --verbose\n```\n\n**Output includes:**\n- All normal mode output\n- More frequent progress updates\n- Detailed timing and rate information\n\n#### Debug Mode\nComprehensive diagnostic information:\n\n```bash\ndedupecopy -p /source -c /destination --debug\n```\n\n**Output includes:**\n- All verbose mode output\n- Queue sizes and internal state\n- Thread activity details\n- Useful for troubleshooting performance issues\n\n### Color Output\n\nBy default, DedupeCopy uses colored output when writing to a terminal (if colorama is installed):\n\n- **Errors**: Red text\n- **Warnings**: Yellow text\n- **Info messages**: Default color\n- **Debug messages**: Cyan text\n\nTo disable colors (e.g., when logging to a file):\n\n```bash\ndedupecopy -p /source -c /destination --no-color\n```\n\nColors are automatically disabled when output is redirected to a file or pipe.\n\n### Enhanced Features\n\n#### Pre-Flight Summary\nBefore starting operations, you'll see a summary of configuration:\n\n```\n======================================================================\nDEDUPE COPY - Operation Summary\n======================================================================\nSource path(s): 2 path(s)\n - /Volumes/Source1\n - /Volumes/Source2\nDestination: /Volumes/Backup\nExtension filter: jpg, png, gif\nPath rules: *.jpg:mtime\nThreads: walk=4, read=8, copy=8\nOptions: dedupe_empty=False, preserve_stat=True, no_walk=False\n======================================================================\n```\n\n#### Progress with Rates\nDuring operation, you'll see processing rates:\n\n```\nDiscovered 5000 files (dirs: 250), accepted 4850. Rate: 142.3 files/sec\nWork queue has 234 items. Progress queue has 12 items. Walk queue has 5 items.\n...\nCopied 4800 items. Skipped 50 items. Rate: 125.7 files/sec\n```\n\n#### Helpful Error Messages\nErrors include context and suggestions:\n\n```\nError processing '/path/to/file.txt': [PermissionError] Permission denied\n Suggestions: Check file permissions; Ensure you have read access to source files\n```\n\n#### Proactive Warnings\nThe tool warns about potential issues before they become problems:\n\n```\nWARNING: Work queue is large (42000 items). Consider reducing thread counts to avoid memory issues.\nWARNING: Progress queue is backing up (12000 items). This may indicate slow processing.\n```\n\n### Examples\n\n#### Silent operation for scripts\n```bash\ndedupecopy -p /source -c /backup --quiet 2>&1 | tee backup.log\n```\n\n#### Maximum detail for troubleshooting\n```bash\ndedupecopy -p /source -c /backup --debug --no-color > debug.log 2>&1\n```\n\n#### Normal operation with color\n```bash\ndedupecopy -p /source -c /backup --verbose\n```\n\n## Troubleshooting\n\n### Common Issues\n\n#### \"Directory disappeared during walk\"\n\n**Cause:** Network path timeout or files deleted during scan.\n\n**Solution:**\n- Reduce `--walk-threads` for network paths\n- Ensure stable network connection\n- Exclude volatile directories with `--ignore`\n\n#### Out of Memory Errors\n\n**Cause:** Very large queue sizes.\n\n**Solution:**\n- Reduce thread counts\n- Process in smaller batches\n- Ensure sufficient swap space\n\n#### Permission Errors\n\n**Cause:** Insufficient permissions on source or destination.\n\n**Solution:**\n```bash\n# Check permissions\nls -la /source/path\nls -la /destination/path\n\n# Run with appropriate user or use sudo (carefully!)\n```\n\n#### Resuming Interrupted Runs\n\nIf a run is interrupted:\n\n```bash\n# Resume using the manifest\ndedupecopy -p /source -c /destination -i manifest.db -m manifest.db\n```\n\nFiles already processed (in manifest) are skipped.\n\n#### Manifest Corruption\n\nIf manifest files become corrupted:\n\n**Solution:**\n- Delete manifest files and restart\n- Manifest files: `manifest.db` and `manifest.db.read`\n- Consider keeping backup copies of manifests for very long operations\n\n### Getting Help\n\nCheck the output during run:\n- Progress updates every 1,000 files with processing rates\n- Error messages show problematic files with helpful suggestions\n- Warnings alert you to potential issues proactively\n- Final summary shows counts and errors\n\nFor debugging, use `--debug` mode:\n\n```bash\ndedupecopy -p /source -c /destination --debug --no-color > debug.log 2>&1\n```\n\nDebug output includes:\n- File counts and progress with timing\n- Queue sizes and internal state (useful if growing unbounded)\n- Thread activity and performance metrics\n- Specific error messages with file paths and suggestions\n\n## Safety and Best Practices\n\n### \u26a0\ufe0f Important Warnings\n\n1. **Test first**: Run with `-r` (report only) before using `-c` (copy) on important data\n2. **Backup important data**: Always have backups before restructuring\n3. **Use manifests**: They provide a record of what was processed\n4. **Verify results**: Check file counts and spot-check files after copy operations\n5. **Watch disk space**: Ensure sufficient space on destination\n\n### Manifest Safety Rules\n\nTo prevent accidental data loss, DedupeCopy enforces the following rules for manifest usage:\n\n1. **Destructive Operations Require an Output Manifest**: Any operation that modifies the set of files being tracked (e.g., `--delete`, `--delete-on-copy`) **requires** the `-m`/`--manifest-dump-path` option. This ensures that the results of the operation are saved to a new manifest, preserving the original.\n\n2. **Input and Output Manifests Must Be Different**: To protect your original manifest, you cannot use the same file path for both `-i`/`--manifest-read-path` and `-m`/`--manifest-dump-path`. This prevents the input manifest from being overwritten.\n\n**Example of a safe delete operation:**\n```bash\n# Load an existing manifest and save the changes to a new one\ndedupecopy --no-walk --delete -i manifest_original.db -m manifest_after_delete.db\n```\n\n### Recommended Workflow\n\n```bash\n# Step 1: Generate report to understand what will happen\ndedupecopy -p /source -r preview.csv -m preview_manifest.db\n\n# Step 2: Review the CSV report\n# Check duplicate counts, file types, sizes\n\n# Step 3: Run the actual copy with manifest\ndedupecopy -p /source -c /destination -i preview_manifest.db -m final_manifest.db\n\n# Step 4: Verify\n# Check file counts, spot-check files, verify important files copied\n```\n\n### What Gets Copied / Deleted\n\n- **First occurrence** of each unique file (by MD5 hash) is kept.\n- Subsequent identical files are either copied to the destination or deleted from the source.\n- Files are considered unique if their MD5 hash differs.\n- By default, empty files are treated as unique. Use `--dedupe-empty` to treat them as duplicates.\n- Ignored patterns (`--ignore`) are never copied or deleted.\n\n### What Doesn't Get Copied / Deleted\n\n- The first-seen version of a file is never deleted.\n- Files matching `--ignore` patterns.\n- Files listed in `--compare` manifests (used for comparison only).\n- Extensions not matching the `-e` filter (if specified).\n\n### Preserving Metadata\n\nBy default, only file contents are copied. To preserve timestamps and permissions:\n\n```bash\ndedupecopy -p /source -c /destination --copy-metadata\n```\n\nThis uses Python's `shutil.copy2()` which preserves:\n- Modification time\n- Access time\n- File mode (permissions)\n\n**Note:** Not all metadata may transfer across different file systems.\n\n## Output Files\n\n### CSV Duplicate Report\n\nFormat: `Collision #, MD5, Path, Size (bytes), mtime`\n\n```csv\nSrc: ['/source/path']\nCollision #, MD5, Path, Size (bytes), mtime\n1, d41d8cd98f00b204e9800998ecf8427e, '/path/file1.jpg', 1024, 1633024800.0\n1, d41d8cd98f00b204e9800998ecf8427e, '/path/file2.jpg', 1024, 1633024800.0\n2, a3d5c12f8b9e4a1c2d3e4f5a6b7c8d9e, '/path/doc1.pdf', 2048, 1633111200.0\n2, a3d5c12f8b9e4a1c2d3e4f5a6b7c8d9e, '/path/doc2.pdf', 2048, 1633111200.0\n```\n\n### Manifest Files\n\nBinary database files (not human-readable):\n- `manifest.db` - MD5 hashes and file metadata\n- `manifest.db.read` - List of processed file paths\n\nThese enable resuming and incremental operations.\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nThis project is licensed under the Simplified BSD License.\n\n## Project Links\n\n- **GitHub**: https://github.com/othererik/dedupe_copy\n- **PyPI**: https://pypi.org/project/DedupeCopy/\n\n## Author\n\nErik Schweller (othererik@gmail.com)\n\n---\n\n**Status**: Tested and seems to work, but use with caution and always backup important data!\n",
"bugtrack_url": null,
"license": "BSD-2-Clause",
"summary": "Find duplicates / copy and restructure file layout command-line tool",
"version": "1.1.8",
"project_urls": {
"Homepage": "https://pypi.python.org/pypi/DedupeCopy/",
"Repository": "https://www.github.com/othererik/dedupe_copy"
},
"split_keywords": [
"de-duplication",
" file management",
" deduplication",
" file-copy",
" backup",
" cleanup",
" disk-space",
" file-organization",
" command-line",
" hashing",
" manifest"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "ca6c27260ebffca5e36f69db2f1c5d638ec31f526cea4b104efd9500c9a6b856",
"md5": "93c0203736ecf2c7d89e35e7749c352d",
"sha256": "4195f52093b1fc82cb31ba3ea4234f5a7bcd2c82eb1435de19e184818ba3b2d5"
},
"downloads": -1,
"filename": "dedupecopy-1.1.8-py3-none-any.whl",
"has_sig": false,
"md5_digest": "93c0203736ecf2c7d89e35e7749c352d",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 121750,
"upload_time": "2025-10-23T02:13:43",
"upload_time_iso_8601": "2025-10-23T02:13:43.878905Z",
"url": "https://files.pythonhosted.org/packages/ca/6c/27260ebffca5e36f69db2f1c5d638ec31f526cea4b104efd9500c9a6b856/dedupecopy-1.1.8-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "5f79f4cf230b98ae9156e461153366580a5f6551cd8c9474798cb6601c00d908",
"md5": "675bb303778d29c93e50f918ee38a589",
"sha256": "7a71b5132f2fb41481af82041ae4cce2f6e4ff196ea3b4cdb92b113124155310"
},
"downloads": -1,
"filename": "dedupecopy-1.1.8.tar.gz",
"has_sig": false,
"md5_digest": "675bb303778d29c93e50f918ee38a589",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 125258,
"upload_time": "2025-10-23T02:13:45",
"upload_time_iso_8601": "2025-10-23T02:13:45.394771Z",
"url": "https://files.pythonhosted.org/packages/5f/79/f4cf230b98ae9156e461153366580a5f6551cd8c9474798cb6601c00d908/dedupecopy-1.1.8.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-23 02:13:45",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "othererik",
"github_project": "dedupe_copy",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "colorama",
"specs": []
},
{
"name": "xxhash",
"specs": []
}
],
"lcname": "dedupecopy"
}