uneff


Nameuneff JSON
Version 1.0.1 PyPI version JSON
download
home_pagehttps://github.com/mkiiim/uneff
SummaryRemove BOM and problematic Unicode characters from text files
upload_time2025-08-06 11:54:47
maintainerNone
docs_urlNone
authorMark
requires_python>=3.7
licenseMIT
keywords unicode bom text-processing csv-cleaning data-cleaning file-processing
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Uneff

A versatile tool to remove BOM (Byte Order Mark) and problematic Unicode characters from text files and streams.

## Overview

Uneff is designed to clean text files by removing BOM markers and other invisible Unicode characters that can cause issues when processing data files. It's especially useful for:

- Cleaning CSV files before data processing
- Fixing encoding issues in text files
- Removing invisible control characters that break parsers
- Normalizing text data from various sources

I created this for use in my own personal project(s) and am including it here in case someone may find it useful as a starting point for their own data processing pipeline issues and challenges.

## ⚠️ Warning and Disclaimer

### Risk of Unintended Consequences

**IMPORTANT**: Modifying text data by removing or replacing characters can have unintended consequences. While Uneff is designed to help with common data cleaning challenges, improper use may introduce new issues into your data.

- **Always create backups** of your original files before processing
- **Always validate** the results after processing to ensure data integrity
- **Never use** on production data without thorough testing
- **Be especially cautious** when working with:
  - Internationalized text
  - Complex data formats (like JSON, XML)
  - Files with custom encoding
  - Critical data where character precision matters

### No Warranty

This tool is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. In no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.

### Recommended Workflow

1. **Analyze first**: Use the `--analyze` flag to identify issues without modifying files
2. **Test on samples**: Process a small sample before running on large datasets
3. **Verify results**: Implement verification checks after processing
4. **Document changes**: Keep records of what was processed and why
5. **Limit scope**: Only process the specific characters you know are problematic

## Features

- Removes UTF-8, UTF-16, and UTF-32 BOM markers
- Handles invisible and problematic Unicode characters
- Supports replacing characters instead of just removing them
- Works with in-memory content (no direct file I/O required)
- Configurable via a simple CSV file - no code changes needed
- Can be used as a command-line tool or as a library in your projects
- Available in both Python and JavaScript versions
- Preserves original files by creating clean copies
- Detailed reporting of changes made
- Precise location reporting of problematic characters (line, column, and absolute position)

## Installation

### Python Package (Recommended)

Install from PyPI:
```bash
pip install uneff
```

### Install from Git

Install the latest development version:
```bash
pip install git+https://github.com/mkiiim/uneff.git
```

### Download Individual Files

#### Python Script
1. Download `uneff.py` to your project directory
2. Make it executable (optional, for Unix-like systems):
   ```bash
   chmod +x uneff.py
   ```

#### JavaScript Version
1. Download `uneff.js` to your project directory
2. Ensure you have Node.js installed

## Usage

### Command Line

#### Python Package

```bash
uneff myfile.csv
```

#### Python Script

```bash
python uneff.py myfile.csv
```

#### Options

```bash
# Using package
uneff myfile.csv [-m mappings.csv] [-o output.csv] [-q] [-a]

# Using script  
python uneff.py myfile.csv [-m mappings.csv] [-o output.csv] [-q] [-a]
```

- `-m, --mapping`: Path to custom character mappings file
- `-o, --output`: Path to save cleaned file (default: adds uneffd_ prefix)
- `-q, --quiet`: Suppress status messages
- `-a, --analyze`: Only analyze file without cleaning

#### JavaScript Version

```bash
node uneff.js myfile.csv
```

Options:
```
node uneff.js myfile.csv [-m|--mapping mappings.csv] [-o|--output output.csv] [-q|--quiet] [-a|--analyze]
```

### In Your Code

#### Python Version

```python
import uneff

# Clean a file
uneff.clean_file('myfile.csv')

# Clean text directly without file I/O
dirty_text = 'Hello�World with invisible\u200bspaces'
clean_text = uneff.clean_text(dirty_text)

# Use with custom options
uneff.clean_file(
    file_path='myfile.csv',
    mapping_file='custom_mappings.csv',
    output_path='cleaned.csv',
    verbose=False
)

# Working with in-memory content (storage-agnostic)
content_bytes = get_content_from_your_system()  # Your custom storage API
cleaned_content, char_counts = uneff.clean_content(content_bytes)
save_content_to_your_system(cleaned_content)  # Your custom storage API

# Analyze file without modifying it
analysis = uneff.analyze_file('myfile.csv')
print(f"Found {analysis['problematic_char_count']} problematic characters")
```

#### JavaScript Version

```javascript
const uneff = require('./uneff');

// Clean a file
uneff.cleanFile('myfile.csv');

// Clean text directly without file I/O
const dirtyText = 'Hello�World with invisible\u200bspaces';
const cleanText = uneff.cleanText(dirtyText);

// Use with custom options
uneff.cleanFile(
    'myfile.csv',
    'custom_mappings.csv',
    'cleaned.csv',
    false  // verbose mode off
);

// Working with in-memory content (storage-agnostic)
const contentBytes = getContentFromYourSystem();  // Your custom storage API
const [cleanedContent, charCounts] = uneff.cleanContent(contentBytes);
saveContentToYourSystem(cleanedContent);  // Your custom storage API

// Analyze file without modifying it
const analysis = uneff.analyzeFile('myfile.csv');
console.log(`Found ${analysis.problematic_char_count} problematic characters`);
```

## Configuring Problematic Characters

On first run, Uneff creates a default `uneff_mappings.csv` file with common problematic characters. You can edit this file to customize how characters are handled:

| Character | Unicode  | Name                     | Remove | Replacement |
|-----------|----------|--------------------------|--------|-------------|
| �         | \ufffd   | Replacement Character    | True   |             |
| [empty]   | \u0000   | NULL                     | True   |             |
| [empty]   | \u2028   | Line Separator           | True   | " "         |
| [empty]   | \u2029   | Paragraph Separator      | True   | "\n"        |
| [empty]   | \u200b   | Zero Width Space         | True   |             |
|           | \ufeff    | BOM                       | True   |             |
| ...       | ...      | ...                      | ...    | ...         |

- To stop processing a specific character, change `True` to `False`
- To add new characters, add a new row with the appropriate Unicode escape sequence
- Add a replacement character in the "Replacement" column to replace instead of remove
- If "Remove" is True and "Replacement" is empty, the character is removed completely

### Notes on the Replacement Field

- **Do not add quotes** around replacement values in the CSV file unless they contain special characters
- Leave empty to remove the character completely
- For whitespace characters:
  - Use `\n` for newline
  - Use `\t` for tab
  - Use a space character for space
- For multiple characters, simply type them as-is (e.g., `-->`)

#### Special Characters in Replacements

When your replacement contains special characters like commas, you must use quotes in the CSV:

| Character | Unicode  | Name              | Remove | Replacement |
|-----------|----------|-------------------|--------|-------------|
| [empty]   | \u2063   | Invisible Separator | True  | ","         |

The CSV should look like this in a text editor:
```
Character,Unicode,Name,Remove,Replacement
,\u2063,Invisible Separator,True,","
```

Without the quotes, the comma would be interpreted as a new field separator.

#### Additional Examples

| Replacement For | Raw Text in CSV | Notes |
|-----------------|-----------------|-------|
| Comma | "," | Must be quoted |
| Quote | """" | Double quotes to escape |
| Tab character | \t | Escape sequence |
| Newline | \n | Escape sequence |
| Comma + text | ",.txt" | Must be quoted |

## What Characters are Processed?

By default, Uneff handles these types of characters:

- BOM (Byte Order Mark) characters
- Replacement Character (�)
- Control characters (NULL, SUB, FS, GS, RS, US)
- Zero-width spaces and joiners
- Bidirectional text control characters
- Line and paragraph separators (now replaced with spaces and newlines by default)
- Other invisible formatting characters

## Advanced Analysis Features

The analysis functions provide detailed information about problematic characters in your text:

```python
# Analyze a file and get detailed report
analysis = uneff.analyze_file('myfile.csv')

# Get total count of problematic characters
print(f"Found {analysis['problematic_char_count']} problematic characters")

# Get information about each type of problematic character
for detail in analysis['character_details']:
    print(f"Character: {detail['name']} [Unicode: {detail['unicode']}]")
    print(f"Count: {detail['count']} instances")
    
    # Print locations of up to 10 instances
    for location in detail['sample_locations']:
        print(f"  Found at Line {location['line']}, Column {location['column']}")
        print(f"  Context: ...{location['context']}...")
```

In JavaScript:

```javascript
// Analyze a file and get detailed report
const analysis = uneff.analyzeFile('myfile.csv');

// Get total count of problematic characters
console.log(`Found ${analysis.problematic_char_count} problematic characters`);

// Get information about each type of problematic character
for (const detail of analysis.character_details) {
    console.log(`Character: ${detail.name} [Unicode: ${detail.unicode}]`);
    console.log(`Count: ${detail.count} instances`);
    
    // Print locations of up to 10 instances
    for (const location of detail.sample_locations) {
        console.log(`  Found at Line ${location.line}, Column ${location.column}`);
        console.log(`  Context: ...${location.context}...`);
    }
}
```

The analysis results include:
- Exact line and column positions of each problematic character
- Absolute position in the file
- Surrounding context to help identify issues
- BOM detection and encoding information
- Character counts and statistics

## First-Run Behavior

When running Uneff for the first time:

1. If a mapping file doesn't exist, Uneff automatically creates one with default settings.
2. This happens seamlessly, allowing you to immediately process files without errors.
3. The default mapping file (`uneff_mappings.csv`) is created in the same directory as the script.

## Why Process These Characters?

### Common Problems Caused by Invisible Characters

1. **Parser Failures**: Many data processing systems and parsers fail when encountering unexpected Unicode characters
2. **Data Processing Errors**: Characters like zero-width spaces break tokenization, field detection, and data extraction
3. **Database Import Issues**: ETL processes often reject data with control characters
4. **Inconsistent Behavior**: The same file might work in one system but fail in another
5. **Hard-to-Debug Problems**: Since these characters are invisible, problems they cause can be extremely difficult to diagnose

### Specific Use Cases

- **CSV Processing**: BOM markers can cause the first column name to be misread
- **Data Analysis**: Invisible characters can cause misalignment in data processing
- **API Integrations**: Data passed between systems can accumulate problematic characters
- **Text Mining**: Control characters can corrupt text analysis
- **Custom Storage Systems**: Applications with their own storage systems can now leverage Uneff's capabilities

## Integration Options

### For Standard File Processing

Use the traditional methods with direct file I/O:

```python
uneff.clean_file('myfile.csv')
```

```javascript
uneff.cleanFile('myfile.csv');
```

### For Custom Storage Systems

Use the storage-agnostic core functions:

```python
# 1. Get content using your storage API
content = your_app.get_file_content(file_id)

# 2. Use Uneff to clean it in memory
cleaned_content, char_counts = uneff.clean_content(content)

# 3. Save using your storage API
your_app.save_file_content(file_id, cleaned_content)
```

```javascript
// 1. Get content using your storage API
const content = your_app.get_file_content(file_id);

// 2. Use Uneff to clean it in memory
const [cleaned_content, char_counts] = uneff.cleanContent(content);

// 3. Save using your storage API
your_app.save_file_content(file_id, cleaned_content);
```

## Safe vs. Unsafe: When to Use Uneff

### When It's Safe to Process These Characters

- **Data Processing Pipelines**: Before feeding data into analysis tools or databases
- **Plain Text Content**: Regular text documents, logs, configuration files
- **Data Exchange**: Files being transferred between different systems
- **CSV Files**: Almost always safe to remove BOM and control characters
- **Legacy Data Cleanup**: Fixing old files with encoding issues

### When to Be Cautious

- **Rich Text Documents**: Some zero-width characters have specific formatting purposes in RTF, HTML, or Word docs
- **Bidirectional Text**: Languages like Arabic or Hebrew sometimes use special Unicode control characters
- **Source Code**: Some IDEs and development tools might use BOM markers intentionally
- **XML/HTML**: These formats sometimes use control characters with specific meanings

### Potential Risks When Processing Characters

- **Text Layout Changes**: Modifying bidirectional controls might affect text rendering in some languages
- **Formatting Loss**: Some invisible characters serve legitimate formatting purposes
- **Semantic Change**: In rare cases, modifying zero-width joiners could change how text is displayed

## License

MIT

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/mkiiim/uneff",
    "name": "uneff",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": "Mark <mark.emila@gmail.com>",
    "keywords": "unicode, bom, text-processing, csv-cleaning, data-cleaning, file-processing",
    "author": "Mark",
    "author_email": "Mark <mark.emila@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/8c/db/1f925359e9d995500de7be77b206952bb7c7576e64239982b43c4753fc5a/uneff-1.0.1.tar.gz",
    "platform": null,
    "description": "# Uneff\n\nA versatile tool to remove BOM (Byte Order Mark) and problematic Unicode characters from text files and streams.\n\n## Overview\n\nUneff is designed to clean text files by removing BOM markers and other invisible Unicode characters that can cause issues when processing data files. It's especially useful for:\n\n- Cleaning CSV files before data processing\n- Fixing encoding issues in text files\n- Removing invisible control characters that break parsers\n- Normalizing text data from various sources\n\nI created this for use in my own personal project(s) and am including it here in case someone may find it useful as a starting point for their own data processing pipeline issues and challenges.\n\n## \u26a0\ufe0f Warning and Disclaimer\n\n### Risk of Unintended Consequences\n\n**IMPORTANT**: Modifying text data by removing or replacing characters can have unintended consequences. While Uneff is designed to help with common data cleaning challenges, improper use may introduce new issues into your data.\n\n- **Always create backups** of your original files before processing\n- **Always validate** the results after processing to ensure data integrity\n- **Never use** on production data without thorough testing\n- **Be especially cautious** when working with:\n  - Internationalized text\n  - Complex data formats (like JSON, XML)\n  - Files with custom encoding\n  - Critical data where character precision matters\n\n### No Warranty\n\nThis tool is provided \"as is\", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. In no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.\n\n### Recommended Workflow\n\n1. **Analyze first**: Use the `--analyze` flag to identify issues without modifying files\n2. **Test on samples**: Process a small sample before running on large datasets\n3. **Verify results**: Implement verification checks after processing\n4. **Document changes**: Keep records of what was processed and why\n5. **Limit scope**: Only process the specific characters you know are problematic\n\n## Features\n\n- Removes UTF-8, UTF-16, and UTF-32 BOM markers\n- Handles invisible and problematic Unicode characters\n- Supports replacing characters instead of just removing them\n- Works with in-memory content (no direct file I/O required)\n- Configurable via a simple CSV file - no code changes needed\n- Can be used as a command-line tool or as a library in your projects\n- Available in both Python and JavaScript versions\n- Preserves original files by creating clean copies\n- Detailed reporting of changes made\n- Precise location reporting of problematic characters (line, column, and absolute position)\n\n## Installation\n\n### Python Package (Recommended)\n\nInstall from PyPI:\n```bash\npip install uneff\n```\n\n### Install from Git\n\nInstall the latest development version:\n```bash\npip install git+https://github.com/mkiiim/uneff.git\n```\n\n### Download Individual Files\n\n#### Python Script\n1. Download `uneff.py` to your project directory\n2. Make it executable (optional, for Unix-like systems):\n   ```bash\n   chmod +x uneff.py\n   ```\n\n#### JavaScript Version\n1. Download `uneff.js` to your project directory\n2. Ensure you have Node.js installed\n\n## Usage\n\n### Command Line\n\n#### Python Package\n\n```bash\nuneff myfile.csv\n```\n\n#### Python Script\n\n```bash\npython uneff.py myfile.csv\n```\n\n#### Options\n\n```bash\n# Using package\nuneff myfile.csv [-m mappings.csv] [-o output.csv] [-q] [-a]\n\n# Using script  \npython uneff.py myfile.csv [-m mappings.csv] [-o output.csv] [-q] [-a]\n```\n\n- `-m, --mapping`: Path to custom character mappings file\n- `-o, --output`: Path to save cleaned file (default: adds uneffd_ prefix)\n- `-q, --quiet`: Suppress status messages\n- `-a, --analyze`: Only analyze file without cleaning\n\n#### JavaScript Version\n\n```bash\nnode uneff.js myfile.csv\n```\n\nOptions:\n```\nnode uneff.js myfile.csv [-m|--mapping mappings.csv] [-o|--output output.csv] [-q|--quiet] [-a|--analyze]\n```\n\n### In Your Code\n\n#### Python Version\n\n```python\nimport uneff\n\n# Clean a file\nuneff.clean_file('myfile.csv')\n\n# Clean text directly without file I/O\ndirty_text = '\ufeffHello\ufffdWorld with invisible\\u200bspaces'\nclean_text = uneff.clean_text(dirty_text)\n\n# Use with custom options\nuneff.clean_file(\n    file_path='myfile.csv',\n    mapping_file='custom_mappings.csv',\n    output_path='cleaned.csv',\n    verbose=False\n)\n\n# Working with in-memory content (storage-agnostic)\ncontent_bytes = get_content_from_your_system()  # Your custom storage API\ncleaned_content, char_counts = uneff.clean_content(content_bytes)\nsave_content_to_your_system(cleaned_content)  # Your custom storage API\n\n# Analyze file without modifying it\nanalysis = uneff.analyze_file('myfile.csv')\nprint(f\"Found {analysis['problematic_char_count']} problematic characters\")\n```\n\n#### JavaScript Version\n\n```javascript\nconst uneff = require('./uneff');\n\n// Clean a file\nuneff.cleanFile('myfile.csv');\n\n// Clean text directly without file I/O\nconst dirtyText = '\ufeffHello\ufffdWorld with invisible\\u200bspaces';\nconst cleanText = uneff.cleanText(dirtyText);\n\n// Use with custom options\nuneff.cleanFile(\n    'myfile.csv',\n    'custom_mappings.csv',\n    'cleaned.csv',\n    false  // verbose mode off\n);\n\n// Working with in-memory content (storage-agnostic)\nconst contentBytes = getContentFromYourSystem();  // Your custom storage API\nconst [cleanedContent, charCounts] = uneff.cleanContent(contentBytes);\nsaveContentToYourSystem(cleanedContent);  // Your custom storage API\n\n// Analyze file without modifying it\nconst analysis = uneff.analyzeFile('myfile.csv');\nconsole.log(`Found ${analysis.problematic_char_count} problematic characters`);\n```\n\n## Configuring Problematic Characters\n\nOn first run, Uneff creates a default `uneff_mappings.csv` file with common problematic characters. You can edit this file to customize how characters are handled:\n\n| Character | Unicode  | Name                     | Remove | Replacement |\n|-----------|----------|--------------------------|--------|-------------|\n| \ufffd         | \\ufffd   | Replacement Character    | True   |             |\n| [empty]   | \\u0000   | NULL                     | True   |             |\n| [empty]   | \\u2028   | Line Separator           | True   | \" \"         |\n| [empty]   | \\u2029   | Paragraph Separator      | True   | \"\\n\"        |\n| [empty]   | \\u200b   | Zero Width Space         | True   |             |\n| \ufeff          | \\ufeff    | BOM                       | True   |             |\n| ...       | ...      | ...                      | ...    | ...         |\n\n- To stop processing a specific character, change `True` to `False`\n- To add new characters, add a new row with the appropriate Unicode escape sequence\n- Add a replacement character in the \"Replacement\" column to replace instead of remove\n- If \"Remove\" is True and \"Replacement\" is empty, the character is removed completely\n\n### Notes on the Replacement Field\n\n- **Do not add quotes** around replacement values in the CSV file unless they contain special characters\n- Leave empty to remove the character completely\n- For whitespace characters:\n  - Use `\\n` for newline\n  - Use `\\t` for tab\n  - Use a space character for space\n- For multiple characters, simply type them as-is (e.g., `-->`)\n\n#### Special Characters in Replacements\n\nWhen your replacement contains special characters like commas, you must use quotes in the CSV:\n\n| Character | Unicode  | Name              | Remove | Replacement |\n|-----------|----------|-------------------|--------|-------------|\n| [empty]   | \\u2063   | Invisible Separator | True  | \",\"         |\n\nThe CSV should look like this in a text editor:\n```\nCharacter,Unicode,Name,Remove,Replacement\n,\\u2063,Invisible Separator,True,\",\"\n```\n\nWithout the quotes, the comma would be interpreted as a new field separator.\n\n#### Additional Examples\n\n| Replacement For | Raw Text in CSV | Notes |\n|-----------------|-----------------|-------|\n| Comma | \",\" | Must be quoted |\n| Quote | \"\"\"\" | Double quotes to escape |\n| Tab character | \\t | Escape sequence |\n| Newline | \\n | Escape sequence |\n| Comma + text | \",.txt\" | Must be quoted |\n\n## What Characters are Processed?\n\nBy default, Uneff handles these types of characters:\n\n- BOM (Byte Order Mark) characters\n- Replacement Character (\ufffd)\n- Control characters (NULL, SUB, FS, GS, RS, US)\n- Zero-width spaces and joiners\n- Bidirectional text control characters\n- Line and paragraph separators (now replaced with spaces and newlines by default)\n- Other invisible formatting characters\n\n## Advanced Analysis Features\n\nThe analysis functions provide detailed information about problematic characters in your text:\n\n```python\n# Analyze a file and get detailed report\nanalysis = uneff.analyze_file('myfile.csv')\n\n# Get total count of problematic characters\nprint(f\"Found {analysis['problematic_char_count']} problematic characters\")\n\n# Get information about each type of problematic character\nfor detail in analysis['character_details']:\n    print(f\"Character: {detail['name']} [Unicode: {detail['unicode']}]\")\n    print(f\"Count: {detail['count']} instances\")\n    \n    # Print locations of up to 10 instances\n    for location in detail['sample_locations']:\n        print(f\"  Found at Line {location['line']}, Column {location['column']}\")\n        print(f\"  Context: ...{location['context']}...\")\n```\n\nIn JavaScript:\n\n```javascript\n// Analyze a file and get detailed report\nconst analysis = uneff.analyzeFile('myfile.csv');\n\n// Get total count of problematic characters\nconsole.log(`Found ${analysis.problematic_char_count} problematic characters`);\n\n// Get information about each type of problematic character\nfor (const detail of analysis.character_details) {\n    console.log(`Character: ${detail.name} [Unicode: ${detail.unicode}]`);\n    console.log(`Count: ${detail.count} instances`);\n    \n    // Print locations of up to 10 instances\n    for (const location of detail.sample_locations) {\n        console.log(`  Found at Line ${location.line}, Column ${location.column}`);\n        console.log(`  Context: ...${location.context}...`);\n    }\n}\n```\n\nThe analysis results include:\n- Exact line and column positions of each problematic character\n- Absolute position in the file\n- Surrounding context to help identify issues\n- BOM detection and encoding information\n- Character counts and statistics\n\n## First-Run Behavior\n\nWhen running Uneff for the first time:\n\n1. If a mapping file doesn't exist, Uneff automatically creates one with default settings.\n2. This happens seamlessly, allowing you to immediately process files without errors.\n3. The default mapping file (`uneff_mappings.csv`) is created in the same directory as the script.\n\n## Why Process These Characters?\n\n### Common Problems Caused by Invisible Characters\n\n1. **Parser Failures**: Many data processing systems and parsers fail when encountering unexpected Unicode characters\n2. **Data Processing Errors**: Characters like zero-width spaces break tokenization, field detection, and data extraction\n3. **Database Import Issues**: ETL processes often reject data with control characters\n4. **Inconsistent Behavior**: The same file might work in one system but fail in another\n5. **Hard-to-Debug Problems**: Since these characters are invisible, problems they cause can be extremely difficult to diagnose\n\n### Specific Use Cases\n\n- **CSV Processing**: BOM markers can cause the first column name to be misread\n- **Data Analysis**: Invisible characters can cause misalignment in data processing\n- **API Integrations**: Data passed between systems can accumulate problematic characters\n- **Text Mining**: Control characters can corrupt text analysis\n- **Custom Storage Systems**: Applications with their own storage systems can now leverage Uneff's capabilities\n\n## Integration Options\n\n### For Standard File Processing\n\nUse the traditional methods with direct file I/O:\n\n```python\nuneff.clean_file('myfile.csv')\n```\n\n```javascript\nuneff.cleanFile('myfile.csv');\n```\n\n### For Custom Storage Systems\n\nUse the storage-agnostic core functions:\n\n```python\n# 1. Get content using your storage API\ncontent = your_app.get_file_content(file_id)\n\n# 2. Use Uneff to clean it in memory\ncleaned_content, char_counts = uneff.clean_content(content)\n\n# 3. Save using your storage API\nyour_app.save_file_content(file_id, cleaned_content)\n```\n\n```javascript\n// 1. Get content using your storage API\nconst content = your_app.get_file_content(file_id);\n\n// 2. Use Uneff to clean it in memory\nconst [cleaned_content, char_counts] = uneff.cleanContent(content);\n\n// 3. Save using your storage API\nyour_app.save_file_content(file_id, cleaned_content);\n```\n\n## Safe vs. Unsafe: When to Use Uneff\n\n### When It's Safe to Process These Characters\n\n- **Data Processing Pipelines**: Before feeding data into analysis tools or databases\n- **Plain Text Content**: Regular text documents, logs, configuration files\n- **Data Exchange**: Files being transferred between different systems\n- **CSV Files**: Almost always safe to remove BOM and control characters\n- **Legacy Data Cleanup**: Fixing old files with encoding issues\n\n### When to Be Cautious\n\n- **Rich Text Documents**: Some zero-width characters have specific formatting purposes in RTF, HTML, or Word docs\n- **Bidirectional Text**: Languages like Arabic or Hebrew sometimes use special Unicode control characters\n- **Source Code**: Some IDEs and development tools might use BOM markers intentionally\n- **XML/HTML**: These formats sometimes use control characters with specific meanings\n\n### Potential Risks When Processing Characters\n\n- **Text Layout Changes**: Modifying bidirectional controls might affect text rendering in some languages\n- **Formatting Loss**: Some invisible characters serve legitimate formatting purposes\n- **Semantic Change**: In rare cases, modifying zero-width joiners could change how text is displayed\n\n## License\n\nMIT\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Remove BOM and problematic Unicode characters from text files",
    "version": "1.0.1",
    "project_urls": {
        "Bug Reports": "https://github.com/mkiiim/uneff/issues",
        "Documentation": "https://github.com/mkiiim/uneff#readme",
        "Homepage": "https://github.com/mkiiim/uneff",
        "Repository": "https://github.com/mkiiim/uneff.git"
    },
    "split_keywords": [
        "unicode",
        " bom",
        " text-processing",
        " csv-cleaning",
        " data-cleaning",
        " file-processing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "db2d7e8932602f55887ae6dd8c0ec19a42e72817262ae8c5bb23b35a25d133aa",
                "md5": "8edf455ff2d58abf5336fdf35ca98774",
                "sha256": "45f50cf89cb28c0bffb3146a9709205c1dad7c341b6b52d7b9938fc9cd373370"
            },
            "downloads": -1,
            "filename": "uneff-1.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8edf455ff2d58abf5336fdf35ca98774",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 16334,
            "upload_time": "2025-08-06T11:54:46",
            "upload_time_iso_8601": "2025-08-06T11:54:46.435043Z",
            "url": "https://files.pythonhosted.org/packages/db/2d/7e8932602f55887ae6dd8c0ec19a42e72817262ae8c5bb23b35a25d133aa/uneff-1.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8cdb1f925359e9d995500de7be77b206952bb7c7576e64239982b43c4753fc5a",
                "md5": "5e2ea3c605f602671d8859b916651d9f",
                "sha256": "d1db6bbf7ae2c1ee89d091fb63d166681988f1c5d2956525feac7b0110e4c78b"
            },
            "downloads": -1,
            "filename": "uneff-1.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "5e2ea3c605f602671d8859b916651d9f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 22882,
            "upload_time": "2025-08-06T11:54:47",
            "upload_time_iso_8601": "2025-08-06T11:54:47.338184Z",
            "url": "https://files.pythonhosted.org/packages/8c/db/1f925359e9d995500de7be77b206952bb7c7576e64239982b43c4753fc5a/uneff-1.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-06 11:54:47",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "mkiiim",
    "github_project": "uneff",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "uneff"
}
        
Elapsed time: 1.19913s