huggingface-text-data-analyzer


Namehuggingface-text-data-analyzer JSON
Version 1.1.0 PyPI version JSON
download
home_pagehttps://github.com/SulRash/huggingface-text-data-analyzer
SummaryA comprehensive tool for analyzing text datasets from HuggingFace's datasets library
upload_time2024-12-06 03:06:41
maintainerNone
docs_urlNone
authorSultan Alrashed
requires_python>=3.8
licenseNone
keywords nlp text-analysis huggingface datasets machine-learning data-analysis text-processing
VCS
bugtrack_url
requirements torch transformers datasets rich spacy nltk pandas numpy seaborn matplotlib tqdm typing-extensions scikit-learn black pylint pytest
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Huggingface Text Data Analyzer

![Repository Image](assets/hftxtdata.png)

A comprehensive tool for analyzing text datasets from HuggingFace's datasets library. This tool provides both basic text statistics and advanced NLP analysis capabilities with optimized performance for large datasets.

## Analysis Types

The tool supports two types of analysis that can be run independently or together:

### Basic Analysis (Default)
- Average text length per field
- Word distribution analysis
- Junk text detection (HTML tags, special characters)
- Tokenizer-based analysis (when tokenizer is specified)
- Token length statistics with batch processing

### Advanced Analysis (Optional)
- Part-of-Speech (POS) tagging
- Named Entity Recognition (NER)
- Language detection using XLM-RoBERTa
- Sentiment analysis using distilbert-sst-2-english

You can control which analyses to run using these flags:
- `--skip-basic`: Skip basic analysis (must be used with `--advanced`)
- `--advanced`: Enable advanced analysis
- `--use-pos`: Enable POS tagging
- `--use-ner`: Enable NER
- `--use-lang`: Enable language detection
- `--use-sentiment`: Enable sentiment analysis

## Installation

### From PyPI
```bash
pip install huggingface-text-data-analyzer
```

### From Source
1. Clone the repository:
```bash
git clone https://github.com/yourusername/huggingface-text-data-analyzer.git
cd huggingface-text-data-analyzer
```

2. Install in development mode:
```bash
pip install -e .
```

3. Install spaCy's English model (if using advanced analysis):
```bash
python -m spacy download en_core_web_sm
```

## Usage

The tool is available as a command-line application after installation. You can run it using the `analyze-dataset` command:

Basic usage:
```bash
analyze-dataset "dataset_name" --split "train" --output-dir "results"
```

With tokenizer analysis:
```bash
analyze-dataset "dataset_name" --tokenizer "bert-base-uncased"
```

Analyze specific fields with chat template:
```bash
analyze-dataset "dataset_name" \
    --fields instruction response \
    --chat-field response \
    --tokenizer "meta-llama/Llama-2-7b-chat-hf"
```

Run only advanced analysis:
```bash
analyze-dataset "dataset_name" --skip-basic --advanced --use-pos --use-lang
```

Run both analyses:
```bash
analyze-dataset "dataset_name" --advanced --use-sentiment
```

Run basic analysis only (default):
```bash
analyze-dataset "dataset_name"
```

Full analysis with all features:
```bash
analyze-dataset "dataset_name" \
    --advanced \
    --use-pos \
    --use-ner \
    --use-lang \
    --use-sentiment \
    --tokenizer "bert-base-uncased" \
    --output-dir "results" \
    --fields instruction response \
    --batch-size 64
```

### Command Line Arguments

- `dataset_name`: Name of the dataset on HuggingFace (required)
- `--split`: Dataset split to analyze (default: "train")
- `--output-dir`: Directory to save analysis results (default: "analysis_results")
- `--tokenizer`: HuggingFace tokenizer to use (optional)
- `--cache-tokenized`: Cache tokenized texts (default: True)
- `--batch-size`: Batch size for tokenization (default: 32)
- `--fields`: Specific fields to analyze (optional, analyzes all text fields if not specified)
- `--chat-field`: Field to apply chat template to (optional)
- `--advanced`: Run advanced analysis with models
- `--use-pos`: Include POS tagging analysis
- `--use-ner`: Include NER analysis
- `--use-lang`: Include language detection
- `--use-sentiment`: Include sentiment analysis

### Python API

You can also use the tool programmatically in your Python code:

```python
from huggingface_text_data_analyzer import BaseAnalyzer, AdvancedAnalyzer

# Basic analysis
analyzer = BaseAnalyzer(
    dataset_name="your_dataset",
    split="train",
    tokenizer="bert-base-uncased"
)
results = analyzer.analyze()

# Advanced analysis
advanced_analyzer = AdvancedAnalyzer(
    dataset_name="your_dataset",
    split="train",
    use_pos=True,
    use_ner=True
)
advanced_results = advanced_analyzer.analyze_advanced()
```
## Project Structure

```
huggingface_text_data_analyzer/
├── src/
│   ├── base_analyzer.py      # Basic text analysis functionality
│   ├── advanced_analyzer.py  # Model-based advanced analysis
│   ├── report_generator.py   # Markdown report generation
│   └── utils.py             # Utility functions and argument parsing
├── cli.py                   # Command-line interface
└── __init__.py             # Package initialization
```

## Output

The tool generates markdown reports in the specified output directory:
- `basic_stats.md`: Contains basic text statistics
- `word_distribution.md`: Word frequency analysis
- `advanced_stats.md`: Results from model-based analysis (if enabled)

## Caching and Results Management

The tool implements a two-level caching system to optimize performance and save time:

### Token Cache
- Tokenized texts are cached to avoid re-tokenization
- Cache is stored in `~/.cache/huggingface-text-data-analyzer/`
- Clear with `--clear-cache` flag

### Analysis Results Cache
- Complete analysis results are cached per dataset/split
- Basic and advanced analysis results are cached separately
- When running analysis:
  - Tool checks for existing results
  - Prompts user before using cached results
  - Saves intermediate results after basic analysis
  - Prompts before overwriting existing results

### Cache Management Examples

Use cached results if available:
```bash
analyze-dataset "dataset_name"  # Will prompt if cache exists
```

Force fresh analysis:
```bash
analyze-dataset "dataset_name" --clear-cache
```

Add advanced analysis to existing basic analysis:
```bash
analyze-dataset "dataset_name" --advanced  # Will reuse basic results if available
```

### Cache Location
- Token cache: `~/.cache/huggingface-text-data-analyzer/`
- Analysis results: `~/.cache/huggingface-text-data-analyzer/analysis_results/`


## Performance and Accuracy Considerations

### Batch Sizes and Memory Usage

The tool uses two different batch sizes for processing:

1. **Basic Batch Size** (`--basic-batch-size`, default: 1):
   - Used for tokenization and basic text analysis
   - Higher values improve processing speed but may affect token count accuracy
   - Token counting in larger batches can be affected by padding, truncation, and memory constraints
   - If exact token counts are crucial, use smaller batch sizes (8-16)

2. **Advanced Batch Size** (`--advanced-batch-size`, default: 16):
   - Used for transformer models (language detection, sentiment analysis)
   - Adjust based on your GPU memory
   - Larger batches improve processing speed but require more GPU memory
   - CPU-only users might want to use smaller batches (4-8)

### GPU Support

The tool automatically detects and uses available CUDA GPUs for:
- Language detection model
- Sentiment analysis model
- Tokenizer operations

SpaCy operations (POS tagging, NER) remain CPU-bound for better compatibility.

### Examples

For exact token counting:
```bash
analyze-dataset "dataset_name" --basic-batch-size 8
```

For faster processing with GPU:
```bash
analyze-dataset "dataset_name" --advanced-batch-size 32 --basic-batch-size 64
```

For memory-constrained environments:
```bash
analyze-dataset "dataset_name" --advanced-batch-size 4 --basic-batch-size 16
```

## Performance Features

- Batch processing for tokenization
- Progress bars for long-running operations
- Tokenizer parallelism enabled
- Caching support for tokenized texts
- Memory-efficient processing of large datasets
- Optimized batch sizes for better performance

## Requirements

- Python 3.8+
- transformers
- datasets
- spacy
- rich
- torch
- pandas
- numpy
- scikit-learn (for advanced features)
- tqdm

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

This project is licensed under the Apache License 2.0

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/SulRash/huggingface-text-data-analyzer",
    "name": "huggingface-text-data-analyzer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "nlp, text-analysis, huggingface, datasets, machine-learning, data-analysis, text-processing",
    "author": "Sultan Alrashed",
    "author_email": "sultan.m.rashed@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/06/3b/c76fb97f55523368f4643be2030a95484bcf2ae23544defacdcb93caf605/huggingface_text_data_analyzer-1.1.0.tar.gz",
    "platform": null,
    "description": "# Huggingface Text Data Analyzer\n\n![Repository Image](assets/hftxtdata.png)\n\nA comprehensive tool for analyzing text datasets from HuggingFace's datasets library. This tool provides both basic text statistics and advanced NLP analysis capabilities with optimized performance for large datasets.\n\n## Analysis Types\n\nThe tool supports two types of analysis that can be run independently or together:\n\n### Basic Analysis (Default)\n- Average text length per field\n- Word distribution analysis\n- Junk text detection (HTML tags, special characters)\n- Tokenizer-based analysis (when tokenizer is specified)\n- Token length statistics with batch processing\n\n### Advanced Analysis (Optional)\n- Part-of-Speech (POS) tagging\n- Named Entity Recognition (NER)\n- Language detection using XLM-RoBERTa\n- Sentiment analysis using distilbert-sst-2-english\n\nYou can control which analyses to run using these flags:\n- `--skip-basic`: Skip basic analysis (must be used with `--advanced`)\n- `--advanced`: Enable advanced analysis\n- `--use-pos`: Enable POS tagging\n- `--use-ner`: Enable NER\n- `--use-lang`: Enable language detection\n- `--use-sentiment`: Enable sentiment analysis\n\n## Installation\n\n### From PyPI\n```bash\npip install huggingface-text-data-analyzer\n```\n\n### From Source\n1. Clone the repository:\n```bash\ngit clone https://github.com/yourusername/huggingface-text-data-analyzer.git\ncd huggingface-text-data-analyzer\n```\n\n2. Install in development mode:\n```bash\npip install -e .\n```\n\n3. Install spaCy's English model (if using advanced analysis):\n```bash\npython -m spacy download en_core_web_sm\n```\n\n## Usage\n\nThe tool is available as a command-line application after installation. You can run it using the `analyze-dataset` command:\n\nBasic usage:\n```bash\nanalyze-dataset \"dataset_name\" --split \"train\" --output-dir \"results\"\n```\n\nWith tokenizer analysis:\n```bash\nanalyze-dataset \"dataset_name\" --tokenizer \"bert-base-uncased\"\n```\n\nAnalyze specific fields with chat template:\n```bash\nanalyze-dataset \"dataset_name\" \\\n    --fields instruction response \\\n    --chat-field response \\\n    --tokenizer \"meta-llama/Llama-2-7b-chat-hf\"\n```\n\nRun only advanced analysis:\n```bash\nanalyze-dataset \"dataset_name\" --skip-basic --advanced --use-pos --use-lang\n```\n\nRun both analyses:\n```bash\nanalyze-dataset \"dataset_name\" --advanced --use-sentiment\n```\n\nRun basic analysis only (default):\n```bash\nanalyze-dataset \"dataset_name\"\n```\n\nFull analysis with all features:\n```bash\nanalyze-dataset \"dataset_name\" \\\n    --advanced \\\n    --use-pos \\\n    --use-ner \\\n    --use-lang \\\n    --use-sentiment \\\n    --tokenizer \"bert-base-uncased\" \\\n    --output-dir \"results\" \\\n    --fields instruction response \\\n    --batch-size 64\n```\n\n### Command Line Arguments\n\n- `dataset_name`: Name of the dataset on HuggingFace (required)\n- `--split`: Dataset split to analyze (default: \"train\")\n- `--output-dir`: Directory to save analysis results (default: \"analysis_results\")\n- `--tokenizer`: HuggingFace tokenizer to use (optional)\n- `--cache-tokenized`: Cache tokenized texts (default: True)\n- `--batch-size`: Batch size for tokenization (default: 32)\n- `--fields`: Specific fields to analyze (optional, analyzes all text fields if not specified)\n- `--chat-field`: Field to apply chat template to (optional)\n- `--advanced`: Run advanced analysis with models\n- `--use-pos`: Include POS tagging analysis\n- `--use-ner`: Include NER analysis\n- `--use-lang`: Include language detection\n- `--use-sentiment`: Include sentiment analysis\n\n### Python API\n\nYou can also use the tool programmatically in your Python code:\n\n```python\nfrom huggingface_text_data_analyzer import BaseAnalyzer, AdvancedAnalyzer\n\n# Basic analysis\nanalyzer = BaseAnalyzer(\n    dataset_name=\"your_dataset\",\n    split=\"train\",\n    tokenizer=\"bert-base-uncased\"\n)\nresults = analyzer.analyze()\n\n# Advanced analysis\nadvanced_analyzer = AdvancedAnalyzer(\n    dataset_name=\"your_dataset\",\n    split=\"train\",\n    use_pos=True,\n    use_ner=True\n)\nadvanced_results = advanced_analyzer.analyze_advanced()\n```\n## Project Structure\n\n```\nhuggingface_text_data_analyzer/\n\u251c\u2500\u2500 src/\n\u2502   \u251c\u2500\u2500 base_analyzer.py      # Basic text analysis functionality\n\u2502   \u251c\u2500\u2500 advanced_analyzer.py  # Model-based advanced analysis\n\u2502   \u251c\u2500\u2500 report_generator.py   # Markdown report generation\n\u2502   \u2514\u2500\u2500 utils.py             # Utility functions and argument parsing\n\u251c\u2500\u2500 cli.py                   # Command-line interface\n\u2514\u2500\u2500 __init__.py             # Package initialization\n```\n\n## Output\n\nThe tool generates markdown reports in the specified output directory:\n- `basic_stats.md`: Contains basic text statistics\n- `word_distribution.md`: Word frequency analysis\n- `advanced_stats.md`: Results from model-based analysis (if enabled)\n\n## Caching and Results Management\n\nThe tool implements a two-level caching system to optimize performance and save time:\n\n### Token Cache\n- Tokenized texts are cached to avoid re-tokenization\n- Cache is stored in `~/.cache/huggingface-text-data-analyzer/`\n- Clear with `--clear-cache` flag\n\n### Analysis Results Cache\n- Complete analysis results are cached per dataset/split\n- Basic and advanced analysis results are cached separately\n- When running analysis:\n  - Tool checks for existing results\n  - Prompts user before using cached results\n  - Saves intermediate results after basic analysis\n  - Prompts before overwriting existing results\n\n### Cache Management Examples\n\nUse cached results if available:\n```bash\nanalyze-dataset \"dataset_name\"  # Will prompt if cache exists\n```\n\nForce fresh analysis:\n```bash\nanalyze-dataset \"dataset_name\" --clear-cache\n```\n\nAdd advanced analysis to existing basic analysis:\n```bash\nanalyze-dataset \"dataset_name\" --advanced  # Will reuse basic results if available\n```\n\n### Cache Location\n- Token cache: `~/.cache/huggingface-text-data-analyzer/`\n- Analysis results: `~/.cache/huggingface-text-data-analyzer/analysis_results/`\n\n\n## Performance and Accuracy Considerations\n\n### Batch Sizes and Memory Usage\n\nThe tool uses two different batch sizes for processing:\n\n1. **Basic Batch Size** (`--basic-batch-size`, default: 1):\n   - Used for tokenization and basic text analysis\n   - Higher values improve processing speed but may affect token count accuracy\n   - Token counting in larger batches can be affected by padding, truncation, and memory constraints\n   - If exact token counts are crucial, use smaller batch sizes (8-16)\n\n2. **Advanced Batch Size** (`--advanced-batch-size`, default: 16):\n   - Used for transformer models (language detection, sentiment analysis)\n   - Adjust based on your GPU memory\n   - Larger batches improve processing speed but require more GPU memory\n   - CPU-only users might want to use smaller batches (4-8)\n\n### GPU Support\n\nThe tool automatically detects and uses available CUDA GPUs for:\n- Language detection model\n- Sentiment analysis model\n- Tokenizer operations\n\nSpaCy operations (POS tagging, NER) remain CPU-bound for better compatibility.\n\n### Examples\n\nFor exact token counting:\n```bash\nanalyze-dataset \"dataset_name\" --basic-batch-size 8\n```\n\nFor faster processing with GPU:\n```bash\nanalyze-dataset \"dataset_name\" --advanced-batch-size 32 --basic-batch-size 64\n```\n\nFor memory-constrained environments:\n```bash\nanalyze-dataset \"dataset_name\" --advanced-batch-size 4 --basic-batch-size 16\n```\n\n## Performance Features\n\n- Batch processing for tokenization\n- Progress bars for long-running operations\n- Tokenizer parallelism enabled\n- Caching support for tokenized texts\n- Memory-efficient processing of large datasets\n- Optimized batch sizes for better performance\n\n## Requirements\n\n- Python 3.8+\n- transformers\n- datasets\n- spacy\n- rich\n- torch\n- pandas\n- numpy\n- scikit-learn (for advanced features)\n- tqdm\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nThis project is licensed under the Apache License 2.0\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A comprehensive tool for analyzing text datasets from HuggingFace's datasets library",
    "version": "1.1.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/SulRash/huggingface-text-data-analyzer/issues",
        "Documentation": "https://github.com/SulRash/huggingface-text-data-analyzer#readme",
        "Homepage": "https://github.com/SulRash/huggingface-text-data-analyzer",
        "Source Code": "https://github.com/SulRash/huggingface-text-data-analyzer"
    },
    "split_keywords": [
        "nlp",
        " text-analysis",
        " huggingface",
        " datasets",
        " machine-learning",
        " data-analysis",
        " text-processing"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6bdd7e16236c101f6e63674f002448fceeb432617dd235ed5d9a1fec097fba54",
                "md5": "3f420c11a9cba4da0ae5d2b1c30e08d6",
                "sha256": "4bc4a2bcc14cac600ab5632d75aed82ff1a6c061586e5beaa53c57c4422fdb90"
            },
            "downloads": -1,
            "filename": "huggingface_text_data_analyzer-1.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3f420c11a9cba4da0ae5d2b1c30e08d6",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 24732,
            "upload_time": "2024-12-06T03:06:39",
            "upload_time_iso_8601": "2024-12-06T03:06:39.534987Z",
            "url": "https://files.pythonhosted.org/packages/6b/dd/7e16236c101f6e63674f002448fceeb432617dd235ed5d9a1fec097fba54/huggingface_text_data_analyzer-1.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "063bc76fb97f55523368f4643be2030a95484bcf2ae23544defacdcb93caf605",
                "md5": "dcb6e957612a54524cdc52b00b4a69ec",
                "sha256": "402cd0e8c979aae07ff499188340ecd4bda539c7604e8d99903f486d43d9ccce"
            },
            "downloads": -1,
            "filename": "huggingface_text_data_analyzer-1.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "dcb6e957612a54524cdc52b00b4a69ec",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 23569,
            "upload_time": "2024-12-06T03:06:41",
            "upload_time_iso_8601": "2024-12-06T03:06:41.791265Z",
            "url": "https://files.pythonhosted.org/packages/06/3b/c76fb97f55523368f4643be2030a95484bcf2ae23544defacdcb93caf605/huggingface_text_data_analyzer-1.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-06 03:06:41",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "SulRash",
    "github_project": "huggingface-text-data-analyzer",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "torch",
            "specs": [
                [
                    ">=",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "transformers",
            "specs": [
                [
                    ">=",
                    "4.30.0"
                ]
            ]
        },
        {
            "name": "datasets",
            "specs": [
                [
                    ">=",
                    "2.12.0"
                ]
            ]
        },
        {
            "name": "rich",
            "specs": [
                [
                    ">=",
                    "13.4.2"
                ]
            ]
        },
        {
            "name": "spacy",
            "specs": [
                [
                    ">=",
                    "3.6.0"
                ]
            ]
        },
        {
            "name": "nltk",
            "specs": [
                [
                    ">=",
                    "3.8.1"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    ">=",
                    "2.0.0"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.24.0"
                ]
            ]
        },
        {
            "name": "seaborn",
            "specs": [
                [
                    ">=",
                    "0.12.0"
                ]
            ]
        },
        {
            "name": "matplotlib",
            "specs": [
                [
                    ">=",
                    "3.7.0"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    ">=",
                    "4.65.0"
                ]
            ]
        },
        {
            "name": "typing-extensions",
            "specs": [
                [
                    ">=",
                    "4.5.0"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    ">=",
                    "1.3.0"
                ]
            ]
        },
        {
            "name": "black",
            "specs": [
                [
                    ">=",
                    "23.3.0"
                ]
            ]
        },
        {
            "name": "pylint",
            "specs": [
                [
                    ">=",
                    "2.17.0"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": [
                [
                    ">=",
                    "7.3.1"
                ]
            ]
        }
    ],
    "lcname": "huggingface-text-data-analyzer"
}
        
Elapsed time: 0.44771s