# Huggingface Text Data Analyzer
![Repository Image](assets/hftxtdata.png)
A comprehensive tool for analyzing text datasets from HuggingFace's datasets library. This tool provides both basic text statistics and advanced NLP analysis capabilities with optimized performance for large datasets.
## Analysis Types
The tool supports two types of analysis that can be run independently or together:
### Basic Analysis (Default)
- Average text length per field
- Word distribution analysis
- Junk text detection (HTML tags, special characters)
- Tokenizer-based analysis (when tokenizer is specified)
- Token length statistics with batch processing
### Advanced Analysis (Optional)
- Part-of-Speech (POS) tagging
- Named Entity Recognition (NER)
- Language detection using XLM-RoBERTa
- Sentiment analysis using distilbert-sst-2-english
You can control which analyses to run using these flags:
- `--skip-basic`: Skip basic analysis (must be used with `--advanced`)
- `--advanced`: Enable advanced analysis
- `--use-pos`: Enable POS tagging
- `--use-ner`: Enable NER
- `--use-lang`: Enable language detection
- `--use-sentiment`: Enable sentiment analysis
## Installation
### From PyPI
```bash
pip install huggingface-text-data-analyzer
```
### From Source
1. Clone the repository:
```bash
git clone https://github.com/yourusername/huggingface-text-data-analyzer.git
cd huggingface-text-data-analyzer
```
2. Install in development mode:
```bash
pip install -e .
```
3. Install spaCy's English model (if using advanced analysis):
```bash
python -m spacy download en_core_web_sm
```
## Usage
The tool is available as a command-line application after installation. You can run it using the `analyze-dataset` command:
Basic usage:
```bash
analyze-dataset "dataset_name" --split "train" --output-dir "results"
```
With tokenizer analysis:
```bash
analyze-dataset "dataset_name" --tokenizer "bert-base-uncased"
```
Analyze specific fields with chat template:
```bash
analyze-dataset "dataset_name" \
--fields instruction response \
--chat-field response \
--tokenizer "meta-llama/Llama-2-7b-chat-hf"
```
Run only advanced analysis:
```bash
analyze-dataset "dataset_name" --skip-basic --advanced --use-pos --use-lang
```
Run both analyses:
```bash
analyze-dataset "dataset_name" --advanced --use-sentiment
```
Run basic analysis only (default):
```bash
analyze-dataset "dataset_name"
```
Full analysis with all features:
```bash
analyze-dataset "dataset_name" \
--advanced \
--use-pos \
--use-ner \
--use-lang \
--use-sentiment \
--tokenizer "bert-base-uncased" \
--output-dir "results" \
--fields instruction response \
--batch-size 64
```
### Command Line Arguments
- `dataset_name`: Name of the dataset on HuggingFace (required)
- `--split`: Dataset split to analyze (default: "train")
- `--output-dir`: Directory to save analysis results (default: "analysis_results")
- `--tokenizer`: HuggingFace tokenizer to use (optional)
- `--cache-tokenized`: Cache tokenized texts (default: True)
- `--batch-size`: Batch size for tokenization (default: 32)
- `--fields`: Specific fields to analyze (optional, analyzes all text fields if not specified)
- `--chat-field`: Field to apply chat template to (optional)
- `--advanced`: Run advanced analysis with models
- `--use-pos`: Include POS tagging analysis
- `--use-ner`: Include NER analysis
- `--use-lang`: Include language detection
- `--use-sentiment`: Include sentiment analysis
### Python API
You can also use the tool programmatically in your Python code:
```python
from huggingface_text_data_analyzer import BaseAnalyzer, AdvancedAnalyzer
# Basic analysis
analyzer = BaseAnalyzer(
dataset_name="your_dataset",
split="train",
tokenizer="bert-base-uncased"
)
results = analyzer.analyze()
# Advanced analysis
advanced_analyzer = AdvancedAnalyzer(
dataset_name="your_dataset",
split="train",
use_pos=True,
use_ner=True
)
advanced_results = advanced_analyzer.analyze_advanced()
```
## Project Structure
```
huggingface_text_data_analyzer/
├── src/
│ ├── base_analyzer.py # Basic text analysis functionality
│ ├── advanced_analyzer.py # Model-based advanced analysis
│ ├── report_generator.py # Markdown report generation
│ └── utils.py # Utility functions and argument parsing
├── cli.py # Command-line interface
└── __init__.py # Package initialization
```
## Output
The tool generates markdown reports in the specified output directory:
- `basic_stats.md`: Contains basic text statistics
- `word_distribution.md`: Word frequency analysis
- `advanced_stats.md`: Results from model-based analysis (if enabled)
## Caching and Results Management
The tool implements a two-level caching system to optimize performance and save time:
### Token Cache
- Tokenized texts are cached to avoid re-tokenization
- Cache is stored in `~/.cache/huggingface-text-data-analyzer/`
- Clear with `--clear-cache` flag
### Analysis Results Cache
- Complete analysis results are cached per dataset/split
- Basic and advanced analysis results are cached separately
- When running analysis:
- Tool checks for existing results
- Prompts user before using cached results
- Saves intermediate results after basic analysis
- Prompts before overwriting existing results
### Cache Management Examples
Use cached results if available:
```bash
analyze-dataset "dataset_name" # Will prompt if cache exists
```
Force fresh analysis:
```bash
analyze-dataset "dataset_name" --clear-cache
```
Add advanced analysis to existing basic analysis:
```bash
analyze-dataset "dataset_name" --advanced # Will reuse basic results if available
```
### Cache Location
- Token cache: `~/.cache/huggingface-text-data-analyzer/`
- Analysis results: `~/.cache/huggingface-text-data-analyzer/analysis_results/`
## Performance and Accuracy Considerations
### Batch Sizes and Memory Usage
The tool uses two different batch sizes for processing:
1. **Basic Batch Size** (`--basic-batch-size`, default: 1):
- Used for tokenization and basic text analysis
- Higher values improve processing speed but may affect token count accuracy
- Token counting in larger batches can be affected by padding, truncation, and memory constraints
- If exact token counts are crucial, use smaller batch sizes (8-16)
2. **Advanced Batch Size** (`--advanced-batch-size`, default: 16):
- Used for transformer models (language detection, sentiment analysis)
- Adjust based on your GPU memory
- Larger batches improve processing speed but require more GPU memory
- CPU-only users might want to use smaller batches (4-8)
### GPU Support
The tool automatically detects and uses available CUDA GPUs for:
- Language detection model
- Sentiment analysis model
- Tokenizer operations
SpaCy operations (POS tagging, NER) remain CPU-bound for better compatibility.
### Examples
For exact token counting:
```bash
analyze-dataset "dataset_name" --basic-batch-size 8
```
For faster processing with GPU:
```bash
analyze-dataset "dataset_name" --advanced-batch-size 32 --basic-batch-size 64
```
For memory-constrained environments:
```bash
analyze-dataset "dataset_name" --advanced-batch-size 4 --basic-batch-size 16
```
## Performance Features
- Batch processing for tokenization
- Progress bars for long-running operations
- Tokenizer parallelism enabled
- Caching support for tokenized texts
- Memory-efficient processing of large datasets
- Optimized batch sizes for better performance
## Requirements
- Python 3.8+
- transformers
- datasets
- spacy
- rich
- torch
- pandas
- numpy
- scikit-learn (for advanced features)
- tqdm
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
## License
This project is licensed under the Apache License 2.0
Raw data
{
"_id": null,
"home_page": "https://github.com/SulRash/huggingface-text-data-analyzer",
"name": "huggingface-text-data-analyzer",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "nlp, text-analysis, huggingface, datasets, machine-learning, data-analysis, text-processing",
"author": "Sultan Alrashed",
"author_email": "sultan.m.rashed@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/06/3b/c76fb97f55523368f4643be2030a95484bcf2ae23544defacdcb93caf605/huggingface_text_data_analyzer-1.1.0.tar.gz",
"platform": null,
"description": "# Huggingface Text Data Analyzer\n\n![Repository Image](assets/hftxtdata.png)\n\nA comprehensive tool for analyzing text datasets from HuggingFace's datasets library. This tool provides both basic text statistics and advanced NLP analysis capabilities with optimized performance for large datasets.\n\n## Analysis Types\n\nThe tool supports two types of analysis that can be run independently or together:\n\n### Basic Analysis (Default)\n- Average text length per field\n- Word distribution analysis\n- Junk text detection (HTML tags, special characters)\n- Tokenizer-based analysis (when tokenizer is specified)\n- Token length statistics with batch processing\n\n### Advanced Analysis (Optional)\n- Part-of-Speech (POS) tagging\n- Named Entity Recognition (NER)\n- Language detection using XLM-RoBERTa\n- Sentiment analysis using distilbert-sst-2-english\n\nYou can control which analyses to run using these flags:\n- `--skip-basic`: Skip basic analysis (must be used with `--advanced`)\n- `--advanced`: Enable advanced analysis\n- `--use-pos`: Enable POS tagging\n- `--use-ner`: Enable NER\n- `--use-lang`: Enable language detection\n- `--use-sentiment`: Enable sentiment analysis\n\n## Installation\n\n### From PyPI\n```bash\npip install huggingface-text-data-analyzer\n```\n\n### From Source\n1. Clone the repository:\n```bash\ngit clone https://github.com/yourusername/huggingface-text-data-analyzer.git\ncd huggingface-text-data-analyzer\n```\n\n2. Install in development mode:\n```bash\npip install -e .\n```\n\n3. Install spaCy's English model (if using advanced analysis):\n```bash\npython -m spacy download en_core_web_sm\n```\n\n## Usage\n\nThe tool is available as a command-line application after installation. You can run it using the `analyze-dataset` command:\n\nBasic usage:\n```bash\nanalyze-dataset \"dataset_name\" --split \"train\" --output-dir \"results\"\n```\n\nWith tokenizer analysis:\n```bash\nanalyze-dataset \"dataset_name\" --tokenizer \"bert-base-uncased\"\n```\n\nAnalyze specific fields with chat template:\n```bash\nanalyze-dataset \"dataset_name\" \\\n --fields instruction response \\\n --chat-field response \\\n --tokenizer \"meta-llama/Llama-2-7b-chat-hf\"\n```\n\nRun only advanced analysis:\n```bash\nanalyze-dataset \"dataset_name\" --skip-basic --advanced --use-pos --use-lang\n```\n\nRun both analyses:\n```bash\nanalyze-dataset \"dataset_name\" --advanced --use-sentiment\n```\n\nRun basic analysis only (default):\n```bash\nanalyze-dataset \"dataset_name\"\n```\n\nFull analysis with all features:\n```bash\nanalyze-dataset \"dataset_name\" \\\n --advanced \\\n --use-pos \\\n --use-ner \\\n --use-lang \\\n --use-sentiment \\\n --tokenizer \"bert-base-uncased\" \\\n --output-dir \"results\" \\\n --fields instruction response \\\n --batch-size 64\n```\n\n### Command Line Arguments\n\n- `dataset_name`: Name of the dataset on HuggingFace (required)\n- `--split`: Dataset split to analyze (default: \"train\")\n- `--output-dir`: Directory to save analysis results (default: \"analysis_results\")\n- `--tokenizer`: HuggingFace tokenizer to use (optional)\n- `--cache-tokenized`: Cache tokenized texts (default: True)\n- `--batch-size`: Batch size for tokenization (default: 32)\n- `--fields`: Specific fields to analyze (optional, analyzes all text fields if not specified)\n- `--chat-field`: Field to apply chat template to (optional)\n- `--advanced`: Run advanced analysis with models\n- `--use-pos`: Include POS tagging analysis\n- `--use-ner`: Include NER analysis\n- `--use-lang`: Include language detection\n- `--use-sentiment`: Include sentiment analysis\n\n### Python API\n\nYou can also use the tool programmatically in your Python code:\n\n```python\nfrom huggingface_text_data_analyzer import BaseAnalyzer, AdvancedAnalyzer\n\n# Basic analysis\nanalyzer = BaseAnalyzer(\n dataset_name=\"your_dataset\",\n split=\"train\",\n tokenizer=\"bert-base-uncased\"\n)\nresults = analyzer.analyze()\n\n# Advanced analysis\nadvanced_analyzer = AdvancedAnalyzer(\n dataset_name=\"your_dataset\",\n split=\"train\",\n use_pos=True,\n use_ner=True\n)\nadvanced_results = advanced_analyzer.analyze_advanced()\n```\n## Project Structure\n\n```\nhuggingface_text_data_analyzer/\n\u251c\u2500\u2500 src/\n\u2502 \u251c\u2500\u2500 base_analyzer.py # Basic text analysis functionality\n\u2502 \u251c\u2500\u2500 advanced_analyzer.py # Model-based advanced analysis\n\u2502 \u251c\u2500\u2500 report_generator.py # Markdown report generation\n\u2502 \u2514\u2500\u2500 utils.py # Utility functions and argument parsing\n\u251c\u2500\u2500 cli.py # Command-line interface\n\u2514\u2500\u2500 __init__.py # Package initialization\n```\n\n## Output\n\nThe tool generates markdown reports in the specified output directory:\n- `basic_stats.md`: Contains basic text statistics\n- `word_distribution.md`: Word frequency analysis\n- `advanced_stats.md`: Results from model-based analysis (if enabled)\n\n## Caching and Results Management\n\nThe tool implements a two-level caching system to optimize performance and save time:\n\n### Token Cache\n- Tokenized texts are cached to avoid re-tokenization\n- Cache is stored in `~/.cache/huggingface-text-data-analyzer/`\n- Clear with `--clear-cache` flag\n\n### Analysis Results Cache\n- Complete analysis results are cached per dataset/split\n- Basic and advanced analysis results are cached separately\n- When running analysis:\n - Tool checks for existing results\n - Prompts user before using cached results\n - Saves intermediate results after basic analysis\n - Prompts before overwriting existing results\n\n### Cache Management Examples\n\nUse cached results if available:\n```bash\nanalyze-dataset \"dataset_name\" # Will prompt if cache exists\n```\n\nForce fresh analysis:\n```bash\nanalyze-dataset \"dataset_name\" --clear-cache\n```\n\nAdd advanced analysis to existing basic analysis:\n```bash\nanalyze-dataset \"dataset_name\" --advanced # Will reuse basic results if available\n```\n\n### Cache Location\n- Token cache: `~/.cache/huggingface-text-data-analyzer/`\n- Analysis results: `~/.cache/huggingface-text-data-analyzer/analysis_results/`\n\n\n## Performance and Accuracy Considerations\n\n### Batch Sizes and Memory Usage\n\nThe tool uses two different batch sizes for processing:\n\n1. **Basic Batch Size** (`--basic-batch-size`, default: 1):\n - Used for tokenization and basic text analysis\n - Higher values improve processing speed but may affect token count accuracy\n - Token counting in larger batches can be affected by padding, truncation, and memory constraints\n - If exact token counts are crucial, use smaller batch sizes (8-16)\n\n2. **Advanced Batch Size** (`--advanced-batch-size`, default: 16):\n - Used for transformer models (language detection, sentiment analysis)\n - Adjust based on your GPU memory\n - Larger batches improve processing speed but require more GPU memory\n - CPU-only users might want to use smaller batches (4-8)\n\n### GPU Support\n\nThe tool automatically detects and uses available CUDA GPUs for:\n- Language detection model\n- Sentiment analysis model\n- Tokenizer operations\n\nSpaCy operations (POS tagging, NER) remain CPU-bound for better compatibility.\n\n### Examples\n\nFor exact token counting:\n```bash\nanalyze-dataset \"dataset_name\" --basic-batch-size 8\n```\n\nFor faster processing with GPU:\n```bash\nanalyze-dataset \"dataset_name\" --advanced-batch-size 32 --basic-batch-size 64\n```\n\nFor memory-constrained environments:\n```bash\nanalyze-dataset \"dataset_name\" --advanced-batch-size 4 --basic-batch-size 16\n```\n\n## Performance Features\n\n- Batch processing for tokenization\n- Progress bars for long-running operations\n- Tokenizer parallelism enabled\n- Caching support for tokenized texts\n- Memory-efficient processing of large datasets\n- Optimized batch sizes for better performance\n\n## Requirements\n\n- Python 3.8+\n- transformers\n- datasets\n- spacy\n- rich\n- torch\n- pandas\n- numpy\n- scikit-learn (for advanced features)\n- tqdm\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nThis project is licensed under the Apache License 2.0\n",
"bugtrack_url": null,
"license": null,
"summary": "A comprehensive tool for analyzing text datasets from HuggingFace's datasets library",
"version": "1.1.0",
"project_urls": {
"Bug Tracker": "https://github.com/SulRash/huggingface-text-data-analyzer/issues",
"Documentation": "https://github.com/SulRash/huggingface-text-data-analyzer#readme",
"Homepage": "https://github.com/SulRash/huggingface-text-data-analyzer",
"Source Code": "https://github.com/SulRash/huggingface-text-data-analyzer"
},
"split_keywords": [
"nlp",
" text-analysis",
" huggingface",
" datasets",
" machine-learning",
" data-analysis",
" text-processing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "6bdd7e16236c101f6e63674f002448fceeb432617dd235ed5d9a1fec097fba54",
"md5": "3f420c11a9cba4da0ae5d2b1c30e08d6",
"sha256": "4bc4a2bcc14cac600ab5632d75aed82ff1a6c061586e5beaa53c57c4422fdb90"
},
"downloads": -1,
"filename": "huggingface_text_data_analyzer-1.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "3f420c11a9cba4da0ae5d2b1c30e08d6",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 24732,
"upload_time": "2024-12-06T03:06:39",
"upload_time_iso_8601": "2024-12-06T03:06:39.534987Z",
"url": "https://files.pythonhosted.org/packages/6b/dd/7e16236c101f6e63674f002448fceeb432617dd235ed5d9a1fec097fba54/huggingface_text_data_analyzer-1.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "063bc76fb97f55523368f4643be2030a95484bcf2ae23544defacdcb93caf605",
"md5": "dcb6e957612a54524cdc52b00b4a69ec",
"sha256": "402cd0e8c979aae07ff499188340ecd4bda539c7604e8d99903f486d43d9ccce"
},
"downloads": -1,
"filename": "huggingface_text_data_analyzer-1.1.0.tar.gz",
"has_sig": false,
"md5_digest": "dcb6e957612a54524cdc52b00b4a69ec",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 23569,
"upload_time": "2024-12-06T03:06:41",
"upload_time_iso_8601": "2024-12-06T03:06:41.791265Z",
"url": "https://files.pythonhosted.org/packages/06/3b/c76fb97f55523368f4643be2030a95484bcf2ae23544defacdcb93caf605/huggingface_text_data_analyzer-1.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-06 03:06:41",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "SulRash",
"github_project": "huggingface-text-data-analyzer",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "torch",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "transformers",
"specs": [
[
">=",
"4.30.0"
]
]
},
{
"name": "datasets",
"specs": [
[
">=",
"2.12.0"
]
]
},
{
"name": "rich",
"specs": [
[
">=",
"13.4.2"
]
]
},
{
"name": "spacy",
"specs": [
[
">=",
"3.6.0"
]
]
},
{
"name": "nltk",
"specs": [
[
">=",
"3.8.1"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "numpy",
"specs": [
[
">=",
"1.24.0"
]
]
},
{
"name": "seaborn",
"specs": [
[
">=",
"0.12.0"
]
]
},
{
"name": "matplotlib",
"specs": [
[
">=",
"3.7.0"
]
]
},
{
"name": "tqdm",
"specs": [
[
">=",
"4.65.0"
]
]
},
{
"name": "typing-extensions",
"specs": [
[
">=",
"4.5.0"
]
]
},
{
"name": "scikit-learn",
"specs": [
[
">=",
"1.3.0"
]
]
},
{
"name": "black",
"specs": [
[
">=",
"23.3.0"
]
]
},
{
"name": "pylint",
"specs": [
[
">=",
"2.17.0"
]
]
},
{
"name": "pytest",
"specs": [
[
">=",
"7.3.1"
]
]
}
],
"lcname": "huggingface-text-data-analyzer"
}