meno

Name	meno JSON
Version	1.0.3 JSON
	download
home_page	https://github.com/srepho/meno
Summary	Topic modeling toolkit for messy text data
upload_time	2025-03-07 14:00:11
maintainer	None
docs_url	None
author	Stephen Oates
requires_python	<3.14,>=3.8
license	MIT
keywords
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage

            # Meno: Topic Modeling Toolkit (v1.0.3)

<p align="center">
  <img src="meno.webp" alt="Meno Logo" width="250"/>
</p>

[![PyPI version](https://img.shields.io/pypi/v/meno.svg)](https://pypi.org/project/meno/)
[![Python Version](https://img.shields.io/pypi/pyversions/meno.svg)](https://pypi.org/project/meno/)
[![License](https://img.shields.io/github/license/srepho/meno.svg)](https://github.com/srepho/meno/blob/main/LICENSE)
[![Tests](https://github.com/srepho/meno/workflows/tests/badge.svg)](https://github.com/srepho/meno/actions?query=workflow%3Atests)
[![Downloads](https://img.shields.io/pypi/dm/meno.svg)](https://pypi.org/project/meno/)

Meno is a toolkit for topic modeling on messy text data, featuring an interactive workflow system that guides users from raw text to insights through acronym detection, spelling correction, topic modeling, and visualization.

## Installation

```bash
# Basic installation with core dependencies
pip install "numpy<2.0.0"  # NumPy 1.x is required for compatibility
pip install meno

# Recommended: Minimal installation with essential topic modeling dependencies
pip install "numpy<2.0.0"
pip install "meno[minimal]"

# CPU-optimized installation without NVIDIA packages
pip install "numpy<2.0.0"
pip install "meno[embeddings]" -f https://download.pytorch.org/whl/torch_stable.html
```

### Offline/Air-gapped Environment Installation

For environments with limited internet access:

1. Download required models on a connected machine:
   ```python
   from sentence_transformers import SentenceTransformer
   # Download and cache model
   model = SentenceTransformer("all-MiniLM-L6-v2")
   # Note the model path (usually in ~/.cache/huggingface)
   ```

2. Transfer the downloaded model files to the offline machine in the same directory structure

3. Use the local_files_only option when initializing:
   ```python
   from meno.modeling.embeddings import DocumentEmbedding
   
   # Option 1: Direct path to downloaded model
   embedding_model = DocumentEmbedding(
       local_model_path="/path/to/local/model",
       use_gpu=False
   )
   
   # Option 2: Using standard HuggingFace cache location
   embedding_model = DocumentEmbedding(
       model_name="all-MiniLM-L6-v2",
       local_files_only=True,
       use_gpu=False
   )
   ```

See `examples/local_model_example.py` for detailed offline usage examples.

## Quick Start

### Basic Topic Modeling

```python
from meno import MenoTopicModeler
import pandas as pd

# Load your data
df = pd.read_csv("documents.csv")

# Initialize and run basic topic modeling
modeler = MenoTopicModeler()
processed_docs = modeler.preprocess(df, text_column="text")
topics_df = modeler.discover_topics(method="embedding_cluster", num_topics=5)

# Visualize results
fig = modeler.visualize_embeddings()
fig.write_html("topic_embeddings.html")

# Generate comprehensive HTML report
report_path = modeler.generate_report(output_path="topic_report.html")
```

### Interactive Workflow

```python
from meno import MenoWorkflow
import pandas as pd

# Load your data
data = pd.DataFrame({
    "text": [
        "The CEO and CFO met to discuss the AI implementation in our CRM system.",
        "Customer submitted a claim for their vehical accident on HWY 101.",
        "The CTO presented the ML strategy for improving cust retention.",
        "Policyholder recieved the EOB and was confused about the CPT codes."
    ]
})

# Initialize and run workflow
workflow = MenoWorkflow()
workflow.load_data(data=data, text_column="text")

# Generate interactive acronym report
workflow.generate_acronym_report(output_path="acronyms.html", open_browser=True)

# Apply acronym expansions
workflow.expand_acronyms({"CRM": "Customer Relationship Management", "CTO": "Chief Technology Officer"})

# Generate interactive misspelling report
workflow.generate_misspelling_report(output_path="misspellings.html", open_browser=True)

# Apply spelling corrections
workflow.correct_spelling({"vehical": "vehicle", "recieved": "received"})

# Preprocess and model topics
workflow.preprocess_documents()
workflow.discover_topics(num_topics=2)

# Generate comprehensive report
workflow.generate_comprehensive_report("final_report.html", open_browser=True)
```

## What's New in v1.0.0

- **Standardized API** - Consistent parameter names and method signatures across all models
- **Automatic Topic Detection** - Models can discover the optimal number of topics automatically
- **Enhanced Memory Efficiency** - Process larger datasets with streaming and quantization
- **Path Object Support** - Better file handling with pathlib integration
- **Return Type Standardization** - Consistent return values across all methods

## Overview

Meno streamlines topic modeling on messy text data, with a special focus on datasets like insurance claims and customer correspondence. It combines traditional methods (LDA) with modern techniques using large language models, dimensionality reduction with UMAP, and interactive visualizations.

## Key Features

- **Interactive Workflow System**
  - Guided process from raw data to insights
  - Acronym detection and expansion
  - Spelling correction with contextual examples
  - Topic discovery and visualization
  - Interactive HTML reports

- **Versatile Topic Modeling**
  - Unsupervised discovery with embedding-based clustering
  - Supervised matching against predefined topics
  - Automatic topic detection
  - Integration with BERTopic and other advanced models

- **Team Configuration System**
  - Share domain-specific dictionaries across teams
  - Import/export terminology (JSON, YAML)
  - CLI tools for configuration management

- **Performance Optimizations**
  - Memory-efficient processing for large datasets
  - Quantized embedding models
  - Streaming processing for larger-than-memory data
  - CPU-first design with optional GPU acceleration

- **Visualization & Reporting**
  - Interactive embedding visualizations
  - Topic distribution and similarity analysis
  - Time series and geospatial visualizations
  - Comprehensive HTML reports

## Installation Options

```bash
# For additional topic modeling approaches (BERTopic, Top2Vec)
pip install "meno[additional_models]"

# For embeddings with GPU acceleration
pip install "meno[embeddings-gpu]"

# For LDA topic modeling
pip install "meno[lda]"

# For visualization capabilities
pip install "meno[viz]"

# For NLP processing capabilities
pip install "meno[nlp]"

# For large dataset optimization using Polars
pip install "meno[optimization]"

# For memory-efficient embeddings
pip install "meno[memory_efficient]"

# For all features (CPU only)
pip install "meno[full]"

# For all features with GPU acceleration
pip install "meno[full-gpu]"

# For development
pip install "meno[dev,test]"
```

## Examples

### Advanced Topic Discovery

```python
from meno import MenoTopicModeler
import pandas as pd

# Initialize modeler
modeler = MenoTopicModeler()

# Load and preprocess data
df = pd.read_csv("documents.csv")
processed_docs = modeler.preprocess(
    df, 
    text_column="text",
    lowercase=True,
    remove_punctuation=True,
    remove_stopwords=True,
    additional_stopwords=["specific", "custom", "words"]
)

# Discover topics (automatic detection with HDBSCAN)
topics_df = modeler.discover_topics(
    method="embedding_cluster",
    clustering_algorithm="hdbscan",
    min_cluster_size=10,
    min_samples=5
)

print(f"Discovered {len(topics_df['topic'].unique())} topics")

# Visualize results
fig = modeler.visualize_embeddings(
    plot_3d=True,
    include_topic_centers=True
)
fig.write_html("3d_topic_visualization.html")

# Generate report
report_path = modeler.generate_report(
    output_path="topic_report.html",
    include_interactive=True
)
```

### BERTopic Integration

```python
from meno import MenoWorkflow
import pandas as pd
from bertopic import BERTopic
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.representation import KeyBERTInspired

# Load data and initialize workflow
df = pd.read_csv("documents.csv")
workflow = MenoWorkflow()
workflow.load_data(data=df, text_column="text")
workflow.preprocess_documents()

# Get preprocessed data from workflow
preprocessed_df = workflow.get_preprocessed_data()

# Configure and fit BERTopic model
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)
keybert_model = KeyBERTInspired()

topic_model = BERTopic(
    embedding_model="all-MiniLM-L6-v2",
    vectorizer_model=ctfidf_model,
    representation_model=keybert_model,
    calculate_probabilities=True
)

topics, probs = topic_model.fit_transform(
    preprocessed_df["processed_text"].tolist()
)

# Update workflow with BERTopic results
preprocessed_df["topic"] = [f"Topic_{t}" if t >= 0 else "Outlier" for t in topics]
preprocessed_df["topic_probability"] = probs
workflow.set_topic_assignments(preprocessed_df[["topic", "topic_probability"]])

# Generate visualizations and report
topic_model.visualize_topics().write_html("bertopic_similarity.html")
workflow.generate_comprehensive_report(
    output_path="bertopic_report.html",
    open_browser=True
)
```

### Matching Documents to Predefined Topics

```python
from meno import MenoTopicModeler
import pandas as pd

# Initialize and load data
modeler = MenoTopicModeler()
df = pd.read_csv("support_tickets.csv")
processed_docs = modeler.preprocess(df, text_column="description")

# Define topics and descriptions
predefined_topics = [
    "Account Access",
    "Billing Issue",
    "Technical Problem",
    "Feature Request",
    "Product Feedback"
]

topic_descriptions = [
    "Issues related to logging in, password resets, or account security",
    "Problems with payments, invoices, or subscription changes",
    "Technical issues, bugs, crashes, or performance problems",
    "Requests for new features or enhancements to existing functionality",
    "General feedback about the product, including compliments and complaints"
]

# Match documents to topics
matched_df = modeler.match_topics(
    topics=predefined_topics,
    descriptions=topic_descriptions,
    threshold=0.6,
    assign_multiple=True,
    max_topics_per_doc=2
)

# View topic assignments
print(matched_df[["description", "topic", "topic_probability"]].head())
```

### Large Dataset Processing

```python
from meno import MenoWorkflow
import pandas as pd

# Create optimized configuration
config_overrides = {
    "modeling": {
        "embeddings": {
            "model_name": "sentence-transformers/all-MiniLM-L6-v2",
            "batch_size": 64,
            "quantize": True,
            "low_memory": True
        }
    }
}

# Initialize workflow with optimized settings
workflow = MenoWorkflow(config_overrides=config_overrides)

# Process in batches
data = pd.read_csv("large_dataset.csv")
batch_size = 10000

for i in range(0, len(data), batch_size):
    batch = data.iloc[i:i+batch_size]
    
    if i == 0:  # First batch
        workflow.load_data(batch, text_column="text")
    else:  # Update with subsequent batches
        workflow.update_data(batch)

# Process with memory-efficient settings
workflow.preprocess_documents()
workflow.discover_topics(method="embedding_cluster")
workflow.generate_comprehensive_report("large_dataset_report.html")
```

## Team Configuration CLI

```bash
# Create a new team configuration
meno-config create "Healthcare" \
    --acronyms-file healthcare_acronyms.json \
    --corrections-file medical_spelling.json \
    --output-path healthcare_config.yaml

# Update an existing configuration
meno-config update healthcare_config.yaml \
    --acronyms-file new_acronyms.json

# Compare configurations from different teams
meno-config compare healthcare_config.yaml insurance_config.yaml \
    --output-path comparison.json
```

## Architecture

The package follows a modular design:

- **Data Preprocessing:** Spelling correction, acronym resolution, text normalization
- **Topic Modeling:** Unsupervised discovery, supervised matching, multiple model support
- **Visualization:** Interactive embeddings, topic distributions, time series
- **Report Generation:** HTML reports with Plotly and Jinja2
- **Team Configuration:** Domain knowledge sharing, CLI tools

## Dependencies

- **Python:** 3.8-3.12 (primary target: 3.10)
- **Core Libraries:** pandas, scikit-learn, thefuzz, pydantic, PyYAML
- **Optional Libraries:** sentence-transformers, transformers, torch, umap-learn, hdbscan, plotly, bertopic

## Testing

```bash
# Run basic tests
python -m pytest -xvs tests/

# Run with coverage reporting
python -m pytest --cov=meno
```

## Documentation

For detailed usage information, see the [full documentation](https://github.com/srepho/meno/wiki).

## Future Development

With v1.0.0 complete, our focus is shifting to:

1. **Cloud Integration** - Native support for cloud-based services
2. **Multilingual Support** - Expand beyond English
3. **Domain-Specific Fine-Tuning** - Adapt models to specific industries
4. **Explainable AI Features** - Better interpret topic assignments
5. **Interactive Dashboards** - More powerful visualization tools

See our [detailed roadmap](ROADMAP.md) for more information.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/srepho/meno",
    "name": "meno",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<3.14,>=3.8",
    "maintainer_email": null,
    "keywords": null,
    "author": "Stephen Oates",
    "author_email": "Stephen Oates <stephen.oates@example.com>",
    "download_url": "https://files.pythonhosted.org/packages/1b/66/8236d56e00269021bdff0f0d6f4c5be5034b0e5b2d347d5a5ecc7576b320/meno-1.0.3.tar.gz",
    "platform": null,
    "description": "# Meno: Topic Modeling Toolkit (v1.0.3)\n\n<p align=\"center\">\n  <img src=\"meno.webp\" alt=\"Meno Logo\" width=\"250\"/>\n</p>\n\n[![PyPI version](https://img.shields.io/pypi/v/meno.svg)](https://pypi.org/project/meno/)\n[![Python Version](https://img.shields.io/pypi/pyversions/meno.svg)](https://pypi.org/project/meno/)\n[![License](https://img.shields.io/github/license/srepho/meno.svg)](https://github.com/srepho/meno/blob/main/LICENSE)\n[![Tests](https://github.com/srepho/meno/workflows/tests/badge.svg)](https://github.com/srepho/meno/actions?query=workflow%3Atests)\n[![Downloads](https://img.shields.io/pypi/dm/meno.svg)](https://pypi.org/project/meno/)\n\nMeno is a toolkit for topic modeling on messy text data, featuring an interactive workflow system that guides users from raw text to insights through acronym detection, spelling correction, topic modeling, and visualization.\n\n## Installation\n\n```bash\n# Basic installation with core dependencies\npip install \"numpy<2.0.0\"  # NumPy 1.x is required for compatibility\npip install meno\n\n# Recommended: Minimal installation with essential topic modeling dependencies\npip install \"numpy<2.0.0\"\npip install \"meno[minimal]\"\n\n# CPU-optimized installation without NVIDIA packages\npip install \"numpy<2.0.0\"\npip install \"meno[embeddings]\" -f https://download.pytorch.org/whl/torch_stable.html\n```\n\n### Offline/Air-gapped Environment Installation\n\nFor environments with limited internet access:\n\n1. Download required models on a connected machine:\n   ```python\n   from sentence_transformers import SentenceTransformer\n   # Download and cache model\n   model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n   # Note the model path (usually in ~/.cache/huggingface)\n   ```\n\n2. Transfer the downloaded model files to the offline machine in the same directory structure\n\n3. Use the local_files_only option when initializing:\n   ```python\n   from meno.modeling.embeddings import DocumentEmbedding\n   \n   # Option 1: Direct path to downloaded model\n   embedding_model = DocumentEmbedding(\n       local_model_path=\"/path/to/local/model\",\n       use_gpu=False\n   )\n   \n   # Option 2: Using standard HuggingFace cache location\n   embedding_model = DocumentEmbedding(\n       model_name=\"all-MiniLM-L6-v2\",\n       local_files_only=True,\n       use_gpu=False\n   )\n   ```\n\nSee `examples/local_model_example.py` for detailed offline usage examples.\n\n## Quick Start\n\n### Basic Topic Modeling\n\n```python\nfrom meno import MenoTopicModeler\nimport pandas as pd\n\n# Load your data\ndf = pd.read_csv(\"documents.csv\")\n\n# Initialize and run basic topic modeling\nmodeler = MenoTopicModeler()\nprocessed_docs = modeler.preprocess(df, text_column=\"text\")\ntopics_df = modeler.discover_topics(method=\"embedding_cluster\", num_topics=5)\n\n# Visualize results\nfig = modeler.visualize_embeddings()\nfig.write_html(\"topic_embeddings.html\")\n\n# Generate comprehensive HTML report\nreport_path = modeler.generate_report(output_path=\"topic_report.html\")\n```\n\n### Interactive Workflow\n\n```python\nfrom meno import MenoWorkflow\nimport pandas as pd\n\n# Load your data\ndata = pd.DataFrame({\n    \"text\": [\n        \"The CEO and CFO met to discuss the AI implementation in our CRM system.\",\n        \"Customer submitted a claim for their vehical accident on HWY 101.\",\n        \"The CTO presented the ML strategy for improving cust retention.\",\n        \"Policyholder recieved the EOB and was confused about the CPT codes.\"\n    ]\n})\n\n# Initialize and run workflow\nworkflow = MenoWorkflow()\nworkflow.load_data(data=data, text_column=\"text\")\n\n# Generate interactive acronym report\nworkflow.generate_acronym_report(output_path=\"acronyms.html\", open_browser=True)\n\n# Apply acronym expansions\nworkflow.expand_acronyms({\"CRM\": \"Customer Relationship Management\", \"CTO\": \"Chief Technology Officer\"})\n\n# Generate interactive misspelling report\nworkflow.generate_misspelling_report(output_path=\"misspellings.html\", open_browser=True)\n\n# Apply spelling corrections\nworkflow.correct_spelling({\"vehical\": \"vehicle\", \"recieved\": \"received\"})\n\n# Preprocess and model topics\nworkflow.preprocess_documents()\nworkflow.discover_topics(num_topics=2)\n\n# Generate comprehensive report\nworkflow.generate_comprehensive_report(\"final_report.html\", open_browser=True)\n```\n\n## What's New in v1.0.0\n\n- **Standardized API** - Consistent parameter names and method signatures across all models\n- **Automatic Topic Detection** - Models can discover the optimal number of topics automatically\n- **Enhanced Memory Efficiency** - Process larger datasets with streaming and quantization\n- **Path Object Support** - Better file handling with pathlib integration\n- **Return Type Standardization** - Consistent return values across all methods\n\n## Overview\n\nMeno streamlines topic modeling on messy text data, with a special focus on datasets like insurance claims and customer correspondence. It combines traditional methods (LDA) with modern techniques using large language models, dimensionality reduction with UMAP, and interactive visualizations.\n\n## Key Features\n\n- **Interactive Workflow System**\n  - Guided process from raw data to insights\n  - Acronym detection and expansion\n  - Spelling correction with contextual examples\n  - Topic discovery and visualization\n  - Interactive HTML reports\n\n- **Versatile Topic Modeling**\n  - Unsupervised discovery with embedding-based clustering\n  - Supervised matching against predefined topics\n  - Automatic topic detection\n  - Integration with BERTopic and other advanced models\n\n- **Team Configuration System**\n  - Share domain-specific dictionaries across teams\n  - Import/export terminology (JSON, YAML)\n  - CLI tools for configuration management\n\n- **Performance Optimizations**\n  - Memory-efficient processing for large datasets\n  - Quantized embedding models\n  - Streaming processing for larger-than-memory data\n  - CPU-first design with optional GPU acceleration\n\n- **Visualization & Reporting**\n  - Interactive embedding visualizations\n  - Topic distribution and similarity analysis\n  - Time series and geospatial visualizations\n  - Comprehensive HTML reports\n\n## Installation Options\n\n```bash\n# For additional topic modeling approaches (BERTopic, Top2Vec)\npip install \"meno[additional_models]\"\n\n# For embeddings with GPU acceleration\npip install \"meno[embeddings-gpu]\"\n\n# For LDA topic modeling\npip install \"meno[lda]\"\n\n# For visualization capabilities\npip install \"meno[viz]\"\n\n# For NLP processing capabilities\npip install \"meno[nlp]\"\n\n# For large dataset optimization using Polars\npip install \"meno[optimization]\"\n\n# For memory-efficient embeddings\npip install \"meno[memory_efficient]\"\n\n# For all features (CPU only)\npip install \"meno[full]\"\n\n# For all features with GPU acceleration\npip install \"meno[full-gpu]\"\n\n# For development\npip install \"meno[dev,test]\"\n```\n\n## Examples\n\n### Advanced Topic Discovery\n\n```python\nfrom meno import MenoTopicModeler\nimport pandas as pd\n\n# Initialize modeler\nmodeler = MenoTopicModeler()\n\n# Load and preprocess data\ndf = pd.read_csv(\"documents.csv\")\nprocessed_docs = modeler.preprocess(\n    df, \n    text_column=\"text\",\n    lowercase=True,\n    remove_punctuation=True,\n    remove_stopwords=True,\n    additional_stopwords=[\"specific\", \"custom\", \"words\"]\n)\n\n# Discover topics (automatic detection with HDBSCAN)\ntopics_df = modeler.discover_topics(\n    method=\"embedding_cluster\",\n    clustering_algorithm=\"hdbscan\",\n    min_cluster_size=10,\n    min_samples=5\n)\n\nprint(f\"Discovered {len(topics_df['topic'].unique())} topics\")\n\n# Visualize results\nfig = modeler.visualize_embeddings(\n    plot_3d=True,\n    include_topic_centers=True\n)\nfig.write_html(\"3d_topic_visualization.html\")\n\n# Generate report\nreport_path = modeler.generate_report(\n    output_path=\"topic_report.html\",\n    include_interactive=True\n)\n```\n\n### BERTopic Integration\n\n```python\nfrom meno import MenoWorkflow\nimport pandas as pd\nfrom bertopic import BERTopic\nfrom bertopic.vectorizers import ClassTfidfTransformer\nfrom bertopic.representation import KeyBERTInspired\n\n# Load data and initialize workflow\ndf = pd.read_csv(\"documents.csv\")\nworkflow = MenoWorkflow()\nworkflow.load_data(data=df, text_column=\"text\")\nworkflow.preprocess_documents()\n\n# Get preprocessed data from workflow\npreprocessed_df = workflow.get_preprocessed_data()\n\n# Configure and fit BERTopic model\nctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)\nkeybert_model = KeyBERTInspired()\n\ntopic_model = BERTopic(\n    embedding_model=\"all-MiniLM-L6-v2\",\n    vectorizer_model=ctfidf_model,\n    representation_model=keybert_model,\n    calculate_probabilities=True\n)\n\ntopics, probs = topic_model.fit_transform(\n    preprocessed_df[\"processed_text\"].tolist()\n)\n\n# Update workflow with BERTopic results\npreprocessed_df[\"topic\"] = [f\"Topic_{t}\" if t >= 0 else \"Outlier\" for t in topics]\npreprocessed_df[\"topic_probability\"] = probs\nworkflow.set_topic_assignments(preprocessed_df[[\"topic\", \"topic_probability\"]])\n\n# Generate visualizations and report\ntopic_model.visualize_topics().write_html(\"bertopic_similarity.html\")\nworkflow.generate_comprehensive_report(\n    output_path=\"bertopic_report.html\",\n    open_browser=True\n)\n```\n\n### Matching Documents to Predefined Topics\n\n```python\nfrom meno import MenoTopicModeler\nimport pandas as pd\n\n# Initialize and load data\nmodeler = MenoTopicModeler()\ndf = pd.read_csv(\"support_tickets.csv\")\nprocessed_docs = modeler.preprocess(df, text_column=\"description\")\n\n# Define topics and descriptions\npredefined_topics = [\n    \"Account Access\",\n    \"Billing Issue\",\n    \"Technical Problem\",\n    \"Feature Request\",\n    \"Product Feedback\"\n]\n\ntopic_descriptions = [\n    \"Issues related to logging in, password resets, or account security\",\n    \"Problems with payments, invoices, or subscription changes\",\n    \"Technical issues, bugs, crashes, or performance problems\",\n    \"Requests for new features or enhancements to existing functionality\",\n    \"General feedback about the product, including compliments and complaints\"\n]\n\n# Match documents to topics\nmatched_df = modeler.match_topics(\n    topics=predefined_topics,\n    descriptions=topic_descriptions,\n    threshold=0.6,\n    assign_multiple=True,\n    max_topics_per_doc=2\n)\n\n# View topic assignments\nprint(matched_df[[\"description\", \"topic\", \"topic_probability\"]].head())\n```\n\n### Large Dataset Processing\n\n```python\nfrom meno import MenoWorkflow\nimport pandas as pd\n\n# Create optimized configuration\nconfig_overrides = {\n    \"modeling\": {\n        \"embeddings\": {\n            \"model_name\": \"sentence-transformers/all-MiniLM-L6-v2\",\n            \"batch_size\": 64,\n            \"quantize\": True,\n            \"low_memory\": True\n        }\n    }\n}\n\n# Initialize workflow with optimized settings\nworkflow = MenoWorkflow(config_overrides=config_overrides)\n\n# Process in batches\ndata = pd.read_csv(\"large_dataset.csv\")\nbatch_size = 10000\n\nfor i in range(0, len(data), batch_size):\n    batch = data.iloc[i:i+batch_size]\n    \n    if i == 0:  # First batch\n        workflow.load_data(batch, text_column=\"text\")\n    else:  # Update with subsequent batches\n        workflow.update_data(batch)\n\n# Process with memory-efficient settings\nworkflow.preprocess_documents()\nworkflow.discover_topics(method=\"embedding_cluster\")\nworkflow.generate_comprehensive_report(\"large_dataset_report.html\")\n```\n\n## Team Configuration CLI\n\n```bash\n# Create a new team configuration\nmeno-config create \"Healthcare\" \\\n    --acronyms-file healthcare_acronyms.json \\\n    --corrections-file medical_spelling.json \\\n    --output-path healthcare_config.yaml\n\n# Update an existing configuration\nmeno-config update healthcare_config.yaml \\\n    --acronyms-file new_acronyms.json\n\n# Compare configurations from different teams\nmeno-config compare healthcare_config.yaml insurance_config.yaml \\\n    --output-path comparison.json\n```\n\n## Architecture\n\nThe package follows a modular design:\n\n- **Data Preprocessing:** Spelling correction, acronym resolution, text normalization\n- **Topic Modeling:** Unsupervised discovery, supervised matching, multiple model support\n- **Visualization:** Interactive embeddings, topic distributions, time series\n- **Report Generation:** HTML reports with Plotly and Jinja2\n- **Team Configuration:** Domain knowledge sharing, CLI tools\n\n## Dependencies\n\n- **Python:** 3.8-3.12 (primary target: 3.10)\n- **Core Libraries:** pandas, scikit-learn, thefuzz, pydantic, PyYAML\n- **Optional Libraries:** sentence-transformers, transformers, torch, umap-learn, hdbscan, plotly, bertopic\n\n## Testing\n\n```bash\n# Run basic tests\npython -m pytest -xvs tests/\n\n# Run with coverage reporting\npython -m pytest --cov=meno\n```\n\n## Documentation\n\nFor detailed usage information, see the [full documentation](https://github.com/srepho/meno/wiki).\n\n## Future Development\n\nWith v1.0.0 complete, our focus is shifting to:\n\n1. **Cloud Integration** - Native support for cloud-based services\n2. **Multilingual Support** - Expand beyond English\n3. **Domain-Specific Fine-Tuning** - Adapt models to specific industries\n4. **Explainable AI Features** - Better interpret topic assignments\n5. **Interactive Dashboards** - More powerful visualization tools\n\nSee our [detailed roadmap](ROADMAP.md) for more information.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Topic modeling toolkit for messy text data",
    "version": "1.0.3",
    "project_urls": {
        "Bug Tracker": "https://github.com/srepho/meno/issues",
        "Homepage": "https://github.com/srepho/meno"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d7939787c54bf36a76f1ac713a878f6679ec5c0b7d9012b88b154f2582f217e9",
                "md5": "ec276f55349fd88956b547e0410f508c",
                "sha256": "1023ac0cb35667adb3eb188de1f5aca745446d2dd1367b905d6b2cb668c7c4e8"
            },
            "downloads": -1,
            "filename": "meno-1.0.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ec276f55349fd88956b547e0410f508c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.14,>=3.8",
            "size": 122831,
            "upload_time": "2025-03-07T14:00:04",
            "upload_time_iso_8601": "2025-03-07T14:00:04.390322Z",
            "url": "https://files.pythonhosted.org/packages/d7/93/9787c54bf36a76f1ac713a878f6679ec5c0b7d9012b88b154f2582f217e9/meno-1.0.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1b668236d56e00269021bdff0f0d6f4c5be5034b0e5b2d347d5a5ecc7576b320",
                "md5": "49890ad43e7d7afa4a0351e2380ce5af",
                "sha256": "b37b0979260a8723503dbb4335b847c3d3e19b70ecb8ccd812617fc7e1ac0e1d"
            },
            "downloads": -1,
            "filename": "meno-1.0.3.tar.gz",
            "has_sig": false,
            "md5_digest": "49890ad43e7d7afa4a0351e2380ce5af",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.14,>=3.8",
            "size": 5828493,
            "upload_time": "2025-03-07T14:00:11",
            "upload_time_iso_8601": "2025-03-07T14:00:11.787290Z",
            "url": "https://files.pythonhosted.org/packages/1b/66/8236d56e00269021bdff0f0d6f4c5be5034b0e5b2d347d5a5ecc7576b320/meno-1.0.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-03-07 14:00:11",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "srepho",
    "github_project": "meno",
    "travis_ci": false,
    "coveralls": true,
    "github_actions": false,
    "lcname": "meno"
}

Stephen Oates