maltopic

Name	maltopic JSON
Version	1.3.2 JSON
	download
home_page	https://github.com/yash91sharma/MALTopic-py
Summary	A multi-agent LLM topic modeling library.
upload_time	2025-08-30 06:08:20
maintainer	None
docs_url	None
author	Yash Sharma
requires_python	<4.0,>=3.12
license	MIT
keywords	topic modeling llm multi-agent survey analysis data enrichment topic deduplication semantic analysis
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # MALTopic: Multi-Agent LLM Topic Modeling Library

MALTopic is a powerful library designed for topic modeling using a multi-agent approach. It leverages the capabilities of large language models (LLMs) to enhance the analysis of survey responses by integrating structured and unstructured data.

[MALTopic](https://ieeexplore.ieee.org/document/11105319) as a research paper was published in 2025 World AI IoT Congress.

## Features

- **Multi-Agent Framework**: Decomposes topic modeling into specialized tasks executed by individual LLM agents.
- **Data Enrichment**: Enhances textual responses using structured and categorical survey data.
- **Latent Theme Extraction**: Extracts meaningful topics from enriched responses.
- **Topic Deduplication**: Intelligently refines and consolidates identified topics using LLM-powered semantic analysis for better interpretability.
- **Automatic Batching**: Handles large datasets by automatically splitting data into manageable batches when token limits are exceeded.
- **Intelligent Error Handling**: Detects token limit errors and seamlessly switches to batching mode without user intervention.
- **Comprehensive Statistics Tracking**: Automatically tracks LLM usage, token consumption, API performance, and costs with detailed metrics and reporting.

## Installation

To install the MALTopic library, you can use pip:

```bash
pip install maltopic
```

## Usage

To use the MALTopic library, you need to initialize the main class with your API key and model name. You can choose between different LLMs such as OpenAI, Google Gemini (not supported yet), or Llama (not supported yet).

```python
from maltopic import MALTopic

# Initialize the MALTopic class
client = MALTopic(
    api_key="your_api_key",
    default_model_name="gpt-4.1-nano",
    llm_type="openai",
)

enriched_df = client.enrich_free_text_with_structured_data(
        survey_context="context about survey, why, how of it...",
        free_text_column="column_1",
        structured_data_columns=["columns_2", "column_3"],
        df=df,
        examples=["free text response, category 1 -> free text response with additional context", "..."], # optional
    )

topics = client.generate_topics(
        topic_mining_context="context about what kind of topics you want to mine",
        df=enriched_df,
        enriched_column="column_1" + "_enriched", # MALTopic adds _enriched as the suffix.
    )

# Optionally deduplicate and merge similar topics for cleaner results
deduplicated_topics = client.deduplicate_topics(
        topics=topics,
        survey_context="context about survey, why, how of it...",
    )

print(deduplicated_topics)

# Access comprehensive statistics anytime
stats = client.get_stats()
print(f"Total tokens used: {stats['overview']['total_tokens_used']:,}")
print(f"API calls made: {stats['overview']['total_calls_made']}")
print(f"Success rate: {stats['overview']['success_rate_percent']}%")

# Print detailed formatted statistics
client.print_stats()
```

## Automatic Batching for Large Datasets

MALTopic v1.1.0 introduces intelligent automatic batching to handle large datasets that may exceed LLM token limits. This feature works seamlessly in the background:

### How It Works

1. **Automatic Detection**: When `generate_topics` encounters a token limit error, it automatically detects this and switches to batching mode.

2. **Smart Splitting**: The library uses `tiktoken` (OpenAI's token counting library) to intelligently split your data into optimally-sized batches based on actual token counts.

3. **Batch Processing**: Each batch is processed independently, with progress tracking to keep you informed.

4. **Topic Consolidation**: Topics from all batches are automatically merged and deduplicated to provide a clean, comprehensive result.

### Key Benefits

- **No Code Changes Required**: Existing code works without modification - batching happens automatically when needed.
- **Optimal Performance**: Uses actual token counting for precise batch sizing, maximizing efficiency.
- **Robust Fallback**: Even works without `tiktoken` by falling back to simple batch splitting.
- **Progress Visibility**: Shows batch processing progress so you know what's happening.
- **Quality Preservation**: Maintains topic quality through intelligent consolidation and deduplication.

### Example Output

When batching is triggered, you'll see output like:
```
Token limit exceeded, splitting into batches...
Processing 3 batches...
Processing batches: 100%|██████████| 3/3 [00:45<00:00, 15.2s/it]
Batch 1/3: Generated 12 topics
Batch 2/3: Generated 8 topics  
Batch 3/3: Generated 10 topics
Consolidated 30 topics into 25 unique topics
```

This feature makes MALTopic suitable for processing large-scale survey datasets without worrying about token limitations.

## Intelligent Topic Deduplication

MALTopic v1.2.0 introduces intelligent topic deduplication that goes beyond simple string matching to provide semantic analysis and consolidation of similar topics.

### How It Works

1. **Semantic Analysis**: Uses LLM to analyze topic meanings, descriptions, and context rather than just comparing names.

2. **Smart Merging**: Identifies topics with significant semantic overlap (>80% similarity) and intelligently merges them while preserving unique perspectives.

3. **Structure Preservation**: Maintains the original topic structure and combines information from merged topics:
   - **Names**: Chooses the most descriptive and comprehensive name
   - **Descriptions**: Combines descriptions to capture all relevant aspects  
   - **Relevance**: Merges relevance information from all source topics
   - **Representative Words**: Combines word lists, removing duplicates

4. **Quality Preservation**: Preserves genuinely unique topics that represent distinct concepts with no significant overlap.

### Key Benefits

- **Higher Quality Results**: Eliminates redundant or highly similar topics for cleaner analysis
- **Semantic Understanding**: Goes beyond keyword matching to understand topic meanings
- **Flexible Control**: Can be used optionally - existing workflows continue to work unchanged
- **Robust Fallback**: Returns original topics unchanged if deduplication fails
- **Context-Aware**: Uses survey context to make better merging decisions

### Usage Example

```python
# Generate topics as usual
topics = client.generate_topics(
    topic_mining_context="Extract themes from customer feedback",
    df=enriched_df,
    enriched_column="feedback_enriched"
)

# Apply intelligent deduplication
deduplicated_topics = client.deduplicate_topics(
    topics=topics,
    survey_context="Customer satisfaction survey for mobile app"
)

print(f"Original topics: {len(topics)}")
print(f"After deduplication: {len(deduplicated_topics)}")
```

### Example Output

When deduplication is applied, you'll see output like:
```
Deduplicated 15 topics into 12 unique topics
```

This feature is particularly useful when:
- Working with large datasets that produce many overlapping topics
- You need cleaner, more consolidated results for reporting
- Multiple batches have generated similar topics that need consolidation

## Comprehensive Statistics Tracking

MALTopic includes built-in statistics tracking that automatically monitors your LLM usage, providing valuable insights into token consumption, API performance, and costs.

### Key Metrics Tracked

- **Token Usage**: Input, output, and total tokens from all API calls
- **API Performance**: Call counts, success/failure rates, and response times  
- **Model Breakdown**: Statistics separated by each model used
- **Cost Monitoring**: Data needed to calculate estimated API costs
- **Real-time Updates**: Statistics update automatically as you use the library

### Accessing Statistics

MALTopic provides three simple methods to access your usage statistics:

```python
# Get comprehensive statistics as a dictionary
stats = client.get_stats()
print(f"Total tokens used: {stats['overview']['total_tokens_used']:,}")
print(f"Average response time: {stats['averages']['avg_response_time_seconds']:.2f}s")

# Print a formatted summary to console
client.print_stats()

# Reset statistics to start fresh
client.reset_stats()
```

### Example Statistics Output

When you call `client.print_stats()`, you'll see output like:

```
============================================================
MALTopic Library Usage Statistics
============================================================

📊 Overview:
  Total Tokens Used: 2,450
  - Input Tokens: 1,800
  - Output Tokens: 650
  Total API Calls: 8
  - Successful: 8
  - Failed: 0
  Success Rate: 100.0%
  Uptime: 125.3 seconds

📈 Averages:
  Avg Tokens per Call: 306.3
  - Avg Input Tokens: 225.0
  - Avg Output Tokens: 81.3
  Avg Response Time: 2.15s

🤖 Model Breakdown:
  gpt-4:
    Calls: 8 (Success: 8, Failed: 0)
    Tokens: 2,450 (Avg: 306.3)
    Success Rate: 100.0%
============================================================
```

### Cost Estimation Example

Use the statistics to estimate your API costs:

```python
stats = client.get_stats()

# Example with GPT-4 pricing (as of 2024)
input_cost = (stats['overview']['total_input_tokens'] / 1000) * 0.03  # $0.03 per 1K input tokens
output_cost = (stats['overview']['total_output_tokens'] / 1000) * 0.06  # $0.06 per 1K output tokens
total_estimated_cost = input_cost + output_cost

print(f"Estimated API cost: ${total_estimated_cost:.4f}")
```

### Benefits

- **Cost Control**: Monitor token usage to manage API expenses
- **Performance Optimization**: Identify bottlenecks and optimize prompts
- **Error Monitoring**: Track success rates to catch issues early
- **Usage Insights**: Understand patterns across different models and operations

Statistics tracking is **automatic** and **privacy-focused** - no data leaves your environment, and statistics are stored only in memory during your session.

## Method Reference

### Core Methods

#### `enrich_free_text_with_structured_data()`
Enhances free-text survey responses with structured data context.

**Parameters:**
- `survey_context` (str): Context about the survey purpose and methodology
- `free_text_column` (str): Name of the column containing free-text responses
- `structured_data_columns` (list[str]): List of column names with structured data to use for enrichment
- `df` (pandas.DataFrame): DataFrame containing the survey data
- `examples` (list[str], optional): Examples of enrichment format

**Returns:** DataFrame with enriched text in a new column with "_enriched" suffix

#### `generate_topics()`
Extracts latent themes and topics from enriched survey responses.

**Parameters:**
- `topic_mining_context` (str): Context about what kind of topics to extract
- `df` (pandas.DataFrame): DataFrame containing enriched data
- `enriched_column` (str): Name of the column containing enriched text

**Returns:** List of topic dictionaries with structure:
```python
{
    "name": "Topic Name",
    "description": "Detailed description of the topic",
    "relevance": "Who this topic is relevant to",
    "representative_words": ["word1", "word2", "word3"]
}
```

#### `deduplicate_topics()`
Intelligently consolidates similar topics using semantic analysis.

**Parameters:**
- `topics` (list[dict]): List of topic dictionaries to deduplicate
- `survey_context` (str): Context about the survey to help with merging decisions

**Returns:** List of deduplicated topic dictionaries with the same structure as input

#### `get_stats()`
Returns comprehensive statistics about LLM usage and performance.

**Returns:** Dictionary containing:
- `overview`: Total tokens, calls, success rates, and uptime
- `averages`: Average tokens per call, response times, etc.
- `model_breakdown`: Statistics separated by model
- `recent_calls`: Details of the most recent API calls

#### `print_stats()`
Prints a formatted summary of statistics to the console.

**Returns:** None (prints to console)

#### `reset_stats()`
Resets all statistics to zero and starts tracking fresh.

**Returns:** None

## Agents

- **Enrichment Agent**: Enhances free-text responses using structured data.
- **Topic Modeling Agent**: Extracts latent themes from enriched responses.
- **Deduplication Agent**: Intelligently refines and consolidates the extracted topics using LLM-powered semantic analysis.

## Changelog

For detailed release notes and version history, see [CHANGELOG.md](https://github.com/yash91sharma/MALTopic-py/blob/master/CHANGELOG.md).

### v1.3.0 (June 2025)
- **NEW**: Comprehensive statistics tracking with automatic LLM usage monitoring
- **NEW**: `get_stats()`, `print_stats()`, and `reset_stats()` methods for statistics access
- **NEW**: Real-time token usage, API performance, and cost monitoring
- **NEW**: Model-specific statistics breakdown and detailed metrics
- **IMPROVED**: Enhanced visibility into LLM usage patterns and costs

### v1.2.0 (June 2025)
- **NEW**: Intelligent topic deduplication using LLM-powered semantic analysis
- **NEW**: `deduplicate_topics()` method for consolidating similar topics
- **NEW**: Advanced topic merging that preserves meaningful distinctions
- **NEW**: Context-aware deduplication that considers survey background
- **NEW**: Robust error handling with fallback to original topics
- **IMPROVED**: Enhanced topic quality through semantic consolidation
- **IMPROVED**: Better user control over topic refinement process

### v1.1.0 (June 2025)
- **NEW**: Automatic batching for large datasets that exceed LLM token limits
- **NEW**: Intelligent token counting using tiktoken for optimal batch sizing
- **NEW**: Automatic error detection and seamless fallback to batching mode
- **NEW**: Topic consolidation and deduplication across batches
- **NEW**: Progress tracking for batch processing operations
- **IMPROVED**: Enhanced error handling and user feedback
- **IMPROVED**: Graceful degradation when tiktoken is not available

### v1.0.0 (May 2025)
- Multi-agent framework for topic modeling
- Data enrichment capabilities  
- Basic topic extraction functionality

## Contributing

Contributions are welcome! Please feel free to submit a pull request or open an issue for any enhancements or bug fixes.

## License

This project is licensed under the MIT License. See the LICENSE file for more details.

## Citation

If you use MALTopic in your research, please cite:

```bibtex
@software{Sharma2025maltopic,
  author = {Sharma, Yash},
  title = {MALTopic: A library for topic modeling},
  year = {2025},
  url = {https://github.com/yash91sharma/MALTopic-py}
}

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/yash91sharma/MALTopic-py",
    "name": "maltopic",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.12",
    "maintainer_email": null,
    "keywords": "topic modeling, LLM, multi-agent, survey analysis, data enrichment, topic deduplication, semantic analysis",
    "author": "Yash Sharma",
    "author_email": "yash91sharma@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/fb/c2/811b7f9eeb7b6b8e572d6b98e38fff1df7ad07bb9115f24010010f74c33a/maltopic-1.3.2.tar.gz",
    "platform": null,
    "description": "# MALTopic: Multi-Agent LLM Topic Modeling Library\n\nMALTopic is a powerful library designed for topic modeling using a multi-agent approach. It leverages the capabilities of large language models (LLMs) to enhance the analysis of survey responses by integrating structured and unstructured data.\n\n[MALTopic](https://ieeexplore.ieee.org/document/11105319) as a research paper was published in 2025 World AI IoT Congress.\n\n## Features\n\n- **Multi-Agent Framework**: Decomposes topic modeling into specialized tasks executed by individual LLM agents.\n- **Data Enrichment**: Enhances textual responses using structured and categorical survey data.\n- **Latent Theme Extraction**: Extracts meaningful topics from enriched responses.\n- **Topic Deduplication**: Intelligently refines and consolidates identified topics using LLM-powered semantic analysis for better interpretability.\n- **Automatic Batching**: Handles large datasets by automatically splitting data into manageable batches when token limits are exceeded.\n- **Intelligent Error Handling**: Detects token limit errors and seamlessly switches to batching mode without user intervention.\n- **Comprehensive Statistics Tracking**: Automatically tracks LLM usage, token consumption, API performance, and costs with detailed metrics and reporting.\n\n## Installation\n\nTo install the MALTopic library, you can use pip:\n\n```bash\npip install maltopic\n```\n\n## Usage\n\nTo use the MALTopic library, you need to initialize the main class with your API key and model name. You can choose between different LLMs such as OpenAI, Google Gemini (not supported yet), or Llama (not supported yet).\n\n```python\nfrom maltopic import MALTopic\n\n# Initialize the MALTopic class\nclient = MALTopic(\n    api_key=\"your_api_key\",\n    default_model_name=\"gpt-4.1-nano\",\n    llm_type=\"openai\",\n)\n\nenriched_df = client.enrich_free_text_with_structured_data(\n        survey_context=\"context about survey, why, how of it...\",\n        free_text_column=\"column_1\",\n        structured_data_columns=[\"columns_2\", \"column_3\"],\n        df=df,\n        examples=[\"free text response, category 1 -> free text response with additional context\", \"...\"], # optional\n    )\n\ntopics = client.generate_topics(\n        topic_mining_context=\"context about what kind of topics you want to mine\",\n        df=enriched_df,\n        enriched_column=\"column_1\" + \"_enriched\", # MALTopic adds _enriched as the suffix.\n    )\n\n# Optionally deduplicate and merge similar topics for cleaner results\ndeduplicated_topics = client.deduplicate_topics(\n        topics=topics,\n        survey_context=\"context about survey, why, how of it...\",\n    )\n\nprint(deduplicated_topics)\n\n# Access comprehensive statistics anytime\nstats = client.get_stats()\nprint(f\"Total tokens used: {stats['overview']['total_tokens_used']:,}\")\nprint(f\"API calls made: {stats['overview']['total_calls_made']}\")\nprint(f\"Success rate: {stats['overview']['success_rate_percent']}%\")\n\n# Print detailed formatted statistics\nclient.print_stats()\n```\n\n## Automatic Batching for Large Datasets\n\nMALTopic v1.1.0 introduces intelligent automatic batching to handle large datasets that may exceed LLM token limits. This feature works seamlessly in the background:\n\n### How It Works\n\n1. **Automatic Detection**: When `generate_topics` encounters a token limit error, it automatically detects this and switches to batching mode.\n\n2. **Smart Splitting**: The library uses `tiktoken` (OpenAI's token counting library) to intelligently split your data into optimally-sized batches based on actual token counts.\n\n3. **Batch Processing**: Each batch is processed independently, with progress tracking to keep you informed.\n\n4. **Topic Consolidation**: Topics from all batches are automatically merged and deduplicated to provide a clean, comprehensive result.\n\n### Key Benefits\n\n- **No Code Changes Required**: Existing code works without modification - batching happens automatically when needed.\n- **Optimal Performance**: Uses actual token counting for precise batch sizing, maximizing efficiency.\n- **Robust Fallback**: Even works without `tiktoken` by falling back to simple batch splitting.\n- **Progress Visibility**: Shows batch processing progress so you know what's happening.\n- **Quality Preservation**: Maintains topic quality through intelligent consolidation and deduplication.\n\n### Example Output\n\nWhen batching is triggered, you'll see output like:\n```\nToken limit exceeded, splitting into batches...\nProcessing 3 batches...\nProcessing batches: 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 3/3 [00:45<00:00, 15.2s/it]\nBatch 1/3: Generated 12 topics\nBatch 2/3: Generated 8 topics  \nBatch 3/3: Generated 10 topics\nConsolidated 30 topics into 25 unique topics\n```\n\nThis feature makes MALTopic suitable for processing large-scale survey datasets without worrying about token limitations.\n\n## Intelligent Topic Deduplication\n\nMALTopic v1.2.0 introduces intelligent topic deduplication that goes beyond simple string matching to provide semantic analysis and consolidation of similar topics.\n\n### How It Works\n\n1. **Semantic Analysis**: Uses LLM to analyze topic meanings, descriptions, and context rather than just comparing names.\n\n2. **Smart Merging**: Identifies topics with significant semantic overlap (>80% similarity) and intelligently merges them while preserving unique perspectives.\n\n3. **Structure Preservation**: Maintains the original topic structure and combines information from merged topics:\n   - **Names**: Chooses the most descriptive and comprehensive name\n   - **Descriptions**: Combines descriptions to capture all relevant aspects  \n   - **Relevance**: Merges relevance information from all source topics\n   - **Representative Words**: Combines word lists, removing duplicates\n\n4. **Quality Preservation**: Preserves genuinely unique topics that represent distinct concepts with no significant overlap.\n\n### Key Benefits\n\n- **Higher Quality Results**: Eliminates redundant or highly similar topics for cleaner analysis\n- **Semantic Understanding**: Goes beyond keyword matching to understand topic meanings\n- **Flexible Control**: Can be used optionally - existing workflows continue to work unchanged\n- **Robust Fallback**: Returns original topics unchanged if deduplication fails\n- **Context-Aware**: Uses survey context to make better merging decisions\n\n### Usage Example\n\n```python\n# Generate topics as usual\ntopics = client.generate_topics(\n    topic_mining_context=\"Extract themes from customer feedback\",\n    df=enriched_df,\n    enriched_column=\"feedback_enriched\"\n)\n\n# Apply intelligent deduplication\ndeduplicated_topics = client.deduplicate_topics(\n    topics=topics,\n    survey_context=\"Customer satisfaction survey for mobile app\"\n)\n\nprint(f\"Original topics: {len(topics)}\")\nprint(f\"After deduplication: {len(deduplicated_topics)}\")\n```\n\n### Example Output\n\nWhen deduplication is applied, you'll see output like:\n```\nDeduplicated 15 topics into 12 unique topics\n```\n\nThis feature is particularly useful when:\n- Working with large datasets that produce many overlapping topics\n- You need cleaner, more consolidated results for reporting\n- Multiple batches have generated similar topics that need consolidation\n\n## Comprehensive Statistics Tracking\n\nMALTopic includes built-in statistics tracking that automatically monitors your LLM usage, providing valuable insights into token consumption, API performance, and costs.\n\n### Key Metrics Tracked\n\n- **Token Usage**: Input, output, and total tokens from all API calls\n- **API Performance**: Call counts, success/failure rates, and response times  \n- **Model Breakdown**: Statistics separated by each model used\n- **Cost Monitoring**: Data needed to calculate estimated API costs\n- **Real-time Updates**: Statistics update automatically as you use the library\n\n### Accessing Statistics\n\nMALTopic provides three simple methods to access your usage statistics:\n\n```python\n# Get comprehensive statistics as a dictionary\nstats = client.get_stats()\nprint(f\"Total tokens used: {stats['overview']['total_tokens_used']:,}\")\nprint(f\"Average response time: {stats['averages']['avg_response_time_seconds']:.2f}s\")\n\n# Print a formatted summary to console\nclient.print_stats()\n\n# Reset statistics to start fresh\nclient.reset_stats()\n```\n\n### Example Statistics Output\n\nWhen you call `client.print_stats()`, you'll see output like:\n\n```\n============================================================\nMALTopic Library Usage Statistics\n============================================================\n\n\ud83d\udcca Overview:\n  Total Tokens Used: 2,450\n  - Input Tokens: 1,800\n  - Output Tokens: 650\n  Total API Calls: 8\n  - Successful: 8\n  - Failed: 0\n  Success Rate: 100.0%\n  Uptime: 125.3 seconds\n\n\ud83d\udcc8 Averages:\n  Avg Tokens per Call: 306.3\n  - Avg Input Tokens: 225.0\n  - Avg Output Tokens: 81.3\n  Avg Response Time: 2.15s\n\n\ud83e\udd16 Model Breakdown:\n  gpt-4:\n    Calls: 8 (Success: 8, Failed: 0)\n    Tokens: 2,450 (Avg: 306.3)\n    Success Rate: 100.0%\n============================================================\n```\n\n### Cost Estimation Example\n\nUse the statistics to estimate your API costs:\n\n```python\nstats = client.get_stats()\n\n# Example with GPT-4 pricing (as of 2024)\ninput_cost = (stats['overview']['total_input_tokens'] / 1000) * 0.03  # $0.03 per 1K input tokens\noutput_cost = (stats['overview']['total_output_tokens'] / 1000) * 0.06  # $0.06 per 1K output tokens\ntotal_estimated_cost = input_cost + output_cost\n\nprint(f\"Estimated API cost: ${total_estimated_cost:.4f}\")\n```\n\n### Benefits\n\n- **Cost Control**: Monitor token usage to manage API expenses\n- **Performance Optimization**: Identify bottlenecks and optimize prompts\n- **Error Monitoring**: Track success rates to catch issues early\n- **Usage Insights**: Understand patterns across different models and operations\n\nStatistics tracking is **automatic** and **privacy-focused** - no data leaves your environment, and statistics are stored only in memory during your session.\n\n## Method Reference\n\n### Core Methods\n\n#### `enrich_free_text_with_structured_data()`\nEnhances free-text survey responses with structured data context.\n\n**Parameters:**\n- `survey_context` (str): Context about the survey purpose and methodology\n- `free_text_column` (str): Name of the column containing free-text responses\n- `structured_data_columns` (list[str]): List of column names with structured data to use for enrichment\n- `df` (pandas.DataFrame): DataFrame containing the survey data\n- `examples` (list[str], optional): Examples of enrichment format\n\n**Returns:** DataFrame with enriched text in a new column with \"_enriched\" suffix\n\n#### `generate_topics()`\nExtracts latent themes and topics from enriched survey responses.\n\n**Parameters:**\n- `topic_mining_context` (str): Context about what kind of topics to extract\n- `df` (pandas.DataFrame): DataFrame containing enriched data\n- `enriched_column` (str): Name of the column containing enriched text\n\n**Returns:** List of topic dictionaries with structure:\n```python\n{\n    \"name\": \"Topic Name\",\n    \"description\": \"Detailed description of the topic\",\n    \"relevance\": \"Who this topic is relevant to\",\n    \"representative_words\": [\"word1\", \"word2\", \"word3\"]\n}\n```\n\n#### `deduplicate_topics()`\nIntelligently consolidates similar topics using semantic analysis.\n\n**Parameters:**\n- `topics` (list[dict]): List of topic dictionaries to deduplicate\n- `survey_context` (str): Context about the survey to help with merging decisions\n\n**Returns:** List of deduplicated topic dictionaries with the same structure as input\n\n#### `get_stats()`\nReturns comprehensive statistics about LLM usage and performance.\n\n**Returns:** Dictionary containing:\n- `overview`: Total tokens, calls, success rates, and uptime\n- `averages`: Average tokens per call, response times, etc.\n- `model_breakdown`: Statistics separated by model\n- `recent_calls`: Details of the most recent API calls\n\n#### `print_stats()`\nPrints a formatted summary of statistics to the console.\n\n**Returns:** None (prints to console)\n\n#### `reset_stats()`\nResets all statistics to zero and starts tracking fresh.\n\n**Returns:** None\n\n## Agents\n\n- **Enrichment Agent**: Enhances free-text responses using structured data.\n- **Topic Modeling Agent**: Extracts latent themes from enriched responses.\n- **Deduplication Agent**: Intelligently refines and consolidates the extracted topics using LLM-powered semantic analysis.\n\n## Changelog\n\nFor detailed release notes and version history, see [CHANGELOG.md](https://github.com/yash91sharma/MALTopic-py/blob/master/CHANGELOG.md).\n\n### v1.3.0 (June 2025)\n- **NEW**: Comprehensive statistics tracking with automatic LLM usage monitoring\n- **NEW**: `get_stats()`, `print_stats()`, and `reset_stats()` methods for statistics access\n- **NEW**: Real-time token usage, API performance, and cost monitoring\n- **NEW**: Model-specific statistics breakdown and detailed metrics\n- **IMPROVED**: Enhanced visibility into LLM usage patterns and costs\n\n### v1.2.0 (June 2025)\n- **NEW**: Intelligent topic deduplication using LLM-powered semantic analysis\n- **NEW**: `deduplicate_topics()` method for consolidating similar topics\n- **NEW**: Advanced topic merging that preserves meaningful distinctions\n- **NEW**: Context-aware deduplication that considers survey background\n- **NEW**: Robust error handling with fallback to original topics\n- **IMPROVED**: Enhanced topic quality through semantic consolidation\n- **IMPROVED**: Better user control over topic refinement process\n\n### v1.1.0 (June 2025)\n- **NEW**: Automatic batching for large datasets that exceed LLM token limits\n- **NEW**: Intelligent token counting using tiktoken for optimal batch sizing\n- **NEW**: Automatic error detection and seamless fallback to batching mode\n- **NEW**: Topic consolidation and deduplication across batches\n- **NEW**: Progress tracking for batch processing operations\n- **IMPROVED**: Enhanced error handling and user feedback\n- **IMPROVED**: Graceful degradation when tiktoken is not available\n\n### v1.0.0 (May 2025)\n- Multi-agent framework for topic modeling\n- Data enrichment capabilities  \n- Basic topic extraction functionality\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a pull request or open an issue for any enhancements or bug fixes.\n\n## License\n\nThis project is licensed under the MIT License. See the LICENSE file for more details.\n\n## Citation\n\nIf you use MALTopic in your research, please cite:\n\n```bibtex\n@software{Sharma2025maltopic,\n  author = {Sharma, Yash},\n  title = {MALTopic: A library for topic modeling},\n  year = {2025},\n  url = {https://github.com/yash91sharma/MALTopic-py}\n}\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A multi-agent LLM topic modeling library.",
    "version": "1.3.2",
    "project_urls": {
        "Homepage": "https://github.com/yash91sharma/MALTopic-py",
        "Repository": "https://github.com/yash91sharma/MALTopic-py"
    },
    "split_keywords": [
        "topic modeling",
        " llm",
        " multi-agent",
        " survey analysis",
        " data enrichment",
        " topic deduplication",
        " semantic analysis"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a45dcc2d3001f14f6dfaab4aee5279469c0588707253e25d437a867a02bd4635",
                "md5": "ab02ba7b7943b1f90c6565339aec40b8",
                "sha256": "e89bfcaa0007717046a4a840ee58503d2983e30c20d412e7f94c422a6f53299e"
            },
            "downloads": -1,
            "filename": "maltopic-1.3.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "ab02ba7b7943b1f90c6565339aec40b8",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.12",
            "size": 19290,
            "upload_time": "2025-08-30T06:08:19",
            "upload_time_iso_8601": "2025-08-30T06:08:19.346203Z",
            "url": "https://files.pythonhosted.org/packages/a4/5d/cc2d3001f14f6dfaab4aee5279469c0588707253e25d437a867a02bd4635/maltopic-1.3.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "fbc2811b7f9eeb7b6b8e572d6b98e38fff1df7ad07bb9115f24010010f74c33a",
                "md5": "48e0adbb7323d239ebece88171f00051",
                "sha256": "62fc1eef466ff74c36a190f36c464c7ba85e9744735e82e6189c113862adffcb"
            },
            "downloads": -1,
            "filename": "maltopic-1.3.2.tar.gz",
            "has_sig": false,
            "md5_digest": "48e0adbb7323d239ebece88171f00051",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.12",
            "size": 21883,
            "upload_time": "2025-08-30T06:08:20",
            "upload_time_iso_8601": "2025-08-30T06:08:20.787527Z",
            "url": "https://files.pythonhosted.org/packages/fb/c2/811b7f9eeb7b6b8e572d6b98e38fff1df7ad07bb9115f24010010f74c33a/maltopic-1.3.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-30 06:08:20",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "yash91sharma",
    "github_project": "MALTopic-py",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "maltopic"
}

Yash Sharma