
# cat-llm
CatLLM: A Reproducible LLM Pipeline for Coding Open-Ended Survey Responses
[](https://pypi.org/project/cat-llm)
[](https://pypi.org/project/cat-llm)
-----
## Table of Contents
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Configuration](#configuration)
- [Supported Models](#supported-models)
- [API Reference](#api-reference)
- [explore_corpus()](#explore_corpus)
- [explore_common_categories()](#explore_common_categories)
- [multi_class()](#multi_class)
- [image_score()](#image_score_drawing)
- [image_features()](#image_features)
- [build_web_research_dataset()](#build_web_research_dataset)
- [cerad_drawn_score()](#cerad_drawn_score)
- [Academic Research](#academic-research)
- [Contact](#contact)
- [License](#license)
## Installation
```console
pip install cat-llm
```
-----
## ⚠️ Beta Software Notice
**CatLLM is currently in beta.** While I'm actively working to make this tool robust and reliable, you may encounter bugs or unexpected behavior.
**I need your help!** This project thrives on community feedback. Please contribute by:
- 🐛 **Reporting bugs** - Found an issue? Let me know!
- 💡 **Sharing ideas** - Have suggestions for improvements or new features?
- 🔧 **Contributing code** - Submit pull requests with fixes or enhancements
**Visit our GitHub**: [github.com/chrissoria/cat-llm](https://github.com/chrissoria/cat-llm)
All feedback helps us build better research software for the community.
-----
## Quick Start
CatLLM helps social scientists and researchers automatically categorize open-ended survey responses, images, and web-scraped data using AI models like GPT-5 and Claude. Not to be confused with CAT-LLM for Chinese article‐style transfer ([Tao et al. 2024](https://arxiv.org/html/2401.05707v1)).
Text Analysis: Simply provide your survey responses and category list - the package handles the rest and outputs clean data ready for statistical analysis. It works with single or multiple categories per response and automatically skips missing data to save API costs.
Image Categorization: Uses the same intelligent categorization method to analyze images, extracting specific features, counting objects, identifying colors, or determining the presence of elements based on your research questions.
Web Data Collection: Builds comprehensive datasets by scraping web data and using Large Language Models to extract exactly the information you need. The function searches across multiple sources, processes the findings through AI models, and structures everything into clean dataframe format ready for export to CSV.
Whether you're working with messy text responses, analyzing visual content, or gathering information from across the web, CatLLM consistently transforms unstructured data into structured categories and datasets you can actually analyze. All outputs are formatted for immediate statistical analysis and can be exported directly to CSV for further research workflows.
## Configuration
### Get Your OpenAI API Key
1. **Create an OpenAI Developer Account**:
- Go to [platform.openai.com](https://platform.openai.com) (separate from regular ChatGPT)
- Sign up with email, Google, Microsoft, or Apple
2. **Generate an API Key**:
- Log into your account and click your name in the top right corner
- Click "View API keys" or navigate to the "API keys" section
- Click "Create new secret key"
- Give your key a descriptive name
- Set permissions (choose "All" for full access)
3. **Add Payment Details**:
- Add a payment method to your OpenAI account
- Purchase credits (start with $5 - it lasts a long time for most research use)
- **Important**: Your API key won't work without credits
4. **Save Your Key Securely**:
- Copy the key immediately (you won't be able to see it again)
- Store it safely and never share it publicly
5. Copy and paste your key into catllm in the api_key parameter
## Supported Models
- **OpenAI**: GPT-4o, GPT-4, GPT-3.5-turbo, etc.
- **Anthropic**: Claude Sonnet 3.7, Claude Haiku, etc.
- **Perplexity**: Sonnar Large, Sonnar Small, etc.
- **Mistral**: Mistral Large, Mistral Small, etc.
**Fully Tested (Beta):**
- ✅ OpenAI (GPT-4, GPT-4o, GPT-3.5-turbo, etc.)
- ✅ Anthropic (Claude 3.5 Sonnet, Haiku)
- ✅ Perplexity (Sonar models)
- ✅ Google Gemini - Free tier has severe rate limits (5 RPM). Requires Google AI Studio billing account for large-scale use.
**Supported but Limited:**
-
- ⚠️ Huggingface - API routing can be unstable
**Note:** For beta testing, I recommend starting with OpenAI or Anthropic.
## API Reference
### `explore_corpus()`
Extracts categories from a corpus of text responses and returns frequency counts.
**Methodology:**
The function divides the corpus into random chunks to address the probabilistic nature of LLM outputs. By processing multiple chunks and averaging results across many API calls rather than relying on a single call, this approach significantly improves reproducibility and provides more stable categorical frequency estimates.
**Parameters:**
- `survey_question` (str): The survey question being analyzed
- `survey_input` (list): List of text responses to categorize
- `api_key` (str): API key for the LLM service
- `cat_num` (int, default=10): Number of categories to extract in each iteration
- `divisions` (int, default=5): Number of chunks to divide the data into (larger corpora might require larger divisions)
- `specificity` (str, default="broad"): Category precision level (e.g., "broad", "narrow")
- `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
- `user_model` (str, default="got-4o"): Specific model (e.g., "gpt-4o", "claude-opus-4-20250514")
- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
- `filename` (str, optional): Output file path for saving results
**Returns:**
- `pandas.DataFrame`: Two-column dataset with category names and frequencies
**Example:***
```
import catllm as cat
categories = cat.explore_corpus(
survey_question="What motivates you most at work?",
survey_input=["flexible schedule", "good pay", "interesting projects"],
api_key="OPENAI_API_KEY",
cat_num=5,
divisions=10
)
```
### `explore_common_categories()`
Identifies the most frequently occurring categories across a text corpus and returns the top N categories by frequency count.
**Methodology:**
Divides the corpus into random chunks and averages results across multiple API calls to improve reproducibility and provide stable frequency estimates for the most prevalent categories, addressing the probabilistic nature of LLM outputs.
**Parameters:**
- `survey_question` (str): Survey question being analyzed
- `survey_input` (list): Text responses to categorize
- `api_key` (str): API key for the LLM service
- `top_n` (int, default=10): Number of top categories to return by frequency
- `cat_num` (int, default=10): Number of categories to extract per iteration
- `divisions` (int, default=5): Number of data chunks (increase for larger corpora)
- `user_model` (str, default="gpt-4o"): Specific model to use
- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
- `specificity` (str, default="broad"): Category precision level ("broad", "narrow")
- `research_question` (str, optional): Contextual research question to guide categorization
- `filename` (str, optional): File path to save output dataset
- `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
**Returns:**
- `pandas.DataFrame`: Dataset with category names and frequencies, limited to top N most common categories
**Example:**
```
import catllm as cat
top_10_categories = cat.explore_common_categories(
survey_question="What motivates you most at work?",
survey_input=["flexible schedule", "good pay", "interesting projects"],
api_key="OPENAI_API_KEY",
top_n=10,
cat_num=5,
divisions=10
)
print(categories)
```
### `multi_class()`
Performs multi-label classification of text responses into user-defined categories, returning structured results with optional CSV export.
**Methodology:**
Processes each text response individually, assigning one or more categories from the provided list. Supports flexible output formatting and optional saving of results to CSV for easy integration with data analysis workflows.
**Parameters:**
- `survey_input` (list): List of text responses to classify
- `categories` (list): List of predefined categories for classification
- `api_key` (str): API key for the LLM service
- `user_model` (str, default="gpt-5"): Specific model to use
- `user_prompt` (str, optional): Custom prompt template to override default prompting
- `survey_question` (str, default=""): The survey question being analyzed
- `example1` through `example6` (dict, optional): Few-shot learning examples (format: {"response": "...", "categories": [...]})
- `creativity` (float, optional): Temperature/randomness setting (0.0-1.0, varies by model)
- `safety` (bool, default=False): Enable safety checks on responses and saves to CSV at each API call step
- `to_csv` (bool, default=False): Whether to save results to CSV
- `chain_of_verification` (bool, default=False): Enable Chain-of-Verification prompting technique for improved accuracy
- `chain_of_thought` (bool, default=False): Enable Chain-of-Thought prompting technique for improved accuracy
- `step_back_prompt` (bool, default=False): Enable step-back prompting to analyze higher-level context before classification
- `context_prompt` (bool, default=False): Add expert role and behavioral guidelines to the prompt
- `filename` (str, default="categorized_data.csv"): Filename for CSV output
- `save_directory` (str, optional): Directory path to save the CSV file
- `model_source` (str, default="auto"): Model provider ("auto", "OpenAI", "Anthropic", "Google", "Mistral", "Perplexity", "Huggingface")
**Returns:**
- `pandas.DataFrame`: DataFrame with classification results, columns formatted as specified
**Example:**
```
import catllm as cat
user_categories = ["to start living with or to stay with partner/spouse",
"relationship change (divorce, breakup, etc)",
"the person had a job or school or career change, including transferred and retired",
"the person's partner's job or school or career change, including transferred and retired",
"financial reasons (rent is too expensive, pay raise, etc)",
"related specifically features of the home, such as a bigger or smaller yard"]
question = "Why did you move?"
move_reasons = cat.multi_class(
survey_question=question,
survey_input= df[column1],
user_model="gpt-4o",
creativity=0,
categories=user_categories,
safety =TRUE,
api_key="OPENAI_API_KEY")
```
### `image_multi_class()`
Performs multi-label image classification into user-defined categories, returning structured results with optional CSV export.
**Methodology:**
Processes each image individually, assigning one or more categories from the provided list. Supports flexible output formatting and optional saving of results to CSV for easy integration with data analysis workflows.
**Parameters:**
- `image_description` (str): A description of what the model should expect to see
- `image_input` (list): List of file paths or a folder to pull file paths from
- `categories` (list): List of predefined categories for classification
- `api_key` (str): API key for the LLM service
- `user_model` (str, default="gpt-4o"): Specific model to use
- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
- `safety` (bool, default=False): Enable safety checks on responses and saves to CSV at each API call step
- `filename` (str, default="categorized_data.csv"): Filename for CSV output
- `save_directory` (str, optional): Directory path to save the CSV file
- `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
**Returns:**
- `pandas.DataFrame`: DataFrame with classification results, columns formatted as specified
**Example:**
```
import catllm as cat
user_categories = ["has a cat somewhere in it",
"looks cartoonish",
"Adrian Brody is in it"]
description = "Should be an image of a child's drawing"
image_categories = cat.image_multi_class(
image_description=description,
image_input= ['desktop/image1.jpg','desktop/image2.jpg', desktop/image3.jpg'],
user_model="gpt-4o",
creativity=0,
categories=user_categories,
safety =TRUE,
api_key="OPENAI_API_KEY")
```
### `image_score_drawing()`
Performs quality scoring of images against a reference description and optional reference image, returning structured results with optional CSV export.
**Methodology:**
Processes each image individually, assigning a drawing quality score on a 5-point scale based on similarity to the expected description:
- **1**: No meaningful similarity (fundamentally different)
- **2**: Barely recognizable similarity (25% match)
- **3**: Partial match (50% key features)
- **4**: Strong alignment (75% features)
- **5**: Near-perfect match (90%+ similarity)
Supports flexible output formatting and optional saving of results to CSV for easy integration with data analysis workflows[5].
**Parameters:**
- `reference_image_description` (str): A description of what the model should expect to see
- `image_input` (list): List of image file paths or folder path containing images
- `reference_image` (str): A file path to the reference image
- `api_key` (str): API key for the LLM service
- `user_model` (str, default="gpt-4o"): Specific vision model to use
- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
- `safety` (bool, default=False): Enable safety checks and save results at each API call step
- `filename` (str, default="image_scores.csv"): Filename for CSV output
- `save_directory` (str, optional): Directory path to save the CSV file
- `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
**Returns:**
- `pandas.DataFrame`: DataFrame with image paths, quality scores, and analysis details
**Example:**
```
import catllm as cat
image_scores = cat.image_score(
reference_image_description='Adrien Brody sitting in a lawn chair,
image_input= ['desktop/image1.jpg','desktop/image2.jpg', desktop/image3.jpg'],
user_model="gpt-4o",
creativity=0,
safety =TRUE,
api_key="OPENAI_API_KEY")
```
### `image_features()`
Extracts specific features and attributes from images, returning exact answers to user-defined questions (e.g., counts, colors, presence of objects).
**Methodology:**
Processes each image individually using vision models to extract precise information about specified features. Unlike scoring and multi-class functions, this returns factual data such as object counts, color identification, or presence/absence of specific elements. Supports flexible output formatting and optional CSV export for quantitative analysis workflows.
**Parameters:**
- `image_description` (str): A description of what the model should expect to see
- `image_input` (list): List of image file paths or folder path containing images
- `features_to_extract` (list): List of specific features to extract (e.g., ["number of people", "primary color", "contains text"])
- `api_key` (str): API key for the LLM service
- `user_model` (str, default="gpt-4o"): Specific vision model to use
- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
- `to_csv` (bool, default=False): Whether to save the output to a CSV file
- `safety` (bool, default=False): Enable safety checks and save results at each API call step
- `filename` (str, default="categorized_data.csv"): Filename for CSV output
- `save_directory` (str, optional): Directory path to save the CSV file
- `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
**Returns:**
- `pandas.DataFrame`: DataFrame with image paths and extracted feature values for each specified attribute[1][4]
**Example:**
```
import catllm as cat
image_scores = cat.image_features(
image_description='An AI generated image of Spongebob dancing with Patrick',
features_to_extract=['Spongebob is yellow','Both are smiling','Patrick is chunky']
image_input= ['desktop/image1.jpg','desktop/image2.jpg', desktop/image3.jpg'],
model_source= 'OpenAI',
user_model="gpt-4o",
creativity=0,
safety =TRUE,
api_key="OPENAI_API_KEY")
```
### `build_web_research_dataset()`
Conducts automated web research on specified topics and compiles the findings into a structured dataset, extracting answers and source URLs for comprehensive research workflows.
NOTE: This function currently only works with Anthropic models and requires an Anthropic API key. It is strongly recommended to increase your API rate limits before using this function to avoid interruptions during web research tasks.
SECOND NOTE: This function works best if you are specific with your search question. For example, instead of search_question="Hottest temperature in 2024?" you should use "Hottest temperature in 2024 from extremeweatherwatch.com?" or "Hottest temperature in 2024 from weatherundeground.com?". Another example is use "Where these UC Berkeley professors got their PhD according to Linkedin?" instead of "Where they got their PhD according to Linkedin?" to avoid matching people with the same name.
THIRD NOTE: This function works by scraping data from the web. Be aware that not all websites allow webscraping from Anthropic and therefore the function won't be able to retrieve information from these sites.
**Methodology:**
Performs systematic web searches using the specified search questions and processes the results through Anthropic's language models to extract relevant information. The function handles multiple search queries sequentially, applying time delays between requests to respect rate limits. Results are categorized according to user-defined criteria and can be exported to CSV format for further analysis and research documentation.
**Rate Limits:**
Before using this function, review and increase your Anthropic API rate limits at: https://console.anthropic.com/settings/limits. For general information about API rate limits, consult the Anthropic documentation at: https://docs.anthropic.com/claude/reference/rate-limits
**Parameters:**
- `search_question` (str): Primary research question or topic to guide the search strategy
- `search_input` (list): List of specific search queries or questions to investigate
- `features_to_extract` (list): List of specific features to extract (e.g., ["number of people", "primary color", "contains text"])
- `api_key` (str): API key for the LLM service
- `answer_format`: (str, default="concise"): Response detail level ("concise", "detailed", "comprehensive")
- `additional_instructions` (str, default="claude-3-7-sonnet-20250219"): Specific Anthropic model to use for processing results
- `user_model` (str, default="gpt-4o"): Specific vision model to use
- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
- `safety` (bool, default=False): Enable safety checks and save results at each API call step
- `filename` (str, default="categorized_data.csv"): Filename for CSV output
- `save_directory` (str, optional): Directory path to save the CSV file
- `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Perplexity", "Mistral")
- `time_delay` (int, default=15): Delay in seconds between search requests to manage API rate limits
**Returns:**
- `pandas.DataFrame`: DataFrame with image paths and extracted feature values for each specified attribute[1][4]
**Example:**
```
import catllm as cat
research_data = cat.build_web_research_dataset(
search_question="What are the latest developments in renewable energy technology?",
search_input=["solar panel efficiency 2025", "wind turbine innovations", "battery storage breakthroughs"],
api_key="ANTHROPIC_API_KEY",
answer_format="detailed",
additional_instructions="Focus on recent technological advances and commercial applications",
categories=['Answer', 'URL', 'Date', 'Key_Technology'],
model_source="Anthropic",
user_model="claude-3-7-sonnet-20250219",
creativity=0.1,
safety=True,
time_delay=3
)
```
### `cerad_drawn_score()`
Automatically scores drawings of circles, diamonds, overlapping rectangles, and cubes according to the official Consortium to Establish a Registry for Alzheimer's Disease (CERAD) scoring system, returning structured results with optional CSV export. Works even with images that contain other drawings or writing.
**Methodology:**
Processes each image individually, evaluating the drawn shapes based on CERAD criteria. Supports optional inclusion of reference shapes within images and can provide reference examples if requested. The function outputs standardized scores facilitating reproducible analysis and integrates optional safety checks and CSV export for research workflows.
**Parameters:**
- `shape` (str): The type of shape to score (e.g., "circle", "diamond", "overlapping rectangles", "cube")
- `image_input` (list): List of image file paths or folder path containing images
- `api_key` (str): API key for the LLM service
- `user_model` (str, default="gpt-4o"): Specific model to use
- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)
- `reference_in_image` (bool, default=False): Whether a reference shape is present in the image for comparison
- `provide_reference` (bool, default=False): Whether to provide a reference example image (built in reference image)
- `safety` (bool, default=False): Enable safety checks and save results at each API call step
- `filename` (str, default="categorized_data.csv"): Filename for CSV output
- `model_source` (str, default="OpenAI"): Model provider ("OpenAI", "Anthropic", "Mistral")
**Returns:**
- `pandas.DataFrame`: DataFrame with image paths, CERAD scores, and analysis details
**Example:**
```
import catllm as cat
diamond_scores = cat.cerad_score(
shape="diamond",
image_input=df['diamond_pic_path'],
api_key=open_ai_key,
safety=True,
filename="diamond_gpt_score.csv",
)
```
## Academic Research
This package implements methodology from research on LLM performance in social science applications, including the UC Berkeley Social Networks Study. The package addresses reproducibility challenges in LLM-assisted research by providing standardized interfaces and consistent output formatting.
If you use this package for research, please cite:
Soria, C. (2025). CatLLM (0.0.8). Zenodo. https://doi.org/10.5281/zenodo.15532317
## Contact
**Interested in research collaboration?** Email: [ChrisSoria@Berkeley.edu](mailto:ChrisSoria@Berkeley.edu)
## License
`cat-llm` is distributed under the terms of the [GNU](https://www.gnu.org/licenses/gpl-3.0.en.html) license.
Raw data
{
"_id": null,
"home_page": null,
"name": "cat-llm",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "categorizer, image classification, llm, structured output, survey data, text classification",
"author": null,
"author_email": "Chris Soria <chrissoria@berkeley.edu>",
"download_url": "https://files.pythonhosted.org/packages/f0/db/2a2e5b00f898f1d4bb903e2607985768e7cc6800da57194697772f764ccc/cat_llm-0.0.99.tar.gz",
"platform": null,
"description": "\n\n# cat-llm\n\nCatLLM: A Reproducible LLM Pipeline for Coding Open-Ended Survey Responses\n\n[](https://pypi.org/project/cat-llm)\n[](https://pypi.org/project/cat-llm)\n\n-----\n\n## Table of Contents\n\n- [Installation](#installation)\n- [Quick Start](#quick-start)\n- [Configuration](#configuration)\n- [Supported Models](#supported-models)\n- [API Reference](#api-reference)\n - [explore_corpus()](#explore_corpus)\n - [explore_common_categories()](#explore_common_categories)\n - [multi_class()](#multi_class)\n - [image_score()](#image_score_drawing)\n - [image_features()](#image_features)\n - [build_web_research_dataset()](#build_web_research_dataset)\n - [cerad_drawn_score()](#cerad_drawn_score)\n- [Academic Research](#academic-research)\n- [Contact](#contact)\n- [License](#license)\n\n## Installation\n\n```console\npip install cat-llm\n```\n\n-----\n\n## \u26a0\ufe0f Beta Software Notice\n\n**CatLLM is currently in beta.** While I'm actively working to make this tool robust and reliable, you may encounter bugs or unexpected behavior. \n\n**I need your help!** This project thrives on community feedback. Please contribute by:\n- \ud83d\udc1b **Reporting bugs** - Found an issue? Let me know!\n- \ud83d\udca1 **Sharing ideas** - Have suggestions for improvements or new features?\n- \ud83d\udd27 **Contributing code** - Submit pull requests with fixes or enhancements\n\n**Visit our GitHub**: [github.com/chrissoria/cat-llm](https://github.com/chrissoria/cat-llm)\n\nAll feedback helps us build better research software for the community.\n\n-----\n\n## Quick Start\n\nCatLLM helps social scientists and researchers automatically categorize open-ended survey responses, images, and web-scraped data using AI models like GPT-5 and Claude. Not to be confused with CAT-LLM for Chinese article\u2010style transfer ([Tao et al. 2024](https://arxiv.org/html/2401.05707v1)).\n\nText Analysis: Simply provide your survey responses and category list - the package handles the rest and outputs clean data ready for statistical analysis. It works with single or multiple categories per response and automatically skips missing data to save API costs.\n\nImage Categorization: Uses the same intelligent categorization method to analyze images, extracting specific features, counting objects, identifying colors, or determining the presence of elements based on your research questions.\n\nWeb Data Collection: Builds comprehensive datasets by scraping web data and using Large Language Models to extract exactly the information you need. The function searches across multiple sources, processes the findings through AI models, and structures everything into clean dataframe format ready for export to CSV.\n\nWhether you're working with messy text responses, analyzing visual content, or gathering information from across the web, CatLLM consistently transforms unstructured data into structured categories and datasets you can actually analyze. All outputs are formatted for immediate statistical analysis and can be exported directly to CSV for further research workflows.\n\n\n\n## Configuration\n\n### Get Your OpenAI API Key\n\n1. **Create an OpenAI Developer Account**:\n - Go to [platform.openai.com](https://platform.openai.com) (separate from regular ChatGPT)\n - Sign up with email, Google, Microsoft, or Apple\n\n2. **Generate an API Key**:\n - Log into your account and click your name in the top right corner\n - Click \"View API keys\" or navigate to the \"API keys\" section\n - Click \"Create new secret key\"\n - Give your key a descriptive name\n - Set permissions (choose \"All\" for full access)\n\n3. **Add Payment Details**:\n - Add a payment method to your OpenAI account\n - Purchase credits (start with $5 - it lasts a long time for most research use)\n - **Important**: Your API key won't work without credits\n\n4. **Save Your Key Securely**:\n - Copy the key immediately (you won't be able to see it again)\n - Store it safely and never share it publicly\n\n5. Copy and paste your key into catllm in the api_key parameter\n\n## Supported Models\n\n- **OpenAI**: GPT-4o, GPT-4, GPT-3.5-turbo, etc.\n- **Anthropic**: Claude Sonnet 3.7, Claude Haiku, etc.\n- **Perplexity**: Sonnar Large, Sonnar Small, etc.\n- **Mistral**: Mistral Large, Mistral Small, etc.\n\n**Fully Tested (Beta):**\n- \u2705 OpenAI (GPT-4, GPT-4o, GPT-3.5-turbo, etc.)\n- \u2705 Anthropic (Claude 3.5 Sonnet, Haiku)\n- \u2705 Perplexity (Sonar models)\n- \u2705 Google Gemini - Free tier has severe rate limits (5 RPM). Requires Google AI Studio billing account for large-scale use.\n\n**Supported but Limited:**\n- \n- \u26a0\ufe0f Huggingface - API routing can be unstable\n\n**Note:** For beta testing, I recommend starting with OpenAI or Anthropic.\n\n\n## API Reference\n\n### `explore_corpus()`\n\nExtracts categories from a corpus of text responses and returns frequency counts.\n\n**Methodology:**\nThe function divides the corpus into random chunks to address the probabilistic nature of LLM outputs. By processing multiple chunks and averaging results across many API calls rather than relying on a single call, this approach significantly improves reproducibility and provides more stable categorical frequency estimates.\n\n**Parameters:**\n- `survey_question` (str): The survey question being analyzed\n- `survey_input` (list): List of text responses to categorize\n- `api_key` (str): API key for the LLM service\n- `cat_num` (int, default=10): Number of categories to extract in each iteration\n- `divisions` (int, default=5): Number of chunks to divide the data into (larger corpora might require larger divisions)\n- `specificity` (str, default=\"broad\"): Category precision level (e.g., \"broad\", \"narrow\")\n- `model_source` (str, default=\"OpenAI\"): Model provider (\"OpenAI\", \"Anthropic\", \"Perplexity\", \"Mistral\")\n- `user_model` (str, default=\"got-4o\"): Specific model (e.g., \"gpt-4o\", \"claude-opus-4-20250514\")\n- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)\n- `filename` (str, optional): Output file path for saving results\n\n**Returns:**\n- `pandas.DataFrame`: Two-column dataset with category names and frequencies\n\n**Example:***\n\n```\nimport catllm as cat\n\ncategories = cat.explore_corpus(\nsurvey_question=\"What motivates you most at work?\",\nsurvey_input=[\"flexible schedule\", \"good pay\", \"interesting projects\"],\napi_key=\"OPENAI_API_KEY\",\ncat_num=5,\ndivisions=10\n)\n```\n\n### `explore_common_categories()`\n\nIdentifies the most frequently occurring categories across a text corpus and returns the top N categories by frequency count.\n\n**Methodology:**\nDivides the corpus into random chunks and averages results across multiple API calls to improve reproducibility and provide stable frequency estimates for the most prevalent categories, addressing the probabilistic nature of LLM outputs.\n\n**Parameters:**\n- `survey_question` (str): Survey question being analyzed\n- `survey_input` (list): Text responses to categorize\n- `api_key` (str): API key for the LLM service\n- `top_n` (int, default=10): Number of top categories to return by frequency\n- `cat_num` (int, default=10): Number of categories to extract per iteration\n- `divisions` (int, default=5): Number of data chunks (increase for larger corpora)\n- `user_model` (str, default=\"gpt-4o\"): Specific model to use\n- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)\n- `specificity` (str, default=\"broad\"): Category precision level (\"broad\", \"narrow\")\n- `research_question` (str, optional): Contextual research question to guide categorization\n- `filename` (str, optional): File path to save output dataset\n- `model_source` (str, default=\"OpenAI\"): Model provider (\"OpenAI\", \"Anthropic\", \"Perplexity\", \"Mistral\")\n\n**Returns:**\n- `pandas.DataFrame`: Dataset with category names and frequencies, limited to top N most common categories\n\n**Example:**\n\n```\nimport catllm as cat\n\ntop_10_categories = cat.explore_common_categories(\nsurvey_question=\"What motivates you most at work?\",\nsurvey_input=[\"flexible schedule\", \"good pay\", \"interesting projects\"],\napi_key=\"OPENAI_API_KEY\",\ntop_n=10,\ncat_num=5,\ndivisions=10\n)\nprint(categories)\n```\n### `multi_class()`\n\nPerforms multi-label classification of text responses into user-defined categories, returning structured results with optional CSV export.\n\n**Methodology:**\nProcesses each text response individually, assigning one or more categories from the provided list. Supports flexible output formatting and optional saving of results to CSV for easy integration with data analysis workflows.\n\n**Parameters:**\n- `survey_input` (list): List of text responses to classify\n- `categories` (list): List of predefined categories for classification\n- `api_key` (str): API key for the LLM service\n- `user_model` (str, default=\"gpt-5\"): Specific model to use\n- `user_prompt` (str, optional): Custom prompt template to override default prompting\n- `survey_question` (str, default=\"\"): The survey question being analyzed\n- `example1` through `example6` (dict, optional): Few-shot learning examples (format: {\"response\": \"...\", \"categories\": [...]})\n- `creativity` (float, optional): Temperature/randomness setting (0.0-1.0, varies by model)\n- `safety` (bool, default=False): Enable safety checks on responses and saves to CSV at each API call step\n- `to_csv` (bool, default=False): Whether to save results to CSV\n- `chain_of_verification` (bool, default=False): Enable Chain-of-Verification prompting technique for improved accuracy\n- `chain_of_thought` (bool, default=False): Enable Chain-of-Thought prompting technique for improved accuracy\n- `step_back_prompt` (bool, default=False): Enable step-back prompting to analyze higher-level context before classification\n- `context_prompt` (bool, default=False): Add expert role and behavioral guidelines to the prompt\n- `filename` (str, default=\"categorized_data.csv\"): Filename for CSV output\n- `save_directory` (str, optional): Directory path to save the CSV file\n- `model_source` (str, default=\"auto\"): Model provider (\"auto\", \"OpenAI\", \"Anthropic\", \"Google\", \"Mistral\", \"Perplexity\", \"Huggingface\")\n\n**Returns:**\n- `pandas.DataFrame`: DataFrame with classification results, columns formatted as specified\n\n**Example:**\n\n```\nimport catllm as cat\n\nuser_categories = [\"to start living with or to stay with partner/spouse\",\n \"relationship change (divorce, breakup, etc)\",\n \"the person had a job or school or career change, including transferred and retired\",\n \"the person's partner's job or school or career change, including transferred and retired\",\n \"financial reasons (rent is too expensive, pay raise, etc)\",\n \"related specifically features of the home, such as a bigger or smaller yard\"]\n\nquestion = \"Why did you move?\" \n\nmove_reasons = cat.multi_class(\n survey_question=question, \n survey_input= df[column1], \n user_model=\"gpt-4o\",\n creativity=0,\n categories=user_categories,\n safety =TRUE,\n api_key=\"OPENAI_API_KEY\")\n```\n\n### `image_multi_class()`\n\nPerforms multi-label image classification into user-defined categories, returning structured results with optional CSV export.\n\n**Methodology:**\nProcesses each image individually, assigning one or more categories from the provided list. Supports flexible output formatting and optional saving of results to CSV for easy integration with data analysis workflows.\n\n**Parameters:**\n- `image_description` (str): A description of what the model should expect to see\n- `image_input` (list): List of file paths or a folder to pull file paths from\n- `categories` (list): List of predefined categories for classification\n- `api_key` (str): API key for the LLM service\n- `user_model` (str, default=\"gpt-4o\"): Specific model to use\n- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)\n- `safety` (bool, default=False): Enable safety checks on responses and saves to CSV at each API call step\n- `filename` (str, default=\"categorized_data.csv\"): Filename for CSV output\n- `save_directory` (str, optional): Directory path to save the CSV file\n- `model_source` (str, default=\"OpenAI\"): Model provider (\"OpenAI\", \"Anthropic\", \"Perplexity\", \"Mistral\")\n\n**Returns:**\n- `pandas.DataFrame`: DataFrame with classification results, columns formatted as specified\n\n**Example:**\n\n```\nimport catllm as cat\n\nuser_categories = [\"has a cat somewhere in it\",\n \"looks cartoonish\",\n \"Adrian Brody is in it\"]\n\ndescription = \"Should be an image of a child's drawing\" \n\nimage_categories = cat.image_multi_class(\n image_description=description, \n image_input= ['desktop/image1.jpg','desktop/image2.jpg', desktop/image3.jpg'], \n user_model=\"gpt-4o\",\n creativity=0,\n categories=user_categories,\n safety =TRUE,\n api_key=\"OPENAI_API_KEY\")\n```\n\n### `image_score_drawing()`\n\nPerforms quality scoring of images against a reference description and optional reference image, returning structured results with optional CSV export.\n\n**Methodology:**\nProcesses each image individually, assigning a drawing quality score on a 5-point scale based on similarity to the expected description:\n\n- **1**: No meaningful similarity (fundamentally different)\n- **2**: Barely recognizable similarity (25% match) \n- **3**: Partial match (50% key features)\n- **4**: Strong alignment (75% features)\n- **5**: Near-perfect match (90%+ similarity)\n\nSupports flexible output formatting and optional saving of results to CSV for easy integration with data analysis workflows[5].\n\n**Parameters:**\n- `reference_image_description` (str): A description of what the model should expect to see\n- `image_input` (list): List of image file paths or folder path containing images\n- `reference_image` (str): A file path to the reference image\n- `api_key` (str): API key for the LLM service\n- `user_model` (str, default=\"gpt-4o\"): Specific vision model to use\n- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)\n- `safety` (bool, default=False): Enable safety checks and save results at each API call step\n- `filename` (str, default=\"image_scores.csv\"): Filename for CSV output\n- `save_directory` (str, optional): Directory path to save the CSV file\n- `model_source` (str, default=\"OpenAI\"): Model provider (\"OpenAI\", \"Anthropic\", \"Perplexity\", \"Mistral\")\n\n**Returns:**\n- `pandas.DataFrame`: DataFrame with image paths, quality scores, and analysis details\n\n**Example:**\n\n```\nimport catllm as cat \n\nimage_scores = cat.image_score(\n reference_image_description='Adrien Brody sitting in a lawn chair, \n image_input= ['desktop/image1.jpg','desktop/image2.jpg', desktop/image3.jpg'], \n user_model=\"gpt-4o\",\n creativity=0,\n safety =TRUE,\n api_key=\"OPENAI_API_KEY\")\n```\n\n### `image_features()`\n\nExtracts specific features and attributes from images, returning exact answers to user-defined questions (e.g., counts, colors, presence of objects).\n\n**Methodology:**\nProcesses each image individually using vision models to extract precise information about specified features. Unlike scoring and multi-class functions, this returns factual data such as object counts, color identification, or presence/absence of specific elements. Supports flexible output formatting and optional CSV export for quantitative analysis workflows.\n\n**Parameters:**\n- `image_description` (str): A description of what the model should expect to see\n- `image_input` (list): List of image file paths or folder path containing images\n- `features_to_extract` (list): List of specific features to extract (e.g., [\"number of people\", \"primary color\", \"contains text\"])\n- `api_key` (str): API key for the LLM service\n- `user_model` (str, default=\"gpt-4o\"): Specific vision model to use\n- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)\n- `to_csv` (bool, default=False): Whether to save the output to a CSV file\n- `safety` (bool, default=False): Enable safety checks and save results at each API call step\n- `filename` (str, default=\"categorized_data.csv\"): Filename for CSV output\n- `save_directory` (str, optional): Directory path to save the CSV file\n- `model_source` (str, default=\"OpenAI\"): Model provider (\"OpenAI\", \"Anthropic\", \"Perplexity\", \"Mistral\")\n\n**Returns:**\n- `pandas.DataFrame`: DataFrame with image paths and extracted feature values for each specified attribute[1][4]\n\n**Example:**\n\n```\nimport catllm as cat \n\nimage_scores = cat.image_features(\n image_description='An AI generated image of Spongebob dancing with Patrick', \n features_to_extract=['Spongebob is yellow','Both are smiling','Patrick is chunky']\n image_input= ['desktop/image1.jpg','desktop/image2.jpg', desktop/image3.jpg'], \n model_source= 'OpenAI',\n user_model=\"gpt-4o\",\n creativity=0,\n safety =TRUE,\n api_key=\"OPENAI_API_KEY\")\n```\n\n### `build_web_research_dataset()`\n\nConducts automated web research on specified topics and compiles the findings into a structured dataset, extracting answers and source URLs for comprehensive research workflows. \n\nNOTE: This function currently only works with Anthropic models and requires an Anthropic API key. It is strongly recommended to increase your API rate limits before using this function to avoid interruptions during web research tasks.\n\nSECOND NOTE: This function works best if you are specific with your search question. For example, instead of search_question=\"Hottest temperature in 2024?\" you should use \"Hottest temperature in 2024 from extremeweatherwatch.com?\" or \"Hottest temperature in 2024 from weatherundeground.com?\". Another example is use \"Where these UC Berkeley professors got their PhD according to Linkedin?\" instead of \"Where they got their PhD according to Linkedin?\" to avoid matching people with the same name. \n\nTHIRD NOTE: This function works by scraping data from the web. Be aware that not all websites allow webscraping from Anthropic and therefore the function won't be able to retrieve information from these sites.\n\n**Methodology:**\nPerforms systematic web searches using the specified search questions and processes the results through Anthropic's language models to extract relevant information. The function handles multiple search queries sequentially, applying time delays between requests to respect rate limits. Results are categorized according to user-defined criteria and can be exported to CSV format for further analysis and research documentation.\n\n**Rate Limits:**\nBefore using this function, review and increase your Anthropic API rate limits at: https://console.anthropic.com/settings/limits. For general information about API rate limits, consult the Anthropic documentation at: https://docs.anthropic.com/claude/reference/rate-limits\n\n**Parameters:**\n- `search_question` (str): Primary research question or topic to guide the search strategy\n- `search_input` (list): List of specific search queries or questions to investigate\n- `features_to_extract` (list): List of specific features to extract (e.g., [\"number of people\", \"primary color\", \"contains text\"])\n- `api_key` (str): API key for the LLM service\n- `answer_format`: (str, default=\"concise\"): Response detail level (\"concise\", \"detailed\", \"comprehensive\")\n- `additional_instructions` (str, default=\"claude-3-7-sonnet-20250219\"): Specific Anthropic model to use for processing results\n- `user_model` (str, default=\"gpt-4o\"): Specific vision model to use\n- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)\n- `safety` (bool, default=False): Enable safety checks and save results at each API call step\n- `filename` (str, default=\"categorized_data.csv\"): Filename for CSV output\n- `save_directory` (str, optional): Directory path to save the CSV file\n- `model_source` (str, default=\"OpenAI\"): Model provider (\"OpenAI\", \"Anthropic\", \"Perplexity\", \"Mistral\")\n- `time_delay` (int, default=15): Delay in seconds between search requests to manage API rate limits\n\n**Returns:**\n- `pandas.DataFrame`: DataFrame with image paths and extracted feature values for each specified attribute[1][4]\n\n**Example:**\n\n```\nimport catllm as cat \n\nresearch_data = cat.build_web_research_dataset(\n search_question=\"What are the latest developments in renewable energy technology?\",\n search_input=[\"solar panel efficiency 2025\", \"wind turbine innovations\", \"battery storage breakthroughs\"],\n api_key=\"ANTHROPIC_API_KEY\",\n answer_format=\"detailed\",\n additional_instructions=\"Focus on recent technological advances and commercial applications\",\n categories=['Answer', 'URL', 'Date', 'Key_Technology'],\n model_source=\"Anthropic\",\n user_model=\"claude-3-7-sonnet-20250219\",\n creativity=0.1,\n safety=True,\n time_delay=3\n)\n```\n\n### `cerad_drawn_score()`\n\nAutomatically scores drawings of circles, diamonds, overlapping rectangles, and cubes according to the official Consortium to Establish a Registry for Alzheimer's Disease (CERAD) scoring system, returning structured results with optional CSV export. Works even with images that contain other drawings or writing.\n\n**Methodology:**\nProcesses each image individually, evaluating the drawn shapes based on CERAD criteria. Supports optional inclusion of reference shapes within images and can provide reference examples if requested. The function outputs standardized scores facilitating reproducible analysis and integrates optional safety checks and CSV export for research workflows.\n\n**Parameters:**\n- `shape` (str): The type of shape to score (e.g., \"circle\", \"diamond\", \"overlapping rectangles\", \"cube\")\n- `image_input` (list): List of image file paths or folder path containing images\n- `api_key` (str): API key for the LLM service\n- `user_model` (str, default=\"gpt-4o\"): Specific model to use\n- `creativity` (float, default=0): Temperature/randomness setting (0.0-1.0)\n- `reference_in_image` (bool, default=False): Whether a reference shape is present in the image for comparison\n- `provide_reference` (bool, default=False): Whether to provide a reference example image (built in reference image)\n- `safety` (bool, default=False): Enable safety checks and save results at each API call step\n- `filename` (str, default=\"categorized_data.csv\"): Filename for CSV output\n- `model_source` (str, default=\"OpenAI\"): Model provider (\"OpenAI\", \"Anthropic\", \"Mistral\")\n\n**Returns:**\n- `pandas.DataFrame`: DataFrame with image paths, CERAD scores, and analysis details\n\n**Example:**\n\n```\nimport catllm as cat \n\ndiamond_scores = cat.cerad_score(\n shape=\"diamond\",\n image_input=df['diamond_pic_path'],\n api_key=open_ai_key,\n safety=True,\n filename=\"diamond_gpt_score.csv\",\n)\n```\n\n## Academic Research\n\nThis package implements methodology from research on LLM performance in social science applications, including the UC Berkeley Social Networks Study. The package addresses reproducibility challenges in LLM-assisted research by providing standardized interfaces and consistent output formatting.\n\nIf you use this package for research, please cite:\n\nSoria, C. (2025). CatLLM (0.0.8). Zenodo. https://doi.org/10.5281/zenodo.15532317\n\n## Contact\n\n**Interested in research collaboration?** Email: [ChrisSoria@Berkeley.edu](mailto:ChrisSoria@Berkeley.edu)\n\n## License\n\n`cat-llm` is distributed under the terms of the [GNU](https://www.gnu.org/licenses/gpl-3.0.en.html) license.\n",
"bugtrack_url": null,
"license": null,
"summary": "A tool for categorizing text data and images using LLMs and vision models",
"version": "0.0.99",
"project_urls": {
"Documentation": "https://github.com/chrissoria/cat-llm#readme",
"Issues": "https://github.com/chrissoria/cat-llm/issues",
"Source": "https://github.com/chrissoria/cat-llm"
},
"split_keywords": [
"categorizer",
" image classification",
" llm",
" structured output",
" survey data",
" text classification"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "e27fb988ddfc06234e7db2001333c25a41cde0d11974ab19c1cfe3a211881900",
"md5": "cbd2d0cd68a09964b627ed21af5f290a",
"sha256": "eeb6bb7268b38e604a0293fbc2b78a6fd9234d431b0acce8792942ac26c4aba6"
},
"downloads": -1,
"filename": "cat_llm-0.0.99-py3-none-any.whl",
"has_sig": false,
"md5_digest": "cbd2d0cd68a09964b627ed21af5f290a",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 353873,
"upload_time": "2025-11-04T19:31:53",
"upload_time_iso_8601": "2025-11-04T19:31:53.347245Z",
"url": "https://files.pythonhosted.org/packages/e2/7f/b988ddfc06234e7db2001333c25a41cde0d11974ab19c1cfe3a211881900/cat_llm-0.0.99-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "f0db2a2e5b00f898f1d4bb903e2607985768e7cc6800da57194697772f764ccc",
"md5": "db8f251f6c8c5645b9107a5449db361c",
"sha256": "53a54b4e4a3a28c75211b1c1f33ef501e47fd0c6108e136aa6fe362a1e2e44f7"
},
"downloads": -1,
"filename": "cat_llm-0.0.99.tar.gz",
"has_sig": false,
"md5_digest": "db8f251f6c8c5645b9107a5449db361c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 347339,
"upload_time": "2025-11-04T19:31:55",
"upload_time_iso_8601": "2025-11-04T19:31:55.744642Z",
"url": "https://files.pythonhosted.org/packages/f0/db/2a2e5b00f898f1d4bb903e2607985768e7cc6800da57194697772f764ccc/cat_llm-0.0.99.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-11-04 19:31:55",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "chrissoria",
"github_project": "cat-llm#readme",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "cat-llm"
}