simars

Name	simars JSON
Version	0.1.0 JSON
	download
home_page	None
Summary	FastText-based response similarity analyzer for human raters
upload_time	2025-09-11 08:51:59
maintainer	None
docs_url	None
author	None
requires_python	>=3.10
license	MIT
keywords	education fasttext jamo korean morphology nlp rater similarity
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Simars - FastText-based similarity analysis of answers & responses for human raters

Simars is a comprehensive toolkit for analyzing response similarity using FastText embeddings, specifically designed to support human raters in educational assessment and text analysis tasks.

## 🚀 Features

- **Text Preprocessing**: Comprehensive Korean and English text cleaning and tokenization
- **FastText Integration**: Training and fine-tuning of FastText models with Korean support
- **Dimensionality Reduction**: Support for UMAP, PCA, and t-SNE algorithms
- **Clustering**: HDBSCAN clustering for response grouping
- **Interactive Visualization**: Plotly-based interactive scatter plots with multiple visualization modes
- **Jamo Processing**: Advanced Korean text processing with Jamo decomposition

## 📦 Installation

```bash
pip install simars

# Install spaCy English model (required for text processing)
python -m spacy download en_core_web_sm
```

### Development Installation

```bash
git clone https://github.com/h000000nkim/simars.git
cd simars
pip install -e ".[dev]"
```

## 🔧 Dependencies

- **Core**: `gensim`, `numpy`, `pandas`, `scikit-learn`, `umap-learn`
- **NLP**: `jamo`, `pecab`, `spacy`
- **Visualization**: `plotly`
- **Clustering**: `hdbscan`
- **Development**: `pytest`, `ruff`, `mkdocs-material`

## 📖 Quick Start

### Basic Usage

```python
import simars
import numpy as np

# Sample data
answers = np.array([["허무"], ["흡수율"], ["부사어"]])
responses = np.array([
    ["허무", "공허", "무상", "허무감", "초월"],
    ["흡수율", "흡수", "반사율", "알베도"],
    ["부사어", "부사", "수식어", "부가어"]
])
informations = np.array([
    "문학 문제에 대한 정서적 태도",
    "과학 제재의 핵심 개념",
    "문법 성분 분석"
])

# Initialize Simars
analyzer = simars.Fastrs(
    answers=answers,
    responses=responses,
    informations=informations
)

# Preprocess text data
analyzer.preprocess()

# Train FastText model
model = analyzer.train(
    vector_size=100,
    window=5,
    min_count=1,
    epochs=10
)

# Reduce dimensionality
coordinates = analyzer.reduce(method="umap", n_neighbors=5)

# Perform clustering
analyzer.hdbscanize()

# Visualize results
figures = analyzer.visualize()
for fig in figures:
    fig.show()
```

### Advanced Usage with Custom Data Structure

```python
# Using dictionary format
data = {
    "item1": {
        "answer": ["정답1"],
        "response": ["정답1", "오답1", "오답2"],
        "information": "문항 설명"
    },
    "item2": {
        "answer": ["정답2"],
        "response": ["정답2", "유사답", "오답"],
        "information": "다른 문항 설명"
    }
}

analyzer = simars.Fastrs(data=data)
analyzer.preprocess()

# Fine-tune existing model
pretrained_model = simars.util.get_pretrained_model()
analyzer.finetune(model=pretrained_model, epochs=5)

# Advanced reduction with custom parameters
coordinates = analyzer.reduce(
    method="umap",
    n_neighbors=15,
    min_dist=0.1,
    metric="cosine"
)
```

## 🛠️ Core Components

### Fastrs Class

Main class for response similarity analysis:

- `preprocess()`: Clean, tokenize, and prepare text data
- `train()`: Train new FastText model from scratch
- `finetune()`: Fine-tune existing FastText model
- `reduce()`: Reduce embeddings dimensionality
- `hdbscanize()`: Perform clustering analysis
- `visualize()`: Create interactive visualizations

### Item Class

Individual item processor for detailed analysis:

- `clean()`: Text cleaning with customizable options
- `tokenize()`: Korean/English tokenization
- `jamoize()`: Korean Jamo decomposition
- `formatize()`: Prepare data for FastText training

### Preprocessing Module

Advanced text preprocessing functions:

- `clean()`: Multi-option text cleaning
- `tokenize()`: Morphological analysis with PeCab
- `jamoize()`: Korean character decomposition
- `formatize()`: Data formatting for training

### Visualization Module

Interactive plotting with Plotly:

- `scatter()`: Unified scatter plot function
- Multiple plot types: simple, value count, labeled, combined
- Customizable themes and color schemes

## 📊 Visualization Types

### Simple Scatter Plot
Basic 2D visualization highlighting answers vs responses.

### Value Count Scatter Plot  
3D visualization showing response frequency in the z-axis.

### Labeled Scatter Plot
Color-coded visualization based on clustering results.

### Combined Scatter Plot
3D visualization combining clustering and frequency information.

## ⚙️ Configuration

simars uses JSON configuration files for customization:

- `color_schemes.json`: Color themes for visualizations
- `plot_config.json`: Plot layout and styling options
- `reduction_defaults.json`: Default parameters for dimensionality reduction
- `fasttext_defaults.json`: Default FastText training parameters

## 🧪 Testing

Run the test suite:

```bash
# All tests
pytest

# Unit tests only
pytest tests/unit/

# Integration tests only
pytest tests/integration/

# With coverage
pytest --cov=simars tests/
```

## 📚 Use Cases

- **Educational Assessment**: Analyze student response patterns
- **Content Analysis**: Group similar text responses
- **Quality Assurance**: Identify outlier responses for review
- **Research**: Study response similarity patterns in surveys

## 🤝 Contributing

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 👤 Author

**Hoon Kim** - [h000000nkim@gmail.com](mailto:h000000nkim@gmail.com)

## 🔗 Links

- [GitHub Repository](https://github.com/h000000nkim/simars)
- [Documentation](https://h000000nkim.github.io/simars/)
- [PyPI Package](https://pypi.org/project/simars/)

## 📈 Roadmap

- [ ] Support for additional languages
- [ ] Web interface for easy usage
- [ ] Additional clustering algorithms
- [ ] Export functionality for results
- [ ] Integration with popular ML frameworks

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "simars",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": "education, fasttext, jamo, korean, morphology, nlp, rater, similarity",
    "author": null,
    "author_email": "Hoon Kim <h000000nkim@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/f5/5b/143dd9535ebd0755551189b3097ddc3b28f02c3d60ea6fb96fc0e353b16e/simars-0.1.0.tar.gz",
    "platform": null,
    "description": "# Simars - FastText-based similarity analysis of answers & responses for human raters\n\nSimars is a comprehensive toolkit for analyzing response similarity using FastText embeddings, specifically designed to support human raters in educational assessment and text analysis tasks.\n\n## \ud83d\ude80 Features\n\n- **Text Preprocessing**: Comprehensive Korean and English text cleaning and tokenization\n- **FastText Integration**: Training and fine-tuning of FastText models with Korean support\n- **Dimensionality Reduction**: Support for UMAP, PCA, and t-SNE algorithms\n- **Clustering**: HDBSCAN clustering for response grouping\n- **Interactive Visualization**: Plotly-based interactive scatter plots with multiple visualization modes\n- **Jamo Processing**: Advanced Korean text processing with Jamo decomposition\n\n## \ud83d\udce6 Installation\n\n```bash\npip install simars\n\n# Install spaCy English model (required for text processing)\npython -m spacy download en_core_web_sm\n```\n\n### Development Installation\n\n```bash\ngit clone https://github.com/h000000nkim/simars.git\ncd simars\npip install -e \".[dev]\"\n```\n\n## \ud83d\udd27 Dependencies\n\n- **Core**: `gensim`, `numpy`, `pandas`, `scikit-learn`, `umap-learn`\n- **NLP**: `jamo`, `pecab`, `spacy`\n- **Visualization**: `plotly`\n- **Clustering**: `hdbscan`\n- **Development**: `pytest`, `ruff`, `mkdocs-material`\n\n## \ud83d\udcd6 Quick Start\n\n### Basic Usage\n\n```python\nimport simars\nimport numpy as np\n\n# Sample data\nanswers = np.array([[\"\ud5c8\ubb34\"], [\"\ud761\uc218\uc728\"], [\"\ubd80\uc0ac\uc5b4\"]])\nresponses = np.array([\n    [\"\ud5c8\ubb34\", \"\uacf5\ud5c8\", \"\ubb34\uc0c1\", \"\ud5c8\ubb34\uac10\", \"\ucd08\uc6d4\"],\n    [\"\ud761\uc218\uc728\", \"\ud761\uc218\", \"\ubc18\uc0ac\uc728\", \"\uc54c\ubca0\ub3c4\"],\n    [\"\ubd80\uc0ac\uc5b4\", \"\ubd80\uc0ac\", \"\uc218\uc2dd\uc5b4\", \"\ubd80\uac00\uc5b4\"]\n])\ninformations = np.array([\n    \"\ubb38\ud559 \ubb38\uc81c\uc5d0 \ub300\ud55c \uc815\uc11c\uc801 \ud0dc\ub3c4\",\n    \"\uacfc\ud559 \uc81c\uc7ac\uc758 \ud575\uc2ec \uac1c\ub150\",\n    \"\ubb38\ubc95 \uc131\ubd84 \ubd84\uc11d\"\n])\n\n# Initialize Simars\nanalyzer = simars.Fastrs(\n    answers=answers,\n    responses=responses,\n    informations=informations\n)\n\n# Preprocess text data\nanalyzer.preprocess()\n\n# Train FastText model\nmodel = analyzer.train(\n    vector_size=100,\n    window=5,\n    min_count=1,\n    epochs=10\n)\n\n# Reduce dimensionality\ncoordinates = analyzer.reduce(method=\"umap\", n_neighbors=5)\n\n# Perform clustering\nanalyzer.hdbscanize()\n\n# Visualize results\nfigures = analyzer.visualize()\nfor fig in figures:\n    fig.show()\n```\n\n### Advanced Usage with Custom Data Structure\n\n```python\n# Using dictionary format\ndata = {\n    \"item1\": {\n        \"answer\": [\"\uc815\ub2f51\"],\n        \"response\": [\"\uc815\ub2f51\", \"\uc624\ub2f51\", \"\uc624\ub2f52\"],\n        \"information\": \"\ubb38\ud56d \uc124\uba85\"\n    },\n    \"item2\": {\n        \"answer\": [\"\uc815\ub2f52\"],\n        \"response\": [\"\uc815\ub2f52\", \"\uc720\uc0ac\ub2f5\", \"\uc624\ub2f5\"],\n        \"information\": \"\ub2e4\ub978 \ubb38\ud56d \uc124\uba85\"\n    }\n}\n\nanalyzer = simars.Fastrs(data=data)\nanalyzer.preprocess()\n\n# Fine-tune existing model\npretrained_model = simars.util.get_pretrained_model()\nanalyzer.finetune(model=pretrained_model, epochs=5)\n\n# Advanced reduction with custom parameters\ncoordinates = analyzer.reduce(\n    method=\"umap\",\n    n_neighbors=15,\n    min_dist=0.1,\n    metric=\"cosine\"\n)\n```\n\n## \ud83d\udee0\ufe0f Core Components\n\n### Fastrs Class\n\nMain class for response similarity analysis:\n\n- `preprocess()`: Clean, tokenize, and prepare text data\n- `train()`: Train new FastText model from scratch\n- `finetune()`: Fine-tune existing FastText model\n- `reduce()`: Reduce embeddings dimensionality\n- `hdbscanize()`: Perform clustering analysis\n- `visualize()`: Create interactive visualizations\n\n### Item Class\n\nIndividual item processor for detailed analysis:\n\n- `clean()`: Text cleaning with customizable options\n- `tokenize()`: Korean/English tokenization\n- `jamoize()`: Korean Jamo decomposition\n- `formatize()`: Prepare data for FastText training\n\n### Preprocessing Module\n\nAdvanced text preprocessing functions:\n\n- `clean()`: Multi-option text cleaning\n- `tokenize()`: Morphological analysis with PeCab\n- `jamoize()`: Korean character decomposition\n- `formatize()`: Data formatting for training\n\n### Visualization Module\n\nInteractive plotting with Plotly:\n\n- `scatter()`: Unified scatter plot function\n- Multiple plot types: simple, value count, labeled, combined\n- Customizable themes and color schemes\n\n## \ud83d\udcca Visualization Types\n\n### Simple Scatter Plot\nBasic 2D visualization highlighting answers vs responses.\n\n### Value Count Scatter Plot  \n3D visualization showing response frequency in the z-axis.\n\n### Labeled Scatter Plot\nColor-coded visualization based on clustering results.\n\n### Combined Scatter Plot\n3D visualization combining clustering and frequency information.\n\n## \u2699\ufe0f Configuration\n\nsimars uses JSON configuration files for customization:\n\n- `color_schemes.json`: Color themes for visualizations\n- `plot_config.json`: Plot layout and styling options\n- `reduction_defaults.json`: Default parameters for dimensionality reduction\n- `fasttext_defaults.json`: Default FastText training parameters\n\n## \ud83e\uddea Testing\n\nRun the test suite:\n\n```bash\n# All tests\npytest\n\n# Unit tests only\npytest tests/unit/\n\n# Integration tests only\npytest tests/integration/\n\n# With coverage\npytest --cov=simars tests/\n```\n\n## \ud83d\udcda Use Cases\n\n- **Educational Assessment**: Analyze student response patterns\n- **Content Analysis**: Group similar text responses\n- **Quality Assurance**: Identify outlier responses for review\n- **Research**: Study response similarity patterns in surveys\n\n## \ud83e\udd1d Contributing\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Commit your changes (`git commit -m 'Add amazing feature'`)\n4. Push to the branch (`git push origin feature/amazing-feature`)\n5. Open a Pull Request\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\udc64 Author\n\n**Hoon Kim** - [h000000nkim@gmail.com](mailto:h000000nkim@gmail.com)\n\n## \ud83d\udd17 Links\n\n- [GitHub Repository](https://github.com/h000000nkim/simars)\n- [Documentation](https://h000000nkim.github.io/simars/)\n- [PyPI Package](https://pypi.org/project/simars/)\n\n## \ud83d\udcc8 Roadmap\n\n- [ ] Support for additional languages\n- [ ] Web interface for easy usage\n- [ ] Additional clustering algorithms\n- [ ] Export functionality for results\n- [ ] Integration with popular ML frameworks",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "FastText-based response similarity analyzer for human raters",
    "version": "0.1.0",
    "project_urls": {
        "Documentation": "https://h000000nkim.github.io/fastrs/",
        "Homepage": "https://github.com/h000000nkim/fastrs",
        "Repository": "https://github.com/h000000nkim/fastrs"
    },
    "split_keywords": [
        "education",
        " fasttext",
        " jamo",
        " korean",
        " morphology",
        " nlp",
        " rater",
        " similarity"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7d0a07ad92e14c99ae0e3b97a6abc77f8278955cd267daf2085147d70a8268d7",
                "md5": "4e605ed914ed767f01a5cb0babd9732c",
                "sha256": "e37ceabb49cdea2c3c8887b50682ff1293e44ff8e3834b9e09632d52d764ccaf"
            },
            "downloads": -1,
            "filename": "simars-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4e605ed914ed767f01a5cb0babd9732c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 18705,
            "upload_time": "2025-09-11T08:51:53",
            "upload_time_iso_8601": "2025-09-11T08:51:53.535246Z",
            "url": "https://files.pythonhosted.org/packages/7d/0a/07ad92e14c99ae0e3b97a6abc77f8278955cd267daf2085147d70a8268d7/simars-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "f55b143dd9535ebd0755551189b3097ddc3b28f02c3d60ea6fb96fc0e353b16e",
                "md5": "08b560e822ab9c282ce9e0a317a4621b",
                "sha256": "2f3b0055cca7e0ff0c53f49bf7a0c58854dc4b3e019fce40f942b22b48e483ef"
            },
            "downloads": -1,
            "filename": "simars-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "08b560e822ab9c282ce9e0a317a4621b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 12846666,
            "upload_time": "2025-09-11T08:51:59",
            "upload_time_iso_8601": "2025-09-11T08:51:59.220784Z",
            "url": "https://files.pythonhosted.org/packages/f5/5b/143dd9535ebd0755551189b3097ddc3b28f02c3d60ea6fb96fc0e353b16e/simars-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-11 08:51:59",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "h000000nkim",
    "github_project": "fastrs",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "simars"
}

None