# Simars - FastText-based similarity analysis of answers & responses for human raters
Simars is a comprehensive toolkit for analyzing response similarity using FastText embeddings, specifically designed to support human raters in educational assessment and text analysis tasks.
## π Features
- **Text Preprocessing**: Comprehensive Korean and English text cleaning and tokenization
- **FastText Integration**: Training and fine-tuning of FastText models with Korean support
- **Dimensionality Reduction**: Support for UMAP, PCA, and t-SNE algorithms
- **Clustering**: HDBSCAN clustering for response grouping
- **Interactive Visualization**: Plotly-based interactive scatter plots with multiple visualization modes
- **Jamo Processing**: Advanced Korean text processing with Jamo decomposition
## π¦ Installation
```bash
pip install simars
# Install spaCy English model (required for text processing)
python -m spacy download en_core_web_sm
```
### Development Installation
```bash
git clone https://github.com/h000000nkim/simars.git
cd simars
pip install -e ".[dev]"
```
## π§ Dependencies
- **Core**: `gensim`, `numpy`, `pandas`, `scikit-learn`, `umap-learn`
- **NLP**: `jamo`, `pecab`, `spacy`
- **Visualization**: `plotly`
- **Clustering**: `hdbscan`
- **Development**: `pytest`, `ruff`, `mkdocs-material`
## π Quick Start
### Basic Usage
```python
import simars
import numpy as np
# Sample data
answers = np.array([["ν무"], ["ν‘μμ¨"], ["λΆμ¬μ΄"]])
responses = np.array([
["ν무", "곡ν", "무μ", "ν무κ°", "μ΄μ"],
["ν‘μμ¨", "ν‘μ", "λ°μ¬μ¨", "μλ² λ"],
["λΆμ¬μ΄", "λΆμ¬", "μμμ΄", "λΆκ°μ΄"]
])
informations = np.array([
"λ¬Έν λ¬Έμ μ λν μ μμ νλ",
"κ³Όν μ μ¬μ ν΅μ¬ κ°λ
",
"λ¬Έλ² μ±λΆ λΆμ"
])
# Initialize Simars
analyzer = simars.Fastrs(
answers=answers,
responses=responses,
informations=informations
)
# Preprocess text data
analyzer.preprocess()
# Train FastText model
model = analyzer.train(
vector_size=100,
window=5,
min_count=1,
epochs=10
)
# Reduce dimensionality
coordinates = analyzer.reduce(method="umap", n_neighbors=5)
# Perform clustering
analyzer.hdbscanize()
# Visualize results
figures = analyzer.visualize()
for fig in figures:
fig.show()
```
### Advanced Usage with Custom Data Structure
```python
# Using dictionary format
data = {
"item1": {
"answer": ["μ λ΅1"],
"response": ["μ λ΅1", "μ€λ΅1", "μ€λ΅2"],
"information": "λ¬Έν μ€λͺ
"
},
"item2": {
"answer": ["μ λ΅2"],
"response": ["μ λ΅2", "μ μ¬λ΅", "μ€λ΅"],
"information": "λ€λ₯Έ λ¬Έν μ€λͺ
"
}
}
analyzer = simars.Fastrs(data=data)
analyzer.preprocess()
# Fine-tune existing model
pretrained_model = simars.util.get_pretrained_model()
analyzer.finetune(model=pretrained_model, epochs=5)
# Advanced reduction with custom parameters
coordinates = analyzer.reduce(
method="umap",
n_neighbors=15,
min_dist=0.1,
metric="cosine"
)
```
## π οΈ Core Components
### Fastrs Class
Main class for response similarity analysis:
- `preprocess()`: Clean, tokenize, and prepare text data
- `train()`: Train new FastText model from scratch
- `finetune()`: Fine-tune existing FastText model
- `reduce()`: Reduce embeddings dimensionality
- `hdbscanize()`: Perform clustering analysis
- `visualize()`: Create interactive visualizations
### Item Class
Individual item processor for detailed analysis:
- `clean()`: Text cleaning with customizable options
- `tokenize()`: Korean/English tokenization
- `jamoize()`: Korean Jamo decomposition
- `formatize()`: Prepare data for FastText training
### Preprocessing Module
Advanced text preprocessing functions:
- `clean()`: Multi-option text cleaning
- `tokenize()`: Morphological analysis with PeCab
- `jamoize()`: Korean character decomposition
- `formatize()`: Data formatting for training
### Visualization Module
Interactive plotting with Plotly:
- `scatter()`: Unified scatter plot function
- Multiple plot types: simple, value count, labeled, combined
- Customizable themes and color schemes
## π Visualization Types
### Simple Scatter Plot
Basic 2D visualization highlighting answers vs responses.
### Value Count Scatter Plot
3D visualization showing response frequency in the z-axis.
### Labeled Scatter Plot
Color-coded visualization based on clustering results.
### Combined Scatter Plot
3D visualization combining clustering and frequency information.
## βοΈ Configuration
simars uses JSON configuration files for customization:
- `color_schemes.json`: Color themes for visualizations
- `plot_config.json`: Plot layout and styling options
- `reduction_defaults.json`: Default parameters for dimensionality reduction
- `fasttext_defaults.json`: Default FastText training parameters
## π§ͺ Testing
Run the test suite:
```bash
# All tests
pytest
# Unit tests only
pytest tests/unit/
# Integration tests only
pytest tests/integration/
# With coverage
pytest --cov=simars tests/
```
## π Use Cases
- **Educational Assessment**: Analyze student response patterns
- **Content Analysis**: Group similar text responses
- **Quality Assurance**: Identify outlier responses for review
- **Research**: Study response similarity patterns in surveys
## π€ Contributing
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
## π License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## π€ Author
**Hoon Kim** - [h000000nkim@gmail.com](mailto:h000000nkim@gmail.com)
## π Links
- [GitHub Repository](https://github.com/h000000nkim/simars)
- [Documentation](https://h000000nkim.github.io/simars/)
- [PyPI Package](https://pypi.org/project/simars/)
## π Roadmap
- [ ] Support for additional languages
- [ ] Web interface for easy usage
- [ ] Additional clustering algorithms
- [ ] Export functionality for results
- [ ] Integration with popular ML frameworks
Raw data
{
"_id": null,
"home_page": null,
"name": "simars",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.10",
"maintainer_email": null,
"keywords": "education, fasttext, jamo, korean, morphology, nlp, rater, similarity",
"author": null,
"author_email": "Hoon Kim <h000000nkim@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/f5/5b/143dd9535ebd0755551189b3097ddc3b28f02c3d60ea6fb96fc0e353b16e/simars-0.1.0.tar.gz",
"platform": null,
"description": "# Simars - FastText-based similarity analysis of answers & responses for human raters\n\nSimars is a comprehensive toolkit for analyzing response similarity using FastText embeddings, specifically designed to support human raters in educational assessment and text analysis tasks.\n\n## \ud83d\ude80 Features\n\n- **Text Preprocessing**: Comprehensive Korean and English text cleaning and tokenization\n- **FastText Integration**: Training and fine-tuning of FastText models with Korean support\n- **Dimensionality Reduction**: Support for UMAP, PCA, and t-SNE algorithms\n- **Clustering**: HDBSCAN clustering for response grouping\n- **Interactive Visualization**: Plotly-based interactive scatter plots with multiple visualization modes\n- **Jamo Processing**: Advanced Korean text processing with Jamo decomposition\n\n## \ud83d\udce6 Installation\n\n```bash\npip install simars\n\n# Install spaCy English model (required for text processing)\npython -m spacy download en_core_web_sm\n```\n\n### Development Installation\n\n```bash\ngit clone https://github.com/h000000nkim/simars.git\ncd simars\npip install -e \".[dev]\"\n```\n\n## \ud83d\udd27 Dependencies\n\n- **Core**: `gensim`, `numpy`, `pandas`, `scikit-learn`, `umap-learn`\n- **NLP**: `jamo`, `pecab`, `spacy`\n- **Visualization**: `plotly`\n- **Clustering**: `hdbscan`\n- **Development**: `pytest`, `ruff`, `mkdocs-material`\n\n## \ud83d\udcd6 Quick Start\n\n### Basic Usage\n\n```python\nimport simars\nimport numpy as np\n\n# Sample data\nanswers = np.array([[\"\ud5c8\ubb34\"], [\"\ud761\uc218\uc728\"], [\"\ubd80\uc0ac\uc5b4\"]])\nresponses = np.array([\n [\"\ud5c8\ubb34\", \"\uacf5\ud5c8\", \"\ubb34\uc0c1\", \"\ud5c8\ubb34\uac10\", \"\ucd08\uc6d4\"],\n [\"\ud761\uc218\uc728\", \"\ud761\uc218\", \"\ubc18\uc0ac\uc728\", \"\uc54c\ubca0\ub3c4\"],\n [\"\ubd80\uc0ac\uc5b4\", \"\ubd80\uc0ac\", \"\uc218\uc2dd\uc5b4\", \"\ubd80\uac00\uc5b4\"]\n])\ninformations = np.array([\n \"\ubb38\ud559 \ubb38\uc81c\uc5d0 \ub300\ud55c \uc815\uc11c\uc801 \ud0dc\ub3c4\",\n \"\uacfc\ud559 \uc81c\uc7ac\uc758 \ud575\uc2ec \uac1c\ub150\",\n \"\ubb38\ubc95 \uc131\ubd84 \ubd84\uc11d\"\n])\n\n# Initialize Simars\nanalyzer = simars.Fastrs(\n answers=answers,\n responses=responses,\n informations=informations\n)\n\n# Preprocess text data\nanalyzer.preprocess()\n\n# Train FastText model\nmodel = analyzer.train(\n vector_size=100,\n window=5,\n min_count=1,\n epochs=10\n)\n\n# Reduce dimensionality\ncoordinates = analyzer.reduce(method=\"umap\", n_neighbors=5)\n\n# Perform clustering\nanalyzer.hdbscanize()\n\n# Visualize results\nfigures = analyzer.visualize()\nfor fig in figures:\n fig.show()\n```\n\n### Advanced Usage with Custom Data Structure\n\n```python\n# Using dictionary format\ndata = {\n \"item1\": {\n \"answer\": [\"\uc815\ub2f51\"],\n \"response\": [\"\uc815\ub2f51\", \"\uc624\ub2f51\", \"\uc624\ub2f52\"],\n \"information\": \"\ubb38\ud56d \uc124\uba85\"\n },\n \"item2\": {\n \"answer\": [\"\uc815\ub2f52\"],\n \"response\": [\"\uc815\ub2f52\", \"\uc720\uc0ac\ub2f5\", \"\uc624\ub2f5\"],\n \"information\": \"\ub2e4\ub978 \ubb38\ud56d \uc124\uba85\"\n }\n}\n\nanalyzer = simars.Fastrs(data=data)\nanalyzer.preprocess()\n\n# Fine-tune existing model\npretrained_model = simars.util.get_pretrained_model()\nanalyzer.finetune(model=pretrained_model, epochs=5)\n\n# Advanced reduction with custom parameters\ncoordinates = analyzer.reduce(\n method=\"umap\",\n n_neighbors=15,\n min_dist=0.1,\n metric=\"cosine\"\n)\n```\n\n## \ud83d\udee0\ufe0f Core Components\n\n### Fastrs Class\n\nMain class for response similarity analysis:\n\n- `preprocess()`: Clean, tokenize, and prepare text data\n- `train()`: Train new FastText model from scratch\n- `finetune()`: Fine-tune existing FastText model\n- `reduce()`: Reduce embeddings dimensionality\n- `hdbscanize()`: Perform clustering analysis\n- `visualize()`: Create interactive visualizations\n\n### Item Class\n\nIndividual item processor for detailed analysis:\n\n- `clean()`: Text cleaning with customizable options\n- `tokenize()`: Korean/English tokenization\n- `jamoize()`: Korean Jamo decomposition\n- `formatize()`: Prepare data for FastText training\n\n### Preprocessing Module\n\nAdvanced text preprocessing functions:\n\n- `clean()`: Multi-option text cleaning\n- `tokenize()`: Morphological analysis with PeCab\n- `jamoize()`: Korean character decomposition\n- `formatize()`: Data formatting for training\n\n### Visualization Module\n\nInteractive plotting with Plotly:\n\n- `scatter()`: Unified scatter plot function\n- Multiple plot types: simple, value count, labeled, combined\n- Customizable themes and color schemes\n\n## \ud83d\udcca Visualization Types\n\n### Simple Scatter Plot\nBasic 2D visualization highlighting answers vs responses.\n\n### Value Count Scatter Plot \n3D visualization showing response frequency in the z-axis.\n\n### Labeled Scatter Plot\nColor-coded visualization based on clustering results.\n\n### Combined Scatter Plot\n3D visualization combining clustering and frequency information.\n\n## \u2699\ufe0f Configuration\n\nsimars uses JSON configuration files for customization:\n\n- `color_schemes.json`: Color themes for visualizations\n- `plot_config.json`: Plot layout and styling options\n- `reduction_defaults.json`: Default parameters for dimensionality reduction\n- `fasttext_defaults.json`: Default FastText training parameters\n\n## \ud83e\uddea Testing\n\nRun the test suite:\n\n```bash\n# All tests\npytest\n\n# Unit tests only\npytest tests/unit/\n\n# Integration tests only\npytest tests/integration/\n\n# With coverage\npytest --cov=simars tests/\n```\n\n## \ud83d\udcda Use Cases\n\n- **Educational Assessment**: Analyze student response patterns\n- **Content Analysis**: Group similar text responses\n- **Quality Assurance**: Identify outlier responses for review\n- **Research**: Study response similarity patterns in surveys\n\n## \ud83e\udd1d Contributing\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Commit your changes (`git commit -m 'Add amazing feature'`)\n4. Push to the branch (`git push origin feature/amazing-feature`)\n5. Open a Pull Request\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\udc64 Author\n\n**Hoon Kim** - [h000000nkim@gmail.com](mailto:h000000nkim@gmail.com)\n\n## \ud83d\udd17 Links\n\n- [GitHub Repository](https://github.com/h000000nkim/simars)\n- [Documentation](https://h000000nkim.github.io/simars/)\n- [PyPI Package](https://pypi.org/project/simars/)\n\n## \ud83d\udcc8 Roadmap\n\n- [ ] Support for additional languages\n- [ ] Web interface for easy usage\n- [ ] Additional clustering algorithms\n- [ ] Export functionality for results\n- [ ] Integration with popular ML frameworks",
"bugtrack_url": null,
"license": "MIT",
"summary": "FastText-based response similarity analyzer for human raters",
"version": "0.1.0",
"project_urls": {
"Documentation": "https://h000000nkim.github.io/fastrs/",
"Homepage": "https://github.com/h000000nkim/fastrs",
"Repository": "https://github.com/h000000nkim/fastrs"
},
"split_keywords": [
"education",
" fasttext",
" jamo",
" korean",
" morphology",
" nlp",
" rater",
" similarity"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "7d0a07ad92e14c99ae0e3b97a6abc77f8278955cd267daf2085147d70a8268d7",
"md5": "4e605ed914ed767f01a5cb0babd9732c",
"sha256": "e37ceabb49cdea2c3c8887b50682ff1293e44ff8e3834b9e09632d52d764ccaf"
},
"downloads": -1,
"filename": "simars-0.1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "4e605ed914ed767f01a5cb0babd9732c",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.10",
"size": 18705,
"upload_time": "2025-09-11T08:51:53",
"upload_time_iso_8601": "2025-09-11T08:51:53.535246Z",
"url": "https://files.pythonhosted.org/packages/7d/0a/07ad92e14c99ae0e3b97a6abc77f8278955cd267daf2085147d70a8268d7/simars-0.1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "f55b143dd9535ebd0755551189b3097ddc3b28f02c3d60ea6fb96fc0e353b16e",
"md5": "08b560e822ab9c282ce9e0a317a4621b",
"sha256": "2f3b0055cca7e0ff0c53f49bf7a0c58854dc4b3e019fce40f942b22b48e483ef"
},
"downloads": -1,
"filename": "simars-0.1.0.tar.gz",
"has_sig": false,
"md5_digest": "08b560e822ab9c282ce9e0a317a4621b",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.10",
"size": 12846666,
"upload_time": "2025-09-11T08:51:59",
"upload_time_iso_8601": "2025-09-11T08:51:59.220784Z",
"url": "https://files.pythonhosted.org/packages/f5/5b/143dd9535ebd0755551189b3097ddc3b28f02c3d60ea6fb96fc0e353b16e/simars-0.1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-11 08:51:59",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "h000000nkim",
"github_project": "fastrs",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "simars"
}