altmorph


Namealtmorph JSON
Version 0.1.0 PyPI version JSON
download
home_pagehttps://github.com/yourusername/altmorph
SummaryContext-aware Norwegian morphological alternative generator
upload_time2025-10-09 12:51:55
maintainerNone
docs_urlNone
authorPere
requires_python>=3.8
licenseApache License Version 2.0, January 2004 http://www.apache.org/licenses/ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 1. Definitions. "License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document. "Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License. "Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity. "You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License. "Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files. "Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types. "Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below). "Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof. "Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution." "Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work. 2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form. 3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed. 4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions: (a) You must give any other recipients of the Work or Derivative Works a copy of this License; and (b) You must cause any modified files to carry prominent notices stating that You changed the files; and (c) You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and (d) If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License. You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License. 5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions. 6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file. 7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License. 8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages. 9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability. END OF TERMS AND CONDITIONS APPENDIX: How to apply the Apache License to your work. To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets "[]" replaced with your own identifying information. (Don't include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same "printed page" as the copyright notice for easier identification within third-party archives. Copyright [yyyy] [name of copyright owner] Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
keywords morphology norwegian nlp linguistics alternatives ordbank pos bert
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # AltMorph: Context-Aware Norwegian Morphological Alternative Generator

**AltMorph** is a tool for expanding Norwegian text by finding morphological alternatives for each word. It combines the Ordbank API with NLP techniques to provide alternatives that fit the surrounding context.

Outputs follow the AltMetrics bracket format, listing options as `[original|alt1|alt2]`.

## ✨ Features

- **🎯 Context-sensitive filtering**: Uses BERT-based acceptability scoring for ambiguous cases
- **📚 Lemma coverage**: Finds morphological forms across multiple lemmas
- **🔍 Position-specific analysis**: Looks at each word in its syntactic context  
- **⚡ Caching**: Persistent file-based caching to improve performance
- **🗣️ Multiple verbosity levels**: From silent operation to detailed pipeline insights
- **🌐 Language support**: Norwegian Bokmål (`nob`) and Nynorsk (`nno`)
- **🧠 POS-aware**: Uses NbAiLab BERT models for part-of-speech tagging
- **🚀 Parallel processing**: Runs concurrent API calls

## 🛠️ Installation

### Prerequisites
- Python 3.8+
- Ordbank API key (free registration at [Ordbank](https://www.ordbank.no/))

### Install from PyPI
```bash
pip install altmorph
```

### Install from Source (development)
```bash
pip install -e .
```

### Optional: Sync Development Requirements
```bash
pip install -r requirements.txt
```

### Get API Key
1. Register at [https://www.ordbank.no/](https://www.ordbank.no/)
2. Obtain your API key from your account dashboard
3. Set the environment variable:
   ```bash
   export ORDBANK_API_KEY="your_api_key_here"
   ```
   Or pass it directly with `--api_key` flag

## 🚀 Quick Start

After installation you can invoke the CLI either with the `altmorph` command or via `python -m altmorph`.

### Basic Usage
```bash
python -m altmorph --sentence "Katta ligger på matta." --lang nob
```
**Output:**
```
[Katta|Katten] ligger på [matta|matten].
```

### With API Key
```bash
python -m altmorph \
  --sentence "Katta ligger på matta." \
  --lang nob \
  --api_key "your_api_key_here"
```

## 📖 Usage Examples

### Context-Sensitive Behaviour
The tool takes sentence context into account:

**Simple example:**
```bash
python -m altmorph --sentence "Katta ligger på matta." --lang nob
# Output: [Katta|Katten] ligger på [matta|matten].
# Shows different morphological forms for the same words
```

**Complex context:**
```bash
python -m altmorph --sentence "Katta ligger på matta i stua." --lang nob  
# Output: [Katta|Katten] ligger på [matta|matten] i stua.
# BERT-based filtering keeps alternatives that work in the sentence
```

### Position-Specific Analysis
```bash
python -m altmorph --sentence "Katta ligger på matta." --lang nob
# Each word occurrence is analyzed in its specific syntactic context
```

## 🎛️ Command Line Options

| Option | Default | Description |
|--------|---------|-------------|
| `--sentence` | *required* | Input sentence to process |
| `--lang` | `nob` | Language code (`nob` or `nno`) |
| `--api_key` | `$ORDBANK_API_KEY` | Ordbank API key |
| `--verbosity` | `0` | Verbosity level (0-3) |
| `--logit-threshold` | `3.0` | BERT acceptability threshold |
| `--timeout` | `6.0` | HTTP timeout per request |
| `--max_workers` | `4` | Parallel API requests |
| `--no-cache` | `False` | Disable caching |
| `--delete-cache` | `False` | Clear cache and exit |

## 🔊 Verbosity Levels

### Level 0: Quiet (Default)
```bash
python -m altmorph --sentence "Katta ligger på matta." --verbosity 0
```
**Output:** Just the final result
```
[Katta|Katten] ligger på [matta|matten].
```

### Level 1: Normal  
```bash
python -m altmorph --sentence "Katta ligger på matta." --verbosity 1
```
**Output:** Basic progress information
```
2025-XX-XX 12:00:00 INFO Loading POS tagger...
2025-XX-XX 12:00:02 INFO POS tagger loaded
[Katta|Katten] ligger på [matta|matten].
```

### Level 2: Verbose
```bash
python -m altmorph --sentence "Katta ligger på matta." --verbosity 2
```
**Output:** Processing details (POS tags, API lookups, alternatives found)
```
🎯 PROCESSING: Katta ligger på matta.
📝 WORDS: ['katta', 'ligger', 'på', 'matta']
🏷️ POS TAGS:
   katta: NOUN
   ligger: VERB
   på: ADP
   matta: NOUN
📡 API LOOKUP: katta (POS: NOUN)
   ✅ katta: 2 alternatives: ['katta', 'katten']
...
✨ RESULT: [Katta|Katten] ligger på [matta|matten].
```

### Level 3: Very Verbose
```bash
python -m altmorph --sentence "Katta ligger på matta." --verbosity 3
```
**Output:** Everything including cache operations, lemma analysis, BERT filtering
```
🎯 PROCESSING: Katta ligger på matta.
📝 FOUND 2 LEMMAS for katta
💾 CACHE HIT: lemmas for 'katta' (POS: NOUN)
🧠 ACCEPTABILITY FILTERING (threshold: 3.00)
🔍 ANALYZING: katta (position 0)
   Context: [Katta] ligger på matta.
   Alternatives: ['katta', 'katten']
📊 CACHE STATS: 8 hits, 0 misses (100.0% hit rate)
...
```

## 🗂️ Caching System

AltMorph includes caching to improve performance:

- **Cache location:** `~/.ordbank_cache/`
- **Cache types:** Lemma searches and inflection data
- **Performance:** ~95%+ hit rate for repeated usage
- **Management:** 
  - `--no-cache`: Disable caching
  - `--delete-cache`: Clear all cache files

**Performance impact:**
- First run: ~3-4 seconds (API calls)
- Cached runs: ~0.5 seconds

## 🧠 Technical Details

### Code Architecture Deep-Dive
📖 **[Complete Code Walkthrough](docs/code_explanation.md)** - Detailed technical explanation of how AltMorph works for developers who need implementation details.

### Architecture
1. **Input Processing**: Tokenization preserving whitespace and punctuation
2. **POS Tagging**: NbAiLab/nb-bert-base-pos for accurate grammatical analysis
3. **Lemma Discovery**: Comprehensive search across all relevant Ordbank lemmas
4. **Inflection Analysis**: Full morphological paradigm extraction
5. **Acceptability Scoring**: NbAiLab/nb-bert-base for context-sensitive filtering
6. **Output Generation**: Case-preserving alternative presentation

### Models Used
- **POS Tagging**: `NbAiLab/nb-bert-base-pos`
- **Acceptability**: `NbAiLab/nb-bert-base` 
- **API**: [Ordbank](https://www.ordbank.no/) - Norwegian morphological database

### Key Algorithms
- **Comprehensive lemma matching**: Finds all lemmas containing target word
- **Position-specific analysis**: Each word occurrence analyzed in context
- **Logit-based filtering**: Acceptability thresholding (default: 3.0)
- **Prioritization**: Balances morphological coverage with contextual fit

## 📊 Performance

### Typical Performance
- **Single sentence**: 0.5-4 seconds (depending on cache state)
- **Cache hit rate**: Typically 95%+ for repeated usage
- **API efficiency**: Parallel requests with batching
- **Memory usage**: ~500MB (loaded BERT models)

### Scaling Considerations
- **Concurrent requests**: Configurable via `--max_workers`
- **Timeout handling**: Robust error recovery with retries
- **Rate limiting**: Respectful API usage patterns

## 🛠️ Tools

AltMorph includes additional helpers for batch processing and debugging:

- **[`corpus_tools/process_jsonl.py`](corpus_tools/process_jsonl.py)**: Batch-process JSONL files by adding morphological alternatives to text fields (resume-aware, batched).
- **[`corpus_tools/create_training_examples.py`](corpus_tools/create_training_examples.py)**: Sample one variant per alternative block to generate `unnorm` training strings.
- **[`corpus_tools/stream_ncc_text.py`](corpus_tools/stream_ncc_text.py)**: Stream Stortinget speeches from the NCC dataset on Hugging Face.
- **[`scripts/pos_tester.py`](scripts/pos_tester.py)**: Compare POS tagging across Norwegian NLP models.
- **[`scripts/hf_probe_fields.py`](scripts/hf_probe_fields.py)**: Inspect Hugging Face dataset metadata and stream example rows.

Browse [`corpus_tools/README.md`](corpus_tools/README.md) and [`scripts/README.md`](scripts/README.md) for more details.

## 🔧 Development

### Project Structure
```
altmorph/
├── __init__.py                  # Main application / CLI
├── data/                        # Packaged lemma resources
├── corpus_tools/                # Corpus cleaning scripts and sample data
│   ├── process_jsonl.py         # JSONL batch processor
│   ├── create_training_examples.py
│   ├── stream_ncc_text.py
│   └── data/                    # Sample + placeholder corpora
├── docs/                        # Developer documentation
│   └── code_explanation.md
├── legacy/                      # Archived scripts kept for reference
├── scripts/                     # Standalone utilities (POS tester, HF helper)
├── README.md                    # Main documentation
├── pyproject.toml               # Packaging metadata
├── requirements.txt             # Dependencies
├── setup.py                     # Legacy packaging shim
└── ~/.ordbank_cache/            # Cache directory (auto-created)
```

### Contributing
1. Fork the repository
2. Create a feature branch
3. Add tests for new functionality
4. Ensure code follows existing style
5. Submit a pull request

### Testing
```bash
# Run the automated test suite
pytest

# Test basic functionality
python -m altmorph --sentence "Katta ligger på matta." --lang nob

# Test cache functionality  
python -m altmorph --delete-cache
python -m altmorph --sentence "Katta ligger på matta." --lang nob --verbosity 3

# Test without cache
python -m altmorph --sentence "Katta ligger på matta." --lang nob --no-cache

# Test POS comparison tool
python scripts/pos_tester.py --text "Katta ligger på matta."

# Test batch processing with sample data
python corpus_tools/process_jsonl.py \
  --input_file corpus_tools/data/samples/sample_input.jsonl \
  --output_file tmp/test_output.jsonl \
  --verbosity 2
```

## 🚢 Release Guide

Ready to publish? Follow the step-by-step instructions in [`docs/RELEASING.md`](docs/RELEASING.md) to build,
test, and upload the package (v0.1.0) to PyPI.

## 🤝 Related Projects

- **[altmetrics](https://github.com/peregilk/altmetrics)**: Depends on AltMorph's output format for Norwegian text evaluation. Allows you to calculate wer, cer, BLEU and chrF based on valid morphological alternatives.

## 📄 License

[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)


## 🙏 Acknowledgments

- **Ordbank Team**: For providing the comprehensive Norwegian morphological API
- **Clarino/UiB**: For hosting the API infrastructure
- **NbAiLab**: For the Norwegian BERT models
- **AltMorph**: Idea and coding by Magnus Breder Birkenes and Per Egil Kummervold

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/yourusername/altmorph",
    "name": "altmorph",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "morphology, norwegian, nlp, linguistics, alternatives, ordbank, pos, bert",
    "author": "Pere",
    "author_email": "Pere <your.email@example.com>",
    "download_url": "https://files.pythonhosted.org/packages/18/90/54ea9f44938aa407a70a394a1832f6546f37704b3e139633c97969039ad1/altmorph-0.1.0.tar.gz",
    "platform": null,
    "description": "# AltMorph: Context-Aware Norwegian Morphological Alternative Generator\n\n**AltMorph** is a tool for expanding Norwegian text by finding morphological alternatives for each word. It combines the Ordbank API with NLP techniques to provide alternatives that fit the surrounding context.\n\nOutputs follow the AltMetrics bracket format, listing options as `[original|alt1|alt2]`.\n\n## \u2728 Features\n\n- **\ud83c\udfaf Context-sensitive filtering**: Uses BERT-based acceptability scoring for ambiguous cases\n- **\ud83d\udcda Lemma coverage**: Finds morphological forms across multiple lemmas\n- **\ud83d\udd0d Position-specific analysis**: Looks at each word in its syntactic context  \n- **\u26a1 Caching**: Persistent file-based caching to improve performance\n- **\ud83d\udde3\ufe0f Multiple verbosity levels**: From silent operation to detailed pipeline insights\n- **\ud83c\udf10 Language support**: Norwegian Bokm\u00e5l (`nob`) and Nynorsk (`nno`)\n- **\ud83e\udde0 POS-aware**: Uses NbAiLab BERT models for part-of-speech tagging\n- **\ud83d\ude80 Parallel processing**: Runs concurrent API calls\n\n## \ud83d\udee0\ufe0f Installation\n\n### Prerequisites\n- Python 3.8+\n- Ordbank API key (free registration at [Ordbank](https://www.ordbank.no/))\n\n### Install from PyPI\n```bash\npip install altmorph\n```\n\n### Install from Source (development)\n```bash\npip install -e .\n```\n\n### Optional: Sync Development Requirements\n```bash\npip install -r requirements.txt\n```\n\n### Get API Key\n1. Register at [https://www.ordbank.no/](https://www.ordbank.no/)\n2. Obtain your API key from your account dashboard\n3. Set the environment variable:\n   ```bash\n   export ORDBANK_API_KEY=\"your_api_key_here\"\n   ```\n   Or pass it directly with `--api_key` flag\n\n## \ud83d\ude80 Quick Start\n\nAfter installation you can invoke the CLI either with the `altmorph` command or via `python -m altmorph`.\n\n### Basic Usage\n```bash\npython -m altmorph --sentence \"Katta ligger p\u00e5 matta.\" --lang nob\n```\n**Output:**\n```\n[Katta|Katten] ligger p\u00e5 [matta|matten].\n```\n\n### With API Key\n```bash\npython -m altmorph \\\n  --sentence \"Katta ligger p\u00e5 matta.\" \\\n  --lang nob \\\n  --api_key \"your_api_key_here\"\n```\n\n## \ud83d\udcd6 Usage Examples\n\n### Context-Sensitive Behaviour\nThe tool takes sentence context into account:\n\n**Simple example:**\n```bash\npython -m altmorph --sentence \"Katta ligger p\u00e5 matta.\" --lang nob\n# Output: [Katta|Katten] ligger p\u00e5 [matta|matten].\n# Shows different morphological forms for the same words\n```\n\n**Complex context:**\n```bash\npython -m altmorph --sentence \"Katta ligger p\u00e5 matta i stua.\" --lang nob  \n# Output: [Katta|Katten] ligger p\u00e5 [matta|matten] i stua.\n# BERT-based filtering keeps alternatives that work in the sentence\n```\n\n### Position-Specific Analysis\n```bash\npython -m altmorph --sentence \"Katta ligger p\u00e5 matta.\" --lang nob\n# Each word occurrence is analyzed in its specific syntactic context\n```\n\n## \ud83c\udf9b\ufe0f Command Line Options\n\n| Option | Default | Description |\n|--------|---------|-------------|\n| `--sentence` | *required* | Input sentence to process |\n| `--lang` | `nob` | Language code (`nob` or `nno`) |\n| `--api_key` | `$ORDBANK_API_KEY` | Ordbank API key |\n| `--verbosity` | `0` | Verbosity level (0-3) |\n| `--logit-threshold` | `3.0` | BERT acceptability threshold |\n| `--timeout` | `6.0` | HTTP timeout per request |\n| `--max_workers` | `4` | Parallel API requests |\n| `--no-cache` | `False` | Disable caching |\n| `--delete-cache` | `False` | Clear cache and exit |\n\n## \ud83d\udd0a Verbosity Levels\n\n### Level 0: Quiet (Default)\n```bash\npython -m altmorph --sentence \"Katta ligger p\u00e5 matta.\" --verbosity 0\n```\n**Output:** Just the final result\n```\n[Katta|Katten] ligger p\u00e5 [matta|matten].\n```\n\n### Level 1: Normal  \n```bash\npython -m altmorph --sentence \"Katta ligger p\u00e5 matta.\" --verbosity 1\n```\n**Output:** Basic progress information\n```\n2025-XX-XX 12:00:00 INFO Loading POS tagger...\n2025-XX-XX 12:00:02 INFO POS tagger loaded\n[Katta|Katten] ligger p\u00e5 [matta|matten].\n```\n\n### Level 2: Verbose\n```bash\npython -m altmorph --sentence \"Katta ligger p\u00e5 matta.\" --verbosity 2\n```\n**Output:** Processing details (POS tags, API lookups, alternatives found)\n```\n\ud83c\udfaf PROCESSING: Katta ligger p\u00e5 matta.\n\ud83d\udcdd WORDS: ['katta', 'ligger', 'p\u00e5', 'matta']\n\ud83c\udff7\ufe0f POS TAGS:\n   katta: NOUN\n   ligger: VERB\n   p\u00e5: ADP\n   matta: NOUN\n\ud83d\udce1 API LOOKUP: katta (POS: NOUN)\n   \u2705 katta: 2 alternatives: ['katta', 'katten']\n...\n\u2728 RESULT: [Katta|Katten] ligger p\u00e5 [matta|matten].\n```\n\n### Level 3: Very Verbose\n```bash\npython -m altmorph --sentence \"Katta ligger p\u00e5 matta.\" --verbosity 3\n```\n**Output:** Everything including cache operations, lemma analysis, BERT filtering\n```\n\ud83c\udfaf PROCESSING: Katta ligger p\u00e5 matta.\n\ud83d\udcdd FOUND 2 LEMMAS for katta\n\ud83d\udcbe CACHE HIT: lemmas for 'katta' (POS: NOUN)\n\ud83e\udde0 ACCEPTABILITY FILTERING (threshold: 3.00)\n\ud83d\udd0d ANALYZING: katta (position 0)\n   Context: [Katta] ligger p\u00e5 matta.\n   Alternatives: ['katta', 'katten']\n\ud83d\udcca CACHE STATS: 8 hits, 0 misses (100.0% hit rate)\n...\n```\n\n## \ud83d\uddc2\ufe0f Caching System\n\nAltMorph includes caching to improve performance:\n\n- **Cache location:** `~/.ordbank_cache/`\n- **Cache types:** Lemma searches and inflection data\n- **Performance:** ~95%+ hit rate for repeated usage\n- **Management:** \n  - `--no-cache`: Disable caching\n  - `--delete-cache`: Clear all cache files\n\n**Performance impact:**\n- First run: ~3-4 seconds (API calls)\n- Cached runs: ~0.5 seconds\n\n## \ud83e\udde0 Technical Details\n\n### Code Architecture Deep-Dive\n\ud83d\udcd6 **[Complete Code Walkthrough](docs/code_explanation.md)** - Detailed technical explanation of how AltMorph works for developers who need implementation details.\n\n### Architecture\n1. **Input Processing**: Tokenization preserving whitespace and punctuation\n2. **POS Tagging**: NbAiLab/nb-bert-base-pos for accurate grammatical analysis\n3. **Lemma Discovery**: Comprehensive search across all relevant Ordbank lemmas\n4. **Inflection Analysis**: Full morphological paradigm extraction\n5. **Acceptability Scoring**: NbAiLab/nb-bert-base for context-sensitive filtering\n6. **Output Generation**: Case-preserving alternative presentation\n\n### Models Used\n- **POS Tagging**: `NbAiLab/nb-bert-base-pos`\n- **Acceptability**: `NbAiLab/nb-bert-base` \n- **API**: [Ordbank](https://www.ordbank.no/) - Norwegian morphological database\n\n### Key Algorithms\n- **Comprehensive lemma matching**: Finds all lemmas containing target word\n- **Position-specific analysis**: Each word occurrence analyzed in context\n- **Logit-based filtering**: Acceptability thresholding (default: 3.0)\n- **Prioritization**: Balances morphological coverage with contextual fit\n\n## \ud83d\udcca Performance\n\n### Typical Performance\n- **Single sentence**: 0.5-4 seconds (depending on cache state)\n- **Cache hit rate**: Typically 95%+ for repeated usage\n- **API efficiency**: Parallel requests with batching\n- **Memory usage**: ~500MB (loaded BERT models)\n\n### Scaling Considerations\n- **Concurrent requests**: Configurable via `--max_workers`\n- **Timeout handling**: Robust error recovery with retries\n- **Rate limiting**: Respectful API usage patterns\n\n## \ud83d\udee0\ufe0f Tools\n\nAltMorph includes additional helpers for batch processing and debugging:\n\n- **[`corpus_tools/process_jsonl.py`](corpus_tools/process_jsonl.py)**: Batch-process JSONL files by adding morphological alternatives to text fields (resume-aware, batched).\n- **[`corpus_tools/create_training_examples.py`](corpus_tools/create_training_examples.py)**: Sample one variant per alternative block to generate `unnorm` training strings.\n- **[`corpus_tools/stream_ncc_text.py`](corpus_tools/stream_ncc_text.py)**: Stream Stortinget speeches from the NCC dataset on Hugging Face.\n- **[`scripts/pos_tester.py`](scripts/pos_tester.py)**: Compare POS tagging across Norwegian NLP models.\n- **[`scripts/hf_probe_fields.py`](scripts/hf_probe_fields.py)**: Inspect Hugging Face dataset metadata and stream example rows.\n\nBrowse [`corpus_tools/README.md`](corpus_tools/README.md) and [`scripts/README.md`](scripts/README.md) for more details.\n\n## \ud83d\udd27 Development\n\n### Project Structure\n```\naltmorph/\n\u251c\u2500\u2500 __init__.py                  # Main application / CLI\n\u251c\u2500\u2500 data/                        # Packaged lemma resources\n\u251c\u2500\u2500 corpus_tools/                # Corpus cleaning scripts and sample data\n\u2502   \u251c\u2500\u2500 process_jsonl.py         # JSONL batch processor\n\u2502   \u251c\u2500\u2500 create_training_examples.py\n\u2502   \u251c\u2500\u2500 stream_ncc_text.py\n\u2502   \u2514\u2500\u2500 data/                    # Sample + placeholder corpora\n\u251c\u2500\u2500 docs/                        # Developer documentation\n\u2502   \u2514\u2500\u2500 code_explanation.md\n\u251c\u2500\u2500 legacy/                      # Archived scripts kept for reference\n\u251c\u2500\u2500 scripts/                     # Standalone utilities (POS tester, HF helper)\n\u251c\u2500\u2500 README.md                    # Main documentation\n\u251c\u2500\u2500 pyproject.toml               # Packaging metadata\n\u251c\u2500\u2500 requirements.txt             # Dependencies\n\u251c\u2500\u2500 setup.py                     # Legacy packaging shim\n\u2514\u2500\u2500 ~/.ordbank_cache/            # Cache directory (auto-created)\n```\n\n### Contributing\n1. Fork the repository\n2. Create a feature branch\n3. Add tests for new functionality\n4. Ensure code follows existing style\n5. Submit a pull request\n\n### Testing\n```bash\n# Run the automated test suite\npytest\n\n# Test basic functionality\npython -m altmorph --sentence \"Katta ligger p\u00e5 matta.\" --lang nob\n\n# Test cache functionality  \npython -m altmorph --delete-cache\npython -m altmorph --sentence \"Katta ligger p\u00e5 matta.\" --lang nob --verbosity 3\n\n# Test without cache\npython -m altmorph --sentence \"Katta ligger p\u00e5 matta.\" --lang nob --no-cache\n\n# Test POS comparison tool\npython scripts/pos_tester.py --text \"Katta ligger p\u00e5 matta.\"\n\n# Test batch processing with sample data\npython corpus_tools/process_jsonl.py \\\n  --input_file corpus_tools/data/samples/sample_input.jsonl \\\n  --output_file tmp/test_output.jsonl \\\n  --verbosity 2\n```\n\n## \ud83d\udea2 Release Guide\n\nReady to publish? Follow the step-by-step instructions in [`docs/RELEASING.md`](docs/RELEASING.md) to build,\ntest, and upload the package (v0.1.0) to PyPI.\n\n## \ud83e\udd1d Related Projects\n\n- **[altmetrics](https://github.com/peregilk/altmetrics)**: Depends on AltMorph's output format for Norwegian text evaluation. Allows you to calculate wer, cer, BLEU and chrF based on valid morphological alternatives.\n\n## \ud83d\udcc4 License\n\n[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)\n\n\n## \ud83d\ude4f Acknowledgments\n\n- **Ordbank Team**: For providing the comprehensive Norwegian morphological API\n- **Clarino/UiB**: For hosting the API infrastructure\n- **NbAiLab**: For the Norwegian BERT models\n- **AltMorph**: Idea and coding by Magnus Breder Birkenes and Per Egil Kummervold\n",
    "bugtrack_url": null,
    "license": "Apache License\n                                   Version 2.0, January 2004\n                                http://www.apache.org/licenses/\n        \n           TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION\n        \n           1. Definitions.\n        \n              \"License\" shall mean the terms and conditions for use, reproduction,\n              and distribution as defined by Sections 1 through 9 of this document.\n        \n              \"Licensor\" shall mean the copyright owner or entity authorized by\n              the copyright owner that is granting the License.\n        \n              \"Legal Entity\" shall mean the union of the acting entity and all\n              other entities that control, are controlled by, or are under common\n              control with that entity. For the purposes of this definition,\n              \"control\" means (i) the power, direct or indirect, to cause the\n              direction or management of such entity, whether by contract or\n              otherwise, or (ii) ownership of fifty percent (50%) or more of the\n              outstanding shares, or (iii) beneficial ownership of such entity.\n        \n              \"You\" (or \"Your\") shall mean an individual or Legal Entity\n              exercising permissions granted by this License.\n        \n              \"Source\" form shall mean the preferred form for making modifications,\n              including but not limited to software source code, documentation\n              source, and configuration files.\n        \n              \"Object\" form shall mean any form resulting from mechanical\n              transformation or translation of a Source form, including but\n              not limited to compiled object code, generated documentation,\n              and conversions to other media types.\n        \n              \"Work\" shall mean the work of authorship, whether in Source or\n              Object form, made available under the License, as indicated by a\n              copyright notice that is included in or attached to the work\n              (an example is provided in the Appendix below).\n        \n              \"Derivative Works\" shall mean any work, whether in Source or Object\n              form, that is based on (or derived from) the Work and for which the\n              editorial revisions, annotations, elaborations, or other modifications\n              represent, as a whole, an original work of authorship. For the purposes\n              of this License, Derivative Works shall not include works that remain\n              separable from, or merely link (or bind by name) to the interfaces of,\n              the Work and Derivative Works thereof.\n        \n              \"Contribution\" shall mean any work of authorship, including\n              the original version of the Work and any modifications or additions\n              to that Work or Derivative Works thereof, that is intentionally\n              submitted to Licensor for inclusion in the Work by the copyright owner\n              or by an individual or Legal Entity authorized to submit on behalf of\n              the copyright owner. For the purposes of this definition, \"submitted\"\n              means any form of electronic, verbal, or written communication sent\n              to the Licensor or its representatives, including but not limited to\n              communication on electronic mailing lists, source code control systems,\n              and issue tracking systems that are managed by, or on behalf of, the\n              Licensor for the purpose of discussing and improving the Work, but\n              excluding communication that is conspicuously marked or otherwise\n              designated in writing by the copyright owner as \"Not a Contribution.\"\n        \n              \"Contributor\" shall mean Licensor and any individual or Legal Entity\n              on behalf of whom a Contribution has been received by Licensor and\n              subsequently incorporated within the Work.\n        \n           2. Grant of Copyright License. Subject to the terms and conditions of\n              this License, each Contributor hereby grants to You a perpetual,\n              worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n              copyright license to reproduce, prepare Derivative Works of,\n              publicly display, publicly perform, sublicense, and distribute the\n              Work and such Derivative Works in Source or Object form.\n        \n           3. Grant of Patent License. Subject to the terms and conditions of\n              this License, each Contributor hereby grants to You a perpetual,\n              worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n              (except as stated in this section) patent license to make, have made,\n              use, offer to sell, sell, import, and otherwise transfer the Work,\n              where such license applies only to those patent claims licensable\n              by such Contributor that are necessarily infringed by their\n              Contribution(s) alone or by combination of their Contribution(s)\n              with the Work to which such Contribution(s) was submitted. If You\n              institute patent litigation against any entity (including a\n              cross-claim or counterclaim in a lawsuit) alleging that the Work\n              or a Contribution incorporated within the Work constitutes direct\n              or contributory patent infringement, then any patent licenses\n              granted to You under this License for that Work shall terminate\n              as of the date such litigation is filed.\n        \n           4. Redistribution. You may reproduce and distribute copies of the\n              Work or Derivative Works thereof in any medium, with or without\n              modifications, and in Source or Object form, provided that You\n              meet the following conditions:\n        \n              (a) You must give any other recipients of the Work or\n                  Derivative Works a copy of this License; and\n        \n              (b) You must cause any modified files to carry prominent notices\n                  stating that You changed the files; and\n        \n              (c) You must retain, in the Source form of any Derivative Works\n                  that You distribute, all copyright, patent, trademark, and\n                  attribution notices from the Source form of the Work,\n                  excluding those notices that do not pertain to any part of\n                  the Derivative Works; and\n        \n              (d) If the Work includes a \"NOTICE\" text file as part of its\n                  distribution, then any Derivative Works that You distribute must\n                  include a readable copy of the attribution notices contained\n                  within such NOTICE file, excluding those notices that do not\n                  pertain to any part of the Derivative Works, in at least one\n                  of the following places: within a NOTICE text file distributed\n                  as part of the Derivative Works; within the Source form or\n                  documentation, if provided along with the Derivative Works; or,\n                  within a display generated by the Derivative Works, if and\n                  wherever such third-party notices normally appear. The contents\n                  of the NOTICE file are for informational purposes only and\n                  do not modify the License. You may add Your own attribution\n                  notices within Derivative Works that You distribute, alongside\n                  or as an addendum to the NOTICE text from the Work, provided\n                  that such additional attribution notices cannot be construed\n                  as modifying the License.\n        \n              You may add Your own copyright statement to Your modifications and\n              may provide additional or different license terms and conditions\n              for use, reproduction, or distribution of Your modifications, or\n              for any such Derivative Works as a whole, provided Your use,\n              reproduction, and distribution of the Work otherwise complies with\n              the conditions stated in this License.\n        \n           5. Submission of Contributions. Unless You explicitly state otherwise,\n              any Contribution intentionally submitted for inclusion in the Work\n              by You to the Licensor shall be under the terms and conditions of\n              this License, without any additional terms or conditions.\n              Notwithstanding the above, nothing herein shall supersede or modify\n              the terms of any separate license agreement you may have executed\n              with Licensor regarding such Contributions.\n        \n           6. Trademarks. This License does not grant permission to use the trade\n              names, trademarks, service marks, or product names of the Licensor,\n              except as required for reasonable and customary use in describing the\n              origin of the Work and reproducing the content of the NOTICE file.\n        \n           7. Disclaimer of Warranty. Unless required by applicable law or\n              agreed to in writing, Licensor provides the Work (and each\n              Contributor provides its Contributions) on an \"AS IS\" BASIS,\n              WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or\n              implied, including, without limitation, any warranties or conditions\n              of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A\n              PARTICULAR PURPOSE. You are solely responsible for determining the\n              appropriateness of using or redistributing the Work and assume any\n              risks associated with Your exercise of permissions under this License.\n        \n           8. Limitation of Liability. In no event and under no legal theory,\n              whether in tort (including negligence), contract, or otherwise,\n              unless required by applicable law (such as deliberate and grossly\n              negligent acts) or agreed to in writing, shall any Contributor be\n              liable to You for damages, including any direct, indirect, special,\n              incidental, or consequential damages of any character arising as a\n              result of this License or out of the use or inability to use the\n              Work (including but not limited to damages for loss of goodwill,\n              work stoppage, computer failure or malfunction, or any and all\n              other commercial damages or losses), even if such Contributor\n              has been advised of the possibility of such damages.\n        \n           9. Accepting Warranty or Additional Liability. While redistributing\n              the Work or Derivative Works thereof, You may choose to offer,\n              and charge a fee for, acceptance of support, warranty, indemnity,\n              or other liability obligations and/or rights consistent with this\n              License. However, in accepting such obligations, You may act only\n              on Your own behalf and on Your sole responsibility, not on behalf\n              of any other Contributor, and only if You agree to indemnify,\n              defend, and hold each Contributor harmless for any liability\n              incurred by, or claims asserted against, such Contributor by reason\n              of your accepting any such warranty or additional liability.\n        \n           END OF TERMS AND CONDITIONS\n        \n           APPENDIX: How to apply the Apache License to your work.\n        \n              To apply the Apache License to your work, attach the following\n              boilerplate notice, with the fields enclosed by brackets \"[]\"\n              replaced with your own identifying information. (Don't include\n              the brackets!)  The text should be enclosed in the appropriate\n              comment syntax for the file format. We also recommend that a\n              file or class name and description of purpose be included on the\n              same \"printed page\" as the copyright notice for easier\n              identification within third-party archives.\n        \n           Copyright [yyyy] [name of copyright owner]\n        \n           Licensed under the Apache License, Version 2.0 (the \"License\");\n           you may not use this file except in compliance with the License.\n           You may obtain a copy of the License at\n        \n               http://www.apache.org/licenses/LICENSE-2.0\n        \n           Unless required by applicable law or agreed to in writing, software\n           distributed under the License is distributed on an \"AS IS\" BASIS,\n           WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n           See the License for the specific language governing permissions and\n           limitations under the License.\n        ",
    "summary": "Context-aware Norwegian morphological alternative generator",
    "version": "0.1.0",
    "project_urls": {
        "Bug Reports": "https://github.com/yourusername/altmorph/issues",
        "Homepage": "https://github.com/yourusername/altmorph",
        "Source": "https://github.com/yourusername/altmorph"
    },
    "split_keywords": [
        "morphology",
        " norwegian",
        " nlp",
        " linguistics",
        " alternatives",
        " ordbank",
        " pos",
        " bert"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "902e55c8a6272242180a84b913c75b8fcddeb7bc98c7bd994570d41cef69920b",
                "md5": "6d80273aa94b8e742aeac8bc3a19bc14",
                "sha256": "685cba554b5ba0e2625859eea89fa211bea2d1cbca601cfbf8bc158188fc0209"
            },
            "downloads": -1,
            "filename": "altmorph-0.1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6d80273aa94b8e742aeac8bc3a19bc14",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 500615,
            "upload_time": "2025-10-09T12:51:53",
            "upload_time_iso_8601": "2025-10-09T12:51:53.500489Z",
            "url": "https://files.pythonhosted.org/packages/90/2e/55c8a6272242180a84b913c75b8fcddeb7bc98c7bd994570d41cef69920b/altmorph-0.1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "189054ea9f44938aa407a70a394a1832f6546f37704b3e139633c97969039ad1",
                "md5": "f1eaeac58fd9c76af16ecaa466390b40",
                "sha256": "c771ae3c0b483f34709ec5ddd2b0d2015e2064c1e4b357ccfb8d7695ab231397"
            },
            "downloads": -1,
            "filename": "altmorph-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "f1eaeac58fd9c76af16ecaa466390b40",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 553012,
            "upload_time": "2025-10-09T12:51:55",
            "upload_time_iso_8601": "2025-10-09T12:51:55.255423Z",
            "url": "https://files.pythonhosted.org/packages/18/90/54ea9f44938aa407a70a394a1832f6546f37704b3e139633c97969039ad1/altmorph-0.1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-09 12:51:55",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "yourusername",
    "github_project": "altmorph",
    "github_not_found": true,
    "lcname": "altmorph"
}
        
Elapsed time: 2.07588s