lexi-align


Namelexi-align JSON
Version 0.3.0 PyPI version JSON
download
home_pageNone
SummaryWord alignment between two languages using structured generation
upload_time2024-11-11 04:37:01
maintainerNone
docs_urlNone
authorNone
requires_python>=3.10
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # lexi-align

[![PyPI version](https://badge.fury.io/py/lexi-align.svg)](https://badge.fury.io/py/lexi-align)
[![CI](https://github.com/borh-lab/lexi-align/actions/workflows/ci.yaml/badge.svg)](https://github.com/borh-lab/lexi-align/actions/workflows/ci.yaml)

Word alignment of multilingual sentences using structured generation with Large Language Models.

## Installation

Install from PyPI:

```bash
pip install lexi-align
```

(or your favorite method)

The library is API-backend agnostic and only directly depends on [Pydantic](https://docs.pydantic.dev/latest/), so you will need to bring your own API code or use the provided [litellm](https://github.com/BerriAI/litellm) integration.

For LLM support via litellm (recommended), install with the optional dependency:

```bash
pip install lexi-align[litellm]
```

Using uv:

```bash
uv add lexi-align --extra litellm
```

For LLM support via Outlines (for local models), install with:

```bash
pip install lexi-align[outlines]
```

Using uv:

```bash
uv add lexi-align --extra outlines
```

For LLM support via llama.cpp (for local models), install with:

```bash
pip install lexi-align[llama]
```

Using uv:

```bash
uv add lexi-align --extra llama
```

## Usage

### Basic Usage

The library expects pre-tokenized input--it does not perform any tokenization. You must provide tokens as lists of strings:

```python
from lexi_align.adapters.litellm_adapter import LiteLLMAdapter
from lexi_align.core import align_tokens

# Initialize the LLM adapter
llm_adapter = LiteLLMAdapter(model_params={
    "model": "gpt-4o",
    "temperature": 0.0
})

# Provide pre-tokenized input with repeated tokens
source_tokens = ["the", "big", "cat", "saw", "the", "cat"]  # Note: "the" and "cat" appear twice
target_tokens = ["le", "gros", "chat", "a", "vu", "le", "chat"]

result = align_tokens(
    llm_adapter,
    source_tokens,
    target_tokens,
    source_language="English",
    target_language="French"
)

# Access the alignment result
if result.alignment:
    print("Successful alignment:")
    for align in result.alignment.alignment:
        print(f"{align.source_token} -> {align.target_token}")

# Example output will show the uniquified tokens:
# the₁ -> le₁
# big -> gros
# cat₁ -> chat₁
# saw -> a
# saw -> vu
# the₂ -> le₂
# cat₂ -> chat₂
```

### Batched Processing

**EXPERIMENTAL**

For processing multiple sequences efficiently using Outlines (which supports native batching):

```python
from lexi_align.adapters.outlines_adapter import OutlinesAdapter
from lexi_align.core import align_tokens_batched

# Initialize adapter with a local model
llm_adapter = OutlinesAdapter(
    model_name="Qwen/Qwen2.5-1.5B-Instruct",  # or any local model path
    dtype="bfloat16",  # optional: choose quantization
    device="cuda"      # optional: specify device
)

# Multiple sequences to align
source_sequences = [
    ["The", "cat", "sat"],
    ["I", "love", "coding"],
]
target_sequences = [
    ["Le", "chat", "assis"],
    ["J'", "aime", "coder"],
]

# Process in batches
results = align_tokens_batched(
    llm_adapter,
    source_sequences,
    target_sequences,
    source_language="English",
    target_language="French",
    batch_size=2  # Process 2 sequences at a time
)

# Each result contains alignment and diagnostic information
for result in results:
    if result.alignment:
        print(result.alignment.alignment)
    else:
        print("Failed attempts:", len(result.attempts))
```

### Async Processing

For asynchronous processing:

```python
import asyncio
from lexi_align.adapters.litellm_adapter import LiteLLMAdapter
from lexi_align.core import align_tokens_async

async def align_async():
    llm_adapter = LiteLLMAdapter(model_params={
        "model": "gpt-4o",
        "temperature": 0.0
    })

    source = ["The", "cat", "sat"]
    target = ["Le", "chat", "assis"]

    result = await align_tokens_async(
        llm_adapter,
        source,
        target,
        source_language="English",
        target_language="French"
    )

    return result

# Run async alignment
result = asyncio.run(align_async())
```

### Diagnostic Information

The alignment functions return an `AlignmentResult` object containing both the alignment and diagnostic information:

```python
result = align_tokens(
    llm_adapter,
    source_tokens,
    target_tokens,
    source_language="English",
    target_language="French"
)

# Access the alignment
if result.alignment:
    print("Successful alignment:", result.alignment.alignment)

# Access attempt history
for attempt in result.attempts:
    print(f"Attempt {attempt.attempt_number}:")
    print("Messages sent:", attempt.messages_sent)
    print("Validation passed:", attempt.validation_passed)
    if attempt.validation_errors:
        print("Validation errors:", attempt.validation_errors)
    if attempt.exception:
        print("Exception:", attempt.exception)
```

### Using Custom Guidelines and Examples

You can provide custom alignment guidelines and examples to improve alignment quality:

```python
from lexi_align.adapters.litellm_adapter import LiteLLMAdapter
from lexi_align.core import align_tokens
from lexi_align.models import TextAlignment, TokenAlignment

# Initialize adapter as before
llm_adapter = LiteLLMAdapter(model_params={
    "model": "gpt-4o",
    "temperature": 0.0
})

# Define custom guidelines
guidelines = """
1. Align content words (nouns, verbs, adjectives) first
2. Function words should be aligned when they have clear correspondences
3. Handle idiomatic expressions by aligning all components
4. One source token can align to multiple target tokens and vice versa
"""

# Provide examples to demonstrate desired alignments
examples = [
    (
        "The cat".split(),  # source tokens
        "Le chat".split(),  # target tokens
        TextAlignment(      # gold alignment
            alignment=[
                TokenAlignment(source_token="The", target_token="Le"),
                TokenAlignment(source_token="cat", target_token="chat"),
            ]
        )
    ),
    # Add more examples as needed
]

# Use guidelines and examples in alignment
alignment = align_tokens(
    llm_adapter,
    source_tokens,
    target_tokens,
    source_language="English",
    target_language="French",
    guidelines=guidelines,
    examples=examples
)
```

### Raw Message Control

For more control over the prompt, you can use `align_tokens_raw` to provide custom messages:

```python
from lexi_align.core import align_tokens_raw

custom_messages = [
    {"role": "system", "content": "You are an expert translator aligning English to French."},
    {"role": "user", "content": "Follow these guidelines:\n" + guidelines},
    # Add any other custom messages
]

alignment = align_tokens_raw(
    llm_adapter,
    source_tokens,
    target_tokens,
    custom_messages
)
```

### Token Uniquification

The library automatically handles repeated tokens by adding unique markers:

```python
from lexi_align.utils import make_unique, remove_unique

# Tokens with repeats
tokens = ["the", "cat", "the", "mat"]

# Add unique markers
unique_tokens = make_unique(tokens)
print(unique_tokens)  # ['the₁', 'cat', 'the₂', 'mat']

# Remove markers
original_tokens = remove_unique(unique_tokens)
print(original_tokens)  # ['the', 'cat', 'the', 'mat']
```

You can also customize the marker style:

```python
from lexi_align.text_processing import create_underscore_generator

# Use underscore markers instead of subscripts
marker_gen = create_underscore_generator()
unique_tokens = make_unique(tokens, marker_gen)
print(unique_tokens)  # ['the_1', 'cat', 'the_2', 'mat']
```

### Using Local Models with Outlines

For running local models, you can use the Outlines adapter:

```python
from lexi_align.adapters.outlines_adapter import OutlinesAdapter
from lexi_align.core import align_tokens

# Initialize the Outlines adapter with a local model
llm_adapter = OutlinesAdapter(
    model_name="Qwen/Qwen2.5-1.5B-Instruct",  # or any local model path
    dtype="bfloat16",  # optional: choose quantization
    device="cuda"      # optional: specify device
)

# Use the same API as with other adapters
alignment = align_tokens(
    llm_adapter,
    source_tokens,
    target_tokens,
    source_language="English",
    target_language="French"
)
```

### Using Local Models with llama.cpp

For running local models with llama.cpp:

```python
from lexi_align.adapters.llama_cpp_adapter import LlamaCppAdapter
from lexi_align.core import align_tokens

# Initialize the llama.cpp adapter with a local model
llm_adapter = LlamaCppAdapter(
    model_path="path/to/model.gguf",
    n_gpu_layers=-1,  # Use GPU acceleration
)

# Note that for some GGUF models the pre-tokenizer might fail,
# in which case you can specify the tokenizer_repo_id, which
# should point to the base model's repo_id on Huggingface.

# Use the same API as with other adapters
alignment = align_tokens(
    llm_adapter,
    source_tokens,
    target_tokens,
    source_language="English",
    target_language="French"
)
```

### Performance

Here are some preliminary results on the test EN-SL subset of XL-WA (using the older 0.1.0 version):

#### gpt-4o-2024-08-06 (1shot) (seed=42)

| Language Pair | Precision | Recall | F1 |
| --- | --- | --- | --- |
| EN-SL | 0.863 | 0.829 | 0.846 |
| **Average** | **0.863** | **0.829** | **0.846** |

#### claude-3-haiku-20240307 (1shot)

| Language Pair | Precision | Recall | F1 |
| --- | --- | --- | --- |
| EN-SL | 0.651 | 0.630 | 0.640 |
| **Average** | **0.651** | **0.630** | **0.640** |

#### meta-llama/Llama-3.2-3B-Instruct (1shot)

| Language Pair | Precision | Recall | F1 |
| --- | --- | --- | --- |
| EN-SL | 0.606 | 0.581 | 0.593 |
| **Average** | **0.606** | **0.581** | **0.593** |

For reference, the 1-shot (1 example) `gpt-4o-2024-08-06` results for EN-SL outperform all systems presented in the [paper](https://ceur-ws.org/Vol-3596/paper32.pdf) (Table 2).
Smaller LLMs perform below SOTA.

### Pharaoh Format Export

While the core alignment functions work with pre-tokenized input, the Pharaoh format utilities currently assume space-separated tokens when parsing/exporting. If your tokens contain spaces or require special tokenization, you'll need to handle this separately.

```python
from lexi_align.utils import export_pharaoh_format

# Note: Pharaoh format assumes space-separated tokens
# Default separator is tab
pharaoh_format = export_pharaoh_format(
    source_tokens,  # Pre-tokenized list of strings
    target_tokens,  # Pre-tokenized list of strings
    alignment
)

print(pharaoh_format)
# Output (will differ depending on chosen model):
# The cat sat on the mat    Le chat était assis sur le tapis    0-0 1-1 2-2 2-3 3-4 4-5 5-6

# Use custom separator
pharaoh_format = export_pharaoh_format(
    source_tokens,
    target_tokens,
    alignment,
    sep=" ||| "  # Custom separator
)

print(pharaoh_format)
# Output:
# The cat sat on the mat ||| Le chat était assis sur le tapis ||| 0-0 1-1 2-2 2-3 3-4 4-5 5-6
```

The Pharaoh format consists of three tab-separated fields:
1. Source sentence (space-separated tokens)
2. Target sentence (space-separated tokens)
3. Alignments as space-separated pairs of indices (source-target)

### Running Evaluations

The package includes scripts to evaluate alignment performance on the [XL-WA dataset](https://github.com/SapienzaNLP/XL-WA) (CC BY-NC-SA 4.0):

```bash
# Install dependencies
pip install lexi-align[litellm]

# Basic evaluation on a single language pair
python evaluations/xl-wa.py --lang-pairs EN-SL

# Evaluate on all language pairs
python evaluations/xl-wa.py --lang-pairs all

# Full evaluation with custom parameters
python evaluations/xl-wa.py \
    --lang-pairs EN-FR EN-DE \
    --model gpt-4o \
    --temperature 0.0 \
    --seed 42 \
    --num-train-examples 3 \
    --output results.json
```

Available command-line arguments:

- `--lang-pairs`: Language pairs to evaluate (e.g., EN-SL EN-DE) or "all"
- `--model`: LLM model to use (default: gpt-4o)
- `--temperature`: Temperature for LLM sampling (default: 0.0)
- `--seed`: Random seed for example selection (default: 42)
- `--model-seed`: Seed for LLM sampling (optional)
- `--num-train-examples`: Number of training examples for few-shot learning
- `--sample-size`: Number of test examples to evaluate per language pair
- `--output`: Path to save results JSON file
- `--verbose`: Enable verbose logging

## Changelog

### v0.3.0 (2024-03-11)
- Added support for batched processing with `align_tokens_batched`
- Added async support via `align_tokens_async`
- Added enhanced diagnostics and error reporting
- Added alignment visualization tools
- Added token-level analysis and metrics
- Added support for custom marker types (subscript/underscore)
- Added support for custom separators in Pharaoh format
- Improved retry logic and validation
- Added CI and evaluation scripts

### v0.2.x (2024-03-07)
- Added support for local models via Outlines and llama.cpp
- Added retries on errors or invalid alignments
- Added async completion support for litellm
- Added support for model weight quantization
- Added improved error messages and validation

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Citation

If you use this software in your research, please cite:

```bibtex
@software{lexi_align,
  title = {lexi-align: Word Alignment via Structured Generation},
  author = {Hodošček, Bor},
  year = {2024},
  url = {https://github.com/borh-lab/lexi-align}
}
```

## References

We use the XL-WA dataset ([repository](https://github.com/SapienzaNLP/XL-WA)) to perform evaluations:

```bibtex
@InProceedings{martelli-EtAl:2023:clicit,
  author    = {Martelli, Federico  and  Bejgu, Andrei Stefan  and  Campagnano, Cesare  and  Čibej, Jaka  and  Costa, Rute  and  Gantar, Apolonija  and  Kallas, Jelena  and  Koeva, Svetla  and  Koppel, Kristina  and  Krek, Simon  and  Langemets, Margit  and  Lipp, Veronika  and  Nimb, Sanni  and  Olsen, Sussi  and  Pedersen, Bolette Sandford  and  Quochi, Valeria  and  Salgado, Ana  and  Simon, László  and  Tiberius, Carole  and  Ureña-Ruiz, Rafael-J  and  Navigli, Roberto},
  title     = {XL-WA: a Gold Evaluation Benchmark for Word Alignment in 14 Language Pairs},
  booktitle      = {Procedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it 2023)},
  month          = {November},
  year           = {2023}
}
```

This code was spun out of the [hachidaishu-translation](https://github.com/borh/hachidaishu-translation) project, presented at  [JADH2024](https://jadh2024.l.u-tokyo.ac.jp/).

## Development

Contributions are welcome! Please feel free to submit a Pull Request.

To set up the development environment:

```bash
git clone https://github.com/borh-lab/lexi-align.git
cd lexi-align
pip install -e ".[dev]"
```

Run tests:

```bash
pytest
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "lexi-align",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": null,
    "keywords": null,
    "author": null,
    "author_email": "Bor Hodo\u0161\u010dek <dev@bor.space>",
    "download_url": "https://files.pythonhosted.org/packages/4d/c1/f228d4c5a2177549d9994d14e96f4c25428d488d644e820d907dc12719a3/lexi_align-0.3.0.tar.gz",
    "platform": null,
    "description": "# lexi-align\n\n[![PyPI version](https://badge.fury.io/py/lexi-align.svg)](https://badge.fury.io/py/lexi-align)\n[![CI](https://github.com/borh-lab/lexi-align/actions/workflows/ci.yaml/badge.svg)](https://github.com/borh-lab/lexi-align/actions/workflows/ci.yaml)\n\nWord alignment of multilingual sentences using structured generation with Large Language Models.\n\n## Installation\n\nInstall from PyPI:\n\n```bash\npip install lexi-align\n```\n\n(or your favorite method)\n\nThe library is API-backend agnostic and only directly depends on [Pydantic](https://docs.pydantic.dev/latest/), so you will need to bring your own API code or use the provided [litellm](https://github.com/BerriAI/litellm) integration.\n\nFor LLM support via litellm (recommended), install with the optional dependency:\n\n```bash\npip install lexi-align[litellm]\n```\n\nUsing uv:\n\n```bash\nuv add lexi-align --extra litellm\n```\n\nFor LLM support via Outlines (for local models), install with:\n\n```bash\npip install lexi-align[outlines]\n```\n\nUsing uv:\n\n```bash\nuv add lexi-align --extra outlines\n```\n\nFor LLM support via llama.cpp (for local models), install with:\n\n```bash\npip install lexi-align[llama]\n```\n\nUsing uv:\n\n```bash\nuv add lexi-align --extra llama\n```\n\n## Usage\n\n### Basic Usage\n\nThe library expects pre-tokenized input--it does not perform any tokenization. You must provide tokens as lists of strings:\n\n```python\nfrom lexi_align.adapters.litellm_adapter import LiteLLMAdapter\nfrom lexi_align.core import align_tokens\n\n# Initialize the LLM adapter\nllm_adapter = LiteLLMAdapter(model_params={\n    \"model\": \"gpt-4o\",\n    \"temperature\": 0.0\n})\n\n# Provide pre-tokenized input with repeated tokens\nsource_tokens = [\"the\", \"big\", \"cat\", \"saw\", \"the\", \"cat\"]  # Note: \"the\" and \"cat\" appear twice\ntarget_tokens = [\"le\", \"gros\", \"chat\", \"a\", \"vu\", \"le\", \"chat\"]\n\nresult = align_tokens(\n    llm_adapter,\n    source_tokens,\n    target_tokens,\n    source_language=\"English\",\n    target_language=\"French\"\n)\n\n# Access the alignment result\nif result.alignment:\n    print(\"Successful alignment:\")\n    for align in result.alignment.alignment:\n        print(f\"{align.source_token} -> {align.target_token}\")\n\n# Example output will show the uniquified tokens:\n# the\u2081 -> le\u2081\n# big -> gros\n# cat\u2081 -> chat\u2081\n# saw -> a\n# saw -> vu\n# the\u2082 -> le\u2082\n# cat\u2082 -> chat\u2082\n```\n\n### Batched Processing\n\n**EXPERIMENTAL**\n\nFor processing multiple sequences efficiently using Outlines (which supports native batching):\n\n```python\nfrom lexi_align.adapters.outlines_adapter import OutlinesAdapter\nfrom lexi_align.core import align_tokens_batched\n\n# Initialize adapter with a local model\nllm_adapter = OutlinesAdapter(\n    model_name=\"Qwen/Qwen2.5-1.5B-Instruct\",  # or any local model path\n    dtype=\"bfloat16\",  # optional: choose quantization\n    device=\"cuda\"      # optional: specify device\n)\n\n# Multiple sequences to align\nsource_sequences = [\n    [\"The\", \"cat\", \"sat\"],\n    [\"I\", \"love\", \"coding\"],\n]\ntarget_sequences = [\n    [\"Le\", \"chat\", \"assis\"],\n    [\"J'\", \"aime\", \"coder\"],\n]\n\n# Process in batches\nresults = align_tokens_batched(\n    llm_adapter,\n    source_sequences,\n    target_sequences,\n    source_language=\"English\",\n    target_language=\"French\",\n    batch_size=2  # Process 2 sequences at a time\n)\n\n# Each result contains alignment and diagnostic information\nfor result in results:\n    if result.alignment:\n        print(result.alignment.alignment)\n    else:\n        print(\"Failed attempts:\", len(result.attempts))\n```\n\n### Async Processing\n\nFor asynchronous processing:\n\n```python\nimport asyncio\nfrom lexi_align.adapters.litellm_adapter import LiteLLMAdapter\nfrom lexi_align.core import align_tokens_async\n\nasync def align_async():\n    llm_adapter = LiteLLMAdapter(model_params={\n        \"model\": \"gpt-4o\",\n        \"temperature\": 0.0\n    })\n\n    source = [\"The\", \"cat\", \"sat\"]\n    target = [\"Le\", \"chat\", \"assis\"]\n\n    result = await align_tokens_async(\n        llm_adapter,\n        source,\n        target,\n        source_language=\"English\",\n        target_language=\"French\"\n    )\n\n    return result\n\n# Run async alignment\nresult = asyncio.run(align_async())\n```\n\n### Diagnostic Information\n\nThe alignment functions return an `AlignmentResult` object containing both the alignment and diagnostic information:\n\n```python\nresult = align_tokens(\n    llm_adapter,\n    source_tokens,\n    target_tokens,\n    source_language=\"English\",\n    target_language=\"French\"\n)\n\n# Access the alignment\nif result.alignment:\n    print(\"Successful alignment:\", result.alignment.alignment)\n\n# Access attempt history\nfor attempt in result.attempts:\n    print(f\"Attempt {attempt.attempt_number}:\")\n    print(\"Messages sent:\", attempt.messages_sent)\n    print(\"Validation passed:\", attempt.validation_passed)\n    if attempt.validation_errors:\n        print(\"Validation errors:\", attempt.validation_errors)\n    if attempt.exception:\n        print(\"Exception:\", attempt.exception)\n```\n\n### Using Custom Guidelines and Examples\n\nYou can provide custom alignment guidelines and examples to improve alignment quality:\n\n```python\nfrom lexi_align.adapters.litellm_adapter import LiteLLMAdapter\nfrom lexi_align.core import align_tokens\nfrom lexi_align.models import TextAlignment, TokenAlignment\n\n# Initialize adapter as before\nllm_adapter = LiteLLMAdapter(model_params={\n    \"model\": \"gpt-4o\",\n    \"temperature\": 0.0\n})\n\n# Define custom guidelines\nguidelines = \"\"\"\n1. Align content words (nouns, verbs, adjectives) first\n2. Function words should be aligned when they have clear correspondences\n3. Handle idiomatic expressions by aligning all components\n4. One source token can align to multiple target tokens and vice versa\n\"\"\"\n\n# Provide examples to demonstrate desired alignments\nexamples = [\n    (\n        \"The cat\".split(),  # source tokens\n        \"Le chat\".split(),  # target tokens\n        TextAlignment(      # gold alignment\n            alignment=[\n                TokenAlignment(source_token=\"The\", target_token=\"Le\"),\n                TokenAlignment(source_token=\"cat\", target_token=\"chat\"),\n            ]\n        )\n    ),\n    # Add more examples as needed\n]\n\n# Use guidelines and examples in alignment\nalignment = align_tokens(\n    llm_adapter,\n    source_tokens,\n    target_tokens,\n    source_language=\"English\",\n    target_language=\"French\",\n    guidelines=guidelines,\n    examples=examples\n)\n```\n\n### Raw Message Control\n\nFor more control over the prompt, you can use `align_tokens_raw` to provide custom messages:\n\n```python\nfrom lexi_align.core import align_tokens_raw\n\ncustom_messages = [\n    {\"role\": \"system\", \"content\": \"You are an expert translator aligning English to French.\"},\n    {\"role\": \"user\", \"content\": \"Follow these guidelines:\\n\" + guidelines},\n    # Add any other custom messages\n]\n\nalignment = align_tokens_raw(\n    llm_adapter,\n    source_tokens,\n    target_tokens,\n    custom_messages\n)\n```\n\n### Token Uniquification\n\nThe library automatically handles repeated tokens by adding unique markers:\n\n```python\nfrom lexi_align.utils import make_unique, remove_unique\n\n# Tokens with repeats\ntokens = [\"the\", \"cat\", \"the\", \"mat\"]\n\n# Add unique markers\nunique_tokens = make_unique(tokens)\nprint(unique_tokens)  # ['the\u2081', 'cat', 'the\u2082', 'mat']\n\n# Remove markers\noriginal_tokens = remove_unique(unique_tokens)\nprint(original_tokens)  # ['the', 'cat', 'the', 'mat']\n```\n\nYou can also customize the marker style:\n\n```python\nfrom lexi_align.text_processing import create_underscore_generator\n\n# Use underscore markers instead of subscripts\nmarker_gen = create_underscore_generator()\nunique_tokens = make_unique(tokens, marker_gen)\nprint(unique_tokens)  # ['the_1', 'cat', 'the_2', 'mat']\n```\n\n### Using Local Models with Outlines\n\nFor running local models, you can use the Outlines adapter:\n\n```python\nfrom lexi_align.adapters.outlines_adapter import OutlinesAdapter\nfrom lexi_align.core import align_tokens\n\n# Initialize the Outlines adapter with a local model\nllm_adapter = OutlinesAdapter(\n    model_name=\"Qwen/Qwen2.5-1.5B-Instruct\",  # or any local model path\n    dtype=\"bfloat16\",  # optional: choose quantization\n    device=\"cuda\"      # optional: specify device\n)\n\n# Use the same API as with other adapters\nalignment = align_tokens(\n    llm_adapter,\n    source_tokens,\n    target_tokens,\n    source_language=\"English\",\n    target_language=\"French\"\n)\n```\n\n### Using Local Models with llama.cpp\n\nFor running local models with llama.cpp:\n\n```python\nfrom lexi_align.adapters.llama_cpp_adapter import LlamaCppAdapter\nfrom lexi_align.core import align_tokens\n\n# Initialize the llama.cpp adapter with a local model\nllm_adapter = LlamaCppAdapter(\n    model_path=\"path/to/model.gguf\",\n    n_gpu_layers=-1,  # Use GPU acceleration\n)\n\n# Note that for some GGUF models the pre-tokenizer might fail,\n# in which case you can specify the tokenizer_repo_id, which\n# should point to the base model's repo_id on Huggingface.\n\n# Use the same API as with other adapters\nalignment = align_tokens(\n    llm_adapter,\n    source_tokens,\n    target_tokens,\n    source_language=\"English\",\n    target_language=\"French\"\n)\n```\n\n### Performance\n\nHere are some preliminary results on the test EN-SL subset of XL-WA (using the older 0.1.0 version):\n\n#### gpt-4o-2024-08-06 (1shot) (seed=42)\n\n| Language Pair | Precision | Recall | F1 |\n| --- | --- | --- | --- |\n| EN-SL | 0.863 | 0.829 | 0.846 |\n| **Average** | **0.863** | **0.829** | **0.846** |\n\n#### claude-3-haiku-20240307 (1shot)\n\n| Language Pair | Precision | Recall | F1 |\n| --- | --- | --- | --- |\n| EN-SL | 0.651 | 0.630 | 0.640 |\n| **Average** | **0.651** | **0.630** | **0.640** |\n\n#### meta-llama/Llama-3.2-3B-Instruct (1shot)\n\n| Language Pair | Precision | Recall | F1 |\n| --- | --- | --- | --- |\n| EN-SL | 0.606 | 0.581 | 0.593 |\n| **Average** | **0.606** | **0.581** | **0.593** |\n\nFor reference, the 1-shot (1 example) `gpt-4o-2024-08-06` results for EN-SL outperform all systems presented in the [paper](https://ceur-ws.org/Vol-3596/paper32.pdf) (Table 2).\nSmaller LLMs perform below SOTA.\n\n### Pharaoh Format Export\n\nWhile the core alignment functions work with pre-tokenized input, the Pharaoh format utilities currently assume space-separated tokens when parsing/exporting. If your tokens contain spaces or require special tokenization, you'll need to handle this separately.\n\n```python\nfrom lexi_align.utils import export_pharaoh_format\n\n# Note: Pharaoh format assumes space-separated tokens\n# Default separator is tab\npharaoh_format = export_pharaoh_format(\n    source_tokens,  # Pre-tokenized list of strings\n    target_tokens,  # Pre-tokenized list of strings\n    alignment\n)\n\nprint(pharaoh_format)\n# Output (will differ depending on chosen model):\n# The cat sat on the mat    Le chat \u00e9tait assis sur le tapis    0-0 1-1 2-2 2-3 3-4 4-5 5-6\n\n# Use custom separator\npharaoh_format = export_pharaoh_format(\n    source_tokens,\n    target_tokens,\n    alignment,\n    sep=\" ||| \"  # Custom separator\n)\n\nprint(pharaoh_format)\n# Output:\n# The cat sat on the mat ||| Le chat \u00e9tait assis sur le tapis ||| 0-0 1-1 2-2 2-3 3-4 4-5 5-6\n```\n\nThe Pharaoh format consists of three tab-separated fields:\n1. Source sentence (space-separated tokens)\n2. Target sentence (space-separated tokens)\n3. Alignments as space-separated pairs of indices (source-target)\n\n### Running Evaluations\n\nThe package includes scripts to evaluate alignment performance on the [XL-WA dataset](https://github.com/SapienzaNLP/XL-WA) (CC BY-NC-SA 4.0):\n\n```bash\n# Install dependencies\npip install lexi-align[litellm]\n\n# Basic evaluation on a single language pair\npython evaluations/xl-wa.py --lang-pairs EN-SL\n\n# Evaluate on all language pairs\npython evaluations/xl-wa.py --lang-pairs all\n\n# Full evaluation with custom parameters\npython evaluations/xl-wa.py \\\n    --lang-pairs EN-FR EN-DE \\\n    --model gpt-4o \\\n    --temperature 0.0 \\\n    --seed 42 \\\n    --num-train-examples 3 \\\n    --output results.json\n```\n\nAvailable command-line arguments:\n\n- `--lang-pairs`: Language pairs to evaluate (e.g., EN-SL EN-DE) or \"all\"\n- `--model`: LLM model to use (default: gpt-4o)\n- `--temperature`: Temperature for LLM sampling (default: 0.0)\n- `--seed`: Random seed for example selection (default: 42)\n- `--model-seed`: Seed for LLM sampling (optional)\n- `--num-train-examples`: Number of training examples for few-shot learning\n- `--sample-size`: Number of test examples to evaluate per language pair\n- `--output`: Path to save results JSON file\n- `--verbose`: Enable verbose logging\n\n## Changelog\n\n### v0.3.0 (2024-03-11)\n- Added support for batched processing with `align_tokens_batched`\n- Added async support via `align_tokens_async`\n- Added enhanced diagnostics and error reporting\n- Added alignment visualization tools\n- Added token-level analysis and metrics\n- Added support for custom marker types (subscript/underscore)\n- Added support for custom separators in Pharaoh format\n- Improved retry logic and validation\n- Added CI and evaluation scripts\n\n### v0.2.x (2024-03-07)\n- Added support for local models via Outlines and llama.cpp\n- Added retries on errors or invalid alignments\n- Added async completion support for litellm\n- Added support for model weight quantization\n- Added improved error messages and validation\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Citation\n\nIf you use this software in your research, please cite:\n\n```bibtex\n@software{lexi_align,\n  title = {lexi-align: Word Alignment via Structured Generation},\n  author = {Hodo\u0161\u010dek, Bor},\n  year = {2024},\n  url = {https://github.com/borh-lab/lexi-align}\n}\n```\n\n## References\n\nWe use the XL-WA dataset ([repository](https://github.com/SapienzaNLP/XL-WA)) to perform evaluations:\n\n```bibtex\n@InProceedings{martelli-EtAl:2023:clicit,\n  author    = {Martelli, Federico  and  Bejgu, Andrei Stefan  and  Campagnano, Cesare  and  \u010cibej, Jaka  and  Costa, Rute  and  Gantar, Apolonija  and  Kallas, Jelena  and  Koeva, Svetla  and  Koppel, Kristina  and  Krek, Simon  and  Langemets, Margit  and  Lipp, Veronika  and  Nimb, Sanni  and  Olsen, Sussi  and  Pedersen, Bolette Sandford  and  Quochi, Valeria  and  Salgado, Ana  and  Simon, L\u00e1szl\u00f3  and  Tiberius, Carole  and  Ure\u00f1a-Ruiz, Rafael-J  and  Navigli, Roberto},\n  title     = {XL-WA: a Gold Evaluation Benchmark for Word Alignment in 14 Language Pairs},\n  booktitle      = {Procedings of the Ninth Italian Conference on Computational Linguistics (CLiC-it 2023)},\n  month          = {November},\n  year           = {2023}\n}\n```\n\nThis code was spun out of the [hachidaishu-translation](https://github.com/borh/hachidaishu-translation) project, presented at  [JADH2024](https://jadh2024.l.u-tokyo.ac.jp/).\n\n## Development\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\nTo set up the development environment:\n\n```bash\ngit clone https://github.com/borh-lab/lexi-align.git\ncd lexi-align\npip install -e \".[dev]\"\n```\n\nRun tests:\n\n```bash\npytest\n```\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Word alignment between two languages using structured generation",
    "version": "0.3.0",
    "project_urls": null,
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d19c8b04402050d19287b7033689df94cb35636fc91b35e31e1cec1446602441",
                "md5": "5818538bf2326b4d6320e984e43fd11c",
                "sha256": "75436a85191b89e9daa05bf96026f2ccc749beec72467cd4e353ec15d023195f"
            },
            "downloads": -1,
            "filename": "lexi_align-0.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "5818538bf2326b4d6320e984e43fd11c",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 29016,
            "upload_time": "2024-11-11T04:36:59",
            "upload_time_iso_8601": "2024-11-11T04:36:59.571450Z",
            "url": "https://files.pythonhosted.org/packages/d1/9c/8b04402050d19287b7033689df94cb35636fc91b35e31e1cec1446602441/lexi_align-0.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "4dc1f228d4c5a2177549d9994d14e96f4c25428d488d644e820d907dc12719a3",
                "md5": "7e30c401e863ff3133ac824d183f95a6",
                "sha256": "a67a23657ce4d6b46dfe79a6db4407a18e265ff957b6843c165361062878716e"
            },
            "downloads": -1,
            "filename": "lexi_align-0.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "7e30c401e863ff3133ac824d183f95a6",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 188372,
            "upload_time": "2024-11-11T04:37:01",
            "upload_time_iso_8601": "2024-11-11T04:37:01.690914Z",
            "url": "https://files.pythonhosted.org/packages/4d/c1/f228d4c5a2177549d9994d14e96f4c25428d488d644e820d907dc12719a3/lexi_align-0.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-11 04:37:01",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "lexi-align"
}
        
Elapsed time: 1.75223s