slice-score

Name	slice-score JSON
Version	1.0.1 JSON
	download
home_page	None
Summary	Schema Lineage Composite Evaluation - A Python package for evaluating schema lineage extraction accuracy
upload_time	2025-08-17 00:29:58
maintainer	None
docs_url	None
author	None
requires_python	>=3.9
license	MIT License Copyright (c) Microsoft Corporation. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE
keywords	schema lineage evaluation data-pipeline nlp ast
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # SLiCE: Schema Lineage Composite Evaluation

[![PyPI version](https://img.shields.io/pypi/v/slice-score.svg)](https://pypi.org/project/slice-score/)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Paper: ArXiv](https://img.shields.io/badge/Paper-ArXiv-red.svg)](https://arxiv.org/abs/2508.07179)

SLiCE is a Python package for evaluating schema lineage extraction accuracy by comparing model predictions with gold standards. It provides comprehensive metrics for assessing the quality of schema lineage extraction in data pipeline analysis. 

## Features

- **Component-wise Evaluation**: Separate scoring for source schema, source tables, transformations, and aggregations
- **Multiple Similarity Metrics**: BLEU scores, fuzzy matching, F1 scores, and AST-based similarity
- **Flexible Weighting**: Customizable weights for different components and metrics
- **Multi-language Support**: Handles Python, SQL, and C# code in transformations
- **Sample Data Module**: Built-in access to curated datasets for testing and demonstration
- **Batch Processing**: Parallel evaluation of multiple lineage pairs
- **Command Line Interface**: Easy-to-use CLI for quick evaluations

## Installation

### From PyPI (recommended)

```bash
pip install slice-score
```

### From Source

```bash
git clone https://github.com/microsoft/SLiCE.git
cd SLiCE
pip install -e .
```

### Development Installation

For development with all testing and linting tools:

```bash
git clone https://github.com/microsoft/SLiCE.git
cd SLiCE

# Using pip
pip install -e ".[dev]"

# Using uv (recommended - faster)
uv sync --extra dev
```

## Quick Start

### Python API

```python
from slice import SchemaLineageEvaluator

# Initialize evaluator
evaluator = SchemaLineageEvaluator()

# Example lineage data
predicted = {
    "source_schema": "cuisine_type",
    "source_table": "restaurants.ss",
    "transformation": "R.cuisine_type AS CuisineType", 
    "aggregation": "COUNT() GROUP BY restaurant_id"
}

ground_truth = {
    "source_schema": "cuisine_type",
    "source_table": "restaurants.ss", 
    "transformation": "R.cuisine_type AS CuisineType",
    "aggregation": ""
}

# Evaluate
results = evaluator.evaluate(predicted, ground_truth)
print(f"Overall Score: {results['overall']:.4f}")
```

### Command Line Interface

```bash
# Basic evaluation
slice-eval predicted.json ground_truth.json

# With custom weights
slice-eval --weights source_table=0.5,transformation=0.3,aggregation=0.2 predicted.json ground_truth.json

# Include metadata evaluation
slice-eval --metadata predicted.json ground_truth.json

# Save results to file
slice-eval predicted.json ground_truth.json --output results.txt
```

## Data Format

SLiCE expects lineage data as dictionaries with the following structure:

```json
{
    "source_schema": "column_name",
    "source_table": "table_references",
    "transformation": "transformation_logic",
    "aggregation": "aggregation_operations",
    "metadata": "additional_metadata (optional)"
}
```

## Evaluation Metrics

### Component Scores

- **Source Schema**: Exact match of schema/column names
- **Source Table**: F1 score + fuzzy matching of table references  
- **Transformation**: BLEU + weighted BLEU + AST similarity
- **Aggregation**: BLEU + weighted BLEU + AST similarity
- **Metadata**: BLEU + weighted BLEU + AST similarity (optional)

### Overall Score

The final score combines component scores using configurable weights:

```
Overall = format_correctness × source_schema × (
    w₁ × source_table_score + 
    w₂ × transformation_score + 
    w₃ × aggregation_score +
    w₄ × metadata_score  # if applicable
)
```

Default weights: `source_table=0.4, transformation=0.4, aggregation=0.2`

## Configuration

### Custom Weights

```python
# Component weights
weights = {
    'source_table': 0.5,
    'transformation': 0.3, 
    'aggregation': 0.2
}

# Metric weights for transformations
transformation_weights = {
    'bleu': 0.6,
    'weighted_bleu': 0.3,
    'ast': 0.1
}

evaluator = SchemaLineageEvaluator(
    weights=weights,
    transformation_weights=transformation_weights
)
```

### Language Support

```python
# Custom syntax and operators
evaluator = SchemaLineageEvaluator(
    sql_syntax={'SELECT', 'FROM', 'WHERE'},
    python_syntax={'def', 'class', 'import'},
    csharp_syntax={'using', 'namespace', 'class'}
)
```

## Examples

See the `examples/` directory for complete usage examples:

- `basic_usage.py`: Basic evaluation with default settings
- `custom_weights.py`: Using custom weights and configurations
- `batch_evaluation.py`: Processing multiple lineage pairs
- `sample_data_usage.py`: Using package sample data for evaluation. 

## Testing

Run the test suite:

```bash
# Using pip
pip install -e ".[dev]"
pytest

# Using uv (recommended)
uv sync --extra dev
uv run pytest

# Run with coverage
uv run pytest --cov=slice

# Run specific test file
uv run pytest tests/test_schema_lineage_evaluator.py -v

# Code quality checks
uv run black slice/ tests/     # Format code
uv run flake8 slice/           # Lint code  
uv run mypy slice/             # Type checking
```

## Contributing

We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes
4. Add tests for your changes
5. Run the test suite (`pytest`)
6. Commit your changes (`git commit -m 'Add amazing feature'`)
7. Push to the branch (`git push origin feature/amazing-feature`)
8. Open a Pull Request

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Citation

If you use SLiCE in your research, please cite:

```bibtex
@software{slice2025,
  title={SLiCE: Schema Lineage Composite Evaluation},
  author={Jiaqi Yin and Yi-Wei Chen and Meng-Lung Lee and Xiya Liu},
  year={2025},
  url={https://github.com/microsoft/SLiCE}
}

@misc{yin2025schemalineageextractionscale,
      title={Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks}, 
      author={Jiaqi Yin and Yi-Wei Chen and Meng-Lung Lee and Xiya Liu},
      year={2025},
      eprint={2508.07179},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.07179}, 
}
```

## Support

- **Documentation**: [Link to documentation]
- **Issues**: [GitHub Issues](https://github.com/microsoft/SLiCE/issues)
- **Discussions**: [GitHub Discussions](https://github.com/microsoft/SLiCE/discussions)

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "slice-score",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": "Jackie Jiaqi Yin <jackie.yin@microsoft.com>, Yi-Wei Chen <yiweichen@microsoft.com>",
    "keywords": "schema, lineage, evaluation, data-pipeline, nlp, ast",
    "author": null,
    "author_email": "Jackie Jiaqi Yin <jackie.yin@microsoft.com>",
    "download_url": "https://files.pythonhosted.org/packages/8d/61/f4b36f5ba080069fcaf7ff0f6f4bcd597f1c08e29064aec04fe45f418e6d/slice_score-1.0.1.tar.gz",
    "platform": null,
    "description": "# SLiCE: Schema Lineage Composite Evaluation\n\n[![PyPI version](https://img.shields.io/pypi/v/slice-score.svg)](https://pypi.org/project/slice-score/)\n[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Paper: ArXiv](https://img.shields.io/badge/Paper-ArXiv-red.svg)](https://arxiv.org/abs/2508.07179)\n\nSLiCE is a Python package for evaluating schema lineage extraction accuracy by comparing model predictions with gold standards. It provides comprehensive metrics for assessing the quality of schema lineage extraction in data pipeline analysis. \n\n## Features\n\n- **Component-wise Evaluation**: Separate scoring for source schema, source tables, transformations, and aggregations\n- **Multiple Similarity Metrics**: BLEU scores, fuzzy matching, F1 scores, and AST-based similarity\n- **Flexible Weighting**: Customizable weights for different components and metrics\n- **Multi-language Support**: Handles Python, SQL, and C# code in transformations\n- **Sample Data Module**: Built-in access to curated datasets for testing and demonstration\n- **Batch Processing**: Parallel evaluation of multiple lineage pairs\n- **Command Line Interface**: Easy-to-use CLI for quick evaluations\n\n## Installation\n\n### From PyPI (recommended)\n\n```bash\npip install slice-score\n```\n\n### From Source\n\n```bash\ngit clone https://github.com/microsoft/SLiCE.git\ncd SLiCE\npip install -e .\n```\n\n### Development Installation\n\nFor development with all testing and linting tools:\n\n```bash\ngit clone https://github.com/microsoft/SLiCE.git\ncd SLiCE\n\n# Using pip\npip install -e \".[dev]\"\n\n# Using uv (recommended - faster)\nuv sync --extra dev\n```\n\n## Quick Start\n\n### Python API\n\n```python\nfrom slice import SchemaLineageEvaluator\n\n# Initialize evaluator\nevaluator = SchemaLineageEvaluator()\n\n# Example lineage data\npredicted = {\n    \"source_schema\": \"cuisine_type\",\n    \"source_table\": \"restaurants.ss\",\n    \"transformation\": \"R.cuisine_type AS CuisineType\", \n    \"aggregation\": \"COUNT() GROUP BY restaurant_id\"\n}\n\nground_truth = {\n    \"source_schema\": \"cuisine_type\",\n    \"source_table\": \"restaurants.ss\", \n    \"transformation\": \"R.cuisine_type AS CuisineType\",\n    \"aggregation\": \"\"\n}\n\n# Evaluate\nresults = evaluator.evaluate(predicted, ground_truth)\nprint(f\"Overall Score: {results['overall']:.4f}\")\n```\n\n### Command Line Interface\n\n```bash\n# Basic evaluation\nslice-eval predicted.json ground_truth.json\n\n# With custom weights\nslice-eval --weights source_table=0.5,transformation=0.3,aggregation=0.2 predicted.json ground_truth.json\n\n# Include metadata evaluation\nslice-eval --metadata predicted.json ground_truth.json\n\n# Save results to file\nslice-eval predicted.json ground_truth.json --output results.txt\n```\n\n## Data Format\n\nSLiCE expects lineage data as dictionaries with the following structure:\n\n```json\n{\n    \"source_schema\": \"column_name\",\n    \"source_table\": \"table_references\",\n    \"transformation\": \"transformation_logic\",\n    \"aggregation\": \"aggregation_operations\",\n    \"metadata\": \"additional_metadata (optional)\"\n}\n```\n\n## Evaluation Metrics\n\n### Component Scores\n\n- **Source Schema**: Exact match of schema/column names\n- **Source Table**: F1 score + fuzzy matching of table references  \n- **Transformation**: BLEU + weighted BLEU + AST similarity\n- **Aggregation**: BLEU + weighted BLEU + AST similarity\n- **Metadata**: BLEU + weighted BLEU + AST similarity (optional)\n\n### Overall Score\n\nThe final score combines component scores using configurable weights:\n\n```\nOverall = format_correctness \u00d7 source_schema \u00d7 (\n    w\u2081 \u00d7 source_table_score + \n    w\u2082 \u00d7 transformation_score + \n    w\u2083 \u00d7 aggregation_score +\n    w\u2084 \u00d7 metadata_score  # if applicable\n)\n```\n\nDefault weights: `source_table=0.4, transformation=0.4, aggregation=0.2`\n\n## Configuration\n\n### Custom Weights\n\n```python\n# Component weights\nweights = {\n    'source_table': 0.5,\n    'transformation': 0.3, \n    'aggregation': 0.2\n}\n\n# Metric weights for transformations\ntransformation_weights = {\n    'bleu': 0.6,\n    'weighted_bleu': 0.3,\n    'ast': 0.1\n}\n\nevaluator = SchemaLineageEvaluator(\n    weights=weights,\n    transformation_weights=transformation_weights\n)\n```\n\n### Language Support\n\n```python\n# Custom syntax and operators\nevaluator = SchemaLineageEvaluator(\n    sql_syntax={'SELECT', 'FROM', 'WHERE'},\n    python_syntax={'def', 'class', 'import'},\n    csharp_syntax={'using', 'namespace', 'class'}\n)\n```\n\n## Examples\n\nSee the `examples/` directory for complete usage examples:\n\n- `basic_usage.py`: Basic evaluation with default settings\n- `custom_weights.py`: Using custom weights and configurations\n- `batch_evaluation.py`: Processing multiple lineage pairs\n- `sample_data_usage.py`: Using package sample data for evaluation. \n\n## Testing\n\nRun the test suite:\n\n```bash\n# Using pip\npip install -e \".[dev]\"\npytest\n\n# Using uv (recommended)\nuv sync --extra dev\nuv run pytest\n\n# Run with coverage\nuv run pytest --cov=slice\n\n# Run specific test file\nuv run pytest tests/test_schema_lineage_evaluator.py -v\n\n# Code quality checks\nuv run black slice/ tests/     # Format code\nuv run flake8 slice/           # Lint code  \nuv run mypy slice/             # Type checking\n```\n\n## Contributing\n\nWe welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.\n\n1. Fork the repository\n2. Create a feature branch (`git checkout -b feature/amazing-feature`)\n3. Make your changes\n4. Add tests for your changes\n5. Run the test suite (`pytest`)\n6. Commit your changes (`git commit -m 'Add amazing feature'`)\n7. Push to the branch (`git push origin feature/amazing-feature`)\n8. Open a Pull Request\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Citation\n\nIf you use SLiCE in your research, please cite:\n\n```bibtex\n@software{slice2025,\n  title={SLiCE: Schema Lineage Composite Evaluation},\n  author={Jiaqi Yin and Yi-Wei Chen and Meng-Lung Lee and Xiya Liu},\n  year={2025},\n  url={https://github.com/microsoft/SLiCE}\n}\n\n@misc{yin2025schemalineageextractionscale,\n      title={Schema Lineage Extraction at Scale: Multilingual Pipelines, Composite Evaluation, and Language-Model Benchmarks}, \n      author={Jiaqi Yin and Yi-Wei Chen and Meng-Lung Lee and Xiya Liu},\n      year={2025},\n      eprint={2508.07179},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2508.07179}, \n}\n```\n\n## Support\n\n- **Documentation**: [Link to documentation]\n- **Issues**: [GitHub Issues](https://github.com/microsoft/SLiCE/issues)\n- **Discussions**: [GitHub Discussions](https://github.com/microsoft/SLiCE/discussions)\n",
    "bugtrack_url": null,
    "license": "MIT License\n        \n            Copyright (c) Microsoft Corporation.\n        \n            Permission is hereby granted, free of charge, to any person obtaining a copy\n            of this software and associated documentation files (the \"Software\"), to deal\n            in the Software without restriction, including without limitation the rights\n            to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n            copies of the Software, and to permit persons to whom the Software is\n            furnished to do so, subject to the following conditions:\n        \n            The above copyright notice and this permission notice shall be included in all\n            copies or substantial portions of the Software.\n        \n            THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n            IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n            FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n            AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n            LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n            OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\n            SOFTWARE\n        ",
    "summary": "Schema Lineage Composite Evaluation - A Python package for evaluating schema lineage extraction accuracy",
    "version": "1.0.1",
    "project_urls": {
        "Documentation": "https://slice-score.readthedocs.io/",
        "Homepage": "https://github.com/microsoft/SLiCE",
        "Issues": "https://github.com/microsoft/SLiCE/issues",
        "Repository": "https://github.com/microsoft/SLiCE"
    },
    "split_keywords": [
        "schema",
        " lineage",
        " evaluation",
        " data-pipeline",
        " nlp",
        " ast"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "496199b17ed216e6e875f2b628d2d7003855d151cdd1ce0fdc40ac2b1ff0d845",
                "md5": "e1b34c50bbb6ceb32098b80ee27b75a3",
                "sha256": "20d308c7ef306f8c77948ae558bb20644d61f32a713766aa592d00fffb36324a"
            },
            "downloads": -1,
            "filename": "slice_score-1.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e1b34c50bbb6ceb32098b80ee27b75a3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 37253,
            "upload_time": "2025-08-17T00:29:56",
            "upload_time_iso_8601": "2025-08-17T00:29:56.866346Z",
            "url": "https://files.pythonhosted.org/packages/49/61/99b17ed216e6e875f2b628d2d7003855d151cdd1ce0fdc40ac2b1ff0d845/slice_score-1.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8d61f4b36f5ba080069fcaf7ff0f6f4bcd597f1c08e29064aec04fe45f418e6d",
                "md5": "38ed37e9fda5cd77b1a9a31437ffe989",
                "sha256": "c1b00cc7f746c4605d4c12be13453e54fb2614c9054eba03a9180163fb7c6083"
            },
            "downloads": -1,
            "filename": "slice_score-1.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "38ed37e9fda5cd77b1a9a31437ffe989",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 164688,
            "upload_time": "2025-08-17T00:29:58",
            "upload_time_iso_8601": "2025-08-17T00:29:58.582683Z",
            "url": "https://files.pythonhosted.org/packages/8d/61/f4b36f5ba080069fcaf7ff0f6f4bcd597f1c08e29064aec04fe45f418e6d/slice_score-1.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-17 00:29:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "microsoft",
    "github_project": "SLiCE",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "slice-score"
}

None