hf-for-legal

Name	hf-for-legal JSON
Version	0.0.13 JSON
	download
home_page	https://github.com/louisbrulenaudet/hf-for-legal
Summary	HF for Legal: A Community Package for Legal Applications 🤗
upload_time	2024-07-26 11:50:57
maintainer	None
docs_url	None
author	Louis Brulé Naudet
requires_python	None
license	Apache License 2.0
keywords	language-models retrieval web-scraping gpl nlp hf-for-legal machine-learning retrieval-augmented-generation rag huggingface generative-ai llama mistral inference-api datasets llm-as-judge
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            <img src="https://huggingface.co/spaces/HFforLegal/README/resolve/main/assets/thumbnail.png">

# HF for Legal: A Community Package for Legal Applications 🤗

[![Python](https://img.shields.io/pypi/pyversions/tensorflow.svg)](https://badge.fury.io/py/tensorflow) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) ![Maintainer](https://img.shields.io/badge/maintainer-@louisbrulenaudet-blue)

Welcome to the HF for Legal package, a library dedicated to breaking down the opacity of language models for legal professionals. Our mission is to empower legal practitioners, scholars, and researchers with the knowledge and tools they need to navigate the complex world of AI in the legal domain. At HF for Legal, we aim to:
- Demystify AI language models for the legal community
- Share curated resources, including specialized legal models, datasets, and tools
- Foster collaboration on projects that enhance legal research and practice through AI
- Provide a platform for discussing ethical implications and best practices of AI in law
- Offer tutorials and workshops on leveraging AI technologies in legal work

By bringing together legal experts, AI researchers, and technology enthusiasts, we strive to create an open ecosystem where legal professionals can easily access, understand, and utilize AI models tailored to their needs. Whether you're a practicing attorney, a legal scholar, or a technologist interested in legal applications of AI, HF for Legal is your hub for exploration, learning, and innovation in the evolving landscape of AI-assisted legal practice.

## Installation

To use hf-for-legal, you need to have the following Python packages installed:
- `numpy`
- `datasets`
- `tqdm`

You can install these packages via pip:

```bash
pip install numpy datasets hf-for-legal tqdm
```

## Usage

First, initialize the DatasetFormatter class with your dataset:

```python
import datasets
from hf_for_legal import DatasetFormatter

# Load a sample dataset
dataset = datasets.Dataset.from_dict(
  {
    "document": [
      "This is a test document.", 
      "Another test document.
    ]
  }
)

# Create an instance of DatasetFormatter
formatter = DatasetFormatter(dataset)

# Apply the hash and UUID functions
formatted_dataset = formatter()
print(formatted_dataset)
```

# Class: DatasetFormatter

## Parameters:

- **dataset** (`datasets.Dataset`): The dataset to be formatted.

## Attributes:

- **dataset** (`datasets.Dataset`): The original dataset.

## Methods

### hash(self, column_name: str = "document", hash_column_name: str = "hash") -> datasets.Dataset

Add a SHA-256 hash column to the dataset.

#### Parameters:

- **column_name** (`str`, optional): The name of the column containing the text to hash. Default is "document".
- **hash_column_name** (`str`, optional): The name of the column to store the hash values. Default is "hash".

#### Returns:

- `datasets.Dataset`: The dataset with the new hash column.

#### Raises:

- **ValueError**: If the specified column_name does not exist in the dataset.

### uuid(self, uuid_column_name: str = "uuid") -> datasets.Dataset

Add a UUID column to the dataset.

#### Parameters:

- **uuid_column_name** (`str`, optional): The name of the column to store the UUID values. Default is "uuid".

#### Returns:

- `datasets.Dataset`: The dataset with the new UUID column.

### normalize_text(self, column_name: str, normalized_column_name: Optional[str] = None) -> datasets.Dataset

Normalize text in a specified column by converting to lowercase and stripping whitespace.

#### Parameters:

- **column_name** (`str`): The name of the column containing the text to be normalized.
- **normalized_column_name** (`str`, optional): The name of the new column to store the normalized text. If not provided, it overwrites the original column.

#### Returns:

- `datasets.Dataset`: The dataset with the normalized text column.

#### Raises:

- **ValueError**: If the specified column_name does not exist in the dataset.

### filter_rows(self, condition: Callable) -> datasets.Dataset

Filter rows based on a given condition.

#### Parameters:

- **condition** (`Callable`): A function that takes a row (dict) and returns True if the row should be included in the filtered dataset.

#### Returns:

- `datasets.Dataset`: The filtered dataset.

### rename_column(self, old_column_name: str, new_column_name: str) -> datasets.Dataset

Rename a column in the dataset.

#### Parameters:

- **old_column_name** (`str`): The current name of the column to be renamed.
- **new_column_name** (`str`): The new name for the column.

#### Returns:

- `datasets.Dataset`: The dataset with the renamed column.

#### Raises:

- **ValueError**: If the specified old_column_name does not exist in the dataset.

### drop_column(self, column_name: str) -> datasets.Dataset

Drop a specified column from the dataset.

#### Parameters:

- **column_name** (`str`): The name of the column to be dropped.

#### Returns:

- `datasets.Dataset`: The dataset with the specified column dropped.

#### Raises:

- **ValueError**: If the specified column_name does not exist in the dataset.

### add_constant_column(self, column_name: str, constant_value) -> datasets.Dataset

Add a new column with a constant value.

#### Parameters:

- **column_name** (`str`): The name of the new column to be added.
- **constant_value**: The constant value to be assigned to each row in the new column.

#### Returns:

- `datasets.Dataset`: The dataset with the new constant value column.

### convert_column_type(self, column_name: str, new_type: Union[type, str]) -> datasets.Dataset

Convert a column to a specified data type.

#### Parameters:

- **column_name** (`str`): The name of the column to be converted.
- **new_type** (`Union[type, str]`): The new data type for the column, e.g., int, float, str.

#### Returns:

- `datasets.Dataset`: The dataset with the converted column.

#### Raises:

- **ValueError**: If the specified column_name does not exist in the dataset.

### fill_missing(self, column_name: str, fill_value) -> datasets.Dataset

Fill missing values in a column with a specified value.

#### Parameters:

- **column_name** (`str`): The name of the column with missing values to be filled.
- **fill_value**: The value to fill in for missing values.

#### Returns:

- `datasets.Dataset`: The dataset with missing values filled.

#### Raises:

- **ValueError**: If the specified column_name does not exist in the dataset.

### compute_summary(self, column_name: str) -> Dict[str, float]

Compute summary statistics for a numerical column.

#### Parameters:

- **column_name** (`str`): The name of the numerical column to compute summary statistics for.

#### Returns:

- **Dict[str, float]**: A dictionary containing summary statistics (mean, median, std) for the column.

#### Raises:

- **ValueError**: If the specified column_name does not exist in the dataset.

### __call__(self, hash_column_name: str = "hash", uuid_column_name: str = "uuid") -> datasets.Dataset

Apply both the hash and UUID functions to the dataset.

#### Parameters:

- **hash_column_name** (`str`, optional): The name of the new column to store the hash values. Default is "hash".
- **uuid_column_name** (`str`, optional): The name of the new column to store the UUID values. Default is "uuid".

#### Returns:

- `datasets.Dataset`: The dataset with both hash and UUID columns.

## Community Discord

You can now join, communicate and share on the HF for Legal community server on Discord.

Link to the server: https://discord.gg/vNhXRsfw

This server will simplify communication between members of the organization and generate synergies around the various projects in the three areas of interactive applications, databases and models.

An example of a project soon to be published: a duplication of the Laws database, but this time containing embeddings already calculated for different models, to enable simplified integration within Spaces (RAG chatbot ?) and save deployment costs for users wishing to use these technologies for their professional and personal projects.

## Citing & Authors

If you use this code in your research, please use the following BibTeX entry.

```BibTeX
@misc{louisbrulenaudet2024,
  author =       {Louis Brulé Naudet},
  title =        {HF for Legal: A Community Package for Legal Applications},
  year =         {2024}
  howpublished = {\url{https://github.com/louisbrulenaudet/hf-for-legal}},
}
```

## Feedback

If you have any feedback, please reach out at [louisbrulenaudet@icloud.com](mailto:louisbrulenaudet@icloud.com).

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/louisbrulenaudet/hf-for-legal",
    "name": "hf-for-legal",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "language-models, retrieval, web-scraping, gpl, nlp, hf-for-legal, machine-learning, retrieval-augmented-generation, RAG, huggingface, generative-ai, llama, Mistral, inference-api, datasets, llm-as-judge",
    "author": "Louis Brul\u00e9 Naudet",
    "author_email": "louisbrulenaudet@icloud.com",
    "download_url": "https://files.pythonhosted.org/packages/54/85/55d655fac996ee940efee14e45818af252fe63b0e5214eb5fc9bfc26eb85/hf_for_legal-0.0.13.tar.gz",
    "platform": null,
    "description": "<img src=\"https://huggingface.co/spaces/HFforLegal/README/resolve/main/assets/thumbnail.png\">\n\n# HF for Legal: A Community Package for Legal Applications \ud83e\udd17\n\n[![Python](https://img.shields.io/pypi/pyversions/tensorflow.svg)](https://badge.fury.io/py/tensorflow) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) ![Maintainer](https://img.shields.io/badge/maintainer-@louisbrulenaudet-blue)\n\nWelcome to the HF for Legal package, a library dedicated to breaking down the opacity of language models for legal professionals. Our mission is to empower legal practitioners, scholars, and researchers with the knowledge and tools they need to navigate the complex world of AI in the legal domain. At HF for Legal, we aim to:\n- Demystify AI language models for the legal community\n- Share curated resources, including specialized legal models, datasets, and tools\n- Foster collaboration on projects that enhance legal research and practice through AI\n- Provide a platform for discussing ethical implications and best practices of AI in law\n- Offer tutorials and workshops on leveraging AI technologies in legal work\n\nBy bringing together legal experts, AI researchers, and technology enthusiasts, we strive to create an open ecosystem where legal professionals can easily access, understand, and utilize AI models tailored to their needs. Whether you're a practicing attorney, a legal scholar, or a technologist interested in legal applications of AI, HF for Legal is your hub for exploration, learning, and innovation in the evolving landscape of AI-assisted legal practice.\n\n## Installation\n\nTo use hf-for-legal, you need to have the following Python packages installed:\n- `numpy`\n- `datasets`\n- `tqdm`\n\nYou can install these packages via pip:\n\n```bash\npip install numpy datasets hf-for-legal tqdm\n```\n\n## Usage\n\nFirst, initialize the DatasetFormatter class with your dataset:\n\n```python\nimport datasets\nfrom hf_for_legal import DatasetFormatter\n\n# Load a sample dataset\ndataset = datasets.Dataset.from_dict(\n  {\n    \"document\": [\n      \"This is a test document.\", \n      \"Another test document.\n    ]\n  }\n)\n\n# Create an instance of DatasetFormatter\nformatter = DatasetFormatter(dataset)\n\n# Apply the hash and UUID functions\nformatted_dataset = formatter()\nprint(formatted_dataset)\n```\n\n# Class: DatasetFormatter\n\n## Parameters:\n\n- **dataset** (`datasets.Dataset`): The dataset to be formatted.\n\n## Attributes:\n\n- **dataset** (`datasets.Dataset`): The original dataset.\n\n## Methods\n\n### hash(self, column_name: str = \"document\", hash_column_name: str = \"hash\") -> datasets.Dataset\n\nAdd a SHA-256 hash column to the dataset.\n\n#### Parameters:\n\n- **column_name** (`str`, optional): The name of the column containing the text to hash. Default is \"document\".\n- **hash_column_name** (`str`, optional): The name of the column to store the hash values. Default is \"hash\".\n\n#### Returns:\n\n- `datasets.Dataset`: The dataset with the new hash column.\n\n#### Raises:\n\n- **ValueError**: If the specified column_name does not exist in the dataset.\n\n### uuid(self, uuid_column_name: str = \"uuid\") -> datasets.Dataset\n\nAdd a UUID column to the dataset.\n\n#### Parameters:\n\n- **uuid_column_name** (`str`, optional): The name of the column to store the UUID values. Default is \"uuid\".\n\n#### Returns:\n\n- `datasets.Dataset`: The dataset with the new UUID column.\n\n### normalize_text(self, column_name: str, normalized_column_name: Optional[str] = None) -> datasets.Dataset\n\nNormalize text in a specified column by converting to lowercase and stripping whitespace.\n\n#### Parameters:\n\n- **column_name** (`str`): The name of the column containing the text to be normalized.\n- **normalized_column_name** (`str`, optional): The name of the new column to store the normalized text. If not provided, it overwrites the original column.\n\n#### Returns:\n\n- `datasets.Dataset`: The dataset with the normalized text column.\n\n#### Raises:\n\n- **ValueError**: If the specified column_name does not exist in the dataset.\n\n### filter_rows(self, condition: Callable) -> datasets.Dataset\n\nFilter rows based on a given condition.\n\n#### Parameters:\n\n- **condition** (`Callable`): A function that takes a row (dict) and returns True if the row should be included in the filtered dataset.\n\n#### Returns:\n\n- `datasets.Dataset`: The filtered dataset.\n\n### rename_column(self, old_column_name: str, new_column_name: str) -> datasets.Dataset\n\nRename a column in the dataset.\n\n#### Parameters:\n\n- **old_column_name** (`str`): The current name of the column to be renamed.\n- **new_column_name** (`str`): The new name for the column.\n\n#### Returns:\n\n- `datasets.Dataset`: The dataset with the renamed column.\n\n#### Raises:\n\n- **ValueError**: If the specified old_column_name does not exist in the dataset.\n\n### drop_column(self, column_name: str) -> datasets.Dataset\n\nDrop a specified column from the dataset.\n\n#### Parameters:\n\n- **column_name** (`str`): The name of the column to be dropped.\n\n#### Returns:\n\n- `datasets.Dataset`: The dataset with the specified column dropped.\n\n#### Raises:\n\n- **ValueError**: If the specified column_name does not exist in the dataset.\n\n### add_constant_column(self, column_name: str, constant_value) -> datasets.Dataset\n\nAdd a new column with a constant value.\n\n#### Parameters:\n\n- **column_name** (`str`): The name of the new column to be added.\n- **constant_value**: The constant value to be assigned to each row in the new column.\n\n#### Returns:\n\n- `datasets.Dataset`: The dataset with the new constant value column.\n\n### convert_column_type(self, column_name: str, new_type: Union[type, str]) -> datasets.Dataset\n\nConvert a column to a specified data type.\n\n#### Parameters:\n\n- **column_name** (`str`): The name of the column to be converted.\n- **new_type** (`Union[type, str]`): The new data type for the column, e.g., int, float, str.\n\n#### Returns:\n\n- `datasets.Dataset`: The dataset with the converted column.\n\n#### Raises:\n\n- **ValueError**: If the specified column_name does not exist in the dataset.\n\n### fill_missing(self, column_name: str, fill_value) -> datasets.Dataset\n\nFill missing values in a column with a specified value.\n\n#### Parameters:\n\n- **column_name** (`str`): The name of the column with missing values to be filled.\n- **fill_value**: The value to fill in for missing values.\n\n#### Returns:\n\n- `datasets.Dataset`: The dataset with missing values filled.\n\n#### Raises:\n\n- **ValueError**: If the specified column_name does not exist in the dataset.\n\n### compute_summary(self, column_name: str) -> Dict[str, float]\n\nCompute summary statistics for a numerical column.\n\n#### Parameters:\n\n- **column_name** (`str`): The name of the numerical column to compute summary statistics for.\n\n#### Returns:\n\n- **Dict[str, float]**: A dictionary containing summary statistics (mean, median, std) for the column.\n\n#### Raises:\n\n- **ValueError**: If the specified column_name does not exist in the dataset.\n\n### __call__(self, hash_column_name: str = \"hash\", uuid_column_name: str = \"uuid\") -> datasets.Dataset\n\nApply both the hash and UUID functions to the dataset.\n\n#### Parameters:\n\n- **hash_column_name** (`str`, optional): The name of the new column to store the hash values. Default is \"hash\".\n- **uuid_column_name** (`str`, optional): The name of the new column to store the UUID values. Default is \"uuid\".\n\n#### Returns:\n\n- `datasets.Dataset`: The dataset with both hash and UUID columns.\n\n## Community Discord\n\nYou can now join, communicate and share on the HF for Legal community server on Discord.\n\nLink to the server: https://discord.gg/vNhXRsfw\n\nThis server will simplify communication between members of the organization and generate synergies around the various projects in the three areas of interactive applications, databases and models.\n\nAn example of a project soon to be published: a duplication of the Laws database, but this time containing embeddings already calculated for different models, to enable simplified integration within Spaces (RAG chatbot ?) and save deployment costs for users wishing to use these technologies for their professional and personal projects.\n\n## Citing & Authors\n\nIf you use this code in your research, please use the following BibTeX entry.\n\n```BibTeX\n@misc{louisbrulenaudet2024,\n  author =       {Louis Brul\u00e9 Naudet},\n  title =        {HF for Legal: A Community Package for Legal Applications},\n  year =         {2024}\n  howpublished = {\\url{https://github.com/louisbrulenaudet/hf-for-legal}},\n}\n```\n\n## Feedback\n\nIf you have any feedback, please reach out at [louisbrulenaudet@icloud.com](mailto:louisbrulenaudet@icloud.com).\n",
    "bugtrack_url": null,
    "license": "Apache License 2.0",
    "summary": "HF for Legal: A Community Package for Legal Applications \ud83e\udd17",
    "version": "0.0.13",
    "project_urls": {
        "Homepage": "https://github.com/louisbrulenaudet/hf-for-legal"
    },
    "split_keywords": [
        "language-models",
        " retrieval",
        " web-scraping",
        " gpl",
        " nlp",
        " hf-for-legal",
        " machine-learning",
        " retrieval-augmented-generation",
        " rag",
        " huggingface",
        " generative-ai",
        " llama",
        " mistral",
        " inference-api",
        " datasets",
        " llm-as-judge"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "a004d564dfc5bb1e93db31afa076d31a06bba80d1413bcaf9b7d2b2148a4924f",
                "md5": "3903f153c0695383024b9e1fafb451fa",
                "sha256": "449c67ab303557bfb33f0d178b96ef631c163181bb8d10146e4cad0ea8689112"
            },
            "downloads": -1,
            "filename": "hf_for_legal-0.0.13-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3903f153c0695383024b9e1fafb451fa",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 14339,
            "upload_time": "2024-07-26T11:50:55",
            "upload_time_iso_8601": "2024-07-26T11:50:55.434863Z",
            "url": "https://files.pythonhosted.org/packages/a0/04/d564dfc5bb1e93db31afa076d31a06bba80d1413bcaf9b7d2b2148a4924f/hf_for_legal-0.0.13-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "548555d655fac996ee940efee14e45818af252fe63b0e5214eb5fc9bfc26eb85",
                "md5": "26735aabd6c24c4ad1757bf8a56e11e0",
                "sha256": "5b3a985207bf26865ed200c08b91e3418ba29705cab28a8608c702424f3b24d2"
            },
            "downloads": -1,
            "filename": "hf_for_legal-0.0.13.tar.gz",
            "has_sig": false,
            "md5_digest": "26735aabd6c24c4ad1757bf8a56e11e0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 15836,
            "upload_time": "2024-07-26T11:50:57",
            "upload_time_iso_8601": "2024-07-26T11:50:57.060664Z",
            "url": "https://files.pythonhosted.org/packages/54/85/55d655fac996ee940efee14e45818af252fe63b0e5214eb5fc9bfc26eb85/hf_for_legal-0.0.13.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-26 11:50:57",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "louisbrulenaudet",
    "github_project": "hf-for-legal",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "hf-for-legal"
}

Louis Brulé Naudet