email-cleaning-service

Name	email-cleaning-service JSON
Version	1.0.0 JSON
	download
home_page	https://github.com/JacksonKnew/email_cleaning_service
Summary	email_cleaning_service created by paul_lestrat
upload_time	2023-07-19 15:02:42
maintainer
docs_url	None
author	paul_lestrat
requires_python
license
keywords
VCS
bugtrack_url
requirements	keras matplotlib mlflow pandas pydantic setuptools tensorflow transformers
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # email_cleaning_service

This is an email segmenting service which takes a list of emails as input and returns the the header, body and signature of each message in the email

## Getting Started

The project is published on Pypi and can be installed using the following command

```py
pip install email-cleaning-service
```

## Usage

The package can be used as follows
```py
from email_cleaning_service.control import EmailCleaner

email_cleaner = EmailCleaner(tracking_uri, storage_uri)
```

Usage revolves around the emailCleaner class which is the preferred interface for the package. The class takes two arguments, the tracking_uri and the storage_uri. The tracking_uri is the uri of the MLflow tracking server and the storage_uri is the uri of the storage server (can be a path to a local folder).

BaseModel classes exist to simplify interactions with the class. The most important of these is the PipelineSpecs class which is used to define the pipeline to be used for cleaning the emails.

```py
from email_cleaning_service.utils.request_classes import PipelineSpecs

pipeline_specs = PipelineSpecs(
    classifier_origin="mlflow", # or "h5" or "config
    classifier_id="a1f66311816e417cb94db7c2457b25d1",
    encoder_origin="hugg", # or "mlflow"
    encoder_id="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
    encoder_dim=384,
    features=[
            "phone_number",
            "url",
            "punctuation",
            "horizontal_separator",
            "hashtag",
            "pipe",
            "email",
            "capitalized",
            "full_caps"
        ]      # Can be any combination of the above features
)
```

**The above pipeline is the recommended one for multilingual email segmenting for now.**

The pipeline contains 3 main elements:
* The embedding model: This is the model used to embed the emails into a vector space. The model can be either a huggingface model or an MLflow model. If the model is from hugging face, specify encoder_origin as "hugg" and the encoder_id as the model name on the platform. If the model is from MLflow, specify encoder_origin as "mlflow" and the encoder_id as the id of the run you want to use the model from on the MLFlow server.
* The extracted features: a list of regex features that are concatenated to the embedding of each sentence
* The classifier: This is the model used to realise the final classification and the separation of a thread in multiple messages. The model can come from mlflow in which case the run_id must be specified or from a h5 file in which case the path to the file must be specified.



The pipeline can then be used as follows:
```py
email_list = [
    "This is a test email. I am testing the email cleaner.\nYours truly, Paul",
    "Hello team!\nThis is another test email with two lines.\n I am testing the email cleaner",
    "Bonjour!\nCeci est un autre email\n\nAu revoir!\nPaul",
]

email_cleaner.segment(email_list, pipeline_specs)
```

The output should look like this:

```py
{'threads': [{'source': 'This is a test email. I am testing the email cleaner.\nYours truly, Paul\n0781759532',
   'messages': [{'full': 'This is a test email. I am testing the email cleaner.\nYours truly, Paul\n0781759532',
     'header': '',
     'disclaimer': '',
     'greetings': '',
     'body': 'This is a test email. I am testing the email cleaner.\nYours truly, Paul\n0781759532',
     'signature': '',
     'caution': ''}]},
  {'source': 'Hello team!\nThis is another test email with two lines.\n I am testing the email cleaner.',
   'messages': [{'full': 'Hello team!\nThis is another test email with two lines.\n I am testing the email cleaner.',
     'header': '',
     'disclaimer': '',
     'greetings': '',
     'body': 'Hello team!',
     'signature': 'This is another test email with two lines.\n I am testing the email cleaner.',
     'caution': ''}]},
  {'source': 'Bonjour!\nCeci est un autre email\n\nAu revoir!\nPaul',
   'messages': [{'full': 'Bonjour!\nCeci est un autre email\nAu revoir!\nPaul',
     'header': '',
     'disclaimer': '',
     'greetings': '',
     'body': 'Bonjour!\nCeci est un autre email\nAu revoir!',
     'signature': 'Paul',
     'caution': ''}]}]}
```

## Model Training

This package also includes support for training pipelines. You can either train (fine-tune) an encoder model or a classifier model. A note-worthy difference between the 2 is that encoders are trained with a single line of the email as input while classifiers are trained with a sequence of 64 lines as input.

The csv files used for training must use be contain lines from emails and have the following columns:
* Email: a unique Id for each email used to group email lines together
* Text: the text of the email line
* Section: the section of the email line (disclaimer, header, greetings, body, signature and caution represented as 1 thru 6 respectively)
* FragmentChange: a boolean (0 or 1) indicating whether the line is a fragment change or not

### Training an Encoder

To train an encoder, use the EncoderSpecs class and the RunSpecs class as follows:

```py
from email_cleaning_service.utils.request_classes import EncoderSpecs, RunSpecs

dataset = RunSpecs(
    run_name="demo_encoder_test_run_2",
    csv_train="./train_multi_71.csv",
    csv_test="./test_multi_48.csv",
    batch_size=4,
    metrics=[],
    lr=0.0001,
    epochs=1,
)

encoder_specs = EncoderSpecs(
    origin="mlflow",
    encoder="14a633237e734575ad7f8eac9bd0319e"
)

email_cleaner.train_encoder(dataset, encoder_specs)
```

The EncoderSpecs class takes two arguments, origin and encoder which are the same as in PipelineSpecs. 
The RunSpecs define how you want to train the model. The arguments are:
* run_name: The name of the run on the MLflow server
* csv_train: The path to the csv file containing the training data
* csv_test: The path to the csv file containing the test data
* batch_size: The batch size to use for training
* metrics: A list of metrics to track during training. The metrics must be defined in the metrics.py file in utils
* lr: The learning rate to use for training
* epochs: The number of epochs to train for

### Training a Classifier

To train a classifier, an entire pipeline must be defined. This is done using the PipelineSpecs class as follows:

```py
dataset = RunSpecs(
    run_name="with_fine_tuned_encoder",
    batch_size=4,
    csv_train="./train_multi.csv",
    csv_test="./test_multi.csv",
    metrics=["seq_f1", "frag_f1"],
    lr=0.007,
    epochs=3,
)

pipeline_specs = PipelineSpecs(
    classifier_origin="h5",
    classifier_id="./temp/base_multi_miniLM_classifier_optimized/multi_miniLM_classifier.h5",
    encoder_origin="mlflow",
    encoder_id="316fb5040b0a4353ade2e967290944ff",
    encoder_dim=384,
    features=[
            "phone_number",
            "url",
            "punctuation",
            "horizontal_separator",
            "hashtag",
            "pipe",
            "email",
            "capitalized",
            "full_caps"
        ]
)

email_cleaner.train_classifier(dataset, pipeline_specs)
```

### Evaluate a Model

To evaluate a model, use the evaluate method as follows:

```py
email_cleaner.evaluate(csv_path, pipeline_specs)
```

This will evaluate the designated pipeline on the data in the csv file and return the metrics. (all metrics are used by default)

## Package Structure

![package_architecture](./assets/package_architecture.png)

## Maintaining the Package

tests have been setup using tox and pytest. To run the tests, run the following command:

```bash
tox
```

to add tests, feel free to modify the python scriptys in the tests folder. An action automatically runs the tests on every push.

## Known Issues

There seems to be an issue when training classifiers for too long. The metrics on the testing data start to decrease after a certain number of epochs. This is not noticeable when training for a few epochs but becomes more apparent when training for longer.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/JacksonKnew/email_cleaning_service",
    "name": "email-cleaning-service",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "",
    "maintainer_email": "",
    "keywords": "",
    "author": "paul_lestrat",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/d5/ac/143624fc639cb7fef6a46236ff2d05a6ec0a8890be67044938dd16a757d7/email_cleaning_service-1.0.0.tar.gz",
    "platform": null,
    "description": "# email_cleaning_service\r\n\r\nThis is an email segmenting service which takes a list of emails as input and returns the the header, body and signature of each message in the email\r\n\r\n## Getting Started\r\n\r\nThe project is published on Pypi and can be installed using the following command\r\n\r\n```py\r\npip install email-cleaning-service\r\n```\r\n\r\n## Usage\r\n\r\nThe package can be used as follows\r\n```py\r\nfrom email_cleaning_service.control import EmailCleaner\r\n\r\nemail_cleaner = EmailCleaner(tracking_uri, storage_uri)\r\n```\r\n\r\nUsage revolves around the emailCleaner class which is the preferred interface for the package. The class takes two arguments, the tracking_uri and the storage_uri. The tracking_uri is the uri of the MLflow tracking server and the storage_uri is the uri of the storage server (can be a path to a local folder).\r\n\r\nBaseModel classes exist to simplify interactions with the class. The most important of these is the PipelineSpecs class which is used to define the pipeline to be used for cleaning the emails.\r\n\r\n```py\r\nfrom email_cleaning_service.utils.request_classes import PipelineSpecs\r\n\r\npipeline_specs = PipelineSpecs(\r\n    classifier_origin=\"mlflow\", # or \"h5\" or \"config\r\n    classifier_id=\"a1f66311816e417cb94db7c2457b25d1\",\r\n    encoder_origin=\"hugg\", # or \"mlflow\"\r\n    encoder_id=\"sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2\",\r\n    encoder_dim=384,\r\n    features=[\r\n            \"phone_number\",\r\n            \"url\",\r\n            \"punctuation\",\r\n            \"horizontal_separator\",\r\n            \"hashtag\",\r\n            \"pipe\",\r\n            \"email\",\r\n            \"capitalized\",\r\n            \"full_caps\"\r\n        ]      # Can be any combination of the above features\r\n)\r\n```\r\n\r\n**The above pipeline is the recommended one for multilingual email segmenting for now.**\r\n\r\nThe pipeline contains 3 main elements:\r\n* The embedding model: This is the model used to embed the emails into a vector space. The model can be either a huggingface model or an MLflow model. If the model is from hugging face, specify encoder_origin as \"hugg\" and the encoder_id as the model name on the platform. If the model is from MLflow, specify encoder_origin as \"mlflow\" and the encoder_id as the id of the run you want to use the model from on the MLFlow server.\r\n* The extracted features: a list of regex features that are concatenated to the embedding of each sentence\r\n* The classifier: This is the model used to realise the final classification and the separation of a thread in multiple messages. The model can come from mlflow in which case the run_id must be specified or from a h5 file in which case the path to the file must be specified.\r\n\r\n\r\n\r\nThe pipeline can then be used as follows:\r\n```py\r\nemail_list = [\r\n    \"This is a test email. I am testing the email cleaner.\\nYours truly, Paul\",\r\n    \"Hello team!\\nThis is another test email with two lines.\\n I am testing the email cleaner\",\r\n    \"Bonjour!\\nCeci est un autre email\\n\\nAu revoir!\\nPaul\",\r\n]\r\n\r\nemail_cleaner.segment(email_list, pipeline_specs)\r\n```\r\n\r\nThe output should look like this:\r\n\r\n```py\r\n{'threads': [{'source': 'This is a test email. I am testing the email cleaner.\\nYours truly, Paul\\n0781759532',\r\n   'messages': [{'full': 'This is a test email. I am testing the email cleaner.\\nYours truly, Paul\\n0781759532',\r\n     'header': '',\r\n     'disclaimer': '',\r\n     'greetings': '',\r\n     'body': 'This is a test email. I am testing the email cleaner.\\nYours truly, Paul\\n0781759532',\r\n     'signature': '',\r\n     'caution': ''}]},\r\n  {'source': 'Hello team!\\nThis is another test email with two lines.\\n I am testing the email cleaner.',\r\n   'messages': [{'full': 'Hello team!\\nThis is another test email with two lines.\\n I am testing the email cleaner.',\r\n     'header': '',\r\n     'disclaimer': '',\r\n     'greetings': '',\r\n     'body': 'Hello team!',\r\n     'signature': 'This is another test email with two lines.\\n I am testing the email cleaner.',\r\n     'caution': ''}]},\r\n  {'source': 'Bonjour!\\nCeci est un autre email\\n\\nAu revoir!\\nPaul',\r\n   'messages': [{'full': 'Bonjour!\\nCeci est un autre email\\nAu revoir!\\nPaul',\r\n     'header': '',\r\n     'disclaimer': '',\r\n     'greetings': '',\r\n     'body': 'Bonjour!\\nCeci est un autre email\\nAu revoir!',\r\n     'signature': 'Paul',\r\n     'caution': ''}]}]}\r\n```\r\n\r\n## Model Training\r\n\r\nThis package also includes support for training pipelines. You can either train (fine-tune) an encoder model or a classifier model. A note-worthy difference between the 2 is that encoders are trained with a single line of the email as input while classifiers are trained with a sequence of 64 lines as input.\r\n\r\nThe csv files used for training must use be contain lines from emails and have the following columns:\r\n* Email: a unique Id for each email used to group email lines together\r\n* Text: the text of the email line\r\n* Section: the section of the email line (disclaimer, header, greetings, body, signature and caution represented as 1 thru 6 respectively)\r\n* FragmentChange: a boolean (0 or 1) indicating whether the line is a fragment change or not\r\n\r\n### Training an Encoder\r\n\r\nTo train an encoder, use the EncoderSpecs class and the RunSpecs class as follows:\r\n\r\n```py\r\nfrom email_cleaning_service.utils.request_classes import EncoderSpecs, RunSpecs\r\n\r\ndataset = RunSpecs(\r\n    run_name=\"demo_encoder_test_run_2\",\r\n    csv_train=\"./train_multi_71.csv\",\r\n    csv_test=\"./test_multi_48.csv\",\r\n    batch_size=4,\r\n    metrics=[],\r\n    lr=0.0001,\r\n    epochs=1,\r\n)\r\n\r\nencoder_specs = EncoderSpecs(\r\n    origin=\"mlflow\",\r\n    encoder=\"14a633237e734575ad7f8eac9bd0319e\"\r\n)\r\n\r\nemail_cleaner.train_encoder(dataset, encoder_specs)\r\n```\r\n\r\nThe EncoderSpecs class takes two arguments, origin and encoder which are the same as in PipelineSpecs. \r\nThe RunSpecs define how you want to train the model. The arguments are:\r\n* run_name: The name of the run on the MLflow server\r\n* csv_train: The path to the csv file containing the training data\r\n* csv_test: The path to the csv file containing the test data\r\n* batch_size: The batch size to use for training\r\n* metrics: A list of metrics to track during training. The metrics must be defined in the metrics.py file in utils\r\n* lr: The learning rate to use for training\r\n* epochs: The number of epochs to train for\r\n\r\n### Training a Classifier\r\n\r\nTo train a classifier, an entire pipeline must be defined. This is done using the PipelineSpecs class as follows:\r\n\r\n```py\r\ndataset = RunSpecs(\r\n    run_name=\"with_fine_tuned_encoder\",\r\n    batch_size=4,\r\n    csv_train=\"./train_multi.csv\",\r\n    csv_test=\"./test_multi.csv\",\r\n    metrics=[\"seq_f1\", \"frag_f1\"],\r\n    lr=0.007,\r\n    epochs=3,\r\n)\r\n\r\npipeline_specs = PipelineSpecs(\r\n    classifier_origin=\"h5\",\r\n    classifier_id=\"./temp/base_multi_miniLM_classifier_optimized/multi_miniLM_classifier.h5\",\r\n    encoder_origin=\"mlflow\",\r\n    encoder_id=\"316fb5040b0a4353ade2e967290944ff\",\r\n    encoder_dim=384,\r\n    features=[\r\n            \"phone_number\",\r\n            \"url\",\r\n            \"punctuation\",\r\n            \"horizontal_separator\",\r\n            \"hashtag\",\r\n            \"pipe\",\r\n            \"email\",\r\n            \"capitalized\",\r\n            \"full_caps\"\r\n        ]\r\n)\r\n\r\nemail_cleaner.train_classifier(dataset, pipeline_specs)\r\n```\r\n\r\n### Evaluate a Model\r\n\r\nTo evaluate a model, use the evaluate method as follows:\r\n\r\n```py\r\nemail_cleaner.evaluate(csv_path, pipeline_specs)\r\n```\r\n\r\nThis will evaluate the designated pipeline on the data in the csv file and return the metrics. (all metrics are used by default)\r\n\r\n## Package Structure\r\n\r\n![package_architecture](./assets/package_architecture.png)\r\n\r\n## Maintaining the Package\r\n\r\ntests have been setup using tox and pytest. To run the tests, run the following command:\r\n\r\n```bash\r\ntox\r\n```\r\n\r\nto add tests, feel free to modify the python scriptys in the tests folder. An action automatically runs the tests on every push.\r\n\r\n## Known Issues\r\n\r\nThere seems to be an issue when training classifiers for too long. The metrics on the testing data start to decrease after a certain number of epochs. This is not noticeable when training for a few epochs but becomes more apparent when training for longer.\r\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "email_cleaning_service created by paul_lestrat",
    "version": "1.0.0",
    "project_urls": {
        "Homepage": "https://github.com/JacksonKnew/email_cleaning_service"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "80c3fc8a72b9fd07635c94043d5b7d9b0c93e6e3b995e9dc08a3f997bae4322c",
                "md5": "cc16213f656a290124c15204db0442ab",
                "sha256": "d883d81486b5be4109337ce76ca2e8a012381628aeb0a1b94f72f560d2e63587"
            },
            "downloads": -1,
            "filename": "email_cleaning_service-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "cc16213f656a290124c15204db0442ab",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 26596,
            "upload_time": "2023-07-19T15:02:28",
            "upload_time_iso_8601": "2023-07-19T15:02:28.595607Z",
            "url": "https://files.pythonhosted.org/packages/80/c3/fc8a72b9fd07635c94043d5b7d9b0c93e6e3b995e9dc08a3f997bae4322c/email_cleaning_service-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d5ac143624fc639cb7fef6a46236ff2d05a6ec0a8890be67044938dd16a757d7",
                "md5": "70b659b98daf2fb3483a27345224b3a1",
                "sha256": "b88121ac97b6fd44f764d172699653c422e4dd878d12fda5edb5d5aba06baed1"
            },
            "downloads": -1,
            "filename": "email_cleaning_service-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "70b659b98daf2fb3483a27345224b3a1",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 39570721,
            "upload_time": "2023-07-19T15:02:42",
            "upload_time_iso_8601": "2023-07-19T15:02:42.093585Z",
            "url": "https://files.pythonhosted.org/packages/d5/ac/143624fc639cb7fef6a46236ff2d05a6ec0a8890be67044938dd16a757d7/email_cleaning_service-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-07-19 15:02:42",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "JacksonKnew",
    "github_project": "email_cleaning_service",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "keras",
            "specs": [
                [
                    "==",
                    "2.10.0"
                ]
            ]
        },
        {
            "name": "matplotlib",
            "specs": [
                [
                    "==",
                    "3.7.1"
                ]
            ]
        },
        {
            "name": "mlflow",
            "specs": [
                [
                    "==",
                    "2.0.1"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    "==",
                    "1.5.3"
                ]
            ]
        },
        {
            "name": "pydantic",
            "specs": [
                [
                    "==",
                    "1.10.7"
                ]
            ]
        },
        {
            "name": "setuptools",
            "specs": [
                [
                    "==",
                    "67.6.1"
                ]
            ]
        },
        {
            "name": "tensorflow",
            "specs": [
                [
                    "==",
                    "2.10.1"
                ]
            ]
        },
        {
            "name": "transformers",
            "specs": [
                [
                    "==",
                    "4.27.2"
                ]
            ]
        }
    ],
    "tox": true,
    "lcname": "email-cleaning-service"
}

paul_lestrat