e2eAIOK-recdp


Namee2eAIOK-recdp JSON
Version 1.2.0 PyPI version JSON
download
home_pagehttps://github.com/intel/e2eAIOK/
SummaryA data processing bundle for spark based recommender system operations
upload_time2023-12-22 01:59:15
maintainer
docs_urlNone
authorINTEL BDF AIOK
requires_python>=3.6
license
keywords pyrecdp recdp distributed parallel auto-feature-engineering autofe llm python
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI
coveralls test coverage No coveralls.
            # RecDP - one stop toolkit for AI data process

We provide intel optimized solution for

* [**Auto Feature Engineering**](pyrecdp/autofe/README.md) -  Provides an automatical way to generate new features for any tabular dataset which containing numericals, categoricals and text features. It only takes 3 lines of codes to automatically enrich features based on data analysis, statistics, clustering and multi-feature interacting.
* [**LLM Data Preparation**](pyrecdp/LLM/README.md) - Provides a parallelled easy-to-use data pipeline for LLM data processing. It supports multiple data source such as jsonlines, pdfs, images, audio/vides. Users will be able to perform data extraction, deduplication(near dedup, rouge, exact), splitting, special_character fixing, types of filtering(length, perplexity, profanity, etc), quality analysis(diversity, GPT3 quality, toxicity, perplexity, etc). This tool also support to save output as jsonlines, parquets, or insertion into VectorStores(FaissStore, ChromaStore, ElasticSearchStore).

## How it works

Install this tool through pip. 

```
DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre graphviz
pip install pyrecdp[all] --pre
```

## RecDP - Tabular
[learn more](pyrecdp/autofe/README.md)

* Auto Feature Engineering Pipeline
![Auto Feature Engineering Pipeline](resources/autofe_pipeline.jpg)

Only **3** lines of codes to generate new features for your tabular data. Usually 5x new features can be found with up to 1.2x accuracy boost
```
from pyrecdp.autofe import AutoFE

pipeline = AutoFE(dataset=train_data, label=target_label, time_series = 'Day')
transformed_train_df = pipeline.fit_transform()
```

* High Performance on Terabyte Tabular data processing
![Performance](resources/recdp_performance.jpg)

## RecDP - LLM
[learn more](pyrecdp/LLM/README.md)

* Low-code Fault-tolerant Auto-scaling Parallel Pipeline
![LLM Pipeline](resources/llm_pipeline.jpg)

```
from pyrecdp.primitives.operations import *
from pyrecdp.LLM import ResumableTextPipeline

pipeline = ResumableTextPipeline()
ops = [
    UrlLoader(urls, max_depth=2),
    DocumentSplit(),
    ProfanityFilter(),
    PIIRemoval(),
    ...
    PerfileParquetWriter("ResumableTextPipeline_output")
]
pipeline.add_operations(ops)
pipeline.execute()
```

## LICENSE
* Apache 2.0

## Dependency
* Spark 3.4.*
* python 3.*
* Ray 2.7.*
            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/intel/e2eAIOK/",
    "name": "e2eAIOK-recdp",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "pyrecdp recdp distributed parallel auto-feature-engineering autofe LLM python",
    "author": "INTEL BDF AIOK",
    "author_email": "bdf.aiok@intel.com",
    "download_url": "https://files.pythonhosted.org/packages/f3/07/97a6d868f3f123b655ab412b257d694e9453034052aa8eee08963aff50fc/e2eAIOK-recdp-1.2.0.tar.gz",
    "platform": null,
    "description": "# RecDP - one stop toolkit for AI data process\n\nWe provide intel optimized solution for\n\n* [**Auto Feature Engineering**](pyrecdp/autofe/README.md) -  Provides an automatical way to generate new features for any tabular dataset which containing numericals, categoricals and text features. It only takes 3 lines of codes to automatically enrich features based on data analysis, statistics, clustering and multi-feature interacting.\n* [**LLM Data Preparation**](pyrecdp/LLM/README.md) - Provides a parallelled easy-to-use data pipeline for LLM data processing. It supports multiple data source such as jsonlines, pdfs, images, audio/vides. Users will be able to perform data extraction, deduplication(near dedup, rouge, exact), splitting, special_character fixing, types of filtering(length, perplexity, profanity, etc), quality analysis(diversity, GPT3 quality, toxicity, perplexity, etc). This tool also support to save output as jsonlines, parquets, or insertion into VectorStores(FaissStore, ChromaStore, ElasticSearchStore).\n\n## How it works\n\nInstall this tool through pip. \n\n```\nDEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre graphviz\npip install pyrecdp[all] --pre\n```\n\n## RecDP - Tabular\n[learn more](pyrecdp/autofe/README.md)\n\n* Auto Feature Engineering Pipeline\n![Auto Feature Engineering Pipeline](resources/autofe_pipeline.jpg)\n\nOnly **3** lines of codes to generate new features for your tabular data. Usually 5x new features can be found with up to 1.2x accuracy boost\n```\nfrom pyrecdp.autofe import AutoFE\n\npipeline = AutoFE(dataset=train_data, label=target_label, time_series = 'Day')\ntransformed_train_df = pipeline.fit_transform()\n```\n\n* High Performance on Terabyte Tabular data processing\n![Performance](resources/recdp_performance.jpg)\n\n## RecDP - LLM\n[learn more](pyrecdp/LLM/README.md)\n\n* Low-code Fault-tolerant Auto-scaling Parallel Pipeline\n![LLM Pipeline](resources/llm_pipeline.jpg)\n\n```\nfrom pyrecdp.primitives.operations import *\nfrom pyrecdp.LLM import ResumableTextPipeline\n\npipeline = ResumableTextPipeline()\nops = [\n    UrlLoader(urls, max_depth=2),\n    DocumentSplit(),\n    ProfanityFilter(),\n    PIIRemoval(),\n    ...\n    PerfileParquetWriter(\"ResumableTextPipeline_output\")\n]\npipeline.add_operations(ops)\npipeline.execute()\n```\n\n## LICENSE\n* Apache 2.0\n\n## Dependency\n* Spark 3.4.*\n* python 3.*\n* Ray 2.7.*",
    "bugtrack_url": null,
    "license": "",
    "summary": "A data processing bundle for spark based recommender system operations",
    "version": "1.2.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/intel/e2eAIOK/",
        "Homepage": "https://github.com/intel/e2eAIOK/"
    },
    "split_keywords": [
        "pyrecdp",
        "recdp",
        "distributed",
        "parallel",
        "auto-feature-engineering",
        "autofe",
        "llm",
        "python"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "f30797a6d868f3f123b655ab412b257d694e9453034052aa8eee08963aff50fc",
                "md5": "37b7c9042e8c6b73cd662e264790843c",
                "sha256": "548dcf58a246237c203d7530856b33d833071102b8f832111ba2cbaa5f287d11"
            },
            "downloads": -1,
            "filename": "e2eAIOK-recdp-1.2.0.tar.gz",
            "has_sig": false,
            "md5_digest": "37b7c9042e8c6b73cd662e264790843c",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 286814,
            "upload_time": "2023-12-22T01:59:15",
            "upload_time_iso_8601": "2023-12-22T01:59:15.230413Z",
            "url": "https://files.pythonhosted.org/packages/f3/07/97a6d868f3f123b655ab412b257d694e9453034052aa8eee08963aff50fc/e2eAIOK-recdp-1.2.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-12-22 01:59:15",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "intel",
    "github_project": "e2eAIOK",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": true,
    "lcname": "e2eaiok-recdp"
}
        
Elapsed time: 0.19381s