AutoDataPreprocess


NameAutoDataPreprocess JSON
Version 0.1.2 PyPI version JSON
download
home_pagehttps://github.com/agirishkumar/AutoDataPreprocess
SummaryA high-level library for automatic preprocessing of tabular data
upload_time2024-08-15 00:23:18
maintainerNone
docs_urlNone
authorGirish Kumar Adari
requires_python>=3.6
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # AutoDataPreprocess

AutoDataPreprocess is a comprehensive Python library for automated data preprocessing. It provides a wide range of tools and techniques to clean, transform, and prepare data for machine learning models.

## Features

- Data loading from various sources (CSV, JSON, Excel, HTML, XML, Pickle, SQL, API)
- Basic data analysis and visualization
- Data cleaning (handling missing values, outliers, duplicates)
- Feature engineering
- Encoding of categorical variables (Onehot, label, ordinal, target, woe, james_stein, catboost, binary)
- Scaling and normalization
- Dimensionality reduction
- Feature selection
- Handling imbalanced data
- Time series preprocessing
- Data anonymization

## Installation

You can install AutoDataPreprocess using pip: `pip install autodatapreprocess`

## Quick Start

```python
from autodatapreprocess import AutoDataPreprocess

# Load data
adp = AutoDataPreprocess('your_data_file.csv')

# Perform basic analysis
adp.basic_analysis()

# Clean the data
cleaned_data = adp.clean(missing='mean', outliers='iqr')

# Perform feature engineering
engineered_data = adp.fe(target_column='target', polynomial_degree=2)

# Encode categorical variables
encoded_data = adp.encode(methods={'category_column': 'onehot'})

# Scale the data
scaled_data = adp.scale(method='standard')

```

## Detailed Usage

### Data Loading
Load data from various sources:

```python
# From CSV
adp = AutoDataPreprocess('data.csv')

# From SQL
adp = AutoDataPreprocess(sql_query="SELECT * FROM table", sql_connection_string="your_connection_string")

# From API
adp = AutoDataPreprocess(api_url="https://api.example.com/data", api_params={"key": "value"})
```
### Data Cleaning
Clean your data with various options:

```python
cleaned_data = adp.clean(
    missing='mean',
    outliers='iqr',
    drop_threshold=0.7,
    date_format='%Y-%m-%d',
    remove_duplicates=True
)
```

### Feature Engineering
Perform feature engineering:

```python
engineered_data = adp.fe(
    target_column='target',
    polynomial_degree=2,
    interaction_only=False,
    bin_numeric=True,
    num_bins=5,
    cyclical_features=['month', 'day_of_week'],
    text_columns=['description'],
    date_columns=['date']
)
```

### Encoding
Encode categorical variables:

```python
encoded_data = adp.encode(
    methods={
        'category1': 'onehot',
        'category2': 'label',
        'category3': 'target'
    },
    target_column='target'
)
```

### Scaling and Normalization
Scale or normalize your data:

```python
scaled_data = adp.scale(method='standard')
normalized_data = adp.normalize(method='l2')
```

### Dimensionality Reduction
Reduce the dimensionality of your data:

```python
reduced_data = adp.dimreduction(method='pca', n_components=5)
```

### Feature Selection
Select the most important features:

```python
selected_data = adp.feature_selection(
    target_column='target',
    method='correlation',
    correlation_threshold=0.8
)
```

### Handling Imbalanced Data
Balance your dataset:

```python
balanced_data = adp.balance_data(
    target_column='target',
    method='smote',
    sampling_strategy='auto'
)
```

### Time Series Preprocessing
Preprocess time series data:

```python
preprocessed_ts_data = adp.time_series_preprocessing(
    time_column='date',
    freq='D',
    method='mean',
    detrend_columns=['value'],
    seasonality_columns=['value'],
    lag_columns=['value'],
    lags=[1, 7, 30]
)
```

### Data Anonymization
Anonymize sensitive data:

```python
anonymized_data = adp.apply_anonymization(
    columns=['sensitive_column'],
    method='hash',
    hash_algorithm='sha256'
)
```


            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/agirishkumar/AutoDataPreprocess",
    "name": "AutoDataPreprocess",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": null,
    "author": "Girish Kumar Adari",
    "author_email": "adari.girishkumar@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/91/0c/dd1da2dcfce8ec836ff7ff87a0a5261e434f991ab7364198d84b6cf777a0/AutoDataPreprocess-0.1.2.tar.gz",
    "platform": null,
    "description": "# AutoDataPreprocess\n\nAutoDataPreprocess is a comprehensive Python library for automated data preprocessing. It provides a wide range of tools and techniques to clean, transform, and prepare data for machine learning models.\n\n## Features\n\n- Data loading from various sources (CSV, JSON, Excel, HTML, XML, Pickle, SQL, API)\n- Basic data analysis and visualization\n- Data cleaning (handling missing values, outliers, duplicates)\n- Feature engineering\n- Encoding of categorical variables (Onehot, label, ordinal, target, woe, james_stein, catboost, binary)\n- Scaling and normalization\n- Dimensionality reduction\n- Feature selection\n- Handling imbalanced data\n- Time series preprocessing\n- Data anonymization\n\n## Installation\n\nYou can install AutoDataPreprocess using pip: `pip install autodatapreprocess`\n\n## Quick Start\n\n```python\nfrom autodatapreprocess import AutoDataPreprocess\n\n# Load data\nadp = AutoDataPreprocess('your_data_file.csv')\n\n# Perform basic analysis\nadp.basic_analysis()\n\n# Clean the data\ncleaned_data = adp.clean(missing='mean', outliers='iqr')\n\n# Perform feature engineering\nengineered_data = adp.fe(target_column='target', polynomial_degree=2)\n\n# Encode categorical variables\nencoded_data = adp.encode(methods={'category_column': 'onehot'})\n\n# Scale the data\nscaled_data = adp.scale(method='standard')\n\n```\n\n## Detailed Usage\n\n### Data Loading\nLoad data from various sources:\n\n```python\n# From CSV\nadp = AutoDataPreprocess('data.csv')\n\n# From SQL\nadp = AutoDataPreprocess(sql_query=\"SELECT * FROM table\", sql_connection_string=\"your_connection_string\")\n\n# From API\nadp = AutoDataPreprocess(api_url=\"https://api.example.com/data\", api_params={\"key\": \"value\"})\n```\n### Data Cleaning\nClean your data with various options:\n\n```python\ncleaned_data = adp.clean(\n    missing='mean',\n    outliers='iqr',\n    drop_threshold=0.7,\n    date_format='%Y-%m-%d',\n    remove_duplicates=True\n)\n```\n\n### Feature Engineering\nPerform feature engineering:\n\n```python\nengineered_data = adp.fe(\n    target_column='target',\n    polynomial_degree=2,\n    interaction_only=False,\n    bin_numeric=True,\n    num_bins=5,\n    cyclical_features=['month', 'day_of_week'],\n    text_columns=['description'],\n    date_columns=['date']\n)\n```\n\n### Encoding\nEncode categorical variables:\n\n```python\nencoded_data = adp.encode(\n    methods={\n        'category1': 'onehot',\n        'category2': 'label',\n        'category3': 'target'\n    },\n    target_column='target'\n)\n```\n\n### Scaling and Normalization\nScale or normalize your data:\n\n```python\nscaled_data = adp.scale(method='standard')\nnormalized_data = adp.normalize(method='l2')\n```\n\n### Dimensionality Reduction\nReduce the dimensionality of your data:\n\n```python\nreduced_data = adp.dimreduction(method='pca', n_components=5)\n```\n\n### Feature Selection\nSelect the most important features:\n\n```python\nselected_data = adp.feature_selection(\n    target_column='target',\n    method='correlation',\n    correlation_threshold=0.8\n)\n```\n\n### Handling Imbalanced Data\nBalance your dataset:\n\n```python\nbalanced_data = adp.balance_data(\n    target_column='target',\n    method='smote',\n    sampling_strategy='auto'\n)\n```\n\n### Time Series Preprocessing\nPreprocess time series data:\n\n```python\npreprocessed_ts_data = adp.time_series_preprocessing(\n    time_column='date',\n    freq='D',\n    method='mean',\n    detrend_columns=['value'],\n    seasonality_columns=['value'],\n    lag_columns=['value'],\n    lags=[1, 7, 30]\n)\n```\n\n### Data Anonymization\nAnonymize sensitive data:\n\n```python\nanonymized_data = adp.apply_anonymization(\n    columns=['sensitive_column'],\n    method='hash',\n    hash_algorithm='sha256'\n)\n```\n\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A high-level library for automatic preprocessing of tabular data",
    "version": "0.1.2",
    "project_urls": {
        "Homepage": "https://github.com/agirishkumar/AutoDataPreprocess"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "85869c75332f0809a6511bbbebd1724e7b6f4fc63cc77280f9e4c9aeed1f2756",
                "md5": "17c01255a0e85c17ef37660a872e7bfb",
                "sha256": "c149aff939dbbb2bc3b2c75434dd0465885b69a831d133ed638638e36d66f2e0"
            },
            "downloads": -1,
            "filename": "AutoDataPreprocess-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "17c01255a0e85c17ef37660a872e7bfb",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 16263,
            "upload_time": "2024-08-15T00:23:17",
            "upload_time_iso_8601": "2024-08-15T00:23:17.545069Z",
            "url": "https://files.pythonhosted.org/packages/85/86/9c75332f0809a6511bbbebd1724e7b6f4fc63cc77280f9e4c9aeed1f2756/AutoDataPreprocess-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "910cdd1da2dcfce8ec836ff7ff87a0a5261e434f991ab7364198d84b6cf777a0",
                "md5": "1926c655cd045b381abd6ebd6ca5ead3",
                "sha256": "673cf0a873a8fd04013875f4b47c688382c53a65acb76ef16baef8996e62d7d7"
            },
            "downloads": -1,
            "filename": "AutoDataPreprocess-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "1926c655cd045b381abd6ebd6ca5ead3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 15902,
            "upload_time": "2024-08-15T00:23:18",
            "upload_time_iso_8601": "2024-08-15T00:23:18.481620Z",
            "url": "https://files.pythonhosted.org/packages/91/0c/dd1da2dcfce8ec836ff7ff87a0a5261e434f991ab7364198d84b6cf777a0/AutoDataPreprocess-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-08-15 00:23:18",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "agirishkumar",
    "github_project": "AutoDataPreprocess",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [],
    "lcname": "autodatapreprocess"
}
        
Elapsed time: 0.64065s