# AutoDataPreprocess
AutoDataPreprocess is a comprehensive Python library for automated data preprocessing. It provides a wide range of tools and techniques to clean, transform, and prepare data for machine learning models.
## Features
- Data loading from various sources (CSV, JSON, Excel, HTML, XML, Pickle, SQL, API)
- Basic data analysis and visualization
- Data cleaning (handling missing values, outliers, duplicates)
- Feature engineering
- Encoding of categorical variables (Onehot, label, ordinal, target, woe, james_stein, catboost, binary)
- Scaling and normalization
- Dimensionality reduction
- Feature selection
- Handling imbalanced data
- Time series preprocessing
- Data anonymization
## Installation
You can install AutoDataPreprocess using pip: `pip install autodatapreprocess`
## Quick Start
```python
from autodatapreprocess import AutoDataPreprocess
# Load data
adp = AutoDataPreprocess('your_data_file.csv')
# Perform basic analysis
adp.basic_analysis()
# Clean the data
cleaned_data = adp.clean(missing='mean', outliers='iqr')
# Perform feature engineering
engineered_data = adp.fe(target_column='target', polynomial_degree=2)
# Encode categorical variables
encoded_data = adp.encode(methods={'category_column': 'onehot'})
# Scale the data
scaled_data = adp.scale(method='standard')
```
## Detailed Usage
### Data Loading
Load data from various sources:
```python
# From CSV
adp = AutoDataPreprocess('data.csv')
# From SQL
adp = AutoDataPreprocess(sql_query="SELECT * FROM table", sql_connection_string="your_connection_string")
# From API
adp = AutoDataPreprocess(api_url="https://api.example.com/data", api_params={"key": "value"})
```
### Data Cleaning
Clean your data with various options:
```python
cleaned_data = adp.clean(
missing='mean',
outliers='iqr',
drop_threshold=0.7,
date_format='%Y-%m-%d',
remove_duplicates=True
)
```
### Feature Engineering
Perform feature engineering:
```python
engineered_data = adp.fe(
target_column='target',
polynomial_degree=2,
interaction_only=False,
bin_numeric=True,
num_bins=5,
cyclical_features=['month', 'day_of_week'],
text_columns=['description'],
date_columns=['date']
)
```
### Encoding
Encode categorical variables:
```python
encoded_data = adp.encode(
methods={
'category1': 'onehot',
'category2': 'label',
'category3': 'target'
},
target_column='target'
)
```
### Scaling and Normalization
Scale or normalize your data:
```python
scaled_data = adp.scale(method='standard')
normalized_data = adp.normalize(method='l2')
```
### Dimensionality Reduction
Reduce the dimensionality of your data:
```python
reduced_data = adp.dimreduction(method='pca', n_components=5)
```
### Feature Selection
Select the most important features:
```python
selected_data = adp.feature_selection(
target_column='target',
method='correlation',
correlation_threshold=0.8
)
```
### Handling Imbalanced Data
Balance your dataset:
```python
balanced_data = adp.balance_data(
target_column='target',
method='smote',
sampling_strategy='auto'
)
```
### Time Series Preprocessing
Preprocess time series data:
```python
preprocessed_ts_data = adp.time_series_preprocessing(
time_column='date',
freq='D',
method='mean',
detrend_columns=['value'],
seasonality_columns=['value'],
lag_columns=['value'],
lags=[1, 7, 30]
)
```
### Data Anonymization
Anonymize sensitive data:
```python
anonymized_data = adp.apply_anonymization(
columns=['sensitive_column'],
method='hash',
hash_algorithm='sha256'
)
```
Raw data
{
"_id": null,
"home_page": "https://github.com/agirishkumar/AutoDataPreprocess",
"name": "AutoDataPreprocess",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": null,
"keywords": null,
"author": "Girish Kumar Adari",
"author_email": "adari.girishkumar@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/91/0c/dd1da2dcfce8ec836ff7ff87a0a5261e434f991ab7364198d84b6cf777a0/AutoDataPreprocess-0.1.2.tar.gz",
"platform": null,
"description": "# AutoDataPreprocess\n\nAutoDataPreprocess is a comprehensive Python library for automated data preprocessing. It provides a wide range of tools and techniques to clean, transform, and prepare data for machine learning models.\n\n## Features\n\n- Data loading from various sources (CSV, JSON, Excel, HTML, XML, Pickle, SQL, API)\n- Basic data analysis and visualization\n- Data cleaning (handling missing values, outliers, duplicates)\n- Feature engineering\n- Encoding of categorical variables (Onehot, label, ordinal, target, woe, james_stein, catboost, binary)\n- Scaling and normalization\n- Dimensionality reduction\n- Feature selection\n- Handling imbalanced data\n- Time series preprocessing\n- Data anonymization\n\n## Installation\n\nYou can install AutoDataPreprocess using pip: `pip install autodatapreprocess`\n\n## Quick Start\n\n```python\nfrom autodatapreprocess import AutoDataPreprocess\n\n# Load data\nadp = AutoDataPreprocess('your_data_file.csv')\n\n# Perform basic analysis\nadp.basic_analysis()\n\n# Clean the data\ncleaned_data = adp.clean(missing='mean', outliers='iqr')\n\n# Perform feature engineering\nengineered_data = adp.fe(target_column='target', polynomial_degree=2)\n\n# Encode categorical variables\nencoded_data = adp.encode(methods={'category_column': 'onehot'})\n\n# Scale the data\nscaled_data = adp.scale(method='standard')\n\n```\n\n## Detailed Usage\n\n### Data Loading\nLoad data from various sources:\n\n```python\n# From CSV\nadp = AutoDataPreprocess('data.csv')\n\n# From SQL\nadp = AutoDataPreprocess(sql_query=\"SELECT * FROM table\", sql_connection_string=\"your_connection_string\")\n\n# From API\nadp = AutoDataPreprocess(api_url=\"https://api.example.com/data\", api_params={\"key\": \"value\"})\n```\n### Data Cleaning\nClean your data with various options:\n\n```python\ncleaned_data = adp.clean(\n missing='mean',\n outliers='iqr',\n drop_threshold=0.7,\n date_format='%Y-%m-%d',\n remove_duplicates=True\n)\n```\n\n### Feature Engineering\nPerform feature engineering:\n\n```python\nengineered_data = adp.fe(\n target_column='target',\n polynomial_degree=2,\n interaction_only=False,\n bin_numeric=True,\n num_bins=5,\n cyclical_features=['month', 'day_of_week'],\n text_columns=['description'],\n date_columns=['date']\n)\n```\n\n### Encoding\nEncode categorical variables:\n\n```python\nencoded_data = adp.encode(\n methods={\n 'category1': 'onehot',\n 'category2': 'label',\n 'category3': 'target'\n },\n target_column='target'\n)\n```\n\n### Scaling and Normalization\nScale or normalize your data:\n\n```python\nscaled_data = adp.scale(method='standard')\nnormalized_data = adp.normalize(method='l2')\n```\n\n### Dimensionality Reduction\nReduce the dimensionality of your data:\n\n```python\nreduced_data = adp.dimreduction(method='pca', n_components=5)\n```\n\n### Feature Selection\nSelect the most important features:\n\n```python\nselected_data = adp.feature_selection(\n target_column='target',\n method='correlation',\n correlation_threshold=0.8\n)\n```\n\n### Handling Imbalanced Data\nBalance your dataset:\n\n```python\nbalanced_data = adp.balance_data(\n target_column='target',\n method='smote',\n sampling_strategy='auto'\n)\n```\n\n### Time Series Preprocessing\nPreprocess time series data:\n\n```python\npreprocessed_ts_data = adp.time_series_preprocessing(\n time_column='date',\n freq='D',\n method='mean',\n detrend_columns=['value'],\n seasonality_columns=['value'],\n lag_columns=['value'],\n lags=[1, 7, 30]\n)\n```\n\n### Data Anonymization\nAnonymize sensitive data:\n\n```python\nanonymized_data = adp.apply_anonymization(\n columns=['sensitive_column'],\n method='hash',\n hash_algorithm='sha256'\n)\n```\n\n",
"bugtrack_url": null,
"license": null,
"summary": "A high-level library for automatic preprocessing of tabular data",
"version": "0.1.2",
"project_urls": {
"Homepage": "https://github.com/agirishkumar/AutoDataPreprocess"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "85869c75332f0809a6511bbbebd1724e7b6f4fc63cc77280f9e4c9aeed1f2756",
"md5": "17c01255a0e85c17ef37660a872e7bfb",
"sha256": "c149aff939dbbb2bc3b2c75434dd0465885b69a831d133ed638638e36d66f2e0"
},
"downloads": -1,
"filename": "AutoDataPreprocess-0.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "17c01255a0e85c17ef37660a872e7bfb",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 16263,
"upload_time": "2024-08-15T00:23:17",
"upload_time_iso_8601": "2024-08-15T00:23:17.545069Z",
"url": "https://files.pythonhosted.org/packages/85/86/9c75332f0809a6511bbbebd1724e7b6f4fc63cc77280f9e4c9aeed1f2756/AutoDataPreprocess-0.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "910cdd1da2dcfce8ec836ff7ff87a0a5261e434f991ab7364198d84b6cf777a0",
"md5": "1926c655cd045b381abd6ebd6ca5ead3",
"sha256": "673cf0a873a8fd04013875f4b47c688382c53a65acb76ef16baef8996e62d7d7"
},
"downloads": -1,
"filename": "AutoDataPreprocess-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "1926c655cd045b381abd6ebd6ca5ead3",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 15902,
"upload_time": "2024-08-15T00:23:18",
"upload_time_iso_8601": "2024-08-15T00:23:18.481620Z",
"url": "https://files.pythonhosted.org/packages/91/0c/dd1da2dcfce8ec836ff7ff87a0a5261e434f991ab7364198d84b6cf777a0/AutoDataPreprocess-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-15 00:23:18",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "agirishkumar",
"github_project": "AutoDataPreprocess",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [],
"lcname": "autodatapreprocess"
}