<!-- <h2 align="center"> <a href="https://upgini.com/">Upgini</a> : low-code feature search and enrichment library for machine learning </h2> -->
<!-- <h2 align="center"> <a href="https://upgini.com/">Upgini</a> : Free automated data enrichment library for machine learning: </br>only the accuracy improving features in 2 minutes </h2> -->
<!-- <h2 align="center"> <a href="https://upgini.com/">Upgini</a> • Free production-ready automated data enrichment library for machine learning</h2>-->
<h2 align="center"> <a href="https://upgini.com/">Upgini • Intelligent data search & enrichment for Machine Learning and AI</a></h2>
<p align="center"> <b>Easily find and add relevant features to your ML & AI pipeline from</br> hundreds of public, community, and premium external data sources, </br>including open & commercial LLMs</b> </p>
<p align="center">
<br />
<a href="https://colab.research.google.com/github/upgini/upgini/blob/main/notebooks/Upgini_Features_search%26generation.ipynb"><strong>Quick Start in Colab »</strong></a> |
<!--<a href="https://upgini.com/">Upgini.com</a> |-->
<a href="https://profile.upgini.com">Register / Sign In</a> |
<!-- <a href="https://gitter.im/upgini/community?utm_source=share-link&utm_medium=link&utm_campaign=share-link">Gitter Community</a> | -->
<a href="https://4mlg.short.gy/join-upgini-community">Slack Community</a> |
<a href="https://forms.gle/pH99gb5hPxBEfNdR7"><strong>Propose a new data source</strong></a>
</p>
<p align=center>
<a href="/LICENSE"><img alt="BSD-3 license" src="https://img.shields.io/badge/license-BSD--3%20Clause-green"></a>
<a href="https://pypi.org/project/upgini/"><img alt="PyPI - Python Version" src="https://img.shields.io/pypi/pyversions/upgini"></a>
<a href="https://pypi.org/project/upgini/"><img alt="PyPI" src="https://img.shields.io/pypi/v/upgini?label=Release"></a>
<a href="https://pepy.tech/project/upgini"><img alt="Downloads" src="https://static.pepy.tech/badge/upgini"></a>
<a href="https://4mlg.short.gy/join-upgini-community"><img alt="Upgini slack community" src="https://img.shields.io/badge/slack-@upgini-orange.svg?logo=slack"></a>
</p>
<!--
[](https://github.com/psf/black)
[](https://gitter.im/upgini/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge) -->
## ❔ Overview
**Upgini** is an intelligent data search engine with a Python library that helps you find and add relevant features to your ML pipeline from hundreds of public, community, and premium external data sources. Under the hood, Upgini automatically optimizes all connected data sources by [generating an optimal set of ML features using large language models (LLMs), GNNs (graph neural networks), and recurrent neural networks (RNNs)](https://upgini.com/#optimized_external_data).
**Motivation:** for most supervised ML models external data & features boost accuracy significantly better than any hyperparameters tuning. But lack of automated and time-efficient enrichment tools for external data blocks massive adoption of external features in ML pipelines. We want to radically simplify feature search and enrichment to make external data a standard approach. Like hyperparameter tuning in machine learning today.
**Mission:** Democratize access to data sources for data science community.
## 🚀 Awesome features
⭐️ Automatically find only relevant features that *improve your model’s accuracy*. Not just correlated with the target variable, which in 9 out of 10 cases yields zero accuracy improvement
⭐️ Automated feature generation from the sources: feature generation with LLM‑based data augmentation, RNNs, and GraphNNs; ensembling across multiple data sources
⭐️ Automatic search key augmentation from all connected sources. If you do not have all search keys in your search request, such as postal/ZIP code, Upgini will try to add those keys based on the provided set of search keys. This will broaden the search across all available data sources
⭐️ Calculate accuracy metrics and uplift after enriching an existing ML model with external features
⭐️ Check the stability of accuracy gain from external data on out-of-time intervals and verification datasets. Mitigate the risks of unstable external data dependencies in the ML pipeline
⭐️ Easy to use - a single request to enrich the training dataset with [*all of the keys at once*](#-search-key-types-we-support-more-to-come):
<table>
<tr>
<td> date / datetime </td>
<td> phone number </td>
</tr>
<tr>
<td> postal / ZIP code </td>
<td> hashed email / HEM </td>
</tr>
<tr>
<td> country </td>
<td> IP-address </td>
</tr>
</table>
⭐️ Scikit-learn-compatible interface for quick data integration with existing ML pipelines
⭐️ Support for most common supervised ML tasks on tabular data:
<table>
<tr>
<td><a href="https://en.wikipedia.org/wiki/Binary_classification">☑️ binary classification</a></td>
<td><a href="https://en.wikipedia.org/wiki/Multiclass_classification">☑️ multiclass classification</a></td>
</tr>
<tr>
<td><a href="https://en.wikipedia.org/wiki/Regression_analysis">☑️ regression</a></td>
<td><a href="https://en.wikipedia.org/wiki/Time_series#Prediction_and_forecasting">☑️ time-series prediction</a></td>
</tr>
</table>
⭐️ [Simple Drag & Drop Search UI](https://www.upgini.com/data-search-widget):
<a href="https://upgini.com/upgini-widget">
<img width="710" alt="Drag & Drop Search UI" src="https://github.com/upgini/upgini/assets/95645411/36b6460c-51f3-400e-9f04-445b938bf45e">
</a>
## 🌎 Connected data sources and coverage
- **Public data**: public sector, academic institutions, other sources through open data portals. Curated and updated by the Upgini team
- **Community‑shared data**: royalty- or license-free datasets or features from the data science community (our users). This includes both public and scraped data
- **Premium data providers**: commercial data sources verified by the Upgini team in real-world use cases
👉 [**Details on datasets and features**](https://upgini.com/#data_sources)
#### 📊 Total: **239 countries** and **up to 41 years** of history
|Data sources|Countries|History (years)|# sources for ensembling|Update frequency|Search keys|API Key required
|--|--|--|--|--|--|--|
|Historical weather & Climate normals | 68 |22|-|Monthly|date, country, postal/ZIP code|No
|Location/Places/POI/Area/Proximity information from OpenStreetMap | 221 |2|-|Monthly|date, country, postal/ZIP code|No
|International holidays & events, Workweek calendar| 232 |22|-|Monthly|date, country|No
|Consumer Confidence index| 44 |22|-|Monthly|date, country|No
|World economic indicators|191 |41|-|Monthly|date, country|No
|Markets data|-|17|-|Monthly|date, datetime|No
|World mobile & fixed-broadband network coverage and performance |167|-|3|Monthly|country, postal/ZIP code|No
|World demographic data |90|-|2|Annual|country, postal/ZIP code|No
|World house prices |44|-|3|Annual|country, postal/ZIP code|No
|Public social media profile data |104|-|-|Monthly|date, email/HEM, phone |Yes
|Car ownership data and Parking statistics|3|-|-|Annual|country, postal/ZIP code, email/HEM, phone|Yes
|Geolocation profile for phone & IPv4 & email|239|-|6|Monthly|date, email/HEM, phone, IPv4|Yes
|🔜 Email/WWW domain profile|-|-|-|-
❓**Know other useful data sources for machine learning?** [Give us a hint and we'll add it for free](https://forms.gle/pH99gb5hPxBEfNdR7).
## 💼 Tutorials
### [Search of relevant external features & Automated feature generation for Salary prediction task (use as a template)](https://github.com/upgini/upgini/blob/main/notebooks/Upgini_Features_search%26generation.ipynb)
* The goal is to predict salary for a data science job posting based on information about the employer and job description.
* Following this guide, you'll learn how to **search and auto‑generate new relevant features with the Upgini library**
* The evaluation metric is [Mean Absolute Error (MAE)](https://en.wikipedia.org/wiki/Mean_absolute_error).
Run [Feature search & generation notebook](https://github.com/upgini/upgini/blob/main/notebooks/Upgini_Features_search%26generation.ipynb) inside your browser:
[](https://colab.research.google.com/github/upgini/upgini/blob/main/notebooks/Upgini_Features_search%26generation.ipynb)
<!--
[](https://mybinder.org/v2/gh/upgini/upgini/main?labpath=notebooks%2FUpgini_Features_search%26generation.ipynb)
[](https://gitpod.io/#/github.com/upgini/upgini)
-->
### ❓ [Simple sales prediction for retail stores](https://github.com/upgini/upgini/blob/main/notebooks/kaggle_example.ipynb)
* The goal is to **predict future sales of different goods in stores** based on a 5-year history of sales.
* Kaggle Competition [Store Item Demand Forecasting Challenge](https://www.kaggle.com/c/demand-forecasting-kernels-only) is a product sales forecasting competition. The evaluation metric is [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error).
Run [Simple sales prediction for retail stores](https://github.com/upgini/upgini/blob/main/notebooks/kaggle_example.ipynb) inside your browser:
[](https://colab.research.google.com/github/upgini/upgini/blob/main/notebooks/kaggle_example.ipynb)
<!--
[](https://mybinder.org/v2/gh/upgini/upgini/main?urlpath=notebooks%2Fnotebooks%2Fkaggle_example.ipynb)
[](https://gitpod.io/#/github.com/upgini/upgini)
-->
### ❓ [How to boost ML model accuracy for Kaggle Top-1 leaderboard in 15 minutes](https://www.kaggle.com/code/nikupgini/how-to-find-external-data-for-1-private-lb-4-53/notebook)
* The goal is **to improve a Top‑1 winning Kaggle solution** by adding new relevant external features and data.
* [Kaggle Competition](https://www.kaggle.com/competitions/tabular-playground-series-jan-2022/) is a product sales forecasting competition; the evaluation metric is [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error).
### ❓ [How to do low-code feature engineering for AutoML tools](https://www.kaggle.com/code/romaupgini/zero-feature-engineering-with-upgini-pycaret/notebook)
* **Save time on feature search and engineering**. Use ready-to-use external features and data sources to maximize overall AutoML accuracy, right out of the box.
* [Kaggle Competition](https://www.kaggle.com/competitions/tabular-playground-series-jan-2022/) is a product sales forecasting, evaluation metric is [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error).
* Low-code AutoML frameworks: [Upgini](https://github.com/upgini/upgini) and [PyCaret](https://github.com/pycaret/pycaret)
### ❓ [How to improve accuracy of Multivariate time-series forecast from external features & data](https://www.kaggle.com/code/romaupgini/guide-external-data-features-for-multivariatets/notebook)
* The goal is **to improve the accuracy of multivariate time‑series forecasting** using new relevant external features and data. The main challenge is the data and feature enrichment strategy, in which a component of a multivariate time series depends not only on its past values but also on other components.
* [Kaggle Competition](https://www.kaggle.com/competitions/tabular-playground-series-jan-2022/) is a product sales forecasting, evaluation metric is [RMSLE](https://www.kaggle.com/code/carlmcbrideellis/store-sales-using-the-average-of-the-last-16-days#Note-regarding-calculating-the-average).
### ❓ [How to speed up feature engineering hypothesis tests with ready-to-use external features](https://www.kaggle.com/code/romaupgini/statement-dates-to-use-or-not-to-use/notebook)
* **Save time on external data wrangling and feature calculation code** for hypothesis tests. The key challenge is the time‑dependent representation of information in the training dataset, which is uncommon for credit default prediction tasks. As a result, special data enrichment strategy is used.
* [Kaggle Competition](https://www.kaggle.com/competitions/amex-default-prediction) is a credit default prediction, evaluation metric is [normalized Gini coefficient](https://www.kaggle.com/competitions/amex-default-prediction/discussion/327464).
## 🏁 Quick start
### 1. Install from PyPI
```python
%pip install upgini
```
<details>
<summary>
🐳 <b>Docker-way</b>
</summary>
</br>
Clone <i>$ git clone https://github.com/upgini/upgini</i> or download upgini git repo locally </br>
and follow steps below to build docker container 👇 </br>
</br>
1. Build docker image from cloned git repo:</br>
<i>cd upgini </br>
docker build -t upgini .</i></br>
</br>
...or directly from GitHub:
</br>
<i>DOCKER_BUILDKIT=0 docker build -t upgini</i></br> <i>git@github.com:upgini/upgini.git#main</i></br>
</br>
2. Run docker image:</br>
<i>
docker run -p 8888:8888 upgini</br>
</i></br>
3. Open http://localhost:8888?token=<your_token_from_console_output> in your browser
</details>
### 2. 💡 Use your labeled training dataset for search
You can use your labeled training datasets "as is" to initiate the search. Under the hood, we'll search for relevant data using:
- **[search keys](#-search-key-types-we-support-more-to-come)** from the training dataset to match records from potential data sources with new features
- **labels** from the training dataset to estimate the relevance of features or datasets for your ML task and calculate feature importance metrics
- **your features** from the training dataset to find external datasets and features that improve accuracy of your existing data and estimate accuracy uplift ([optional](#find-features-only-give-accuracy-gain-to-existing-data-in-the-ml-model))
Load the training dataset into a Pandas DataFrame and separate feature columns from the label column in a Scikit-learn way:
```python
import pandas as pd
# labeled training dataset - customer_churn_prediction_train.csv
train_df = pd.read_csv("customer_churn_prediction_train.csv")
X = train_df.drop(columns="churn_flag")
y = train_df["churn_flag"]
```
<table border=1 cellpadding=10><tr><td>
⚠️ <b>Requirements for search initialization dataset</b>
<br>
We perform dataset verification and cleaning under the hood, but still there are some requirements to follow:
<br>
1. <b>pandas.DataFrame</b>, <b>pandas.Series</b> or <b>numpy.ndarray</b> representation;
<br>
2. correct label column types: boolean/integers/strings for binary and multiclass labels, floats for regression;
<br>
3. at least one column selected as a <a href="#-search-key-types-we-support-more-to-come">search key</a>;
<br>
4. min size after deduplication by search-key columns and removal of NaNs: <i>100 records</i>
</td></tr></table>
### 3. 🔦 Choose one or more columns as search keys
*Search keys* columns will be used to match records from all potential external data sources/features.
Define one or more columns as search keys when initializing the `FeaturesEnricher` class.
```python
from upgini.features_enricher import FeaturesEnricher
from upgini.metadata import SearchKey
enricher = FeaturesEnricher(
search_keys={
"subscription_activation_date": SearchKey.DATE,
"country": SearchKey.COUNTRY,
"zip_code": SearchKey.POSTAL_CODE,
"hashed_email": SearchKey.HEM,
"last_visit_ip_address": SearchKey.IP,
"registered_with_phone": SearchKey.PHONE
})
```
#### ✨ Search key types we support (more to come!)
<table style="table-layout: fixed; text-align: left">
<tr>
<th> Search Key<br/>Meaning Type </th>
<th> Description </th>
<th> Allowed pandas dtypes (Python types) </th>
<th> Example </th>
</tr>
<tr>
<td> SearchKey.EMAIL </td>
<td> e-mail </td>
<td> <tt>object(str)</tt> <br/> <tt>string</tt> </td>
<td> <tt>support@upgini.com </tt> </td>
</tr>
<tr>
<td> SearchKey.HEM </td>
<td> <tt>sha256(lowercase(email)) </tt> </td>
<td> <tt>object(str)</tt> <br/> <tt>string</tt> </td>
<td> <tt>0e2dfefcddc929933dcec9a5c7db7b172482814e63c80b8460b36a791384e955</tt> </td>
</tr>
<tr>
<td> SearchKey.IP </td>
<td> IPv4 or IPv6 address</td>
<td> <tt>object(str, ipaddress.IPv4Address, ipaddress.IPv6Address)</tt> <br/> <tt>string</tt> <br/> <tt>int64</tt> </td>
<td> <tt>192.168.0.1 </tt> </td>
</tr>
<tr>
<td> SearchKey.PHONE </td>
<td> phone number (<a href="https://en.wikipedia.org/wiki/E.164">E.164 standard</a>) </td>
<td> <tt>object(str)</tt> <br/> <tt>string</tt> <br/> <tt>int64</tt> <br/> <tt>float64</tt> </td>
<td> <tt>443451925138 </tt> </td>
</tr>
<tr>
<td> SearchKey.DATE </td>
<td> date </td>
<td>
<tt>object(str)</tt> <br/>
<tt>string</tt> <br/>
<tt>datetime64[ns]</tt> <br/>
<tt>period[D]</tt> <br/>
</td>
<td>
<tt>2020-02-12 </tt> (<a href="https://en.wikipedia.org/wiki/ISO_8601">ISO-8601 standard</a>)
<br/> <tt>12.02.2020 </tt> (non‑standard notation)
</td>
</tr>
<tr>
<td> SearchKey.DATETIME </td>
<td> datetime </td>
<td>
<tt>object(str)</tt> <br/>
<tt>string</tt> <br/>
<tt>datetime64[ns]</tt> <br/>
<tt>period[D]</tt> <br/>
</td>
<td> <tt>2020-02-12 12:46:18 </tt> <br/> <tt>12:46:18 12.02.2020 </tt> </td>
</tr>
<tr>
<td> SearchKey.COUNTRY </td>
<td> <a href="https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2">Country ISO-3166 code</a>, Country name </td>
<td> <tt>object(str)</tt> <br/> <tt>string</tt> </td>
<td> <tt>GB </tt> <br/> <tt>US </tt> <br/> <tt>IN </tt> </td>
</tr>
<tr>
<td> SearchKey.POSTAL_CODE </td>
<td> Postal code a.k.a. ZIP code. Can only be used with SearchKey.COUNTRY </td>
<td> <tt>object(str)</tt> <br/> <tt>string</tt> </td>
<td> <tt>21174 </tt> <br/> <tt>061107 </tt> <br/> <tt>SE-999-99 </tt> </td>
</tr>
</table>
</details>
For the search key types <tt>SearchKey.DATE</tt>/<tt>SearchKey.DATETIME</tt> with dtypes <tt>object</tt> or <tt>string</tt> you have to specify the date/datetime format by passing <tt>date_format</tt> parameter to `FeaturesEnricher`. For example:
```python
from upgini.features_enricher import FeaturesEnricher
from upgini.metadata import SearchKey
enricher = FeaturesEnricher(
search_keys={
"subscription_activation_date": SearchKey.DATE,
"country": SearchKey.COUNTRY,
"zip_code": SearchKey.POSTAL_CODE,
"hashed_email": SearchKey.HEM,
"last_visit_ip_address": SearchKey.IP,
"registered_with_phone": SearchKey.PHONE
},
date_format = "%Y-%d-%m"
)
```
To use a non-UTC timezone for datetime, you can cast datetime column explicitly to your timezone (example for Warsaw):
```python
df["date"] = df.date.astype("datetime64").dt.tz_localize("Europe/Warsaw")
```
A single country for the whole training dataset can be passed via `country_code` parameter:
```python
from upgini.features_enricher import FeaturesEnricher
from upgini.metadata import SearchKey
enricher = FeaturesEnricher(
search_keys={
"subscription_activation_date": SearchKey.DATE,
"zip_code": SearchKey.POSTAL_CODE,
},
country_code = "US",
date_format = "%Y-%d-%m"
)
```
### 4. 🔍 Start your first feature search!
The main abstraction you interact with is `FeaturesEnricher`, a Scikit-learn-compatible estimator. You can easily add it to your existing ML pipelines.
Create an instance of the `FeaturesEnricher` class and call:
- `fit` to search relevant datasets & features
- then `transform` to enrich your dataset with features from the search result
Let's try it out!
```python
import pandas as pd
from upgini.features_enricher import FeaturesEnricher
from upgini.metadata import SearchKey
# load labeled training dataset to initiate search
train_df = pd.read_csv("customer_churn_prediction_train.csv")
X = train_df.drop(columns="churn_flag")
y = train_df["churn_flag"]
# now we're going to create an instance of the `FeaturesEnricher` class
enricher = FeaturesEnricher(
search_keys={
"subscription_activation_date": SearchKey.DATE,
"country": SearchKey.COUNTRY,
"zip_code": SearchKey.POSTAL_CODE
})
# Everything is ready to fit! For 100k records, fitting should take around 10 minutes
# We'll send an email notification; just register on profile.upgini.com
enricher.fit(X, y)
```
That's it! The `FeaturesEnricher` is now fitted.
### 5. 📈 Evaluate feature importances (SHAP values) from the search result
`FeaturesEnricher` class has two properties for feature importances, that are populated after fit - `feature_names_` and `feature_importances_`:
- `feature_names_` - feature names from the search result, and if parameter `keep_input=True` was used, initial columns from search dataset as well
- `feature_importances_` - SHAP values for features from the search result, same order as in `feature_names_`
Method `get_features_info()` returns pandas dataframe with features and full statistics after fit, including SHAP values and match rates:
```python
enricher.get_features_info()
```
Get more details about `FeaturesEnricher` at runtime using docstrings via `help(FeaturesEnricher)` or `help(FeaturesEnricher.fit)`.
### 6. 🏭 Enrich Production ML pipeline with relevant external features
`FeaturesEnricher` is a Scikit-learn-compatible estimator, so any pandas dataframe can be enriched with external features from a search result (after `fit`).
Use the `transform` method of `FeaturesEnricher`, and let the magic do the rest 🪄
```python
# load dataset for enrichment
test_x = pd.read_csv("test.csv")
# enrich it!
enriched_test_features = enricher.transform(test_x)
```
#### 6.1 Reuse completed search for enrichment without 'fit' run
`FeaturesEnricher` can be initialized with `search_id` from a completed search (after a fit call).
Just use `enricher.get_search_id()` or copy search id string from the `fit()` output.
Search keys and features in X must be the same as for `fit()`
```python
enricher = FeaturesEnricher(
# same set of search keys as for the fit step
search_keys={"date": SearchKey.DATE},
api_key="<YOUR API_KEY>", # if you fitted the enricher with an api_key, then you should use it here
search_id = "abcdef00-0000-0000-0000-999999999999"
)
enriched_prod_dataframe = enricher.transform(input_dataframe)
```
#### 6.2 Enrichment with updated external data sources and features
In most ML cases, the training step requires a labeled dataset with historical observations. For production, you'll need updated, current data sources and features to generate predictions.
`FeaturesEnricher`, when initialized with a set of search keys that includes `SearchKey.DATE`, will match records from all potential external data sources **exactly on the specified date/datetime** based on `SearchKey.DATE`, to avoid enrichment with features "from the future" during the `fit` step.
And then, for `transform` in a production ML pipeline, you'll get enrichment with relevant features, current as of the present date.
⚠️ Include `SearchKey.DATE` in the set of search keys to get current features for production and avoid features from the future during training:
```python
enricher = FeaturesEnricher(
search_keys={
"subscription_activation_date": SearchKey.DATE,
"country": SearchKey.COUNTRY,
"zip_code": SearchKey.POSTAL_CODE,
},
)
```
## 💻 How does it work?
### 🧹 Search dataset validation
We validate and clean the search‑initialization dataset under the hood:
- check your **search keys** columns' formats;
- check zero variance for label column;
- check dataset for full row duplicates. If we find any, we remove them and report their share;
- check inconsistent labels - rows with the same features and keys but different labels, we remove them and report their share;
- remove columns with zero variance - we treat any non **search key** column in the search dataset as a feature, so columns with zero variance will be removed
### ❔ Supervised ML tasks detection
We detect ML task under the hood based on label column values. Currently we support:
- ModelTaskType.BINARY
- ModelTaskType.MULTICLASS
- ModelTaskType.REGRESSION
But for certain search datasets you can pass parameter to `FeaturesEnricher` with correct ML task type:
```python
from upgini.features_enricher import FeaturesEnricher
from upgini.metadata import SearchKey, ModelTaskType
enricher = FeaturesEnricher(
search_keys={"subscription_activation_date": SearchKey.DATE},
model_task_type=ModelTaskType.REGRESSION
)
```
#### ⏰ Time-series prediction support
*Time-series prediction* is supported as `ModelTaskType.REGRESSION` or `ModelTaskType.BINARY` tasks with time-series‑specific cross-validation splits:
* [Scikit-learn time-series cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html#time-series-split) - `CVType.time_series` parameter
* [Blocked time-series cross-validation](https://goldinlocks.github.io/Time-Series-Cross-Validation/#Blocked-and-Time-Series-Split-Cross-Validation) - `CVType.blocked_time_series` parameter
To initiate feature search, you can pass the cross-validation type parameter to `FeaturesEnricher` with a time-series‑specific CV type:
```python
from upgini.features_enricher import FeaturesEnricher
from upgini.metadata import SearchKey, CVType
enricher = FeaturesEnricher(
search_keys={"sales_date": SearchKey.DATE},
cv=CVType.time_series
)
```
If you're working with multivariate time series, you should specify id columns of individual univariate series in `FeaturesEnricher`. For example, if you have a dataset predicting sales for different stores and products, you should specify store and product id columns as follows:
```python
enricher = FeaturesEnricher(
search_keys={
"sales_date": SearchKey.DATE,
},
id_columns=["store_id", "product_id"],
cv=CVType.time_series
)
```
⚠️ **Preprocess the dataset** in case of time-series prediction:
sort rows in dataset according to observation order, in most cases - ascending order by date/datetime.
### 🆙 Accuracy and uplift metrics calculations
`FeaturesEnricher` automatically calculates model metrics and uplift from new relevant features either using `calculate_metrics()` method or `calculate_metrics=True` parameter in `fit` or `fit_transform` methods (example below).
You can use any model estimator with scikit-learn-compatible interface, some examples are:
* [All Scikit-Learn supervised models](https://scikit-learn.org/stable/supervised_learning.html)
* [Xgboost](https://xgboost.readthedocs.io/en/stable/python/python_api.html#module-xgboost.sklearn)
* [LightGBM](https://lightgbm.readthedocs.io/en/latest/Python-API.html#scikit-learn-api)
* [CatBoost](https://catboost.ai/en/docs/concepts/python-quickstart)
<details>
<summary>
👈 Evaluation metric should be passed to <i>calculate_metrics()</i> by the <i>scoring</i> parameter,<br/>
out-of-the-box Upgini supports
</summary>
<table style="table-layout: fixed;">
<tr>
<th>Metric</th>
<th>Description</th>
</tr>
<tr>
<td><tt>explained_variance</tt></td>
<td>Explained variance regression score function</td>
</tr>
<tr>
<td><tt>r2</tt></td>
<td>R<sup>2</sup> (coefficient of determination) regression score function</td>
</tr>
<tr>
<td><tt>max_error</tt></td>
<td>Calculates the maximum residual error (negative - greater is better)</td>
</tr>
<tr>
<td><tt>median_absolute_error</tt></td>
<td>Median absolute error regression loss</td>
</tr>
<tr>
<td><tt>mean_absolute_error</tt></td>
<td>Mean absolute error regression loss</td>
</tr>
<tr>
<td><tt>mean_absolute_percentage_error</tt></td>
<td>Mean absolute percentage error regression loss</td>
</tr>
<tr>
<td><tt>mean_squared_error</tt></td>
<td>Mean squared error regression loss</td>
</tr>
<tr>
<td><tt>mean_squared_log_error</tt> (or aliases: <tt>msle</tt>, <tt>MSLE</tt>)</td>
<td>Mean squared logarithmic error regression loss</td>
</tr>
<tr>
<td><tt>root_mean_squared_log_error</tt> (or aliases: <tt>rmsle</tt>, <tt>RMSLE</tt>)</td>
<td>Root mean squared logarithmic error regression loss</td>
</tr>
<tr>
<td><tt>root_mean_squared_error</tt></td>
<td>Root mean squared error regression loss</td>
</tr>
<tr>
<td><tt>mean_poisson_deviance</tt></td>
<td>Mean Poisson deviance regression loss</td>
</tr>
<tr>
<td><tt>mean_gamma_deviance</tt></td>
<td>Mean Gamma deviance regression loss</td>
</tr>
<tr>
<td><tt>accuracy</tt></td>
<td>Accuracy classification score</td>
</tr>
<tr>
<td><tt>top_k_accuracy</tt></td>
<td>Top-k Accuracy classification score</td>
</tr>
<tr>
<td><tt>roc_auc</tt></td>
<td>Area Under the Receiver Operating Characteristic Curve (ROC AUC)
from prediction scores</td>
</tr>
<tr>
<td><tt>roc_auc_ovr</tt></td>
<td>Area Under the Receiver Operating Characteristic Curve (ROC AUC)
from prediction scores (multi_class="ovr")</td>
</tr>
<tr>
<td><tt>roc_auc_ovo</tt></td>
<td>Area Under the Receiver Operating Characteristic Curve (ROC AUC)
from prediction scores (multi_class="ovo")</td>
</tr>
<tr>
<td><tt>roc_auc_ovr_weighted</tt></td>
<td>Area Under the Receiver Operating Characteristic Curve (ROC AUC)
from prediction scores (multi_class="ovr", average="weighted")</td>
</tr>
<tr>
<td><tt>roc_auc_ovo_weighted</tt></td>
<td>Area Under the Receiver Operating Characteristic Curve (ROC AUC)
from prediction scores (multi_class="ovo", average="weighted")</td>
</tr>
<tr>
<td><tt>balanced_accuracy</tt></td>
<td>Compute the balanced accuracy</td>
</tr>
<tr>
<td><tt>average_precision</tt></td>
<td>Compute average precision (AP) from prediction scores</td>
</tr>
<tr>
<td><tt>log_loss</tt></td>
<td>Log loss, aka logistic loss or cross-entropy loss</td>
</tr>
<tr>
<td><tt>brier_score</tt></td>
<td>Compute the Brier score loss</td>
</tr>
</table>
</details>
In addition to that list, you can define a custom evaluation metric function using [scikit-learn make_scorer](https://scikit-learn.org/1.7/modules/model_evaluation.html#defining-your-scoring-strategy-from-score-functions), for example [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error).
By default, the `calculate_metrics()` method calculates the evaluation metric with the same cross-validation split as selected for `FeaturesEnricher.fit()` by the parameter `cv = CVType.<cross-validation-split>`.
But you can easily define a new split by passing a subclass of `BaseCrossValidator` to the `cv` parameter in `calculate_metrics()`.
Example with more tips-and-tricks:
```python
from upgini.features_enricher import FeaturesEnricher
from upgini.metadata import SearchKey
enricher = FeaturesEnricher(search_keys={"registration_date": SearchKey.DATE})
# Fit with default setup for metrics calculation
# CatBoost will be used
enricher.fit(X, y, eval_set=eval_set, calculate_metrics=True)
# LightGBM estimator for metrics
custom_estimator = LGBMRegressor()
enricher.calculate_metrics(estimator=custom_estimator)
# Custom metric function to scoring param (callable or name)
custom_scoring = "RMSLE"
enricher.calculate_metrics(scoring=custom_scoring)
# Custom cross validator
custom_cv = TimeSeriesSplit(n_splits=5)
enricher.calculate_metrics(cv=custom_cv)
# All of these custom parameters can be combined in both methods: fit, fit_transform and calculate_metrics:
enricher.fit(X, y, eval_set, calculate_metrics=True, estimator=custom_estimator, scoring=custom_scoring, cv=custom_cv)
```
## ✅ More tips-and-tricks
### 🤖 Automated feature generation from columns in a search dataset
If a training dataset has a text column, you can generate additional embeddings from it using instruction‑guided embedding generation with LLMs and data augmentation from external sources, just like Upgini does for all records from connected data sources.
In most cases, this gives better results than direct embeddings generation from a text field. Currently, Upgini has two LLMs connected to the search engine - GPT-3.5 from OpenAI and GPT-J.
To use this feature, pass the column names as arguments to the `text_features` parameter. You can use up to 2 columns.
Here's an example for generating features from the "description" and "summary" columns:
```python
enricher = FeaturesEnricher(
search_keys={"date": SearchKey.DATE},
text_features=["description", "summary"]
)
```
With this code, Upgini will generate LLM embeddings from text columns and then check them for predictive power for your ML task.
Finally, Upgini will return a dataset enriched with only the relevant components of LLM embeddings.
### Find features that only provide accuracy gains to existing data in the ML model
If you already have features or other external data sources, you can specifically search for new datasets and features that only provide accuracy gains "on top" of them.
Just leave all these existing features in the labeled training dataset and the Upgini library automatically uses them during the feature search process and as a baseline ML model to calculate accuracy metric uplift. Only features that improve accuracy will be returned.
### Check robustness of accuracy improvement from external features
You can validate the robustness of external features on an out-of-time dataset using the `eval_set` parameter:
```python
# load train dataset
train_df = pd.read_csv("train.csv")
train_ids_and_features = train_df.drop(columns="label")
train_label = train_df["label"]
# load out-of-time validation dataset
eval_df = pd.read_csv("validation.csv")
eval_ids_and_features = eval_df.drop(columns="label")
eval_label = eval_df["label"]
# create FeaturesEnricher
enricher = FeaturesEnricher(search_keys={"registration_date": SearchKey.DATE})
# now we fit WITH eval_set parameter to calculate accuracy metrics on Out-of-time dataset.
# the output will contain quality metrics for both the training data set and
# the eval set (validation OOT data set)
enricher.fit(
train_ids_and_features,
train_label,
eval_set = [(eval_ids_and_features, eval_label)]
)
```
#### ⚠️ Requirements for out-of-time dataset
- Same data schema as for search initialization X dataset
- Pandas dataframe representation
The out-of-time dataset can be without labels. There are 3 options to pass out-of-time without labels:
```python
enricher.fit(
train_ids_and_features,
train_label,
eval_set = [
(eval_ids_and_features_1,), # A tuple with 1 element
(eval_ids_and_features_2, None), # None as labels
(eval_ids_and_features_3, [np.nan] * len(eval_ids_and_features_3)), # List or Series of the same size as eval X
]
)
```
### Control feature stability with PSI parameters
`FeaturesEnricher` supports Population Stability Index (PSI) calculation on eval_set to evaluate feature stability over time. You can control this behavior using stability parameters in `fit` and `fit_transform` methods:
```python
enricher = FeaturesEnricher(
search_keys={"registration_date": SearchKey.DATE}
)
# Control feature stability during fit
enricher.fit(
X, y,
stability_threshold=0.2, # PSI threshold: features with PSI above this value will be dropped
stability_agg_func="max" # Aggregation function for stability values: "max", "min", "mean"
)
# Same parameters work for fit_transform
enriched_df = enricher.fit_transform(
X, y,
stability_threshold=0.1, # Stricter threshold for more stable features
stability_agg_func="mean" # Use mean aggregation instead of max
)
```
**Stability parameters:**
- `stability_threshold` (float, default=0.2): PSI threshold value. Features with PSI above this threshold will be excluded from the final feature set. Lower values mean stricter stability requirements.
- `stability_agg_func` (str, default="max"): Function to aggregate PSI values across time intervals. Options: "max" (most conservative), "min" (least conservative), "mean" (balanced approach).
**PSI (Population Stability Index)** measures how much feature distribution changes over time. Lower PSI values indicate more stable features, which are generally more reliable for production ML models. PSI is calculated on the eval_set, which should contain the most recent dates relative to the training dataset.
### Use custom loss function in feature selection & metrics calculation
`FeaturesEnricher` can be initialized with additional string parameter `loss`.
Depending on the ML task, you can use the following loss functions:
- `regression`: regression, regression_l1, huber, poisson, quantile, mape, gamma, tweedie;
- `binary`: binary;
- `multiclass`: multiclass, multiclassova.
For instance, if your target variable has a Poisson distribution (count of events, number of customers in the shop and so on), you should try to use `loss="poisson"` to improve quality of feature selection and get better evaluation metrics.
Usage example:
```python
enricher = FeaturesEnricher(
search_keys={"date": SearchKey.DATE},
loss="poisson",
model_task_type=ModelTaskType.REGRESSION
)
enriched_dataframe.fit(X, y)
```
### Exclude premium data sources from fit, transform and metrics calculation
`fit`, `fit_transform`, `transform` and `calculate_metrics` methods of `FeaturesEnricher` can be used with the `exclude_features_sources` parameter to exclude Trial or Paid features from Premium data sources:
```python
enricher = FeaturesEnricher(
search_keys={"subscription_activation_date": SearchKey.DATE}
)
enricher.fit(X, y, calculate_metrics=False)
trial_features = enricher.get_features_info()[enricher.get_features_info()["Feature type"] == "Trial"]["Feature name"].values.tolist()
paid_features = enricher.get_features_info()[enricher.get_features_info()["Feature type"] == "Paid"]["Feature name"].values.tolist()
enricher.calculate_metrics(exclude_features_sources=(trial_features + paid_features))
enricher.transform(X, exclude_features_sources=(trial_features + paid_features))
```
### Turn off autodetection for search key columns
Upgini has autodetection of search keys enabled by default.
To turn off use `autodetect_search_keys=False`:
```python
enricher = FeaturesEnricher(
search_keys={"date": SearchKey.DATE},
autodetect_search_keys=False,
)
enricher.fit(X, y)
```
### Turn off removal of target outliers
Upgini detects rows with target outliers for regression tasks. By default such rows are dropped during metrics calculation. To turn off the removal of target‑outlier rows, use the `remove_outliers_calc_metrics=False` parameter in the fit, fit_transform, or calculate_metrics methods:
```python
enricher = FeaturesEnricher(
search_keys={"date": SearchKey.DATE},
)
enricher.fit(X, y, remove_outliers_calc_metrics=False)
```
### Turn off feature generation on search keys
Upgini attempts to generate features for email, date and datetime search keys. By default this generation is enabled. To disable it use the `generate_search_key_features` parameter of the FeaturesEnricher constructor:
```python
enricher = FeaturesEnricher(
search_keys={"date": SearchKey.DATE},
generate_search_key_features=False,
)
```
## 🔑 Open up all capabilities of Upgini
[Register](https://profile.upgini.com) and get a free API key for exclusive data sources and features: 600M+ phone numbers, 350M+ emails, 2^32 IP addresses
|Benefit|No Sign-up | Registered user |
|--|--|--|
|Enrichment with **date/datetime, postal/ZIP code and country keys** | Yes | Yes |
|Enrichment with **phone number, hashed email/HEM and IP address keys** | No | Yes |
|Email notification on **search task completion** | No | Yes |
|Automated **feature generation with LLMs** from columns in a search dataset| Yes, *till 12/05/23* | Yes |
|Email notification on **new data source activation** 🔜 | No | Yes |
## 👩🏻💻 How to share data/features with the community?
You may publish ANY data which you consider as royalty‑ or license‑free ([Open Data](http://opendatahandbook.org/guide/en/what-is-open-data/)) and potentially valuable for ML applications for **community usage**:
1. Please Sign Up [here](https://profile.upgini.com)
2. Copy *Upgini API key* from your profile and upload your data from the Upgini Python library with this key:
```python
import pandas as pd
from upgini.metadata import SearchKey
from upgini.ads import upload_user_ads
import os
os.environ["UPGINI_API_KEY"] = "your_long_string_api_key_goes_here"
#you can define a custom search key that might not yet be supported; just use SearchKey.CUSTOM_KEY type
sample_df = pd.read_csv("path_to_data_sample_file")
upload_user_ads("test", sample_df, {
"city": SearchKey.CUSTOM_KEY,
"stats_date": SearchKey.DATE
})
```
3. After data verification, search results on community data will be available in the usual way.
## 🛠 Getting Help & Community
Please note that we are still in beta.
Requests and support, in preferred order
[](https://4mlg.short.gy/join-upgini-community)
[](https://github.com/upgini/upgini/issues)
❗Please try to create bug reports that are:
- **reproducible** - include steps to reproduce the problem.
- **specific** - include as much detail as possible: which Python version, what environment, etc.
- **unique** - do not duplicate existing opened issues.
- **scoped to a Single Bug** - one bug per report.
## 🧩 Contributing
We are not a large team, so we probably won't be able to:
- implement smooth integration with the most common low-code ML libraries and platforms ([PyCaret](https://www.github.com/pycaret/pycaret), [H2O AutoML](https://github.com//h2oai/h2o-3/blob/master/h2o-docs/src/product/automl.rst), etc.)
- implement all possible data verification and normalization capabilities for different types of search keys
And we need some help from the community!
So, we'll be happy about every **pull request** you open and every **issue** you report to make this library **even better**. Please note that it might sometimes take us a while to get back to you.
**For major changes**, please open an issue first to discuss what you would like to change.
#### Developing
Some convenient ways to start contributing are:
⚙️ [**Open in Visual Studio Code**](https://open.vscode.dev/upgini/upgini) You can remotely open this repo in VS Code without cloning or automatically clone and open it inside a docker container.
⚙️ **Gitpod** [](https://gitpod.io/#https://github.com/upgini/upgini) You can use Gitpod to launch a fully functional development environment right in your browser.
## 🔗 Useful links
- [Simple sales prediction template notebook](#-simple-sales-prediction-for-retail-stores)
- [Full list of Kaggle Guides & Examples](https://www.kaggle.com/romaupgini/code)
- [Project on PyPI](https://pypi.org/project/upgini)
- [More perks for registered users](https://profile.upgini.com)
<sup>😔 Found typo or a bug in code snippet? Our bad! <a href="https://github.com/upgini/upgini/issues/new?assignees=&title=readme%2Fbug">
Please report it here</a></sup>
Raw data
{
"_id": null,
"home_page": null,
"name": "upgini",
"maintainer": null,
"docs_url": null,
"requires_python": "<3.13,>=3.10",
"maintainer_email": null,
"keywords": "automl, data mining, data science, data search, machine learning",
"author": null,
"author_email": "Upgini Developers <madewithlove@upgini.com>",
"download_url": "https://files.pythonhosted.org/packages/e4/1a/5e7a48d287115bf54365482b600a702c65fde74842f6be3c14529db853b5/upgini-1.2.146.tar.gz",
"platform": null,
"description": "\n<!-- <h2 align=\"center\"> <a href=\"https://upgini.com/\">Upgini</a> : low-code feature search and enrichment library for machine learning </h2> -->\n<!-- <h2 align=\"center\"> <a href=\"https://upgini.com/\">Upgini</a> : Free automated data enrichment library for machine learning: </br>only the accuracy improving features in 2 minutes </h2> -->\n<!-- <h2 align=\"center\"> <a href=\"https://upgini.com/\">Upgini</a> \u2022 Free production-ready automated data enrichment library for machine learning</h2>--> \n<h2 align=\"center\"> <a href=\"https://upgini.com/\">Upgini \u2022 Intelligent data search & enrichment for Machine Learning and AI</a></h2>\n<p align=\"center\"> <b>Easily find and add relevant features to your ML & AI pipeline from</br> hundreds of public, community, and premium external data sources, </br>including open & commercial LLMs</b> </p>\n<p align=\"center\">\n\t<br />\n <a href=\"https://colab.research.google.com/github/upgini/upgini/blob/main/notebooks/Upgini_Features_search%26generation.ipynb\"><strong>Quick Start in Colab \u00bb</strong></a> |\n <!--<a href=\"https://upgini.com/\">Upgini.com</a> |-->\n <a href=\"https://profile.upgini.com\">Register / Sign In</a> |\n <!-- <a href=\"https://gitter.im/upgini/community?utm_source=share-link&utm_medium=link&utm_campaign=share-link\">Gitter Community</a> | -->\n <a href=\"https://4mlg.short.gy/join-upgini-community\">Slack Community</a> |\n <a href=\"https://forms.gle/pH99gb5hPxBEfNdR7\"><strong>Propose a new data source</strong></a>\n </p>\n<p align=center>\n<a href=\"/LICENSE\"><img alt=\"BSD-3 license\" src=\"https://img.shields.io/badge/license-BSD--3%20Clause-green\"></a>\n<a href=\"https://pypi.org/project/upgini/\"><img alt=\"PyPI - Python Version\" src=\"https://img.shields.io/pypi/pyversions/upgini\"></a>\n<a href=\"https://pypi.org/project/upgini/\"><img alt=\"PyPI\" src=\"https://img.shields.io/pypi/v/upgini?label=Release\"></a>\n<a href=\"https://pepy.tech/project/upgini\"><img alt=\"Downloads\" src=\"https://static.pepy.tech/badge/upgini\"></a>\n<a href=\"https://4mlg.short.gy/join-upgini-community\"><img alt=\"Upgini slack community\" src=\"https://img.shields.io/badge/slack-@upgini-orange.svg?logo=slack\"></a>\n</p>\n\n<!-- \n[](https://github.com/psf/black)\n\n[](https://gitter.im/upgini/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge) -->\n## \u2754 Overview\n\n**Upgini** is an intelligent data search engine with a Python library that helps you find and add relevant features to your ML pipeline from hundreds of public, community, and premium external data sources. Under the hood, Upgini automatically optimizes all connected data sources by [generating an optimal set of ML features using large language models (LLMs), GNNs (graph neural networks), and recurrent neural networks (RNNs)](https://upgini.com/#optimized_external_data). \n\n**Motivation:** for most supervised ML models external data & features boost accuracy significantly better than any hyperparameters tuning. But lack of automated and time-efficient enrichment tools for external data blocks massive adoption of external features in ML pipelines. We want to radically simplify feature search and enrichment to make external data a standard approach. Like hyperparameter tuning in machine learning today. \n\n**Mission:** Democratize access to data sources for data science community. \n\n## \ud83d\ude80 Awesome features\n\u2b50\ufe0f Automatically find only relevant features that *improve your model\u2019s accuracy*. Not just correlated with the target variable, which in 9 out of 10 cases yields zero accuracy improvement \n\u2b50\ufe0f Automated feature generation from the sources: feature generation with LLM\u2011based data augmentation, RNNs, and GraphNNs; ensembling across multiple data sources \n\u2b50\ufe0f Automatic search key augmentation from all connected sources. If you do not have all search keys in your search request, such as postal/ZIP code, Upgini will try to add those keys based on the provided set of search keys. This will broaden the search across all available data sources \n\u2b50\ufe0f Calculate accuracy metrics and uplift after enriching an existing ML model with external features \n\u2b50\ufe0f Check the stability of accuracy gain from external data on out-of-time intervals and verification datasets. Mitigate the risks of unstable external data dependencies in the ML pipeline \n\u2b50\ufe0f Easy to use - a single request to enrich the training dataset with [*all of the keys at once*](#-search-key-types-we-support-more-to-come): \n<table>\n <tr>\n <td> date / datetime </td>\n <td> phone number </td>\n </tr>\n <tr>\n <td> postal / ZIP code </td>\n <td> hashed email / HEM </td>\n </tr>\n <tr>\n <td> country </td>\n <td> IP-address </td>\n </tr>\n</table>\n\n\u2b50\ufe0f Scikit-learn-compatible interface for quick data integration with existing ML pipelines \n\u2b50\ufe0f Support for most common supervised ML tasks on tabular data: \n<table>\n <tr>\n <td><a href=\"https://en.wikipedia.org/wiki/Binary_classification\">\u2611\ufe0f binary classification</a></td>\n <td><a href=\"https://en.wikipedia.org/wiki/Multiclass_classification\">\u2611\ufe0f multiclass classification</a></td>\n </tr>\n <tr>\n <td><a href=\"https://en.wikipedia.org/wiki/Regression_analysis\">\u2611\ufe0f regression</a></td>\n <td><a href=\"https://en.wikipedia.org/wiki/Time_series#Prediction_and_forecasting\">\u2611\ufe0f time-series prediction</a></td>\n </tr>\n</table> \n\n\u2b50\ufe0f [Simple Drag & Drop Search UI](https://www.upgini.com/data-search-widget): \n<a href=\"https://upgini.com/upgini-widget\">\n<img width=\"710\" alt=\"Drag & Drop Search UI\" src=\"https://github.com/upgini/upgini/assets/95645411/36b6460c-51f3-400e-9f04-445b938bf45e\">\n</a>\n\n\n## \ud83c\udf0e Connected data sources and coverage\n\n- **Public data**: public sector, academic institutions, other sources through open data portals. Curated and updated by the Upgini team \n- **Community\u2011shared data**: royalty- or license-free datasets or features from the data science community (our users). This includes both public and scraped data \n- **Premium data providers**: commercial data sources verified by the Upgini team in real-world use cases \n\n\ud83d\udc49 [**Details on datasets and features**](https://upgini.com/#data_sources) \n#### \ud83d\udcca Total: **239 countries** and **up to 41 years** of history\n|Data sources|Countries|History (years)|# sources for ensembling|Update frequency|Search keys|API Key required\n|--|--|--|--|--|--|--|\n|Historical weather & Climate normals | 68 |22|-|Monthly|date, country, postal/ZIP code|No\n|Location/Places/POI/Area/Proximity information from OpenStreetMap | 221 |2|-|Monthly|date, country, postal/ZIP code|No\n|International holidays & events, Workweek calendar| 232 |22|-|Monthly|date, country|No\n|Consumer Confidence index| 44 |22|-|Monthly|date, country|No\n|World economic indicators|191 |41|-|Monthly|date, country|No\n|Markets data|-|17|-|Monthly|date, datetime|No\n|World mobile & fixed-broadband network coverage and performance |167|-|3|Monthly|country, postal/ZIP code|No\n|World demographic data |90|-|2|Annual|country, postal/ZIP code|No\n|World house prices |44|-|3|Annual|country, postal/ZIP code|No\n|Public social media profile data |104|-|-|Monthly|date, email/HEM, phone |Yes\n|Car ownership data and Parking statistics|3|-|-|Annual|country, postal/ZIP code, email/HEM, phone|Yes\n|Geolocation profile for phone & IPv4 & email|239|-|6|Monthly|date, email/HEM, phone, IPv4|Yes\n|\ud83d\udd1c Email/WWW domain profile|-|-|-|-\n\n\u2753**Know other useful data sources for machine learning?** [Give us a hint and we'll add it for free](https://forms.gle/pH99gb5hPxBEfNdR7). \n\n\n## \ud83d\udcbc Tutorials\n\n### [Search of relevant external features & Automated feature generation for Salary prediction task (use as a template)](https://github.com/upgini/upgini/blob/main/notebooks/Upgini_Features_search%26generation.ipynb)\n\n* The goal is to predict salary for a data science job posting based on information about the employer and job description.\n* Following this guide, you'll learn how to **search and auto\u2011generate new relevant features with the Upgini library**\n* The evaluation metric is [Mean Absolute Error (MAE)](https://en.wikipedia.org/wiki/Mean_absolute_error).\n \nRun [Feature search & generation notebook](https://github.com/upgini/upgini/blob/main/notebooks/Upgini_Features_search%26generation.ipynb) inside your browser:\n\n[](https://colab.research.google.com/github/upgini/upgini/blob/main/notebooks/Upgini_Features_search%26generation.ipynb)\n \n<!--\n[](https://mybinder.org/v2/gh/upgini/upgini/main?labpath=notebooks%2FUpgini_Features_search%26generation.ipynb)\n \n[](https://gitpod.io/#/github.com/upgini/upgini)\n-->\n### \u2753 [Simple sales prediction for retail stores](https://github.com/upgini/upgini/blob/main/notebooks/kaggle_example.ipynb)\n\n* The goal is to **predict future sales of different goods in stores** based on a 5-year history of sales. \n* Kaggle Competition [Store Item Demand Forecasting Challenge](https://www.kaggle.com/c/demand-forecasting-kernels-only) is a product sales forecasting competition. The evaluation metric is [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error). \n\nRun [Simple sales prediction for retail stores](https://github.com/upgini/upgini/blob/main/notebooks/kaggle_example.ipynb) inside your browser:\n\n[](https://colab.research.google.com/github/upgini/upgini/blob/main/notebooks/kaggle_example.ipynb)\n \n<!--\n[](https://mybinder.org/v2/gh/upgini/upgini/main?urlpath=notebooks%2Fnotebooks%2Fkaggle_example.ipynb)\n \n[](https://gitpod.io/#/github.com/upgini/upgini)\n--> \n\n### \u2753 [How to boost ML model accuracy for Kaggle Top-1 leaderboard in 15 minutes](https://www.kaggle.com/code/nikupgini/how-to-find-external-data-for-1-private-lb-4-53/notebook)\n\n* The goal is **to improve a Top\u20111 winning Kaggle solution** by adding new relevant external features and data. \n* [Kaggle Competition](https://www.kaggle.com/competitions/tabular-playground-series-jan-2022/) is a product sales forecasting competition; the evaluation metric is [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error). \n\n### \u2753 [How to do low-code feature engineering for AutoML tools](https://www.kaggle.com/code/romaupgini/zero-feature-engineering-with-upgini-pycaret/notebook)\n\n* **Save time on feature search and engineering**. Use ready-to-use external features and data sources to maximize overall AutoML accuracy, right out of the box. \n* [Kaggle Competition](https://www.kaggle.com/competitions/tabular-playground-series-jan-2022/) is a product sales forecasting, evaluation metric is [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error). \n* Low-code AutoML frameworks: [Upgini](https://github.com/upgini/upgini) and [PyCaret](https://github.com/pycaret/pycaret)\n\n### \u2753 [How to improve accuracy of Multivariate time-series forecast from external features & data](https://www.kaggle.com/code/romaupgini/guide-external-data-features-for-multivariatets/notebook)\n\n* The goal is **to improve the accuracy of multivariate time\u2011series forecasting** using new relevant external features and data. The main challenge is the data and feature enrichment strategy, in which a component of a multivariate time series depends not only on its past values but also on other components. \n* [Kaggle Competition](https://www.kaggle.com/competitions/tabular-playground-series-jan-2022/) is a product sales forecasting, evaluation metric is [RMSLE](https://www.kaggle.com/code/carlmcbrideellis/store-sales-using-the-average-of-the-last-16-days#Note-regarding-calculating-the-average). \n\n### \u2753 [How to speed up feature engineering hypothesis tests with ready-to-use external features](https://www.kaggle.com/code/romaupgini/statement-dates-to-use-or-not-to-use/notebook)\n\n* **Save time on external data wrangling and feature calculation code** for hypothesis tests. The key challenge is the time\u2011dependent representation of information in the training dataset, which is uncommon for credit default prediction tasks. As a result, special data enrichment strategy is used. \n* [Kaggle Competition](https://www.kaggle.com/competitions/amex-default-prediction) is a credit default prediction, evaluation metric is [normalized Gini coefficient](https://www.kaggle.com/competitions/amex-default-prediction/discussion/327464).\n\n## \ud83c\udfc1 Quick start \n\n### 1. Install from PyPI\n```python\n%pip install upgini\n```\n<details>\n\t<summary>\n\t\ud83d\udc33 <b>Docker-way</b>\n\t</summary>\n</br>\nClone <i>$ git clone https://github.com/upgini/upgini</i> or download upgini git repo locally </br>\nand follow steps below to build docker container \ud83d\udc47 </br>\n</br> \n1. Build docker image from cloned git repo:</br>\n<i>cd upgini </br>\ndocker build -t upgini .</i></br>\n</br>\n...or directly from GitHub:\n</br>\n<i>DOCKER_BUILDKIT=0 docker build -t upgini</i></br> <i>git@github.com:upgini/upgini.git#main</i></br>\n</br>\n2. Run docker image:</br>\n<i>\ndocker run -p 8888:8888 upgini</br>\n</i></br>\n3. Open http://localhost:8888?token=<your_token_from_console_output> in your browser \n</details>\n\n\n### 2. \ud83d\udca1 Use your labeled training dataset for search\n\nYou can use your labeled training datasets \"as is\" to initiate the search. Under the hood, we'll search for relevant data using:\n- **[search keys](#-search-key-types-we-support-more-to-come)** from the training dataset to match records from potential data sources with new features\n- **labels** from the training dataset to estimate the relevance of features or datasets for your ML task and calculate feature importance metrics \n- **your features** from the training dataset to find external datasets and features that improve accuracy of your existing data and estimate accuracy uplift ([optional](#find-features-only-give-accuracy-gain-to-existing-data-in-the-ml-model)) \n\n\nLoad the training dataset into a Pandas DataFrame and separate feature columns from the label column in a Scikit-learn way: \n```python\nimport pandas as pd\n# labeled training dataset - customer_churn_prediction_train.csv\ntrain_df = pd.read_csv(\"customer_churn_prediction_train.csv\")\nX = train_df.drop(columns=\"churn_flag\")\ny = train_df[\"churn_flag\"]\n```\n<table border=1 cellpadding=10><tr><td>\n\u26a0\ufe0f <b>Requirements for search initialization dataset</b>\n<br>\nWe perform dataset verification and cleaning under the hood, but still there are some requirements to follow: \n<br>\n1. <b>pandas.DataFrame</b>, <b>pandas.Series</b> or <b>numpy.ndarray</b> representation; \n<br>\n2. correct label column types: boolean/integers/strings for binary and multiclass labels, floats for regression; \n<br>\n3. at least one column selected as a <a href=\"#-search-key-types-we-support-more-to-come\">search key</a>;\n<br>\n4. min size after deduplication by search-key columns and removal of NaNs: <i>100 records</i>\n</td></tr></table>\n\n### 3. \ud83d\udd26 Choose one or more columns as search keys\n*Search keys* columns will be used to match records from all potential external data sources/features. \nDefine one or more columns as search keys when initializing the `FeaturesEnricher` class.\n```python\nfrom upgini.features_enricher import FeaturesEnricher\nfrom upgini.metadata import SearchKey\n\nenricher = FeaturesEnricher(\n\tsearch_keys={\n\t\t\"subscription_activation_date\": SearchKey.DATE,\n\t\t\"country\": SearchKey.COUNTRY,\n\t\t\"zip_code\": SearchKey.POSTAL_CODE,\n\t\t\"hashed_email\": SearchKey.HEM,\n\t\t\"last_visit_ip_address\": SearchKey.IP,\n\t\t\"registered_with_phone\": SearchKey.PHONE\n\t})\n```\n#### \u2728 Search key types we support (more to come!)\n<table style=\"table-layout: fixed; text-align: left\">\n <tr>\n <th> Search Key<br/>Meaning Type </th>\n <th> Description </th>\n <th> Allowed pandas dtypes (Python types) </th>\n <th> Example </th>\n </tr>\n <tr>\n <td> SearchKey.EMAIL </td>\n <td> e-mail </td>\n <td> <tt>object(str)</tt> <br/> <tt>string</tt> </td>\n <td> <tt>support@upgini.com </tt> </td>\n </tr>\n <tr>\n <td> SearchKey.HEM </td>\n <td> <tt>sha256(lowercase(email)) </tt> </td>\n <td> <tt>object(str)</tt> <br/> <tt>string</tt> </td>\n <td> <tt>0e2dfefcddc929933dcec9a5c7db7b172482814e63c80b8460b36a791384e955</tt> </td>\n </tr>\n <tr>\n <td> SearchKey.IP </td>\n <td> IPv4 or IPv6 address</td>\n <td> <tt>object(str, ipaddress.IPv4Address, ipaddress.IPv6Address)</tt> <br/> <tt>string</tt> <br/> <tt>int64</tt> </td>\n <td> <tt>192.168.0.1 </tt> </td>\n </tr>\n <tr>\n <td> SearchKey.PHONE </td>\n <td> phone number (<a href=\"https://en.wikipedia.org/wiki/E.164\">E.164 standard</a>) </td>\n <td> <tt>object(str)</tt> <br/> <tt>string</tt> <br/> <tt>int64</tt> <br/> <tt>float64</tt> </td>\n <td> <tt>443451925138 </tt> </td>\n </tr>\n <tr>\n <td> SearchKey.DATE </td>\n <td> date </td>\n <td> \n <tt>object(str)</tt> <br/> \n <tt>string</tt> <br/>\n <tt>datetime64[ns]</tt> <br/>\n <tt>period[D]</tt> <br/>\n </td>\n <td> \n <tt>2020-02-12 </tt> (<a href=\"https://en.wikipedia.org/wiki/ISO_8601\">ISO-8601 standard</a>) \n <br/> <tt>12.02.2020 </tt> (non\u2011standard notation) \n </td>\n </tr>\n <tr>\n <td> SearchKey.DATETIME </td>\n <td> datetime </td>\n <td> \n <tt>object(str)</tt> <br/> \n <tt>string</tt> <br/>\n <tt>datetime64[ns]</tt> <br/>\n <tt>period[D]</tt> <br/>\n </td>\n <td> <tt>2020-02-12 12:46:18 </tt> <br/> <tt>12:46:18 12.02.2020 </tt> </td>\n </tr>\n <tr>\n <td> SearchKey.COUNTRY </td>\n <td> <a href=\"https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2\">Country ISO-3166 code</a>, Country name </td>\n <td> <tt>object(str)</tt> <br/> <tt>string</tt> </td>\n <td> <tt>GB </tt> <br/> <tt>US </tt> <br/> <tt>IN </tt> </td>\n </tr> \n <tr>\n <td> SearchKey.POSTAL_CODE </td>\n <td> Postal code a.k.a. ZIP code. Can only be used with SearchKey.COUNTRY </td>\n <td> <tt>object(str)</tt> <br/> <tt>string</tt> </td>\n <td> <tt>21174 </tt> <br/> <tt>061107 </tt> <br/> <tt>SE-999-99 </tt> </td>\n </tr>\n</table>\n\n</details>\n\nFor the search key types <tt>SearchKey.DATE</tt>/<tt>SearchKey.DATETIME</tt> with dtypes <tt>object</tt> or <tt>string</tt> you have to specify the date/datetime format by passing <tt>date_format</tt> parameter to `FeaturesEnricher`. For example:\n```python\nfrom upgini.features_enricher import FeaturesEnricher\nfrom upgini.metadata import SearchKey\n\nenricher = FeaturesEnricher(\n\tsearch_keys={\n\t\t\"subscription_activation_date\": SearchKey.DATE,\n\t\t\"country\": SearchKey.COUNTRY,\n\t\t\"zip_code\": SearchKey.POSTAL_CODE,\n\t\t\"hashed_email\": SearchKey.HEM,\n\t\t\"last_visit_ip_address\": SearchKey.IP,\n\t\t\"registered_with_phone\": SearchKey.PHONE\n\t}, \n\tdate_format = \"%Y-%d-%m\"\n)\n```\n\nTo use a non-UTC timezone for datetime, you can cast datetime column explicitly to your timezone (example for Warsaw):\n```python\ndf[\"date\"] = df.date.astype(\"datetime64\").dt.tz_localize(\"Europe/Warsaw\")\n```\n\nA single country for the whole training dataset can be passed via `country_code` parameter:\n```python\nfrom upgini.features_enricher import FeaturesEnricher\nfrom upgini.metadata import SearchKey\n\nenricher = FeaturesEnricher(\n\tsearch_keys={\n\t\t\"subscription_activation_date\": SearchKey.DATE,\n\t\t\"zip_code\": SearchKey.POSTAL_CODE,\n\t}, \n\tcountry_code = \"US\",\n\tdate_format = \"%Y-%d-%m\"\n)\n```\n\n### 4. \ud83d\udd0d Start your first feature search!\nThe main abstraction you interact with is `FeaturesEnricher`, a Scikit-learn-compatible estimator. You can easily add it to your existing ML pipelines. \nCreate an instance of the `FeaturesEnricher` class and call:\n- `fit` to search relevant datasets & features \n- then `transform` to enrich your dataset with features from the search result \n\nLet's try it out!\n```python\nimport pandas as pd\nfrom upgini.features_enricher import FeaturesEnricher\nfrom upgini.metadata import SearchKey\n\n# load labeled training dataset to initiate search\ntrain_df = pd.read_csv(\"customer_churn_prediction_train.csv\")\nX = train_df.drop(columns=\"churn_flag\")\ny = train_df[\"churn_flag\"]\n\n# now we're going to create an instance of the `FeaturesEnricher` class\nenricher = FeaturesEnricher(\n\tsearch_keys={\n\t\t\"subscription_activation_date\": SearchKey.DATE,\n\t\t\"country\": SearchKey.COUNTRY,\n\t\t\"zip_code\": SearchKey.POSTAL_CODE\n\t})\n\n# Everything is ready to fit! For 100k records, fitting should take around 10 minutes\n# We'll send an email notification; just register on profile.upgini.com\nenricher.fit(X, y)\n```\n\nThat's it! The `FeaturesEnricher` is now fitted. \n### 5. \ud83d\udcc8 Evaluate feature importances (SHAP values) from the search result\n\n`FeaturesEnricher` class has two properties for feature importances, that are populated after fit - `feature_names_` and `feature_importances_`: \n- `feature_names_` - feature names from the search result, and if parameter `keep_input=True` was used, initial columns from search dataset as well \n- `feature_importances_` - SHAP values for features from the search result, same order as in `feature_names_` \n\nMethod `get_features_info()` returns pandas dataframe with features and full statistics after fit, including SHAP values and match rates:\n```python\nenricher.get_features_info()\n```\nGet more details about `FeaturesEnricher` at runtime using docstrings via `help(FeaturesEnricher)` or `help(FeaturesEnricher.fit)`.\n\n### 6. \ud83c\udfed Enrich Production ML pipeline with relevant external features\n`FeaturesEnricher` is a Scikit-learn-compatible estimator, so any pandas dataframe can be enriched with external features from a search result (after `fit`). \nUse the `transform` method of `FeaturesEnricher`, and let the magic do the rest \ud83e\ude84\n```python\n# load dataset for enrichment\ntest_x = pd.read_csv(\"test.csv\")\n# enrich it!\nenriched_test_features = enricher.transform(test_x)\n```\n #### 6.1 Reuse completed search for enrichment without 'fit' run\n\n`FeaturesEnricher` can be initialized with `search_id` from a completed search (after a fit call).\nJust use `enricher.get_search_id()` or copy search id string from the `fit()` output. \nSearch keys and features in X must be the same as for `fit()`\n```python\nenricher = FeaturesEnricher(\n # same set of search keys as for the fit step\n search_keys={\"date\": SearchKey.DATE},\n api_key=\"<YOUR API_KEY>\", # if you fitted the enricher with an api_key, then you should use it here\n search_id = \"abcdef00-0000-0000-0000-999999999999\"\n)\nenriched_prod_dataframe = enricher.transform(input_dataframe)\n```\n#### 6.2 Enrichment with updated external data sources and features\nIn most ML cases, the training step requires a labeled dataset with historical observations. For production, you'll need updated, current data sources and features to generate predictions. \n`FeaturesEnricher`, when initialized with a set of search keys that includes `SearchKey.DATE`, will match records from all potential external data sources **exactly on the specified date/datetime** based on `SearchKey.DATE`, to avoid enrichment with features \"from the future\" during the `fit` step. \nAnd then, for `transform` in a production ML pipeline, you'll get enrichment with relevant features, current as of the present date.\n\n\u26a0\ufe0f Include `SearchKey.DATE` in the set of search keys to get current features for production and avoid features from the future during training:\n```python\nenricher = FeaturesEnricher(\n\tsearch_keys={\n\t\t\"subscription_activation_date\": SearchKey.DATE,\n\t\t\"country\": SearchKey.COUNTRY,\n\t\t\"zip_code\": SearchKey.POSTAL_CODE,\n\t},\n) \n```\n\n## \ud83d\udcbb How does it work?\n\n### \ud83e\uddf9 Search dataset validation\nWe validate and clean the search\u2011initialization dataset under the hood: \n\n - check your **search keys** columns' formats; \n - check zero variance for label column; \n - check dataset for full row duplicates. If we find any, we remove them and report their share; \n - check inconsistent labels - rows with the same features and keys but different labels, we remove them and report their share; \n - remove columns with zero variance - we treat any non **search key** column in the search dataset as a feature, so columns with zero variance will be removed\n\n### \u2754 Supervised ML tasks detection\nWe detect ML task under the hood based on label column values. Currently we support: \n - ModelTaskType.BINARY\n - ModelTaskType.MULTICLASS \n - ModelTaskType.REGRESSION \n\nBut for certain search datasets you can pass parameter to `FeaturesEnricher` with correct ML task type:\n```python\nfrom upgini.features_enricher import FeaturesEnricher\nfrom upgini.metadata import SearchKey, ModelTaskType\n\nenricher = FeaturesEnricher(\n\tsearch_keys={\"subscription_activation_date\": SearchKey.DATE},\n\tmodel_task_type=ModelTaskType.REGRESSION\n)\n```\n#### \u23f0 Time-series prediction support \n*Time-series prediction* is supported as `ModelTaskType.REGRESSION` or `ModelTaskType.BINARY` tasks with time-series\u2011specific cross-validation splits:\n* [Scikit-learn time-series cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html#time-series-split) - `CVType.time_series` parameter\n* [Blocked time-series cross-validation](https://goldinlocks.github.io/Time-Series-Cross-Validation/#Blocked-and-Time-Series-Split-Cross-Validation) - `CVType.blocked_time_series` parameter\n\nTo initiate feature search, you can pass the cross-validation type parameter to `FeaturesEnricher` with a time-series\u2011specific CV type:\n```python\nfrom upgini.features_enricher import FeaturesEnricher\nfrom upgini.metadata import SearchKey, CVType\n\nenricher = FeaturesEnricher(\n\tsearch_keys={\"sales_date\": SearchKey.DATE},\n\tcv=CVType.time_series\n)\n```\n\nIf you're working with multivariate time series, you should specify id columns of individual univariate series in `FeaturesEnricher`. For example, if you have a dataset predicting sales for different stores and products, you should specify store and product id columns as follows:\n```python\nenricher = FeaturesEnricher(\n\t\tsearch_keys={\n\t\t\t\t\"sales_date\": SearchKey.DATE,\n },\n id_columns=[\"store_id\", \"product_id\"],\n cv=CVType.time_series\n)\n```\n\u26a0\ufe0f **Preprocess the dataset** in case of time-series prediction: \nsort rows in dataset according to observation order, in most cases - ascending order by date/datetime.\n\n### \ud83c\udd99 Accuracy and uplift metrics calculations\n`FeaturesEnricher` automatically calculates model metrics and uplift from new relevant features either using `calculate_metrics()` method or `calculate_metrics=True` parameter in `fit` or `fit_transform` methods (example below). \nYou can use any model estimator with scikit-learn-compatible interface, some examples are:\n* [All Scikit-Learn supervised models](https://scikit-learn.org/stable/supervised_learning.html)\n* [Xgboost](https://xgboost.readthedocs.io/en/stable/python/python_api.html#module-xgboost.sklearn)\n* [LightGBM](https://lightgbm.readthedocs.io/en/latest/Python-API.html#scikit-learn-api)\n* [CatBoost](https://catboost.ai/en/docs/concepts/python-quickstart)\n\n<details>\n\t<summary>\n\t\t\ud83d\udc48 Evaluation metric should be passed to <i>calculate_metrics()</i> by the <i>scoring</i> parameter,<br/> \n\t\tout-of-the-box Upgini supports \n\t</summary>\n<table style=\"table-layout: fixed;\">\n <tr>\n <th>Metric</th>\n <th>Description</th>\n </tr>\n <tr>\n <td><tt>explained_variance</tt></td>\n <td>Explained variance regression score function</td>\n </tr>\n <tr>\n <td><tt>r2</tt></td>\n <td>R<sup>2</sup> (coefficient of determination) regression score function</td>\n </tr>\n <tr>\n <td><tt>max_error</tt></td>\n <td>Calculates the maximum residual error (negative - greater is better)</td>\n </tr>\n <tr>\n <td><tt>median_absolute_error</tt></td>\n <td>Median absolute error regression loss</td>\n </tr>\n <tr>\n <td><tt>mean_absolute_error</tt></td>\n <td>Mean absolute error regression loss</td>\n </tr>\n <tr>\n <td><tt>mean_absolute_percentage_error</tt></td>\n <td>Mean absolute percentage error regression loss</td>\n </tr>\n <tr>\n <td><tt>mean_squared_error</tt></td>\n <td>Mean squared error regression loss</td>\n </tr>\n <tr>\n\t <td><tt>mean_squared_log_error</tt> (or aliases: <tt>msle</tt>, <tt>MSLE</tt>)</td>\n <td>Mean squared logarithmic error regression loss</td>\n </tr>\n <tr>\n <td><tt>root_mean_squared_log_error</tt> (or aliases: <tt>rmsle</tt>, <tt>RMSLE</tt>)</td>\n <td>Root mean squared logarithmic error regression loss</td>\n </tr>\n <tr>\n <td><tt>root_mean_squared_error</tt></td>\n <td>Root mean squared error regression loss</td>\n </tr>\n <tr>\n <td><tt>mean_poisson_deviance</tt></td>\n <td>Mean Poisson deviance regression loss</td>\n </tr>\n <tr>\n <td><tt>mean_gamma_deviance</tt></td>\n <td>Mean Gamma deviance regression loss</td>\n </tr>\n <tr>\n <td><tt>accuracy</tt></td>\n <td>Accuracy classification score</td>\n </tr>\n <tr>\n <td><tt>top_k_accuracy</tt></td>\n <td>Top-k Accuracy classification score</td>\n </tr>\n <tr>\n <td><tt>roc_auc</tt></td>\n <td>Area Under the Receiver Operating Characteristic Curve (ROC AUC)\n from prediction scores</td>\n </tr>\n <tr>\n <td><tt>roc_auc_ovr</tt></td>\n <td>Area Under the Receiver Operating Characteristic Curve (ROC AUC)\n from prediction scores (multi_class=\"ovr\")</td>\n </tr>\n <tr>\n <td><tt>roc_auc_ovo</tt></td>\n <td>Area Under the Receiver Operating Characteristic Curve (ROC AUC)\n from prediction scores (multi_class=\"ovo\")</td>\n </tr>\n <tr>\n <td><tt>roc_auc_ovr_weighted</tt></td>\n <td>Area Under the Receiver Operating Characteristic Curve (ROC AUC)\n from prediction scores (multi_class=\"ovr\", average=\"weighted\")</td>\n </tr>\n <tr>\n <td><tt>roc_auc_ovo_weighted</tt></td>\n <td>Area Under the Receiver Operating Characteristic Curve (ROC AUC)\n from prediction scores (multi_class=\"ovo\", average=\"weighted\")</td>\n </tr>\n <tr>\n <td><tt>balanced_accuracy</tt></td>\n <td>Compute the balanced accuracy</td>\n </tr>\n <tr>\n <td><tt>average_precision</tt></td>\n <td>Compute average precision (AP) from prediction scores</td>\n </tr>\n <tr>\n <td><tt>log_loss</tt></td>\n <td>Log loss, aka logistic loss or cross-entropy loss</td>\n </tr>\n <tr>\n <td><tt>brier_score</tt></td>\n <td>Compute the Brier score loss</td>\n </tr>\n</table>\n</details>\n\nIn addition to that list, you can define a custom evaluation metric function using [scikit-learn make_scorer](https://scikit-learn.org/1.7/modules/model_evaluation.html#defining-your-scoring-strategy-from-score-functions), for example [SMAPE](https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error).\n\nBy default, the `calculate_metrics()` method calculates the evaluation metric with the same cross-validation split as selected for `FeaturesEnricher.fit()` by the parameter `cv = CVType.<cross-validation-split>`. \nBut you can easily define a new split by passing a subclass of `BaseCrossValidator` to the `cv` parameter in `calculate_metrics()`.\n\nExample with more tips-and-tricks:\n```python\nfrom upgini.features_enricher import FeaturesEnricher\nfrom upgini.metadata import SearchKey\n\nenricher = FeaturesEnricher(search_keys={\"registration_date\": SearchKey.DATE})\n\n# Fit with default setup for metrics calculation\n# CatBoost will be used\nenricher.fit(X, y, eval_set=eval_set, calculate_metrics=True)\n\n# LightGBM estimator for metrics\ncustom_estimator = LGBMRegressor()\nenricher.calculate_metrics(estimator=custom_estimator)\n\n# Custom metric function to scoring param (callable or name)\ncustom_scoring = \"RMSLE\"\nenricher.calculate_metrics(scoring=custom_scoring)\n\n# Custom cross validator\ncustom_cv = TimeSeriesSplit(n_splits=5)\nenricher.calculate_metrics(cv=custom_cv)\n\n# All of these custom parameters can be combined in both methods: fit, fit_transform and calculate_metrics:\nenricher.fit(X, y, eval_set, calculate_metrics=True, estimator=custom_estimator, scoring=custom_scoring, cv=custom_cv)\n```\n\n\n\n## \u2705 More tips-and-tricks\n\n### \ud83e\udd16 Automated feature generation from columns in a search dataset \n\nIf a training dataset has a text column, you can generate additional embeddings from it using instruction\u2011guided embedding generation with LLMs and data augmentation from external sources, just like Upgini does for all records from connected data sources.\n\nIn most cases, this gives better results than direct embeddings generation from a text field. Currently, Upgini has two LLMs connected to the search engine - GPT-3.5 from OpenAI and GPT-J.\n\nTo use this feature, pass the column names as arguments to the `text_features` parameter. You can use up to 2 columns.\n\nHere's an example for generating features from the \"description\" and \"summary\" columns:\n\n```python\nenricher = FeaturesEnricher(\n search_keys={\"date\": SearchKey.DATE},\n text_features=[\"description\", \"summary\"]\n)\n```\n\nWith this code, Upgini will generate LLM embeddings from text columns and then check them for predictive power for your ML task.\n\nFinally, Upgini will return a dataset enriched with only the relevant components of LLM embeddings.\n\n### Find features that only provide accuracy gains to existing data in the ML model\n\nIf you already have features or other external data sources, you can specifically search for new datasets and features that only provide accuracy gains \"on top\" of them. \n\nJust leave all these existing features in the labeled training dataset and the Upgini library automatically uses them during the feature search process and as a baseline ML model to calculate accuracy metric uplift. Only features that improve accuracy will be returned.\n\n### Check robustness of accuracy improvement from external features\n\nYou can validate the robustness of external features on an out-of-time dataset using the `eval_set` parameter:\n```python\n# load train dataset\ntrain_df = pd.read_csv(\"train.csv\")\ntrain_ids_and_features = train_df.drop(columns=\"label\")\ntrain_label = train_df[\"label\"]\n\n# load out-of-time validation dataset\neval_df = pd.read_csv(\"validation.csv\")\neval_ids_and_features = eval_df.drop(columns=\"label\")\neval_label = eval_df[\"label\"]\n# create FeaturesEnricher\nenricher = FeaturesEnricher(search_keys={\"registration_date\": SearchKey.DATE})\n\n# now we fit WITH eval_set parameter to calculate accuracy metrics on Out-of-time dataset.\n# the output will contain quality metrics for both the training data set and\n# the eval set (validation OOT data set)\nenricher.fit(\n train_ids_and_features,\n train_label,\n eval_set = [(eval_ids_and_features, eval_label)]\n)\n```\n#### \u26a0\ufe0f Requirements for out-of-time dataset \n- Same data schema as for search initialization X dataset\n- Pandas dataframe representation\n\nThe out-of-time dataset can be without labels. There are 3 options to pass out-of-time without labels:\n```python\nenricher.fit(\n train_ids_and_features,\n train_label,\n eval_set = [\n (eval_ids_and_features_1,), # A tuple with 1 element\n (eval_ids_and_features_2, None), # None as labels\n (eval_ids_and_features_3, [np.nan] * len(eval_ids_and_features_3)), # List or Series of the same size as eval X\n ]\n)\n```\n\n### Control feature stability with PSI parameters\n\n`FeaturesEnricher` supports Population Stability Index (PSI) calculation on eval_set to evaluate feature stability over time. You can control this behavior using stability parameters in `fit` and `fit_transform` methods:\n\n```python\nenricher = FeaturesEnricher(\n search_keys={\"registration_date\": SearchKey.DATE}\n)\n\n# Control feature stability during fit\nenricher.fit(\n X, y, \n stability_threshold=0.2, # PSI threshold: features with PSI above this value will be dropped\n stability_agg_func=\"max\" # Aggregation function for stability values: \"max\", \"min\", \"mean\"\n)\n\n# Same parameters work for fit_transform\nenriched_df = enricher.fit_transform(\n X, y,\n stability_threshold=0.1, # Stricter threshold for more stable features\n stability_agg_func=\"mean\" # Use mean aggregation instead of max\n)\n```\n\n**Stability parameters:**\n- `stability_threshold` (float, default=0.2): PSI threshold value. Features with PSI above this threshold will be excluded from the final feature set. Lower values mean stricter stability requirements.\n- `stability_agg_func` (str, default=\"max\"): Function to aggregate PSI values across time intervals. Options: \"max\" (most conservative), \"min\" (least conservative), \"mean\" (balanced approach).\n\n**PSI (Population Stability Index)** measures how much feature distribution changes over time. Lower PSI values indicate more stable features, which are generally more reliable for production ML models. PSI is calculated on the eval_set, which should contain the most recent dates relative to the training dataset.\n\n### Use custom loss function in feature selection & metrics calculation\n\n`FeaturesEnricher` can be initialized with additional string parameter `loss`. \nDepending on the ML task, you can use the following loss functions:\n- `regression`: regression, regression_l1, huber, poisson, quantile, mape, gamma, tweedie;\n- `binary`: binary;\n- `multiclass`: multiclass, multiclassova.\n\nFor instance, if your target variable has a Poisson distribution (count of events, number of customers in the shop and so on), you should try to use `loss=\"poisson\"` to improve quality of feature selection and get better evaluation metrics. \n\nUsage example:\n```python\nenricher = FeaturesEnricher(\n\tsearch_keys={\"date\": SearchKey.DATE},\n\tloss=\"poisson\",\n \tmodel_task_type=ModelTaskType.REGRESSION\n)\nenriched_dataframe.fit(X, y)\n```\n\n### Exclude premium data sources from fit, transform and metrics calculation\n\n`fit`, `fit_transform`, `transform` and `calculate_metrics` methods of `FeaturesEnricher` can be used with the `exclude_features_sources` parameter to exclude Trial or Paid features from Premium data sources:\n```python\nenricher = FeaturesEnricher(\n search_keys={\"subscription_activation_date\": SearchKey.DATE}\n)\nenricher.fit(X, y, calculate_metrics=False)\ntrial_features = enricher.get_features_info()[enricher.get_features_info()[\"Feature type\"] == \"Trial\"][\"Feature name\"].values.tolist()\npaid_features = enricher.get_features_info()[enricher.get_features_info()[\"Feature type\"] == \"Paid\"][\"Feature name\"].values.tolist()\nenricher.calculate_metrics(exclude_features_sources=(trial_features + paid_features))\nenricher.transform(X, exclude_features_sources=(trial_features + paid_features))\n```\n\n### Turn off autodetection for search key columns\nUpgini has autodetection of search keys enabled by default.\nTo turn off use `autodetect_search_keys=False`:\n\n```python\nenricher = FeaturesEnricher(\n search_keys={\"date\": SearchKey.DATE},\n autodetect_search_keys=False,\n)\n\nenricher.fit(X, y)\n```\n\n### Turn off removal of target outliers\nUpgini detects rows with target outliers for regression tasks. By default such rows are dropped during metrics calculation. To turn off the removal of target\u2011outlier rows, use the `remove_outliers_calc_metrics=False` parameter in the fit, fit_transform, or calculate_metrics methods:\n\n```python\nenricher = FeaturesEnricher(\n search_keys={\"date\": SearchKey.DATE},\n)\n\nenricher.fit(X, y, remove_outliers_calc_metrics=False)\n```\n\n### Turn off feature generation on search keys\nUpgini attempts to generate features for email, date and datetime search keys. By default this generation is enabled. To disable it use the `generate_search_key_features` parameter of the FeaturesEnricher constructor:\n\n```python\nenricher = FeaturesEnricher(\n search_keys={\"date\": SearchKey.DATE},\n generate_search_key_features=False,\n)\n```\n\n## \ud83d\udd11 Open up all capabilities of Upgini\n\n[Register](https://profile.upgini.com) and get a free API key for exclusive data sources and features: 600M+ phone numbers, 350M+ emails, 2^32 IP addresses\n\n|Benefit|No Sign-up | Registered user |\n|--|--|--|\n|Enrichment with **date/datetime, postal/ZIP code and country keys** | Yes | Yes |\n|Enrichment with **phone number, hashed email/HEM and IP address keys** | No | Yes |\n|Email notification on **search task completion** | No | Yes |\n|Automated **feature generation with LLMs** from columns in a search dataset| Yes, *till 12/05/23* | Yes |\n|Email notification on **new data source activation** \ud83d\udd1c | No | Yes |\n\n## \ud83d\udc69\ud83c\udffb\u200d\ud83d\udcbb How to share data/features with the community?\nYou may publish ANY data which you consider as royalty\u2011 or license\u2011free ([Open Data](http://opendatahandbook.org/guide/en/what-is-open-data/)) and potentially valuable for ML applications for **community usage**: \n1. Please Sign Up [here](https://profile.upgini.com)\n2. Copy *Upgini API key* from your profile and upload your data from the Upgini Python library with this key:\n```python\nimport pandas as pd\nfrom upgini.metadata import SearchKey\nfrom upgini.ads import upload_user_ads\nimport os\nos.environ[\"UPGINI_API_KEY\"] = \"your_long_string_api_key_goes_here\"\n#you can define a custom search key that might not yet be supported; just use SearchKey.CUSTOM_KEY type\nsample_df = pd.read_csv(\"path_to_data_sample_file\")\nupload_user_ads(\"test\", sample_df, {\n \"city\": SearchKey.CUSTOM_KEY,\n \"stats_date\": SearchKey.DATE\n})\n```\n3. After data verification, search results on community data will be available in the usual way.\n\n## \ud83d\udee0 Getting Help & Community\nPlease note that we are still in beta.\nRequests and support, in preferred order \n[](https://4mlg.short.gy/join-upgini-community)\n[](https://github.com/upgini/upgini/issues) \n\n\u2757Please try to create bug reports that are:\n- **reproducible** - include steps to reproduce the problem.\n- **specific** - include as much detail as possible: which Python version, what environment, etc.\n- **unique** - do not duplicate existing opened issues.\n- **scoped to a Single Bug** - one bug per report.\n\n## \ud83e\udde9 Contributing\nWe are not a large team, so we probably won't be able to:\n - implement smooth integration with the most common low-code ML libraries and platforms ([PyCaret](https://www.github.com/pycaret/pycaret), [H2O AutoML](https://github.com//h2oai/h2o-3/blob/master/h2o-docs/src/product/automl.rst), etc.)\n - implement all possible data verification and normalization capabilities for different types of search keys \nAnd we need some help from the community!\n\nSo, we'll be happy about every **pull request** you open and every **issue** you report to make this library **even better**. Please note that it might sometimes take us a while to get back to you.\n**For major changes**, please open an issue first to discuss what you would like to change.\n#### Developing\nSome convenient ways to start contributing are: \n\u2699\ufe0f [**Open in Visual Studio Code**](https://open.vscode.dev/upgini/upgini) You can remotely open this repo in VS Code without cloning or automatically clone and open it inside a docker container. \n\u2699\ufe0f **Gitpod** [](https://gitpod.io/#https://github.com/upgini/upgini) You can use Gitpod to launch a fully functional development environment right in your browser.\n\n## \ud83d\udd17 Useful links\n- [Simple sales prediction template notebook](#-simple-sales-prediction-for-retail-stores)\n- [Full list of Kaggle Guides & Examples](https://www.kaggle.com/romaupgini/code)\n- [Project on PyPI](https://pypi.org/project/upgini)\n- [More perks for registered users](https://profile.upgini.com)\n\n<sup>\ud83d\ude14 Found typo or a bug in code snippet? Our bad! <a href=\"https://github.com/upgini/upgini/issues/new?assignees=&title=readme%2Fbug\">\nPlease report it here</a></sup>\n",
"bugtrack_url": null,
"license": null,
"summary": "Intelligent data search & enrichment for Machine Learning",
"version": "1.2.146",
"project_urls": {
"Bug Reports": "https://github.com/upgini/upgini/issues",
"Homepage": "https://upgini.com/",
"Source": "https://github.com/upgini/upgini"
},
"split_keywords": [
"automl",
" data mining",
" data science",
" data search",
" machine learning"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "146d3a7d08b671777a414716482c22cbd4b93296f1a2d33508302f4632a75e30",
"md5": "1388534fe766c890f56c21fa826729da",
"sha256": "3ae6d7181e8cef775ed3615ecd726488c341755587cc2a0bfc70ad9d763310bc"
},
"downloads": -1,
"filename": "upgini-1.2.146-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1388534fe766c890f56c21fa826729da",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": "<3.13,>=3.10",
"size": 305068,
"upload_time": "2025-11-12T06:11:50",
"upload_time_iso_8601": "2025-11-12T06:11:50.903048Z",
"url": "https://files.pythonhosted.org/packages/14/6d/3a7d08b671777a414716482c22cbd4b93296f1a2d33508302f4632a75e30/upgini-1.2.146-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "e41a5e7a48d287115bf54365482b600a702c65fde74842f6be3c14529db853b5",
"md5": "eeec7abd1017b8011e940a094d9de65f",
"sha256": "0ccd4b0d03d2c85df95baad77baf2c632916cad74b21213e407d3f104ce5ade2"
},
"downloads": -1,
"filename": "upgini-1.2.146.tar.gz",
"has_sig": false,
"md5_digest": "eeec7abd1017b8011e940a094d9de65f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": "<3.13,>=3.10",
"size": 290854,
"upload_time": "2025-11-12T06:11:54",
"upload_time_iso_8601": "2025-11-12T06:11:54.153514Z",
"url": "https://files.pythonhosted.org/packages/e4/1a/5e7a48d287115bf54365482b600a702c65fde74842f6be3c14529db853b5/upgini-1.2.146.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-11-12 06:11:54",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "upgini",
"github_project": "upgini",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"circle": true,
"requirements": [
{
"name": "python-dateutil",
"specs": [
[
">=",
"2.8.0"
]
]
},
{
"name": "requests",
"specs": [
[
">=",
"2.8.0"
]
]
},
{
"name": "pandas",
"specs": [
[
">=",
"1.1.0"
],
[
"<",
"3.0.0"
]
]
},
{
"name": "numpy",
"specs": [
[
"<",
"3.0.0"
],
[
">=",
"1.19.0"
]
]
},
{
"name": "scikit-learn",
"specs": [
[
">=",
"1.3.0"
],
[
"<",
"1.8.0"
]
]
},
{
"name": "pydantic",
"specs": [
[
"<",
"3.0.0"
],
[
">",
"1.0.0"
]
]
},
{
"name": "fastparquet",
"specs": [
[
">=",
"0.8.1"
]
]
},
{
"name": "python-json-logger",
"specs": [
[
">=",
"3.3.0"
]
]
},
{
"name": "lightgbm",
"specs": [
[
">=",
"4.6.0"
]
]
},
{
"name": "shap",
"specs": [
[
">=",
"0.44.0"
]
]
},
{
"name": "pyjwt",
"specs": [
[
">=",
"2.8.0"
]
]
},
{
"name": "xhtml2pdf",
"specs": [
[
">=",
"0.2.11"
],
[
"<",
"0.3.0"
]
]
},
{
"name": "python-bidi",
"specs": [
[
"==",
"0.4.2"
]
]
},
{
"name": "ipywidgets",
"specs": [
[
">=",
"8.1.0"
]
]
},
{
"name": "jarowinkler",
"specs": [
[
">=",
"2.0.0"
]
]
},
{
"name": "levenshtein",
"specs": [
[
">=",
"0.25.1"
]
]
},
{
"name": "psutil",
"specs": [
[
">=",
"5.9.0"
]
]
},
{
"name": "category-encoders",
"specs": [
[
">=",
"2.8.1"
]
]
},
{
"name": "catboost",
"specs": [
[
">=",
"1.2.8"
]
]
},
{
"name": "more_itertools",
"specs": [
[
"==",
"10.7.0"
]
]
},
{
"name": "pyarrow",
"specs": [
[
"==",
"18.1.0"
]
]
},
{
"name": "black",
"specs": []
},
{
"name": "flake8",
"specs": []
},
{
"name": "flake8-bugbear",
"specs": []
},
{
"name": "pytest",
"specs": []
},
{
"name": "coverage",
"specs": []
},
{
"name": "pytest-cov",
"specs": []
},
{
"name": "pytest-datafiles",
"specs": []
},
{
"name": "pytest-timeout",
"specs": []
},
{
"name": "requests-mock",
"specs": []
},
{
"name": "unittest-xml-reporting",
"specs": []
},
{
"name": "pytest-parallel",
"specs": []
},
{
"name": "py",
"specs": []
},
{
"name": "build",
"specs": []
},
{
"name": "twine",
"specs": []
},
{
"name": "pytest-xdist",
"specs": []
},
{
"name": "pytest-testmon",
"specs": []
}
],
"tox": true,
"lcname": "upgini"
}