# mlslib
[](https://badge.fury.io/py/mlslib)
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/MIT)
A lightweight utility library to simplify working with Google Cloud Storage, BigQuery, and DataFrame evaluation on Google Cloud Platform (GCP). This library provides a set of high-level functions to streamline common data engineering, data science, and evaluation workflows.
---
## 🚀 Key Features
- **Google Cloud Storage Integration**: Upload pandas or Spark DataFrames to GCS
- **File Management**: Upload any local file (CSV, Parquet, Pickle, etc.) to GCS
- **Public Access**: Make GCS files public and get downloadable links
- **BigQuery Integration**: Query BigQuery tables directly into Spark DataFrames
- **Notebook Display**: Beautifully display PySpark DataFrames in Jupyter notebooks
- **Data Sampling**: Perform session-based sampling on pandas and Spark DataFrames
- **Evaluation Utilities**: Calculate MRR, save metrics, and display evaluation results
---
## 📦 Installation
Install `mlslib` directly from PyPI:
```bash
pip install mlslib
```
### Dependencies
- `ipython>=7.0.0` - For notebook display functionality
- `pyarrow>=6.0.0` - For efficient data serialization
- `python-dateutil>=2.8.2`
**Note:** Some functions require `google-cloud-storage` and `pyspark` to be installed in your environment.
---
## 🛠️ Setup
Before using `mlslib`, ensure you have:
1. **Google Cloud SDK** installed and configured
2. **Authentication** set up (service account key or gcloud auth)
3. **Required packages** installed:
```bash
pip install google-cloud-storage pyspark
```
---
## 📖 Usage
### Importing Key Functions
```python
from mlslib import (
display_df, download_csv, load_bigquery_table_spark,
sample_by_session, upload_df_to_gcs, upload_df_to_gcs_csv,
calculate_mrr, save_metrics_to_json, display_mrr_comparison
)
```
### Google Cloud Storage Utilities
```python
from mlslib.gcs_utils import upload_file_to_gcs, upload_df_to_gcs
# ... see full usage in the API Reference below ...
```
### BigQuery Utilities
```python
from mlslib.bigquery_utils import load_bigquery_table_spark
```
### Evaluation Utilities
#### Calculate Mean Reciprocal Rank (MRR)
```python
from mlslib import calculate_mrr
results = calculate_mrr(
df=my_dataframe,
position_col="rank",
label_col="is_relevant",
group_by_cols=["store_id"]
)
print(results)
```
#### Save Metrics to JSON
```python
from mlslib import save_metrics_to_json
save_metrics_to_json(results, "metrics.json")
```
#### Display MRR Comparison
```python
from mlslib import display_mrr_comparison
# results_list = [results1, results2, ...]
display_mrr_comparison(results_list)
```
---
## 🗂️ Project Structure
```
mlslib/
├── __init__.py # Package initialization and exports
├── gcs_utils.py # Google Cloud Storage utilities
├── bigquery_utils.py # BigQuery integration utilities
├── display_utils.py # Notebook display utilities
├── sampling_utils.py # Data sampling utilities
├── date_utils.py # Date range utilities
├── evaluate_utils.py # Evaluation utilities (MRR, metrics)
```
---
## 📚 API Reference
### gcs_utils
- `upload_file_to_gcs(file_path, bucket_name, gcs_path)`
- `upload_df_to_gcs(df, bucket_name, gcs_path, format='parquet')`
- `download_csv(bucket_name, file_path)`
### bigquery_utils
- `load_bigquery_table_spark(spark, sql_query, table_name, project_id, dataset_id)`
### display_utils
- `display_df(df, limit_rows=50, title=None)`
### sampling_utils
- `sample_by_session(df, session_column, fraction, seed=None)`
### date_utils
- `generate_periodic_date_ranges(start_date_str, num_periods, period_days)`
- `get_relative_day_range(days, offset_days=0)`
### evaluate_utils
- `calculate_mrr(df, position_col, label_col, group_by_cols=None)`
- `save_metrics_to_json(metrics, output_path)`
- `display_mrr_comparison(results_list, model_col='Model', test_set_col='Test Set')`
---
## 🤝 Contributing
Contributions are welcome! Please open issues or submit pull requests for bug fixes, improvements, or new features.
---
## 📄 License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
---
## 📢 Contact
Author: Raj Jha
Email: rjha4@wayfair.com
---
## 🚀 Publishing to PyPI
1. **Ensure your version is updated in `setup.py`**
2. **Build the package:**
```bash
python -m pip install --upgrade build
python -m build
```
3. **Upload to PyPI:**
```bash
python -m pip install --upgrade twine
twine upload dist/*
```
4. **(Optional) Test on TestPyPI first:**
```bash
twine upload --repository testpypi dist/*
```
---
Raw data
{
"_id": null,
"home_page": "https://github.com/wayfair-sandbox/dslib",
"name": "mlslib",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": "Raj Jha",
"author_email": "rjha4@wayfair.com",
"download_url": "https://files.pythonhosted.org/packages/8e/1d/f744ac982ad5f8789902765b690459863e9bc6da39ee631b33110aa0dbf0/mlslib-0.1.12.tar.gz",
"platform": null,
"description": "# mlslib\n\n[](https://badge.fury.io/py/mlslib)\n[](https://www.python.org/downloads/)\n[](https://opensource.org/licenses/MIT)\n\nA lightweight utility library to simplify working with Google Cloud Storage, BigQuery, and DataFrame evaluation on Google Cloud Platform (GCP). This library provides a set of high-level functions to streamline common data engineering, data science, and evaluation workflows.\n\n---\n\n## \ud83d\ude80 Key Features\n\n- **Google Cloud Storage Integration**: Upload pandas or Spark DataFrames to GCS\n- **File Management**: Upload any local file (CSV, Parquet, Pickle, etc.) to GCS\n- **Public Access**: Make GCS files public and get downloadable links\n- **BigQuery Integration**: Query BigQuery tables directly into Spark DataFrames\n- **Notebook Display**: Beautifully display PySpark DataFrames in Jupyter notebooks\n- **Data Sampling**: Perform session-based sampling on pandas and Spark DataFrames\n- **Evaluation Utilities**: Calculate MRR, save metrics, and display evaluation results\n\n---\n\n## \ud83d\udce6 Installation\n\nInstall `mlslib` directly from PyPI:\n\n```bash\npip install mlslib\n```\n\n### Dependencies\n\n- `ipython>=7.0.0` - For notebook display functionality\n- `pyarrow>=6.0.0` - For efficient data serialization\n- `python-dateutil>=2.8.2`\n\n**Note:** Some functions require `google-cloud-storage` and `pyspark` to be installed in your environment.\n\n---\n\n## \ud83d\udee0\ufe0f Setup\n\nBefore using `mlslib`, ensure you have:\n\n1. **Google Cloud SDK** installed and configured\n2. **Authentication** set up (service account key or gcloud auth)\n3. **Required packages** installed:\n ```bash\n pip install google-cloud-storage pyspark\n ```\n\n---\n\n## \ud83d\udcd6 Usage\n\n### Importing Key Functions\n\n```python\nfrom mlslib import (\n display_df, download_csv, load_bigquery_table_spark,\n sample_by_session, upload_df_to_gcs, upload_df_to_gcs_csv,\n calculate_mrr, save_metrics_to_json, display_mrr_comparison\n)\n```\n\n### Google Cloud Storage Utilities\n\n```python\nfrom mlslib.gcs_utils import upload_file_to_gcs, upload_df_to_gcs\n# ... see full usage in the API Reference below ...\n```\n\n### BigQuery Utilities\n\n```python\nfrom mlslib.bigquery_utils import load_bigquery_table_spark\n```\n\n### Evaluation Utilities\n\n#### Calculate Mean Reciprocal Rank (MRR)\n\n```python\nfrom mlslib import calculate_mrr\n\nresults = calculate_mrr(\n df=my_dataframe,\n position_col=\"rank\",\n label_col=\"is_relevant\",\n group_by_cols=[\"store_id\"]\n)\nprint(results)\n```\n\n#### Save Metrics to JSON\n\n```python\nfrom mlslib import save_metrics_to_json\nsave_metrics_to_json(results, \"metrics.json\")\n```\n\n#### Display MRR Comparison\n\n```python\nfrom mlslib import display_mrr_comparison\n# results_list = [results1, results2, ...]\ndisplay_mrr_comparison(results_list)\n```\n\n---\n\n## \ud83d\uddc2\ufe0f Project Structure\n\n```\nmlslib/\n\u251c\u2500\u2500 __init__.py # Package initialization and exports\n\u251c\u2500\u2500 gcs_utils.py # Google Cloud Storage utilities\n\u251c\u2500\u2500 bigquery_utils.py # BigQuery integration utilities\n\u251c\u2500\u2500 display_utils.py # Notebook display utilities\n\u251c\u2500\u2500 sampling_utils.py # Data sampling utilities\n\u251c\u2500\u2500 date_utils.py # Date range utilities\n\u251c\u2500\u2500 evaluate_utils.py # Evaluation utilities (MRR, metrics)\n```\n\n---\n\n## \ud83d\udcda API Reference\n\n### gcs_utils\n- `upload_file_to_gcs(file_path, bucket_name, gcs_path)`\n- `upload_df_to_gcs(df, bucket_name, gcs_path, format='parquet')`\n- `download_csv(bucket_name, file_path)`\n\n### bigquery_utils\n- `load_bigquery_table_spark(spark, sql_query, table_name, project_id, dataset_id)`\n\n### display_utils\n- `display_df(df, limit_rows=50, title=None)`\n\n### sampling_utils\n- `sample_by_session(df, session_column, fraction, seed=None)`\n\n### date_utils\n- `generate_periodic_date_ranges(start_date_str, num_periods, period_days)`\n- `get_relative_day_range(days, offset_days=0)`\n\n### evaluate_utils\n- `calculate_mrr(df, position_col, label_col, group_by_cols=None)`\n- `save_metrics_to_json(metrics, output_path)`\n- `display_mrr_comparison(results_list, model_col='Model', test_set_col='Test Set')`\n\n---\n\n## \ud83e\udd1d Contributing\n\nContributions are welcome! Please open issues or submit pull requests for bug fixes, improvements, or new features.\n\n---\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.\n\n---\n\n## \ud83d\udce2 Contact\n\nAuthor: Raj Jha \nEmail: rjha4@wayfair.com\n\n---\n\n## \ud83d\ude80 Publishing to PyPI\n\n1. **Ensure your version is updated in `setup.py`**\n2. **Build the package:**\n ```bash\n python -m pip install --upgrade build\n python -m build\n ```\n3. **Upload to PyPI:**\n ```bash\n python -m pip install --upgrade twine\n twine upload dist/*\n ```\n4. **(Optional) Test on TestPyPI first:**\n ```bash\n twine upload --repository testpypi dist/*\n ```\n\n---\n",
"bugtrack_url": null,
"license": null,
"summary": "A utility library for working with data pipelines on GCP",
"version": "0.1.12",
"project_urls": {
"Homepage": "https://github.com/wayfair-sandbox/dslib"
},
"split_keywords": [],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "067f8fb862f06bb3fd63b8ec1d9e52b6448e07990aeca3e29774134df46a5c56",
"md5": "b4da1f5a03f0e9d42634ad5a35b4dd22",
"sha256": "fcdd192e240382618c3a25e1f61a8116da8acc37df498f1a7dd24d0e59910a30"
},
"downloads": -1,
"filename": "mlslib-0.1.12-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b4da1f5a03f0e9d42634ad5a35b4dd22",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 11854,
"upload_time": "2025-07-09T05:41:50",
"upload_time_iso_8601": "2025-07-09T05:41:50.552938Z",
"url": "https://files.pythonhosted.org/packages/06/7f/8fb862f06bb3fd63b8ec1d9e52b6448e07990aeca3e29774134df46a5c56/mlslib-0.1.12-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "8e1df744ac982ad5f8789902765b690459863e9bc6da39ee631b33110aa0dbf0",
"md5": "5d7bfcfc855fa8cdfa3486e1d0fb893f",
"sha256": "3033de65b2d5943e8f72589bd7f75bd4cdd64e8649d13c72c6b5651b8d85c5c4"
},
"downloads": -1,
"filename": "mlslib-0.1.12.tar.gz",
"has_sig": false,
"md5_digest": "5d7bfcfc855fa8cdfa3486e1d0fb893f",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 12572,
"upload_time": "2025-07-09T05:41:51",
"upload_time_iso_8601": "2025-07-09T05:41:51.585569Z",
"url": "https://files.pythonhosted.org/packages/8e/1d/f744ac982ad5f8789902765b690459863e9bc6da39ee631b33110aa0dbf0/mlslib-0.1.12.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-07-09 05:41:51",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "wayfair-sandbox",
"github_project": "dslib",
"github_not_found": true,
"lcname": "mlslib"
}