# Simulator
Simulator is framework for training and evaluating recommendation algorithms on real or synthetic data. Framework is based on pyspark library to work with big data.
As a part of simulation process the framework incorporates data generators, response functions and other tools, that can provide flexible usage of simulator.
# Table of contents
* [Installation](#installation)
* [Quickstart](#quickstart)
* [Examples](#examples)
* [Build from sources](#build-from-sources)
* [Building documentation](#compile-documentation)
* [Running tests](#tests)
## Installation
```bash
pip install sim4rec
```
If the installation takes too long, try
```bash
pip install sim4rec --use-deprecated=legacy-resolver
```
To install dependencies with poetry run
```bash
pip install --upgrade pip wheel poetry lightfm==1.17
poetry install
```
## Quickstart
The following example shows how to use simulator to train model iteratively by refitting recommendation algorithm on the new upcoming history log
```python
import numpy as np
import pandas as pd
import pyspark.sql.types as st
from pyspark.ml import PipelineModel
from sim4rec.utils import pandas_to_spark
from sim4rec.modules import RealDataGenerator, Simulator
from sim4rec.response import NoiseResponse, BernoulliResponse
from ucb import UCB
from replay.metrics import NDCG
LOG_SCHEMA = st.StructType([
st.StructField('user_idx', st.LongType(), True),
st.StructField('item_idx', st.LongType(), True),
st.StructField('relevance', st.DoubleType(), False),
st.StructField('response', st.IntegerType(), False)
])
users_df = pd.DataFrame(
data=np.random.normal(0, 1, size=(100, 15)),
columns=[f'user_attr_{i}' for i in range(15)]
)
items_df = pd.DataFrame(
data=np.random.normal(1, 1, size=(30, 10)),
columns=[f'item_attr_{i}' for i in range(10)]
)
history_df = pandas_to_spark(pd.DataFrame({
'user_idx' : [1, 10, 10, 50],
'item_idx' : [4, 25, 26, 25],
'relevance' : [1.0, 0.0, 1.0, 1.0],
'response' : [1, 0, 1, 1]
}), schema=LOG_SCHEMA)
users_df['user_idx'] = np.arange(len(users_df))
items_df['item_idx'] = np.arange(len(items_df))
users_df = pandas_to_spark(users_df)
items_df = pandas_to_spark(items_df)
user_gen = RealDataGenerator(label='users_real')
item_gen = RealDataGenerator(label='items_real')
user_gen.fit(users_df)
item_gen.fit(items_df)
_ = user_gen.generate(100)
_ = item_gen.generate(30)
sim = Simulator(
user_gen=user_gen,
item_gen=item_gen,
data_dir='test_simulator',
user_key_col='user_idx',
item_key_col='item_idx',
log_df=history_df
)
noise_resp = NoiseResponse(mu=0.5, sigma=0.2, outputCol='__noise')
br = BernoulliResponse(inputCol='__noise', outputCol='response')
pipeline = PipelineModel(stages=[noise_resp, br])
model = UCB()
model.fit(log=history_df)
ndcg = NDCG()
train_ndcg = []
for i in range(10):
users = sim.sample_users(0.1).cache()
recs = model.predict(log=sim.log, k=5, users=users, items=items_df, filter_seen_items=True).cache()
true_resp = sim.sample_responses(
recs_df=recs,
user_features=users,
item_features=items_df,
action_models=pipeline
).select('user_idx', 'item_idx', 'relevance', 'response').cache()
sim.update_log(true_resp, iteration=i)
train_ndcg.append(ndcg(recs, true_resp.filter(true_resp['response'] >= 1), 5))
model.fit(sim.log.drop('relevance').withColumnRenamed('response', 'relevance'))
users.unpersist()
recs.unpersist()
true_resp.unpersist()
print(train_ndcg)
```
## Examples
You can find useful examples in `notebooks` folder, which demonstrates how to use synthetic data generators, composite generators, evaluate scores of the generators, iteratively refit recommendation algorithm, use response functions and more.
## Build from sources
```bash
poetry build
pip install ./dist/sim4rec-0.0.1-py3-none-any.whl
```
## Compile documentation
```bash
cd docs
make clean && make html
```
## Tests
For tests the pytest python library is used and to run tests for all modules you can run the following command from repository root directory
```bash
pytest
```
## Licence
Sim4Rec is distributed under the [Apache License Version 2.0](https://github.com/sb-ai-lab/Sim4Rec/blob/main/LICENSE),
nevertheless the SDV package, imported by the Sim4Rec for synthetic data generation,
is distributed under [Business Source License (BSL) 1.1](https://github.com/sdv-dev/SDV/blob/master/LICENSE).
Synthetic tabular data generation not a purpose of the Sit4Rec framework.
The Sim4Rec offers an API and wrappers to run simulation with synthetic data, but the method of synthetic data generation is determined by the user.
SDV package is imported for illustration purposes and may be replaced by another synthetic data generation solution.
Thus, synthetic data generation functional and quality evaluation with SDV library,
namely the `SDVDataGenerator` from [generator.py](sim4rec/modules/generator.py) and `evaluate_synthetic` from [evaluation.py](sim4rec/modules/evaluation.py)
should be used for non-production purposes only according to the SDV License.
Raw data
{
"_id": null,
"home_page": "https://github.com/sb-ai-lab/Sim4Rec",
"name": "sim4rec",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8,<3.10",
"maintainer_email": "",
"keywords": "",
"author": "Alexey Vasilev",
"author_email": "",
"download_url": "https://files.pythonhosted.org/packages/fb/74/5b8d0450192055f7e48787720771d6b157c6cc92ec7c37ee27fd59ccbba8/sim4rec-0.0.2.tar.gz",
"platform": null,
"description": "# Simulator\n\nSimulator is framework for training and evaluating recommendation algorithms on real or synthetic data. Framework is based on pyspark library to work with big data.\nAs a part of simulation process the framework incorporates data generators, response functions and other tools, that can provide flexible usage of simulator.\n\n# Table of contents\n\n* [Installation](#installation)\n* [Quickstart](#quickstart)\n* [Examples](#examples)\n* [Build from sources](#build-from-sources)\n* [Building documentation](#compile-documentation)\n* [Running tests](#tests)\n\n## Installation\n\n```bash\npip install sim4rec\n```\n\nIf the installation takes too long, try\n```bash\npip install sim4rec --use-deprecated=legacy-resolver\n```\n\nTo install dependencies with poetry run\n\n```bash\npip install --upgrade pip wheel poetry lightfm==1.17\npoetry install\n```\n\n## Quickstart\n\nThe following example shows how to use simulator to train model iteratively by refitting recommendation algorithm on the new upcoming history log\n\n```python\nimport numpy as np\nimport pandas as pd\n\nimport pyspark.sql.types as st\nfrom pyspark.ml import PipelineModel\nfrom sim4rec.utils import pandas_to_spark\nfrom sim4rec.modules import RealDataGenerator, Simulator\nfrom sim4rec.response import NoiseResponse, BernoulliResponse\n\nfrom ucb import UCB\nfrom replay.metrics import NDCG\n\nLOG_SCHEMA = st.StructType([\n st.StructField('user_idx', st.LongType(), True),\n st.StructField('item_idx', st.LongType(), True),\n st.StructField('relevance', st.DoubleType(), False),\n st.StructField('response', st.IntegerType(), False)\n])\n\nusers_df = pd.DataFrame(\n data=np.random.normal(0, 1, size=(100, 15)),\n columns=[f'user_attr_{i}' for i in range(15)]\n)\nitems_df = pd.DataFrame(\n data=np.random.normal(1, 1, size=(30, 10)),\n columns=[f'item_attr_{i}' for i in range(10)]\n)\nhistory_df = pandas_to_spark(pd.DataFrame({\n 'user_idx' : [1, 10, 10, 50],\n 'item_idx' : [4, 25, 26, 25],\n 'relevance' : [1.0, 0.0, 1.0, 1.0],\n 'response' : [1, 0, 1, 1]\n}), schema=LOG_SCHEMA)\n\nusers_df['user_idx'] = np.arange(len(users_df))\nitems_df['item_idx'] = np.arange(len(items_df))\n\nusers_df = pandas_to_spark(users_df)\nitems_df = pandas_to_spark(items_df)\n\nuser_gen = RealDataGenerator(label='users_real')\nitem_gen = RealDataGenerator(label='items_real')\nuser_gen.fit(users_df)\nitem_gen.fit(items_df)\n_ = user_gen.generate(100)\n_ = item_gen.generate(30)\n\nsim = Simulator(\n user_gen=user_gen,\n item_gen=item_gen,\n data_dir='test_simulator',\n user_key_col='user_idx',\n item_key_col='item_idx',\n log_df=history_df\n)\n\nnoise_resp = NoiseResponse(mu=0.5, sigma=0.2, outputCol='__noise')\nbr = BernoulliResponse(inputCol='__noise', outputCol='response')\npipeline = PipelineModel(stages=[noise_resp, br])\n\nmodel = UCB()\nmodel.fit(log=history_df)\n\nndcg = NDCG()\n\ntrain_ndcg = []\nfor i in range(10):\n users = sim.sample_users(0.1).cache()\n\n recs = model.predict(log=sim.log, k=5, users=users, items=items_df, filter_seen_items=True).cache()\n\n true_resp = sim.sample_responses(\n recs_df=recs,\n user_features=users,\n item_features=items_df,\n action_models=pipeline\n ).select('user_idx', 'item_idx', 'relevance', 'response').cache()\n\n sim.update_log(true_resp, iteration=i)\n\n train_ndcg.append(ndcg(recs, true_resp.filter(true_resp['response'] >= 1), 5))\n\n model.fit(sim.log.drop('relevance').withColumnRenamed('response', 'relevance'))\n\n users.unpersist()\n recs.unpersist()\n true_resp.unpersist()\n\nprint(train_ndcg)\n\n```\n\n## Examples\n\nYou can find useful examples in `notebooks` folder, which demonstrates how to use synthetic data generators, composite generators, evaluate scores of the generators, iteratively refit recommendation algorithm, use response functions and more.\n\n## Build from sources\n\n```bash\npoetry build\npip install ./dist/sim4rec-0.0.1-py3-none-any.whl\n```\n\n## Compile documentation\n\n```bash\ncd docs\nmake clean && make html\n```\n\n## Tests\n\nFor tests the pytest python library is used and to run tests for all modules you can run the following command from repository root directory\n\n```bash\npytest\n```\n\n## Licence\nSim4Rec is distributed under the [Apache License Version 2.0](https://github.com/sb-ai-lab/Sim4Rec/blob/main/LICENSE), \nnevertheless the SDV package, imported by the Sim4Rec for synthetic data generation,\nis distributed under [Business Source License (BSL) 1.1](https://github.com/sdv-dev/SDV/blob/master/LICENSE).\n\nSynthetic tabular data generation not a purpose of the Sit4Rec framework. \nThe Sim4Rec offers an API and wrappers to run simulation with synthetic data, but the method of synthetic data generation is determined by the user. \nSDV package is imported for illustration purposes and may be replaced by another synthetic data generation solution. \n\nThus, synthetic data generation functional and quality evaluation with SDV library, \nnamely the `SDVDataGenerator` from [generator.py](sim4rec/modules/generator.py) and `evaluate_synthetic` from [evaluation.py](sim4rec/modules/evaluation.py) \nshould be used for non-production purposes only according to the SDV License. \n",
"bugtrack_url": null,
"license": "",
"summary": "Simulator for recommendation algorithms",
"version": "0.0.2",
"project_urls": {
"Homepage": "https://github.com/sb-ai-lab/Sim4Rec",
"Repository": "https://github.com/sb-ai-lab/Sim4Rec"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "1f9bd9ac3d9bc209e6415bcf98bbba2bb58de87e55bc21dc5561afba5425f8aa",
"md5": "5a8d1cc0d38a1845d25f6c431b65b1b5",
"sha256": "78e8304e6c997bca10c9a3b26b8724cf18e3ea5155880cc642b76d5820881c73"
},
"downloads": -1,
"filename": "sim4rec-0.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "5a8d1cc0d38a1845d25f6c431b65b1b5",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8,<3.10",
"size": 27063,
"upload_time": "2023-08-01T14:58:18",
"upload_time_iso_8601": "2023-08-01T14:58:18.079954Z",
"url": "https://files.pythonhosted.org/packages/1f/9b/d9ac3d9bc209e6415bcf98bbba2bb58de87e55bc21dc5561afba5425f8aa/sim4rec-0.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "fb745b8d0450192055f7e48787720771d6b157c6cc92ec7c37ee27fd59ccbba8",
"md5": "9655ebfd6ea1735d2f2bc3f96e2738e2",
"sha256": "ec6c8a2b70d6a7f78ef856aad259814edcdb79af9e801a3b68dd46d5e5db95fc"
},
"downloads": -1,
"filename": "sim4rec-0.0.2.tar.gz",
"has_sig": false,
"md5_digest": "9655ebfd6ea1735d2f2bc3f96e2738e2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8,<3.10",
"size": 23695,
"upload_time": "2023-08-01T14:58:19",
"upload_time_iso_8601": "2023-08-01T14:58:19.733284Z",
"url": "https://files.pythonhosted.org/packages/fb/74/5b8d0450192055f7e48787720771d6b157c6cc92ec7c37ee27fd59ccbba8/sim4rec-0.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-08-01 14:58:19",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "sb-ai-lab",
"github_project": "Sim4Rec",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "pytest",
"specs": [
[
">=",
"7.1.3"
]
]
},
{
"name": "sdv",
"specs": [
[
"==",
"0.15.0"
]
]
},
{
"name": "replay-rec",
"specs": [
[
">=",
"0.10.0"
]
]
},
{
"name": "torch",
"specs": []
},
{
"name": "pandas",
"specs": []
},
{
"name": "pyspark",
"specs": [
[
">=",
"3.0.0"
]
]
},
{
"name": "numpy",
"specs": [
[
">=",
"1.20.0"
]
]
},
{
"name": "scipy",
"specs": []
}
],
"lcname": "sim4rec"
}