# NCS Package - Machine Learning in Business
![Build and Test](https://github.com/colingwuyu/rotman_ncs/actions/workflows/python-build.yml/badge.svg)
![Publish](https://github.com/colingwuyu/rotman_ncs/actions/workflows/python-publish.yml/badge.svg)
[![PyPI version](https://badge.fury.io/py/ncs.svg)](https://pypi.org/project/rotman-ncs)
The NCS package is designed for a case study in the "Machine Learning in Business" course at UoT Rotman. The goal of this project is to demonstrate the potential of Natural Language Processing (NLP) tools for extracting valuable insights from unstructured text data.
Utilizing earning call transcripts from S&P 500 constituents, we develop an effective investment strategy and measure its performance against the S&P 500 index.
## Data Description
The data used in this case study is comprised of earnings call transcripts and historical adjusted stock prices for S&P 500 constituents, collected from Fool.com and Yahoo Finance respectively, from the years 2018 to 2023.
The transcript data includes speaker identities and titles, which allows us to differentiate between the comments of, for example, a CEO and a financial analyst.
## Methodology
We explore the evolution of NLP techniques, from traditional methods to sophisticated large language models. We demonstrate their application to real-world data, charting the journey from basic bag-of-words models to sentiment analysis and topic modelling.
The NLP techniques used in this study are:
- **TF-IDF (Term Frequency-Inverse Document Frequency)**: This method, though older, effectively shows how important a word is to a document in a corpus.
- **Word2Vec**: This technique creates dense vector representations of words, capturing their semantic relationships.
- **BERT (Bidirectional Encoder Representations from Transformers)**: BERT uses transformers to understand the context of words in all directions.
- **Large Language Models**: Trained on vast amounts of text data, these advanced models are capable of generating coherent text, understanding sentiment, completing prompts, and more.
Additional methods used for feature extraction are sentiment analysis and topic modelling.
## Expected Outcome
Using the insights derived from these NLP techniques, we construct an investment strategy. This strategy uses sentiment scores, main topics, and contextual embeddings from earning calls to predict future stock returns, showing the practical application of NLP in investment decision-making. The performance of this strategy is compared to the S&P 500 index.
## Python Package
This Python package, `rotman-ncs`, contains a suite of functions for working with financial earnings call data. The functions are broken down into three categories: data loading, model training and inference, and strategy execution and analysis.
### Data Loading
There are four main functions for loading different types of data:
1. `load_call_description(data_type='train')`: Loads earnings call description data from the specified data type. The `data_type` parameter is a string that specifies the type of data to load, either 'train' or 'test'.
2. `load_call_statements(data_type='train')`: Loads earnings call statements data from a specified data type. Again, the `data_type` parameter can be either 'train' or 'test'.
3. `load_stock_history()`: Loads the stock price history data.
4. `load_stock_returns_on_calls(data_type='train')`: Loads stock returns on calls data, with `data_type` specifying the type of data to load.
```python
# Load training data
call_descriptions = ncs.load_call_description(data_type='train')
call_statements = ncs.load_call_statements(data_type='train')
stock_history = ncs.load_stock_history()
stock_returns = ncs.load_stock_returns_on_calls(data_type='train')
```
### Model Training and Inference
The `train` function is used to train a model using the provided feature files and parameters. The `inference` function uses the trained model to generate actions for the test set.
```python
# Train model
ncs.train(feature_files=[list_of_feature_files])
# Perform inference
ncs.inference(feature_files=[list_of_feature_files], model_file='model.pkl', action_file='actions.csv')
```
### Strategy Execution and Analysis
These functions are used to execute investment strategies and analyze their performance:
1. `ncs.run_strategy(action_file, holding_period, log_file, save_portfolio_path)`: Runs the investment strategy using the provided action file, holding period, log file, and save portfolio path.
2. `ncs.report_strategy_analysis(actions, portfolio, holding_period, model_name)`: Generates a report of the strategy analysis for a given set of actions and portfolio.
3. `ncs.demo_benchmark(strategy, holding_period)`: Runs a benchmark strategy, either 'spy' or 'random', with a specified holding period.
```python
# Run strategy
ncs.run_strategy(action_file='actions.csv', holding_period=5)
# Generate strategy report
ncs.report_strategy_analysis(actions='actions.csv', portfolio='portfolio.parquet', holding_period=5)
# Generate benchmark strategy report
ncs.demo_benchmark(strategy='random', holding_period=5)
```
The key dataset used in the case study is the `call_statements` DataFrame, which contains unique identifiers (`statement_uid`, `call_uid`), as well as text data (`text`, `clean_text`). The `text` field is original call transcript and the `clean_text` field is processed by several text cleaning steps. Apply different NLP models to convert them into numeric features that can be consumed by the model training functions. These features are then used to inform the investment strategies with the trained model.
## Usage and Examples
Please refer to the following notebooks for detailed examples and usage instructions:
- [TF-IDF Solution Notebook](https://colab.research.google.com/drive/1JYQK1IfBEBkOKRbAnNChC-aZrQsUcfGu?usp=sharing)
- Word2Vec Solution (Coming Soon)
- BERT Solution (Coming Soon)
- LLM (Language Model) Solution (Coming Soon)
These notebooks demonstrate how to utilize different NLP techniques for processing text data and building investment strategies.
Make sure to follow the instructions provided in each notebook for proper setup and execution of the code.
If you have any questions or need further assistance, please feel free to reach out.
Happy coding!
## Enjoy the Learning and Investing 🥳
Raw data
{
"_id": null,
"home_page": "https://github.com/colingwuyu/rotman_ncs.git",
"name": "rotman-ncs",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "",
"keywords": "NLP,Case Study,UoT,UoT Rotman,Machine Learning,Investing",
"author": "Zeyu Colin Wang",
"author_email": "colinzeyu.wang@utoronto.ca",
"download_url": "",
"platform": null,
"description": "# NCS Package - Machine Learning in Business\n\n![Build and Test](https://github.com/colingwuyu/rotman_ncs/actions/workflows/python-build.yml/badge.svg)\n![Publish](https://github.com/colingwuyu/rotman_ncs/actions/workflows/python-publish.yml/badge.svg)\n[![PyPI version](https://badge.fury.io/py/ncs.svg)](https://pypi.org/project/rotman-ncs)\n\nThe NCS package is designed for a case study in the \"Machine Learning in Business\" course at UoT Rotman. The goal of this project is to demonstrate the potential of Natural Language Processing (NLP) tools for extracting valuable insights from unstructured text data.\n\nUtilizing earning call transcripts from S&P 500 constituents, we develop an effective investment strategy and measure its performance against the S&P 500 index.\n\n## Data Description\n\nThe data used in this case study is comprised of earnings call transcripts and historical adjusted stock prices for S&P 500 constituents, collected from Fool.com and Yahoo Finance respectively, from the years 2018 to 2023.\n\nThe transcript data includes speaker identities and titles, which allows us to differentiate between the comments of, for example, a CEO and a financial analyst.\n\n## Methodology\n\nWe explore the evolution of NLP techniques, from traditional methods to sophisticated large language models. We demonstrate their application to real-world data, charting the journey from basic bag-of-words models to sentiment analysis and topic modelling.\n\nThe NLP techniques used in this study are:\n\n- **TF-IDF (Term Frequency-Inverse Document Frequency)**: This method, though older, effectively shows how important a word is to a document in a corpus.\n- **Word2Vec**: This technique creates dense vector representations of words, capturing their semantic relationships.\n- **BERT (Bidirectional Encoder Representations from Transformers)**: BERT uses transformers to understand the context of words in all directions.\n- **Large Language Models**: Trained on vast amounts of text data, these advanced models are capable of generating coherent text, understanding sentiment, completing prompts, and more.\n\nAdditional methods used for feature extraction are sentiment analysis and topic modelling.\n\n## Expected Outcome\n\nUsing the insights derived from these NLP techniques, we construct an investment strategy. This strategy uses sentiment scores, main topics, and contextual embeddings from earning calls to predict future stock returns, showing the practical application of NLP in investment decision-making. The performance of this strategy is compared to the S&P 500 index.\n\n## Python Package\n\nThis Python package, `rotman-ncs`, contains a suite of functions for working with financial earnings call data. The functions are broken down into three categories: data loading, model training and inference, and strategy execution and analysis.\n\n### Data Loading\n\nThere are four main functions for loading different types of data:\n\n1. `load_call_description(data_type='train')`: Loads earnings call description data from the specified data type. The `data_type` parameter is a string that specifies the type of data to load, either 'train' or 'test'.\n\n2. `load_call_statements(data_type='train')`: Loads earnings call statements data from a specified data type. Again, the `data_type` parameter can be either 'train' or 'test'.\n\n3. `load_stock_history()`: Loads the stock price history data.\n\n4. `load_stock_returns_on_calls(data_type='train')`: Loads stock returns on calls data, with `data_type` specifying the type of data to load.\n\n```python\n# Load training data\ncall_descriptions = ncs.load_call_description(data_type='train')\ncall_statements = ncs.load_call_statements(data_type='train')\nstock_history = ncs.load_stock_history()\nstock_returns = ncs.load_stock_returns_on_calls(data_type='train')\n```\n\n### Model Training and Inference\n\nThe `train` function is used to train a model using the provided feature files and parameters. The `inference` function uses the trained model to generate actions for the test set.\n\n```python\n# Train model\nncs.train(feature_files=[list_of_feature_files])\n\n# Perform inference\nncs.inference(feature_files=[list_of_feature_files], model_file='model.pkl', action_file='actions.csv')\n```\n\n### Strategy Execution and Analysis\n\nThese functions are used to execute investment strategies and analyze their performance:\n\n1. `ncs.run_strategy(action_file, holding_period, log_file, save_portfolio_path)`: Runs the investment strategy using the provided action file, holding period, log file, and save portfolio path.\n\n2. `ncs.report_strategy_analysis(actions, portfolio, holding_period, model_name)`: Generates a report of the strategy analysis for a given set of actions and portfolio.\n\n3. `ncs.demo_benchmark(strategy, holding_period)`: Runs a benchmark strategy, either 'spy' or 'random', with a specified holding period.\n\n```python\n# Run strategy\nncs.run_strategy(action_file='actions.csv', holding_period=5)\n\n# Generate strategy report\nncs.report_strategy_analysis(actions='actions.csv', portfolio='portfolio.parquet', holding_period=5)\n\n# Generate benchmark strategy report\nncs.demo_benchmark(strategy='random', holding_period=5)\n```\n\nThe key dataset used in the case study is the `call_statements` DataFrame, which contains unique identifiers (`statement_uid`, `call_uid`), as well as text data (`text`, `clean_text`). The `text` field is original call transcript and the `clean_text` field is processed by several text cleaning steps. Apply different NLP models to convert them into numeric features that can be consumed by the model training functions. These features are then used to inform the investment strategies with the trained model.\n\n## Usage and Examples\n\nPlease refer to the following notebooks for detailed examples and usage instructions:\n\n- [TF-IDF Solution Notebook](https://colab.research.google.com/drive/1JYQK1IfBEBkOKRbAnNChC-aZrQsUcfGu?usp=sharing)\n- Word2Vec Solution (Coming Soon)\n- BERT Solution (Coming Soon)\n- LLM (Language Model) Solution (Coming Soon)\n\nThese notebooks demonstrate how to utilize different NLP techniques for processing text data and building investment strategies.\n\nMake sure to follow the instructions provided in each notebook for proper setup and execution of the code.\n\nIf you have any questions or need further assistance, please feel free to reach out.\n\nHappy coding!\n\n## Enjoy the Learning and Investing \ud83e\udd73\n",
"bugtrack_url": null,
"license": "",
"summary": "UoT Rotman NLP Case Study",
"version": "0.1.0a8",
"project_urls": {
"Homepage": "https://github.com/colingwuyu/rotman_ncs.git"
},
"split_keywords": [
"nlp",
"case study",
"uot",
"uot rotman",
"machine learning",
"investing"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "d44e2f419d75ed12841129015214eff4d1cf8958ce78b7a799692cfbd6ee1f92",
"md5": "12b4bed1cacbf73e08624c9b8f289558",
"sha256": "6a113625a66cecdddfdc4c65d73d3d5d011e10f6370c4c958325a9888abb42e3"
},
"downloads": -1,
"filename": "rotman_ncs-0.1.0a8-py3-none-any.whl",
"has_sig": false,
"md5_digest": "12b4bed1cacbf73e08624c9b8f289558",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 622532,
"upload_time": "2023-07-10T04:33:34",
"upload_time_iso_8601": "2023-07-10T04:33:34.967662Z",
"url": "https://files.pythonhosted.org/packages/d4/4e/2f419d75ed12841129015214eff4d1cf8958ce78b7a799692cfbd6ee1f92/rotman_ncs-0.1.0a8-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-07-10 04:33:34",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "colingwuyu",
"github_project": "rotman_ncs",
"github_not_found": true,
"lcname": "rotman-ncs"
}