# TextRegress
TextRegress is a Python package designed to help researchers perform linear regression analysis on text data. It supports:
- Configurable text encoding using SentenceTransformer or custom methods (e.g., TFIDF).
- Automatic text chunking for long documents.
- A deep learning backend based on PyTorch Lightning with RNN (LSTM/GRU) layers.
- Integration of exogenous features through standard normalization and attention mechanisms.
- An sklearn-like API with `fit`, `predict`, and `fit_predict` methods.
## Installation
TextRegress requires Python 3.6 or higher. You can install it directly from the repository:
```bash
git clone https://github.com/yourusername/TextRegress.git
cd TextRegress
pip install -e .
```
You may also install it through pypi:
```python
pip install textregress
```
## Features
- **Unified DataFrame Interface**
The estimator methods (`fit`, `predict`, `fit_predict`) accept a single pandas DataFrame with:
- **`text`**: Input text data (can be long-form text).
- **`y`**: Continuous target variable.
- Additional columns can be provided as exogenous features.
- **Configurable Text Encoding**
Choose from multiple encoding methods:
- **TFIDF Encoder:** Activated when the model identifier contains `"tfidf"`.
- **SentenceTransformer Encoder:** Activated when the model identifier contains `"sentence-transformers"`.
- **Generic Hugging Face Encoder:** Supports any pre-trained Hugging Face model using `AutoTokenizer`/`AutoModel` with a mean-pooling strategy.
- **Text Chunking**
Automatically splits long texts into overlapping, fixed-size chunks (only full chunks are processed) to ensure consistent input size.
- **Deep Learning Regression Model**
Utilizes an RNN-based (LSTM/GRU) network implemented with PyTorch Lightning:
- Customizable number of layers, hidden size, and bidirectionality.
- Optionally integrates exogenous features into the regression process.
- **Custom Loss Functions**
Multiple loss functions are available via `loss.py`:
- MAE (default)
- SMAPE
- MSE
- RMSE
- wMAPE
- MAPE
- **Training Customization**
Fine-tune training behavior with parameters such as:
- `max_steps`: Maximum training steps (default: 500).
- `early_stop_enabled`: Enable early stopping (default: False).
- `patience_steps`: Steps with no improvement before stopping (default: 10 when early stopping is enabled).
- `val_check_steps`: Validation check interval (default: 50, automatically adjusted if needed).
- `val_size`: Proportion of data reserved for validation when early stopping is enabled.
- **GPU Auto-Detection**
Automatically leverages available GPUs via PyTorch Lightning (using `accelerator="auto"` and `devices="auto"`).
- **Dynamic Embedding Dimension Handling**
The model dynamically detects the encoder’s output dimension (e.g., 384 for `"sentence-transformers/all-MiniLM-L6-v2"`) and configures the RNN input accordingly.
- **Extensive Testing Suite**
Comprehensive tests ensure that utility functions, encoder types, and estimator functionality work as expected, making it easy to maintain and extend the package.
Raw data
{
"_id": null,
"home_page": null,
"name": "TextRegress",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": null,
"keywords": "text, predictive NLP, machine learning, NLP, linear regression",
"author": "Jinhang Jiang, Weiyao Peng, Karthik Srinivasan",
"author_email": "jinhang@asu.edu",
"download_url": "https://files.pythonhosted.org/packages/bb/6e/c9dcd2d1d87ff351ea5e8c4cb390c8c22a5fa319a0df1f99fe892b5ff68a/textregress-1.0.0.tar.gz",
"platform": null,
"description": "# TextRegress\r\n\r\nTextRegress is a Python package designed to help researchers perform linear regression analysis on text data. It supports:\r\n- Configurable text encoding using SentenceTransformer or custom methods (e.g., TFIDF).\r\n- Automatic text chunking for long documents.\r\n- A deep learning backend based on PyTorch Lightning with RNN (LSTM/GRU) layers.\r\n- Integration of exogenous features through standard normalization and attention mechanisms.\r\n- An sklearn-like API with `fit`, `predict`, and `fit_predict` methods.\r\n\r\n## Installation\r\n\r\nTextRegress requires Python 3.6 or higher. You can install it directly from the repository:\r\n\r\n```bash\r\ngit clone https://github.com/yourusername/TextRegress.git\r\ncd TextRegress\r\npip install -e .\r\n```\r\n\r\nYou may also install it through pypi:\r\n\r\n```python\r\npip install textregress\r\n```\r\n\r\n## Features\r\n\r\n- **Unified DataFrame Interface** \r\n The estimator methods (`fit`, `predict`, `fit_predict`) accept a single pandas DataFrame with:\r\n - **`text`**: Input text data (can be long-form text).\r\n - **`y`**: Continuous target variable.\r\n - Additional columns can be provided as exogenous features.\r\n\r\n- **Configurable Text Encoding** \r\n Choose from multiple encoding methods:\r\n - **TFIDF Encoder:** Activated when the model identifier contains `\"tfidf\"`.\r\n - **SentenceTransformer Encoder:** Activated when the model identifier contains `\"sentence-transformers\"`.\r\n - **Generic Hugging Face Encoder:** Supports any pre-trained Hugging Face model using `AutoTokenizer`/`AutoModel` with a mean-pooling strategy.\r\n\r\n- **Text Chunking** \r\n Automatically splits long texts into overlapping, fixed-size chunks (only full chunks are processed) to ensure consistent input size.\r\n\r\n- **Deep Learning Regression Model** \r\n Utilizes an RNN-based (LSTM/GRU) network implemented with PyTorch Lightning:\r\n - Customizable number of layers, hidden size, and bidirectionality.\r\n - Optionally integrates exogenous features into the regression process.\r\n\r\n- **Custom Loss Functions** \r\n Multiple loss functions are available via `loss.py`:\r\n - MAE (default)\r\n - SMAPE\r\n - MSE\r\n - RMSE\r\n - wMAPE\r\n - MAPE\r\n\r\n- **Training Customization** \r\n Fine-tune training behavior with parameters such as:\r\n - `max_steps`: Maximum training steps (default: 500).\r\n - `early_stop_enabled`: Enable early stopping (default: False).\r\n - `patience_steps`: Steps with no improvement before stopping (default: 10 when early stopping is enabled).\r\n - `val_check_steps`: Validation check interval (default: 50, automatically adjusted if needed).\r\n - `val_size`: Proportion of data reserved for validation when early stopping is enabled.\r\n\r\n- **GPU Auto-Detection** \r\n Automatically leverages available GPUs via PyTorch Lightning (using `accelerator=\"auto\"` and `devices=\"auto\"`).\r\n\r\n- **Dynamic Embedding Dimension Handling** \r\n The model dynamically detects the encoder\u2019s output dimension (e.g., 384 for `\"sentence-transformers/all-MiniLM-L6-v2\"`) and configures the RNN input accordingly.\r\n\r\n- **Extensive Testing Suite** \r\n Comprehensive tests ensure that utility functions, encoder types, and estimator functionality work as expected, making it easy to maintain and extend the package.\r\n\r\n\r\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "A Python package for performing linear regression analysis on text data.",
"version": "1.0.0",
"project_urls": {
"Homepage": "https://github.com/jinhangjiang/textregress",
"Repository": "https://github.com/jinhangjiang/textregress"
},
"split_keywords": [
"text",
" predictive nlp",
" machine learning",
" nlp",
" linear regression"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "7ed18eedc5ec11930a8b4ede6c14922b7c9a15c1ae67d630acf738c99deefa3a",
"md5": "1d687c85a41bd2587c13f4cba6275b2f",
"sha256": "786c116f92bf77794d3b455f1976323f9c91aac78bf9057f09698e3ed9e87f3a"
},
"downloads": -1,
"filename": "TextRegress-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "1d687c85a41bd2587c13f4cba6275b2f",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 11841,
"upload_time": "2025-02-16T07:01:51",
"upload_time_iso_8601": "2025-02-16T07:01:51.892754Z",
"url": "https://files.pythonhosted.org/packages/7e/d1/8eedc5ec11930a8b4ede6c14922b7c9a15c1ae67d630acf738c99deefa3a/TextRegress-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "bb6ec9dcd2d1d87ff351ea5e8c4cb390c8c22a5fa319a0df1f99fe892b5ff68a",
"md5": "96f5580669b65d8c142ebdacb96f1f56",
"sha256": "81e7faa6f2c56b65a69b94f75e5348171be2c8bc8590fc8a9eaafd8002ca4530"
},
"downloads": -1,
"filename": "textregress-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "96f5580669b65d8c142ebdacb96f1f56",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 12196,
"upload_time": "2025-02-16T07:01:53",
"upload_time_iso_8601": "2025-02-16T07:01:53.138812Z",
"url": "https://files.pythonhosted.org/packages/bb/6e/c9dcd2d1d87ff351ea5e8c4cb390c8c22a5fa319a0df1f99fe892b5ff68a/textregress-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-02-16 07:01:53",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "jinhangjiang",
"github_project": "textregress",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "pandas",
"specs": [
[
"~=",
"1.5.3"
]
]
},
{
"name": "numpy",
"specs": [
[
"~=",
"1.23.5"
]
]
},
{
"name": "scikit-learn",
"specs": [
[
"~=",
"1.2.2"
]
]
},
{
"name": "tqdm",
"specs": [
[
"~=",
"4.64.1"
]
]
},
{
"name": "torch",
"specs": [
[
"~=",
"2.0.1"
]
]
},
{
"name": "pytorch-lightning",
"specs": [
[
"~=",
"2.0.2"
]
]
},
{
"name": "sentence-transformers",
"specs": [
[
"~=",
"2.2.2"
]
]
},
{
"name": "transformers",
"specs": [
[
"~=",
"4.27.0"
]
]
}
],
"lcname": "textregress"
}