TextRegress

Name	TextRegress JSON
Version	1.0.0 JSON
	download
home_page	None
Summary	A Python package for performing linear regression analysis on text data.
upload_time	2025-02-16 07:01:53
maintainer	None
docs_url	None
author	Jinhang Jiang, Weiyao Peng, Karthik Srinivasan
requires_python	>=3.6
license	MIT License
keywords	text predictive nlp machine learning nlp linear regression
VCS
bugtrack_url
requirements	pandas numpy scikit-learn tqdm torch pytorch-lightning sentence-transformers transformers
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

# TextRegress

TextRegress is a Python package designed to help researchers perform linear regression analysis on text data. It supports:
- Configurable text encoding using SentenceTransformer or custom methods (e.g., TFIDF).
- Automatic text chunking for long documents.
- A deep learning backend based on PyTorch Lightning with RNN (LSTM/GRU) layers.
- Integration of exogenous features through standard normalization and attention mechanisms.
- An sklearn-like API with `fit`, `predict`, and `fit_predict` methods.

## Installation

TextRegress requires Python 3.6 or higher. You can install it directly from the repository:

```bash
git clone https://github.com/yourusername/TextRegress.git
cd TextRegress
pip install -e .
```

You may also install it through pypi:

```python
pip install textregress
```

## Features

- **Unified DataFrame Interface**
The estimator methods (`fit`, `predict`, `fit_predict`) accept a single pandas DataFrame with:
- **`text`**: Input text data (can be long-form text).
- **`y`**: Continuous target variable.
- Additional columns can be provided as exogenous features.

- **Configurable Text Encoding**
Choose from multiple encoding methods:
- **TFIDF Encoder:** Activated when the model identifier contains `"tfidf"`.
- **SentenceTransformer Encoder:** Activated when the model identifier contains `"sentence-transformers"`.
- **Generic Hugging Face Encoder:** Supports any pre-trained Hugging Face model using `AutoTokenizer`/`AutoModel` with a mean-pooling strategy.

- **Text Chunking**
Automatically splits long texts into overlapping, fixed-size chunks (only full chunks are processed) to ensure consistent input size.

- **Deep Learning Regression Model**
Utilizes an RNN-based (LSTM/GRU) network implemented with PyTorch Lightning:
- Customizable number of layers, hidden size, and bidirectionality.
- Optionally integrates exogenous features into the regression process.

- **Custom Loss Functions**
Multiple loss functions are available via `loss.py`:
- MAE (default)
- SMAPE
- MSE
- RMSE
- wMAPE
- MAPE

- **Training Customization**
Fine-tune training behavior with parameters such as:
- `max_steps`: Maximum training steps (default: 500).
- `early_stop_enabled`: Enable early stopping (default: False).
- `patience_steps`: Steps with no improvement before stopping (default: 10 when early stopping is enabled).
- `val_check_steps`: Validation check interval (default: 50, automatically adjusted if needed).
- `val_size`: Proportion of data reserved for validation when early stopping is enabled.

- **GPU Auto-Detection**
Automatically leverages available GPUs via PyTorch Lightning (using `accelerator="auto"` and `devices="auto"`).

- **Dynamic Embedding Dimension Handling**
The model dynamically detects the encoder’s output dimension (e.g., 384 for `"sentence-transformers/all-MiniLM-L6-v2"`) and configures the RNN input accordingly.

- **Extensive Testing Suite**
Comprehensive tests ensure that utility functions, encoder types, and estimator functionality work as expected, making it easy to maintain and extend the package.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "TextRegress",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": "text, predictive NLP, machine learning, NLP, linear regression",
    "author": "Jinhang Jiang, Weiyao Peng, Karthik Srinivasan",
    "author_email": "jinhang@asu.edu",
    "download_url": "https://files.pythonhosted.org/packages/bb/6e/c9dcd2d1d87ff351ea5e8c4cb390c8c22a5fa319a0df1f99fe892b5ff68a/textregress-1.0.0.tar.gz",
    "platform": null,
    "description": "# TextRegress\r\n\r\nTextRegress is a Python package designed to help researchers perform linear regression analysis on text data. It supports:\r\n- Configurable text encoding using SentenceTransformer or custom methods (e.g., TFIDF).\r\n- Automatic text chunking for long documents.\r\n- A deep learning backend based on PyTorch Lightning with RNN (LSTM/GRU) layers.\r\n- Integration of exogenous features through standard normalization and attention mechanisms.\r\n- An sklearn-like API with `fit`, `predict`, and `fit_predict` methods.\r\n\r\n## Installation\r\n\r\nTextRegress requires Python 3.6 or higher. You can install it directly from the repository:\r\n\r\n```bash\r\ngit clone https://github.com/yourusername/TextRegress.git\r\ncd TextRegress\r\npip install -e .\r\n```\r\n\r\nYou may also install it through pypi:\r\n\r\n```python\r\npip install textregress\r\n```\r\n\r\n## Features\r\n\r\n- **Unified DataFrame Interface**  \r\n  The estimator methods (`fit`, `predict`, `fit_predict`) accept a single pandas DataFrame with:\r\n  - **`text`**: Input text data (can be long-form text).\r\n  - **`y`**: Continuous target variable.\r\n  - Additional columns can be provided as exogenous features.\r\n\r\n- **Configurable Text Encoding**  \r\n  Choose from multiple encoding methods:\r\n  - **TFIDF Encoder:** Activated when the model identifier contains `\"tfidf\"`.\r\n  - **SentenceTransformer Encoder:** Activated when the model identifier contains `\"sentence-transformers\"`.\r\n  - **Generic Hugging Face Encoder:** Supports any pre-trained Hugging Face model using `AutoTokenizer`/`AutoModel` with a mean-pooling strategy.\r\n\r\n- **Text Chunking**  \r\n  Automatically splits long texts into overlapping, fixed-size chunks (only full chunks are processed) to ensure consistent input size.\r\n\r\n- **Deep Learning Regression Model**  \r\n  Utilizes an RNN-based (LSTM/GRU) network implemented with PyTorch Lightning:\r\n  - Customizable number of layers, hidden size, and bidirectionality.\r\n  - Optionally integrates exogenous features into the regression process.\r\n\r\n- **Custom Loss Functions**  \r\n  Multiple loss functions are available via `loss.py`:\r\n  - MAE (default)\r\n  - SMAPE\r\n  - MSE\r\n  - RMSE\r\n  - wMAPE\r\n  - MAPE\r\n\r\n- **Training Customization**  \r\n  Fine-tune training behavior with parameters such as:\r\n  - `max_steps`: Maximum training steps (default: 500).\r\n  - `early_stop_enabled`: Enable early stopping (default: False).\r\n  - `patience_steps`: Steps with no improvement before stopping (default: 10 when early stopping is enabled).\r\n  - `val_check_steps`: Validation check interval (default: 50, automatically adjusted if needed).\r\n  - `val_size`: Proportion of data reserved for validation when early stopping is enabled.\r\n\r\n- **GPU Auto-Detection**  \r\n  Automatically leverages available GPUs via PyTorch Lightning (using `accelerator=\"auto\"` and `devices=\"auto\"`).\r\n\r\n- **Dynamic Embedding Dimension Handling**  \r\n  The model dynamically detects the encoder\u2019s output dimension (e.g., 384 for `\"sentence-transformers/all-MiniLM-L6-v2\"`) and configures the RNN input accordingly.\r\n\r\n- **Extensive Testing Suite**  \r\n  Comprehensive tests ensure that utility functions, encoder types, and estimator functionality work as expected, making it easy to maintain and extend the package.\r\n\r\n\r\n",
    "bugtrack_url": null,
    "license": "MIT License",
    "summary": "A Python package for performing linear regression analysis on text data.",
    "version": "1.0.0",
    "project_urls": {
        "Homepage": "https://github.com/jinhangjiang/textregress",
        "Repository": "https://github.com/jinhangjiang/textregress"
    },
    "split_keywords": [
        "text",
        " predictive nlp",
        " machine learning",
        " nlp",
        " linear regression"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7ed18eedc5ec11930a8b4ede6c14922b7c9a15c1ae67d630acf738c99deefa3a",
                "md5": "1d687c85a41bd2587c13f4cba6275b2f",
                "sha256": "786c116f92bf77794d3b455f1976323f9c91aac78bf9057f09698e3ed9e87f3a"
            },
            "downloads": -1,
            "filename": "TextRegress-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "1d687c85a41bd2587c13f4cba6275b2f",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 11841,
            "upload_time": "2025-02-16T07:01:51",
            "upload_time_iso_8601": "2025-02-16T07:01:51.892754Z",
            "url": "https://files.pythonhosted.org/packages/7e/d1/8eedc5ec11930a8b4ede6c14922b7c9a15c1ae67d630acf738c99deefa3a/TextRegress-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "bb6ec9dcd2d1d87ff351ea5e8c4cb390c8c22a5fa319a0df1f99fe892b5ff68a",
                "md5": "96f5580669b65d8c142ebdacb96f1f56",
                "sha256": "81e7faa6f2c56b65a69b94f75e5348171be2c8bc8590fc8a9eaafd8002ca4530"
            },
            "downloads": -1,
            "filename": "textregress-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "96f5580669b65d8c142ebdacb96f1f56",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 12196,
            "upload_time": "2025-02-16T07:01:53",
            "upload_time_iso_8601": "2025-02-16T07:01:53.138812Z",
            "url": "https://files.pythonhosted.org/packages/bb/6e/c9dcd2d1d87ff351ea5e8c4cb390c8c22a5fa319a0df1f99fe892b5ff68a/textregress-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-16 07:01:53",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jinhangjiang",
    "github_project": "textregress",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "pandas",
            "specs": [
                [
                    "~=",
                    "1.5.3"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    "~=",
                    "1.23.5"
                ]
            ]
        },
        {
            "name": "scikit-learn",
            "specs": [
                [
                    "~=",
                    "1.2.2"
                ]
            ]
        },
        {
            "name": "tqdm",
            "specs": [
                [
                    "~=",
                    "4.64.1"
                ]
            ]
        },
        {
            "name": "torch",
            "specs": [
                [
                    "~=",
                    "2.0.1"
                ]
            ]
        },
        {
            "name": "pytorch-lightning",
            "specs": [
                [
                    "~=",
                    "2.0.2"
                ]
            ]
        },
        {
            "name": "sentence-transformers",
            "specs": [
                [
                    "~=",
                    "2.2.2"
                ]
            ]
        },
        {
            "name": "transformers",
            "specs": [
                [
                    "~=",
                    "4.27.0"
                ]
            ]
        }
    ],
    "lcname": "textregress"
}

Jinhang Jiang, Weiyao Peng, Karthik Srinivasan