data-streaming

Name	data-streaming JSON
Version	1.0 JSON
	download
home_page	https://github.com/WalBouss/data_stream
Summary	Data Stream
upload_time	2024-10-07 08:40:14
maintainer	None
docs_url	None
author	Walid Bousselham
requires_python	>=3.7
license	None
keywords	data stream
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # 🚀 Data Proxy Service (data-stream)

A Python-based tool that allows you to stream data from a remote server to your local compute resources. This service is particularly useful when you need to **train models on large datasets stored on a remote server but don't have sufficient storage on your local compute node**.

This repository is a wrapper around the [sshtunnel](https://github.com/pahaz/sshtunnel) library and uses [fastapi](https://fastapi.tiangolo.com/) to create a simple HTTP server to stream the data.
## ✨ Features

- 🔒 Stream data securely from a remote server using SSH tunneling
- 📝 Support for SSH config aliases and direct SSH parameters
- ⚡ FastAPI-powered HTTP endpoint for data access
- 🤖 Automatic management of remote Python HTTP server
- 🏥 Health check endpoint for monitoring
- 🔑 Support for both SSH key and password authentication
- ⚙️ Configurable ports for local and remote connections
- 🛑 Graceful shutdown handling

## 📦 Installation

Install the package using pip:

```bash
pip install data-stream
```

Alternatively, Clone this repository:
```bash
   git clone https://github.com/yourusername/data-proxy-service.git
   cd data-proxy-service
   pip install -e .
   ```

## 🔧 Usage: Command-line Interface


To start the Data Proxy Service, use one of the following methods:

### 1. Using SSH Config Alias 📋

If you have an SSH config file (`~/.ssh/config`) with your server details:

```bash
data-stream --ssh-host-alias myserver --data-path /path/to/remote/data
```

Here is an example of an SSH config file:
```
Host myserver
    HostName example.com
    User mouloud
    IdentityFile ~/.ssh/id_rsa
```

### 2. Using Direct SSH Parameters 🔑

```bash
data-stream \
  --ssh-host example.com \
  --ssh-username myusername \
  --ssh-key-path ~/.ssh/id_rsa \
  --data-path /path/to/remote/data
```

### Optional Parameters ⚙️

- `--local-port`: Local port for SSH tunnel (default: 8000)
- `--remote-port`: Remote port for HTTP server (default: 8001)
- `--fastapi-port`: FastAPI server port (default: 5001)
- `--ssh-password`: SSH password (if not using key-based authentication)

Example with all parameters:

```bash
data-stream \
  --ssh-host example.com \
  --ssh-username john \
  --data-path /home/john/datasets \
  --ssh-key-path ~/.ssh/id_rsa \
  --local-port 8000 \
  --remote-port 8001 \
  --fastapi-port 5000
```

### 3.Using Environment Variables 🔧

You can also configure the service using environment variables:

- `PROXY_SSH_HOST_ALIAS`: SSH host alias (for SSH config)
- `PROXY_SSH_HOST`: SSH host (cluster 1)
- `PROXY_SSH_USERNAME`: SSH username
- `PROXY_DATA_PATH`: Path to data on cluster 1
- `PROXY_SSH_KEY_PATH`: Path to SSH key
- `PROXY_SSH_PASSWORD`: SSH password (if not using key)
- `PROXY_LOCAL_PORT`: Local port for SSH tunnel
- `PROXY_REMOTE_PORT`: Remote port for HTTP server
- `PROXY_FASTAPI_PORT`: FastAPI server port

## 🖥️ HPC Usage

When using data-stream on an HPC (High-Performance Computing) system:

⚠️ **Important**: Always start the service on a compute node, not on the login node. Login nodes are shared resources and aren't suitable for running services.

Example using SLURM:
```bash
#!/bin/bash
#SBATCH --job-name=data-stream
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --time=24:00:00

data-stream \
  --ssh-host-alias myserver \
  --data-path /path/to/remote/data
```

## 📊 Integration Examples

### WebDataset Integration 📦

data-stream works seamlessly with WebDataset for efficient data loading in machine learning pipelines:

```python
import webdataset as wds
from torch.utils.data import DataLoader

# Start data-stream service (as shown above)

# Create WebDataset pipeline
dataset = wds.WebDataset('http://localhost:5000/data/path/to/tarfiles/{000000..999999}.tar')

# Create DataLoader
dataloader = DataLoader(dataset, batch_size=None, num_workers=4)

# Use in training
for batch_input, batch_target in dataloader:
    # Your training code here
    pass
```

## 📂 Accessing Data

Once the service is running, you can access your data through:

```
http://localhost:5000/data/path/to/file
```
You can test the data stream by running:
```bash
curl http://localhost:5000/health/shard_0001.tar -o test.tar
```

## 🏥 Health Check

You can verify the service status using:

```bash
curl http://localhost:5000/health
```

This will return:
```json
{
  "status": "OK",
  "connection": {
    "hostname": "example.com",
    "username": "myusername",
    "using_ssh_config": true
  }
}
```

## 🐍 Using as a Python Package

You can also use data-stream in your Python code:

```python
from data_stream import DataProxyService, Settings

# Using SSH config alias
settings = Settings(
    ssh_host_alias="myserver",
    data_path="/path/to/remote/data"
)

# Or using direct parameters
settings = Settings(
    ssh_host="example.com",
    ssh_username="myusername",
    ssh_key_path="~/.ssh/id_rsa",
    data_path="/path/to/remote/data"
)

# Initialize and start the service
service = DataProxyService(settings)
await service.start()

# When done
await service.stop()
```

## 📋 Requirements

- Python 3.7+
- SSH access to the remote server
- Python installation on the remote server

## 🔧 Troubleshooting

### Common Issues

1. **🚫 Permission Denied**
   - Verify your username and SSH key are correct
   - Check if your user has access to the data directory on the remote server

2. **⚠️ Port Already in Use**
   - Try different ports using `--local-port`, `--remote-port`, or `--fastapi-port`
   - Check if another instance of data-stream is already running
   - On HPC, ensure no other jobs are using the same ports (that why it important to run on the compute node)

3. **🔌 Remote Server Issues**
   - Ensure Python is installed on the remote server
   - Check if the data path exists and is accessible


## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/WalBouss/data_stream",
    "name": "data-streaming",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": "Data Stream",
    "author": "Walid Bousselham",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/50/90/ec28ded0ac0fbda084ffa320ec8d3e3fa932ce32bcca23480f5a283afd78/data_streaming-1.0.tar.gz",
    "platform": null,
    "description": "# \ud83d\ude80 Data Proxy Service (data-stream)\n\nA Python-based tool that allows you to stream data from a remote server to your local compute resources. This service is particularly useful when you need to **train models on large datasets stored on a remote server but don't have sufficient storage on your local compute node**.\n\nThis repository is a wrapper around the [sshtunnel](https://github.com/pahaz/sshtunnel) library and uses [fastapi](https://fastapi.tiangolo.com/) to create a simple HTTP server to stream the data.\n## \u2728 Features\n\n- \ud83d\udd12 Stream data securely from a remote server using SSH tunneling\n- \ud83d\udcdd Support for SSH config aliases and direct SSH parameters\n- \u26a1 FastAPI-powered HTTP endpoint for data access\n- \ud83e\udd16 Automatic management of remote Python HTTP server\n- \ud83c\udfe5 Health check endpoint for monitoring\n- \ud83d\udd11 Support for both SSH key and password authentication\n- \u2699\ufe0f Configurable ports for local and remote connections\n- \ud83d\uded1 Graceful shutdown handling\n\n## \ud83d\udce6 Installation\n\nInstall the package using pip:\n\n```bash\npip install data-stream\n```\n\nAlternatively, Clone this repository:\n```bash\n   git clone https://github.com/yourusername/data-proxy-service.git\n   cd data-proxy-service\n   pip install -e .\n   ```\n\n## \ud83d\udd27 Usage: Command-line Interface\n\n\nTo start the Data Proxy Service, use one of the following methods:\n\n### 1. Using SSH Config Alias \ud83d\udccb\n\nIf you have an SSH config file (`~/.ssh/config`) with your server details:\n\n```bash\ndata-stream --ssh-host-alias myserver --data-path /path/to/remote/data\n```\n\nHere is an example of an SSH config file:\n```\nHost myserver\n    HostName example.com\n    User mouloud\n    IdentityFile ~/.ssh/id_rsa\n```\n\n### 2. Using Direct SSH Parameters \ud83d\udd11\n\n```bash\ndata-stream \\\n  --ssh-host example.com \\\n  --ssh-username myusername \\\n  --ssh-key-path ~/.ssh/id_rsa \\\n  --data-path /path/to/remote/data\n```\n\n### Optional Parameters \u2699\ufe0f\n\n- `--local-port`: Local port for SSH tunnel (default: 8000)\n- `--remote-port`: Remote port for HTTP server (default: 8001)\n- `--fastapi-port`: FastAPI server port (default: 5001)\n- `--ssh-password`: SSH password (if not using key-based authentication)\n\nExample with all parameters:\n\n```bash\ndata-stream \\\n  --ssh-host example.com \\\n  --ssh-username john \\\n  --data-path /home/john/datasets \\\n  --ssh-key-path ~/.ssh/id_rsa \\\n  --local-port 8000 \\\n  --remote-port 8001 \\\n  --fastapi-port 5000\n```\n\n### 3.Using Environment Variables \ud83d\udd27\n\nYou can also configure the service using environment variables:\n\n- `PROXY_SSH_HOST_ALIAS`: SSH host alias (for SSH config)\n- `PROXY_SSH_HOST`: SSH host (cluster 1)\n- `PROXY_SSH_USERNAME`: SSH username\n- `PROXY_DATA_PATH`: Path to data on cluster 1\n- `PROXY_SSH_KEY_PATH`: Path to SSH key\n- `PROXY_SSH_PASSWORD`: SSH password (if not using key)\n- `PROXY_LOCAL_PORT`: Local port for SSH tunnel\n- `PROXY_REMOTE_PORT`: Remote port for HTTP server\n- `PROXY_FASTAPI_PORT`: FastAPI server port\n\n## \ud83d\udda5\ufe0f HPC Usage\n\nWhen using data-stream on an HPC (High-Performance Computing) system:\n\n\u26a0\ufe0f **Important**: Always start the service on a compute node, not on the login node. Login nodes are shared resources and aren't suitable for running services.\n\nExample using SLURM:\n```bash\n#!/bin/bash\n#SBATCH --job-name=data-stream\n#SBATCH --nodes=1\n#SBATCH --ntasks=1\n#SBATCH --cpus-per-task=1\n#SBATCH --mem=4G\n#SBATCH --time=24:00:00\n\ndata-stream \\\n  --ssh-host-alias myserver \\\n  --data-path /path/to/remote/data\n```\n\n## \ud83d\udcca Integration Examples\n\n### WebDataset Integration \ud83d\udce6\n\ndata-stream works seamlessly with WebDataset for efficient data loading in machine learning pipelines:\n\n```python\nimport webdataset as wds\nfrom torch.utils.data import DataLoader\n\n# Start data-stream service (as shown above)\n\n# Create WebDataset pipeline\ndataset = wds.WebDataset('http://localhost:5000/data/path/to/tarfiles/{000000..999999}.tar')\n\n# Create DataLoader\ndataloader = DataLoader(dataset, batch_size=None, num_workers=4)\n\n# Use in training\nfor batch_input, batch_target in dataloader:\n    # Your training code here\n    pass\n```\n\n## \ud83d\udcc2 Accessing Data\n\nOnce the service is running, you can access your data through:\n\n```\nhttp://localhost:5000/data/path/to/file\n```\nYou can test the data stream by running:\n```bash\ncurl http://localhost:5000/health/shard_0001.tar -o test.tar\n```\n\n## \ud83c\udfe5 Health Check\n\nYou can verify the service status using:\n\n```bash\ncurl http://localhost:5000/health\n```\n\nThis will return:\n```json\n{\n  \"status\": \"OK\",\n  \"connection\": {\n    \"hostname\": \"example.com\",\n    \"username\": \"myusername\",\n    \"using_ssh_config\": true\n  }\n}\n```\n\n## \ud83d\udc0d Using as a Python Package\n\nYou can also use data-stream in your Python code:\n\n```python\nfrom data_stream import DataProxyService, Settings\n\n# Using SSH config alias\nsettings = Settings(\n    ssh_host_alias=\"myserver\",\n    data_path=\"/path/to/remote/data\"\n)\n\n# Or using direct parameters\nsettings = Settings(\n    ssh_host=\"example.com\",\n    ssh_username=\"myusername\",\n    ssh_key_path=\"~/.ssh/id_rsa\",\n    data_path=\"/path/to/remote/data\"\n)\n\n# Initialize and start the service\nservice = DataProxyService(settings)\nawait service.start()\n\n# When done\nawait service.stop()\n```\n\n## \ud83d\udccb Requirements\n\n- Python 3.7+\n- SSH access to the remote server\n- Python installation on the remote server\n\n## \ud83d\udd27 Troubleshooting\n\n### Common Issues\n\n1. **\ud83d\udeab Permission Denied**\n   - Verify your username and SSH key are correct\n   - Check if your user has access to the data directory on the remote server\n\n2. **\u26a0\ufe0f Port Already in Use**\n   - Try different ports using `--local-port`, `--remote-port`, or `--fastapi-port`\n   - Check if another instance of data-stream is already running\n   - On HPC, ensure no other jobs are using the same ports (that why it important to run on the compute node)\n\n3. **\ud83d\udd0c Remote Server Issues**\n   - Ensure Python is installed on the remote server\n   - Check if the data path exists and is accessible\n\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Data Stream",
    "version": "1.0",
    "project_urls": {
        "Homepage": "https://github.com/WalBouss/data_stream"
    },
    "split_keywords": [
        "data",
        "stream"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "29c5cf0a63ad59d8b98a37f6dcb1954f0683b8f9b6626652ab5dce52908ab498",
                "md5": "23e737f8175dcecf2886bd157ebf3895",
                "sha256": "6f388d6ae4346f6d9c30fb772d7292c951084b44ad31454da8f4ed81fd164b4b"
            },
            "downloads": -1,
            "filename": "data_streaming-1.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "23e737f8175dcecf2886bd157ebf3895",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.7",
            "size": 13230,
            "upload_time": "2024-10-07T08:40:12",
            "upload_time_iso_8601": "2024-10-07T08:40:12.681822Z",
            "url": "https://files.pythonhosted.org/packages/29/c5/cf0a63ad59d8b98a37f6dcb1954f0683b8f9b6626652ab5dce52908ab498/data_streaming-1.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "5090ec28ded0ac0fbda084ffa320ec8d3e3fa932ce32bcca23480f5a283afd78",
                "md5": "de965414ac8bee48ffe01e8c50851da2",
                "sha256": "f10df69a01cad5d1873979d6f06660380f8c571c44922d4b7da42c9a7284e92e"
            },
            "downloads": -1,
            "filename": "data_streaming-1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "de965414ac8bee48ffe01e8c50851da2",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 7739,
            "upload_time": "2024-10-07T08:40:14",
            "upload_time_iso_8601": "2024-10-07T08:40:14.174785Z",
            "url": "https://files.pythonhosted.org/packages/50/90/ec28ded0ac0fbda084ffa320ec8d3e3fa932ce32bcca23480f5a283afd78/data_streaming-1.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-10-07 08:40:14",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "WalBouss",
    "github_project": "data_stream",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "data-streaming"
}

Walid Bousselham