# 🚀 Data Proxy Service (data-stream)
A Python-based tool that allows you to stream data from a remote server to your local compute resources. This service is particularly useful when you need to **train models on large datasets stored on a remote server but don't have sufficient storage on your local compute node**.
This repository is a wrapper around the [sshtunnel](https://github.com/pahaz/sshtunnel) library and uses [fastapi](https://fastapi.tiangolo.com/) to create a simple HTTP server to stream the data.
## ✨ Features
- 🔒 Stream data securely from a remote server using SSH tunneling
- 📝 Support for SSH config aliases and direct SSH parameters
- ⚡ FastAPI-powered HTTP endpoint for data access
- 🤖 Automatic management of remote Python HTTP server
- 🏥 Health check endpoint for monitoring
- 🔑 Support for both SSH key and password authentication
- ⚙️ Configurable ports for local and remote connections
- 🛑 Graceful shutdown handling
## 📦 Installation
Install the package using pip:
```bash
pip install data-stream
```
Alternatively, Clone this repository:
```bash
git clone https://github.com/yourusername/data-proxy-service.git
cd data-proxy-service
pip install -e .
```
## 🔧 Usage: Command-line Interface
To start the Data Proxy Service, use one of the following methods:
### 1. Using SSH Config Alias 📋
If you have an SSH config file (`~/.ssh/config`) with your server details:
```bash
data-stream --ssh-host-alias myserver --data-path /path/to/remote/data
```
Here is an example of an SSH config file:
```
Host myserver
HostName example.com
User mouloud
IdentityFile ~/.ssh/id_rsa
```
### 2. Using Direct SSH Parameters 🔑
```bash
data-stream \
--ssh-host example.com \
--ssh-username myusername \
--ssh-key-path ~/.ssh/id_rsa \
--data-path /path/to/remote/data
```
### Optional Parameters ⚙️
- `--local-port`: Local port for SSH tunnel (default: 8000)
- `--remote-port`: Remote port for HTTP server (default: 8001)
- `--fastapi-port`: FastAPI server port (default: 5001)
- `--ssh-password`: SSH password (if not using key-based authentication)
Example with all parameters:
```bash
data-stream \
--ssh-host example.com \
--ssh-username john \
--data-path /home/john/datasets \
--ssh-key-path ~/.ssh/id_rsa \
--local-port 8000 \
--remote-port 8001 \
--fastapi-port 5000
```
### 3.Using Environment Variables 🔧
You can also configure the service using environment variables:
- `PROXY_SSH_HOST_ALIAS`: SSH host alias (for SSH config)
- `PROXY_SSH_HOST`: SSH host (cluster 1)
- `PROXY_SSH_USERNAME`: SSH username
- `PROXY_DATA_PATH`: Path to data on cluster 1
- `PROXY_SSH_KEY_PATH`: Path to SSH key
- `PROXY_SSH_PASSWORD`: SSH password (if not using key)
- `PROXY_LOCAL_PORT`: Local port for SSH tunnel
- `PROXY_REMOTE_PORT`: Remote port for HTTP server
- `PROXY_FASTAPI_PORT`: FastAPI server port
## 🖥️ HPC Usage
When using data-stream on an HPC (High-Performance Computing) system:
⚠️ **Important**: Always start the service on a compute node, not on the login node. Login nodes are shared resources and aren't suitable for running services.
Example using SLURM:
```bash
#!/bin/bash
#SBATCH --job-name=data-stream
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --time=24:00:00
data-stream \
--ssh-host-alias myserver \
--data-path /path/to/remote/data
```
## 📊 Integration Examples
### WebDataset Integration 📦
data-stream works seamlessly with WebDataset for efficient data loading in machine learning pipelines:
```python
import webdataset as wds
from torch.utils.data import DataLoader
# Start data-stream service (as shown above)
# Create WebDataset pipeline
dataset = wds.WebDataset('http://localhost:5000/data/path/to/tarfiles/{000000..999999}.tar')
# Create DataLoader
dataloader = DataLoader(dataset, batch_size=None, num_workers=4)
# Use in training
for batch_input, batch_target in dataloader:
# Your training code here
pass
```
## 📂 Accessing Data
Once the service is running, you can access your data through:
```
http://localhost:5000/data/path/to/file
```
You can test the data stream by running:
```bash
curl http://localhost:5000/health/shard_0001.tar -o test.tar
```
## 🏥 Health Check
You can verify the service status using:
```bash
curl http://localhost:5000/health
```
This will return:
```json
{
"status": "OK",
"connection": {
"hostname": "example.com",
"username": "myusername",
"using_ssh_config": true
}
}
```
## 🐍 Using as a Python Package
You can also use data-stream in your Python code:
```python
from data_stream import DataProxyService, Settings
# Using SSH config alias
settings = Settings(
ssh_host_alias="myserver",
data_path="/path/to/remote/data"
)
# Or using direct parameters
settings = Settings(
ssh_host="example.com",
ssh_username="myusername",
ssh_key_path="~/.ssh/id_rsa",
data_path="/path/to/remote/data"
)
# Initialize and start the service
service = DataProxyService(settings)
await service.start()
# When done
await service.stop()
```
## 📋 Requirements
- Python 3.7+
- SSH access to the remote server
- Python installation on the remote server
## 🔧 Troubleshooting
### Common Issues
1. **🚫 Permission Denied**
- Verify your username and SSH key are correct
- Check if your user has access to the data directory on the remote server
2. **⚠️ Port Already in Use**
- Try different ports using `--local-port`, `--remote-port`, or `--fastapi-port`
- Check if another instance of data-stream is already running
- On HPC, ensure no other jobs are using the same ports (that why it important to run on the compute node)
3. **🔌 Remote Server Issues**
- Ensure Python is installed on the remote server
- Check if the data path exists and is accessible
## 📄 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
Raw data
{
"_id": null,
"home_page": "https://github.com/WalBouss/data_stream",
"name": "data-streaming",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "Data Stream",
"author": "Walid Bousselham",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/50/90/ec28ded0ac0fbda084ffa320ec8d3e3fa932ce32bcca23480f5a283afd78/data_streaming-1.0.tar.gz",
"platform": null,
"description": "# \ud83d\ude80 Data Proxy Service (data-stream)\n\nA Python-based tool that allows you to stream data from a remote server to your local compute resources. This service is particularly useful when you need to **train models on large datasets stored on a remote server but don't have sufficient storage on your local compute node**.\n\nThis repository is a wrapper around the [sshtunnel](https://github.com/pahaz/sshtunnel) library and uses [fastapi](https://fastapi.tiangolo.com/) to create a simple HTTP server to stream the data.\n## \u2728 Features\n\n- \ud83d\udd12 Stream data securely from a remote server using SSH tunneling\n- \ud83d\udcdd Support for SSH config aliases and direct SSH parameters\n- \u26a1 FastAPI-powered HTTP endpoint for data access\n- \ud83e\udd16 Automatic management of remote Python HTTP server\n- \ud83c\udfe5 Health check endpoint for monitoring\n- \ud83d\udd11 Support for both SSH key and password authentication\n- \u2699\ufe0f Configurable ports for local and remote connections\n- \ud83d\uded1 Graceful shutdown handling\n\n## \ud83d\udce6 Installation\n\nInstall the package using pip:\n\n```bash\npip install data-stream\n```\n\nAlternatively, Clone this repository:\n```bash\n git clone https://github.com/yourusername/data-proxy-service.git\n cd data-proxy-service\n pip install -e .\n ```\n\n## \ud83d\udd27 Usage: Command-line Interface\n\n\nTo start the Data Proxy Service, use one of the following methods:\n\n### 1. Using SSH Config Alias \ud83d\udccb\n\nIf you have an SSH config file (`~/.ssh/config`) with your server details:\n\n```bash\ndata-stream --ssh-host-alias myserver --data-path /path/to/remote/data\n```\n\nHere is an example of an SSH config file:\n```\nHost myserver\n HostName example.com\n User mouloud\n IdentityFile ~/.ssh/id_rsa\n```\n\n### 2. Using Direct SSH Parameters \ud83d\udd11\n\n```bash\ndata-stream \\\n --ssh-host example.com \\\n --ssh-username myusername \\\n --ssh-key-path ~/.ssh/id_rsa \\\n --data-path /path/to/remote/data\n```\n\n### Optional Parameters \u2699\ufe0f\n\n- `--local-port`: Local port for SSH tunnel (default: 8000)\n- `--remote-port`: Remote port for HTTP server (default: 8001)\n- `--fastapi-port`: FastAPI server port (default: 5001)\n- `--ssh-password`: SSH password (if not using key-based authentication)\n\nExample with all parameters:\n\n```bash\ndata-stream \\\n --ssh-host example.com \\\n --ssh-username john \\\n --data-path /home/john/datasets \\\n --ssh-key-path ~/.ssh/id_rsa \\\n --local-port 8000 \\\n --remote-port 8001 \\\n --fastapi-port 5000\n```\n\n### 3.Using Environment Variables \ud83d\udd27\n\nYou can also configure the service using environment variables:\n\n- `PROXY_SSH_HOST_ALIAS`: SSH host alias (for SSH config)\n- `PROXY_SSH_HOST`: SSH host (cluster 1)\n- `PROXY_SSH_USERNAME`: SSH username\n- `PROXY_DATA_PATH`: Path to data on cluster 1\n- `PROXY_SSH_KEY_PATH`: Path to SSH key\n- `PROXY_SSH_PASSWORD`: SSH password (if not using key)\n- `PROXY_LOCAL_PORT`: Local port for SSH tunnel\n- `PROXY_REMOTE_PORT`: Remote port for HTTP server\n- `PROXY_FASTAPI_PORT`: FastAPI server port\n\n## \ud83d\udda5\ufe0f HPC Usage\n\nWhen using data-stream on an HPC (High-Performance Computing) system:\n\n\u26a0\ufe0f **Important**: Always start the service on a compute node, not on the login node. Login nodes are shared resources and aren't suitable for running services.\n\nExample using SLURM:\n```bash\n#!/bin/bash\n#SBATCH --job-name=data-stream\n#SBATCH --nodes=1\n#SBATCH --ntasks=1\n#SBATCH --cpus-per-task=1\n#SBATCH --mem=4G\n#SBATCH --time=24:00:00\n\ndata-stream \\\n --ssh-host-alias myserver \\\n --data-path /path/to/remote/data\n```\n\n## \ud83d\udcca Integration Examples\n\n### WebDataset Integration \ud83d\udce6\n\ndata-stream works seamlessly with WebDataset for efficient data loading in machine learning pipelines:\n\n```python\nimport webdataset as wds\nfrom torch.utils.data import DataLoader\n\n# Start data-stream service (as shown above)\n\n# Create WebDataset pipeline\ndataset = wds.WebDataset('http://localhost:5000/data/path/to/tarfiles/{000000..999999}.tar')\n\n# Create DataLoader\ndataloader = DataLoader(dataset, batch_size=None, num_workers=4)\n\n# Use in training\nfor batch_input, batch_target in dataloader:\n # Your training code here\n pass\n```\n\n## \ud83d\udcc2 Accessing Data\n\nOnce the service is running, you can access your data through:\n\n```\nhttp://localhost:5000/data/path/to/file\n```\nYou can test the data stream by running:\n```bash\ncurl http://localhost:5000/health/shard_0001.tar -o test.tar\n```\n\n## \ud83c\udfe5 Health Check\n\nYou can verify the service status using:\n\n```bash\ncurl http://localhost:5000/health\n```\n\nThis will return:\n```json\n{\n \"status\": \"OK\",\n \"connection\": {\n \"hostname\": \"example.com\",\n \"username\": \"myusername\",\n \"using_ssh_config\": true\n }\n}\n```\n\n## \ud83d\udc0d Using as a Python Package\n\nYou can also use data-stream in your Python code:\n\n```python\nfrom data_stream import DataProxyService, Settings\n\n# Using SSH config alias\nsettings = Settings(\n ssh_host_alias=\"myserver\",\n data_path=\"/path/to/remote/data\"\n)\n\n# Or using direct parameters\nsettings = Settings(\n ssh_host=\"example.com\",\n ssh_username=\"myusername\",\n ssh_key_path=\"~/.ssh/id_rsa\",\n data_path=\"/path/to/remote/data\"\n)\n\n# Initialize and start the service\nservice = DataProxyService(settings)\nawait service.start()\n\n# When done\nawait service.stop()\n```\n\n## \ud83d\udccb Requirements\n\n- Python 3.7+\n- SSH access to the remote server\n- Python installation on the remote server\n\n## \ud83d\udd27 Troubleshooting\n\n### Common Issues\n\n1. **\ud83d\udeab Permission Denied**\n - Verify your username and SSH key are correct\n - Check if your user has access to the data directory on the remote server\n\n2. **\u26a0\ufe0f Port Already in Use**\n - Try different ports using `--local-port`, `--remote-port`, or `--fastapi-port`\n - Check if another instance of data-stream is already running\n - On HPC, ensure no other jobs are using the same ports (that why it important to run on the compute node)\n\n3. **\ud83d\udd0c Remote Server Issues**\n - Ensure Python is installed on the remote server\n - Check if the data path exists and is accessible\n\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n",
"bugtrack_url": null,
"license": null,
"summary": "Data Stream",
"version": "1.0",
"project_urls": {
"Homepage": "https://github.com/WalBouss/data_stream"
},
"split_keywords": [
"data",
"stream"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "29c5cf0a63ad59d8b98a37f6dcb1954f0683b8f9b6626652ab5dce52908ab498",
"md5": "23e737f8175dcecf2886bd157ebf3895",
"sha256": "6f388d6ae4346f6d9c30fb772d7292c951084b44ad31454da8f4ed81fd164b4b"
},
"downloads": -1,
"filename": "data_streaming-1.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "23e737f8175dcecf2886bd157ebf3895",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 13230,
"upload_time": "2024-10-07T08:40:12",
"upload_time_iso_8601": "2024-10-07T08:40:12.681822Z",
"url": "https://files.pythonhosted.org/packages/29/c5/cf0a63ad59d8b98a37f6dcb1954f0683b8f9b6626652ab5dce52908ab498/data_streaming-1.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "5090ec28ded0ac0fbda084ffa320ec8d3e3fa932ce32bcca23480f5a283afd78",
"md5": "de965414ac8bee48ffe01e8c50851da2",
"sha256": "f10df69a01cad5d1873979d6f06660380f8c571c44922d4b7da42c9a7284e92e"
},
"downloads": -1,
"filename": "data_streaming-1.0.tar.gz",
"has_sig": false,
"md5_digest": "de965414ac8bee48ffe01e8c50851da2",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 7739,
"upload_time": "2024-10-07T08:40:14",
"upload_time_iso_8601": "2024-10-07T08:40:14.174785Z",
"url": "https://files.pythonhosted.org/packages/50/90/ec28ded0ac0fbda084ffa320ec8d3e3fa932ce32bcca23480f5a283afd78/data_streaming-1.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-10-07 08:40:14",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "WalBouss",
"github_project": "data_stream",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "data-streaming"
}