webis-llm


Namewebis-llm JSON
Version 1.0.6 PyPI version JSON
download
home_pageNone
SummaryHTML内容提取工具,使用AI自动识别网页上的有价值信息
upload_time2025-08-05 06:11:45
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT
keywords web extraction ai content
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ```markdown
# Webis - HTML Content Extraction Tool  
![Python Version](https://img.shields.io/badge/Python-3.10-blue)  
![Build Status](https://img.shields.io/badge/Build-Passed-green)  

Webis is an intelligent web data extraction tool that uses AI technology to automatically identify valuable information on web pages, filter out noise, and provide high-quality input for downstream AI training and knowledge base construction.  

## Table of Contents  

- [Installation](#installation)  
- [Usage](#usage)  
  - [API Usage Example](#api-usage-example)  
  - [CLI Usage Example](#cli-usage-example)  
- [About the Model](#about-the-model)  
- [Project Structure](#project-structure)  
- [Troubleshooting](#troubleshooting)  
- [Contributing](#contributing)  

## Installation  

### Prerequisites  

- **Python 3.10**  
- **Conda** (recommended for environment management)  
- **NVIDIA GPU** (optional, for CUDA support)  

### Installing Webis
#### Method 1: Install via pip (Recommended)
```bash
conda create -n webis python=3.10 -y

conda activate webis

pip install webis-llm
```
#### Method 2: Install from Source
```bash
git clone https://github.com/TheBinKing/Webis.git  

cd Webis  

pip install -e .  

# Add the bin directory to PATH  
export PATH="$PATH:$(pwd)/bin"  
echo 'export PATH="$PATH:$(pwd)/bin"' >> ~/.bashrc  
source ~/.bashrc  
```

## Usage
Webis supports both CLI and API service modes. **Always start the model server first!**  

### Step 1: Start the Servers
+ **Model Server** (port 8000):  

```bash
python scripts/start_model_server.py  
```

+ **Web API Server** (port 8002):  

```bash
python scripts/start_web_server.py  
```

> **Note**: The default model (`Easonnoway/Web_info_extra_1.5b`) will be automatically downloaded from HuggingFace. The first run may take some time.  
>

### API Usage Example
The `api_usage.py` script demonstrates how to process HTML files via the API interface, supporting both synchronous and asynchronous modes, suitable for familiarizing clients with operations.  

#### Synchronous Processing Mode
Ideal for small numbers of files, where the client waits for the server to complete processing:  

```python
# Send an HTML file for synchronous processing  
response = requests.post(  
    "http://localhost:8002/extract/process-html",  
    files=files,  
    data=data  
)  

# Download the processed results  
response = requests.get(f"http://localhost:8002/tasks/{task_id}/download", stream=True)  
```

#### Asynchronous Processing Mode
Ideal for large numbers of files or long processing times; submit the task and periodically check its status:  

```python
# Submit an asynchronous processing task  
response = requests.post(  
    "http://localhost:8002/extract/process-async",  
    files=files,  
    data=data  
)  

# Monitor task status  
response = requests.get(f"http://localhost:8002/tasks/{async_task_id}")  

# Download results after task completion  
download_response = requests.get(f"http://localhost:8002/tasks/{async_task_id}/download", stream=True)  
```

#### Running the API Example
```bash
# Basic usage  
python samples/api_usage.py  

# Enhance processing results using the DeepSeek API (requires an API key)  
python samples/api_usage.py --use-deepseek --api-key YOUR_API_KEY_HERE  
```

> **Tip**: Ensure there are HTML files in the `input_html/` directory. Results will be saved as `{task_id}_results.zip` (synchronous) and `{async_task_id}_async_results.zip` (asynchronous).  
>

### CLI Usage Example
The `cli_usage.sh` script provides quick examples of command-line interface usage, suitable for batch processing or script integration.  

#### Basic Usage
```bash
# Process HTML files  
./samples/cli_usage.sh  
```

> **Note**: The script calls the `webis extract` command and requires a valid `YOUR_API_KEY_HERE`. Results are saved to the `output_basic/` directory.  
>

#### Other Commands
```bash
# View version information  
$PROJECT_ROOT/bin/webis version  

# Check API connection  
$PROJECT_ROOT/bin/webis check-api --api-key YOUR_API_KEY  

# View help  
$PROJECT_ROOT/bin/webis --help  
$PROJECT_ROOT/bin/webis extract --help  
```

## About the Model
### Model Details
+ **Name**: Web_info_extra_1.5b  
+ **HuggingFace**: [Easonnoway/Web_info_extra_1.5b](https://huggingface.co/Easonnoway/Web_info_extra_1.5b)  
+ **Parameters**: 1.5B  
+ **Function**: DOM tree node classification

### Usage Instructions
+ Downloaded by default to `~/.cache/huggingface/hub`.  
+ Use `--model-path` to specify a local path.  
+ Cache management: Set `HF_HOME` or `TRANSFORMERS_CACHE` to customize the location; use `huggingface-cli delete-cache` to clear the cache.

## Project Structure
+ `bin/` - Command-line tools  
+ `src/` - Source code  
    - `cli/` - CLI implementation  
    - `core/` - Core logic  
    - `server/` - API server
+ `scripts/` - Startup scripts  
+ `samples/` - Usage examples (including `api_usage.py` and `cli_usage.sh`)  
    - `input_html/` - Sample HTML files  
    - `output_basic/` - CLI output results
+ `config/` - Configuration files

## Contributing
Contributions are welcome! Please submit issues or pull requests on [GitHub](https://github.com/TheBinKing/Webis). For support, contact the maintainers or join the community discussion.  

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "webis-llm",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "TheBinKing <example@example.com>",
    "keywords": "web, extraction, ai, content",
    "author": null,
    "author_email": "TheBinKing <example@example.com>",
    "download_url": "https://files.pythonhosted.org/packages/c3/2f/97541cc3ca89acbdba9fc9fdc4cd22011a3d6a99ac7f3c346d2f6662b4de/webis_llm-1.0.6.tar.gz",
    "platform": null,
    "description": "```markdown\n# Webis - HTML Content Extraction Tool  \n![Python Version](https://img.shields.io/badge/Python-3.10-blue)  \n![Build Status](https://img.shields.io/badge/Build-Passed-green)  \n\nWebis is an intelligent web data extraction tool that uses AI technology to automatically identify valuable information on web pages, filter out noise, and provide high-quality input for downstream AI training and knowledge base construction.  \n\n## Table of Contents  \n\n- [Installation](#installation)  \n- [Usage](#usage)  \n  - [API Usage Example](#api-usage-example)  \n  - [CLI Usage Example](#cli-usage-example)  \n- [About the Model](#about-the-model)  \n- [Project Structure](#project-structure)  \n- [Troubleshooting](#troubleshooting)  \n- [Contributing](#contributing)  \n\n## Installation  \n\n### Prerequisites  \n\n- **Python 3.10**  \n- **Conda** (recommended for environment management)  \n- **NVIDIA GPU** (optional, for CUDA support)  \n\n### Installing Webis\n#### Method 1: Install via pip (Recommended)\n```bash\nconda create -n webis python=3.10 -y\n\nconda activate webis\n\npip install webis-llm\n```\n#### Method 2: Install from Source\n```bash\ngit clone https://github.com/TheBinKing/Webis.git  \n\ncd Webis  \n\npip install -e .  \n\n# Add the bin directory to PATH  \nexport PATH=\"$PATH:$(pwd)/bin\"  \necho 'export PATH=\"$PATH:$(pwd)/bin\"' >> ~/.bashrc  \nsource ~/.bashrc  \n```\n\n## Usage\nWebis supports both CLI and API service modes. **Always start the model server first!**  \n\n### Step 1: Start the Servers\n+ **Model Server** (port 8000):  \n\n```bash\npython scripts/start_model_server.py  \n```\n\n+ **Web API Server** (port 8002):  \n\n```bash\npython scripts/start_web_server.py  \n```\n\n> **Note**: The default model (`Easonnoway/Web_info_extra_1.5b`) will be automatically downloaded from HuggingFace. The first run may take some time.  \n>\n\n### API Usage Example\nThe `api_usage.py` script demonstrates how to process HTML files via the API interface, supporting both synchronous and asynchronous modes, suitable for familiarizing clients with operations.  \n\n#### Synchronous Processing Mode\nIdeal for small numbers of files, where the client waits for the server to complete processing:  \n\n```python\n# Send an HTML file for synchronous processing  \nresponse = requests.post(  \n    \"http://localhost:8002/extract/process-html\",  \n    files=files,  \n    data=data  \n)  \n\n# Download the processed results  \nresponse = requests.get(f\"http://localhost:8002/tasks/{task_id}/download\", stream=True)  \n```\n\n#### Asynchronous Processing Mode\nIdeal for large numbers of files or long processing times; submit the task and periodically check its status:  \n\n```python\n# Submit an asynchronous processing task  \nresponse = requests.post(  \n    \"http://localhost:8002/extract/process-async\",  \n    files=files,  \n    data=data  \n)  \n\n# Monitor task status  \nresponse = requests.get(f\"http://localhost:8002/tasks/{async_task_id}\")  \n\n# Download results after task completion  \ndownload_response = requests.get(f\"http://localhost:8002/tasks/{async_task_id}/download\", stream=True)  \n```\n\n#### Running the API Example\n```bash\n# Basic usage  \npython samples/api_usage.py  \n\n# Enhance processing results using the DeepSeek API (requires an API key)  \npython samples/api_usage.py --use-deepseek --api-key YOUR_API_KEY_HERE  \n```\n\n> **Tip**: Ensure there are HTML files in the `input_html/` directory. Results will be saved as `{task_id}_results.zip` (synchronous) and `{async_task_id}_async_results.zip` (asynchronous).  \n>\n\n### CLI Usage Example\nThe `cli_usage.sh` script provides quick examples of command-line interface usage, suitable for batch processing or script integration.  \n\n#### Basic Usage\n```bash\n# Process HTML files  \n./samples/cli_usage.sh  \n```\n\n> **Note**: The script calls the `webis extract` command and requires a valid `YOUR_API_KEY_HERE`. Results are saved to the `output_basic/` directory.  \n>\n\n#### Other Commands\n```bash\n# View version information  \n$PROJECT_ROOT/bin/webis version  \n\n# Check API connection  \n$PROJECT_ROOT/bin/webis check-api --api-key YOUR_API_KEY  \n\n# View help  \n$PROJECT_ROOT/bin/webis --help  \n$PROJECT_ROOT/bin/webis extract --help  \n```\n\n## About the Model\n### Model Details\n+ **Name**: Web_info_extra_1.5b  \n+ **HuggingFace**: [Easonnoway/Web_info_extra_1.5b](https://huggingface.co/Easonnoway/Web_info_extra_1.5b)  \n+ **Parameters**: 1.5B  \n+ **Function**: DOM tree node classification\n\n### Usage Instructions\n+ Downloaded by default to `~/.cache/huggingface/hub`.  \n+ Use `--model-path` to specify a local path.  \n+ Cache management: Set `HF_HOME` or `TRANSFORMERS_CACHE` to customize the location; use `huggingface-cli delete-cache` to clear the cache.\n\n## Project Structure\n+ `bin/` - Command-line tools  \n+ `src/` - Source code  \n    - `cli/` - CLI implementation  \n    - `core/` - Core logic  \n    - `server/` - API server\n+ `scripts/` - Startup scripts  \n+ `samples/` - Usage examples (including `api_usage.py` and `cli_usage.sh`)  \n    - `input_html/` - Sample HTML files  \n    - `output_basic/` - CLI output results\n+ `config/` - Configuration files\n\n## Contributing\nContributions are welcome! Please submit issues or pull requests on [GitHub](https://github.com/TheBinKing/Webis). For support, contact the maintainers or join the community discussion.  \n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "HTML\u5185\u5bb9\u63d0\u53d6\u5de5\u5177\uff0c\u4f7f\u7528AI\u81ea\u52a8\u8bc6\u522b\u7f51\u9875\u4e0a\u7684\u6709\u4ef7\u503c\u4fe1\u606f",
    "version": "1.0.6",
    "project_urls": {
        "Bug Tracker": "https://github.com/TheBinKing/Webis/issues",
        "Homepage": "https://github.com/TheBinKing/Webis"
    },
    "split_keywords": [
        "web",
        " extraction",
        " ai",
        " content"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "b4fd907abe0f33b991adac902feea47c468d04605a6d499e9b5112e8461719d0",
                "md5": "8ab9eb325720afb74c7dbd27d1a346ef",
                "sha256": "8760b501919fd60e8d50ef593f9d65b71818c3cf9b8a4d25fad193be45ce5bc4"
            },
            "downloads": -1,
            "filename": "webis_llm-1.0.6-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8ab9eb325720afb74c7dbd27d1a346ef",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 947667,
            "upload_time": "2025-08-05T06:11:42",
            "upload_time_iso_8601": "2025-08-05T06:11:42.026013Z",
            "url": "https://files.pythonhosted.org/packages/b4/fd/907abe0f33b991adac902feea47c468d04605a6d499e9b5112e8461719d0/webis_llm-1.0.6-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "c32f97541cc3ca89acbdba9fc9fdc4cd22011a3d6a99ac7f3c346d2f6662b4de",
                "md5": "e04f328fe04f49c1388e94e483b598a4",
                "sha256": "f6bc8319374ba80447d09066968c97fb94d2462ce8bd7586f362d2ba8a15e340"
            },
            "downloads": -1,
            "filename": "webis_llm-1.0.6.tar.gz",
            "has_sig": false,
            "md5_digest": "e04f328fe04f49c1388e94e483b598a4",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 941013,
            "upload_time": "2025-08-05T06:11:45",
            "upload_time_iso_8601": "2025-08-05T06:11:45.012229Z",
            "url": "https://files.pythonhosted.org/packages/c3/2f/97541cc3ca89acbdba9fc9fdc4cd22011a3d6a99ac7f3c346d2f6662b4de/webis_llm-1.0.6.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-05 06:11:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "TheBinKing",
    "github_project": "Webis",
    "github_not_found": true,
    "lcname": "webis-llm"
}
        
Elapsed time: 2.24234s