Name | webis-llm JSON |
Version |
1.0.6
JSON |
| download |
home_page | None |
Summary | HTML内容提取工具,使用AI自动识别网页上的有价值信息 |
upload_time | 2025-08-05 06:11:45 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.8 |
license | MIT |
keywords |
web
extraction
ai
content
|
VCS |
 |
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
```markdown
# Webis - HTML Content Extraction Tool


Webis is an intelligent web data extraction tool that uses AI technology to automatically identify valuable information on web pages, filter out noise, and provide high-quality input for downstream AI training and knowledge base construction.
## Table of Contents
- [Installation](#installation)
- [Usage](#usage)
- [API Usage Example](#api-usage-example)
- [CLI Usage Example](#cli-usage-example)
- [About the Model](#about-the-model)
- [Project Structure](#project-structure)
- [Troubleshooting](#troubleshooting)
- [Contributing](#contributing)
## Installation
### Prerequisites
- **Python 3.10**
- **Conda** (recommended for environment management)
- **NVIDIA GPU** (optional, for CUDA support)
### Installing Webis
#### Method 1: Install via pip (Recommended)
```bash
conda create -n webis python=3.10 -y
conda activate webis
pip install webis-llm
```
#### Method 2: Install from Source
```bash
git clone https://github.com/TheBinKing/Webis.git
cd Webis
pip install -e .
# Add the bin directory to PATH
export PATH="$PATH:$(pwd)/bin"
echo 'export PATH="$PATH:$(pwd)/bin"' >> ~/.bashrc
source ~/.bashrc
```
## Usage
Webis supports both CLI and API service modes. **Always start the model server first!**
### Step 1: Start the Servers
+ **Model Server** (port 8000):
```bash
python scripts/start_model_server.py
```
+ **Web API Server** (port 8002):
```bash
python scripts/start_web_server.py
```
> **Note**: The default model (`Easonnoway/Web_info_extra_1.5b`) will be automatically downloaded from HuggingFace. The first run may take some time.
>
### API Usage Example
The `api_usage.py` script demonstrates how to process HTML files via the API interface, supporting both synchronous and asynchronous modes, suitable for familiarizing clients with operations.
#### Synchronous Processing Mode
Ideal for small numbers of files, where the client waits for the server to complete processing:
```python
# Send an HTML file for synchronous processing
response = requests.post(
"http://localhost:8002/extract/process-html",
files=files,
data=data
)
# Download the processed results
response = requests.get(f"http://localhost:8002/tasks/{task_id}/download", stream=True)
```
#### Asynchronous Processing Mode
Ideal for large numbers of files or long processing times; submit the task and periodically check its status:
```python
# Submit an asynchronous processing task
response = requests.post(
"http://localhost:8002/extract/process-async",
files=files,
data=data
)
# Monitor task status
response = requests.get(f"http://localhost:8002/tasks/{async_task_id}")
# Download results after task completion
download_response = requests.get(f"http://localhost:8002/tasks/{async_task_id}/download", stream=True)
```
#### Running the API Example
```bash
# Basic usage
python samples/api_usage.py
# Enhance processing results using the DeepSeek API (requires an API key)
python samples/api_usage.py --use-deepseek --api-key YOUR_API_KEY_HERE
```
> **Tip**: Ensure there are HTML files in the `input_html/` directory. Results will be saved as `{task_id}_results.zip` (synchronous) and `{async_task_id}_async_results.zip` (asynchronous).
>
### CLI Usage Example
The `cli_usage.sh` script provides quick examples of command-line interface usage, suitable for batch processing or script integration.
#### Basic Usage
```bash
# Process HTML files
./samples/cli_usage.sh
```
> **Note**: The script calls the `webis extract` command and requires a valid `YOUR_API_KEY_HERE`. Results are saved to the `output_basic/` directory.
>
#### Other Commands
```bash
# View version information
$PROJECT_ROOT/bin/webis version
# Check API connection
$PROJECT_ROOT/bin/webis check-api --api-key YOUR_API_KEY
# View help
$PROJECT_ROOT/bin/webis --help
$PROJECT_ROOT/bin/webis extract --help
```
## About the Model
### Model Details
+ **Name**: Web_info_extra_1.5b
+ **HuggingFace**: [Easonnoway/Web_info_extra_1.5b](https://huggingface.co/Easonnoway/Web_info_extra_1.5b)
+ **Parameters**: 1.5B
+ **Function**: DOM tree node classification
### Usage Instructions
+ Downloaded by default to `~/.cache/huggingface/hub`.
+ Use `--model-path` to specify a local path.
+ Cache management: Set `HF_HOME` or `TRANSFORMERS_CACHE` to customize the location; use `huggingface-cli delete-cache` to clear the cache.
## Project Structure
+ `bin/` - Command-line tools
+ `src/` - Source code
- `cli/` - CLI implementation
- `core/` - Core logic
- `server/` - API server
+ `scripts/` - Startup scripts
+ `samples/` - Usage examples (including `api_usage.py` and `cli_usage.sh`)
- `input_html/` - Sample HTML files
- `output_basic/` - CLI output results
+ `config/` - Configuration files
## Contributing
Contributions are welcome! Please submit issues or pull requests on [GitHub](https://github.com/TheBinKing/Webis). For support, contact the maintainers or join the community discussion.
Raw data
{
"_id": null,
"home_page": null,
"name": "webis-llm",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "TheBinKing <example@example.com>",
"keywords": "web, extraction, ai, content",
"author": null,
"author_email": "TheBinKing <example@example.com>",
"download_url": "https://files.pythonhosted.org/packages/c3/2f/97541cc3ca89acbdba9fc9fdc4cd22011a3d6a99ac7f3c346d2f6662b4de/webis_llm-1.0.6.tar.gz",
"platform": null,
"description": "```markdown\n# Webis - HTML Content Extraction Tool \n \n \n\nWebis is an intelligent web data extraction tool that uses AI technology to automatically identify valuable information on web pages, filter out noise, and provide high-quality input for downstream AI training and knowledge base construction. \n\n## Table of Contents \n\n- [Installation](#installation) \n- [Usage](#usage) \n - [API Usage Example](#api-usage-example) \n - [CLI Usage Example](#cli-usage-example) \n- [About the Model](#about-the-model) \n- [Project Structure](#project-structure) \n- [Troubleshooting](#troubleshooting) \n- [Contributing](#contributing) \n\n## Installation \n\n### Prerequisites \n\n- **Python 3.10** \n- **Conda** (recommended for environment management) \n- **NVIDIA GPU** (optional, for CUDA support) \n\n### Installing Webis\n#### Method 1: Install via pip (Recommended)\n```bash\nconda create -n webis python=3.10 -y\n\nconda activate webis\n\npip install webis-llm\n```\n#### Method 2: Install from Source\n```bash\ngit clone https://github.com/TheBinKing/Webis.git \n\ncd Webis \n\npip install -e . \n\n# Add the bin directory to PATH \nexport PATH=\"$PATH:$(pwd)/bin\" \necho 'export PATH=\"$PATH:$(pwd)/bin\"' >> ~/.bashrc \nsource ~/.bashrc \n```\n\n## Usage\nWebis supports both CLI and API service modes. **Always start the model server first!** \n\n### Step 1: Start the Servers\n+ **Model Server** (port 8000): \n\n```bash\npython scripts/start_model_server.py \n```\n\n+ **Web API Server** (port 8002): \n\n```bash\npython scripts/start_web_server.py \n```\n\n> **Note**: The default model (`Easonnoway/Web_info_extra_1.5b`) will be automatically downloaded from HuggingFace. The first run may take some time. \n>\n\n### API Usage Example\nThe `api_usage.py` script demonstrates how to process HTML files via the API interface, supporting both synchronous and asynchronous modes, suitable for familiarizing clients with operations. \n\n#### Synchronous Processing Mode\nIdeal for small numbers of files, where the client waits for the server to complete processing: \n\n```python\n# Send an HTML file for synchronous processing \nresponse = requests.post( \n \"http://localhost:8002/extract/process-html\", \n files=files, \n data=data \n) \n\n# Download the processed results \nresponse = requests.get(f\"http://localhost:8002/tasks/{task_id}/download\", stream=True) \n```\n\n#### Asynchronous Processing Mode\nIdeal for large numbers of files or long processing times; submit the task and periodically check its status: \n\n```python\n# Submit an asynchronous processing task \nresponse = requests.post( \n \"http://localhost:8002/extract/process-async\", \n files=files, \n data=data \n) \n\n# Monitor task status \nresponse = requests.get(f\"http://localhost:8002/tasks/{async_task_id}\") \n\n# Download results after task completion \ndownload_response = requests.get(f\"http://localhost:8002/tasks/{async_task_id}/download\", stream=True) \n```\n\n#### Running the API Example\n```bash\n# Basic usage \npython samples/api_usage.py \n\n# Enhance processing results using the DeepSeek API (requires an API key) \npython samples/api_usage.py --use-deepseek --api-key YOUR_API_KEY_HERE \n```\n\n> **Tip**: Ensure there are HTML files in the `input_html/` directory. Results will be saved as `{task_id}_results.zip` (synchronous) and `{async_task_id}_async_results.zip` (asynchronous). \n>\n\n### CLI Usage Example\nThe `cli_usage.sh` script provides quick examples of command-line interface usage, suitable for batch processing or script integration. \n\n#### Basic Usage\n```bash\n# Process HTML files \n./samples/cli_usage.sh \n```\n\n> **Note**: The script calls the `webis extract` command and requires a valid `YOUR_API_KEY_HERE`. Results are saved to the `output_basic/` directory. \n>\n\n#### Other Commands\n```bash\n# View version information \n$PROJECT_ROOT/bin/webis version \n\n# Check API connection \n$PROJECT_ROOT/bin/webis check-api --api-key YOUR_API_KEY \n\n# View help \n$PROJECT_ROOT/bin/webis --help \n$PROJECT_ROOT/bin/webis extract --help \n```\n\n## About the Model\n### Model Details\n+ **Name**: Web_info_extra_1.5b \n+ **HuggingFace**: [Easonnoway/Web_info_extra_1.5b](https://huggingface.co/Easonnoway/Web_info_extra_1.5b) \n+ **Parameters**: 1.5B \n+ **Function**: DOM tree node classification\n\n### Usage Instructions\n+ Downloaded by default to `~/.cache/huggingface/hub`. \n+ Use `--model-path` to specify a local path. \n+ Cache management: Set `HF_HOME` or `TRANSFORMERS_CACHE` to customize the location; use `huggingface-cli delete-cache` to clear the cache.\n\n## Project Structure\n+ `bin/` - Command-line tools \n+ `src/` - Source code \n - `cli/` - CLI implementation \n - `core/` - Core logic \n - `server/` - API server\n+ `scripts/` - Startup scripts \n+ `samples/` - Usage examples (including `api_usage.py` and `cli_usage.sh`) \n - `input_html/` - Sample HTML files \n - `output_basic/` - CLI output results\n+ `config/` - Configuration files\n\n## Contributing\nContributions are welcome! Please submit issues or pull requests on [GitHub](https://github.com/TheBinKing/Webis). For support, contact the maintainers or join the community discussion. \n",
"bugtrack_url": null,
"license": "MIT",
"summary": "HTML\u5185\u5bb9\u63d0\u53d6\u5de5\u5177\uff0c\u4f7f\u7528AI\u81ea\u52a8\u8bc6\u522b\u7f51\u9875\u4e0a\u7684\u6709\u4ef7\u503c\u4fe1\u606f",
"version": "1.0.6",
"project_urls": {
"Bug Tracker": "https://github.com/TheBinKing/Webis/issues",
"Homepage": "https://github.com/TheBinKing/Webis"
},
"split_keywords": [
"web",
" extraction",
" ai",
" content"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "b4fd907abe0f33b991adac902feea47c468d04605a6d499e9b5112e8461719d0",
"md5": "8ab9eb325720afb74c7dbd27d1a346ef",
"sha256": "8760b501919fd60e8d50ef593f9d65b71818c3cf9b8a4d25fad193be45ce5bc4"
},
"downloads": -1,
"filename": "webis_llm-1.0.6-py3-none-any.whl",
"has_sig": false,
"md5_digest": "8ab9eb325720afb74c7dbd27d1a346ef",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 947667,
"upload_time": "2025-08-05T06:11:42",
"upload_time_iso_8601": "2025-08-05T06:11:42.026013Z",
"url": "https://files.pythonhosted.org/packages/b4/fd/907abe0f33b991adac902feea47c468d04605a6d499e9b5112e8461719d0/webis_llm-1.0.6-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "c32f97541cc3ca89acbdba9fc9fdc4cd22011a3d6a99ac7f3c346d2f6662b4de",
"md5": "e04f328fe04f49c1388e94e483b598a4",
"sha256": "f6bc8319374ba80447d09066968c97fb94d2462ce8bd7586f362d2ba8a15e340"
},
"downloads": -1,
"filename": "webis_llm-1.0.6.tar.gz",
"has_sig": false,
"md5_digest": "e04f328fe04f49c1388e94e483b598a4",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 941013,
"upload_time": "2025-08-05T06:11:45",
"upload_time_iso_8601": "2025-08-05T06:11:45.012229Z",
"url": "https://files.pythonhosted.org/packages/c3/2f/97541cc3ca89acbdba9fc9fdc4cd22011a3d6a99ac7f3c346d2f6662b4de/webis_llm-1.0.6.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-08-05 06:11:45",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "TheBinKing",
"github_project": "Webis",
"github_not_found": true,
"lcname": "webis-llm"
}