baidu-serp-api


Namebaidu-serp-api JSON
Version 1.1.7 PyPI version JSON
download
home_pageNone
SummaryA library to extract data from Baidu SERP and output it as JSON objects
upload_time2025-08-01 05:59:58
maintainerNone
docs_urlNone
authorNone
requires_python>=3.12
licenseGPL-3.0
keywords baidu serp search scraping json
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            [English](#baidu-serp-api) | [中文](README_CN.md)

# Baidu SERP API

A Python library to extract data from Baidu Search Engine Results Pages (SERP) and output it as JSON objects.

## Installation

```bash
pip install baidu-serp-api
```

## Usage

### Basic Usage

```python
from baidu_serp_api import BaiduPc, BaiduMobile

# Basic usage (default optimized for proxy rotation)
pc_serp = BaiduPc()
results = pc_serp.search('keyword', date_range='20240501,20240531', pn='2', proxies={'http': 'http://your-proxy-server:port'})
print(results)

m_serp = BaiduMobile()
results = m_serp.search('keyword', date_range='day', pn='2', proxies={'http': 'http://your-proxy-server:port'})
print(results)

# Filter the specified content. The following returned results do not contain 'recommend', 'last_page', 'match_count'
results = m_serp.search('关键词', exclude=['recommend', 'last_page', 'match_count'])
```

### Network Connection Optimization

#### Connection Mode Configuration

```python
# Single connection mode (default, suitable for proxy rotation and scraping)
pc = BaiduPc(connection_mode='single')

# Connection pool mode (suitable for fixed proxy or high-performance scenarios)
pc = BaiduPc(connection_mode='pooled')

# Custom mode (fully customizable parameters)
pc = BaiduPc(
    connection_mode='custom',
    connect_timeout=5,
    read_timeout=15,
    pool_connections=5,
    pool_maxsize=20,
    keep_alive=True
)
```

#### Performance Monitoring

```python
# Get performance data
results = pc.search('keyword', include_performance=True)
if results['code'] == 200:
    performance = results['data']['performance']
    print(f"Response time: {performance['response_time']}s")
    print(f"Status code: {performance['status_code']}")
```

#### Resource Management

```python
# Manual resource management
pc = BaiduPc()
try:
    results = pc.search('keyword')
finally:
    pc.close()  # Manually release resources

# Recommended: Use context manager
with BaiduPc() as pc:
    results = pc.search('keyword')
# Automatically release resources
```

## Parameters

### Search Parameters

- `keyword`: The search keyword.
- `date_range` (optional): Search for results within the specified date range. the format should be a time range string like `'20240501,20240531'`, representing searching results between May 1, 2024, and May 31, 2024. 
- `pn` (optional): Search for results on the specified page.
- `proxies` (optional): Use proxies for searching.
- `exclude` (optional): Exclude specified fields, e.g., `['recommend', 'last_page']`.
- `include_performance` (optional): Whether to include performance data, default `False`.

### Connection Configuration Parameters

- `connection_mode`: Connection mode, options:
  - `'single'` (default): Single connection mode, suitable for proxy rotation
  - `'pooled'`: Connection pool mode, suitable for high-performance scenarios
  - `'custom'`: Custom mode, use custom parameters
- `connect_timeout`: Connection timeout in seconds, default 5
- `read_timeout`: Read timeout in seconds, default 10
- `max_retries`: Maximum retry count, default 0
- `pool_connections`: Number of connection pools, default 1
- `pool_maxsize`: Maximum connections per pool, default 1
- `keep_alive`: Whether to enable keep-alive, default `False`

## Technical Details

### PC Version Request Headers & Cookies

**Key Request Parameters:**
- `rsv_pq`: Random query parameter (64-bit hex)
- `rsv_t`: Random timestamp hash
- `oq`: Original query (same as search keyword)

**Cookie Parameters (automatically generated):**
- `BAIDUID`: Unique browser identifier (32-char hex)
- `H_PS_645EC`: Synchronized with `rsv_t` parameter
- `H_PS_PSSID`: Session ID with multiple numeric segments
- `BAIDUID_BFESS`: Same as BAIDUID for security
- Plus 13 additional cookies for complete browser simulation

### Mobile Version Request Headers & Cookies

**Key Request Parameters:**
- `rsv_iqid`: Random identifier (19 digits)
- `rsv_t`: Random timestamp hash
- `sugid`: Suggestion ID (14 digits)
- `rqid`: Request ID (same as rsv_iqid)
- `inputT`: Input timestamp
- Plus 11 additional parameters for mobile simulation

**Cookie Parameters (automatically generated):**
- `BAIDUID`: Synchronized with internal parameters
- `H_WISE_SIDS`: Mobile-specific session with 80 numeric segments
- `rsv_i`: Complex encoded string (64 chars)
- `__bsi`: Special session ID format
- `FC_MODEL`: Feature model parameters
- Plus 14 additional cookies for mobile browser simulation

All parameters are automatically generated and synchronized to ensure realistic browser behavior.

## Return Values

### Successful Response

- `{'code': 200, 'msg': 'ok', 'data': {...}}`: Successful response
  - `results`: Search results list
  - `recommend`: Basic recommendation keywords (may be empty array)
  - `ext_recommend`: Extended recommendation keywords (mobile only, may be empty array)
  - `last_page`: Indicates whether it's the last page
  - `match_count`: Number of matching results
  - `performance` (optional): Performance data, contains `response_time` and `status_code`

### Error Response

#### Application Errors (400-499)
- `{'code': 404, 'msg': '未找到相关结果'}`: No relevant results found
- `{'code': 405, 'msg': '无搜索结果'}`: No search results

#### Server Errors (500-523)
- `{'code': 500, 'msg': '请求异常'}`: General network request exception
- `{'code': 501, 'msg': '百度安全验证'}`: Baidu security verification required
- `{'code': 502, 'msg': '响应提前结束'}`: Response data incomplete
- `{'code': 503, 'msg': '连接超时'}`: Connection timeout
- `{'code': 504, 'msg': '读取超时'}`: Read timeout
- `{'code': 505-510}`: Proxy-related errors (connection reset, auth failure, etc.)
- `{'code': 511-513}`: SSL-related errors (certificate verification, handshake failure, etc.)
- `{'code': 514-519}`: Connection errors (connection refused, DNS resolution failure, etc.)
- `{'code': 520-523}`: HTTP errors (403 forbidden, 429 rate limit, server error, etc.)

## Connection Optimization Best Practices

### Proxy Rotation Scenarios
```python
# Recommended configuration: default single mode is already optimized
with BaiduPc() as pc:  # Automatically uses single connection to avoid connection reuse issues
    for proxy in proxy_list:
        results = pc.search('keyword', proxies=proxy)
        # Process results...
```

### High-Performance Fixed Proxy Scenarios
```python
# Use pooled mode for better performance
with BaiduPc(connection_mode='pooled') as pc:
    results = pc.search('keyword', proxies=fixed_proxy)
    # Connection pool automatically manages connection reuse
```

### Error Handling and Retry
```python
def robust_search(keyword, max_retries=3):
    for attempt in range(max_retries):
        with BaiduPc() as pc:
            results = pc.search(keyword, include_performance=True)
            
            if results['code'] == 200:
                return results
            elif results['code'] in [503, 504]:  # Timeout errors
                continue  # Retry
            elif results['code'] in [505, 506, 514, 515]:  # Connection issues
                continue  # Retry
            else:
                break  # Don't retry other errors
    
    return results
```

## Mobile Extended Recommendations

Mobile version supports two types of recommendations:
- `recommend`: Basic recommendation keywords extracted directly from search results page
- `ext_recommend`: Extended recommendation keywords obtained through additional API call

How to get extended recommendations:

```python
# Get all recommendations (including extended recommendations)
results = m_serp.search('keyword', exclude=[])

# Get only basic recommendations (default behavior)
results = m_serp.search('keyword')  # equivalent to exclude=['ext_recommend']

# Get no recommendations
results = m_serp.search('keyword', exclude=['recommend'])  # automatically excludes ext_recommend
```

**Notes**:
- Extended recommendations require an additional network request and are only fetched on the first page (pn=1 or None)
- Extended recommendations depend on basic recommendations; if basic recommendations are excluded, extended recommendations are automatically excluded as well

## Disclaimer
This project is intended for educational purposes only and must not be used for commercial purposes or for large-scale scraping of Baidu data. This project is licensed under the GPLv3 open-source license. If other projects utilize the content of this project, they must be open-sourced and acknowledge the source. Additionally, the author of this project shall not be held responsible for any legal risks resulting from misuse. Violators will bear the consequences at their own risk.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "baidu-serp-api",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.12",
    "maintainer_email": null,
    "keywords": "baidu, serp, search, scraping, json",
    "author": null,
    "author_email": "Ben Chen <chan@live.cn>",
    "download_url": "https://files.pythonhosted.org/packages/ac/36/2b4c253ce752f2468dd34182258b721cef02d4ff1f98d8d9034c8ac351d5/baidu_serp_api-1.1.7.tar.gz",
    "platform": null,
    "description": "[English](#baidu-serp-api) | [\u4e2d\u6587](README_CN.md)\n\n# Baidu SERP API\n\nA Python library to extract data from Baidu Search Engine Results Pages (SERP) and output it as JSON objects.\n\n## Installation\n\n```bash\npip install baidu-serp-api\n```\n\n## Usage\n\n### Basic Usage\n\n```python\nfrom baidu_serp_api import BaiduPc, BaiduMobile\n\n# Basic usage (default optimized for proxy rotation)\npc_serp = BaiduPc()\nresults = pc_serp.search('keyword', date_range='20240501,20240531', pn='2', proxies={'http': 'http://your-proxy-server:port'})\nprint(results)\n\nm_serp = BaiduMobile()\nresults = m_serp.search('keyword', date_range='day', pn='2', proxies={'http': 'http://your-proxy-server:port'})\nprint(results)\n\n# Filter the specified content. The following returned results do not contain 'recommend', 'last_page', 'match_count'\nresults = m_serp.search('\u5173\u952e\u8bcd', exclude=['recommend', 'last_page', 'match_count'])\n```\n\n### Network Connection Optimization\n\n#### Connection Mode Configuration\n\n```python\n# Single connection mode (default, suitable for proxy rotation and scraping)\npc = BaiduPc(connection_mode='single')\n\n# Connection pool mode (suitable for fixed proxy or high-performance scenarios)\npc = BaiduPc(connection_mode='pooled')\n\n# Custom mode (fully customizable parameters)\npc = BaiduPc(\n    connection_mode='custom',\n    connect_timeout=5,\n    read_timeout=15,\n    pool_connections=5,\n    pool_maxsize=20,\n    keep_alive=True\n)\n```\n\n#### Performance Monitoring\n\n```python\n# Get performance data\nresults = pc.search('keyword', include_performance=True)\nif results['code'] == 200:\n    performance = results['data']['performance']\n    print(f\"Response time: {performance['response_time']}s\")\n    print(f\"Status code: {performance['status_code']}\")\n```\n\n#### Resource Management\n\n```python\n# Manual resource management\npc = BaiduPc()\ntry:\n    results = pc.search('keyword')\nfinally:\n    pc.close()  # Manually release resources\n\n# Recommended: Use context manager\nwith BaiduPc() as pc:\n    results = pc.search('keyword')\n# Automatically release resources\n```\n\n## Parameters\n\n### Search Parameters\n\n- `keyword`: The search keyword.\n- `date_range` (optional): Search for results within the specified date range. the format should be a time range string like `'20240501,20240531'`, representing searching results between May 1, 2024, and May 31, 2024. \n- `pn` (optional): Search for results on the specified page.\n- `proxies` (optional): Use proxies for searching.\n- `exclude` (optional): Exclude specified fields, e.g., `['recommend', 'last_page']`.\n- `include_performance` (optional): Whether to include performance data, default `False`.\n\n### Connection Configuration Parameters\n\n- `connection_mode`: Connection mode, options:\n  - `'single'` (default): Single connection mode, suitable for proxy rotation\n  - `'pooled'`: Connection pool mode, suitable for high-performance scenarios\n  - `'custom'`: Custom mode, use custom parameters\n- `connect_timeout`: Connection timeout in seconds, default 5\n- `read_timeout`: Read timeout in seconds, default 10\n- `max_retries`: Maximum retry count, default 0\n- `pool_connections`: Number of connection pools, default 1\n- `pool_maxsize`: Maximum connections per pool, default 1\n- `keep_alive`: Whether to enable keep-alive, default `False`\n\n## Technical Details\n\n### PC Version Request Headers & Cookies\n\n**Key Request Parameters:**\n- `rsv_pq`: Random query parameter (64-bit hex)\n- `rsv_t`: Random timestamp hash\n- `oq`: Original query (same as search keyword)\n\n**Cookie Parameters (automatically generated):**\n- `BAIDUID`: Unique browser identifier (32-char hex)\n- `H_PS_645EC`: Synchronized with `rsv_t` parameter\n- `H_PS_PSSID`: Session ID with multiple numeric segments\n- `BAIDUID_BFESS`: Same as BAIDUID for security\n- Plus 13 additional cookies for complete browser simulation\n\n### Mobile Version Request Headers & Cookies\n\n**Key Request Parameters:**\n- `rsv_iqid`: Random identifier (19 digits)\n- `rsv_t`: Random timestamp hash\n- `sugid`: Suggestion ID (14 digits)\n- `rqid`: Request ID (same as rsv_iqid)\n- `inputT`: Input timestamp\n- Plus 11 additional parameters for mobile simulation\n\n**Cookie Parameters (automatically generated):**\n- `BAIDUID`: Synchronized with internal parameters\n- `H_WISE_SIDS`: Mobile-specific session with 80 numeric segments\n- `rsv_i`: Complex encoded string (64 chars)\n- `__bsi`: Special session ID format\n- `FC_MODEL`: Feature model parameters\n- Plus 14 additional cookies for mobile browser simulation\n\nAll parameters are automatically generated and synchronized to ensure realistic browser behavior.\n\n## Return Values\n\n### Successful Response\n\n- `{'code': 200, 'msg': 'ok', 'data': {...}}`: Successful response\n  - `results`: Search results list\n  - `recommend`: Basic recommendation keywords (may be empty array)\n  - `ext_recommend`: Extended recommendation keywords (mobile only, may be empty array)\n  - `last_page`: Indicates whether it's the last page\n  - `match_count`: Number of matching results\n  - `performance` (optional): Performance data, contains `response_time` and `status_code`\n\n### Error Response\n\n#### Application Errors (400-499)\n- `{'code': 404, 'msg': '\u672a\u627e\u5230\u76f8\u5173\u7ed3\u679c'}`: No relevant results found\n- `{'code': 405, 'msg': '\u65e0\u641c\u7d22\u7ed3\u679c'}`: No search results\n\n#### Server Errors (500-523)\n- `{'code': 500, 'msg': '\u8bf7\u6c42\u5f02\u5e38'}`: General network request exception\n- `{'code': 501, 'msg': '\u767e\u5ea6\u5b89\u5168\u9a8c\u8bc1'}`: Baidu security verification required\n- `{'code': 502, 'msg': '\u54cd\u5e94\u63d0\u524d\u7ed3\u675f'}`: Response data incomplete\n- `{'code': 503, 'msg': '\u8fde\u63a5\u8d85\u65f6'}`: Connection timeout\n- `{'code': 504, 'msg': '\u8bfb\u53d6\u8d85\u65f6'}`: Read timeout\n- `{'code': 505-510}`: Proxy-related errors (connection reset, auth failure, etc.)\n- `{'code': 511-513}`: SSL-related errors (certificate verification, handshake failure, etc.)\n- `{'code': 514-519}`: Connection errors (connection refused, DNS resolution failure, etc.)\n- `{'code': 520-523}`: HTTP errors (403 forbidden, 429 rate limit, server error, etc.)\n\n## Connection Optimization Best Practices\n\n### Proxy Rotation Scenarios\n```python\n# Recommended configuration: default single mode is already optimized\nwith BaiduPc() as pc:  # Automatically uses single connection to avoid connection reuse issues\n    for proxy in proxy_list:\n        results = pc.search('keyword', proxies=proxy)\n        # Process results...\n```\n\n### High-Performance Fixed Proxy Scenarios\n```python\n# Use pooled mode for better performance\nwith BaiduPc(connection_mode='pooled') as pc:\n    results = pc.search('keyword', proxies=fixed_proxy)\n    # Connection pool automatically manages connection reuse\n```\n\n### Error Handling and Retry\n```python\ndef robust_search(keyword, max_retries=3):\n    for attempt in range(max_retries):\n        with BaiduPc() as pc:\n            results = pc.search(keyword, include_performance=True)\n            \n            if results['code'] == 200:\n                return results\n            elif results['code'] in [503, 504]:  # Timeout errors\n                continue  # Retry\n            elif results['code'] in [505, 506, 514, 515]:  # Connection issues\n                continue  # Retry\n            else:\n                break  # Don't retry other errors\n    \n    return results\n```\n\n## Mobile Extended Recommendations\n\nMobile version supports two types of recommendations:\n- `recommend`: Basic recommendation keywords extracted directly from search results page\n- `ext_recommend`: Extended recommendation keywords obtained through additional API call\n\nHow to get extended recommendations:\n\n```python\n# Get all recommendations (including extended recommendations)\nresults = m_serp.search('keyword', exclude=[])\n\n# Get only basic recommendations (default behavior)\nresults = m_serp.search('keyword')  # equivalent to exclude=['ext_recommend']\n\n# Get no recommendations\nresults = m_serp.search('keyword', exclude=['recommend'])  # automatically excludes ext_recommend\n```\n\n**Notes**:\n- Extended recommendations require an additional network request and are only fetched on the first page (pn=1 or None)\n- Extended recommendations depend on basic recommendations; if basic recommendations are excluded, extended recommendations are automatically excluded as well\n\n## Disclaimer\nThis project is intended for educational purposes only and must not be used for commercial purposes or for large-scale scraping of Baidu data. This project is licensed under the GPLv3 open-source license. If other projects utilize the content of this project, they must be open-sourced and acknowledge the source. Additionally, the author of this project shall not be held responsible for any legal risks resulting from misuse. Violators will bear the consequences at their own risk.\n",
    "bugtrack_url": null,
    "license": "GPL-3.0",
    "summary": "A library to extract data from Baidu SERP and output it as JSON objects",
    "version": "1.1.7",
    "project_urls": {
        "Homepage": "https://github.com/ohblue/baidu-serp-api",
        "Issues": "https://github.com/ohblue/baidu-serp-api/issues",
        "Repository": "https://github.com/ohblue/baidu-serp-api"
    },
    "split_keywords": [
        "baidu",
        " serp",
        " search",
        " scraping",
        " json"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "7b9e297f79760982ff68dcb77b46e699c0a230bd3998c768d2f102e184caa3ef",
                "md5": "4c44954aa1abfcb556ee99d6514de916",
                "sha256": "1de51c04cca1fb504be880724e0bf2f61c807444d4cfa2a79999cc3ce42d7b9a"
            },
            "downloads": -1,
            "filename": "baidu_serp_api-1.1.7-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "4c44954aa1abfcb556ee99d6514de916",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.12",
            "size": 30535,
            "upload_time": "2025-08-01T05:59:56",
            "upload_time_iso_8601": "2025-08-01T05:59:56.719877Z",
            "url": "https://files.pythonhosted.org/packages/7b/9e/297f79760982ff68dcb77b46e699c0a230bd3998c768d2f102e184caa3ef/baidu_serp_api-1.1.7-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "ac362b4c253ce752f2468dd34182258b721cef02d4ff1f98d8d9034c8ac351d5",
                "md5": "82b44bf111fd484a34c2d0ea9fd58e71",
                "sha256": "23cdb66ea6ee488fd7e393dedc9e6eb6fdaf0be00bd9a5622d44dc8f029fa6ce"
            },
            "downloads": -1,
            "filename": "baidu_serp_api-1.1.7.tar.gz",
            "has_sig": false,
            "md5_digest": "82b44bf111fd484a34c2d0ea9fd58e71",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.12",
            "size": 30868,
            "upload_time": "2025-08-01T05:59:58",
            "upload_time_iso_8601": "2025-08-01T05:59:58.044653Z",
            "url": "https://files.pythonhosted.org/packages/ac/36/2b4c253ce752f2468dd34182258b721cef02d4ff1f98d8d9034c8ac351d5/baidu_serp_api-1.1.7.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-08-01 05:59:58",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ohblue",
    "github_project": "baidu-serp-api",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [],
    "lcname": "baidu-serp-api"
}
        
Elapsed time: 8.26885s