xhs-crawl


Namexhs-crawl JSON
Version 0.1.3 PyPI version JSON
download
home_pageNone
Summary一个异步的小红书爬虫工具,支持笔记内容和图片的批量下载
upload_time2025-02-11 02:20:48
maintainerNone
docs_urlNone
authorLGrok
requires_python<4.0,>=3.9
licenseMIT
keywords spider crawler xiaohongshu download
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # XHS Crawl

小红书内容爬虫工具

## 功能特点

- 支持帖子内容抓取
- 支持图片下载
- 异步处理
- 自动重试机制

## 安装

```bash
pip install xhs-crawl
```

## 使用说明

### 命令行工具

安装完成后,你可以直接使用命令行工具下载小红书帖子内容:

```bash
xhs-crawl "https://www.xiaohongshu.com/explore/[POST_ID]" -d "./downloads"
```

参数说明:
- 第一个参数为小红书帖子URL(必填)
- `-d` 或 `--dir`:指定图片保存目录,默认为 `./downloads`

### Python代码调用

你也可以在Python代码中调用:

```python
import asyncio
from xhs_crawl import XHSSpider

async def main():
    # 初始化爬虫
    spider = XHSSpider()
    
    try:
        # 获取帖子数据
        url = "https://www.xiaohongshu.com/explore/[POST_ID]"
        post = await spider.get_post_data(url)
        
        if post:
            print(f"标题: {post.title}")
            print(f"内容: {post.content}")
            print(f"发现 {len(post.images)} 张图片")
            
            # 下载图片
            await spider.download_images(post, "./downloads")
    finally:
        # 关闭客户端连接
        await spider.close()

if __name__ == "__main__":
    asyncio.run(main())
```

### 返回数据结构

`get_post_data` 方法返回的 `post` 对象包含以下属性:

- `post_id`: 帖子ID
- `title`: 帖子标题
- `content`: 帖子正文内容
- `images`: 帖子包含的图片URL列表

## 注意事项

1. 请确保提供的URL格式正确
2. 下载目录需要有写入权限
3. 建议合理控制爬取频率,避免对目标网站造成压力
4. 该工具仅用于学习和研究目的,请遵守相关法律法规

## 许可证

MIT License
            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "xhs-crawl",
    "maintainer": null,
    "docs_url": null,
    "requires_python": "<4.0,>=3.9",
    "maintainer_email": null,
    "keywords": "spider, crawler, xiaohongshu, download",
    "author": "LGrok",
    "author_email": "luolin.work@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/95/8c/ec89a67c9cc6d100ba68c4dd9d0a42d713b880dd5ef2dbf70f42168d8133/xhs_crawl-0.1.3.tar.gz",
    "platform": null,
    "description": "# XHS Crawl\n\n\u5c0f\u7ea2\u4e66\u5185\u5bb9\u722c\u866b\u5de5\u5177\n\n## \u529f\u80fd\u7279\u70b9\n\n- \u652f\u6301\u5e16\u5b50\u5185\u5bb9\u6293\u53d6\n- \u652f\u6301\u56fe\u7247\u4e0b\u8f7d\n- \u5f02\u6b65\u5904\u7406\n- \u81ea\u52a8\u91cd\u8bd5\u673a\u5236\n\n## \u5b89\u88c5\n\n```bash\npip install xhs-crawl\n```\n\n## \u4f7f\u7528\u8bf4\u660e\n\n### \u547d\u4ee4\u884c\u5de5\u5177\n\n\u5b89\u88c5\u5b8c\u6210\u540e\uff0c\u4f60\u53ef\u4ee5\u76f4\u63a5\u4f7f\u7528\u547d\u4ee4\u884c\u5de5\u5177\u4e0b\u8f7d\u5c0f\u7ea2\u4e66\u5e16\u5b50\u5185\u5bb9\uff1a\n\n```bash\nxhs-crawl \"https://www.xiaohongshu.com/explore/[POST_ID]\" -d \"./downloads\"\n```\n\n\u53c2\u6570\u8bf4\u660e\uff1a\n- \u7b2c\u4e00\u4e2a\u53c2\u6570\u4e3a\u5c0f\u7ea2\u4e66\u5e16\u5b50URL\uff08\u5fc5\u586b\uff09\n- `-d` \u6216 `--dir`\uff1a\u6307\u5b9a\u56fe\u7247\u4fdd\u5b58\u76ee\u5f55\uff0c\u9ed8\u8ba4\u4e3a `./downloads`\n\n### Python\u4ee3\u7801\u8c03\u7528\n\n\u4f60\u4e5f\u53ef\u4ee5\u5728Python\u4ee3\u7801\u4e2d\u8c03\u7528\uff1a\n\n```python\nimport asyncio\nfrom xhs_crawl import XHSSpider\n\nasync def main():\n    # \u521d\u59cb\u5316\u722c\u866b\n    spider = XHSSpider()\n    \n    try:\n        # \u83b7\u53d6\u5e16\u5b50\u6570\u636e\n        url = \"https://www.xiaohongshu.com/explore/[POST_ID]\"\n        post = await spider.get_post_data(url)\n        \n        if post:\n            print(f\"\u6807\u9898: {post.title}\")\n            print(f\"\u5185\u5bb9: {post.content}\")\n            print(f\"\u53d1\u73b0 {len(post.images)} \u5f20\u56fe\u7247\")\n            \n            # \u4e0b\u8f7d\u56fe\u7247\n            await spider.download_images(post, \"./downloads\")\n    finally:\n        # \u5173\u95ed\u5ba2\u6237\u7aef\u8fde\u63a5\n        await spider.close()\n\nif __name__ == \"__main__\":\n    asyncio.run(main())\n```\n\n### \u8fd4\u56de\u6570\u636e\u7ed3\u6784\n\n`get_post_data` \u65b9\u6cd5\u8fd4\u56de\u7684 `post` \u5bf9\u8c61\u5305\u542b\u4ee5\u4e0b\u5c5e\u6027\uff1a\n\n- `post_id`: \u5e16\u5b50ID\n- `title`: \u5e16\u5b50\u6807\u9898\n- `content`: \u5e16\u5b50\u6b63\u6587\u5185\u5bb9\n- `images`: \u5e16\u5b50\u5305\u542b\u7684\u56fe\u7247URL\u5217\u8868\n\n## \u6ce8\u610f\u4e8b\u9879\n\n1. \u8bf7\u786e\u4fdd\u63d0\u4f9b\u7684URL\u683c\u5f0f\u6b63\u786e\n2. \u4e0b\u8f7d\u76ee\u5f55\u9700\u8981\u6709\u5199\u5165\u6743\u9650\n3. \u5efa\u8bae\u5408\u7406\u63a7\u5236\u722c\u53d6\u9891\u7387\uff0c\u907f\u514d\u5bf9\u76ee\u6807\u7f51\u7ad9\u9020\u6210\u538b\u529b\n4. \u8be5\u5de5\u5177\u4ec5\u7528\u4e8e\u5b66\u4e60\u548c\u7814\u7a76\u76ee\u7684\uff0c\u8bf7\u9075\u5b88\u76f8\u5173\u6cd5\u5f8b\u6cd5\u89c4\n\n## \u8bb8\u53ef\u8bc1\n\nMIT License",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "\u4e00\u4e2a\u5f02\u6b65\u7684\u5c0f\u7ea2\u4e66\u722c\u866b\u5de5\u5177\uff0c\u652f\u6301\u7b14\u8bb0\u5185\u5bb9\u548c\u56fe\u7247\u7684\u6279\u91cf\u4e0b\u8f7d",
    "version": "0.1.3",
    "project_urls": {
        "Bug Tracker": "https://github.com/LinLL/xhs-crawl/issues",
        "Homepage": "https://github.com/LinLL/xhs-crawl"
    },
    "split_keywords": [
        "spider",
        " crawler",
        " xiaohongshu",
        " download"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2a47e3d650c556bdae15111004c311ba71bc266b71de6f35577f74b7aa0e5a77",
                "md5": "9cbc69522ec91a6c9fa9a7a28ae5995d",
                "sha256": "fa2fc68a2cb178a422d25884998a14688941e4d6d11ca197931bc89d6d0c4eb5"
            },
            "downloads": -1,
            "filename": "xhs_crawl-0.1.3-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9cbc69522ec91a6c9fa9a7a28ae5995d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<4.0,>=3.9",
            "size": 6407,
            "upload_time": "2025-02-11T02:20:47",
            "upload_time_iso_8601": "2025-02-11T02:20:47.338157Z",
            "url": "https://files.pythonhosted.org/packages/2a/47/e3d650c556bdae15111004c311ba71bc266b71de6f35577f74b7aa0e5a77/xhs_crawl-0.1.3-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "958cec89a67c9cc6d100ba68c4dd9d0a42d713b880dd5ef2dbf70f42168d8133",
                "md5": "7821fa5fc7aa97cfe69432e99eca2f97",
                "sha256": "17619089bde52c890dbfea30194dc1349b20840dd1f1170cf227fd2959b2be8f"
            },
            "downloads": -1,
            "filename": "xhs_crawl-0.1.3.tar.gz",
            "has_sig": false,
            "md5_digest": "7821fa5fc7aa97cfe69432e99eca2f97",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<4.0,>=3.9",
            "size": 5128,
            "upload_time": "2025-02-11T02:20:48",
            "upload_time_iso_8601": "2025-02-11T02:20:48.598584Z",
            "url": "https://files.pythonhosted.org/packages/95/8c/ec89a67c9cc6d100ba68c4dd9d0a42d713b880dd5ef2dbf70f42168d8133/xhs_crawl-0.1.3.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-11 02:20:48",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "LinLL",
    "github_project": "xhs-crawl",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "xhs-crawl"
}
        
Elapsed time: 0.39252s