wespy


Namewespy JSON
Version 0.1.2 PyPI version JSON
download
home_pagehttps://github.com/tianchangNorth/WeSpy
SummaryA Python tool for fetching web articles and converting them to Markdown format
upload_time2025-07-30 07:52:30
maintainerNone
docs_urlNone
authortianchang
requires_python>=3.6
licenseMIT
keywords web scraping article extraction markdown converter weixin
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # WeSpy

[![PyPI version](https://badge.fury.io/py/wespy.svg)](https://badge.fury.io/py/wespy)
[![Python Support](https://img.shields.io/pypi/pyversions/wespy.svg)](https://pypi.org/project/wespy/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

WeSpy 是一个用于获取微信公众号文章并转换为 Markdown 格式的 Python 工具,支持图片防盗链处理和多种输出格式。

## 特性

- 🚀 **智能文章提取**:自动识别文章标题、作者、发布时间和正文内容
- 📱 **微信公众号支持**:专门优化微信公众号文章的提取
- 🖼️ **图片防盗链处理**:自动处理图片防盗链问题,确保图片正常显示
- 📝 **多格式输出**:支持 HTML、JSON 和 Markdown 多种输出格式
- 🌐 **通用网页支持**:支持大多数网站的文章提取
- 🎯 **命令行友好**:提供简单易用的命令行界面
- 📂 **批量处理**:支持批量处理多个文章链接

## 安装

### 使用 pip 安装(推荐)

```bash
pip install wespy
```

### 从源码安装

```bash
git clone https://github.com/tianchang/wespy.git
cd wespy
pip install -e .
```

## 快速开始

### 命令行使用

```bash
# 获取微信公众号文章
wespy "https://mp.weixin.qq.com/s/xxxxx"

# 指定输出目录
wespy "https://mp.weixin.qq.com/s/xxxxx" -o /path/to/output

# 显示详细信息
wespy "https://example.com/article" -v
```

### 交互式使用

如果不提供任何参数,程序会进入交互模式:

```bash
wespy
```

然后根据提示输入文章 URL 和输出目录。

### Python API 使用

```python
from wespy import ArticleFetcher

# 创建文章获取器实例
fetcher = ArticleFetcher()

# 获取文章
article_info = fetcher.fetch_article(
    url="https://mp.weixin.qq.com/s/xxxxx",
    output_dir="articles"
)

if article_info:
    print(f"标题: {article_info['title']}")
    print(f"作者: {article_info['author']}")
    print(f"发布时间: {article_info['publish_time']}")
```

## 输出格式

WeSpy 会为每篇文章生成三种格式的文件:

1. **HTML 文件**:原始 HTML 内容
2. **JSON 文件**:文章元数据信息
3. **Markdown 文件**:转换后的 Markdown 格式内容

### 输出文件示例

```
articles/
├── 文章标题_1627834567.html      # 原始HTML
├── 文章标题_1627834567.md        # Markdown格式
└── 文章标题_1627834567_info.json # 元数据信息
```

### JSON 元数据格式

```json
{
  "title": "文章标题",
  "author": "作者名称",
  "publish_time": "2023-07-30",
  "url": "https://example.com/article",
  "html_file": "文章标题_1627834567.html",
  "fetch_time": "2023-07-30 12:34:56"
}
```

## 支持的网站

### 完全支持
- 微信公众号 (mp.weixin.qq.com)
- 大部分基于标准 HTML 结构的博客和新闻网站

### 通用支持
WeSpy 使用智能算法尝试从以下元素中提取内容:
- `<article>` 标签
- 带有 `content`、`article-content`、`post-content` 等 class 的元素
- `<main>` 标签
- 标准的 meta 标签信息

## 命令行选项

```
wespy [-h] [-o OUTPUT] [-v] url

获取文章内容并转换为Markdown

positional arguments:
  url                   文章URL

optional arguments:
  -h, --help            显示帮助信息
  -o OUTPUT, --output OUTPUT
                        输出目录 (默认: articles)
  -v, --verbose         显示详细信息
```

## 依赖要求

- Python 3.6+
- requests >= 2.20.0
- beautifulsoup4 >= 4.9.0

## 开发

### 开发环境设置

```bash
git clone https://github.com/tianchang/wespy.git
cd wespy
pip install -e ".[dev]"
```

### 运行测试

```bash
python -m pytest tests/
```

### 代码格式化

```bash
black wespy/
flake8 wespy/
```

## 常见问题

### Q: 为什么有些图片无法显示?
A: WeSpy 使用 images.weserv.nl 作为代理服务来解决图片防盗链问题。如果仍然无法显示,可能是原图片已被删除或网络问题。

### Q: 支持哪些网站?
A: WeSpy 对微信公众号有特别优化,对大部分使用标准 HTML 结构的网站都有较好的支持。如果某个网站不支持,欢迎提交 issue。

### Q: 如何批量处理文章?
A: 目前需要通过脚本调用 Python API 来实现批量处理,命令行版本暂不支持批量处理。

## 贡献

欢迎提交 issue 和 pull request!

1. Fork 本仓库
2. 创建特性分支 (`git checkout -b feature/AmazingFeature`)
3. 提交更改 (`git commit -m 'Add some AmazingFeature'`)
4. 推送到分支 (`git push origin feature/AmazingFeature`)
5. 开启 Pull Request

## 许可证

本项目使用 MIT 许可证。详见 [LICENSE](LICENSE) 文件。

## 更新日志

### v0.1.0 (2023-07-30)
- 初始版本发布
- 支持微信公众号文章提取
- 支持通用网页文章提取
- 支持 HTML/JSON/Markdown 多格式输出
- 图片防盗链处理
- 命令行界面

## 联系方式

- GitHub: [https://github.com/tianchangNorth/WeSpy](https://github.com/tianchangNorth/WeSpy)
- Issues: [https://github.com/tianchangNorth/WeSpy/issues](https://github.com/tianchangNorth/WeSpy/issues)

---

**注意**: 请遵守网站的 robots.txt 和使用条款,合理使用本工具。

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/tianchangNorth/WeSpy",
    "name": "wespy",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": null,
    "keywords": "web scraping, article extraction, markdown converter, weixin",
    "author": "tianchang",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/79/27/d974eb24dfffb5df3fcd81b3428272145cf5539c5b6e1796bf13a0250723/wespy-0.1.2.tar.gz",
    "platform": null,
    "description": "# WeSpy\r\n\r\n[![PyPI version](https://badge.fury.io/py/wespy.svg)](https://badge.fury.io/py/wespy)\r\n[![Python Support](https://img.shields.io/pypi/pyversions/wespy.svg)](https://pypi.org/project/wespy/)\r\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\r\n\r\nWeSpy \u662f\u4e00\u4e2a\u7528\u4e8e\u83b7\u53d6\u5fae\u4fe1\u516c\u4f17\u53f7\u6587\u7ae0\u5e76\u8f6c\u6362\u4e3a Markdown \u683c\u5f0f\u7684 Python \u5de5\u5177,\u652f\u6301\u56fe\u7247\u9632\u76d7\u94fe\u5904\u7406\u548c\u591a\u79cd\u8f93\u51fa\u683c\u5f0f\u3002\r\n\r\n## \u7279\u6027\r\n\r\n- \ud83d\ude80 **\u667a\u80fd\u6587\u7ae0\u63d0\u53d6**\uff1a\u81ea\u52a8\u8bc6\u522b\u6587\u7ae0\u6807\u9898\u3001\u4f5c\u8005\u3001\u53d1\u5e03\u65f6\u95f4\u548c\u6b63\u6587\u5185\u5bb9\r\n- \ud83d\udcf1 **\u5fae\u4fe1\u516c\u4f17\u53f7\u652f\u6301**\uff1a\u4e13\u95e8\u4f18\u5316\u5fae\u4fe1\u516c\u4f17\u53f7\u6587\u7ae0\u7684\u63d0\u53d6\r\n- \ud83d\uddbc\ufe0f **\u56fe\u7247\u9632\u76d7\u94fe\u5904\u7406**\uff1a\u81ea\u52a8\u5904\u7406\u56fe\u7247\u9632\u76d7\u94fe\u95ee\u9898\uff0c\u786e\u4fdd\u56fe\u7247\u6b63\u5e38\u663e\u793a\r\n- \ud83d\udcdd **\u591a\u683c\u5f0f\u8f93\u51fa**\uff1a\u652f\u6301 HTML\u3001JSON \u548c Markdown \u591a\u79cd\u8f93\u51fa\u683c\u5f0f\r\n- \ud83c\udf10 **\u901a\u7528\u7f51\u9875\u652f\u6301**\uff1a\u652f\u6301\u5927\u591a\u6570\u7f51\u7ad9\u7684\u6587\u7ae0\u63d0\u53d6\r\n- \ud83c\udfaf **\u547d\u4ee4\u884c\u53cb\u597d**\uff1a\u63d0\u4f9b\u7b80\u5355\u6613\u7528\u7684\u547d\u4ee4\u884c\u754c\u9762\r\n- \ud83d\udcc2 **\u6279\u91cf\u5904\u7406**\uff1a\u652f\u6301\u6279\u91cf\u5904\u7406\u591a\u4e2a\u6587\u7ae0\u94fe\u63a5\r\n\r\n## \u5b89\u88c5\r\n\r\n### \u4f7f\u7528 pip \u5b89\u88c5\uff08\u63a8\u8350\uff09\r\n\r\n```bash\r\npip install wespy\r\n```\r\n\r\n### \u4ece\u6e90\u7801\u5b89\u88c5\r\n\r\n```bash\r\ngit clone https://github.com/tianchang/wespy.git\r\ncd wespy\r\npip install -e .\r\n```\r\n\r\n## \u5feb\u901f\u5f00\u59cb\r\n\r\n### \u547d\u4ee4\u884c\u4f7f\u7528\r\n\r\n```bash\r\n# \u83b7\u53d6\u5fae\u4fe1\u516c\u4f17\u53f7\u6587\u7ae0\r\nwespy \"https://mp.weixin.qq.com/s/xxxxx\"\r\n\r\n# \u6307\u5b9a\u8f93\u51fa\u76ee\u5f55\r\nwespy \"https://mp.weixin.qq.com/s/xxxxx\" -o /path/to/output\r\n\r\n# \u663e\u793a\u8be6\u7ec6\u4fe1\u606f\r\nwespy \"https://example.com/article\" -v\r\n```\r\n\r\n### \u4ea4\u4e92\u5f0f\u4f7f\u7528\r\n\r\n\u5982\u679c\u4e0d\u63d0\u4f9b\u4efb\u4f55\u53c2\u6570\uff0c\u7a0b\u5e8f\u4f1a\u8fdb\u5165\u4ea4\u4e92\u6a21\u5f0f\uff1a\r\n\r\n```bash\r\nwespy\r\n```\r\n\r\n\u7136\u540e\u6839\u636e\u63d0\u793a\u8f93\u5165\u6587\u7ae0 URL \u548c\u8f93\u51fa\u76ee\u5f55\u3002\r\n\r\n### Python API \u4f7f\u7528\r\n\r\n```python\r\nfrom wespy import ArticleFetcher\r\n\r\n# \u521b\u5efa\u6587\u7ae0\u83b7\u53d6\u5668\u5b9e\u4f8b\r\nfetcher = ArticleFetcher()\r\n\r\n# \u83b7\u53d6\u6587\u7ae0\r\narticle_info = fetcher.fetch_article(\r\n    url=\"https://mp.weixin.qq.com/s/xxxxx\",\r\n    output_dir=\"articles\"\r\n)\r\n\r\nif article_info:\r\n    print(f\"\u6807\u9898: {article_info['title']}\")\r\n    print(f\"\u4f5c\u8005: {article_info['author']}\")\r\n    print(f\"\u53d1\u5e03\u65f6\u95f4: {article_info['publish_time']}\")\r\n```\r\n\r\n## \u8f93\u51fa\u683c\u5f0f\r\n\r\nWeSpy \u4f1a\u4e3a\u6bcf\u7bc7\u6587\u7ae0\u751f\u6210\u4e09\u79cd\u683c\u5f0f\u7684\u6587\u4ef6\uff1a\r\n\r\n1. **HTML \u6587\u4ef6**\uff1a\u539f\u59cb HTML \u5185\u5bb9\r\n2. **JSON \u6587\u4ef6**\uff1a\u6587\u7ae0\u5143\u6570\u636e\u4fe1\u606f\r\n3. **Markdown \u6587\u4ef6**\uff1a\u8f6c\u6362\u540e\u7684 Markdown \u683c\u5f0f\u5185\u5bb9\r\n\r\n### \u8f93\u51fa\u6587\u4ef6\u793a\u4f8b\r\n\r\n```\r\narticles/\r\n\u251c\u2500\u2500 \u6587\u7ae0\u6807\u9898_1627834567.html      # \u539f\u59cbHTML\r\n\u251c\u2500\u2500 \u6587\u7ae0\u6807\u9898_1627834567.md        # Markdown\u683c\u5f0f\r\n\u2514\u2500\u2500 \u6587\u7ae0\u6807\u9898_1627834567_info.json # \u5143\u6570\u636e\u4fe1\u606f\r\n```\r\n\r\n### JSON \u5143\u6570\u636e\u683c\u5f0f\r\n\r\n```json\r\n{\r\n  \"title\": \"\u6587\u7ae0\u6807\u9898\",\r\n  \"author\": \"\u4f5c\u8005\u540d\u79f0\",\r\n  \"publish_time\": \"2023-07-30\",\r\n  \"url\": \"https://example.com/article\",\r\n  \"html_file\": \"\u6587\u7ae0\u6807\u9898_1627834567.html\",\r\n  \"fetch_time\": \"2023-07-30 12:34:56\"\r\n}\r\n```\r\n\r\n## \u652f\u6301\u7684\u7f51\u7ad9\r\n\r\n### \u5b8c\u5168\u652f\u6301\r\n- \u5fae\u4fe1\u516c\u4f17\u53f7 (mp.weixin.qq.com)\r\n- \u5927\u90e8\u5206\u57fa\u4e8e\u6807\u51c6 HTML \u7ed3\u6784\u7684\u535a\u5ba2\u548c\u65b0\u95fb\u7f51\u7ad9\r\n\r\n### \u901a\u7528\u652f\u6301\r\nWeSpy \u4f7f\u7528\u667a\u80fd\u7b97\u6cd5\u5c1d\u8bd5\u4ece\u4ee5\u4e0b\u5143\u7d20\u4e2d\u63d0\u53d6\u5185\u5bb9\uff1a\r\n- `<article>` \u6807\u7b7e\r\n- \u5e26\u6709 `content`\u3001`article-content`\u3001`post-content` \u7b49 class \u7684\u5143\u7d20\r\n- `<main>` \u6807\u7b7e\r\n- \u6807\u51c6\u7684 meta \u6807\u7b7e\u4fe1\u606f\r\n\r\n## \u547d\u4ee4\u884c\u9009\u9879\r\n\r\n```\r\nwespy [-h] [-o OUTPUT] [-v] url\r\n\r\n\u83b7\u53d6\u6587\u7ae0\u5185\u5bb9\u5e76\u8f6c\u6362\u4e3aMarkdown\r\n\r\npositional arguments:\r\n  url                   \u6587\u7ae0URL\r\n\r\noptional arguments:\r\n  -h, --help            \u663e\u793a\u5e2e\u52a9\u4fe1\u606f\r\n  -o OUTPUT, --output OUTPUT\r\n                        \u8f93\u51fa\u76ee\u5f55 (\u9ed8\u8ba4: articles)\r\n  -v, --verbose         \u663e\u793a\u8be6\u7ec6\u4fe1\u606f\r\n```\r\n\r\n## \u4f9d\u8d56\u8981\u6c42\r\n\r\n- Python 3.6+\r\n- requests >= 2.20.0\r\n- beautifulsoup4 >= 4.9.0\r\n\r\n## \u5f00\u53d1\r\n\r\n### \u5f00\u53d1\u73af\u5883\u8bbe\u7f6e\r\n\r\n```bash\r\ngit clone https://github.com/tianchang/wespy.git\r\ncd wespy\r\npip install -e \".[dev]\"\r\n```\r\n\r\n### \u8fd0\u884c\u6d4b\u8bd5\r\n\r\n```bash\r\npython -m pytest tests/\r\n```\r\n\r\n### \u4ee3\u7801\u683c\u5f0f\u5316\r\n\r\n```bash\r\nblack wespy/\r\nflake8 wespy/\r\n```\r\n\r\n## \u5e38\u89c1\u95ee\u9898\r\n\r\n### Q: \u4e3a\u4ec0\u4e48\u6709\u4e9b\u56fe\u7247\u65e0\u6cd5\u663e\u793a\uff1f\r\nA: WeSpy \u4f7f\u7528 images.weserv.nl \u4f5c\u4e3a\u4ee3\u7406\u670d\u52a1\u6765\u89e3\u51b3\u56fe\u7247\u9632\u76d7\u94fe\u95ee\u9898\u3002\u5982\u679c\u4ecd\u7136\u65e0\u6cd5\u663e\u793a\uff0c\u53ef\u80fd\u662f\u539f\u56fe\u7247\u5df2\u88ab\u5220\u9664\u6216\u7f51\u7edc\u95ee\u9898\u3002\r\n\r\n### Q: \u652f\u6301\u54ea\u4e9b\u7f51\u7ad9\uff1f\r\nA: WeSpy \u5bf9\u5fae\u4fe1\u516c\u4f17\u53f7\u6709\u7279\u522b\u4f18\u5316\uff0c\u5bf9\u5927\u90e8\u5206\u4f7f\u7528\u6807\u51c6 HTML \u7ed3\u6784\u7684\u7f51\u7ad9\u90fd\u6709\u8f83\u597d\u7684\u652f\u6301\u3002\u5982\u679c\u67d0\u4e2a\u7f51\u7ad9\u4e0d\u652f\u6301\uff0c\u6b22\u8fce\u63d0\u4ea4 issue\u3002\r\n\r\n### Q: \u5982\u4f55\u6279\u91cf\u5904\u7406\u6587\u7ae0\uff1f\r\nA: \u76ee\u524d\u9700\u8981\u901a\u8fc7\u811a\u672c\u8c03\u7528 Python API \u6765\u5b9e\u73b0\u6279\u91cf\u5904\u7406\uff0c\u547d\u4ee4\u884c\u7248\u672c\u6682\u4e0d\u652f\u6301\u6279\u91cf\u5904\u7406\u3002\r\n\r\n## \u8d21\u732e\r\n\r\n\u6b22\u8fce\u63d0\u4ea4 issue \u548c pull request\uff01\r\n\r\n1. Fork \u672c\u4ed3\u5e93\r\n2. \u521b\u5efa\u7279\u6027\u5206\u652f (`git checkout -b feature/AmazingFeature`)\r\n3. \u63d0\u4ea4\u66f4\u6539 (`git commit -m 'Add some AmazingFeature'`)\r\n4. \u63a8\u9001\u5230\u5206\u652f (`git push origin feature/AmazingFeature`)\r\n5. \u5f00\u542f Pull Request\r\n\r\n## \u8bb8\u53ef\u8bc1\r\n\r\n\u672c\u9879\u76ee\u4f7f\u7528 MIT \u8bb8\u53ef\u8bc1\u3002\u8be6\u89c1 [LICENSE](LICENSE) \u6587\u4ef6\u3002\r\n\r\n## \u66f4\u65b0\u65e5\u5fd7\r\n\r\n### v0.1.0 (2023-07-30)\r\n- \u521d\u59cb\u7248\u672c\u53d1\u5e03\r\n- \u652f\u6301\u5fae\u4fe1\u516c\u4f17\u53f7\u6587\u7ae0\u63d0\u53d6\r\n- \u652f\u6301\u901a\u7528\u7f51\u9875\u6587\u7ae0\u63d0\u53d6\r\n- \u652f\u6301 HTML/JSON/Markdown \u591a\u683c\u5f0f\u8f93\u51fa\r\n- \u56fe\u7247\u9632\u76d7\u94fe\u5904\u7406\r\n- \u547d\u4ee4\u884c\u754c\u9762\r\n\r\n## \u8054\u7cfb\u65b9\u5f0f\r\n\r\n- GitHub: [https://github.com/tianchangNorth/WeSpy](https://github.com/tianchangNorth/WeSpy)\r\n- Issues: [https://github.com/tianchangNorth/WeSpy/issues](https://github.com/tianchangNorth/WeSpy/issues)\r\n\r\n---\r\n\r\n**\u6ce8\u610f**: \u8bf7\u9075\u5b88\u7f51\u7ad9\u7684 robots.txt \u548c\u4f7f\u7528\u6761\u6b3e\uff0c\u5408\u7406\u4f7f\u7528\u672c\u5de5\u5177\u3002\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A Python tool for fetching web articles and converting them to Markdown format",
    "version": "0.1.2",
    "project_urls": {
        "Bug Reports": "https://github.com/tianchangNorth/WeSpy/issues",
        "Homepage": "https://github.com/tianchangNorth/WeSpy",
        "Source": "https://github.com/tianchangNorth/WeSpy"
    },
    "split_keywords": [
        "web scraping",
        " article extraction",
        " markdown converter",
        " weixin"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "85cc680fc6eb467bd58160af175792db5fe252121fd91c6dd55bffdc3ebfa434",
                "md5": "6472247222acc6798057fabf55d864f7",
                "sha256": "4bc1d577bd382b85b7fc052a88e057c61f8f481d630ef7ebb0ef9a558e10a6d0"
            },
            "downloads": -1,
            "filename": "wespy-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "6472247222acc6798057fabf55d864f7",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 9552,
            "upload_time": "2025-07-30T07:52:29",
            "upload_time_iso_8601": "2025-07-30T07:52:29.278250Z",
            "url": "https://files.pythonhosted.org/packages/85/cc/680fc6eb467bd58160af175792db5fe252121fd91c6dd55bffdc3ebfa434/wespy-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "7927d974eb24dfffb5df3fcd81b3428272145cf5539c5b6e1796bf13a0250723",
                "md5": "16d91367f8f3a235c7e04daabb2ceb26",
                "sha256": "fa21ca99f840a5d4867e88fbee4a380dbaefd16174ba001b518c407f58d8afa8"
            },
            "downloads": -1,
            "filename": "wespy-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "16d91367f8f3a235c7e04daabb2ceb26",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 12269,
            "upload_time": "2025-07-30T07:52:30",
            "upload_time_iso_8601": "2025-07-30T07:52:30.644553Z",
            "url": "https://files.pythonhosted.org/packages/79/27/d974eb24dfffb5df3fcd81b3428272145cf5539c5b6e1796bf13a0250723/wespy-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-30 07:52:30",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "tianchangNorth",
    "github_project": "WeSpy",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "wespy"
}
        
Elapsed time: 3.11517s