# AI News Collector Library
一个用于收集AI相关新闻的Python库,支持多种搜索源和高级功能。
## 🚀 特性
- **多源搜索**: 支持HackerNews、ArXiv、DuckDuckGo、NewsAPI等
- **内容提取**: 自动提取网页内容
- **关键词分析**: 智能提取关键词
- **结果缓存**: 支持结果缓存,提高效率
- **定时任务**: 支持定时自动收集
- **报告生成**: 生成多种格式的报告
- **易于集成**: 简单的API接口
## 📁 项目结构
```
ai_news_collector_lib/
├── __init__.py # 主模块入口
├── cli.py # 命令行接口
├── config/ # 配置模块
│ ├── __init__.py
│ ├── settings.py # 搜索配置
│ └── api_keys.py # API密钥管理
├── core/ # 核心功能
│ ├── __init__.py
│ ├── collector.py # 基础收集器
│ └── advanced_collector.py # 高级收集器
├── models/ # 数据模型
│ ├── __init__.py
│ ├── article.py # 文章模型
│ └── result.py # 结果模型
├── tools/ # 搜索工具
│ ├── __init__.py
│ └── search_tools.py # 各种搜索工具
├── utils/ # 工具函数
│ ├── __init__.py
│ ├── cache.py # 缓存管理
│ ├── content_extractor.py # 内容提取
│ ├── keyword_extractor.py # 关键词提取
│ ├── reporter.py # 报告生成
│ └── scheduler.py # 任务调度
├── tests/ # 测试文件
├── examples/ # 使用示例
├── scripts/ # 构建脚本
├── setup.py # 安装配置
├── pyproject.toml # 项目配置
└── README.md # 项目说明
```
## 📦 安装
### 基础安装
```bash
pip install ai-news-collector-lib
```
### 高级功能安装
```bash
pip install ai-news-collector-lib[advanced]
```
### 开发安装
```bash
git clone https://github.com/ai-news-collector/ai-news-collector-lib.git
cd ai-news-collector-lib
pip install -e .
```
## 🔧 快速开始
### 基础使用
```python
import asyncio
from ai_news_collector_lib import AINewsCollector, SearchConfig
# 创建配置
config = SearchConfig(
enable_hackernews=True,
enable_arxiv=True,
enable_duckduckgo=True,
max_articles_per_source=10
)
# 创建搜集器
collector = AINewsCollector(config)
# 收集新闻
async def main():
result = await collector.collect_news("artificial intelligence")
print(f"收集到 {result.total_articles} 篇文章")
return result.articles
# 运行
articles = asyncio.run(main())
```
### 高级使用
```python
from ai_news_collector_lib import AdvancedAINewsCollector, AdvancedSearchConfig
# 创建高级配置
config = AdvancedSearchConfig(
enable_hackernews=True,
enable_arxiv=True,
enable_duckduckgo=True,
enable_content_extraction=True,
enable_keyword_extraction=True,
cache_results=True
)
# 创建高级搜集器
collector = AdvancedAINewsCollector(config)
# 收集增强新闻
async def main():
result = await collector.collect_news_advanced("machine learning")
# 分析结果
total_words = sum(article['word_count'] for article in result['articles'])
print(f"总字数: {total_words}")
return result
# 运行
enhanced_result = asyncio.run(main())
```
## 📊 支持的搜索源
### 免费源
- 🔥 **HackerNews** - 技术社区讨论
- 📚 **ArXiv** - 学术论文和预印本
- 🦆 **DuckDuckGo** - 隐私保护的网页搜索
### 付费源 (需要API密钥)
- 📡 **NewsAPI** - 多源新闻聚合
- 🔍 **Tavily** - AI驱动的搜索API
- 🌐 **Google Search** - Google自定义搜索API
- 🔵 **Bing Search** - 微软Bing搜索API
- ⚡ **Serper** - 快速Google搜索API
- 🦁 **Brave Search** - 独立隐私搜索API
- 🔬 **MetaSota Search** - 基于MCP协议的智能搜索服务
## ⚙️ 配置
### 环境变量
```bash
# API密钥
NEWS_API_KEY=your_newsapi_key
TAVILY_API_KEY=your_tavily_key
GOOGLE_SEARCH_API_KEY=your_google_key
GOOGLE_SEARCH_ENGINE_ID=your_engine_id
BING_SEARCH_API_KEY=your_bing_key
SERPER_API_KEY=your_serper_key
BRAVE_SEARCH_API_KEY=your_brave_key
METASOSEARCH_API_KEY=your_metasota_key
```
### 配置文件
```python
from ai_news_collector_lib import SearchConfig
config = SearchConfig(
# 传统源
enable_hackernews=True,
enable_arxiv=True,
enable_newsapi=False,
enable_rss_feeds=True,
# 搜索引擎源
enable_duckduckgo=True,
enable_tavily=False,
enable_google_search=False,
enable_bing_search=False,
enable_serper=False,
enable_brave_search=False,
enable_metasota_search=False,
# 搜索参数
max_articles_per_source=10,
days_back=7,
similarity_threshold=0.85
)
```
## 🛠️ 高级功能
### 定时任务
```python
from ai_news_collector_lib import DailyScheduler
# 创建调度器
scheduler = DailyScheduler(
collector_func=collect_news,
schedule_time="09:00",
timezone="Asia/Shanghai"
)
# 启动调度器
scheduler.start()
```
### 缓存管理
```python
from ai_news_collector_lib import CacheManager
# 创建缓存管理器
cache = CacheManager(cache_dir="./cache", default_ttl_hours=24)
# 检查缓存
cache_key = cache.get_cache_key("ai news", ["hackernews", "arxiv"])
cached_result = cache.get_cached_result(cache_key)
if cached_result:
print("使用缓存结果")
else:
# 执行搜索并缓存结果
result = await collector.collect_news("ai news")
cache.cache_result(cache_key, result)
```
### 报告生成
```python
from ai_news_collector_lib import ReportGenerator
# 创建报告生成器
reporter = ReportGenerator(output_dir="./reports")
# 生成报告
report = reporter.generate_daily_report(result, format="markdown")
reporter.save_report(result, filename="daily_report.md")
```
## 📈 使用示例
### 每日收集脚本
```python
#!/usr/bin/env python3
import asyncio
from ai_news_collector_lib import AdvancedAINewsCollector, AdvancedSearchConfig
async def daily_collection():
# 配置
config = AdvancedSearchConfig(
enable_hackernews=True,
enable_arxiv=True,
enable_duckduckgo=True,
enable_content_extraction=True,
cache_results=True
)
# 创建搜集器
collector = AdvancedAINewsCollector(config)
# 收集多个主题
topics = ["artificial intelligence", "machine learning", "deep learning"]
result = await collector.collect_multiple_topics(topics)
print(f"收集完成: {result['unique_articles']} 篇独特文章")
return result
if __name__ == "__main__":
asyncio.run(daily_collection())
```
### Web API集成
```python
from fastapi import FastAPI
from ai_news_collector_lib import AINewsCollector, SearchConfig
app = FastAPI()
collector = AINewsCollector(SearchConfig())
@app.get("/ai-news")
async def get_ai_news(query: str = "artificial intelligence"):
result = await collector.collect_news(query)
return {
"total": result.total_articles,
"unique": result.unique_articles,
"articles": [article.to_dict() for article in result.articles]
}
```
## 🧪 测试
```bash
# 运行测试
pytest
# 运行异步测试
pytest -v
# 运行特定测试
pytest tests/test_collector.py
```
## 🗓️ ArXiv 日期解析与回退
- 默认采用 `BeautifulSoup` 的 XML 解析获取 `published` 字段;若解析异常则回退到 `feedparser`。
- 在 `feedparser` 分支中,日期字段可能仅存在其一:`published_parsed` 或 `updated_parsed`,两者类型均为 `time.struct_time`。
- 回退顺序为:`published_parsed` → `updated_parsed` → `datetime.now()`,以尽量保持条目的时间接近真实发布时间。
- 将 `struct_time` 转换为 `datetime` 时仅取到秒位:`datetime(*entry.published_parsed[:6])` 或 `datetime(*entry.updated_parsed[:6])`。
- 时区说明:Atom 中尾部 `Z` 表示 UTC。BS4 分支使用 `published_str.replace('Z', '+00:00')` 后通过 `datetime.fromisoformat` 解析;`feedparser` 分支直接由 `struct_time` 构建 `datetime`。
实现节选(位于 `ai_news_collector_lib/tools/search_tools.py` 的 `ArxivTool`):
```python
feed = feedparser.parse(response.content)
for entry in feed.entries:
# 说明:feedparser 可能仅提供 published_parsed 或 updated_parsed
# 回退顺序:published_parsed > updated_parsed > 当前时间
try:
if hasattr(entry, 'published_parsed') and entry.published_parsed:
published_date = datetime(*entry.published_parsed[:6])
elif hasattr(entry, 'updated_parsed') and entry.updated_parsed:
published_date = datetime(*entry.updated_parsed[:6])
else:
published_date = datetime.now()
except Exception:
published_date = datetime.now()
```
最小验证脚本:`scripts/min_check_feedparser_fallback.py`
```bash
python scripts/min_check_feedparser_fallback.py
```
该脚本分别构造 RSS (`pubDate`) 与 Atom (`updated`) 的示例,在仅存在其中一个日期字段时验证回退逻辑能够正常运行且不抛异常。
## 📚 文档
- [完整文档](https://ai-news-collector-lib.readthedocs.io/)
- [API参考](https://ai-news-collector-lib.readthedocs.io/api/)
- [示例代码](https://github.com/ai-news-collector/ai-news-collector-lib/tree/main/examples)
## 🤝 贡献
欢迎贡献代码!请查看 [贡献指南](CONTRIBUTING.md) 了解详细信息。
## 📄 许可证
本项目采用 MIT 许可证。查看 [LICENSE](LICENSE) 文件了解详细信息。
## 🆘 支持
- [问题报告](https://github.com/ai-news-collector/ai-news-collector-lib/issues)
- [讨论区](https://github.com/ai-news-collector/ai-news-collector-lib/discussions)
- [邮件支持](mailto:support@ai-news-collector.com)
## 🔄 更新日志
### v0.1.0 (2025-10-07)
- 初始预发布版本
- 支持基础搜索功能
- 支持多种搜索源
- 支持高级功能(内容提取、关键词分析、缓存等)
- ⚠️ 注意:这是预发布版本,功能可能不稳定
---
**祝你使用愉快!** 🎉
Raw data
{
"_id": null,
"home_page": "https://github.com/ai-news-collector/ai-news-collector-lib",
"name": "ai-news-collector-lib",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": "AI News Collector Team <support@ai-news-collector.com>",
"keywords": "ai, news, collector, search, web scraping, machine learning, artificial intelligence",
"author": "AI News Collector Team",
"author_email": "AI News Collector Team <support@ai-news-collector.com>",
"download_url": "https://files.pythonhosted.org/packages/6e/ef/1bee220282dc9eae3fc96e2034bcf0f86237777ab1fee07cfa12e23b799d/ai_news_collector_lib-0.1.1.tar.gz",
"platform": null,
"description": "# AI News Collector Library\r\n\r\n\u4e00\u4e2a\u7528\u4e8e\u6536\u96c6AI\u76f8\u5173\u65b0\u95fb\u7684Python\u5e93\uff0c\u652f\u6301\u591a\u79cd\u641c\u7d22\u6e90\u548c\u9ad8\u7ea7\u529f\u80fd\u3002\r\n\r\n## \ud83d\ude80 \u7279\u6027\r\n\r\n- **\u591a\u6e90\u641c\u7d22**: \u652f\u6301HackerNews\u3001ArXiv\u3001DuckDuckGo\u3001NewsAPI\u7b49\r\n- **\u5185\u5bb9\u63d0\u53d6**: \u81ea\u52a8\u63d0\u53d6\u7f51\u9875\u5185\u5bb9\r\n- **\u5173\u952e\u8bcd\u5206\u6790**: \u667a\u80fd\u63d0\u53d6\u5173\u952e\u8bcd\r\n- **\u7ed3\u679c\u7f13\u5b58**: \u652f\u6301\u7ed3\u679c\u7f13\u5b58\uff0c\u63d0\u9ad8\u6548\u7387\r\n- **\u5b9a\u65f6\u4efb\u52a1**: \u652f\u6301\u5b9a\u65f6\u81ea\u52a8\u6536\u96c6\r\n- **\u62a5\u544a\u751f\u6210**: \u751f\u6210\u591a\u79cd\u683c\u5f0f\u7684\u62a5\u544a\r\n- **\u6613\u4e8e\u96c6\u6210**: \u7b80\u5355\u7684API\u63a5\u53e3\r\n\r\n## \ud83d\udcc1 \u9879\u76ee\u7ed3\u6784\r\n\r\n```\r\nai_news_collector_lib/\r\n\u251c\u2500\u2500 __init__.py # \u4e3b\u6a21\u5757\u5165\u53e3\r\n\u251c\u2500\u2500 cli.py # \u547d\u4ee4\u884c\u63a5\u53e3\r\n\u251c\u2500\u2500 config/ # \u914d\u7f6e\u6a21\u5757\r\n\u2502 \u251c\u2500\u2500 __init__.py\r\n\u2502 \u251c\u2500\u2500 settings.py # \u641c\u7d22\u914d\u7f6e\r\n\u2502 \u2514\u2500\u2500 api_keys.py # API\u5bc6\u94a5\u7ba1\u7406\r\n\u251c\u2500\u2500 core/ # \u6838\u5fc3\u529f\u80fd\r\n\u2502 \u251c\u2500\u2500 __init__.py\r\n\u2502 \u251c\u2500\u2500 collector.py # \u57fa\u7840\u6536\u96c6\u5668\r\n\u2502 \u2514\u2500\u2500 advanced_collector.py # \u9ad8\u7ea7\u6536\u96c6\u5668\r\n\u251c\u2500\u2500 models/ # \u6570\u636e\u6a21\u578b\r\n\u2502 \u251c\u2500\u2500 __init__.py\r\n\u2502 \u251c\u2500\u2500 article.py # \u6587\u7ae0\u6a21\u578b\r\n\u2502 \u2514\u2500\u2500 result.py # \u7ed3\u679c\u6a21\u578b\r\n\u251c\u2500\u2500 tools/ # \u641c\u7d22\u5de5\u5177\r\n\u2502 \u251c\u2500\u2500 __init__.py\r\n\u2502 \u2514\u2500\u2500 search_tools.py # \u5404\u79cd\u641c\u7d22\u5de5\u5177\r\n\u251c\u2500\u2500 utils/ # \u5de5\u5177\u51fd\u6570\r\n\u2502 \u251c\u2500\u2500 __init__.py\r\n\u2502 \u251c\u2500\u2500 cache.py # \u7f13\u5b58\u7ba1\u7406\r\n\u2502 \u251c\u2500\u2500 content_extractor.py # \u5185\u5bb9\u63d0\u53d6\r\n\u2502 \u251c\u2500\u2500 keyword_extractor.py # \u5173\u952e\u8bcd\u63d0\u53d6\r\n\u2502 \u251c\u2500\u2500 reporter.py # \u62a5\u544a\u751f\u6210\r\n\u2502 \u2514\u2500\u2500 scheduler.py # \u4efb\u52a1\u8c03\u5ea6\r\n\u251c\u2500\u2500 tests/ # \u6d4b\u8bd5\u6587\u4ef6\r\n\u251c\u2500\u2500 examples/ # \u4f7f\u7528\u793a\u4f8b\r\n\u251c\u2500\u2500 scripts/ # \u6784\u5efa\u811a\u672c\r\n\u251c\u2500\u2500 setup.py # \u5b89\u88c5\u914d\u7f6e\r\n\u251c\u2500\u2500 pyproject.toml # \u9879\u76ee\u914d\u7f6e\r\n\u2514\u2500\u2500 README.md # \u9879\u76ee\u8bf4\u660e\r\n```\r\n\r\n## \ud83d\udce6 \u5b89\u88c5\r\n\r\n### \u57fa\u7840\u5b89\u88c5\r\n\r\n```bash\r\npip install ai-news-collector-lib\r\n```\r\n\r\n### \u9ad8\u7ea7\u529f\u80fd\u5b89\u88c5\r\n\r\n```bash\r\npip install ai-news-collector-lib[advanced]\r\n```\r\n\r\n### \u5f00\u53d1\u5b89\u88c5\r\n\r\n```bash\r\ngit clone https://github.com/ai-news-collector/ai-news-collector-lib.git\r\ncd ai-news-collector-lib\r\npip install -e .\r\n```\r\n\r\n## \ud83d\udd27 \u5feb\u901f\u5f00\u59cb\r\n\r\n### \u57fa\u7840\u4f7f\u7528\r\n\r\n```python\r\nimport asyncio\r\nfrom ai_news_collector_lib import AINewsCollector, SearchConfig\r\n\r\n# \u521b\u5efa\u914d\u7f6e\r\nconfig = SearchConfig(\r\n enable_hackernews=True,\r\n enable_arxiv=True,\r\n enable_duckduckgo=True,\r\n max_articles_per_source=10\r\n)\r\n\r\n# \u521b\u5efa\u641c\u96c6\u5668\r\ncollector = AINewsCollector(config)\r\n\r\n# \u6536\u96c6\u65b0\u95fb\r\nasync def main():\r\n result = await collector.collect_news(\"artificial intelligence\")\r\n print(f\"\u6536\u96c6\u5230 {result.total_articles} \u7bc7\u6587\u7ae0\")\r\n return result.articles\r\n\r\n# \u8fd0\u884c\r\narticles = asyncio.run(main())\r\n```\r\n\r\n### \u9ad8\u7ea7\u4f7f\u7528\r\n\r\n```python\r\nfrom ai_news_collector_lib import AdvancedAINewsCollector, AdvancedSearchConfig\r\n\r\n# \u521b\u5efa\u9ad8\u7ea7\u914d\u7f6e\r\nconfig = AdvancedSearchConfig(\r\n enable_hackernews=True,\r\n enable_arxiv=True,\r\n enable_duckduckgo=True,\r\n enable_content_extraction=True,\r\n enable_keyword_extraction=True,\r\n cache_results=True\r\n)\r\n\r\n# \u521b\u5efa\u9ad8\u7ea7\u641c\u96c6\u5668\r\ncollector = AdvancedAINewsCollector(config)\r\n\r\n# \u6536\u96c6\u589e\u5f3a\u65b0\u95fb\r\nasync def main():\r\n result = await collector.collect_news_advanced(\"machine learning\")\r\n \r\n # \u5206\u6790\u7ed3\u679c\r\n total_words = sum(article['word_count'] for article in result['articles'])\r\n print(f\"\u603b\u5b57\u6570: {total_words}\")\r\n \r\n return result\r\n\r\n# \u8fd0\u884c\r\nenhanced_result = asyncio.run(main())\r\n```\r\n\r\n## \ud83d\udcca \u652f\u6301\u7684\u641c\u7d22\u6e90\r\n\r\n### \u514d\u8d39\u6e90\r\n\r\n- \ud83d\udd25 **HackerNews** - \u6280\u672f\u793e\u533a\u8ba8\u8bba\r\n- \ud83d\udcda **ArXiv** - \u5b66\u672f\u8bba\u6587\u548c\u9884\u5370\u672c\r\n- \ud83e\udd86 **DuckDuckGo** - \u9690\u79c1\u4fdd\u62a4\u7684\u7f51\u9875\u641c\u7d22\r\n\r\n### \u4ed8\u8d39\u6e90 (\u9700\u8981API\u5bc6\u94a5)\r\n\r\n- \ud83d\udce1 **NewsAPI** - \u591a\u6e90\u65b0\u95fb\u805a\u5408\r\n- \ud83d\udd0d **Tavily** - AI\u9a71\u52a8\u7684\u641c\u7d22API\r\n- \ud83c\udf10 **Google Search** - Google\u81ea\u5b9a\u4e49\u641c\u7d22API\r\n- \ud83d\udd35 **Bing Search** - \u5fae\u8f6fBing\u641c\u7d22API\r\n- \u26a1 **Serper** - \u5feb\u901fGoogle\u641c\u7d22API\r\n- \ud83e\udd81 **Brave Search** - \u72ec\u7acb\u9690\u79c1\u641c\u7d22API\r\n- \ud83d\udd2c **MetaSota Search** - \u57fa\u4e8eMCP\u534f\u8bae\u7684\u667a\u80fd\u641c\u7d22\u670d\u52a1\r\n\r\n## \u2699\ufe0f \u914d\u7f6e\r\n\r\n### \u73af\u5883\u53d8\u91cf\r\n\r\n```bash\r\n# API\u5bc6\u94a5\r\nNEWS_API_KEY=your_newsapi_key\r\nTAVILY_API_KEY=your_tavily_key\r\nGOOGLE_SEARCH_API_KEY=your_google_key\r\nGOOGLE_SEARCH_ENGINE_ID=your_engine_id\r\nBING_SEARCH_API_KEY=your_bing_key\r\nSERPER_API_KEY=your_serper_key\r\nBRAVE_SEARCH_API_KEY=your_brave_key\r\nMETASOSEARCH_API_KEY=your_metasota_key\r\n```\r\n\r\n### \u914d\u7f6e\u6587\u4ef6\r\n\r\n```python\r\nfrom ai_news_collector_lib import SearchConfig\r\n\r\nconfig = SearchConfig(\r\n # \u4f20\u7edf\u6e90\r\n enable_hackernews=True,\r\n enable_arxiv=True,\r\n enable_newsapi=False,\r\n enable_rss_feeds=True,\r\n \r\n # \u641c\u7d22\u5f15\u64ce\u6e90\r\n enable_duckduckgo=True,\r\n enable_tavily=False,\r\n enable_google_search=False,\r\n enable_bing_search=False,\r\n enable_serper=False,\r\n enable_brave_search=False,\r\n enable_metasota_search=False,\r\n \r\n # \u641c\u7d22\u53c2\u6570\r\n max_articles_per_source=10,\r\n days_back=7,\r\n similarity_threshold=0.85\r\n)\r\n```\r\n\r\n## \ud83d\udee0\ufe0f \u9ad8\u7ea7\u529f\u80fd\r\n\r\n### \u5b9a\u65f6\u4efb\u52a1\r\n\r\n```python\r\nfrom ai_news_collector_lib import DailyScheduler\r\n\r\n# \u521b\u5efa\u8c03\u5ea6\u5668\r\nscheduler = DailyScheduler(\r\n collector_func=collect_news,\r\n schedule_time=\"09:00\",\r\n timezone=\"Asia/Shanghai\"\r\n)\r\n\r\n# \u542f\u52a8\u8c03\u5ea6\u5668\r\nscheduler.start()\r\n```\r\n\r\n### \u7f13\u5b58\u7ba1\u7406\r\n\r\n```python\r\nfrom ai_news_collector_lib import CacheManager\r\n\r\n# \u521b\u5efa\u7f13\u5b58\u7ba1\u7406\u5668\r\ncache = CacheManager(cache_dir=\"./cache\", default_ttl_hours=24)\r\n\r\n# \u68c0\u67e5\u7f13\u5b58\r\ncache_key = cache.get_cache_key(\"ai news\", [\"hackernews\", \"arxiv\"])\r\ncached_result = cache.get_cached_result(cache_key)\r\n\r\nif cached_result:\r\n print(\"\u4f7f\u7528\u7f13\u5b58\u7ed3\u679c\")\r\nelse:\r\n # \u6267\u884c\u641c\u7d22\u5e76\u7f13\u5b58\u7ed3\u679c\r\n result = await collector.collect_news(\"ai news\")\r\n cache.cache_result(cache_key, result)\r\n```\r\n\r\n### \u62a5\u544a\u751f\u6210\r\n\r\n```python\r\nfrom ai_news_collector_lib import ReportGenerator\r\n\r\n# \u521b\u5efa\u62a5\u544a\u751f\u6210\u5668\r\nreporter = ReportGenerator(output_dir=\"./reports\")\r\n\r\n# \u751f\u6210\u62a5\u544a\r\nreport = reporter.generate_daily_report(result, format=\"markdown\")\r\nreporter.save_report(result, filename=\"daily_report.md\")\r\n```\r\n\r\n## \ud83d\udcc8 \u4f7f\u7528\u793a\u4f8b\r\n\r\n### \u6bcf\u65e5\u6536\u96c6\u811a\u672c\r\n\r\n```python\r\n#!/usr/bin/env python3\r\nimport asyncio\r\nfrom ai_news_collector_lib import AdvancedAINewsCollector, AdvancedSearchConfig\r\n\r\nasync def daily_collection():\r\n # \u914d\u7f6e\r\n config = AdvancedSearchConfig(\r\n enable_hackernews=True,\r\n enable_arxiv=True,\r\n enable_duckduckgo=True,\r\n enable_content_extraction=True,\r\n cache_results=True\r\n )\r\n \r\n # \u521b\u5efa\u641c\u96c6\u5668\r\n collector = AdvancedAINewsCollector(config)\r\n \r\n # \u6536\u96c6\u591a\u4e2a\u4e3b\u9898\r\n topics = [\"artificial intelligence\", \"machine learning\", \"deep learning\"]\r\n result = await collector.collect_multiple_topics(topics)\r\n \r\n print(f\"\u6536\u96c6\u5b8c\u6210: {result['unique_articles']} \u7bc7\u72ec\u7279\u6587\u7ae0\")\r\n return result\r\n\r\nif __name__ == \"__main__\":\r\n asyncio.run(daily_collection())\r\n```\r\n\r\n### Web API\u96c6\u6210\r\n\r\n```python\r\nfrom fastapi import FastAPI\r\nfrom ai_news_collector_lib import AINewsCollector, SearchConfig\r\n\r\napp = FastAPI()\r\ncollector = AINewsCollector(SearchConfig())\r\n\r\n@app.get(\"/ai-news\")\r\nasync def get_ai_news(query: str = \"artificial intelligence\"):\r\n result = await collector.collect_news(query)\r\n return {\r\n \"total\": result.total_articles,\r\n \"unique\": result.unique_articles,\r\n \"articles\": [article.to_dict() for article in result.articles]\r\n }\r\n```\r\n\r\n## \ud83e\uddea \u6d4b\u8bd5\r\n\r\n```bash\r\n# \u8fd0\u884c\u6d4b\u8bd5\r\npytest\r\n\r\n# \u8fd0\u884c\u5f02\u6b65\u6d4b\u8bd5\r\npytest -v\r\n\r\n# \u8fd0\u884c\u7279\u5b9a\u6d4b\u8bd5\r\npytest tests/test_collector.py\r\n```\r\n\r\n## \ud83d\uddd3\ufe0f ArXiv \u65e5\u671f\u89e3\u6790\u4e0e\u56de\u9000\r\n\r\n- \u9ed8\u8ba4\u91c7\u7528 `BeautifulSoup` \u7684 XML \u89e3\u6790\u83b7\u53d6 `published` \u5b57\u6bb5\uff1b\u82e5\u89e3\u6790\u5f02\u5e38\u5219\u56de\u9000\u5230 `feedparser`\u3002\r\n- \u5728 `feedparser` \u5206\u652f\u4e2d\uff0c\u65e5\u671f\u5b57\u6bb5\u53ef\u80fd\u4ec5\u5b58\u5728\u5176\u4e00\uff1a`published_parsed` \u6216 `updated_parsed`\uff0c\u4e24\u8005\u7c7b\u578b\u5747\u4e3a `time.struct_time`\u3002\r\n- \u56de\u9000\u987a\u5e8f\u4e3a\uff1a`published_parsed` \u2192 `updated_parsed` \u2192 `datetime.now()`\uff0c\u4ee5\u5c3d\u91cf\u4fdd\u6301\u6761\u76ee\u7684\u65f6\u95f4\u63a5\u8fd1\u771f\u5b9e\u53d1\u5e03\u65f6\u95f4\u3002\r\n- \u5c06 `struct_time` \u8f6c\u6362\u4e3a `datetime` \u65f6\u4ec5\u53d6\u5230\u79d2\u4f4d\uff1a`datetime(*entry.published_parsed[:6])` \u6216 `datetime(*entry.updated_parsed[:6])`\u3002\r\n- \u65f6\u533a\u8bf4\u660e\uff1aAtom \u4e2d\u5c3e\u90e8 `Z` \u8868\u793a UTC\u3002BS4 \u5206\u652f\u4f7f\u7528 `published_str.replace('Z', '+00:00')` \u540e\u901a\u8fc7 `datetime.fromisoformat` \u89e3\u6790\uff1b`feedparser` \u5206\u652f\u76f4\u63a5\u7531 `struct_time` \u6784\u5efa `datetime`\u3002\r\n\r\n\u5b9e\u73b0\u8282\u9009\uff08\u4f4d\u4e8e `ai_news_collector_lib/tools/search_tools.py` \u7684 `ArxivTool`\uff09\uff1a\r\n\r\n```python\r\nfeed = feedparser.parse(response.content)\r\nfor entry in feed.entries:\r\n # \u8bf4\u660e\uff1afeedparser \u53ef\u80fd\u4ec5\u63d0\u4f9b published_parsed \u6216 updated_parsed\r\n # \u56de\u9000\u987a\u5e8f\uff1apublished_parsed > updated_parsed > \u5f53\u524d\u65f6\u95f4\r\n try:\r\n if hasattr(entry, 'published_parsed') and entry.published_parsed:\r\n published_date = datetime(*entry.published_parsed[:6])\r\n elif hasattr(entry, 'updated_parsed') and entry.updated_parsed:\r\n published_date = datetime(*entry.updated_parsed[:6])\r\n else:\r\n published_date = datetime.now()\r\n except Exception:\r\n published_date = datetime.now()\r\n```\r\n\r\n\u6700\u5c0f\u9a8c\u8bc1\u811a\u672c\uff1a`scripts/min_check_feedparser_fallback.py`\r\n\r\n```bash\r\npython scripts/min_check_feedparser_fallback.py\r\n```\r\n\r\n\u8be5\u811a\u672c\u5206\u522b\u6784\u9020 RSS (`pubDate`) \u4e0e Atom (`updated`) \u7684\u793a\u4f8b\uff0c\u5728\u4ec5\u5b58\u5728\u5176\u4e2d\u4e00\u4e2a\u65e5\u671f\u5b57\u6bb5\u65f6\u9a8c\u8bc1\u56de\u9000\u903b\u8f91\u80fd\u591f\u6b63\u5e38\u8fd0\u884c\u4e14\u4e0d\u629b\u5f02\u5e38\u3002\r\n\r\n## \ud83d\udcda \u6587\u6863\r\n\r\n- [\u5b8c\u6574\u6587\u6863](https://ai-news-collector-lib.readthedocs.io/)\r\n- [API\u53c2\u8003](https://ai-news-collector-lib.readthedocs.io/api/)\r\n- [\u793a\u4f8b\u4ee3\u7801](https://github.com/ai-news-collector/ai-news-collector-lib/tree/main/examples)\r\n\r\n## \ud83e\udd1d \u8d21\u732e\r\n\r\n\u6b22\u8fce\u8d21\u732e\u4ee3\u7801\uff01\u8bf7\u67e5\u770b [\u8d21\u732e\u6307\u5357](CONTRIBUTING.md) \u4e86\u89e3\u8be6\u7ec6\u4fe1\u606f\u3002\r\n\r\n## \ud83d\udcc4 \u8bb8\u53ef\u8bc1\r\n\r\n\u672c\u9879\u76ee\u91c7\u7528 MIT \u8bb8\u53ef\u8bc1\u3002\u67e5\u770b [LICENSE](LICENSE) \u6587\u4ef6\u4e86\u89e3\u8be6\u7ec6\u4fe1\u606f\u3002\r\n\r\n## \ud83c\udd98 \u652f\u6301\r\n\r\n- [\u95ee\u9898\u62a5\u544a](https://github.com/ai-news-collector/ai-news-collector-lib/issues)\r\n- [\u8ba8\u8bba\u533a](https://github.com/ai-news-collector/ai-news-collector-lib/discussions)\r\n- [\u90ae\u4ef6\u652f\u6301](mailto:support@ai-news-collector.com)\r\n\r\n## \ud83d\udd04 \u66f4\u65b0\u65e5\u5fd7\r\n\r\n### v0.1.0 (2025-10-07)\r\n\r\n- \u521d\u59cb\u9884\u53d1\u5e03\u7248\u672c\r\n- \u652f\u6301\u57fa\u7840\u641c\u7d22\u529f\u80fd\r\n- \u652f\u6301\u591a\u79cd\u641c\u7d22\u6e90\r\n- \u652f\u6301\u9ad8\u7ea7\u529f\u80fd\uff08\u5185\u5bb9\u63d0\u53d6\u3001\u5173\u952e\u8bcd\u5206\u6790\u3001\u7f13\u5b58\u7b49\uff09\r\n- \u26a0\ufe0f \u6ce8\u610f\uff1a\u8fd9\u662f\u9884\u53d1\u5e03\u7248\u672c\uff0c\u529f\u80fd\u53ef\u80fd\u4e0d\u7a33\u5b9a\r\n\r\n---\r\n\r\n**\u795d\u4f60\u4f7f\u7528\u6109\u5feb\uff01** \ud83c\udf89\r\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "A Python library for collecting AI-related news from multiple sources",
"version": "0.1.1",
"project_urls": {
"Bug Tracker": "https://github.com/ai-news-collector/ai-news-collector-lib/issues",
"Documentation": "https://ai-news-collector-lib.readthedocs.io/",
"Homepage": "https://github.com/ai-news-collector/ai-news-collector-lib",
"Repository": "https://github.com/ai-news-collector/ai-news-collector-lib.git"
},
"split_keywords": [
"ai",
" news",
" collector",
" search",
" web scraping",
" machine learning",
" artificial intelligence"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "87e1373d90f7f85dad03e651e18e53f61352ee1f21a5e94f48f1b061f11915d0",
"md5": "9f692c0796bdd1898e56096e68208fcf",
"sha256": "bbc93021b61f459426212de734894f1bc004e2e33a422f63535ec0c42fe081cc"
},
"downloads": -1,
"filename": "ai_news_collector_lib-0.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "9f692c0796bdd1898e56096e68208fcf",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 43230,
"upload_time": "2025-10-17T06:35:27",
"upload_time_iso_8601": "2025-10-17T06:35:27.781638Z",
"url": "https://files.pythonhosted.org/packages/87/e1/373d90f7f85dad03e651e18e53f61352ee1f21a5e94f48f1b061f11915d0/ai_news_collector_lib-0.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "6eef1bee220282dc9eae3fc96e2034bcf0f86237777ab1fee07cfa12e23b799d",
"md5": "f0f65b0d7161efe2915570c6e63562f1",
"sha256": "b590e3d9f2d9e179056841d41f1587c3f6ca58e49a8352cfc83d6a42a1b5d4cc"
},
"downloads": -1,
"filename": "ai_news_collector_lib-0.1.1.tar.gz",
"has_sig": false,
"md5_digest": "f0f65b0d7161efe2915570c6e63562f1",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 85888,
"upload_time": "2025-10-17T06:35:29",
"upload_time_iso_8601": "2025-10-17T06:35:29.280525Z",
"url": "https://files.pythonhosted.org/packages/6e/ef/1bee220282dc9eae3fc96e2034bcf0f86237777ab1fee07cfa12e23b799d/ai_news_collector_lib-0.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-17 06:35:29",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ai-news-collector",
"github_project": "ai-news-collector-lib",
"github_not_found": true,
"lcname": "ai-news-collector-lib"
}