pdf-ai-extractor


Namepdf-ai-extractor JSON
Version 1.0.0 PyPI version JSON
download
home_pagehttps://github.com/changyy/py-pdf-ai-extractor
SummaryPDF metadata and content extraction tool with AI-powered analysis
upload_time2024-11-27 15:31:12
maintainerNone
docs_urlNone
authorchangyy
requires_python>=3.9
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # pdf-ai-extractor

![PyPI](https://img.shields.io/pypi/v/pdf-ai-extractor.svg)

PDF metadata and content extraction tool with AI-powered analysis capabilities.

## Features

- Extract PDF metadata including abstract, keywords, and bookmarks/TOC
- Multiple analysis backends:
  - Local extraction (no AI)
  - OpenAI-powered analysis
  - xAI-powered analysis
  - Hugging Face models integration

## Installation

```bash
pip install pdf-ai-extractor
```

From source:
```bash
git clone https://github.com/changyy/py-pdf-ai-extractor.git
cd py-pdf-ai-extractor
pip install -e .
```

## Usage

### Command Line Interface

```bash
# Basic usage (local extraction)
pdf-ai-extractor input.pdf

# Using OpenAI backend
pdf-ai-extractor --backend openai --api-key YOUR_API_KEY input.pdf

# Using xAI backend
pdf-ai-extractor --backend xai --api-key YOUR_API_KEY input.pdf

# Using Hugging Face backend
pdf-ai-extractor --backend huggingface --model-name "model/name" input.pdf

# Save output to file
pdf-ai-extractor input.pdf --output result.json
```

## Example

```bash
% wget https://github.com/datawhalechina/leedl-tutorial/releases/download/v1.2.2/LeeDL_Tutorial_v.1.2.2.pdf -O ~/Downloads/LeeDL_Tutorial.pdf

% pdf-ai-extractor -b openai -k 'sk-XXXXXXX' ~/Downloads/LeeDL_Tutorial.pdf    
[
  {
    "path": "/path/Downloads/LeeDL_Tutorial.pdf",
    "abstract": "The 'LeeDL Tutorial' authored by Wang Qi, Yang Yiyuan, and Jiang Ji is a comprehensive guide to deep learning, inspired by the popular machine learning course by Professor Li Hongyi from National Taiwan University. This tutorial aims to make deep learning accessible to Chinese-speaking students by simplifying complex theories and providing detailed derivations of formulas. It covers essential topics in deep learning, including foundational concepts, practical methodologies, and advanced techniques, while integrating original content and supplementary materials from previous courses. The tutorial is designed for beginners and those seeking to deepen their understanding of deep learning, making it a recommended resource for students interested in the field. The authors, who are members of the Datawhale organization, have backgrounds in artificial intelligence, reinforcement learning, and computer vision, further enhancing the tutorial's credibility and depth.",
    "keywords": [
      "深度学习",
      "机器学习",
      "李宏毅",
      "教程",
      "人工智能",
      "数据挖掘",
      "优化算法",
      "模型训练",
      "中文教育",
      "开源组织"
    ],
    "bookmarks": [
      "机器学习基础",
      "案例学习",
      "线性模型",
      "分段线性曲线",
      "模型变形",
      "机器学习框架",
      "实践方法论",
      "模型偏差",
      "优化问题",
      "过拟合",
      "交叉验证",
      "不匹配",
      "深度学习基础",
      "局部极小值与鞍点",
      "临界点及其种类",
      "判断临界值种类的方法",
      "批量和动量",
      "批量大小对梯度下降法的影响",
      "动量法",
      "自适应学习率",
      "AdaGrad",
      "RMSProp",
      "Adam",
      "学习率调度",
      "优化总结",
      "分类",
      "分类与回归的关系",
      "带有 softmax 的分类",
      "分类损失",
      "批量归一化",
      "考虑深度学习",
      "测试时的批量归一化",
      "内部协变量偏移",
      "卷积神经网络",
      "观察1:检测模式不需要整张图像",
      "简化1:感受野",
      "观察2:同样的模式可能会出现在图像的不同区域",
      "简化2:共享参数",
      "简化1和2的总结",
      "观察3:下采样不影响模式检测",
      "简化3:汇聚",
      "卷积神经网络的应用:下围棋",
      "循环神经网络",
      "独热编码",
      "什么是RNN?",
      "RNN架构",
      "其他RNN",
      "Elman 网络 &Jordan 网络",
      "双向循环神经网络",
      "长短期记忆网络",
      "LSTM举例",
      "LSTM运算示例",
      "LSTM原理",
      "RNN学习方式",
      "如何解决RNN梯度消失或者爆炸",
      "RNN其他应用",
      "多对一序列",
      "多对多序列",
      "序列到序列",
      "自注意力机制",
      "输入是向量序列的情况",
      "类型1:输入与输出数量相同",
      "类型2:输入是一个序列,输出是一个标签",
      "类型3:序列到序列",
      "自注意力的运作原理",
      "多头注意力",
      "位置编码",
      "截断自注意力",
      "自注意力与卷积神经网络对比",
      "自注意力与循环神经网络对比",
      "Transformer",
      "序列到序列模型",
      "语音识别、机器翻译与语音翻译",
      "语音合成",
      "聊天机器人",
      "问答任务",
      "句法分析",
      "多标签分类",
      "Transformer结构",
      "Transformer编码器",
      "Transformer解码器",
      "自回归解码器",
      "非自回归解码器",
      "编码器-解码器注意⼒",
      "Transformer的训练过程",
      "序列到序列模型训练常用技巧",
      "复制机制",
      "引导注意力",
      "束搜索",
      "加入噪声",
      "使用强化学习训练",
      "计划采样",
      "生成模型",
      "生成对抗网络",
      "生成器",
      "辨别器",
      "生成器与辨别器的训练过程",
      "GAN的应用案例",
      "GAN的理论介绍",
      "WGAN算法",
      "训练GAN的难点与技巧",
      "GAN的性能评估方法",
      "条件型生成",
      "Cycle GAN",
      "扩散模型",
      "自监督学习",
      "来⾃Transformers的双向编码器表⽰(BERT)",
      "BERT的使用方式",
      "BERT有用的原因",
      "BERT的变种",
      "⽣成式预训练(GPT)",
      "自编码器",
      "自编码器的概念",
      "为什么需要自编码器?",
      "去噪自编码器",
      "自编码器应用之特征解耦",
      "自编码器应用之离散隐表征",
      "自编码器的其他应用",
      "对抗攻击",
      "对抗攻击简介",
      "如何进行网络攻击",
      "快速梯度符号法",
      "白盒攻击与黑盒攻击",
      "其他模态数据被攻击案例",
      "现实世界中的攻击",
      "防御方式中的被动防御",
      "防御方式中的主动防御",
      "迁移学习",
      "领域偏移",
      "领域自适应",
      "领域泛化",
      "强化学习",
      "强化学习应用",
      "玩电子游戏",
      "下围棋",
      "强化学习框架",
      "第1步:未知函数",
      "第2步:定义损失",
      "第3步:优化",
      "评价动作的标准",
      "使用即时奖励作为评价标准",
      "使用累积奖励作为评价标准",
      "使用折扣累积奖励作为评价标准",
      "使用折扣累积奖励减去基线作为评价标准",
      "Actor-Critic",
      "优势 Actor-Critic",
      "元学习",
      "元学习的概念",
      "元学习的三个步骤",
      "元学习与机器学习",
      "元学习的实例算法",
      "元学习的应用",
      "终身学习",
      "灾难性遗忘",
      "终身学习评估方法",
      "终身学习的主要解法",
      "网络压缩",
      "网络剪枝",
      "知识蒸馏",
      "参数量化",
      "网络架构设计",
      "动态计算",
      "可解释性人工智能",
      "可解释性人工智能的重要性",
      "决策树模型的可解释性",
      "可解释性机器学习的目标",
      "可解释性机器学习中的局部解释",
      "可解释性机器学习中的全局解释",
      "扩展与小结",
      "ChatGPT",
      "ChatGPT简介和功能",
      "对于ChatGPT的误解",
      "ChatGPT背后的关键技术——预训练",
      "ChatGPT带来的研究问题",
      "术语"
    ],
    "error": null
  }
]

% pdf-ai-extractor -b xai -k 'xai-XXXXX' ~/Downloads/LeeDL_Tutorial.pdf 
[
  {
    "path": "/path/Downloads/LeeDL_Tutorial.pdf",
    "abstract": "本教程基于李宏毅教授的《机器学习》(2021年春)课程,旨在为深度学习初学者提供一个轻松入门的中文学习资源。教程内容全面,涵盖了深度学习的基本理论和实践方法,通过幽默风趣的讲解和大量动漫相关的例子,使深奥的理论变得易于理解。教程不仅选取了课程的精华内容,还对公式进行了详细推导,对难点进行了重点讲解,并补充了其他深度学习相关知识。",
    "keywords": [
      "深度学习",
      "机器学习",
      "李宏毅",
      "教程",
      "入门",
      "中文",
      "强化学习",
      "计算机视觉",
      "时间序列",
      "数据挖掘"
    ],
    "bookmarks": [
      "机器学习基础",
      "案例学习",
      "线性模型",
      "分段线性曲线",
      "模型变形",
      "机器学习框架",
      ...
      "术语"
    ],
    "error": null
  }
]


% pdf-ai-extractor -b xai -k 'xai-XXXXX' ~/Downloads/LeeDL_Tutorial.pdf
[
  {
    "path": "/path/Downloads/LeeDL_Tutorial.pdf",
    "abstract": "This tutorial, based on Professor Li Hongyi's 'Machine Learning' course from NTU, offers an accessible introduction to deep learning. It simplifies complex theories through engaging examples, making it suitable for beginners. The tutorial covers essential deep learning concepts, optimization techniques, and practical methodologies, supplemented with original content to enhance understanding. It also includes contributions from other courses and additional knowledge to enrich the learning experience.",
    "keywords": [
      "深度学习",
      "机器学习",
      "李宏毅",
      "教程",
      "优化",
      "强化学习",
      "计算机视觉",
      "时间序列",
      "数据挖掘",
      "智能传感系统"
    ],
    "bookmarks": [
      "机器学习基础",
      "案例学习",
      "线性模型",
      "分段线性曲线",
      "模型变形",
      "机器学习框架",
      "实践方法论",
      "模型偏差",
      "优化问题",
      "过拟合",
      ...
      "术语"
    ],
    "error": null
  }
]
```

### Python API

```python
from pdf_ai_extractor.handlers import create_handler

# Local analysis
handler = create_handler("local")
result = handler.result("input.pdf")
print(result)

# Using OpenAI
handler = create_handler(
    "openai",
    config={"api_key": "YOUR_API_KEY"}
)
result = handler.result("input.pdf")
print(result)

# Using xAI
handler = create_handler(
    "xai",
    config={"api_key": "YOUR_API_KEY"}
)
result = handler.result("input.pdf")
print(result)

# Using HuggingFace - facebook/bart-large-cnn
handler = create_handler(
    "huggingface",
    config={"model_name": "facebook/bart-large-cnn"}
)
result = handler.result("input.pdf")
print(result)

# Output format
# {
#     "abstract": "Document abstract extracted from the PDF",
#     "keywords": ["keyword1", "keyword2", ...],
#     "bookmarks": ["Chapter 1", "Section 1.1", ...]
# }
```

You can also set additional parameters:

```python
# Advanced OpenAI settings
handler = create_handler(
    "openai",
    config={
        "api_key": "YOUR_API_KEY",
        "model": "gpt-4",
        "max_tokens": 500
    }
)

# Advanced xAI settings
handler = create_handler(
    "xai",
    config={
        "api_key": "YOUR_API_KEY",
        "model": "grok-beta",
        "max_content_length": 4000
    }
)

# Advanced HuggingFace settings
handler = create_handler(
    "huggingface",
    config={
        "model_name": "facebook/bart-large-cnn",
        "device": "cuda",  # 或 "cpu", "mps"
        "max_content_length": 1024,
        "min_length": 50,
        "max_length": 150
    }
)
```

## Development

Setup for development:

```bash
# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows use: venv\Scripts\activate

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest
```

## Configuration

The tool can be configured using environment variables or a config file:

```bash
# Environment variables
export OPENAI_API_KEY="your-key"
export XAI_API_KEY="your-key"
```

Or create a `~/.pdf-ai-extractor.yaml` file:

```yaml
openai:
  api_key: "your-key"
xai:
  api_key: "your-key"
huggingface:
  model_name: "default/model"
```

## Output Format

The tool outputs JSON in the following format:

```json
{
  "abstract": "Document abstract extracted from the PDF",
  "keywords": ["keyword1", "keyword2", "keyword3"],
  "bookmarks": [
    "Chapter 1",
    "Section 1.1"
  ]
}
```

## License

MIT License

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/changyy/py-pdf-ai-extractor",
    "name": "pdf-ai-extractor",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": null,
    "author": "changyy",
    "author_email": "changyy.csie@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/28/d5/8062d5ed91a025c9a3f790dd550217b55ae048e14ccfbb9faab4a66d6f2b/pdf_ai_extractor-1.0.0.tar.gz",
    "platform": null,
    "description": "# pdf-ai-extractor\n\n![PyPI](https://img.shields.io/pypi/v/pdf-ai-extractor.svg)\n\nPDF metadata and content extraction tool with AI-powered analysis capabilities.\n\n## Features\n\n- Extract PDF metadata including abstract, keywords, and bookmarks/TOC\n- Multiple analysis backends:\n  - Local extraction (no AI)\n  - OpenAI-powered analysis\n  - xAI-powered analysis\n  - Hugging Face models integration\n\n## Installation\n\n```bash\npip install pdf-ai-extractor\n```\n\nFrom source:\n```bash\ngit clone https://github.com/changyy/py-pdf-ai-extractor.git\ncd py-pdf-ai-extractor\npip install -e .\n```\n\n## Usage\n\n### Command Line Interface\n\n```bash\n# Basic usage (local extraction)\npdf-ai-extractor input.pdf\n\n# Using OpenAI backend\npdf-ai-extractor --backend openai --api-key YOUR_API_KEY input.pdf\n\n# Using xAI backend\npdf-ai-extractor --backend xai --api-key YOUR_API_KEY input.pdf\n\n# Using Hugging Face backend\npdf-ai-extractor --backend huggingface --model-name \"model/name\" input.pdf\n\n# Save output to file\npdf-ai-extractor input.pdf --output result.json\n```\n\n## Example\n\n```bash\n% wget https://github.com/datawhalechina/leedl-tutorial/releases/download/v1.2.2/LeeDL_Tutorial_v.1.2.2.pdf -O ~/Downloads/LeeDL_Tutorial.pdf\n\n% pdf-ai-extractor -b openai -k 'sk-XXXXXXX' ~/Downloads/LeeDL_Tutorial.pdf    \n[\n  {\n    \"path\": \"/path/Downloads/LeeDL_Tutorial.pdf\",\n    \"abstract\": \"The 'LeeDL Tutorial' authored by Wang Qi, Yang Yiyuan, and Jiang Ji is a comprehensive guide to deep learning, inspired by the popular machine learning course by Professor Li Hongyi from National Taiwan University. This tutorial aims to make deep learning accessible to Chinese-speaking students by simplifying complex theories and providing detailed derivations of formulas. It covers essential topics in deep learning, including foundational concepts, practical methodologies, and advanced techniques, while integrating original content and supplementary materials from previous courses. The tutorial is designed for beginners and those seeking to deepen their understanding of deep learning, making it a recommended resource for students interested in the field. The authors, who are members of the Datawhale organization, have backgrounds in artificial intelligence, reinforcement learning, and computer vision, further enhancing the tutorial's credibility and depth.\",\n    \"keywords\": [\n      \"\u6df1\u5ea6\u5b66\u4e60\",\n      \"\u673a\u5668\u5b66\u4e60\",\n      \"\u674e\u5b8f\u6bc5\",\n      \"\u6559\u7a0b\",\n      \"\u4eba\u5de5\u667a\u80fd\",\n      \"\u6570\u636e\u6316\u6398\",\n      \"\u4f18\u5316\u7b97\u6cd5\",\n      \"\u6a21\u578b\u8bad\u7ec3\",\n      \"\u4e2d\u6587\u6559\u80b2\",\n      \"\u5f00\u6e90\u7ec4\u7ec7\"\n    ],\n    \"bookmarks\": [\n      \"\u673a\u5668\u5b66\u4e60\u57fa\u7840\",\n      \"\u6848\u4f8b\u5b66\u4e60\",\n      \"\u7ebf\u6027\u6a21\u578b\",\n      \"\u5206\u6bb5\u7ebf\u6027\u66f2\u7ebf\",\n      \"\u6a21\u578b\u53d8\u5f62\",\n      \"\u673a\u5668\u5b66\u4e60\u6846\u67b6\",\n      \"\u5b9e\u8df5\u65b9\u6cd5\u8bba\",\n      \"\u6a21\u578b\u504f\u5dee\",\n      \"\u4f18\u5316\u95ee\u9898\",\n      \"\u8fc7\u62df\u5408\",\n      \"\u4ea4\u53c9\u9a8c\u8bc1\",\n      \"\u4e0d\u5339\u914d\",\n      \"\u6df1\u5ea6\u5b66\u4e60\u57fa\u7840\",\n      \"\u5c40\u90e8\u6781\u5c0f\u503c\u4e0e\u978d\u70b9\",\n      \"\u4e34\u754c\u70b9\u53ca\u5176\u79cd\u7c7b\",\n      \"\u5224\u65ad\u4e34\u754c\u503c\u79cd\u7c7b\u7684\u65b9\u6cd5\",\n      \"\u6279\u91cf\u548c\u52a8\u91cf\",\n      \"\u6279\u91cf\u5927\u5c0f\u5bf9\u68af\u5ea6\u4e0b\u964d\u6cd5\u7684\u5f71\u54cd\",\n      \"\u52a8\u91cf\u6cd5\",\n      \"\u81ea\u9002\u5e94\u5b66\u4e60\u7387\",\n      \"AdaGrad\",\n      \"RMSProp\",\n      \"Adam\",\n      \"\u5b66\u4e60\u7387\u8c03\u5ea6\",\n      \"\u4f18\u5316\u603b\u7ed3\",\n      \"\u5206\u7c7b\",\n      \"\u5206\u7c7b\u4e0e\u56de\u5f52\u7684\u5173\u7cfb\",\n      \"\u5e26\u6709 softmax \u7684\u5206\u7c7b\",\n      \"\u5206\u7c7b\u635f\u5931\",\n      \"\u6279\u91cf\u5f52\u4e00\u5316\",\n      \"\u8003\u8651\u6df1\u5ea6\u5b66\u4e60\",\n      \"\u6d4b\u8bd5\u65f6\u7684\u6279\u91cf\u5f52\u4e00\u5316\",\n      \"\u5185\u90e8\u534f\u53d8\u91cf\u504f\u79fb\",\n      \"\u5377\u79ef\u795e\u7ecf\u7f51\u7edc\",\n      \"\u89c2\u5bdf1\uff1a\u68c0\u6d4b\u6a21\u5f0f\u4e0d\u9700\u8981\u6574\u5f20\u56fe\u50cf\",\n      \"\u7b80\u53161\uff1a\u611f\u53d7\u91ce\",\n      \"\u89c2\u5bdf2\uff1a\u540c\u6837\u7684\u6a21\u5f0f\u53ef\u80fd\u4f1a\u51fa\u73b0\u5728\u56fe\u50cf\u7684\u4e0d\u540c\u533a\u57df\",\n      \"\u7b80\u53162\uff1a\u5171\u4eab\u53c2\u6570\",\n      \"\u7b80\u53161\u548c2\u7684\u603b\u7ed3\",\n      \"\u89c2\u5bdf3\uff1a\u4e0b\u91c7\u6837\u4e0d\u5f71\u54cd\u6a21\u5f0f\u68c0\u6d4b\",\n      \"\u7b80\u53163\uff1a\u6c47\u805a\",\n      \"\u5377\u79ef\u795e\u7ecf\u7f51\u7edc\u7684\u5e94\u7528\uff1a\u4e0b\u56f4\u68cb\",\n      \"\u5faa\u73af\u795e\u7ecf\u7f51\u7edc\",\n      \"\u72ec\u70ed\u7f16\u7801\",\n      \"\u4ec0\u4e48\u662fRNN\uff1f\",\n      \"RNN\u67b6\u6784\",\n      \"\u5176\u4ed6RNN\",\n      \"Elman \u7f51\u7edc &Jordan \u7f51\u7edc\",\n      \"\u53cc\u5411\u5faa\u73af\u795e\u7ecf\u7f51\u7edc\",\n      \"\u957f\u77ed\u671f\u8bb0\u5fc6\u7f51\u7edc\",\n      \"LSTM\u4e3e\u4f8b\",\n      \"LSTM\u8fd0\u7b97\u793a\u4f8b\",\n      \"LSTM\u539f\u7406\",\n      \"RNN\u5b66\u4e60\u65b9\u5f0f\",\n      \"\u5982\u4f55\u89e3\u51b3RNN\u68af\u5ea6\u6d88\u5931\u6216\u8005\u7206\u70b8\",\n      \"RNN\u5176\u4ed6\u5e94\u7528\",\n      \"\u591a\u5bf9\u4e00\u5e8f\u5217\",\n      \"\u591a\u5bf9\u591a\u5e8f\u5217\",\n      \"\u5e8f\u5217\u5230\u5e8f\u5217\",\n      \"\u81ea\u6ce8\u610f\u529b\u673a\u5236\",\n      \"\u8f93\u5165\u662f\u5411\u91cf\u5e8f\u5217\u7684\u60c5\u51b5\",\n      \"\u7c7b\u578b1\uff1a\u8f93\u5165\u4e0e\u8f93\u51fa\u6570\u91cf\u76f8\u540c\",\n      \"\u7c7b\u578b2\uff1a\u8f93\u5165\u662f\u4e00\u4e2a\u5e8f\u5217\uff0c\u8f93\u51fa\u662f\u4e00\u4e2a\u6807\u7b7e\",\n      \"\u7c7b\u578b3\uff1a\u5e8f\u5217\u5230\u5e8f\u5217\",\n      \"\u81ea\u6ce8\u610f\u529b\u7684\u8fd0\u4f5c\u539f\u7406\",\n      \"\u591a\u5934\u6ce8\u610f\u529b\",\n      \"\u4f4d\u7f6e\u7f16\u7801\",\n      \"\u622a\u65ad\u81ea\u6ce8\u610f\u529b\",\n      \"\u81ea\u6ce8\u610f\u529b\u4e0e\u5377\u79ef\u795e\u7ecf\u7f51\u7edc\u5bf9\u6bd4\",\n      \"\u81ea\u6ce8\u610f\u529b\u4e0e\u5faa\u73af\u795e\u7ecf\u7f51\u7edc\u5bf9\u6bd4\",\n      \"Transformer\",\n      \"\u5e8f\u5217\u5230\u5e8f\u5217\u6a21\u578b\",\n      \"\u8bed\u97f3\u8bc6\u522b\u3001\u673a\u5668\u7ffb\u8bd1\u4e0e\u8bed\u97f3\u7ffb\u8bd1\",\n      \"\u8bed\u97f3\u5408\u6210\",\n      \"\u804a\u5929\u673a\u5668\u4eba\",\n      \"\u95ee\u7b54\u4efb\u52a1\",\n      \"\u53e5\u6cd5\u5206\u6790\",\n      \"\u591a\u6807\u7b7e\u5206\u7c7b\",\n      \"Transformer\u7ed3\u6784\",\n      \"Transformer\u7f16\u7801\u5668\",\n      \"Transformer\u89e3\u7801\u5668\",\n      \"\u81ea\u56de\u5f52\u89e3\u7801\u5668\",\n      \"\u975e\u81ea\u56de\u5f52\u89e3\u7801\u5668\",\n      \"\u7f16\u7801\u5668-\u89e3\u7801\u5668\u6ce8\u610f\u2f12\",\n      \"Transformer\u7684\u8bad\u7ec3\u8fc7\u7a0b\",\n      \"\u5e8f\u5217\u5230\u5e8f\u5217\u6a21\u578b\u8bad\u7ec3\u5e38\u7528\u6280\u5de7\",\n      \"\u590d\u5236\u673a\u5236\",\n      \"\u5f15\u5bfc\u6ce8\u610f\u529b\",\n      \"\u675f\u641c\u7d22\",\n      \"\u52a0\u5165\u566a\u58f0\",\n      \"\u4f7f\u7528\u5f3a\u5316\u5b66\u4e60\u8bad\u7ec3\",\n      \"\u8ba1\u5212\u91c7\u6837\",\n      \"\u751f\u6210\u6a21\u578b\",\n      \"\u751f\u6210\u5bf9\u6297\u7f51\u7edc\",\n      \"\u751f\u6210\u5668\",\n      \"\u8fa8\u522b\u5668\",\n      \"\u751f\u6210\u5668\u4e0e\u8fa8\u522b\u5668\u7684\u8bad\u7ec3\u8fc7\u7a0b\",\n      \"GAN\u7684\u5e94\u7528\u6848\u4f8b\",\n      \"GAN\u7684\u7406\u8bba\u4ecb\u7ecd\",\n      \"WGAN\u7b97\u6cd5\",\n      \"\u8bad\u7ec3GAN\u7684\u96be\u70b9\u4e0e\u6280\u5de7\",\n      \"GAN\u7684\u6027\u80fd\u8bc4\u4f30\u65b9\u6cd5\",\n      \"\u6761\u4ef6\u578b\u751f\u6210\",\n      \"Cycle GAN\",\n      \"\u6269\u6563\u6a21\u578b\",\n      \"\u81ea\u76d1\u7763\u5b66\u4e60\",\n      \"\u6765\u2f83Transformers\u7684\u53cc\u5411\u7f16\u7801\u5668\u8868\u2f70\uff08BERT\uff09\",\n      \"BERT\u7684\u4f7f\u7528\u65b9\u5f0f\",\n      \"BERT\u6709\u7528\u7684\u539f\u56e0\",\n      \"BERT\u7684\u53d8\u79cd\",\n      \"\u2f63\u6210\u5f0f\u9884\u8bad\u7ec3\uff08GPT\uff09\",\n      \"\u81ea\u7f16\u7801\u5668\",\n      \"\u81ea\u7f16\u7801\u5668\u7684\u6982\u5ff5\",\n      \"\u4e3a\u4ec0\u4e48\u9700\u8981\u81ea\u7f16\u7801\u5668\uff1f\",\n      \"\u53bb\u566a\u81ea\u7f16\u7801\u5668\",\n      \"\u81ea\u7f16\u7801\u5668\u5e94\u7528\u4e4b\u7279\u5f81\u89e3\u8026\",\n      \"\u81ea\u7f16\u7801\u5668\u5e94\u7528\u4e4b\u79bb\u6563\u9690\u8868\u5f81\",\n      \"\u81ea\u7f16\u7801\u5668\u7684\u5176\u4ed6\u5e94\u7528\",\n      \"\u5bf9\u6297\u653b\u51fb\",\n      \"\u5bf9\u6297\u653b\u51fb\u7b80\u4ecb\",\n      \"\u5982\u4f55\u8fdb\u884c\u7f51\u7edc\u653b\u51fb\",\n      \"\u5feb\u901f\u68af\u5ea6\u7b26\u53f7\u6cd5\",\n      \"\u767d\u76d2\u653b\u51fb\u4e0e\u9ed1\u76d2\u653b\u51fb\",\n      \"\u5176\u4ed6\u6a21\u6001\u6570\u636e\u88ab\u653b\u51fb\u6848\u4f8b\",\n      \"\u73b0\u5b9e\u4e16\u754c\u4e2d\u7684\u653b\u51fb\",\n      \"\u9632\u5fa1\u65b9\u5f0f\u4e2d\u7684\u88ab\u52a8\u9632\u5fa1\",\n      \"\u9632\u5fa1\u65b9\u5f0f\u4e2d\u7684\u4e3b\u52a8\u9632\u5fa1\",\n      \"\u8fc1\u79fb\u5b66\u4e60\",\n      \"\u9886\u57df\u504f\u79fb\",\n      \"\u9886\u57df\u81ea\u9002\u5e94\",\n      \"\u9886\u57df\u6cdb\u5316\",\n      \"\u5f3a\u5316\u5b66\u4e60\",\n      \"\u5f3a\u5316\u5b66\u4e60\u5e94\u7528\",\n      \"\u73a9\u7535\u5b50\u6e38\u620f\",\n      \"\u4e0b\u56f4\u68cb\",\n      \"\u5f3a\u5316\u5b66\u4e60\u6846\u67b6\",\n      \"\u7b2c1\u6b65\uff1a\u672a\u77e5\u51fd\u6570\",\n      \"\u7b2c2\u6b65\uff1a\u5b9a\u4e49\u635f\u5931\",\n      \"\u7b2c3\u6b65\uff1a\u4f18\u5316\",\n      \"\u8bc4\u4ef7\u52a8\u4f5c\u7684\u6807\u51c6\",\n      \"\u4f7f\u7528\u5373\u65f6\u5956\u52b1\u4f5c\u4e3a\u8bc4\u4ef7\u6807\u51c6\",\n      \"\u4f7f\u7528\u7d2f\u79ef\u5956\u52b1\u4f5c\u4e3a\u8bc4\u4ef7\u6807\u51c6\",\n      \"\u4f7f\u7528\u6298\u6263\u7d2f\u79ef\u5956\u52b1\u4f5c\u4e3a\u8bc4\u4ef7\u6807\u51c6\",\n      \"\u4f7f\u7528\u6298\u6263\u7d2f\u79ef\u5956\u52b1\u51cf\u53bb\u57fa\u7ebf\u4f5c\u4e3a\u8bc4\u4ef7\u6807\u51c6\",\n      \"Actor-Critic\",\n      \"\u4f18\u52bf Actor-Critic\",\n      \"\u5143\u5b66\u4e60\",\n      \"\u5143\u5b66\u4e60\u7684\u6982\u5ff5\",\n      \"\u5143\u5b66\u4e60\u7684\u4e09\u4e2a\u6b65\u9aa4\",\n      \"\u5143\u5b66\u4e60\u4e0e\u673a\u5668\u5b66\u4e60\",\n      \"\u5143\u5b66\u4e60\u7684\u5b9e\u4f8b\u7b97\u6cd5\",\n      \"\u5143\u5b66\u4e60\u7684\u5e94\u7528\",\n      \"\u7ec8\u8eab\u5b66\u4e60\",\n      \"\u707e\u96be\u6027\u9057\u5fd8\",\n      \"\u7ec8\u8eab\u5b66\u4e60\u8bc4\u4f30\u65b9\u6cd5\",\n      \"\u7ec8\u8eab\u5b66\u4e60\u7684\u4e3b\u8981\u89e3\u6cd5\",\n      \"\u7f51\u7edc\u538b\u7f29\",\n      \"\u7f51\u7edc\u526a\u679d\",\n      \"\u77e5\u8bc6\u84b8\u998f\",\n      \"\u53c2\u6570\u91cf\u5316\",\n      \"\u7f51\u7edc\u67b6\u6784\u8bbe\u8ba1\",\n      \"\u52a8\u6001\u8ba1\u7b97\",\n      \"\u53ef\u89e3\u91ca\u6027\u4eba\u5de5\u667a\u80fd\",\n      \"\u53ef\u89e3\u91ca\u6027\u4eba\u5de5\u667a\u80fd\u7684\u91cd\u8981\u6027\",\n      \"\u51b3\u7b56\u6811\u6a21\u578b\u7684\u53ef\u89e3\u91ca\u6027\",\n      \"\u53ef\u89e3\u91ca\u6027\u673a\u5668\u5b66\u4e60\u7684\u76ee\u6807\",\n      \"\u53ef\u89e3\u91ca\u6027\u673a\u5668\u5b66\u4e60\u4e2d\u7684\u5c40\u90e8\u89e3\u91ca\",\n      \"\u53ef\u89e3\u91ca\u6027\u673a\u5668\u5b66\u4e60\u4e2d\u7684\u5168\u5c40\u89e3\u91ca\",\n      \"\u6269\u5c55\u4e0e\u5c0f\u7ed3\",\n      \"ChatGPT\",\n      \"ChatGPT\u7b80\u4ecb\u548c\u529f\u80fd\",\n      \"\u5bf9\u4e8eChatGPT\u7684\u8bef\u89e3\",\n      \"ChatGPT\u80cc\u540e\u7684\u5173\u952e\u6280\u672f\u2014\u2014\u9884\u8bad\u7ec3\",\n      \"ChatGPT\u5e26\u6765\u7684\u7814\u7a76\u95ee\u9898\",\n      \"\u672f\u8bed\"\n    ],\n    \"error\": null\n  }\n]\n\n% pdf-ai-extractor -b xai -k 'xai-XXXXX' ~/Downloads/LeeDL_Tutorial.pdf \n[\n  {\n    \"path\": \"/path/Downloads/LeeDL_Tutorial.pdf\",\n    \"abstract\": \"\u672c\u6559\u7a0b\u57fa\u4e8e\u674e\u5b8f\u6bc5\u6559\u6388\u7684\u300a\u673a\u5668\u5b66\u4e60\u300b\uff082021\u5e74\u6625\uff09\u8bfe\u7a0b\uff0c\u65e8\u5728\u4e3a\u6df1\u5ea6\u5b66\u4e60\u521d\u5b66\u8005\u63d0\u4f9b\u4e00\u4e2a\u8f7b\u677e\u5165\u95e8\u7684\u4e2d\u6587\u5b66\u4e60\u8d44\u6e90\u3002\u6559\u7a0b\u5185\u5bb9\u5168\u9762\uff0c\u6db5\u76d6\u4e86\u6df1\u5ea6\u5b66\u4e60\u7684\u57fa\u672c\u7406\u8bba\u548c\u5b9e\u8df5\u65b9\u6cd5\uff0c\u901a\u8fc7\u5e7d\u9ed8\u98ce\u8da3\u7684\u8bb2\u89e3\u548c\u5927\u91cf\u52a8\u6f2b\u76f8\u5173\u7684\u4f8b\u5b50\uff0c\u4f7f\u6df1\u5965\u7684\u7406\u8bba\u53d8\u5f97\u6613\u4e8e\u7406\u89e3\u3002\u6559\u7a0b\u4e0d\u4ec5\u9009\u53d6\u4e86\u8bfe\u7a0b\u7684\u7cbe\u534e\u5185\u5bb9\uff0c\u8fd8\u5bf9\u516c\u5f0f\u8fdb\u884c\u4e86\u8be6\u7ec6\u63a8\u5bfc\uff0c\u5bf9\u96be\u70b9\u8fdb\u884c\u4e86\u91cd\u70b9\u8bb2\u89e3\uff0c\u5e76\u8865\u5145\u4e86\u5176\u4ed6\u6df1\u5ea6\u5b66\u4e60\u76f8\u5173\u77e5\u8bc6\u3002\",\n    \"keywords\": [\n      \"\u6df1\u5ea6\u5b66\u4e60\",\n      \"\u673a\u5668\u5b66\u4e60\",\n      \"\u674e\u5b8f\u6bc5\",\n      \"\u6559\u7a0b\",\n      \"\u5165\u95e8\",\n      \"\u4e2d\u6587\",\n      \"\u5f3a\u5316\u5b66\u4e60\",\n      \"\u8ba1\u7b97\u673a\u89c6\u89c9\",\n      \"\u65f6\u95f4\u5e8f\u5217\",\n      \"\u6570\u636e\u6316\u6398\"\n    ],\n    \"bookmarks\": [\n      \"\u673a\u5668\u5b66\u4e60\u57fa\u7840\",\n      \"\u6848\u4f8b\u5b66\u4e60\",\n      \"\u7ebf\u6027\u6a21\u578b\",\n      \"\u5206\u6bb5\u7ebf\u6027\u66f2\u7ebf\",\n      \"\u6a21\u578b\u53d8\u5f62\",\n      \"\u673a\u5668\u5b66\u4e60\u6846\u67b6\",\n      ...\n      \"\u672f\u8bed\"\n    ],\n    \"error\": null\n  }\n]\n\n\n% pdf-ai-extractor -b xai -k 'xai-XXXXX' ~/Downloads/LeeDL_Tutorial.pdf\n[\n  {\n    \"path\": \"/path/Downloads/LeeDL_Tutorial.pdf\",\n    \"abstract\": \"This tutorial, based on Professor Li Hongyi's 'Machine Learning' course from NTU, offers an accessible introduction to deep learning. It simplifies complex theories through engaging examples, making it suitable for beginners. The tutorial covers essential deep learning concepts, optimization techniques, and practical methodologies, supplemented with original content to enhance understanding. It also includes contributions from other courses and additional knowledge to enrich the learning experience.\",\n    \"keywords\": [\n      \"\u6df1\u5ea6\u5b66\u4e60\",\n      \"\u673a\u5668\u5b66\u4e60\",\n      \"\u674e\u5b8f\u6bc5\",\n      \"\u6559\u7a0b\",\n      \"\u4f18\u5316\",\n      \"\u5f3a\u5316\u5b66\u4e60\",\n      \"\u8ba1\u7b97\u673a\u89c6\u89c9\",\n      \"\u65f6\u95f4\u5e8f\u5217\",\n      \"\u6570\u636e\u6316\u6398\",\n      \"\u667a\u80fd\u4f20\u611f\u7cfb\u7edf\"\n    ],\n    \"bookmarks\": [\n      \"\u673a\u5668\u5b66\u4e60\u57fa\u7840\",\n      \"\u6848\u4f8b\u5b66\u4e60\",\n      \"\u7ebf\u6027\u6a21\u578b\",\n      \"\u5206\u6bb5\u7ebf\u6027\u66f2\u7ebf\",\n      \"\u6a21\u578b\u53d8\u5f62\",\n      \"\u673a\u5668\u5b66\u4e60\u6846\u67b6\",\n      \"\u5b9e\u8df5\u65b9\u6cd5\u8bba\",\n      \"\u6a21\u578b\u504f\u5dee\",\n      \"\u4f18\u5316\u95ee\u9898\",\n      \"\u8fc7\u62df\u5408\",\n      ...\n      \"\u672f\u8bed\"\n    ],\n    \"error\": null\n  }\n]\n```\n\n### Python API\n\n```python\nfrom pdf_ai_extractor.handlers import create_handler\n\n# Local analysis\nhandler = create_handler(\"local\")\nresult = handler.result(\"input.pdf\")\nprint(result)\n\n# Using OpenAI\nhandler = create_handler(\n    \"openai\",\n    config={\"api_key\": \"YOUR_API_KEY\"}\n)\nresult = handler.result(\"input.pdf\")\nprint(result)\n\n# Using xAI\nhandler = create_handler(\n    \"xai\",\n    config={\"api_key\": \"YOUR_API_KEY\"}\n)\nresult = handler.result(\"input.pdf\")\nprint(result)\n\n# Using HuggingFace - facebook/bart-large-cnn\nhandler = create_handler(\n    \"huggingface\",\n    config={\"model_name\": \"facebook/bart-large-cnn\"}\n)\nresult = handler.result(\"input.pdf\")\nprint(result)\n\n# Output format\n# {\n#     \"abstract\": \"Document abstract extracted from the PDF\",\n#     \"keywords\": [\"keyword1\", \"keyword2\", ...],\n#     \"bookmarks\": [\"Chapter 1\", \"Section 1.1\", ...]\n# }\n```\n\nYou can also set additional parameters:\n\n```python\n# Advanced OpenAI settings\nhandler = create_handler(\n    \"openai\",\n    config={\n        \"api_key\": \"YOUR_API_KEY\",\n        \"model\": \"gpt-4\",\n        \"max_tokens\": 500\n    }\n)\n\n# Advanced xAI settings\nhandler = create_handler(\n    \"xai\",\n    config={\n        \"api_key\": \"YOUR_API_KEY\",\n        \"model\": \"grok-beta\",\n        \"max_content_length\": 4000\n    }\n)\n\n# Advanced HuggingFace settings\nhandler = create_handler(\n    \"huggingface\",\n    config={\n        \"model_name\": \"facebook/bart-large-cnn\",\n        \"device\": \"cuda\",  # \u6216 \"cpu\", \"mps\"\n        \"max_content_length\": 1024,\n        \"min_length\": 50,\n        \"max_length\": 150\n    }\n)\n```\n\n## Development\n\nSetup for development:\n\n```bash\n# Create virtual environment\npython -m venv venv\nsource venv/bin/activate  # On Windows use: venv\\Scripts\\activate\n\n# Install development dependencies\npip install -e \".[dev]\"\n\n# Run tests\npytest\n```\n\n## Configuration\n\nThe tool can be configured using environment variables or a config file:\n\n```bash\n# Environment variables\nexport OPENAI_API_KEY=\"your-key\"\nexport XAI_API_KEY=\"your-key\"\n```\n\nOr create a `~/.pdf-ai-extractor.yaml` file:\n\n```yaml\nopenai:\n  api_key: \"your-key\"\nxai:\n  api_key: \"your-key\"\nhuggingface:\n  model_name: \"default/model\"\n```\n\n## Output Format\n\nThe tool outputs JSON in the following format:\n\n```json\n{\n  \"abstract\": \"Document abstract extracted from the PDF\",\n  \"keywords\": [\"keyword1\", \"keyword2\", \"keyword3\"],\n  \"bookmarks\": [\n    \"Chapter 1\",\n    \"Section 1.1\"\n  ]\n}\n```\n\n## License\n\nMIT License\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "PDF metadata and content extraction tool with AI-powered analysis",
    "version": "1.0.0",
    "project_urls": {
        "Homepage": "https://github.com/changyy/py-pdf-ai-extractor"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e30b9a815f6f5124d0b6138691e20f74a89f00a264e44076ba8fbc3e8ea9e856",
                "md5": "972a8fbb9b2a8ddb939ecefd8e80dbc2",
                "sha256": "55b7aafe7f5bea86d68d48857a5936d6f8485661a8558eaab142f068cc171f9e"
            },
            "downloads": -1,
            "filename": "pdf_ai_extractor-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "972a8fbb9b2a8ddb939ecefd8e80dbc2",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 16329,
            "upload_time": "2024-11-27T15:31:10",
            "upload_time_iso_8601": "2024-11-27T15:31:10.475746Z",
            "url": "https://files.pythonhosted.org/packages/e3/0b/9a815f6f5124d0b6138691e20f74a89f00a264e44076ba8fbc3e8ea9e856/pdf_ai_extractor-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "28d58062d5ed91a025c9a3f790dd550217b55ae048e14ccfbb9faab4a66d6f2b",
                "md5": "f583dba598f6d4516f59fd86fc83f1d6",
                "sha256": "06addd55025fb34ce88176cff8d230e1b5fd84c9335f0ab941e4a835b8031f6d"
            },
            "downloads": -1,
            "filename": "pdf_ai_extractor-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "f583dba598f6d4516f59fd86fc83f1d6",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 19815,
            "upload_time": "2024-11-27T15:31:12",
            "upload_time_iso_8601": "2024-11-27T15:31:12.405164Z",
            "url": "https://files.pythonhosted.org/packages/28/d5/8062d5ed91a025c9a3f790dd550217b55ae048e14ccfbb9faab4a66d6f2b/pdf_ai_extractor-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-11-27 15:31:12",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "changyy",
    "github_project": "py-pdf-ai-extractor",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "pdf-ai-extractor"
}
        
Elapsed time: 1.74252s