<img src="https://www.bitbuffet.dev/_next/image?url=%2Fbitbuffet-logo-closed-transparent.png&w=64&q=75" alt="BitBuffet Logo" width="64" height="64">
# BitBuffet Python SDK
[](https://badge.fury.io/py/bitbuffet)
[](https://opensource.org/licenses/MIT)
[](https://www.python.org/downloads/)
A powerful Python SDK for the BitBuffet API that allows you to extract structured data from any web content using Pydantic models or raw markdown content in under two seconds.
## 🚀 Features
- **Universal**: Works with any website or web content (url, image, video, audio, youtube, pdf, etc.)
- **Type-safe**: Built with Pydantic for complete type safety and validation
- **Fast**: Extract structured data in under 2 seconds
- **Flexible**: Support for custom prompts and reasoning levels
- **Dual Output**: Extract structured JSON data or raw markdown content
- **Easy to use**: Simple, intuitive API
- **Well-tested**: Comprehensive test suite with integration tests
## 📦 Installation
```bash
pip install bitbuffet
# or
poetry add bitbuffet
# or
uv add bitbuffet
```
## 🏃♂️ Quick Start
### JSON Extraction (Structured Data)
```python
from bitbuffet import BitBuffet
from pydantic import BaseModel, Field
from typing import List
# Define your data structure with Pydantic
class Article(BaseModel):
title: str = Field(description="The main title of the article")
author: str = Field(description="The author of the article")
publish_date: str = Field(description="When the article was published")
content: str = Field(description="The main content/body of the article")
tags: List[str] = Field(description="Article tags or categories")
summary: str = Field(description="A brief summary of the article")
# Initialize the client with your API key
client = BitBuffet(api_key="your-api-key-here")
# Extract structured data from any URL
try:
result: Article = client.extract(
url="https://example.com/article",
schema_class=Article
)
print(f"Title: {result.title}")
print(f"Author: {result.author}")
print(f"Published: {result.publish_date}")
print(f"Tags: {', '.join(result.tags)}")
except Exception as error:
print(f"Extraction failed: {error}")
```
### Markdown Extraction (Raw Content)
```python
from bitbuffet import BitBuffet
client = BitBuffet(api_key="your-api-key-here")
# Extract raw markdown content
try:
markdown: str = client.extract(
url="https://example.com/article",
format="markdown"
)
print("Raw markdown content:")
print(markdown)
except Exception as error:
print(f"Extraction failed: {error}")
```
## ⚙️ Output Methods
Choose between structured JSON extraction or raw markdown content:
### JSON Format (Default)
Extracts structured data according to your Pydantic model:
```python
class Product(BaseModel):
name: str
price: float
description: str
product = client.extract(
url="https://example.com/product",
schema_class=Product,
format="json" # Optional - this is the default
)
```
### Markdown Format
Returns the raw markdown content of the webpage:
```python
markdown = client.extract(
url="https://example.com/article",
format="markdown"
)
```
**Note:** When using `format="markdown"`, do not provide a `schema_class` parameter.
## ⚙️ Configuration Options
Customize the extraction process with various options:
```python
# JSON extraction with configuration
result = client.extract(
url="https://example.com/complex-page",
schema_class=Article,
format="json", # Optional - this is the default
timeout=30, # Timeout in seconds (default: 30)
reasoning_effort="high", # 'medium' | 'high' - Higher effort for complex pages
prompt="Focus on extracting the main article content, ignoring ads and navigation",
temperature=0.1, # Lower for more consistent results (0.0 - 1.5)
# OR use top_p instead of temperature
# top_p=0.9
)
# Markdown extraction with configuration
markdown = client.extract(
url="https://example.com/article",
format="markdown", # Required for markdown extraction
timeout=30,
reasoning_effort="medium",
prompt="Focus on the main content, ignore navigation and ads"
)
```
### Parameter Validation:
- **Temperature vs Top-p**: Cannot specify both `temperature` and `top_p` simultaneously
- **Format Validation**: The SDK will raise `ValueError` for invalid format/schema combinations
- **Type Safety**: Format overloads provide compile-time type checking
## 📚 Advanced Examples
### E-commerce Product Extraction
```python
from pydantic import BaseModel, Field, HttpUrl
from typing import List, Optional
class Product(BaseModel):
name: str
price: float
currency: str
description: str
images: List[HttpUrl]
in_stock: bool
rating: Optional[float] = Field(None, ge=0, le=5)
reviews: Optional[int] = None
product = client.extract(
url="https://shop.example.com/product/123",
schema_class=Product,
reasoning_effort="high"
)
print(f"Product: {product.name}")
print(f"Price: {product.price} {product.currency}")
print(f"In Stock: {product.in_stock}")
```
### News Article with Nested Models
```python
from pydantic import BaseModel, HttpUrl
from typing import List, Optional
class Author(BaseModel):
name: str
bio: Optional[str] = None
class RelatedArticle(BaseModel):
title: str
url: HttpUrl
class NewsArticle(BaseModel):
headline: str
subheadline: Optional[str] = None
author: Author
published_at: str
category: str
content: str
related_articles: Optional[List[RelatedArticle]] = None
article = client.extract(
url="https://news.example.com/breaking-news",
schema_class=NewsArticle
)
print(f"Headline: {article.headline}")
print(f"Author: {article.author.name}")
print(f"Category: {article.category}")
```
### Raw Content for Processing
```python
# Extract raw markdown for further processing
raw_content = client.extract(
url="https://blog.example.com/post/123",
format="markdown"
)
# Process the markdown content
word_count = len(raw_content.split())
has_code_blocks = "```" in raw_content
print(f"Content has {word_count} words")
print(f"Contains code blocks: {has_code_blocks}")
```
## 🔧 API Reference
### `BitBuffet` Class
#### Constructor
```python
BitBuffet(api_key: str)
```
#### Methods
The `extract` format has two overloaded signatures for type safety:
##### JSON Extraction (Default)
```python
extract(
url: str,
schema_class: Type[BaseModel],
timeout: int = 30,
reasoning_effort: Optional[Literal['medium', 'high']] = None,
prompt: Optional[str] = None,
top_p: Optional[Union[int, float]] = None,
temperature: Optional[Union[int, float]] = None,
format: Literal['json'] = 'json' # Optional - defaults to 'json'
) -> BaseModel
```
##### Markdown Extraction
```python
extract(
url: str,
format: Literal['markdown'], # Required for markdown extraction
timeout: int = 30,
reasoning_effort: Optional[Literal['medium', 'high']] = None,
prompt: Optional[str] = None,
top_p: Optional[Union[int, float]] = None,
temperature: Optional[Union[int, float]] = None
) -> str
```
**Parameters:**
- `url`: The URL to extract data from
- `schema_class`: Pydantic model class defining the expected data structure (JSON format only)
- `format`: Extraction format ('json' or 'markdown')
- For JSON: Optional, defaults to 'json'
- For Markdown: Required, must be 'markdown'
- `timeout`: Request timeout in seconds (default: 30)
- `reasoning_effort`: 'medium' | 'high' (default: 'medium')
- `prompt`: Custom extraction prompt (optional)
- `temperature`: Sampling temperature 0.0-1.5 (optional, cannot be used with top_p)
- `top_p`: Alternative to temperature (optional, cannot be used with temperature)
**Returns:**
- JSON format: Instance of the provided Pydantic model with extracted data
- Markdown format: Raw markdown content as string
**Raises:**
- `ValueError`: When format/schema combination is invalid or both temperature and top_p are provided
- `requests.RequestException`: When API request fails
**Format Overload Rules:**
1. **JSON Format Requirements:**
- A Pydantic model class MUST be provided via `schema_class` parameter
- Returns an instance of your Pydantic model with validated data
- `format="json"` is optional (default behavior)
2. **Markdown Format Requirements:**
- NO `schema_class` should be provided
- `format="markdown"` MUST be specified
- Returns raw markdown content as string
- Schema class and markdown format cannot be used together
3. **Type Safety:**
- The SDK uses format overloads to enforce these rules at the type level
- This ensures type safety and prevents invalid parameter combinations
## 🛠️ Development
```bash
# Install dependencies with uv (recommended)
uv sync
# Or with pip
pip install -e ".[dev]"
# Run tests
pytest
# Run tests with coverage
pytest --cov=bitbuffet
# Run integration tests
pytest -m integration
# Build the package
python -m build
```
## 📋 Requirements
- Python >= 3.9
- pydantic >= 2.11.7
- requests >= 2.32.5
## 🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add some amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
## 📄 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## 🔗 Links
- [Complete API Documentation](https://bitbuffet.dev/docs/overview) - Full API reference and guides
- [GitHub Repository](https://github.com/ystefanov6/bitbuffet-clients)
- [PyPI Package](https://pypi.org/project/bitbuffet/)
- [Report Issues](https://github.com/ystefanov6/bitbuffet-clients/issues)
## 💡 Need Help?
For detailed documentation, examples, and API reference, visit our [complete documentation](https://bitbuffet.dev/docs/overview).
If you encounter any issues or have questions, please [open an issue](https://github.com/ystefanov6/bitbuffet-clients/issues) on GitHub.
Raw data
{
"_id": null,
"home_page": null,
"name": "bitbuffet",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.9",
"maintainer_email": null,
"keywords": "ai, api, client, scraper, structured-data, web-scraping",
"author": null,
"author_email": "Yuliyan Stefanov <ystefanov.dev@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/13/b6/519dc56ec6c9ed34d95fe5573611a65490c7bdd8e4b35cfa14c4f63fd17b/bitbuffet-1.0.2.tar.gz",
"platform": null,
"description": "<img src=\"https://www.bitbuffet.dev/_next/image?url=%2Fbitbuffet-logo-closed-transparent.png&w=64&q=75\" alt=\"BitBuffet Logo\" width=\"64\" height=\"64\">\n\n# BitBuffet Python SDK\n\n[](https://badge.fury.io/py/bitbuffet)\n[](https://opensource.org/licenses/MIT)\n[](https://www.python.org/downloads/)\n\nA powerful Python SDK for the BitBuffet API that allows you to extract structured data from any web content using Pydantic models or raw markdown content in under two seconds.\n\n## \ud83d\ude80 Features\n\n- **Universal**: Works with any website or web content (url, image, video, audio, youtube, pdf, etc.)\n- **Type-safe**: Built with Pydantic for complete type safety and validation\n- **Fast**: Extract structured data in under 2 seconds\n- **Flexible**: Support for custom prompts and reasoning levels\n- **Dual Output**: Extract structured JSON data or raw markdown content\n- **Easy to use**: Simple, intuitive API\n- **Well-tested**: Comprehensive test suite with integration tests\n\n## \ud83d\udce6 Installation\n\n```bash\npip install bitbuffet\n# or\npoetry add bitbuffet\n# or\nuv add bitbuffet\n```\n\n## \ud83c\udfc3\u200d\u2642\ufe0f Quick Start\n\n### JSON Extraction (Structured Data)\n\n```python\nfrom bitbuffet import BitBuffet\nfrom pydantic import BaseModel, Field\nfrom typing import List\n\n# Define your data structure with Pydantic\nclass Article(BaseModel):\n title: str = Field(description=\"The main title of the article\")\n author: str = Field(description=\"The author of the article\")\n publish_date: str = Field(description=\"When the article was published\")\n content: str = Field(description=\"The main content/body of the article\")\n tags: List[str] = Field(description=\"Article tags or categories\")\n summary: str = Field(description=\"A brief summary of the article\")\n\n# Initialize the client with your API key\nclient = BitBuffet(api_key=\"your-api-key-here\")\n\n# Extract structured data from any URL\ntry:\n result: Article = client.extract(\n url=\"https://example.com/article\",\n schema_class=Article\n )\n \n print(f\"Title: {result.title}\")\n print(f\"Author: {result.author}\")\n print(f\"Published: {result.publish_date}\")\n print(f\"Tags: {', '.join(result.tags)}\")\nexcept Exception as error:\n print(f\"Extraction failed: {error}\")\n```\n\n### Markdown Extraction (Raw Content)\n\n```python\nfrom bitbuffet import BitBuffet\n\nclient = BitBuffet(api_key=\"your-api-key-here\")\n\n# Extract raw markdown content\ntry:\n markdown: str = client.extract(\n url=\"https://example.com/article\",\n format=\"markdown\"\n )\n \n print(\"Raw markdown content:\")\n print(markdown)\nexcept Exception as error:\n print(f\"Extraction failed: {error}\")\n```\n\n## \u2699\ufe0f Output Methods\n\nChoose between structured JSON extraction or raw markdown content:\n\n### JSON Format (Default)\nExtracts structured data according to your Pydantic model:\n\n```python\nclass Product(BaseModel):\n name: str\n price: float\n description: str\n\nproduct = client.extract(\n url=\"https://example.com/product\",\n schema_class=Product,\n format=\"json\" # Optional - this is the default\n)\n```\n\n### Markdown Format\nReturns the raw markdown content of the webpage:\n\n```python\nmarkdown = client.extract(\n url=\"https://example.com/article\",\n format=\"markdown\"\n)\n```\n\n**Note:** When using `format=\"markdown\"`, do not provide a `schema_class` parameter.\n\n## \u2699\ufe0f Configuration Options\n\nCustomize the extraction process with various options:\n\n```python\n# JSON extraction with configuration\nresult = client.extract(\n url=\"https://example.com/complex-page\",\n schema_class=Article,\n format=\"json\", # Optional - this is the default\n timeout=30, # Timeout in seconds (default: 30)\n reasoning_effort=\"high\", # 'medium' | 'high' - Higher effort for complex pages\n prompt=\"Focus on extracting the main article content, ignoring ads and navigation\",\n temperature=0.1, # Lower for more consistent results (0.0 - 1.5)\n # OR use top_p instead of temperature\n # top_p=0.9\n)\n\n# Markdown extraction with configuration\nmarkdown = client.extract(\n url=\"https://example.com/article\",\n format=\"markdown\", # Required for markdown extraction\n timeout=30,\n reasoning_effort=\"medium\",\n prompt=\"Focus on the main content, ignore navigation and ads\"\n)\n```\n\n### Parameter Validation:\n\n- **Temperature vs Top-p**: Cannot specify both `temperature` and `top_p` simultaneously\n- **Format Validation**: The SDK will raise `ValueError` for invalid format/schema combinations\n- **Type Safety**: Format overloads provide compile-time type checking\n\n## \ud83d\udcda Advanced Examples\n\n### E-commerce Product Extraction\n\n```python\nfrom pydantic import BaseModel, Field, HttpUrl\nfrom typing import List, Optional\n\nclass Product(BaseModel):\n name: str\n price: float\n currency: str\n description: str\n images: List[HttpUrl]\n in_stock: bool\n rating: Optional[float] = Field(None, ge=0, le=5)\n reviews: Optional[int] = None\n\nproduct = client.extract(\n url=\"https://shop.example.com/product/123\",\n schema_class=Product,\n reasoning_effort=\"high\"\n)\n\nprint(f\"Product: {product.name}\")\nprint(f\"Price: {product.price} {product.currency}\")\nprint(f\"In Stock: {product.in_stock}\")\n```\n\n### News Article with Nested Models\n\n```python\nfrom pydantic import BaseModel, HttpUrl\nfrom typing import List, Optional\n\nclass Author(BaseModel):\n name: str\n bio: Optional[str] = None\n\nclass RelatedArticle(BaseModel):\n title: str\n url: HttpUrl\n\nclass NewsArticle(BaseModel):\n headline: str\n subheadline: Optional[str] = None\n author: Author\n published_at: str\n category: str\n content: str\n related_articles: Optional[List[RelatedArticle]] = None\n\narticle = client.extract(\n url=\"https://news.example.com/breaking-news\",\n schema_class=NewsArticle\n)\n\nprint(f\"Headline: {article.headline}\")\nprint(f\"Author: {article.author.name}\")\nprint(f\"Category: {article.category}\")\n```\n\n### Raw Content for Processing\n\n```python\n# Extract raw markdown for further processing\nraw_content = client.extract(\n url=\"https://blog.example.com/post/123\",\n format=\"markdown\"\n)\n\n# Process the markdown content\nword_count = len(raw_content.split())\nhas_code_blocks = \"```\" in raw_content\n\nprint(f\"Content has {word_count} words\")\nprint(f\"Contains code blocks: {has_code_blocks}\")\n```\n\n## \ud83d\udd27 API Reference\n\n### `BitBuffet` Class\n\n#### Constructor\n```python\nBitBuffet(api_key: str)\n```\n\n#### Methods\n\nThe `extract` format has two overloaded signatures for type safety:\n\n##### JSON Extraction (Default)\n```python\nextract(\n url: str,\n schema_class: Type[BaseModel],\n timeout: int = 30,\n reasoning_effort: Optional[Literal['medium', 'high']] = None,\n prompt: Optional[str] = None,\n top_p: Optional[Union[int, float]] = None,\n temperature: Optional[Union[int, float]] = None,\n format: Literal['json'] = 'json' # Optional - defaults to 'json'\n) -> BaseModel\n```\n\n##### Markdown Extraction\n```python\nextract(\n url: str,\n format: Literal['markdown'], # Required for markdown extraction\n timeout: int = 30,\n reasoning_effort: Optional[Literal['medium', 'high']] = None,\n prompt: Optional[str] = None,\n top_p: Optional[Union[int, float]] = None,\n temperature: Optional[Union[int, float]] = None\n) -> str\n```\n\n**Parameters:**\n- `url`: The URL to extract data from\n- `schema_class`: Pydantic model class defining the expected data structure (JSON format only)\n- `format`: Extraction format ('json' or 'markdown')\n - For JSON: Optional, defaults to 'json'\n - For Markdown: Required, must be 'markdown'\n- `timeout`: Request timeout in seconds (default: 30)\n- `reasoning_effort`: 'medium' | 'high' (default: 'medium')\n- `prompt`: Custom extraction prompt (optional)\n- `temperature`: Sampling temperature 0.0-1.5 (optional, cannot be used with top_p)\n- `top_p`: Alternative to temperature (optional, cannot be used with temperature)\n\n**Returns:** \n- JSON format: Instance of the provided Pydantic model with extracted data\n- Markdown format: Raw markdown content as string\n\n**Raises:**\n- `ValueError`: When format/schema combination is invalid or both temperature and top_p are provided\n- `requests.RequestException`: When API request fails\n\n**Format Overload Rules:**\n\n1. **JSON Format Requirements:**\n - A Pydantic model class MUST be provided via `schema_class` parameter\n - Returns an instance of your Pydantic model with validated data\n - `format=\"json\"` is optional (default behavior)\n\n2. **Markdown Format Requirements:**\n - NO `schema_class` should be provided\n - `format=\"markdown\"` MUST be specified\n - Returns raw markdown content as string\n - Schema class and markdown format cannot be used together\n\n3. **Type Safety:**\n - The SDK uses format overloads to enforce these rules at the type level\n - This ensures type safety and prevents invalid parameter combinations\n\n## \ud83d\udee0\ufe0f Development\n\n```bash\n# Install dependencies with uv (recommended)\nuv sync\n\n# Or with pip\npip install -e \".[dev]\"\n\n# Run tests\npytest\n\n# Run tests with coverage\npytest --cov=bitbuffet\n\n# Run integration tests\npytest -m integration\n\n# Build the package\npython -m build\n```\n\n## \ud83d\udccb Requirements\n\n- Python >= 3.9\n- pydantic >= 2.11.7\n- requests >= 2.32.5\n\n## \ud83e\udd1d Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n1. Fork the repository\n2. Create your feature branch (`git checkout -b feature/amazing-feature`)\n3. Commit your changes (`git commit -m 'Add some amazing feature'`)\n4. Push to the branch (`git push origin feature/amazing-feature`)\n5. Open a Pull Request\n\n## \ud83d\udcc4 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## \ud83d\udd17 Links\n\n- [Complete API Documentation](https://bitbuffet.dev/docs/overview) - Full API reference and guides\n- [GitHub Repository](https://github.com/ystefanov6/bitbuffet-clients)\n- [PyPI Package](https://pypi.org/project/bitbuffet/)\n- [Report Issues](https://github.com/ystefanov6/bitbuffet-clients/issues)\n\n## \ud83d\udca1 Need Help?\n\nFor detailed documentation, examples, and API reference, visit our [complete documentation](https://bitbuffet.dev/docs/overview).\n\nIf you encounter any issues or have questions, please [open an issue](https://github.com/ystefanov6/bitbuffet-clients/issues) on GitHub.",
"bugtrack_url": null,
"license": "MIT",
"summary": "Python SDK for the bitbuffet API - BitBuffet",
"version": "1.0.2",
"project_urls": {
"Bug Tracker": "https://github.com/ystefanov6/bitbuffet-clients/issues",
"Documentation": "https://github.com/ystefanov6/bitbuffet-clients#readme",
"Homepage": "https://github.com/ystefanov6/bitbuffet-clients",
"Repository": "https://github.com/ystefanov6/bitbuffet-clients"
},
"split_keywords": [
"ai",
" api",
" client",
" scraper",
" structured-data",
" web-scraping"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "8e2a28f46852d83ca87608437ee4dc1ac3b9925f41e07533701d42aab82f8820",
"md5": "b4d53dafa354fbe0c87eda7719638acc",
"sha256": "9ffae03293ad5ee035400a4b72742d6afacf47f3597da94ae3a3c00f34973f2a"
},
"downloads": -1,
"filename": "bitbuffet-1.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b4d53dafa354fbe0c87eda7719638acc",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.9",
"size": 8118,
"upload_time": "2025-09-11T16:42:03",
"upload_time_iso_8601": "2025-09-11T16:42:03.411242Z",
"url": "https://files.pythonhosted.org/packages/8e/2a/28f46852d83ca87608437ee4dc1ac3b9925f41e07533701d42aab82f8820/bitbuffet-1.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "13b6519dc56ec6c9ed34d95fe5573611a65490c7bdd8e4b35cfa14c4f63fd17b",
"md5": "6430881225bef21db0e455c76e13e776",
"sha256": "6877d4fbe7ec2069b08a7fd327519265a718d1f2ca7c89415cb18aabf69f5b25"
},
"downloads": -1,
"filename": "bitbuffet-1.0.2.tar.gz",
"has_sig": false,
"md5_digest": "6430881225bef21db0e455c76e13e776",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.9",
"size": 52850,
"upload_time": "2025-09-11T16:42:04",
"upload_time_iso_8601": "2025-09-11T16:42:04.744898Z",
"url": "https://files.pythonhosted.org/packages/13/b6/519dc56ec6c9ed34d95fe5573611a65490c7bdd8e4b35cfa14c4f63fd17b/bitbuffet-1.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-11 16:42:04",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "ystefanov6",
"github_project": "bitbuffet-clients",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "bitbuffet"
}