olyptik


Nameolyptik JSON
Version 0.1.2 PyPI version JSON
download
home_pageNone
SummaryOfficial Python SDK for Olyptik API
upload_time2025-09-06 14:56:18
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseMIT
keywords sdk crawler olyptik api
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # Olyptik Python SDK
The Olyptik Python SDK provides a simple and intuitive interface for web crawling and content extraction. It supports both synchronous and asynchronous programming patterns with full type hints.

## Installation

Install the SDK using pip:

```bash
pip install olyptik
```

## Configuration

First, you'll need to initialize the SDK with your API key - you can get it from the [settings page](https://app.olyptik.io/settings/crawl). You can either pass it directly or use environment variables.

```python
from olyptik import Olyptik

# Initialize with API key
client = Olyptik(api_key="your_api_key_here")
```

## Synchronous Usage

### Start a crawl

<CodeGroup>

Minimal settings crawl:
```python
crawl = client.run_crawl({
    "startUrl": "https://example.com",
    "maxResults": 50
})

print(f"Crawl started with ID: {crawl.id}")
print(f"Status: {crawl.status}")
```

Full example:
```python
# Start a crawl
crawl = client.run_crawl({
    "startUrl": "https://example.com",
    "maxResults": 50,
    "maxDepth": 2,
    "engineType": "auto",
    "includeLinks": True,
    "timeout": 60,
    "useSitemap": False,
    "entireWebsite": False,
    "excludeNonMainTags": True,
    "useStaticIps": False
})

print(f"Crawl started with ID: {crawl.id}")
print(f"Status: {crawl.status}")
```
</CodeGroup>

### Query crawls

```python
from olyptik import CrawlStatus

result = client.query_crawls({
    "startUrls": ["https://example.com"],
    "status": [CrawlStatus.SUCCEEDED],
    "page": 0,
})

print("Crawls: ", result.results)
print("Page: ", result.page)
print("Total pages: ", result.totalPages)
print("Count of items per page: ", result.limit)
print("Total matched crawls: ", result.totalResults)
```

### Getting Crawl Results
Retrieve the results of your crawl using the crawl ID.
The results are paginated, and you can specify the page number and limit per page.

```python
limit = 50
page = 0
results = client.get_crawl_results(crawl.id, page, limit)
for result in results.results:
    print(f"URL: {result.url}")
    print(f"Title: {result.title}")
    print(f"Depth: {result.depthOfUrl}")
```

### Abort a crawl

```python
aborted_crawl = client.abort_crawl(crawl.id)
print(f"Crawl aborted with ID: {aborted_crawl.id}")
```

## Asynchronous Usage

For better performance with I/O operations, use the async client:

### Start a crawl

<CodeGroup>

Minimal settings crawl:
```python
import asyncio
from olyptik import AsyncOlyptik

async def main():
    async with AsyncOlyptik(api_key="your_api_key_here") as client:
        crawl = await client.run_crawl({
            "startUrl": "https://example.com",
            "maxResults": 50
        })

        print(f"Crawl started with ID: {crawl.id}")
        print(f"Status: {crawl.status}")

asyncio.run(main())
```

Full example:
```python
import asyncio
from olyptik import AsyncOlyptik

async def main():
    async with AsyncOlyptik(api_key="your_api_key_here") as client:
        # Start a crawl
        crawl = await client.run_crawl({
            "startUrl": "https://example.com",
            "maxResults": 50,
            "maxDepth": 2,
            "engineType": "auto",
            "includeLinks": True,
            "timeout": 60,
            "useSitemap": False,
            "entireWebsite": False,
            "excludeNonMainTags": True,
            "useStaticIps": False
        })

        print(f"Crawl started with ID: {crawl.id}")
        print(f"Status: {crawl.status}")

asyncio.run(main())
```

</CodeGroup>

### Query crawls

```python
import asyncio
from olyptik import AsyncOlyptik, CrawlStatus

async def main():
    async with AsyncOlyptik(api_key="your_api_key_here") as client:
        result = await client.query_crawls({
            "startUrls": ["https://example.com"],
            "status": [CrawlStatus.SUCCEEDED],
            "page": 0,
        })
        
        print("Crawls: ", result.results)
        print("Page: ", result.page)
        print("Total pages: ", result.totalPages)
        print("Count of items per page: ", result.limit)
        print("Total matched crawls: ", result.totalResults)

asyncio.run(main())
```

### Get crawl results

```python
import asyncio
from olyptik import AsyncOlyptik

async def main():
    async with AsyncOlyptik(api_key="your_api_key_here") as client:
        # First start a crawl
        crawl = await client.run_crawl({
            "startUrl": "https://example.com",
            "maxResults": 50
        })
        
        # Get crawl results
        limit = 50
        page = 0
        results = await client.get_crawl_results(crawl.id, page, limit)
        for result in results.results:
            print(f"URL: {result.url}")
            print(f"Title: {result.title}")
            print(f"Depth: {result.depthOfUrl}")

asyncio.run(main())
```

### Abort a crawl

```python
import asyncio
from olyptik import AsyncOlyptik

async def main():
    async with AsyncOlyptik(api_key="your_api_key_here") as client:
        # First start a crawl
        crawl = await client.run_crawl({
            "startUrl": "https://example.com",
            "maxResults": 50
        })
        
        # Abort the crawl
        aborted_crawl = await client.abort_crawl(crawl.id)
        print(f"Crawl aborted with ID: {aborted_crawl.id}")

asyncio.run(main())
```

## Configuration Options

### StartCrawlPayload

The crawl configuration options available:

You must provide at least one of the following: maxResults, useSitemap, or entireWebsite.

| Property | Type | Required | Default | Description |
|--------|------|----------|---------|-------------|
| startUrl | string | ✅ | - | The URL to start crawling from |
| maxResults | number | ❌ | - | Maximum number of results to collect (1-5,000) |
| useSitemap | boolean | ❌ | false | Whether to use sitemap.xml to crawl the website |
| entireWebsite | boolean | ❌ | false | Whether to use sitemap.xml and all found links to crawl the website |
| maxDepth | number | ❌ | 10 | Maximum depth of pages to crawl (1-100) |
| includeLinks | boolean | ❌ | true | Whether to include links in the crawl results' markdown |
| excludeNonMainTags | boolean | ❌ | true | Whether to exclude non-main HTML tags (header, footer, aside, etc.) from the crawl results |
| timeout | number | ❌ | 60 | Timeout duration in minutes |
| engineType | string | ❌ | "auto" | The engine to use: "auto", "cheerio" (fast, static sites), "playwright" (dynamic sites) |
| useStaticIps | boolean | ❌ | false | Whether to use static IPs for the crawl |

### Engine Types

Choose the appropriate engine for your crawling needs:

```python
from olyptik import EngineType

# Available engine types
EngineType.AUTO        # Automatically choose the best engine
EngineType.PLAYWRIGHT  # Use Playwright for JavaScript-heavy sites
EngineType.CHEERIO     # Use Cheerio for faster, static content crawling
```

### Crawl Status

Monitor your crawl status using the `CrawlStatus` enum:

```python
from olyptik import CrawlStatus

# Possible status values
CrawlStatus.RUNNING    # Crawl is currently running
CrawlStatus.SUCCEEDED  # Crawl completed successfully
CrawlStatus.FAILED     # Crawl failed due to an error
CrawlStatus.TIMED_OUT  # Crawl exceeded timeout limit
CrawlStatus.ABORTED    # Crawl was manually aborted
CrawlStatus.ERROR      # Crawl encountered an error
```

## Error Handling

The SDK throws errors for various scenarios. Always wrap your calls in try-catch blocks:

```python
from olyptik import Olyptik, ApiError

client = Olyptik(api_key="your_api_key_here")

try:
    crawl = client.run_crawl({
        "startUrl": "https://example.com",
        "maxResults": 10
    })
except ApiError as e:
    # API returned an error response
    print(f"API Error: {e.message}")
    print(f"Status Code: {e.status_code}")
```

## Data Models

### CrawlResult

Each crawl result contains:

```python
@dataclass
class CrawlResult:
    crawlId: str          # Unique identifier for the crawl
    brandId: str          # Brand identifier
    url: str              # The crawled URL
    title: str            # Page title
    markdown: str         # Extracted content in markdown format
    depthOfUrl: int       # How deep this URL was in the crawl
    createdAt: str        # When the result was created
```

### Crawl

Crawl metadata includes:

```python
@dataclass
class Crawl:
    id: str                    # Unique crawl identifier
    status: CrawlStatus        # Current status
    startUrls: List[str]       # Starting URLs
    includeLinks: bool         # Whether links are included
    maxDepth: int              # Maximum crawl depth
    maxResults: int            # Maximum number of results
    brandId: str               # Brand identifier
    createdAt: str             # Creation timestamp
    completedAt: Optional[str] # Completion timestamp
    durationInSeconds: int     # Total duration
    numberOfResults: int       # Number of results found
    useSitemap: bool          # Whether sitemap was used
    entireWebsite: bool       # Whether to use both sitemap and all found links
    excludeNonMainTags: bool  # Whether non-main HTML tags were excluded
    timeout: int              # Timeout setting
    useStaticIps: bool        # Whether static IPs were used
    engineType: EngineType    # Engine type used
```

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "olyptik",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "sdk, crawler, olyptik, api",
    "author": null,
    "author_email": "Olyptik <support@olyptik.io>",
    "download_url": "https://files.pythonhosted.org/packages/d5/a2/b27bb150aad9edc0ee5309428c0b03610324ace72ec5ae58d45ea5ada16c/olyptik-0.1.2.tar.gz",
    "platform": null,
    "description": "# Olyptik Python SDK\nThe Olyptik Python SDK provides a simple and intuitive interface for web crawling and content extraction. It supports both synchronous and asynchronous programming patterns with full type hints.\n\n## Installation\n\nInstall the SDK using pip:\n\n```bash\npip install olyptik\n```\n\n## Configuration\n\nFirst, you'll need to initialize the SDK with your API key - you can get it from the [settings page](https://app.olyptik.io/settings/crawl). You can either pass it directly or use environment variables.\n\n```python\nfrom olyptik import Olyptik\n\n# Initialize with API key\nclient = Olyptik(api_key=\"your_api_key_here\")\n```\n\n## Synchronous Usage\n\n### Start a crawl\n\n<CodeGroup>\n\nMinimal settings crawl:\n```python\ncrawl = client.run_crawl({\n    \"startUrl\": \"https://example.com\",\n    \"maxResults\": 50\n})\n\nprint(f\"Crawl started with ID: {crawl.id}\")\nprint(f\"Status: {crawl.status}\")\n```\n\nFull example:\n```python\n# Start a crawl\ncrawl = client.run_crawl({\n    \"startUrl\": \"https://example.com\",\n    \"maxResults\": 50,\n    \"maxDepth\": 2,\n    \"engineType\": \"auto\",\n    \"includeLinks\": True,\n    \"timeout\": 60,\n    \"useSitemap\": False,\n    \"entireWebsite\": False,\n    \"excludeNonMainTags\": True,\n    \"useStaticIps\": False\n})\n\nprint(f\"Crawl started with ID: {crawl.id}\")\nprint(f\"Status: {crawl.status}\")\n```\n</CodeGroup>\n\n### Query crawls\n\n```python\nfrom olyptik import CrawlStatus\n\nresult = client.query_crawls({\n    \"startUrls\": [\"https://example.com\"],\n    \"status\": [CrawlStatus.SUCCEEDED],\n    \"page\": 0,\n})\n\nprint(\"Crawls: \", result.results)\nprint(\"Page: \", result.page)\nprint(\"Total pages: \", result.totalPages)\nprint(\"Count of items per page: \", result.limit)\nprint(\"Total matched crawls: \", result.totalResults)\n```\n\n### Getting Crawl Results\nRetrieve the results of your crawl using the crawl ID.\nThe results are paginated, and you can specify the page number and limit per page.\n\n```python\nlimit = 50\npage = 0\nresults = client.get_crawl_results(crawl.id, page, limit)\nfor result in results.results:\n    print(f\"URL: {result.url}\")\n    print(f\"Title: {result.title}\")\n    print(f\"Depth: {result.depthOfUrl}\")\n```\n\n### Abort a crawl\n\n```python\naborted_crawl = client.abort_crawl(crawl.id)\nprint(f\"Crawl aborted with ID: {aborted_crawl.id}\")\n```\n\n## Asynchronous Usage\n\nFor better performance with I/O operations, use the async client:\n\n### Start a crawl\n\n<CodeGroup>\n\nMinimal settings crawl:\n```python\nimport asyncio\nfrom olyptik import AsyncOlyptik\n\nasync def main():\n    async with AsyncOlyptik(api_key=\"your_api_key_here\") as client:\n        crawl = await client.run_crawl({\n            \"startUrl\": \"https://example.com\",\n            \"maxResults\": 50\n        })\n\n        print(f\"Crawl started with ID: {crawl.id}\")\n        print(f\"Status: {crawl.status}\")\n\nasyncio.run(main())\n```\n\nFull example:\n```python\nimport asyncio\nfrom olyptik import AsyncOlyptik\n\nasync def main():\n    async with AsyncOlyptik(api_key=\"your_api_key_here\") as client:\n        # Start a crawl\n        crawl = await client.run_crawl({\n            \"startUrl\": \"https://example.com\",\n            \"maxResults\": 50,\n            \"maxDepth\": 2,\n            \"engineType\": \"auto\",\n            \"includeLinks\": True,\n            \"timeout\": 60,\n            \"useSitemap\": False,\n            \"entireWebsite\": False,\n            \"excludeNonMainTags\": True,\n            \"useStaticIps\": False\n        })\n\n        print(f\"Crawl started with ID: {crawl.id}\")\n        print(f\"Status: {crawl.status}\")\n\nasyncio.run(main())\n```\n\n</CodeGroup>\n\n### Query crawls\n\n```python\nimport asyncio\nfrom olyptik import AsyncOlyptik, CrawlStatus\n\nasync def main():\n    async with AsyncOlyptik(api_key=\"your_api_key_here\") as client:\n        result = await client.query_crawls({\n            \"startUrls\": [\"https://example.com\"],\n            \"status\": [CrawlStatus.SUCCEEDED],\n            \"page\": 0,\n        })\n        \n        print(\"Crawls: \", result.results)\n        print(\"Page: \", result.page)\n        print(\"Total pages: \", result.totalPages)\n        print(\"Count of items per page: \", result.limit)\n        print(\"Total matched crawls: \", result.totalResults)\n\nasyncio.run(main())\n```\n\n### Get crawl results\n\n```python\nimport asyncio\nfrom olyptik import AsyncOlyptik\n\nasync def main():\n    async with AsyncOlyptik(api_key=\"your_api_key_here\") as client:\n        # First start a crawl\n        crawl = await client.run_crawl({\n            \"startUrl\": \"https://example.com\",\n            \"maxResults\": 50\n        })\n        \n        # Get crawl results\n        limit = 50\n        page = 0\n        results = await client.get_crawl_results(crawl.id, page, limit)\n        for result in results.results:\n            print(f\"URL: {result.url}\")\n            print(f\"Title: {result.title}\")\n            print(f\"Depth: {result.depthOfUrl}\")\n\nasyncio.run(main())\n```\n\n### Abort a crawl\n\n```python\nimport asyncio\nfrom olyptik import AsyncOlyptik\n\nasync def main():\n    async with AsyncOlyptik(api_key=\"your_api_key_here\") as client:\n        # First start a crawl\n        crawl = await client.run_crawl({\n            \"startUrl\": \"https://example.com\",\n            \"maxResults\": 50\n        })\n        \n        # Abort the crawl\n        aborted_crawl = await client.abort_crawl(crawl.id)\n        print(f\"Crawl aborted with ID: {aborted_crawl.id}\")\n\nasyncio.run(main())\n```\n\n## Configuration Options\n\n### StartCrawlPayload\n\nThe crawl configuration options available:\n\nYou must provide at least one of the following: maxResults, useSitemap, or entireWebsite.\n\n| Property | Type | Required | Default | Description |\n|--------|------|----------|---------|-------------|\n| startUrl | string | \u2705 | - | The URL to start crawling from |\n| maxResults | number | \u274c | - | Maximum number of results to collect (1-5,000) |\n| useSitemap | boolean | \u274c | false | Whether to use sitemap.xml to crawl the website |\n| entireWebsite | boolean | \u274c | false | Whether to use sitemap.xml and all found links to crawl the website |\n| maxDepth | number | \u274c | 10 | Maximum depth of pages to crawl (1-100) |\n| includeLinks | boolean | \u274c | true | Whether to include links in the crawl results' markdown |\n| excludeNonMainTags | boolean | \u274c | true | Whether to exclude non-main HTML tags (header, footer, aside, etc.) from the crawl results |\n| timeout | number | \u274c | 60 | Timeout duration in minutes |\n| engineType | string | \u274c | \"auto\" | The engine to use: \"auto\", \"cheerio\" (fast, static sites), \"playwright\" (dynamic sites) |\n| useStaticIps | boolean | \u274c | false | Whether to use static IPs for the crawl |\n\n### Engine Types\n\nChoose the appropriate engine for your crawling needs:\n\n```python\nfrom olyptik import EngineType\n\n# Available engine types\nEngineType.AUTO        # Automatically choose the best engine\nEngineType.PLAYWRIGHT  # Use Playwright for JavaScript-heavy sites\nEngineType.CHEERIO     # Use Cheerio for faster, static content crawling\n```\n\n### Crawl Status\n\nMonitor your crawl status using the `CrawlStatus` enum:\n\n```python\nfrom olyptik import CrawlStatus\n\n# Possible status values\nCrawlStatus.RUNNING    # Crawl is currently running\nCrawlStatus.SUCCEEDED  # Crawl completed successfully\nCrawlStatus.FAILED     # Crawl failed due to an error\nCrawlStatus.TIMED_OUT  # Crawl exceeded timeout limit\nCrawlStatus.ABORTED    # Crawl was manually aborted\nCrawlStatus.ERROR      # Crawl encountered an error\n```\n\n## Error Handling\n\nThe SDK throws errors for various scenarios. Always wrap your calls in try-catch blocks:\n\n```python\nfrom olyptik import Olyptik, ApiError\n\nclient = Olyptik(api_key=\"your_api_key_here\")\n\ntry:\n    crawl = client.run_crawl({\n        \"startUrl\": \"https://example.com\",\n        \"maxResults\": 10\n    })\nexcept ApiError as e:\n    # API returned an error response\n    print(f\"API Error: {e.message}\")\n    print(f\"Status Code: {e.status_code}\")\n```\n\n## Data Models\n\n### CrawlResult\n\nEach crawl result contains:\n\n```python\n@dataclass\nclass CrawlResult:\n    crawlId: str          # Unique identifier for the crawl\n    brandId: str          # Brand identifier\n    url: str              # The crawled URL\n    title: str            # Page title\n    markdown: str         # Extracted content in markdown format\n    depthOfUrl: int       # How deep this URL was in the crawl\n    createdAt: str        # When the result was created\n```\n\n### Crawl\n\nCrawl metadata includes:\n\n```python\n@dataclass\nclass Crawl:\n    id: str                    # Unique crawl identifier\n    status: CrawlStatus        # Current status\n    startUrls: List[str]       # Starting URLs\n    includeLinks: bool         # Whether links are included\n    maxDepth: int              # Maximum crawl depth\n    maxResults: int            # Maximum number of results\n    brandId: str               # Brand identifier\n    createdAt: str             # Creation timestamp\n    completedAt: Optional[str] # Completion timestamp\n    durationInSeconds: int     # Total duration\n    numberOfResults: int       # Number of results found\n    useSitemap: bool          # Whether sitemap was used\n    entireWebsite: bool       # Whether to use both sitemap and all found links\n    excludeNonMainTags: bool  # Whether non-main HTML tags were excluded\n    timeout: int              # Timeout setting\n    useStaticIps: bool        # Whether static IPs were used\n    engineType: EngineType    # Engine type used\n```\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Official Python SDK for Olyptik API",
    "version": "0.1.2",
    "project_urls": {
        "Homepage": "https://www.olyptik.io",
        "Issues": "https://github.com/olyptik/olyptik/issues",
        "Repository": "https://github.com/olyptik/olyptik"
    },
    "split_keywords": [
        "sdk",
        " crawler",
        " olyptik",
        " api"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "fa2cd04c784dc48d37fc1b9626c46ca9a6928289f25c609a0484efd58ad0f6d9",
                "md5": "8b049d5230618750433f2fd4da773b40",
                "sha256": "0d0bad5e417401a3e95d016381ec01d72a364a0b865c5a9ba532ec5ea97c911f"
            },
            "downloads": -1,
            "filename": "olyptik-0.1.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "8b049d5230618750433f2fd4da773b40",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 7168,
            "upload_time": "2025-09-06T14:56:17",
            "upload_time_iso_8601": "2025-09-06T14:56:17.581447Z",
            "url": "https://files.pythonhosted.org/packages/fa/2c/d04c784dc48d37fc1b9626c46ca9a6928289f25c609a0484efd58ad0f6d9/olyptik-0.1.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "d5a2b27bb150aad9edc0ee5309428c0b03610324ace72ec5ae58d45ea5ada16c",
                "md5": "44cc2f5297de4e24f6fd4d372f3b00b7",
                "sha256": "20c82bb0354459476941e5c2a30fbb31267b9e570c10d6f4231f6db27d609b61"
            },
            "downloads": -1,
            "filename": "olyptik-0.1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "44cc2f5297de4e24f6fd4d372f3b00b7",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 8564,
            "upload_time": "2025-09-06T14:56:18",
            "upload_time_iso_8601": "2025-09-06T14:56:18.968349Z",
            "url": "https://files.pythonhosted.org/packages/d5/a2/b27bb150aad9edc0ee5309428c0b03610324ace72ec5ae58d45ea5ada16c/olyptik-0.1.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-06 14:56:18",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "olyptik",
    "github_project": "olyptik",
    "github_not_found": true,
    "lcname": "olyptik"
}
        
Elapsed time: 3.55743s