cdvl-crawler


Namecdvl-crawler JSON
Version 0.4.0 PyPI version JSON
download
home_pageNone
SummaryCrawl and download videos from the CDVL (Consumer Digital Video Library) research repository
upload_time2025-10-13 19:33:45
maintainerNone
docs_urlNone
authorWerner Robitza
requires_python>=3.9
licenseNone
keywords cdvl video crawler downloader research
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # CDVL Crawler

[![PyPI version](https://img.shields.io/pypi/v/cdvl-crawler.svg)](https://pypi.org/project/cdvl-crawler)

Python tools for crawling and downloading videos from the [CDVL](https://cdvl.org) (Consumer Digital Video Library) research video repository.

**Contents:**

- [Disclaimer](#disclaimer)
- [What This Does](#what-this-does)
- [Requirements](#requirements)
- [Installation](#installation)
- [Usage](#usage)
  - [Crawling Metadata](#crawling-metadata)
  - [Downloading Videos](#downloading-videos)
  - [Generating Static Site](#generating-static-site)
- [Output Format](#output-format)
  - [Video Records](#video-records)
  - [Dataset Records](#dataset-records)
- [Configuration File (`config.json`)](#configuration-file-configjson)
  - [Configuration Options](#configuration-options)
- [API](#api)
- [License](#license)

## Disclaimer

> [!CAUTION]
>
> This software is provided as-is as a general-purpose tool for interacting with the CDVL web content. The author provides this tool solely as software code and has no control over how it is used, no relationship with CDVL, and no obligation to enforce any third-party terms of service or licenses. You, the user, are solely and independently responsible for:
>
> - Obtaining proper authorization to access CDVL
> - Complying with all applicable terms of service, license agreements, and usage policies
> - Understanding and accepting the CDVL Database Content User License Agreement
> - Your own actions and use of any content accessed through this tool
> - Any legal consequences arising from your use of this tool
>
> While this tool displays the CDVL license agreement for user convenience, this does not constitute legal advice, license enforcement, or any guarantee of compliance. Displaying the license is informational only and does not create any legal relationship between the author and CDVL or the user.
> Under no circumstances shall the author be held liable for any direct, indirect, incidental, special, consequential, or exemplary damages arising from your use or misuse of this tool. This includes, but is not limited to, any violation of terms of service, breach of license agreements, unauthorized access, or any other legal issues.
>
> Do not download more content than you would reasonably access manually through a web browser. Only use your own account credentials – never use credentials that do not belong to you. Respect the intellectual property rights and usage restrictions of content providers.
> Specifically, have a look at the [License page](https://www.cdvl.org/license/) before use.
>
> By using this tool, you acknowledge that you have read, understood, and agree to this disclaimer. If you do not agree, do not use this software.

## What This Does

This package provides a unified command-line tool for working with CDVL:

**`cdvl-crawler`** with three subcommands:

- **`crawl`** - Crawls and extracts metadata from all videos and datasets on CDVL
- **`download`** - Downloads individual videos by their ID
- **`generate-site`** - Generates a searchable, interactive HTML site from crawled metadata

Features:

- **Automatic Login**: Handles authentication automatically with username/password
- **Parallel Crawling**: Concurrent requests for efficient data collection
- **Smart Enumeration**: Automatically discovers videos and datasets by ID
- **Auto-Resume**: Continues from the last crawled ID if interrupted
- **Structured Output**: JSONL format for easy data processing
- **Progress Tracking**: Real-time progress bars showing success/empty/failed counts
- **Download Management**: Handles large file downloads with progress indicators
- **Static Site Generator**: Creates a beautiful, searchable HTML interface for browsing videos

## Requirements

- Python 3.9 or higher
- An active CDVL account with username and password

## Installation

Install [with `uvx`](https://docs.astral.sh/uv/getting-started/installation/):

```bash
uvx cdvl-crawler --help
uvx cdvl-crawler crawl --help
uvx cdvl-crawler download --help
uvx cdvl-crawler generate-site --help
```

Or install with `pipx`:

```bash
pipx install cdvl-crawler
```

Or, with pip:

```bash
pip3 install --user cdvl-crawler
```

We assume you will be using `uvx`, otherwise just run `cdvl-crawler` directly without `uvx` after installing from `pipx` or `pip`.

## Usage

Before using the tool, you need to provide your CDVL credentials. The tool supports three methods for providing credentials (in order of priority):

1. **Config file**: Create a `config.json` file in your working directory (automatically detected) or specify with `--config` that contains `username` and `password`
2. **Environment variables**: Set `CDVL_USERNAME` and `CDVL_PASSWORD`
3. **Interactive prompt**: The tool will ask for credentials if not found via other methods

**Note**: If a `config.json` file exists in your current directory, it will be automatically loaded. You don't need to specify `--config` unless you want to use a different file.

Choose the method that best suits your workflow. For example:

```bash
# Using environment variables (no config file needed)
export CDVL_USERNAME="your.email@example.com"
export CDVL_PASSWORD="your_password"
uvx cdvl-crawler crawl

# Using config.json (automatically detected if in current directory)
# Just create config.json:
# {
#   "username": "your.email@example.com",
#   "password": "your_password_here"
# }

# and run:
uvx cdvl-crawler crawl
```

### Crawling Metadata

To crawl all videos and datasets:

```bash
# Basic usage (outputs to current directory)
uvx cdvl-crawler crawl

# Save to specific directory
uvx cdvl-crawler crawl --output-dir ./data

# Accept license automatically (useful for automation/scripts)
uvx cdvl-crawler crawl --accept-license

# Crawl with custom concurrency and delays
uvx cdvl-crawler crawl --max-concurrent 10 --delay 0.2

# Crawl up to specific ID limits
uvx cdvl-crawler crawl --max-video-id 3000 --max-dataset-id 500

# Adjust failure threshold (stop after N consecutive failures)
uvx cdvl-crawler crawl --max-failures 2000

# Advanced: customize ID gap probing
uvx cdvl-crawler crawl --probe-step 50 --max-probe-attempts 40
```

For more options, run:

```bash
uvx cdvl-crawler crawl --help
```

The crawler will automatically:

1. Log in with your credentials (from config, env vars, or prompt)
2. Crawl videos and datasets in parallel
3. Save metadata to `videos.jsonl` and `datasets.jsonl` in the output directory
4. Resume from the last ID if run again

Example output:

```
2025-10-09 15:30:03 - INFO - ✓ Login successful!
2025-10-09 15:30:03 - INFO - Starting crawlers in parallel...
Videos:   12543 | Success: 8432 | Empty: 3891 | Failed: 220
Datasets:   142 | Success:   98 | Empty:   34 | Failed:  10
```

To start fresh, delete the output files before running:

```bash
rm ./data/videos.jsonl ./data/datasets.jsonl
uvx cdvl-crawler crawl --output-dir ./data
```

To resume, just run it again - it will automatically continue from where it left off.

### Downloading Videos

Download videos by their ID:

```bash
# Download a single video (to current directory)
uvx cdvl-crawler download 42

# Download to specific directory
uvx cdvl-crawler download 42 --output-dir ./downloads

# Accept license automatically (useful for automation/scripts)
uvx cdvl-crawler download 42 --accept-license

# Download multiple videos (comma-separated)
uvx cdvl-crawler download 1,5,10,20 --output-dir ./videos

# Get download URL without downloading
uvx cdvl-crawler download 42 --dry-run

# Download to specific filename (single video only)
uvx cdvl-crawler download 42 --output my_video.avi
```

For more options:

```bash
uvx cdvl-crawler download --help
```

### Generating Static Site

After crawling metadata, you can generate a beautiful, interactive HTML site to browse the video library:

```bash
# Generate site from videos.jsonl to index.html
uvx cdvl-crawler generate-site

# Specify custom input and output files
uvx cdvl-crawler generate-site -i ./data/videos.jsonl -o ./website/index.html
```

For more options:

```bash
uvx cdvl-crawler generate-site --help
```

## Output Format

Output files use JSON Lines format (one JSON object per line).

### Video Records

```json
{
  "id": 5,
  "url": "https://www.cdvl.org/members-section/view-file/?videoid=5",
  "paragraphs": [
    "Description:Pan over a children's ball pit, seen from above using a crane.",
    "Dataset:NTIA Source Scenes",
    "Audio Specifications:16-bit stereo PCM. Talk in English with balls rustling.",
    "Video Specification:The camera was a professional HDTV camera..."
  ],
  "extracted_at": "2025-10-09T15:30:00+00:00",
  "content_type": "video",
  "title": "NTIA children's ball pit from above, part 1, 525-line",
  "links": [{"text": "NTIA T1.801.01", "href": "/members-section/search?dataset=3"}],
  "media": [{"type": "img", "src": "/uploads/thumbnails/thumb_1.jpg"}],
  "filename": "ntia_bpit1-525_original.avi",
  "file_size": "242.50 MB"
}
```

Required fields:

- `id`: Video ID number
- `url`: Source URL on CDVL
- `paragraphs`: Structured list of text content from the page
- `extracted_at`: Timestamp when data was extracted (ISO 8601 format)
- `content_type`: Always "video" for videos

Optional fields:

- `title`: Video title
- `links`: Related links found on the page
- `media`: Images and media elements
- `filename`: Download filename (if available)
- `file_size`: File size (if available)

### Dataset Records

```json
{
  "id": 7,
  "url": "https://www.cdvl.org/members-section/search?dataset=7",
  "paragraphs": ["Description of the dataset...", "Additional information..."],
  "extracted_at": "2025-10-09T15:30:00+00:00",
  "content_type": "dataset",
  "title": "Mobile Quality Dataset",
  "links": [{"text": "Video 123", "href": "/members-section/view-file/?videoid=123"}],
  "tables_count": 2
}
```

Required fields:

- `id`: Dataset ID number
- `url`: Source URL on CDVL
- `paragraphs`: Structured list of text content from the page
- `extracted_at`: Timestamp when data was extracted (ISO 8601 format)
- `content_type`: Always "dataset" for datasets

Optional fields:

- `title`: Dataset title
- `links`: Related links found on the page
- `tables_count`: Number of tables in the dataset page

Here are some processing examples using `jq`.

Count records:

```bash
wc -l videos.jsonl datasets.jsonl
```

View first record:

```bash
head -n 1 videos.jsonl | jq .
```

Extract all titles:

```bash
jq -r '.title' videos.jsonl
```

Filter by keyword in paragraphs:

```bash
jq 'select(.paragraphs | join(" ") | contains("codec"))' videos.jsonl
```

Convert to CSV:

```bash
jq -r '[.id, .title, .url] | @csv' videos.jsonl > videos.csv
```

## Configuration File (`config.json`)

Configuration is **optional**. The tool has sensible defaults built-in, and you can use:

- Environment variables for authentication (see [Usage](#usage) above)
- Command-line options for crawling parameters (see `--help`)
- Interactive prompts if credentials are not found

**Auto-detection**: If a file named `config.json` exists in your current directory, it will be automatically loaded. You can override this with `--config path/to/other.json`.

If you want to customize settings permanently or override defaults, create a `config.json` file:

1. Download `config.example.json` from the repository
2. Rename it to `config.json`
3. Edit `config.json` with your settings:

```json
{
  "username": "your.email@example.com",
  "password": "your_password_here",
  "endpoints": {
    "video_base_url": "https://www.cdvl.org/members-section/view-file/",
    "dataset_base_url": "https://www.cdvl.org/members-section/search"
  },
  "output": {
    "videos_file": "videos.jsonl",
    "datasets_file": "datasets.jsonl"
  },
  "start_video_id": 1,
  "start_dataset_id": 1,
  "max_video_id": null,
  "max_dataset_id": null,
  "max_concurrent_requests": 5,
  "max_consecutive_failures": 1000,
  "request_delay": 0.1,
  "probe_step": 100,
  "max_probe_attempts": 20
}
```

### Configuration Options

All settings are optional with sensible defaults. CLI options override config file values.

| Setting                    | Default                              | CLI Option            | Description                                      |
| -------------------------- | ------------------------------------ | --------------------- | ------------------------------------------------ |
| `username`                 | (from env `CDVL_USERNAME` or prompt) | -                     | Your CDVL account email                          |
| `password`                 | (from env `CDVL_PASSWORD` or prompt) | -                     | Your CDVL account password                       |
| `start_video_id`           | 1                                    | `--start-video-id`    | Starting video ID for crawling                   |
| `start_dataset_id`         | 1                                    | `--start-dataset-id`  | Starting dataset ID for crawling                 |
| `max_video_id`             | None                                 | `--max-video-id`      | Maximum video ID to crawl (optional)             |
| `max_dataset_id`           | None                                 | `--max-dataset-id`    | Maximum dataset ID to crawl (optional)           |
| `max_concurrent_requests`  | 5                                    | `--max-concurrent`    | Number of parallel requests                      |
| `max_consecutive_failures` | 1000                                 | `--max-failures`      | Stop after N consecutive empty/failed responses  |
| `request_delay`            | 0.1                                  | `--delay`             | Delay between request batches (seconds)          |
| `probe_step`               | 100                                  | `--probe-step`        | How far ahead to jump when probing for ID gaps   |
| `max_probe_attempts`       | 20                                   | `--max-probe-attempts`| Max probe attempts (20*100=2000 ID range)        |
| `videos_file`              | videos.jsonl                         | -                     | Output filename for video metadata               |
| `datasets_file`            | datasets.jsonl                       | -                     | Output filename for dataset metadata             |
| `endpoints.video_base_url` | cdvl.org members section             | -                     | Base URL for video pages                         |
| `endpoints.dataset_base_url` | cdvl.org members section           | -                     | Base URL for dataset pages                       |
| `headers`                  | Browser-like headers                 | -                     | HTTP headers (User-Agent, Accept, etc.)          |

## API

You can also use the package programmatically:

```python
import asyncio
from cdvl_crawler import CDVLCrawler, CDVLDownloader, CDVLSiteGenerator

# Crawl videos and datasets
async def crawl():
    # Config file is optional - will use env vars or prompt
    crawler = CDVLCrawler(config_path=None, output_dir="./data")
    await crawler.crawl()

# Download a specific video
async def download():
    # Config file is optional - will use env vars or prompt
    downloader = CDVLDownloader(config_path=None, output_dir="./downloads")
    await downloader._init_session()
    await downloader._login()

    url = await downloader.get_download_link(42)
    if url:
        await downloader.download_file(url, "output.avi")

    await downloader._close_session()

# Generate static site
def generate_site():
    generator = CDVLSiteGenerator(
        input_file="./data/videos.jsonl",
        output_file="./website/index.html"
    )
    success = generator.generate()
    print(f"Site generated: {success}")

# Run
asyncio.run(crawl())
asyncio.run(download())
generate_site()
```

## License

The MIT License (MIT)

Copyright (c) 2025 Werner Robitza

Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be included
in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "cdvl-crawler",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.9",
    "maintainer_email": null,
    "keywords": "cdvl, video, crawler, downloader, research",
    "author": "Werner Robitza",
    "author_email": "Werner Robitza <werner.robitza@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/82/81/93575cae547848628cde7680483838bd6d80371420518c6c1b43ab9efd19/cdvl_crawler-0.4.0.tar.gz",
    "platform": null,
    "description": "# CDVL Crawler\n\n[![PyPI version](https://img.shields.io/pypi/v/cdvl-crawler.svg)](https://pypi.org/project/cdvl-crawler)\n\nPython tools for crawling and downloading videos from the [CDVL](https://cdvl.org) (Consumer Digital Video Library) research video repository.\n\n**Contents:**\n\n- [Disclaimer](#disclaimer)\n- [What This Does](#what-this-does)\n- [Requirements](#requirements)\n- [Installation](#installation)\n- [Usage](#usage)\n  - [Crawling Metadata](#crawling-metadata)\n  - [Downloading Videos](#downloading-videos)\n  - [Generating Static Site](#generating-static-site)\n- [Output Format](#output-format)\n  - [Video Records](#video-records)\n  - [Dataset Records](#dataset-records)\n- [Configuration File (`config.json`)](#configuration-file-configjson)\n  - [Configuration Options](#configuration-options)\n- [API](#api)\n- [License](#license)\n\n## Disclaimer\n\n> [!CAUTION]\n>\n> This software is provided as-is as a general-purpose tool for interacting with the CDVL web content. The author provides this tool solely as software code and has no control over how it is used, no relationship with CDVL, and no obligation to enforce any third-party terms of service or licenses. You, the user, are solely and independently responsible for:\n>\n> - Obtaining proper authorization to access CDVL\n> - Complying with all applicable terms of service, license agreements, and usage policies\n> - Understanding and accepting the CDVL Database Content User License Agreement\n> - Your own actions and use of any content accessed through this tool\n> - Any legal consequences arising from your use of this tool\n>\n> While this tool displays the CDVL license agreement for user convenience, this does not constitute legal advice, license enforcement, or any guarantee of compliance. Displaying the license is informational only and does not create any legal relationship between the author and CDVL or the user.\n> Under no circumstances shall the author be held liable for any direct, indirect, incidental, special, consequential, or exemplary damages arising from your use or misuse of this tool. This includes, but is not limited to, any violation of terms of service, breach of license agreements, unauthorized access, or any other legal issues.\n>\n> Do not download more content than you would reasonably access manually through a web browser. Only use your own account credentials \u2013 never use credentials that do not belong to you. Respect the intellectual property rights and usage restrictions of content providers.\n> Specifically, have a look at the [License page](https://www.cdvl.org/license/) before use.\n>\n> By using this tool, you acknowledge that you have read, understood, and agree to this disclaimer. If you do not agree, do not use this software.\n\n## What This Does\n\nThis package provides a unified command-line tool for working with CDVL:\n\n**`cdvl-crawler`** with three subcommands:\n\n- **`crawl`** - Crawls and extracts metadata from all videos and datasets on CDVL\n- **`download`** - Downloads individual videos by their ID\n- **`generate-site`** - Generates a searchable, interactive HTML site from crawled metadata\n\nFeatures:\n\n- **Automatic Login**: Handles authentication automatically with username/password\n- **Parallel Crawling**: Concurrent requests for efficient data collection\n- **Smart Enumeration**: Automatically discovers videos and datasets by ID\n- **Auto-Resume**: Continues from the last crawled ID if interrupted\n- **Structured Output**: JSONL format for easy data processing\n- **Progress Tracking**: Real-time progress bars showing success/empty/failed counts\n- **Download Management**: Handles large file downloads with progress indicators\n- **Static Site Generator**: Creates a beautiful, searchable HTML interface for browsing videos\n\n## Requirements\n\n- Python 3.9 or higher\n- An active CDVL account with username and password\n\n## Installation\n\nInstall [with `uvx`](https://docs.astral.sh/uv/getting-started/installation/):\n\n```bash\nuvx cdvl-crawler --help\nuvx cdvl-crawler crawl --help\nuvx cdvl-crawler download --help\nuvx cdvl-crawler generate-site --help\n```\n\nOr install with `pipx`:\n\n```bash\npipx install cdvl-crawler\n```\n\nOr, with pip:\n\n```bash\npip3 install --user cdvl-crawler\n```\n\nWe assume you will be using `uvx`, otherwise just run `cdvl-crawler` directly without `uvx` after installing from `pipx` or `pip`.\n\n## Usage\n\nBefore using the tool, you need to provide your CDVL credentials. The tool supports three methods for providing credentials (in order of priority):\n\n1. **Config file**: Create a `config.json` file in your working directory (automatically detected) or specify with `--config` that contains `username` and `password`\n2. **Environment variables**: Set `CDVL_USERNAME` and `CDVL_PASSWORD`\n3. **Interactive prompt**: The tool will ask for credentials if not found via other methods\n\n**Note**: If a `config.json` file exists in your current directory, it will be automatically loaded. You don't need to specify `--config` unless you want to use a different file.\n\nChoose the method that best suits your workflow. For example:\n\n```bash\n# Using environment variables (no config file needed)\nexport CDVL_USERNAME=\"your.email@example.com\"\nexport CDVL_PASSWORD=\"your_password\"\nuvx cdvl-crawler crawl\n\n# Using config.json (automatically detected if in current directory)\n# Just create config.json:\n# {\n#   \"username\": \"your.email@example.com\",\n#   \"password\": \"your_password_here\"\n# }\n\n# and run:\nuvx cdvl-crawler crawl\n```\n\n### Crawling Metadata\n\nTo crawl all videos and datasets:\n\n```bash\n# Basic usage (outputs to current directory)\nuvx cdvl-crawler crawl\n\n# Save to specific directory\nuvx cdvl-crawler crawl --output-dir ./data\n\n# Accept license automatically (useful for automation/scripts)\nuvx cdvl-crawler crawl --accept-license\n\n# Crawl with custom concurrency and delays\nuvx cdvl-crawler crawl --max-concurrent 10 --delay 0.2\n\n# Crawl up to specific ID limits\nuvx cdvl-crawler crawl --max-video-id 3000 --max-dataset-id 500\n\n# Adjust failure threshold (stop after N consecutive failures)\nuvx cdvl-crawler crawl --max-failures 2000\n\n# Advanced: customize ID gap probing\nuvx cdvl-crawler crawl --probe-step 50 --max-probe-attempts 40\n```\n\nFor more options, run:\n\n```bash\nuvx cdvl-crawler crawl --help\n```\n\nThe crawler will automatically:\n\n1. Log in with your credentials (from config, env vars, or prompt)\n2. Crawl videos and datasets in parallel\n3. Save metadata to `videos.jsonl` and `datasets.jsonl` in the output directory\n4. Resume from the last ID if run again\n\nExample output:\n\n```\n2025-10-09 15:30:03 - INFO - \u2713 Login successful!\n2025-10-09 15:30:03 - INFO - Starting crawlers in parallel...\nVideos:   12543 | Success: 8432 | Empty: 3891 | Failed: 220\nDatasets:   142 | Success:   98 | Empty:   34 | Failed:  10\n```\n\nTo start fresh, delete the output files before running:\n\n```bash\nrm ./data/videos.jsonl ./data/datasets.jsonl\nuvx cdvl-crawler crawl --output-dir ./data\n```\n\nTo resume, just run it again - it will automatically continue from where it left off.\n\n### Downloading Videos\n\nDownload videos by their ID:\n\n```bash\n# Download a single video (to current directory)\nuvx cdvl-crawler download 42\n\n# Download to specific directory\nuvx cdvl-crawler download 42 --output-dir ./downloads\n\n# Accept license automatically (useful for automation/scripts)\nuvx cdvl-crawler download 42 --accept-license\n\n# Download multiple videos (comma-separated)\nuvx cdvl-crawler download 1,5,10,20 --output-dir ./videos\n\n# Get download URL without downloading\nuvx cdvl-crawler download 42 --dry-run\n\n# Download to specific filename (single video only)\nuvx cdvl-crawler download 42 --output my_video.avi\n```\n\nFor more options:\n\n```bash\nuvx cdvl-crawler download --help\n```\n\n### Generating Static Site\n\nAfter crawling metadata, you can generate a beautiful, interactive HTML site to browse the video library:\n\n```bash\n# Generate site from videos.jsonl to index.html\nuvx cdvl-crawler generate-site\n\n# Specify custom input and output files\nuvx cdvl-crawler generate-site -i ./data/videos.jsonl -o ./website/index.html\n```\n\nFor more options:\n\n```bash\nuvx cdvl-crawler generate-site --help\n```\n\n## Output Format\n\nOutput files use JSON Lines format (one JSON object per line).\n\n### Video Records\n\n```json\n{\n  \"id\": 5,\n  \"url\": \"https://www.cdvl.org/members-section/view-file/?videoid=5\",\n  \"paragraphs\": [\n    \"Description:Pan over a children's ball pit, seen from above using a crane.\",\n    \"Dataset:NTIA Source Scenes\",\n    \"Audio Specifications:16-bit stereo PCM. Talk in English with balls rustling.\",\n    \"Video Specification:The camera was a professional HDTV camera...\"\n  ],\n  \"extracted_at\": \"2025-10-09T15:30:00+00:00\",\n  \"content_type\": \"video\",\n  \"title\": \"NTIA children's ball pit from above, part 1, 525-line\",\n  \"links\": [{\"text\": \"NTIA T1.801.01\", \"href\": \"/members-section/search?dataset=3\"}],\n  \"media\": [{\"type\": \"img\", \"src\": \"/uploads/thumbnails/thumb_1.jpg\"}],\n  \"filename\": \"ntia_bpit1-525_original.avi\",\n  \"file_size\": \"242.50 MB\"\n}\n```\n\nRequired fields:\n\n- `id`: Video ID number\n- `url`: Source URL on CDVL\n- `paragraphs`: Structured list of text content from the page\n- `extracted_at`: Timestamp when data was extracted (ISO 8601 format)\n- `content_type`: Always \"video\" for videos\n\nOptional fields:\n\n- `title`: Video title\n- `links`: Related links found on the page\n- `media`: Images and media elements\n- `filename`: Download filename (if available)\n- `file_size`: File size (if available)\n\n### Dataset Records\n\n```json\n{\n  \"id\": 7,\n  \"url\": \"https://www.cdvl.org/members-section/search?dataset=7\",\n  \"paragraphs\": [\"Description of the dataset...\", \"Additional information...\"],\n  \"extracted_at\": \"2025-10-09T15:30:00+00:00\",\n  \"content_type\": \"dataset\",\n  \"title\": \"Mobile Quality Dataset\",\n  \"links\": [{\"text\": \"Video 123\", \"href\": \"/members-section/view-file/?videoid=123\"}],\n  \"tables_count\": 2\n}\n```\n\nRequired fields:\n\n- `id`: Dataset ID number\n- `url`: Source URL on CDVL\n- `paragraphs`: Structured list of text content from the page\n- `extracted_at`: Timestamp when data was extracted (ISO 8601 format)\n- `content_type`: Always \"dataset\" for datasets\n\nOptional fields:\n\n- `title`: Dataset title\n- `links`: Related links found on the page\n- `tables_count`: Number of tables in the dataset page\n\nHere are some processing examples using `jq`.\n\nCount records:\n\n```bash\nwc -l videos.jsonl datasets.jsonl\n```\n\nView first record:\n\n```bash\nhead -n 1 videos.jsonl | jq .\n```\n\nExtract all titles:\n\n```bash\njq -r '.title' videos.jsonl\n```\n\nFilter by keyword in paragraphs:\n\n```bash\njq 'select(.paragraphs | join(\" \") | contains(\"codec\"))' videos.jsonl\n```\n\nConvert to CSV:\n\n```bash\njq -r '[.id, .title, .url] | @csv' videos.jsonl > videos.csv\n```\n\n## Configuration File (`config.json`)\n\nConfiguration is **optional**. The tool has sensible defaults built-in, and you can use:\n\n- Environment variables for authentication (see [Usage](#usage) above)\n- Command-line options for crawling parameters (see `--help`)\n- Interactive prompts if credentials are not found\n\n**Auto-detection**: If a file named `config.json` exists in your current directory, it will be automatically loaded. You can override this with `--config path/to/other.json`.\n\nIf you want to customize settings permanently or override defaults, create a `config.json` file:\n\n1. Download `config.example.json` from the repository\n2. Rename it to `config.json`\n3. Edit `config.json` with your settings:\n\n```json\n{\n  \"username\": \"your.email@example.com\",\n  \"password\": \"your_password_here\",\n  \"endpoints\": {\n    \"video_base_url\": \"https://www.cdvl.org/members-section/view-file/\",\n    \"dataset_base_url\": \"https://www.cdvl.org/members-section/search\"\n  },\n  \"output\": {\n    \"videos_file\": \"videos.jsonl\",\n    \"datasets_file\": \"datasets.jsonl\"\n  },\n  \"start_video_id\": 1,\n  \"start_dataset_id\": 1,\n  \"max_video_id\": null,\n  \"max_dataset_id\": null,\n  \"max_concurrent_requests\": 5,\n  \"max_consecutive_failures\": 1000,\n  \"request_delay\": 0.1,\n  \"probe_step\": 100,\n  \"max_probe_attempts\": 20\n}\n```\n\n### Configuration Options\n\nAll settings are optional with sensible defaults. CLI options override config file values.\n\n| Setting                    | Default                              | CLI Option            | Description                                      |\n| -------------------------- | ------------------------------------ | --------------------- | ------------------------------------------------ |\n| `username`                 | (from env `CDVL_USERNAME` or prompt) | -                     | Your CDVL account email                          |\n| `password`                 | (from env `CDVL_PASSWORD` or prompt) | -                     | Your CDVL account password                       |\n| `start_video_id`           | 1                                    | `--start-video-id`    | Starting video ID for crawling                   |\n| `start_dataset_id`         | 1                                    | `--start-dataset-id`  | Starting dataset ID for crawling                 |\n| `max_video_id`             | None                                 | `--max-video-id`      | Maximum video ID to crawl (optional)             |\n| `max_dataset_id`           | None                                 | `--max-dataset-id`    | Maximum dataset ID to crawl (optional)           |\n| `max_concurrent_requests`  | 5                                    | `--max-concurrent`    | Number of parallel requests                      |\n| `max_consecutive_failures` | 1000                                 | `--max-failures`      | Stop after N consecutive empty/failed responses  |\n| `request_delay`            | 0.1                                  | `--delay`             | Delay between request batches (seconds)          |\n| `probe_step`               | 100                                  | `--probe-step`        | How far ahead to jump when probing for ID gaps   |\n| `max_probe_attempts`       | 20                                   | `--max-probe-attempts`| Max probe attempts (20*100=2000 ID range)        |\n| `videos_file`              | videos.jsonl                         | -                     | Output filename for video metadata               |\n| `datasets_file`            | datasets.jsonl                       | -                     | Output filename for dataset metadata             |\n| `endpoints.video_base_url` | cdvl.org members section             | -                     | Base URL for video pages                         |\n| `endpoints.dataset_base_url` | cdvl.org members section           | -                     | Base URL for dataset pages                       |\n| `headers`                  | Browser-like headers                 | -                     | HTTP headers (User-Agent, Accept, etc.)          |\n\n## API\n\nYou can also use the package programmatically:\n\n```python\nimport asyncio\nfrom cdvl_crawler import CDVLCrawler, CDVLDownloader, CDVLSiteGenerator\n\n# Crawl videos and datasets\nasync def crawl():\n    # Config file is optional - will use env vars or prompt\n    crawler = CDVLCrawler(config_path=None, output_dir=\"./data\")\n    await crawler.crawl()\n\n# Download a specific video\nasync def download():\n    # Config file is optional - will use env vars or prompt\n    downloader = CDVLDownloader(config_path=None, output_dir=\"./downloads\")\n    await downloader._init_session()\n    await downloader._login()\n\n    url = await downloader.get_download_link(42)\n    if url:\n        await downloader.download_file(url, \"output.avi\")\n\n    await downloader._close_session()\n\n# Generate static site\ndef generate_site():\n    generator = CDVLSiteGenerator(\n        input_file=\"./data/videos.jsonl\",\n        output_file=\"./website/index.html\"\n    )\n    success = generator.generate()\n    print(f\"Site generated: {success}\")\n\n# Run\nasyncio.run(crawl())\nasyncio.run(download())\ngenerate_site()\n```\n\n## License\n\nThe MIT License (MIT)\n\nCopyright (c) 2025 Werner Robitza\n\nPermission is hereby granted, free of charge, to any person obtaining a\ncopy of this software and associated documentation files (the\n\"Software\"), to deal in the Software without restriction, including\nwithout limitation the rights to use, copy, modify, merge, publish,\ndistribute, sublicense, and/or sell copies of the Software, and to\npermit persons to whom the Software is furnished to do so, subject to\nthe following conditions:\n\nThe above copyright notice and this permission notice shall be included\nin all copies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS\nOR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF\nMERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.\nIN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY\nCLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,\nTORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE\nSOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "Crawl and download videos from the CDVL (Consumer Digital Video Library) research repository",
    "version": "0.4.0",
    "project_urls": {
        "Homepage": "https://github.com/slhck/cdvl-crawler",
        "Repository": "https://github.com/slhck/cdvl-crawler"
    },
    "split_keywords": [
        "cdvl",
        " video",
        " crawler",
        " downloader",
        " research"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "e90b2cade9a7555262d0e8dbc65d026d26a272804198b504ee1d76dd5433fffa",
                "md5": "eada8e5b61b86cfaa1be5f400caabae1",
                "sha256": "f6d69bae82bac3b27489a0cf760c8d4698ba682c414d6b56eb9c8d9f17d0b3a2"
            },
            "downloads": -1,
            "filename": "cdvl_crawler-0.4.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "eada8e5b61b86cfaa1be5f400caabae1",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.9",
            "size": 31517,
            "upload_time": "2025-10-13T19:33:44",
            "upload_time_iso_8601": "2025-10-13T19:33:44.849929Z",
            "url": "https://files.pythonhosted.org/packages/e9/0b/2cade9a7555262d0e8dbc65d026d26a272804198b504ee1d76dd5433fffa/cdvl_crawler-0.4.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "828193575cae547848628cde7680483838bd6d80371420518c6c1b43ab9efd19",
                "md5": "ce2732a2f1c7a166cd62f9750c395562",
                "sha256": "13d4b80f147c1f2c8dff1bb005fde7c4a265ad2ffcbad5f0cc40526145e2eeee"
            },
            "downloads": -1,
            "filename": "cdvl_crawler-0.4.0.tar.gz",
            "has_sig": false,
            "md5_digest": "ce2732a2f1c7a166cd62f9750c395562",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.9",
            "size": 27783,
            "upload_time": "2025-10-13T19:33:45",
            "upload_time_iso_8601": "2025-10-13T19:33:45.785058Z",
            "url": "https://files.pythonhosted.org/packages/82/81/93575cae547848628cde7680483838bd6d80371420518c6c1b43ab9efd19/cdvl_crawler-0.4.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-13 19:33:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "slhck",
    "github_project": "cdvl-crawler",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "cdvl-crawler"
}
        
Elapsed time: 0.78505s