kaggle-discussion-extractor

Name	kaggle-discussion-extractor JSON
Version	1.3.0 JSON
	download
home_page	https://github.com/Letemoin/kaggle-discussion-extractor
Summary	A professional-grade tool for extracting and analyzing discussions from Kaggle competitions
upload_time	2025-10-26 19:52:20
maintainer	Kaggle Discussion Extractor Contributors
docs_url	None
author	Kaggle Discussion Extractor Contributors
requires_python	>=3.8
license	MIT
keywords	kaggle discussion extractor web-scraping data-extraction machine-learning competition playwright async
VCS
bugtrack_url
requirements	playwright asyncio-extras nbformat nbconvert
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Kaggle Discussion Extractor

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Playwright](https://img.shields.io/badge/Playwright-45ba4b?style=flat&logo=playwright&logoColor=white)](https://playwright.dev/python/)

A professional-grade Python tool for extracting and analyzing discussions, solution writeups, and notebooks from Kaggle competitions. Features hierarchical reply extraction, automatic writeup extraction from leaderboards, competition notebook downloading with conversion to Python files, and clean markdown output with rich metadata.

## 🚀 Key Features

### 📓 Notebook Download & Conversion
- **Competition Notebook Discovery**: Automatically finds all public notebooks from competitions
- **Kaggle API Integration**: Uses Kaggle CLI API for reliable notebook listing (primary method)
- **Web Scraping Fallback**: Advanced lazy loading with configurable retry attempts
- **Automatic Conversion**: Converts `.ipynb` files to clean Python `.py` files
- **Smart Naming**: Files saved as `{NotebookTitle}_{YYMMDD}.py` with metadata headers
- **Batch Processing**: Download and convert multiple notebooks efficiently

### Competition Writeup Extraction
- **Leaderboard Scraping**: Automatically extracts writeup URLs from competition leaderboards
- **Private/Public Leaderboards**: Supports both private and public leaderboard tabs
- **Custom Naming**: Files saved as `{contest_name}_{rank}_{team_name}.md`
- **Rich Metadata**: Includes rank, team members, scores, and extraction timestamps
- **Top-N Selection**: Extract only top performers (e.g., top 10)

### Hierarchical Discussion Extraction
- **Complete Thread Preservation**: Maintains the full discussion structure with parent-child relationships
- **Smart Reply Numbering**: Automatic hierarchical numbering (1, 1.1, 1.2, 2, 2.1, etc.)
- **No Content Duplication**: Intelligently separates parent and nested reply content
- **Deep Nesting Support**: Handles multiple levels of nested replies

### Rich Metadata Extraction
- **Author Information**: Names, usernames, profile URLs
- **Competition Rankings**: Extracts "Nth in this Competition" rankings
- **User Badges**: Competition Host, Expert, Master, Grandmaster badges
- **Engagement Metrics**: Upvotes/downvotes for all posts and replies
- **Timestamps**: Full timestamp extraction for temporal analysis

### Advanced Capabilities
- **Pagination Support**: Automatically handles multi-page discussion lists
- **Lazy Loading Handling**: Advanced infinite scroll with configurable retry attempts
- **Batch Processing**: Extract all discussions, writeups, or notebooks from a competition at once
- **Rate Limiting**: Built-in delays to respect server resources
- **Error Recovery**: Robust error handling with detailed logging
- **Multiple Output Formats**: Clean Markdown export and Python file conversion
- **Hybrid Extraction**: Kaggle API integration with web scraping fallback

## 📦 Installation

### Method 1: Install from PyPI (Recommended)

```bash
pip install kaggle-discussion-extractor
playwright install chromium
```

### Method 2: Install from Source

```bash
# Clone the repository
git clone https://github.com/Letemoin/kaggle-discussion-extractor.git
cd kaggle-discussion-extractor

# Install in development mode
pip install -e .
playwright install chromium
```

## 🎯 Quick Start

### Command Line Usage

```bash
# Extract all discussions from a competition
kaggle-discussion-extractor https://www.kaggle.com/competitions/neurips-2025

# Extract only 10 discussions
kaggle-discussion-extractor https://www.kaggle.com/competitions/neurips-2025 --limit 10

# Enable development mode for detailed logging
kaggle-discussion-extractor https://www.kaggle.com/competitions/neurips-2025 --dev-mode

# Run with visible browser (useful for debugging)
kaggle-discussion-extractor https://www.kaggle.com/competitions/neurips-2025 --no-headless

# Extract top 10 writeups from private leaderboard
kaggle-discussion-extractor https://www.kaggle.com/competitions/cmi-detect-behavior --extract-writeups --limit 10

# Extract from public leaderboard with development mode
kaggle-discussion-extractor https://www.kaggle.com/competitions/cmi-detect-behavior --extract-writeups --leaderboard-tab public --dev-mode

# Download competition notebooks and convert to Python files
kaggle-discussion-extractor https://www.kaggle.com/competitions/neurips-2025 --download-notebooks

# Download notebooks with enhanced extraction (2 retry attempts)
kaggle-discussion-extractor https://www.kaggle.com/competitions/neurips-2025 --download-notebooks --extraction-attempts 2 --limit 20
```

### Python API Usage

#### Extract Discussions
```python
import asyncio
from kaggle_discussion_extractor import KaggleDiscussionExtractor

async def extract_discussions():
    # Initialize extractor
    extractor = KaggleDiscussionExtractor(dev_mode=True)
    
    # Extract discussions
    success = await extractor.extract_competition_discussions(
        competition_url="https://www.kaggle.com/competitions/neurips-2025",
        limit=5  # Optional: limit number of discussions
    )
    
    if success:
        print("Extraction completed successfully!")
    else:
        print("Extraction failed!")

# Run the extraction
asyncio.run(extract_discussions())
```

#### Extract Writeups
```python
import asyncio
from kaggle_discussion_extractor import KaggleWriteupExtractor

async def extract_writeups():
    # Initialize writeup extractor
    extractor = KaggleWriteupExtractor(dev_mode=True)
    
    # Extract top 5 writeups from private leaderboard
    success = await extractor.extract_writeups(
        competition_url="https://www.kaggle.com/competitions/cmi-detect-behavior",
        limit=5,
        leaderboard_tab="private"
    )
    
    if success:
        print("Writeup extraction completed successfully!")
    else:
        print("Writeup extraction failed!")

# Run the extraction
asyncio.run(extract_writeups())
```

#### Download Notebooks
```python
import asyncio
from kaggle_discussion_extractor import KaggleNotebookDownloader

async def download_notebooks():
    # Initialize notebook downloader
    downloader = KaggleNotebookDownloader(
        dev_mode=True,
        extraction_attempts=2  # Enhanced extraction with 2 retry attempts
    )
    
    # Extract notebook list from competition
    notebooks = await downloader.extract_notebook_list(
        competition_url="https://www.kaggle.com/competitions/neurips-2025",
        limit=20  # Download top 20 notebooks
    )
    
    print(f"Found {len(notebooks)} notebooks:")
    for notebook in notebooks:
        print(f"  - {notebook.title} by {notebook.author}")
    
    # Download and convert notebooks to Python files
    output_dir = Path("competition_notebooks")
    success_count = 0
    
    for notebook in notebooks:
        success = await downloader.extract_notebook_comments(notebook, output_dir)
        if success:
            success_count += 1
    
    print(f"Successfully downloaded {success_count}/{len(notebooks)} notebooks")

# Run the download
asyncio.run(download_notebooks())
```

## 📋 CLI Options

| Option | Description | Default |
|--------|-------------|---------|
| `competition_url` | URL of the Kaggle competition (required) | - |
| `--limit, -l` | Number of discussions/writeups/notebooks to extract | All |
| `--dev-mode, -d` | Enable detailed logging | False |
| `--no-headless` | Run browser in visible mode | False (headless) |
| `--date-format` | Include YYMMDD date in filename | False |
| `--date-position` | Position of date (prefix/suffix) | suffix |
| `--extract-writeups` | Extract writeups from leaderboard | False |
| `--leaderboard-tab` | Leaderboard tab (private/public) | private |
| `--download-notebooks` | Download and convert notebooks to Python | False |
| `--extraction-attempts` | Number of retry attempts for notebook extraction | 1 |
| `--notebooks-input` | Text file with notebook URLs for batch download | - |
| `--version, -v` | Show version information | - |

## 📁 Output Structure

### 📓 Notebook Downloads
The notebook downloader creates a custom output directory (default: current directory) with:

```
competition_notebooks/
├── Neurips Simple Baseline With Lightgbm_250908.py
├── Multi Seed Ensemble Of Gbdts And A Neural Network_250908.py
├── Lb 0 067 2Gpu Chemberta Train_250908.py
├── End To End Best Practice Solution_250908.py
└── ...
```

Each Python file includes:
- **Metadata Header**: Original title, author, source URL, download timestamp
- **Clean Python Code**: Converted from Jupyter notebook cells
- **Preserved Structure**: In[ ] comments maintain original cell organization

### Writeup Extraction
The writeup extractor creates a `writeups_extracted` directory with:

```
writeups_extracted/
├── contest-name_01_TeamName.md
├── contest-name_02_AnotherTeam.md 
├── contest-name_03_ThirdPlace.md
└── ...
```

### Discussion Extraction
The discussion extractor creates a `kaggle_discussions_extracted` directory with:

```
kaggle_discussions_extracted/
├── 01_Discussion_Title.md
├── 02_Another_Discussion.md
├── 03_Third_Discussion.md
└── ...
```

### Sample Notebook Output Format

```python
#!/usr/bin/env python3
"""
Neurips Simple Baseline With Lightgbm
Author: jade290395
Last Updated: 250908
Source: https://www.kaggle.com/code/jade290395/neurips-simple-baseline-with-lightgbm
Downloaded: 2025-09-08 22:04:47
"""

#!/usr/bin/env python
# coding: utf-8

# In[ ]:

import pandas as pd
import numpy as np
from sklearn.ensemble import ExtraTreesRegressor
from lightgbm import LGBMRegressor

# In[ ]:

# Load data and build features
tg = pd.read_csv('/kaggle/input/modred-dataset/desc_tg.csv')
tc = pd.read_csv('/kaggle/input/modred-dataset/desc_tc.csv')

# Continue with notebook code...
```

### Sample Discussion Output Format

```markdown
# Discussion Title

**URL**: https://www.kaggle.com/competitions/neurips-2025/discussion/123456
**Total Comments**: 15
**Extracted**: 2025-01-15T10:30:00

---

## Main Post

**Author**: username (@username)
**Rank**: 27th in this Competition
**Badges**: Competition Host
**Upvotes**: 36

Main discussion content goes here...

---

## Replies

### Reply 1

- **Author**: user1 (@user1)
- **Rank**: 154th in this Competition
- **Upvotes**: 11
- **Timestamp**: Tue Jun 17 2025 11:54:57 GMT+0300

Content of reply 1...

  #### Reply 1.1

  - **Author**: user2 (@user2)
  - **Upvotes**: 6
  - **Timestamp**: Sun Jun 29 2025 04:20:43 GMT+0300

  Nested reply content...

  #### Reply 1.2

  - **Author**: user3 (@user3)
  - **Upvotes**: 2
  - **Timestamp**: Wed Jul 16 2025 12:50:34 GMT+0300

  Another nested reply...

---

### Reply 2

- **Author**: user4 (@user4)
- **Upvotes**: -3

Content of reply 2...

---
```

## ⚙️ Configuration

### Development Mode

Enable development mode to see detailed logs and debugging information:

```python
extractor = KaggleDiscussionExtractor(dev_mode=True)
```

**What dev_mode does:**
- Enables DEBUG level logging
- Shows detailed progress information
- Displays browser automation steps
- Provides error stack traces
- Logs DOM element detection details

### Browser Mode

Run with visible browser for debugging:

```python
extractor = KaggleDiscussionExtractor(headless=False)
```

## 🧪 Examples

### Basic Example

```python
from kaggle_discussion_extractor import KaggleDiscussionExtractor
import asyncio

async def main():
    extractor = KaggleDiscussionExtractor()
    
    await extractor.extract_competition_discussions(
        "https://www.kaggle.com/competitions/neurips-2025"
    )

asyncio.run(main())
```

### Notebook Download Example

```python
from kaggle_discussion_extractor import KaggleNotebookDownloader
import asyncio
from pathlib import Path

async def download_competition_notebooks():
    # Initialize with enhanced extraction settings
    downloader = KaggleNotebookDownloader(
        dev_mode=True,           # Enable detailed logging
        headless=True,           # Run in background
        extraction_attempts=2    # Try extraction twice for better results
    )
    
    # Download notebooks from competition
    notebooks = await downloader.extract_notebook_list(
        "https://www.kaggle.com/competitions/neurips-2025",
        limit=15  # Download top 15 notebooks
    )
    
    # Create output directory
    output_dir = Path("neurips_notebooks")
    output_dir.mkdir(exist_ok=True)
    
    # Download and convert each notebook
    for i, notebook in enumerate(notebooks, 1):
        print(f"Processing {i}/{len(notebooks)}: {notebook.title}")
        success = await downloader.extract_notebook_comments(notebook, output_dir)
        if success:
            print(f"✅ Downloaded: {notebook.filename}")
        else:
            print(f"❌ Failed: {notebook.title}")

asyncio.run(download_competition_notebooks())
```

### Advanced Example with Logging

```python
import asyncio
import logging
from kaggle_discussion_extractor import KaggleDiscussionExtractor

# Setup custom logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

async def extract_with_monitoring():
    extractor = KaggleDiscussionExtractor(
        dev_mode=True,  # Enable detailed logging
        headless=True   # Run in background
    )
    
    logger.info("Starting extraction...")
    
    success = await extractor.extract_competition_discussions(
        competition_url="https://www.kaggle.com/competitions/neurips-2025",
        limit=20  # Extract first 20 discussions
    )
    
    if success:
        logger.info("✅ Extraction completed successfully!")
        logger.info("Check 'kaggle_discussions_extracted' directory for results")
    else:
        logger.error("❌ Extraction failed!")

if __name__ == "__main__":
    asyncio.run(extract_with_monitoring())
```

## 🔧 Development

### Setup Development Environment

```bash
# Clone repository
git clone https://github.com/Letemoin/kaggle-discussion-extractor.git
cd kaggle-discussion-extractor

# Install development dependencies
pip install -e ".[dev]"
playwright install chromium

# Run tests
pytest tests/
```

### Project Structure

```
kaggle_discussion_extractor/
├── __init__.py              # Package initialization and exports
├── core.py                 # Discussion extraction logic
├── writeup_extractor.py    # Leaderboard writeup extraction
├── notebook_downloader.py  # Competition notebook downloading
└── cli.py                  # Command-line interface
```

## 🤝 Contributing

Contributions are welcome! Please read our [Contributing Guidelines](CONTRIBUTING.md) for details on how to submit pull requests, report issues, and contribute to the project.


## 🙏 Acknowledgments

- Built with [Playwright](https://playwright.dev/) for reliable browser automation
- Uses [nbformat](https://nbformat.readthedocs.io/) and [nbconvert](https://nbconvert.readthedocs.io/) for Jupyter notebook processing
- Integrates with [Kaggle CLI](https://github.com/Kaggle/kaggle-api) for robust notebook discovery
- Inspired by the need for better Kaggle competition analysis tools
- Thanks to the open-source community for continuous support

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/Letemoin/kaggle-discussion-extractor",
    "name": "kaggle-discussion-extractor",
    "maintainer": "Kaggle Discussion Extractor Contributors",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "kaggle, discussion, extractor, web-scraping, data-extraction, machine-learning, competition, playwright, async",
    "author": "Kaggle Discussion Extractor Contributors",
    "author_email": "contact@kaggle-extractor.com",
    "download_url": "https://files.pythonhosted.org/packages/aa/81/dac4f1fba1d8ff7db0157e52b8da272efb279d8201968038b42a05655e4e/kaggle_discussion_extractor-1.3.0.tar.gz",
    "platform": null,
    "description": "# Kaggle Discussion Extractor\r\n\r\n[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)\r\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\r\n[![Playwright](https://img.shields.io/badge/Playwright-45ba4b?style=flat&logo=playwright&logoColor=white)](https://playwright.dev/python/)\r\n\r\nA professional-grade Python tool for extracting and analyzing discussions, solution writeups, and notebooks from Kaggle competitions. Features hierarchical reply extraction, automatic writeup extraction from leaderboards, competition notebook downloading with conversion to Python files, and clean markdown output with rich metadata.\r\n\r\n## \ud83d\ude80 Key Features\r\n\r\n### \ud83d\udcd3 Notebook Download & Conversion\r\n- **Competition Notebook Discovery**: Automatically finds all public notebooks from competitions\r\n- **Kaggle API Integration**: Uses Kaggle CLI API for reliable notebook listing (primary method)\r\n- **Web Scraping Fallback**: Advanced lazy loading with configurable retry attempts\r\n- **Automatic Conversion**: Converts `.ipynb` files to clean Python `.py` files\r\n- **Smart Naming**: Files saved as `{NotebookTitle}_{YYMMDD}.py` with metadata headers\r\n- **Batch Processing**: Download and convert multiple notebooks efficiently\r\n\r\n### Competition Writeup Extraction\r\n- **Leaderboard Scraping**: Automatically extracts writeup URLs from competition leaderboards\r\n- **Private/Public Leaderboards**: Supports both private and public leaderboard tabs\r\n- **Custom Naming**: Files saved as `{contest_name}_{rank}_{team_name}.md`\r\n- **Rich Metadata**: Includes rank, team members, scores, and extraction timestamps\r\n- **Top-N Selection**: Extract only top performers (e.g., top 10)\r\n\r\n### Hierarchical Discussion Extraction\r\n- **Complete Thread Preservation**: Maintains the full discussion structure with parent-child relationships\r\n- **Smart Reply Numbering**: Automatic hierarchical numbering (1, 1.1, 1.2, 2, 2.1, etc.)\r\n- **No Content Duplication**: Intelligently separates parent and nested reply content\r\n- **Deep Nesting Support**: Handles multiple levels of nested replies\r\n\r\n### Rich Metadata Extraction\r\n- **Author Information**: Names, usernames, profile URLs\r\n- **Competition Rankings**: Extracts \"Nth in this Competition\" rankings\r\n- **User Badges**: Competition Host, Expert, Master, Grandmaster badges\r\n- **Engagement Metrics**: Upvotes/downvotes for all posts and replies\r\n- **Timestamps**: Full timestamp extraction for temporal analysis\r\n\r\n### Advanced Capabilities\r\n- **Pagination Support**: Automatically handles multi-page discussion lists\r\n- **Lazy Loading Handling**: Advanced infinite scroll with configurable retry attempts\r\n- **Batch Processing**: Extract all discussions, writeups, or notebooks from a competition at once\r\n- **Rate Limiting**: Built-in delays to respect server resources\r\n- **Error Recovery**: Robust error handling with detailed logging\r\n- **Multiple Output Formats**: Clean Markdown export and Python file conversion\r\n- **Hybrid Extraction**: Kaggle API integration with web scraping fallback\r\n\r\n## \ud83d\udce6 Installation\r\n\r\n### Method 1: Install from PyPI (Recommended)\r\n\r\n```bash\r\npip install kaggle-discussion-extractor\r\nplaywright install chromium\r\n```\r\n\r\n### Method 2: Install from Source\r\n\r\n```bash\r\n# Clone the repository\r\ngit clone https://github.com/Letemoin/kaggle-discussion-extractor.git\r\ncd kaggle-discussion-extractor\r\n\r\n# Install in development mode\r\npip install -e .\r\nplaywright install chromium\r\n```\r\n\r\n## \ud83c\udfaf Quick Start\r\n\r\n### Command Line Usage\r\n\r\n```bash\r\n# Extract all discussions from a competition\r\nkaggle-discussion-extractor https://www.kaggle.com/competitions/neurips-2025\r\n\r\n# Extract only 10 discussions\r\nkaggle-discussion-extractor https://www.kaggle.com/competitions/neurips-2025 --limit 10\r\n\r\n# Enable development mode for detailed logging\r\nkaggle-discussion-extractor https://www.kaggle.com/competitions/neurips-2025 --dev-mode\r\n\r\n# Run with visible browser (useful for debugging)\r\nkaggle-discussion-extractor https://www.kaggle.com/competitions/neurips-2025 --no-headless\r\n\r\n# Extract top 10 writeups from private leaderboard\r\nkaggle-discussion-extractor https://www.kaggle.com/competitions/cmi-detect-behavior --extract-writeups --limit 10\r\n\r\n# Extract from public leaderboard with development mode\r\nkaggle-discussion-extractor https://www.kaggle.com/competitions/cmi-detect-behavior --extract-writeups --leaderboard-tab public --dev-mode\r\n\r\n# Download competition notebooks and convert to Python files\r\nkaggle-discussion-extractor https://www.kaggle.com/competitions/neurips-2025 --download-notebooks\r\n\r\n# Download notebooks with enhanced extraction (2 retry attempts)\r\nkaggle-discussion-extractor https://www.kaggle.com/competitions/neurips-2025 --download-notebooks --extraction-attempts 2 --limit 20\r\n```\r\n\r\n### Python API Usage\r\n\r\n#### Extract Discussions\r\n```python\r\nimport asyncio\r\nfrom kaggle_discussion_extractor import KaggleDiscussionExtractor\r\n\r\nasync def extract_discussions():\r\n    # Initialize extractor\r\n    extractor = KaggleDiscussionExtractor(dev_mode=True)\r\n    \r\n    # Extract discussions\r\n    success = await extractor.extract_competition_discussions(\r\n        competition_url=\"https://www.kaggle.com/competitions/neurips-2025\",\r\n        limit=5  # Optional: limit number of discussions\r\n    )\r\n    \r\n    if success:\r\n        print(\"Extraction completed successfully!\")\r\n    else:\r\n        print(\"Extraction failed!\")\r\n\r\n# Run the extraction\r\nasyncio.run(extract_discussions())\r\n```\r\n\r\n#### Extract Writeups\r\n```python\r\nimport asyncio\r\nfrom kaggle_discussion_extractor import KaggleWriteupExtractor\r\n\r\nasync def extract_writeups():\r\n    # Initialize writeup extractor\r\n    extractor = KaggleWriteupExtractor(dev_mode=True)\r\n    \r\n    # Extract top 5 writeups from private leaderboard\r\n    success = await extractor.extract_writeups(\r\n        competition_url=\"https://www.kaggle.com/competitions/cmi-detect-behavior\",\r\n        limit=5,\r\n        leaderboard_tab=\"private\"\r\n    )\r\n    \r\n    if success:\r\n        print(\"Writeup extraction completed successfully!\")\r\n    else:\r\n        print(\"Writeup extraction failed!\")\r\n\r\n# Run the extraction\r\nasyncio.run(extract_writeups())\r\n```\r\n\r\n#### Download Notebooks\r\n```python\r\nimport asyncio\r\nfrom kaggle_discussion_extractor import KaggleNotebookDownloader\r\n\r\nasync def download_notebooks():\r\n    # Initialize notebook downloader\r\n    downloader = KaggleNotebookDownloader(\r\n        dev_mode=True,\r\n        extraction_attempts=2  # Enhanced extraction with 2 retry attempts\r\n    )\r\n    \r\n    # Extract notebook list from competition\r\n    notebooks = await downloader.extract_notebook_list(\r\n        competition_url=\"https://www.kaggle.com/competitions/neurips-2025\",\r\n        limit=20  # Download top 20 notebooks\r\n    )\r\n    \r\n    print(f\"Found {len(notebooks)} notebooks:\")\r\n    for notebook in notebooks:\r\n        print(f\"  - {notebook.title} by {notebook.author}\")\r\n    \r\n    # Download and convert notebooks to Python files\r\n    output_dir = Path(\"competition_notebooks\")\r\n    success_count = 0\r\n    \r\n    for notebook in notebooks:\r\n        success = await downloader.extract_notebook_comments(notebook, output_dir)\r\n        if success:\r\n            success_count += 1\r\n    \r\n    print(f\"Successfully downloaded {success_count}/{len(notebooks)} notebooks\")\r\n\r\n# Run the download\r\nasyncio.run(download_notebooks())\r\n```\r\n\r\n## \ud83d\udccb CLI Options\r\n\r\n| Option | Description | Default |\r\n|--------|-------------|---------|\r\n| `competition_url` | URL of the Kaggle competition (required) | - |\r\n| `--limit, -l` | Number of discussions/writeups/notebooks to extract | All |\r\n| `--dev-mode, -d` | Enable detailed logging | False |\r\n| `--no-headless` | Run browser in visible mode | False (headless) |\r\n| `--date-format` | Include YYMMDD date in filename | False |\r\n| `--date-position` | Position of date (prefix/suffix) | suffix |\r\n| `--extract-writeups` | Extract writeups from leaderboard | False |\r\n| `--leaderboard-tab` | Leaderboard tab (private/public) | private |\r\n| `--download-notebooks` | Download and convert notebooks to Python | False |\r\n| `--extraction-attempts` | Number of retry attempts for notebook extraction | 1 |\r\n| `--notebooks-input` | Text file with notebook URLs for batch download | - |\r\n| `--version, -v` | Show version information | - |\r\n\r\n## \ud83d\udcc1 Output Structure\r\n\r\n### \ud83d\udcd3 Notebook Downloads\r\nThe notebook downloader creates a custom output directory (default: current directory) with:\r\n\r\n```\r\ncompetition_notebooks/\r\n\u251c\u2500\u2500 Neurips Simple Baseline With Lightgbm_250908.py\r\n\u251c\u2500\u2500 Multi Seed Ensemble Of Gbdts And A Neural Network_250908.py\r\n\u251c\u2500\u2500 Lb 0 067 2Gpu Chemberta Train_250908.py\r\n\u251c\u2500\u2500 End To End Best Practice Solution_250908.py\r\n\u2514\u2500\u2500 ...\r\n```\r\n\r\nEach Python file includes:\r\n- **Metadata Header**: Original title, author, source URL, download timestamp\r\n- **Clean Python Code**: Converted from Jupyter notebook cells\r\n- **Preserved Structure**: In[ ] comments maintain original cell organization\r\n\r\n### Writeup Extraction\r\nThe writeup extractor creates a `writeups_extracted` directory with:\r\n\r\n```\r\nwriteups_extracted/\r\n\u251c\u2500\u2500 contest-name_01_TeamName.md\r\n\u251c\u2500\u2500 contest-name_02_AnotherTeam.md \r\n\u251c\u2500\u2500 contest-name_03_ThirdPlace.md\r\n\u2514\u2500\u2500 ...\r\n```\r\n\r\n### Discussion Extraction\r\nThe discussion extractor creates a `kaggle_discussions_extracted` directory with:\r\n\r\n```\r\nkaggle_discussions_extracted/\r\n\u251c\u2500\u2500 01_Discussion_Title.md\r\n\u251c\u2500\u2500 02_Another_Discussion.md\r\n\u251c\u2500\u2500 03_Third_Discussion.md\r\n\u2514\u2500\u2500 ...\r\n```\r\n\r\n### Sample Notebook Output Format\r\n\r\n```python\r\n#!/usr/bin/env python3\r\n\"\"\"\r\nNeurips Simple Baseline With Lightgbm\r\nAuthor: jade290395\r\nLast Updated: 250908\r\nSource: https://www.kaggle.com/code/jade290395/neurips-simple-baseline-with-lightgbm\r\nDownloaded: 2025-09-08 22:04:47\r\n\"\"\"\r\n\r\n#!/usr/bin/env python\r\n# coding: utf-8\r\n\r\n# In[ ]:\r\n\r\nimport pandas as pd\r\nimport numpy as np\r\nfrom sklearn.ensemble import ExtraTreesRegressor\r\nfrom lightgbm import LGBMRegressor\r\n\r\n# In[ ]:\r\n\r\n# Load data and build features\r\ntg = pd.read_csv('/kaggle/input/modred-dataset/desc_tg.csv')\r\ntc = pd.read_csv('/kaggle/input/modred-dataset/desc_tc.csv')\r\n\r\n# Continue with notebook code...\r\n```\r\n\r\n### Sample Discussion Output Format\r\n\r\n```markdown\r\n# Discussion Title\r\n\r\n**URL**: https://www.kaggle.com/competitions/neurips-2025/discussion/123456\r\n**Total Comments**: 15\r\n**Extracted**: 2025-01-15T10:30:00\r\n\r\n---\r\n\r\n## Main Post\r\n\r\n**Author**: username (@username)\r\n**Rank**: 27th in this Competition\r\n**Badges**: Competition Host\r\n**Upvotes**: 36\r\n\r\nMain discussion content goes here...\r\n\r\n---\r\n\r\n## Replies\r\n\r\n### Reply 1\r\n\r\n- **Author**: user1 (@user1)\r\n- **Rank**: 154th in this Competition\r\n- **Upvotes**: 11\r\n- **Timestamp**: Tue Jun 17 2025 11:54:57 GMT+0300\r\n\r\nContent of reply 1...\r\n\r\n  #### Reply 1.1\r\n\r\n  - **Author**: user2 (@user2)\r\n  - **Upvotes**: 6\r\n  - **Timestamp**: Sun Jun 29 2025 04:20:43 GMT+0300\r\n\r\n  Nested reply content...\r\n\r\n  #### Reply 1.2\r\n\r\n  - **Author**: user3 (@user3)\r\n  - **Upvotes**: 2\r\n  - **Timestamp**: Wed Jul 16 2025 12:50:34 GMT+0300\r\n\r\n  Another nested reply...\r\n\r\n---\r\n\r\n### Reply 2\r\n\r\n- **Author**: user4 (@user4)\r\n- **Upvotes**: -3\r\n\r\nContent of reply 2...\r\n\r\n---\r\n```\r\n\r\n## \u2699\ufe0f Configuration\r\n\r\n### Development Mode\r\n\r\nEnable development mode to see detailed logs and debugging information:\r\n\r\n```python\r\nextractor = KaggleDiscussionExtractor(dev_mode=True)\r\n```\r\n\r\n**What dev_mode does:**\r\n- Enables DEBUG level logging\r\n- Shows detailed progress information\r\n- Displays browser automation steps\r\n- Provides error stack traces\r\n- Logs DOM element detection details\r\n\r\n### Browser Mode\r\n\r\nRun with visible browser for debugging:\r\n\r\n```python\r\nextractor = KaggleDiscussionExtractor(headless=False)\r\n```\r\n\r\n## \ud83e\uddea Examples\r\n\r\n### Basic Example\r\n\r\n```python\r\nfrom kaggle_discussion_extractor import KaggleDiscussionExtractor\r\nimport asyncio\r\n\r\nasync def main():\r\n    extractor = KaggleDiscussionExtractor()\r\n    \r\n    await extractor.extract_competition_discussions(\r\n        \"https://www.kaggle.com/competitions/neurips-2025\"\r\n    )\r\n\r\nasyncio.run(main())\r\n```\r\n\r\n### Notebook Download Example\r\n\r\n```python\r\nfrom kaggle_discussion_extractor import KaggleNotebookDownloader\r\nimport asyncio\r\nfrom pathlib import Path\r\n\r\nasync def download_competition_notebooks():\r\n    # Initialize with enhanced extraction settings\r\n    downloader = KaggleNotebookDownloader(\r\n        dev_mode=True,           # Enable detailed logging\r\n        headless=True,           # Run in background\r\n        extraction_attempts=2    # Try extraction twice for better results\r\n    )\r\n    \r\n    # Download notebooks from competition\r\n    notebooks = await downloader.extract_notebook_list(\r\n        \"https://www.kaggle.com/competitions/neurips-2025\",\r\n        limit=15  # Download top 15 notebooks\r\n    )\r\n    \r\n    # Create output directory\r\n    output_dir = Path(\"neurips_notebooks\")\r\n    output_dir.mkdir(exist_ok=True)\r\n    \r\n    # Download and convert each notebook\r\n    for i, notebook in enumerate(notebooks, 1):\r\n        print(f\"Processing {i}/{len(notebooks)}: {notebook.title}\")\r\n        success = await downloader.extract_notebook_comments(notebook, output_dir)\r\n        if success:\r\n            print(f\"\u2705 Downloaded: {notebook.filename}\")\r\n        else:\r\n            print(f\"\u274c Failed: {notebook.title}\")\r\n\r\nasyncio.run(download_competition_notebooks())\r\n```\r\n\r\n### Advanced Example with Logging\r\n\r\n```python\r\nimport asyncio\r\nimport logging\r\nfrom kaggle_discussion_extractor import KaggleDiscussionExtractor\r\n\r\n# Setup custom logging\r\nlogging.basicConfig(level=logging.INFO)\r\nlogger = logging.getLogger(__name__)\r\n\r\nasync def extract_with_monitoring():\r\n    extractor = KaggleDiscussionExtractor(\r\n        dev_mode=True,  # Enable detailed logging\r\n        headless=True   # Run in background\r\n    )\r\n    \r\n    logger.info(\"Starting extraction...\")\r\n    \r\n    success = await extractor.extract_competition_discussions(\r\n        competition_url=\"https://www.kaggle.com/competitions/neurips-2025\",\r\n        limit=20  # Extract first 20 discussions\r\n    )\r\n    \r\n    if success:\r\n        logger.info(\"\u2705 Extraction completed successfully!\")\r\n        logger.info(\"Check 'kaggle_discussions_extracted' directory for results\")\r\n    else:\r\n        logger.error(\"\u274c Extraction failed!\")\r\n\r\nif __name__ == \"__main__\":\r\n    asyncio.run(extract_with_monitoring())\r\n```\r\n\r\n## \ud83d\udd27 Development\r\n\r\n### Setup Development Environment\r\n\r\n```bash\r\n# Clone repository\r\ngit clone https://github.com/Letemoin/kaggle-discussion-extractor.git\r\ncd kaggle-discussion-extractor\r\n\r\n# Install development dependencies\r\npip install -e \".[dev]\"\r\nplaywright install chromium\r\n\r\n# Run tests\r\npytest tests/\r\n```\r\n\r\n### Project Structure\r\n\r\n```\r\nkaggle_discussion_extractor/\r\n\u251c\u2500\u2500 __init__.py              # Package initialization and exports\r\n\u251c\u2500\u2500 core.py                 # Discussion extraction logic\r\n\u251c\u2500\u2500 writeup_extractor.py    # Leaderboard writeup extraction\r\n\u251c\u2500\u2500 notebook_downloader.py  # Competition notebook downloading\r\n\u2514\u2500\u2500 cli.py                  # Command-line interface\r\n```\r\n\r\n## \ud83e\udd1d Contributing\r\n\r\nContributions are welcome! Please read our [Contributing Guidelines](CONTRIBUTING.md) for details on how to submit pull requests, report issues, and contribute to the project.\r\n\r\n\r\n## \ud83d\ude4f Acknowledgments\r\n\r\n- Built with [Playwright](https://playwright.dev/) for reliable browser automation\r\n- Uses [nbformat](https://nbformat.readthedocs.io/) and [nbconvert](https://nbconvert.readthedocs.io/) for Jupyter notebook processing\r\n- Integrates with [Kaggle CLI](https://github.com/Kaggle/kaggle-api) for robust notebook discovery\r\n- Inspired by the need for better Kaggle competition analysis tools\r\n- Thanks to the open-source community for continuous support\r\n\r\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A professional-grade tool for extracting and analyzing discussions from Kaggle competitions",
    "version": "1.3.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/Letemoin/kaggle-discussion-extractor/issues",
        "Documentation": "https://github.com/Letemoin/kaggle-discussion-extractor#readme",
        "Homepage": "https://github.com/Letemoin/kaggle-discussion-extractor",
        "Repository": "https://github.com/Letemoin/kaggle-discussion-extractor"
    },
    "split_keywords": [
        "kaggle",
        " discussion",
        " extractor",
        " web-scraping",
        " data-extraction",
        " machine-learning",
        " competition",
        " playwright",
        " async"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6e5c7a2ba16be6de05cf54cfa6bcba0746cec8c5705ac108abfab8353428c8a5",
                "md5": "84d27f4a477449707ac15e30bdcd938a",
                "sha256": "876d0b433d6ca777a98b3bb309d10ea518f51e853e880e905f7ca5bf61d49f12"
            },
            "downloads": -1,
            "filename": "kaggle_discussion_extractor-1.3.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "84d27f4a477449707ac15e30bdcd938a",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 37415,
            "upload_time": "2025-10-26T19:52:19",
            "upload_time_iso_8601": "2025-10-26T19:52:19.145903Z",
            "url": "https://files.pythonhosted.org/packages/6e/5c/7a2ba16be6de05cf54cfa6bcba0746cec8c5705ac108abfab8353428c8a5/kaggle_discussion_extractor-1.3.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "aa81dac4f1fba1d8ff7db0157e52b8da272efb279d8201968038b42a05655e4e",
                "md5": "24090ccb9371694aff0a00951216d696",
                "sha256": "aa94df78d789596defffd70f400b08152214aafbb7a9e2c24d953a9f028a839f"
            },
            "downloads": -1,
            "filename": "kaggle_discussion_extractor-1.3.0.tar.gz",
            "has_sig": false,
            "md5_digest": "24090ccb9371694aff0a00951216d696",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 41343,
            "upload_time": "2025-10-26T19:52:20",
            "upload_time_iso_8601": "2025-10-26T19:52:20.644861Z",
            "url": "https://files.pythonhosted.org/packages/aa/81/dac4f1fba1d8ff7db0157e52b8da272efb279d8201968038b42a05655e4e/kaggle_discussion_extractor-1.3.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-26 19:52:20",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "Letemoin",
    "github_project": "kaggle-discussion-extractor",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "playwright",
            "specs": [
                [
                    ">=",
                    "1.40.0"
                ]
            ]
        },
        {
            "name": "asyncio-extras",
            "specs": [
                [
                    ">=",
                    "1.3.0"
                ]
            ]
        },
        {
            "name": "nbformat",
            "specs": [
                [
                    ">=",
                    "5.9.0"
                ]
            ]
        },
        {
            "name": "nbconvert",
            "specs": [
                [
                    ">=",
                    "7.8.0"
                ]
            ]
        }
    ],
    "lcname": "kaggle-discussion-extractor"
}

Kaggle Discussion Extractor Contributors