ticket-miner

Name	ticket-miner JSON
Version	1.0.0 JSON
	download
home_page	None
Summary	a python library designed to grab a Jira ticket and scrape ALL of the related and referenced content (and scrape referenced Confluence pages or help center pages for their full content) like a human reviewer would do to try and understand what was going on, and return structured JSON to make it easy to pass to an LLM / ChatGPT / GenAI Etc
upload_time	2025-02-22 20:17:18
maintainer	None
docs_url	None
author	None
requires_python	>=3.8
license	MIT
keywords	ticket mining knowledge base jira confluence documentation
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Ticket Miner

A Python library that replicates how a human would investigate a Jira ticket by automatically mining all referenced content - linked tickets, Confluence pages, Help Center articles, and more - into a structured format suitable for Large Language Model processing.

## Overview

When troubleshooting or analyzing a ticket, a human would:
1. Read the ticket description and comments
2. Follow links to related Jira tickets
3. Check referenced Confluence documentation
4. Look up any Help Center or Developer Documentation pages
5. Analyze all this information together

This library automates this process by:
- Mining the main ticket content
- Following and extracting content from all references recursively
- Converting everything into a clean, structured JSON format
- Making the entire context available for LLM processing

The result is a comprehensive "knowledge bundle" containing all relevant information about a ticket and its context, perfect for:
- Ticket analysis and categorization by LLMs
- Automated troubleshooting
- Knowledge extraction and synthesis
- Pattern recognition across tickets
- Support workflow automation

## Features

- Comprehensive ticket content extraction:
  - Main ticket information (description, comments, metadata)
  - Linked Jira tickets (with their own references)
  - Referenced Confluence pages (with attachments)
  - Help Center articles
  - Developer documentation
  - External URLs
- Smart URL detection and categorization
- Configurable URL patterns and scraping rules
- Resource metadata extraction
- Flexible domain configuration
- Recursive reference processing
- Cycle detection to prevent infinite loops
- Structured output format optimized for LLM processing

## Installation

```bash
pip install ticket-miner
```

## Quick Start

First, set up your environment variables:

```bash
# .env file
BASE_DOMAIN=yourdomain.com
JIRA_URL=https://jira.yourdomain.com
CONFLUENCE_URL=https://confluence.yourdomain.com
JIRA_USERNAME=your_username
JIRA_API_TOKEN=your_api_token
```

Then mine a complete ticket bundle:

```python
from ticket_miner import TicketMiner
from ticket_miner.extractors import JiraExtractor, ConfluenceExtractor

# Initialize the miner with desired extractors
miner = TicketMiner(
    jira_extractor=JiraExtractor(),
    confluence_extractor=ConfluenceExtractor()
)

# Get complete ticket data with all references
ticket_data = miner.mine_ticket("PROJ-123")

# The ticket_data will contain everything a human would look at:
{
    # Main ticket information
    "id": "PROJ-123",
    "summary": "Example ticket",
    "description": "Ticket description...",
    "status": "Open",
    "priority": "High",
    "assignee": "John Smith",
    "reporter": "Jane Doe",
    "created": "2024-02-18T10:00:00.000Z",
    "updated": "2024-02-18T11:00:00.000Z",
    "labels": ["label1", "label2"],
    
    # Ticket comments in chronological order
    "comments": [
        {
            "author": "John Smith",
            "body": "Comment text...",
            "created": "2024-02-18T10:30:00.000Z",
            "is_support_team": true
        }
    ],
    
    # All referenced content
    "references": {
        # Documentation from Confluence
        "confluence_pages": [
            {
                "id": "12345",
                "title": "Documentation Page",
                "space_key": "DOCS",
                "content": "Page content in markdown...",
                "url": "https://confluence.example.com/pages/12345",
                "creator": "Jane Doe",
                "created": "2024-02-17T10:00:00.000Z",
                "updated": "2024-02-18T09:00:00.000Z",
                "attachments": [
                    {
                        "filename": "document.pdf",
                        "size": 1024,
                        "mediaType": "application/pdf"
                    }
                ]
            }
        ],
        
        # Other Jira tickets referenced (with their own references)
        "jira_tickets": [
            {
                "id": "PROJ-124",
                "summary": "Related ticket",
                "status": "Closed",
                "description": "Related ticket description...",
                "references": {
                    # Each linked ticket also includes its references
                    "confluence_pages": [...],
                    "jira_tickets": [...],
                    "scrapable_documentation": [...]
                }
            }
        ],
        
        # Help Center and Developer Documentation
        "scrapable_documentation": [
            {
                "url": "https://help.example.com/article/123",
                "title": "Help Article",
                "content": "Article content...",
                "author": "Support Team",
                "date": "2024-02-15"
            }
        ],
        
        # Any other referenced URLs
        "other_urls": [
            {
                "url": "https://example.com/some-page",
                "type": "external",
                "domain": "example.com",
                "context": "Referenced in comment"
            }
        ]
    }
}
```

## Configuration

### Environment Variables

The library uses environment variables for configuration. You can set these in a `.env` file:

```bash
# Base domain for your organization
BASE_DOMAIN=yourdomain.com

# Jira configuration
JIRA_URL=https://jira.yourdomain.com
JIRA_USERNAME=your_username
JIRA_API_TOKEN=your_api_token

# Confluence configuration
CONFLUENCE_URL=https://confluence.yourdomain.com
CONFLUENCE_USERNAME=your_username
CONFLUENCE_API_TOKEN=your_api_token
```

### Custom URL Patterns

Create a JSON file with your custom URL patterns:

```json
{
  "url_patterns": {
    "help_center": {
      "domains": ["help.yourdomain.com"],
      "scrape": true,
      "exclude_patterns": [
        "^/search(/.*)?$",
        "^/user(/.*)?$"
      ]
    }
  }
}
```

Initialize the analyzer with your patterns:

```python
analyzer = URLAnalyzer(patterns_file="path/to/patterns.json")
```

## Advanced Usage

### Controlling Reference Depth

You can control how deep the extractor follows references:

```python
# Only extract direct references
extractor = JiraExtractor(max_reference_depth=1)

# Extract references up to 3 levels deep (default is 2)
extractor = JiraExtractor(max_reference_depth=3)
```

### Async Support

For web applications or when processing multiple tickets:

```python
async def process_ticket():
    extractor = JiraExtractor()
    ticket_data = await extractor.get_ticket("PROJ-123")
    # Process the ticket data
```

## API Reference

### URLAnalyzer

The main class for URL analysis and extraction.

#### Methods

- `analyze_content(content: str, source_content_id: str, source_type: str = "description") -> List[URLMatch]`
  Analyzes content to find and categorize URLs.

- `is_scrapable_url(url: str, domain: str) -> bool`
  Checks if a URL should be scraped based on configuration.

- `print_summary()`
  Prints a summary of URL analysis statistics.

#### Configuration Options

- `base_domain`: Your organization's base domain
- `patterns_file`: Path to custom URL patterns JSON file

### URLMatch

Data class containing information about matched URLs.

#### Attributes

- `url`: The matched URL
- `url_type`: Type of URL (e.g., "collaboration", "help_center")
- `domain`: URL domain
- `path`: URL path
- `resource_metadata`: Extracted resource metadata
- `context`: Surrounding content context
- `source_content_id`: ID of source content
- `source_type`: Type of source content
- `should_scrape`: Whether URL should be scraped

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "ticket-miner",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "ticket mining, knowledge base, jira, confluence, documentation",
    "author": null,
    "author_email": "Your Organization <maintainers@example.com>",
    "download_url": "https://files.pythonhosted.org/packages/22/37/299acbf194bc3326db37c2f7ba1b4f147ea2b3418cc3c96cb0a49f29d9c4/ticket_miner-1.0.0.tar.gz",
    "platform": null,
    "description": "# Ticket Miner\n\nA Python library that replicates how a human would investigate a Jira ticket by automatically mining all referenced content - linked tickets, Confluence pages, Help Center articles, and more - into a structured format suitable for Large Language Model processing.\n\n## Overview\n\nWhen troubleshooting or analyzing a ticket, a human would:\n1. Read the ticket description and comments\n2. Follow links to related Jira tickets\n3. Check referenced Confluence documentation\n4. Look up any Help Center or Developer Documentation pages\n5. Analyze all this information together\n\nThis library automates this process by:\n- Mining the main ticket content\n- Following and extracting content from all references recursively\n- Converting everything into a clean, structured JSON format\n- Making the entire context available for LLM processing\n\nThe result is a comprehensive \"knowledge bundle\" containing all relevant information about a ticket and its context, perfect for:\n- Ticket analysis and categorization by LLMs\n- Automated troubleshooting\n- Knowledge extraction and synthesis\n- Pattern recognition across tickets\n- Support workflow automation\n\n## Features\n\n- Comprehensive ticket content extraction:\n  - Main ticket information (description, comments, metadata)\n  - Linked Jira tickets (with their own references)\n  - Referenced Confluence pages (with attachments)\n  - Help Center articles\n  - Developer documentation\n  - External URLs\n- Smart URL detection and categorization\n- Configurable URL patterns and scraping rules\n- Resource metadata extraction\n- Flexible domain configuration\n- Recursive reference processing\n- Cycle detection to prevent infinite loops\n- Structured output format optimized for LLM processing\n\n## Installation\n\n```bash\npip install ticket-miner\n```\n\n## Quick Start\n\nFirst, set up your environment variables:\n\n```bash\n# .env file\nBASE_DOMAIN=yourdomain.com\nJIRA_URL=https://jira.yourdomain.com\nCONFLUENCE_URL=https://confluence.yourdomain.com\nJIRA_USERNAME=your_username\nJIRA_API_TOKEN=your_api_token\n```\n\nThen mine a complete ticket bundle:\n\n```python\nfrom ticket_miner import TicketMiner\nfrom ticket_miner.extractors import JiraExtractor, ConfluenceExtractor\n\n# Initialize the miner with desired extractors\nminer = TicketMiner(\n    jira_extractor=JiraExtractor(),\n    confluence_extractor=ConfluenceExtractor()\n)\n\n# Get complete ticket data with all references\nticket_data = miner.mine_ticket(\"PROJ-123\")\n\n# The ticket_data will contain everything a human would look at:\n{\n    # Main ticket information\n    \"id\": \"PROJ-123\",\n    \"summary\": \"Example ticket\",\n    \"description\": \"Ticket description...\",\n    \"status\": \"Open\",\n    \"priority\": \"High\",\n    \"assignee\": \"John Smith\",\n    \"reporter\": \"Jane Doe\",\n    \"created\": \"2024-02-18T10:00:00.000Z\",\n    \"updated\": \"2024-02-18T11:00:00.000Z\",\n    \"labels\": [\"label1\", \"label2\"],\n    \n    # Ticket comments in chronological order\n    \"comments\": [\n        {\n            \"author\": \"John Smith\",\n            \"body\": \"Comment text...\",\n            \"created\": \"2024-02-18T10:30:00.000Z\",\n            \"is_support_team\": true\n        }\n    ],\n    \n    # All referenced content\n    \"references\": {\n        # Documentation from Confluence\n        \"confluence_pages\": [\n            {\n                \"id\": \"12345\",\n                \"title\": \"Documentation Page\",\n                \"space_key\": \"DOCS\",\n                \"content\": \"Page content in markdown...\",\n                \"url\": \"https://confluence.example.com/pages/12345\",\n                \"creator\": \"Jane Doe\",\n                \"created\": \"2024-02-17T10:00:00.000Z\",\n                \"updated\": \"2024-02-18T09:00:00.000Z\",\n                \"attachments\": [\n                    {\n                        \"filename\": \"document.pdf\",\n                        \"size\": 1024,\n                        \"mediaType\": \"application/pdf\"\n                    }\n                ]\n            }\n        ],\n        \n        # Other Jira tickets referenced (with their own references)\n        \"jira_tickets\": [\n            {\n                \"id\": \"PROJ-124\",\n                \"summary\": \"Related ticket\",\n                \"status\": \"Closed\",\n                \"description\": \"Related ticket description...\",\n                \"references\": {\n                    # Each linked ticket also includes its references\n                    \"confluence_pages\": [...],\n                    \"jira_tickets\": [...],\n                    \"scrapable_documentation\": [...]\n                }\n            }\n        ],\n        \n        # Help Center and Developer Documentation\n        \"scrapable_documentation\": [\n            {\n                \"url\": \"https://help.example.com/article/123\",\n                \"title\": \"Help Article\",\n                \"content\": \"Article content...\",\n                \"author\": \"Support Team\",\n                \"date\": \"2024-02-15\"\n            }\n        ],\n        \n        # Any other referenced URLs\n        \"other_urls\": [\n            {\n                \"url\": \"https://example.com/some-page\",\n                \"type\": \"external\",\n                \"domain\": \"example.com\",\n                \"context\": \"Referenced in comment\"\n            }\n        ]\n    }\n}\n```\n\n## Configuration\n\n### Environment Variables\n\nThe library uses environment variables for configuration. You can set these in a `.env` file:\n\n```bash\n# Base domain for your organization\nBASE_DOMAIN=yourdomain.com\n\n# Jira configuration\nJIRA_URL=https://jira.yourdomain.com\nJIRA_USERNAME=your_username\nJIRA_API_TOKEN=your_api_token\n\n# Confluence configuration\nCONFLUENCE_URL=https://confluence.yourdomain.com\nCONFLUENCE_USERNAME=your_username\nCONFLUENCE_API_TOKEN=your_api_token\n```\n\n### Custom URL Patterns\n\nCreate a JSON file with your custom URL patterns:\n\n```json\n{\n  \"url_patterns\": {\n    \"help_center\": {\n      \"domains\": [\"help.yourdomain.com\"],\n      \"scrape\": true,\n      \"exclude_patterns\": [\n        \"^/search(/.*)?$\",\n        \"^/user(/.*)?$\"\n      ]\n    }\n  }\n}\n```\n\nInitialize the analyzer with your patterns:\n\n```python\nanalyzer = URLAnalyzer(patterns_file=\"path/to/patterns.json\")\n```\n\n## Advanced Usage\n\n### Controlling Reference Depth\n\nYou can control how deep the extractor follows references:\n\n```python\n# Only extract direct references\nextractor = JiraExtractor(max_reference_depth=1)\n\n# Extract references up to 3 levels deep (default is 2)\nextractor = JiraExtractor(max_reference_depth=3)\n```\n\n### Async Support\n\nFor web applications or when processing multiple tickets:\n\n```python\nasync def process_ticket():\n    extractor = JiraExtractor()\n    ticket_data = await extractor.get_ticket(\"PROJ-123\")\n    # Process the ticket data\n```\n\n## API Reference\n\n### URLAnalyzer\n\nThe main class for URL analysis and extraction.\n\n#### Methods\n\n- `analyze_content(content: str, source_content_id: str, source_type: str = \"description\") -> List[URLMatch]`\n  Analyzes content to find and categorize URLs.\n\n- `is_scrapable_url(url: str, domain: str) -> bool`\n  Checks if a URL should be scraped based on configuration.\n\n- `print_summary()`\n  Prints a summary of URL analysis statistics.\n\n#### Configuration Options\n\n- `base_domain`: Your organization's base domain\n- `patterns_file`: Path to custom URL patterns JSON file\n\n### URLMatch\n\nData class containing information about matched URLs.\n\n#### Attributes\n\n- `url`: The matched URL\n- `url_type`: Type of URL (e.g., \"collaboration\", \"help_center\")\n- `domain`: URL domain\n- `path`: URL path\n- `resource_metadata`: Extracted resource metadata\n- `context`: Surrounding content context\n- `source_content_id`: ID of source content\n- `source_type`: Type of source content\n- `should_scrape`: Whether URL should be scraped\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. \n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "a python library designed to grab a Jira ticket and scrape ALL of the related and referenced content (and scrape referenced Confluence pages or help center pages for their full content) like a human reviewer would do to try and understand what was going on, and return structured JSON to make it easy to pass to an LLM / ChatGPT / GenAI Etc",
    "version": "1.0.0",
    "project_urls": null,
    "split_keywords": [
        "ticket mining",
        " knowledge base",
        " jira",
        " confluence",
        " documentation"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "82e0fac215e8b83b0235aff4ddb610bcaf90319feb081fed47b14069ab099764",
                "md5": "30c94ec0135e17e4c33d81f90ad0fd34",
                "sha256": "7604f62e20650ee77f190f36baba0346e162d032c1c9c7ce124be7d64adc066d"
            },
            "downloads": -1,
            "filename": "ticket_miner-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "30c94ec0135e17e4c33d81f90ad0fd34",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 26965,
            "upload_time": "2025-02-22T20:17:16",
            "upload_time_iso_8601": "2025-02-22T20:17:16.819636Z",
            "url": "https://files.pythonhosted.org/packages/82/e0/fac215e8b83b0235aff4ddb610bcaf90319feb081fed47b14069ab099764/ticket_miner-1.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2237299acbf194bc3326db37c2f7ba1b4f147ea2b3418cc3c96cb0a49f29d9c4",
                "md5": "b5ee34baccb4329e0f800581b695eeaa",
                "sha256": "09fbede58a8907edc1360d3a92087564259e2c9d860c6c83014e6d9fa8be1e76"
            },
            "downloads": -1,
            "filename": "ticket_miner-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "b5ee34baccb4329e0f800581b695eeaa",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 27820,
            "upload_time": "2025-02-22T20:17:18",
            "upload_time_iso_8601": "2025-02-22T20:17:18.935907Z",
            "url": "https://files.pythonhosted.org/packages/22/37/299acbf194bc3326db37c2f7ba1b4f147ea2b3418cc3c96cb0a49f29d9c4/ticket_miner-1.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-02-22 20:17:18",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "ticket-miner"
}

None