doculyzer

Name	doculyzer JSON
Version	0.19.0 JSON
	download
home_page	None
Summary	Universal, Searchable, Structured Document Manager
upload_time	2025-05-07 02:40:04
maintainer	None
docs_url	None
author	None
requires_python	None
license	MIT
keywords	document-management semantic-search embedding document-parsing
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Doculyzer

## Universal, Searchable, Structured Document Manager

Doculyzer is a powerful document management system that creates a universal, structured representation of documents from various sources while maintaining pointers to the original content rather than duplicating it.

```
┌─────────────────┐     ┌─────────────────┐     ┌────────────────┐
│ Content Sources │     │Document Ingester│     │  Storage Layer │
└────────┬────────┘     └────────┬────────┘     └────────┬───────┘
         │                       │                       │
┌────────┼────────┐     ┌────────┼────────┐     ┌────────┼──────┐
│ Confluence API  │     │Parser Adapters  │     │SQLite Backend │
│ Markdown Files  │◄───►│Structure Extract│◄───►│MongoDB Backend│
│ HTML from URLs  │     │Embedding Gen    │     │Vector Database│
│ DOCX Documents  │     │Relationship Map │     │Graph Database │
└─────────────────┘     └─────────────────┘     └───────────────┘
```

## Key Features

- **Universal Document Model**: Common representation across document types
- **Preservation of Structure**: Maintains hierarchical document structure
- **Content Resolution**: Resolves pointers back to original content when needed
- **Contextual Semantic Search**: Uses advanced embedding techniques that incorporate document context (hierarchy, neighbors) for more accurate semantic search
- **Element-Level Precision**: Maintains granular accuracy to specific document elements
- **Relationship Mapping**: Identifies connections between document elements
- **Configurable Vector Representations**: Support for different vector dimensions based on content needs, allowing larger vectors for technical content and smaller vectors for general content

## Supported Document Types

Doculyzer can ingest and process a variety of document formats:
- HTML pages
- Markdown files
- Plain text files
- PDF documents
- Microsoft Word documents (DOCX)
- Microsoft PowerPoint presentations (PPTX)
- Microsoft Excel spreadsheets (XLSX)
- CSV files
- XML files
- JSON files

## Content Sources

Doculyzer supports multiple content sources:
- File systems (local, mounted, and network shares)
- HTTP endpoints
- Confluence
- JIRA
- Amazon S3
- Relational Databases
- ServiceNow
- MongoDB

## Architecture

The system is built with a modular architecture:

1. **Content Sources**: Adapters for different content origins
2. **Document Parsers**: Transform content into structured elements
3. **Document Database**: Stores metadata, elements, and relationships
4. **Content Resolver**: Retrieves original content when needed
5. **Embedding Generator**: Creates vector representations for semantic search
6. **Relationship Detector**: Identifies connections between document elements

## Storage Backends

Doculyzer supports multiple storage backends:
- **File-based storage**: Simple storage using the file system
- **SQLite**: Lightweight, embedded database
- **Neo4J**: Graph datastore, all document elements, relationships and embeddings are stored
- **PostgreSQL**: Robust relational database for production deployments
- **MongoDB**: Document-oriented database for larger deployments
- **SQLAlchemy**: ORM layer supporting multiple relational databases:
  - MySQL/MariaDB
  - Oracle
  - Microsoft SQL Server
  - And other SQLAlchemy-compatible databases

## Content Monitoring and Updates

Doculyzer includes a robust system for monitoring content sources and handling updates:

### Change Detection

- **Efficient Monitoring**: Tracks content sources for changes using lightweight methods (timestamps, ETags, content hashes)
- **Selective Processing**: Only reprocesses documents that have changed since their last ingestion
- **Hash-Based Comparison**: Uses content hashes to avoid unnecessary processing when content hasn't changed
- **Source-Specific Strategies**: Each content source type implements its own optimal change detection mechanism

### Update Process

```python
# Schedule regular updates
from doculyzer import ingest_documents
import schedule
import time

def update_documents():
    # This will only process documents that have changed
    stats = ingest_documents(config)
    print(f"Updates: {stats['documents']} documents, {stats['unchanged_documents']} unchanged")

# Run updates every hour
schedule.every(1).hour.do(update_documents)

while True:
    schedule.run_pending()
    time.sleep(60)
```

### Update Status Tracking

- **Processing History**: Maintains a record of when each document was last processed
- **Content Hash Storage**: Stores content hashes to quickly identify changes
- **Update Statistics**: Provides metrics on documents processed, unchanged, and updated
- **Pointer-Based Architecture**: Since Doculyzer stores pointers to original content rather than copies, it efficiently handles updates without versioning complications

### Scheduled Crawling

For continuous monitoring of content sources, Doculyzer can be configured to run scheduled crawls:

```python
import argparse
import logging
import time
from doculyzer import crawl

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Doculyzer Crawler")
    parser.add_argument("--config", required=True, help="Path to configuration file")
    parser.add_argument("--interval", type=int, default=3600, help="Crawl interval in seconds")
    args = parser.parse_args()
    
    logger = logging.getLogger("Doculyzer Crawler")
    logger.info(f"Crawler initialized with interval {args.interval} seconds")
    
    while True:
        crawl(args.config, args.interval)
        logger.info(f"Sleeping for {args.interval} seconds")
        time.sleep(args.interval)
```

Run the crawler as a background process or service:

```bash
# Run crawler with 1-hour interval
python crawler.py --config config.yaml --interval 3600
```

For production environments, consider using a proper task scheduler like Celery or a cron job to manage the crawl process.

## Getting Started

### Installation

```bash
# Clone the repository
git clone https://github.com/yourusername/doculyzer.git
cd doculyzer

# Install dependencies
pip install -r requirements.txt
```

### Configuration

Create a configuration file `config.yaml`:

```yaml
storage:
  backend: sqlite  # Options: file, sqlite, mongodb, postgresql, sqlalchemy
  path: "./data"
  
  # MongoDB-specific configuration (if using MongoDB)
  mongodb:
    host: localhost
    port: 27017
    db_name: doculyzer
    username: myuser  # optional
    password: mypassword  # optional

embedding:
  enabled: true
  model: "sentence-transformers/all-MiniLM-L6-v2"
  backend: "huggingface"  # Options: huggingface, openai, custom
  chunk_size: 512
  overlap: 128
  contextual: true  # Enable contextual embeddings
  vector_size: 384  # Configurable based on content needs
  
  # Contextual embedding configuration
  predecessor_count: 1
  successor_count: 1
  ancestor_depth: 1
  
  # Content-specific configurations
  content_types:
    technical:
      model: "sentence-transformers/all-mpnet-base-v2"
      vector_size: 768  # Larger vectors for technical content
    general:
      model: "sentence-transformers/all-MiniLM-L6-v2"
      vector_size: 384  # Smaller vectors for general content
  
  # OpenAI-specific configuration (if using OpenAI backend)
  openai:
    api_key: "your_api_key_here"
    model: "text-embedding-ada-002"
    dimensions: 1536  # Configurable embedding dimensions

content_sources:
  - name: "documentation"
    type: "file"
    base_path: "./docs"
    file_pattern: "**/*.md"
    max_link_depth: 2

relationship_detection:
  enabled: true
  link_pattern: r"\[\[(.*?)\]\]|href=[\"\'](.*?)[\"\']"

logging:
  level: "INFO"
  file: "./logs/docpointer.log"
```

### Basic Usage

```python
from doculyzer import Config, ingest_documents

# Load configuration
config = Config("config.yaml")

# Initialize storage
db = config.initialize_database()

# Ingest documents
stats = ingest_documents(config)
print(f"Processed {stats['documents']} documents with {stats['elements']} elements")

# Search documents
results = db.search_elements_by_content("search term")
for element in results:
    print(f"Found in {element['element_id']}: {element['content_preview']}")

# Semantic search (if embeddings are enabled)
query_embedding = embedding_generator.generate("search query")
results = db.search_by_embedding(query_embedding)
for element_id, score in results:
    element = db.get_element(element_id)
    print(f"Semantic match ({score:.2f}): {element['content_preview']}")
```

## Advanced Features

### Relationship Detection

Doculyzer can detect various types of relationships between document elements:

- **Explicit Links**: Links explicitly defined in the document
- **Structural Relationships**: Parent-child, sibling, and section relationships
- **Semantic Relationships**: Connections based on content similarity

### Embedding Generation

Doculyzer uses advanced contextual embedding techniques to generate vector representations of document elements:

- **Pluggable Embedding Backends**: Choose from different embedding providers or implement your own
  - **HuggingFace Transformers**: Use transformer-based models like BERT, RoBERTa, or Sentence Transformers
  - **OpenAI Embeddings**: Leverage OpenAI's powerful embedding models
  - **Custom Embeddings**: Implement your own embedding generator with the provided interfaces
- **Contextual Embeddings**: Incorporates hierarchical relationships, predecessors, and successors into each element's embedding
- **Element-Level Precision**: Maintains accuracy to specific document elements rather than just document-level matching
- **Content-Optimized Vector Dimensions**: Flexibility to choose vector sizes based on content type
  - Larger vectors for highly technical content requiring more nuanced semantic representation
  - Smaller vectors for general content to optimize storage and query performance
  - Select the embedding provider and model that best suits your specific use case
- **Improved Relevance**: Context-aware embeddings produce more accurate similarity search results
- **Temporal Semantics**: Finds date references and expands them into a complete explanation of all date and time parts, improving ANN search.

```python
from doculyzer.embeddings import get_embedding_generator

# Create contextual embedding generator with the configured backend
embedding_generator = get_embedding_generator(config)

# Use a specific embedding backend
from doculyzer.embeddings.factory import create_embedding_generator
from doculyzer.embeddings.hugging_face import HuggingFaceEmbedding

# Create a HuggingFace embedding generator with a specific model and vector size
embedding_generator = create_embedding_generator(
    backend="huggingface",
    model_name="sentence-transformers/all-mpnet-base-v2",
    vector_size=768,  # Larger vector size for technical content
    contextual=True
)

# Or choose a different model with smaller vectors for general content
general_content_embedder = create_embedding_generator(
    backend="huggingface",
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    vector_size=384,  # Smaller vector size for general content
    contextual=True
)

# Generate embeddings for a document
elements = db.get_document_elements(doc_id)
embeddings = embedding_generator.generate_from_elements(elements)

# Store embeddings
for element_id, embedding in embeddings.items():
    db.store_embedding(element_id, embedding)
```

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

This project is licensed under the MIT License - see the LICENSE file for details.

# Sample Config

```yaml
storage:
  backend: neo4j  # Can be neo4j, sqlite, file, mongodb, postgresql, or sqlalchemy
  path: "./data"  
  
  # Neo4j-specific configuration
  neo4j:
    uri: "bolt://localhost:7687"
    username: "neo4j"
    password: "password"
    database: "doculyzer"
    
  # File-based storage configuration (uncomment to use)
  # backend: file
  # path: "./data"
  # file:
  #   # Options for organizing file storage
  #   subdirectory_structure: "flat"  # Can be "flat" or "hierarchical"
  #   create_backups: true  # Whether to create backups before overwriting files
  #   backup_count: 3  # Number of backups to keep
  #   compression: false  # Whether to compress stored files
  #   index_in_memory: true  # Whether to keep indexes in memory for faster access
  
  # MongoDB-based storage configuration (uncomment to use)
  # backend: mongodb
  # mongodb:
  #   host: "localhost"
  #   port: 27017
  #   username: "admin"  # Optional
  #   password: "password"  # Optional
  #   db_name: "doculyzer"
  #   options:  # Optional connection options
  #     retryWrites: true
  #     w: "majority"
  #     connectTimeoutMS: 5000
  #   vector_search: true  # Whether to use vector search capabilities (requires MongoDB Atlas)
  #   create_vector_index: true  # Whether to create vector search index on startup
  
  # PostgreSQL-based storage configuration (uncomment to use)
  # backend: postgresql
  # postgresql:
  #   host: "localhost"
  #   port: 5432
  #   dbname: "doculyzer"
  #   user: "postgres"
  #   password: "password"
  #   # Optional SSL configuration
  #   sslmode: "prefer"  # Options: disable, prefer, require, verify-ca, verify-full
  #   # Vector search configuration using pgvector
  #   enable_vector: true  # Whether to try to enable pgvector extension
  #   create_vector_index: true  # Whether to create vector indexes automatically
  #   vector_index_type: "ivfflat"  # Options: ivfflat, hnsw
  
  # SQLAlchemy-based storage configuration (uncomment to use)
  # backend: sqlalchemy
  # sqlalchemy:
  #   # Database URI (examples for different database types)
  #   # SQLite:
  #   db_uri: "sqlite:///data/doculyzer.db"
  #   # PostgreSQL:
  #   # db_uri: "postgresql://user:password@localhost:5432/doculyzer"
  #   # MySQL:
  #   # db_uri: "mysql+pymysql://user:password@localhost:3306/doculyzer"
  #   # MS SQL Server:
  #   # db_uri: "mssql+pyodbc://user:password@server/database?driver=ODBC+Driver+17+for+SQL+Server"
  #   
  #   # Additional configuration options
  #   echo: false  # Whether to echo SQL statements for debugging
  #   pool_size: 5  # Connection pool size
  #   max_overflow: 10  # Maximum overflow connections
  #   pool_timeout: 30  # Connection timeout in seconds
  #   pool_recycle: 1800  # Connection recycle time in seconds
  #   
  #   # Vector extensions
  #   vector_extension: "auto"  # Options: auto, pgvector, sqlite_vss, sqlite_vec, none
  #   create_vector_index: true  # Whether to create vector indexes automatically
  #   vector_index_type: "ivfflat"  # For PostgreSQL: ivfflat, hnsw
  
  # SQLite-based storage configuration (uncomment to use)
  # backend: sqlite
  # path: "./data"  # Path where the SQLite database file will be stored
  # sqlite:
  #   # Extensions configuration
  #   sqlite_extensions:
  #     use_sqlean: true  # Whether to use sqlean.py (provides more SQLite extensions)
  #     auto_discover: true  # Whether to automatically discover and load vector extensions
  #   
  #   # Vector search configuration
  #   vector_extensions:
  #     preferred: "auto"  # Options: auto, vec0, vss0, none
  #     create_tables: true  # Whether to create vector tables on startup
  #     populate_existing: true  # Whether to populate vector tables with existing embeddings

embedding:
  enabled: true
  model: "sentence-transformers/all-MiniLM-L6-v2"
  dimensions: 384  # Embedding dimensions, used by database vector search
  
  # OpenAI embedding configuration (uncomment to use)
  # provider: "openai"  # Change from default sentence-transformers to OpenAI
  # model: "text-embedding-3-small"  # OpenAI embedding model name
  # dimensions: 1536  # Dimensions for the model (1536 for text-embedding-3-small, 3072 for text-embedding-3-large)
  # openai:
  #   api_key: "your-openai-api-key"  # Can also be set via OPENAI_API_KEY environment variable
  #   batch_size: 10  # Number of texts to embed in a single API call
  #   retry_count: 3  # Number of retries on API failure
  #   retry_delay: 1  # Delay between retries in seconds
  #   timeout: 60  # Timeout for API calls in seconds
  #   max_tokens: 8191  # Maximum tokens per text (8191 for text-embedding-3-small/large)
  #   cache_enabled: true  # Whether to cache embeddings
  #   cache_size: 1000  # Maximum number of embeddings to cache in memory

content_sources:
  # File content source
  - name: "local-files"
    type: "file"
    base_path: "./documents"
    file_pattern: "**/*"
    include_extensions: ["md", "txt", "pdf", "docx", "html"]
    exclude_extensions: ["tmp", "bak"]
    watch_for_changes: true
    recursive: true
    max_link_depth: 2
  
  # JIRA content source
  - name: "project-tickets"
    type: "jira"
    base_url: "https://your-company.atlassian.net"
    username: "jira_user@example.com"
    api_token: "your-jira-api-token"
    projects: ["PROJ", "FEAT"]
    issue_types: ["Bug", "Story", "Task"]
    statuses: ["In Progress", "To Do", "In Review"] 
    include_closed: false
    max_results: 100
    include_description: true
    include_comments: true
    include_attachments: false
    include_subtasks: true
    include_linked_issues: true
    include_custom_fields: ["customfield_10001", "customfield_10002"]
    max_link_depth: 1
  
  # MongoDB content source
  - name: "document-db"
    type: "mongodb"
    connection_string: "mongodb://localhost:27017/"
    database_name: "your_database"
    collection_name: "documents"
    query: {"status": "active"}
    projection: {"_id": 1, "title": 1, "content": 1, "metadata": 1, "updated_at": 1}
    id_field: "_id"
    content_field: "content"
    timestamp_field: "updated_at"
    limit: 1000
    sort_by: [["updated_at", -1]]
    follow_references: true
    reference_field: "related_docs"
    max_link_depth: 1
    
  # S3 content source
  - name: "cloud-documents"
    type: "s3"
    bucket_name: "your-document-bucket"
    prefix: "documents/"
    region_name: "us-west-2"
    aws_access_key_id: "your-access-key"
    aws_secret_access_key: "your-secret-key"
    assume_role_arn: "arn:aws:iam::123456789012:role/S3AccessRole"  # Optional
    endpoint_url: null  # For S3-compatible storage
    include_extensions: ["md", "txt", "pdf", "docx", "html"]
    exclude_extensions: ["tmp", "bak", "log"]
    include_prefixes: ["documents/important/", "documents/shared/"]  # Optional
    exclude_prefixes: ["documents/archive/", "documents/backup/"]  # Optional
    include_patterns: []  # Optional regex patterns
    exclude_patterns: []  # Optional regex patterns
    recursive: true
    max_depth: 5
    detect_mimetype: true
    temp_dir: "/tmp"
    delete_after_processing: true
    local_link_mode: "relative"  # Can be relative, absolute, or none
    max_link_depth: 2
    
  # ServiceNow content source
  - name: "it-service-management"
    type: "servicenow"
    base_url: "https://your-instance.service-now.com"
    username: "servicenow_user"
    api_token: "your-servicenow-api-token"
    # Alternatively, use password authentication
    # password: "your-password"
    
    # Content type settings
    include_knowledge: true
    include_incidents: true
    include_service_catalog: true
    include_cmdb: true
    
    # Filter settings
    knowledge_query: "workflow_state=published"  # ServiceNow knowledge API query
    incident_query: "active=true^priority<=2"  # ServiceNow table API query for incidents
    service_catalog_query: "active=true^category=hardware"  # Query for service catalog items
    cmdb_query: "sys_class_name=cmdb_ci_server"  # Query for CMDB items
    include_patterns: [".*prod.*", ".*critical.*"]  # Regex patterns to include
    exclude_patterns: [".*test.*", ".*dev.*"]  # Regex patterns to exclude
    limit: 100  # Maximum number of items to retrieve
    max_link_depth: 1  # For following links between ServiceNow items
    
  # Web content source
  - name: "web-content"
    type: "web"
    base_url: "https://www.example.com"  # Optional base URL for relative paths
    url_list:  # List of URLs to fetch
      - "https://www.example.com/docs/overview"
      - "https://www.example.com/docs/tutorials"
      - "https://www.example.com/blog"
    url_list_file: "./urls.txt"  # Optional path to file containing URLs (one per line)
    refresh_interval: 86400  # Refresh interval in seconds (default: 1 day)
    headers:  # Custom headers for requests
      User-Agent: "Mozilla/5.0 (compatible; DoculyzerBot/1.0)"
      Accept-Language: "en-US,en;q=0.9"
    authentication:  # Optional authentication
      type: "basic"  # Can be "basic" or "bearer"
      username: "web_user"
      password: "web_password"
      # For bearer token:
      # type: "bearer"
      # token: "your-access-token"
    include_patterns:  # Regex patterns to include when following links
      - "/docs/.*"
      - "/blog/[0-9]{4}/.*"
    exclude_patterns:  # Regex patterns to exclude when following links
      - ".*\\.pdf$"
      - "/archived/.*"
    max_link_depth: 3  # Maximum depth for following links
    
  # Confluence content source
  - name: "team-knowledge-base"
    type: "confluence"
    base_url: "https://your-company.atlassian.net"
    username: "confluence_user@example.com"
    api_token: "your-confluence-api-token"
    # Alternatively, use password authentication
    # password: "your-password"
    
    # Space configuration
    spaces: ["TEAM", "PROJ", "DOCS"]  # List of space keys to include (empty list fetches all accessible spaces)
    exclude_personal_spaces: true  # Skip personal spaces when fetching all spaces
    
    # Content type settings
    include_pages: true  # Include regular Confluence pages
    include_blogs: true  # Include blog posts
    include_comments: false  # Include comments on pages/blogs
    include_attachments: false  # Include file attachments
    
    # Content filtering
    include_patterns: ["^Project.*", "^Guide.*"]  # Regex patterns to include (matches against title)
    exclude_patterns: ["^Draft.*", "^WIP.*"]  # Regex patterns to exclude
    
    # Advanced settings
    expand_macros: true  # Expand Confluence macros in content
    link_pattern: "/wiki/spaces/[^/]+/pages/(\\d+)"  # Regex pattern to extract page IDs from links
    limit: 500  # Maximum number of items to retrieve
    max_link_depth: 2  # For following links between Confluence pages
  
  # Example of a blob-based database content source
  - name: "database-blobs"
    type: "database"
    connection_string: "postgresql://user:password@localhost:5432/mydatabase"
    query: "SELECT * FROM documents"
    id_column: "doc_id"
    content_column: "content_blob" 
    metadata_columns: ["title", "author", "created_date"]
    timestamp_column: "updated_at"
    
  # Example of a JSON-structured database content source
  - name: "database-json"
    type: "database"
    connection_string: "mysql://user:password@localhost:3306/customer_db"
    query: "SELECT * FROM customers"
    id_column: "customer_id"
    json_mode: true
    json_columns: ["first_name", "last_name", "email", "address", "phone_number", "signup_date"]
    metadata_columns: ["account_status", "customer_type"]
    timestamp_column: "last_modified"
    json_include_metadata: true  # Include metadata columns in the JSON document
    
  # Example of automatic column discovery (all columns except ID will be in JSON)
  - name: "database-json-auto"
    type: "database"
    connection_string: "sqlite:///local_database.db"
    query: "products"  # Simple table name
    id_column: "product_id"
    json_mode: true
    # No json_columns specified - will automatically use all non-ID columns
    metadata_columns: ["category", "supplier"]
    timestamp_column: "updated_at"
    
  # Example with a complex query
  - name: "database-complex-query"
    type: "database"
    connection_string: "mssql+pyodbc://user:password@server/database"
    query: "SELECT o.order_id, c.customer_name, o.order_date, p.product_name, oi.quantity 
            FROM orders o
            JOIN customers c ON o.customer_id = c.customer_id
            JOIN order_items oi ON o.order_id = oi.order_id
            JOIN products p ON oi.product_id = p.product_id"
    id_column: "order_id"
    json_mode: true
    json_columns: ["customer_name", "order_date", "product_name", "quantity"]
    timestamp_column: "order_date"

relationship_detection:
  enabled: true

logging:
  level: "INFO"
  file: "./logs/doculyzer.log"
```

# Here's what I've tested so far
- SQLite storage (with and within vector search plugins)
- Web Content Source
- File Content Source
- Content type: MD, HTML, XLSX, PDF, XML, CSV, DOCX, PPTX

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "doculyzer",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "document-management, semantic-search, embedding, document-parsing",
    "author": null,
    "author_email": "Kenneth Stott <ken@hasura.io>",
    "download_url": "https://files.pythonhosted.org/packages/86/72/466fe77736ea7e58aa474a90d388f35de3c27589bf13cb52cd6c877fd999/doculyzer-0.19.0.tar.gz",
    "platform": null,
    "description": "# Doculyzer\n\n## Universal, Searchable, Structured Document Manager\n\nDoculyzer is a powerful document management system that creates a universal, structured representation of documents from various sources while maintaining pointers to the original content rather than duplicating it.\n\n```\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510     \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510     \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Content Sources \u2502     \u2502Document Ingester\u2502     \u2502  Storage Layer \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518     \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518     \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n         \u2502                       \u2502                       \u2502\n\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510     \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510     \u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502 Confluence API  \u2502     \u2502Parser Adapters  \u2502     \u2502SQLite Backend \u2502\n\u2502 Markdown Files  \u2502\u25c4\u2500\u2500\u2500\u25ba\u2502Structure Extract\u2502\u25c4\u2500\u2500\u2500\u25ba\u2502MongoDB Backend\u2502\n\u2502 HTML from URLs  \u2502     \u2502Embedding Gen    \u2502     \u2502Vector Database\u2502\n\u2502 DOCX Documents  \u2502     \u2502Relationship Map \u2502     \u2502Graph Database \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518     \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518     \u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n```\n\n## Key Features\n\n- **Universal Document Model**: Common representation across document types\n- **Preservation of Structure**: Maintains hierarchical document structure\n- **Content Resolution**: Resolves pointers back to original content when needed\n- **Contextual Semantic Search**: Uses advanced embedding techniques that incorporate document context (hierarchy, neighbors) for more accurate semantic search\n- **Element-Level Precision**: Maintains granular accuracy to specific document elements\n- **Relationship Mapping**: Identifies connections between document elements\n- **Configurable Vector Representations**: Support for different vector dimensions based on content needs, allowing larger vectors for technical content and smaller vectors for general content\n\n## Supported Document Types\n\nDoculyzer can ingest and process a variety of document formats:\n- HTML pages\n- Markdown files\n- Plain text files\n- PDF documents\n- Microsoft Word documents (DOCX)\n- Microsoft PowerPoint presentations (PPTX)\n- Microsoft Excel spreadsheets (XLSX)\n- CSV files\n- XML files\n- JSON files\n\n## Content Sources\n\nDoculyzer supports multiple content sources:\n- File systems (local, mounted, and network shares)\n- HTTP endpoints\n- Confluence\n- JIRA\n- Amazon S3\n- Relational Databases\n- ServiceNow\n- MongoDB\n\n## Architecture\n\nThe system is built with a modular architecture:\n\n1. **Content Sources**: Adapters for different content origins\n2. **Document Parsers**: Transform content into structured elements\n3. **Document Database**: Stores metadata, elements, and relationships\n4. **Content Resolver**: Retrieves original content when needed\n5. **Embedding Generator**: Creates vector representations for semantic search\n6. **Relationship Detector**: Identifies connections between document elements\n\n## Storage Backends\n\nDoculyzer supports multiple storage backends:\n- **File-based storage**: Simple storage using the file system\n- **SQLite**: Lightweight, embedded database\n- **Neo4J**: Graph datastore, all document elements, relationships and embeddings are stored\n- **PostgreSQL**: Robust relational database for production deployments\n- **MongoDB**: Document-oriented database for larger deployments\n- **SQLAlchemy**: ORM layer supporting multiple relational databases:\n  - MySQL/MariaDB\n  - Oracle\n  - Microsoft SQL Server\n  - And other SQLAlchemy-compatible databases\n\n## Content Monitoring and Updates\n\nDoculyzer includes a robust system for monitoring content sources and handling updates:\n\n### Change Detection\n\n- **Efficient Monitoring**: Tracks content sources for changes using lightweight methods (timestamps, ETags, content hashes)\n- **Selective Processing**: Only reprocesses documents that have changed since their last ingestion\n- **Hash-Based Comparison**: Uses content hashes to avoid unnecessary processing when content hasn't changed\n- **Source-Specific Strategies**: Each content source type implements its own optimal change detection mechanism\n\n### Update Process\n\n```python\n# Schedule regular updates\nfrom doculyzer import ingest_documents\nimport schedule\nimport time\n\ndef update_documents():\n    # This will only process documents that have changed\n    stats = ingest_documents(config)\n    print(f\"Updates: {stats['documents']} documents, {stats['unchanged_documents']} unchanged\")\n\n# Run updates every hour\nschedule.every(1).hour.do(update_documents)\n\nwhile True:\n    schedule.run_pending()\n    time.sleep(60)\n```\n\n### Update Status Tracking\n\n- **Processing History**: Maintains a record of when each document was last processed\n- **Content Hash Storage**: Stores content hashes to quickly identify changes\n- **Update Statistics**: Provides metrics on documents processed, unchanged, and updated\n- **Pointer-Based Architecture**: Since Doculyzer stores pointers to original content rather than copies, it efficiently handles updates without versioning complications\n\n### Scheduled Crawling\n\nFor continuous monitoring of content sources, Doculyzer can be configured to run scheduled crawls:\n\n```python\nimport argparse\nimport logging\nimport time\nfrom doculyzer import crawl\n\nif __name__ == \"__main__\":\n    parser = argparse.ArgumentParser(description=\"Doculyzer Crawler\")\n    parser.add_argument(\"--config\", required=True, help=\"Path to configuration file\")\n    parser.add_argument(\"--interval\", type=int, default=3600, help=\"Crawl interval in seconds\")\n    args = parser.parse_args()\n    \n    logger = logging.getLogger(\"Doculyzer Crawler\")\n    logger.info(f\"Crawler initialized with interval {args.interval} seconds\")\n    \n    while True:\n        crawl(args.config, args.interval)\n        logger.info(f\"Sleeping for {args.interval} seconds\")\n        time.sleep(args.interval)\n```\n\nRun the crawler as a background process or service:\n\n```bash\n# Run crawler with 1-hour interval\npython crawler.py --config config.yaml --interval 3600\n```\n\nFor production environments, consider using a proper task scheduler like Celery or a cron job to manage the crawl process.\n\n## Getting Started\n\n### Installation\n\n```bash\n# Clone the repository\ngit clone https://github.com/yourusername/doculyzer.git\ncd doculyzer\n\n# Install dependencies\npip install -r requirements.txt\n```\n\n### Configuration\n\nCreate a configuration file `config.yaml`:\n\n```yaml\nstorage:\n  backend: sqlite  # Options: file, sqlite, mongodb, postgresql, sqlalchemy\n  path: \"./data\"\n  \n  # MongoDB-specific configuration (if using MongoDB)\n  mongodb:\n    host: localhost\n    port: 27017\n    db_name: doculyzer\n    username: myuser  # optional\n    password: mypassword  # optional\n\nembedding:\n  enabled: true\n  model: \"sentence-transformers/all-MiniLM-L6-v2\"\n  backend: \"huggingface\"  # Options: huggingface, openai, custom\n  chunk_size: 512\n  overlap: 128\n  contextual: true  # Enable contextual embeddings\n  vector_size: 384  # Configurable based on content needs\n  \n  # Contextual embedding configuration\n  predecessor_count: 1\n  successor_count: 1\n  ancestor_depth: 1\n  \n  # Content-specific configurations\n  content_types:\n    technical:\n      model: \"sentence-transformers/all-mpnet-base-v2\"\n      vector_size: 768  # Larger vectors for technical content\n    general:\n      model: \"sentence-transformers/all-MiniLM-L6-v2\"\n      vector_size: 384  # Smaller vectors for general content\n  \n  # OpenAI-specific configuration (if using OpenAI backend)\n  openai:\n    api_key: \"your_api_key_here\"\n    model: \"text-embedding-ada-002\"\n    dimensions: 1536  # Configurable embedding dimensions\n\ncontent_sources:\n  - name: \"documentation\"\n    type: \"file\"\n    base_path: \"./docs\"\n    file_pattern: \"**/*.md\"\n    max_link_depth: 2\n\nrelationship_detection:\n  enabled: true\n  link_pattern: r\"\\[\\[(.*?)\\]\\]|href=[\\\"\\'](.*?)[\\\"\\']\"\n\nlogging:\n  level: \"INFO\"\n  file: \"./logs/docpointer.log\"\n```\n\n### Basic Usage\n\n```python\nfrom doculyzer import Config, ingest_documents\n\n# Load configuration\nconfig = Config(\"config.yaml\")\n\n# Initialize storage\ndb = config.initialize_database()\n\n# Ingest documents\nstats = ingest_documents(config)\nprint(f\"Processed {stats['documents']} documents with {stats['elements']} elements\")\n\n# Search documents\nresults = db.search_elements_by_content(\"search term\")\nfor element in results:\n    print(f\"Found in {element['element_id']}: {element['content_preview']}\")\n\n# Semantic search (if embeddings are enabled)\nquery_embedding = embedding_generator.generate(\"search query\")\nresults = db.search_by_embedding(query_embedding)\nfor element_id, score in results:\n    element = db.get_element(element_id)\n    print(f\"Semantic match ({score:.2f}): {element['content_preview']}\")\n```\n\n## Advanced Features\n\n### Relationship Detection\n\nDoculyzer can detect various types of relationships between document elements:\n\n- **Explicit Links**: Links explicitly defined in the document\n- **Structural Relationships**: Parent-child, sibling, and section relationships\n- **Semantic Relationships**: Connections based on content similarity\n\n### Embedding Generation\n\nDoculyzer uses advanced contextual embedding techniques to generate vector representations of document elements:\n\n- **Pluggable Embedding Backends**: Choose from different embedding providers or implement your own\n  - **HuggingFace Transformers**: Use transformer-based models like BERT, RoBERTa, or Sentence Transformers\n  - **OpenAI Embeddings**: Leverage OpenAI's powerful embedding models\n  - **Custom Embeddings**: Implement your own embedding generator with the provided interfaces\n- **Contextual Embeddings**: Incorporates hierarchical relationships, predecessors, and successors into each element's embedding\n- **Element-Level Precision**: Maintains accuracy to specific document elements rather than just document-level matching\n- **Content-Optimized Vector Dimensions**: Flexibility to choose vector sizes based on content type\n  - Larger vectors for highly technical content requiring more nuanced semantic representation\n  - Smaller vectors for general content to optimize storage and query performance\n  - Select the embedding provider and model that best suits your specific use case\n- **Improved Relevance**: Context-aware embeddings produce more accurate similarity search results\n- **Temporal Semantics**: Finds date references and expands them into a complete explanation of all date and time parts, improving ANN search.\n\n```python\nfrom doculyzer.embeddings import get_embedding_generator\n\n# Create contextual embedding generator with the configured backend\nembedding_generator = get_embedding_generator(config)\n\n# Use a specific embedding backend\nfrom doculyzer.embeddings.factory import create_embedding_generator\nfrom doculyzer.embeddings.hugging_face import HuggingFaceEmbedding\n\n# Create a HuggingFace embedding generator with a specific model and vector size\nembedding_generator = create_embedding_generator(\n    backend=\"huggingface\",\n    model_name=\"sentence-transformers/all-mpnet-base-v2\",\n    vector_size=768,  # Larger vector size for technical content\n    contextual=True\n)\n\n# Or choose a different model with smaller vectors for general content\ngeneral_content_embedder = create_embedding_generator(\n    backend=\"huggingface\",\n    model_name=\"sentence-transformers/all-MiniLM-L6-v2\",\n    vector_size=384,  # Smaller vector size for general content\n    contextual=True\n)\n\n# Generate embeddings for a document\nelements = db.get_document_elements(doc_id)\nembeddings = embedding_generator.generate_from_elements(elements)\n\n# Store embeddings\nfor element_id, embedding in embeddings.items():\n    db.store_embedding(element_id, embedding)\n```\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n# Sample Config\n\n```yaml\nstorage:\n  backend: neo4j  # Can be neo4j, sqlite, file, mongodb, postgresql, or sqlalchemy\n  path: \"./data\"  \n  \n  # Neo4j-specific configuration\n  neo4j:\n    uri: \"bolt://localhost:7687\"\n    username: \"neo4j\"\n    password: \"password\"\n    database: \"doculyzer\"\n    \n  # File-based storage configuration (uncomment to use)\n  # backend: file\n  # path: \"./data\"\n  # file:\n  #   # Options for organizing file storage\n  #   subdirectory_structure: \"flat\"  # Can be \"flat\" or \"hierarchical\"\n  #   create_backups: true  # Whether to create backups before overwriting files\n  #   backup_count: 3  # Number of backups to keep\n  #   compression: false  # Whether to compress stored files\n  #   index_in_memory: true  # Whether to keep indexes in memory for faster access\n  \n  # MongoDB-based storage configuration (uncomment to use)\n  # backend: mongodb\n  # mongodb:\n  #   host: \"localhost\"\n  #   port: 27017\n  #   username: \"admin\"  # Optional\n  #   password: \"password\"  # Optional\n  #   db_name: \"doculyzer\"\n  #   options:  # Optional connection options\n  #     retryWrites: true\n  #     w: \"majority\"\n  #     connectTimeoutMS: 5000\n  #   vector_search: true  # Whether to use vector search capabilities (requires MongoDB Atlas)\n  #   create_vector_index: true  # Whether to create vector search index on startup\n  \n  # PostgreSQL-based storage configuration (uncomment to use)\n  # backend: postgresql\n  # postgresql:\n  #   host: \"localhost\"\n  #   port: 5432\n  #   dbname: \"doculyzer\"\n  #   user: \"postgres\"\n  #   password: \"password\"\n  #   # Optional SSL configuration\n  #   sslmode: \"prefer\"  # Options: disable, prefer, require, verify-ca, verify-full\n  #   # Vector search configuration using pgvector\n  #   enable_vector: true  # Whether to try to enable pgvector extension\n  #   create_vector_index: true  # Whether to create vector indexes automatically\n  #   vector_index_type: \"ivfflat\"  # Options: ivfflat, hnsw\n  \n  # SQLAlchemy-based storage configuration (uncomment to use)\n  # backend: sqlalchemy\n  # sqlalchemy:\n  #   # Database URI (examples for different database types)\n  #   # SQLite:\n  #   db_uri: \"sqlite:///data/doculyzer.db\"\n  #   # PostgreSQL:\n  #   # db_uri: \"postgresql://user:password@localhost:5432/doculyzer\"\n  #   # MySQL:\n  #   # db_uri: \"mysql+pymysql://user:password@localhost:3306/doculyzer\"\n  #   # MS SQL Server:\n  #   # db_uri: \"mssql+pyodbc://user:password@server/database?driver=ODBC+Driver+17+for+SQL+Server\"\n  #   \n  #   # Additional configuration options\n  #   echo: false  # Whether to echo SQL statements for debugging\n  #   pool_size: 5  # Connection pool size\n  #   max_overflow: 10  # Maximum overflow connections\n  #   pool_timeout: 30  # Connection timeout in seconds\n  #   pool_recycle: 1800  # Connection recycle time in seconds\n  #   \n  #   # Vector extensions\n  #   vector_extension: \"auto\"  # Options: auto, pgvector, sqlite_vss, sqlite_vec, none\n  #   create_vector_index: true  # Whether to create vector indexes automatically\n  #   vector_index_type: \"ivfflat\"  # For PostgreSQL: ivfflat, hnsw\n  \n  # SQLite-based storage configuration (uncomment to use)\n  # backend: sqlite\n  # path: \"./data\"  # Path where the SQLite database file will be stored\n  # sqlite:\n  #   # Extensions configuration\n  #   sqlite_extensions:\n  #     use_sqlean: true  # Whether to use sqlean.py (provides more SQLite extensions)\n  #     auto_discover: true  # Whether to automatically discover and load vector extensions\n  #   \n  #   # Vector search configuration\n  #   vector_extensions:\n  #     preferred: \"auto\"  # Options: auto, vec0, vss0, none\n  #     create_tables: true  # Whether to create vector tables on startup\n  #     populate_existing: true  # Whether to populate vector tables with existing embeddings\n\nembedding:\n  enabled: true\n  model: \"sentence-transformers/all-MiniLM-L6-v2\"\n  dimensions: 384  # Embedding dimensions, used by database vector search\n  \n  # OpenAI embedding configuration (uncomment to use)\n  # provider: \"openai\"  # Change from default sentence-transformers to OpenAI\n  # model: \"text-embedding-3-small\"  # OpenAI embedding model name\n  # dimensions: 1536  # Dimensions for the model (1536 for text-embedding-3-small, 3072 for text-embedding-3-large)\n  # openai:\n  #   api_key: \"your-openai-api-key\"  # Can also be set via OPENAI_API_KEY environment variable\n  #   batch_size: 10  # Number of texts to embed in a single API call\n  #   retry_count: 3  # Number of retries on API failure\n  #   retry_delay: 1  # Delay between retries in seconds\n  #   timeout: 60  # Timeout for API calls in seconds\n  #   max_tokens: 8191  # Maximum tokens per text (8191 for text-embedding-3-small/large)\n  #   cache_enabled: true  # Whether to cache embeddings\n  #   cache_size: 1000  # Maximum number of embeddings to cache in memory\n\ncontent_sources:\n  # File content source\n  - name: \"local-files\"\n    type: \"file\"\n    base_path: \"./documents\"\n    file_pattern: \"**/*\"\n    include_extensions: [\"md\", \"txt\", \"pdf\", \"docx\", \"html\"]\n    exclude_extensions: [\"tmp\", \"bak\"]\n    watch_for_changes: true\n    recursive: true\n    max_link_depth: 2\n  \n  # JIRA content source\n  - name: \"project-tickets\"\n    type: \"jira\"\n    base_url: \"https://your-company.atlassian.net\"\n    username: \"jira_user@example.com\"\n    api_token: \"your-jira-api-token\"\n    projects: [\"PROJ\", \"FEAT\"]\n    issue_types: [\"Bug\", \"Story\", \"Task\"]\n    statuses: [\"In Progress\", \"To Do\", \"In Review\"] \n    include_closed: false\n    max_results: 100\n    include_description: true\n    include_comments: true\n    include_attachments: false\n    include_subtasks: true\n    include_linked_issues: true\n    include_custom_fields: [\"customfield_10001\", \"customfield_10002\"]\n    max_link_depth: 1\n  \n  # MongoDB content source\n  - name: \"document-db\"\n    type: \"mongodb\"\n    connection_string: \"mongodb://localhost:27017/\"\n    database_name: \"your_database\"\n    collection_name: \"documents\"\n    query: {\"status\": \"active\"}\n    projection: {\"_id\": 1, \"title\": 1, \"content\": 1, \"metadata\": 1, \"updated_at\": 1}\n    id_field: \"_id\"\n    content_field: \"content\"\n    timestamp_field: \"updated_at\"\n    limit: 1000\n    sort_by: [[\"updated_at\", -1]]\n    follow_references: true\n    reference_field: \"related_docs\"\n    max_link_depth: 1\n    \n  # S3 content source\n  - name: \"cloud-documents\"\n    type: \"s3\"\n    bucket_name: \"your-document-bucket\"\n    prefix: \"documents/\"\n    region_name: \"us-west-2\"\n    aws_access_key_id: \"your-access-key\"\n    aws_secret_access_key: \"your-secret-key\"\n    assume_role_arn: \"arn:aws:iam::123456789012:role/S3AccessRole\"  # Optional\n    endpoint_url: null  # For S3-compatible storage\n    include_extensions: [\"md\", \"txt\", \"pdf\", \"docx\", \"html\"]\n    exclude_extensions: [\"tmp\", \"bak\", \"log\"]\n    include_prefixes: [\"documents/important/\", \"documents/shared/\"]  # Optional\n    exclude_prefixes: [\"documents/archive/\", \"documents/backup/\"]  # Optional\n    include_patterns: []  # Optional regex patterns\n    exclude_patterns: []  # Optional regex patterns\n    recursive: true\n    max_depth: 5\n    detect_mimetype: true\n    temp_dir: \"/tmp\"\n    delete_after_processing: true\n    local_link_mode: \"relative\"  # Can be relative, absolute, or none\n    max_link_depth: 2\n    \n  # ServiceNow content source\n  - name: \"it-service-management\"\n    type: \"servicenow\"\n    base_url: \"https://your-instance.service-now.com\"\n    username: \"servicenow_user\"\n    api_token: \"your-servicenow-api-token\"\n    # Alternatively, use password authentication\n    # password: \"your-password\"\n    \n    # Content type settings\n    include_knowledge: true\n    include_incidents: true\n    include_service_catalog: true\n    include_cmdb: true\n    \n    # Filter settings\n    knowledge_query: \"workflow_state=published\"  # ServiceNow knowledge API query\n    incident_query: \"active=true^priority<=2\"  # ServiceNow table API query for incidents\n    service_catalog_query: \"active=true^category=hardware\"  # Query for service catalog items\n    cmdb_query: \"sys_class_name=cmdb_ci_server\"  # Query for CMDB items\n    include_patterns: [\".*prod.*\", \".*critical.*\"]  # Regex patterns to include\n    exclude_patterns: [\".*test.*\", \".*dev.*\"]  # Regex patterns to exclude\n    limit: 100  # Maximum number of items to retrieve\n    max_link_depth: 1  # For following links between ServiceNow items\n    \n  # Web content source\n  - name: \"web-content\"\n    type: \"web\"\n    base_url: \"https://www.example.com\"  # Optional base URL for relative paths\n    url_list:  # List of URLs to fetch\n      - \"https://www.example.com/docs/overview\"\n      - \"https://www.example.com/docs/tutorials\"\n      - \"https://www.example.com/blog\"\n    url_list_file: \"./urls.txt\"  # Optional path to file containing URLs (one per line)\n    refresh_interval: 86400  # Refresh interval in seconds (default: 1 day)\n    headers:  # Custom headers for requests\n      User-Agent: \"Mozilla/5.0 (compatible; DoculyzerBot/1.0)\"\n      Accept-Language: \"en-US,en;q=0.9\"\n    authentication:  # Optional authentication\n      type: \"basic\"  # Can be \"basic\" or \"bearer\"\n      username: \"web_user\"\n      password: \"web_password\"\n      # For bearer token:\n      # type: \"bearer\"\n      # token: \"your-access-token\"\n    include_patterns:  # Regex patterns to include when following links\n      - \"/docs/.*\"\n      - \"/blog/[0-9]{4}/.*\"\n    exclude_patterns:  # Regex patterns to exclude when following links\n      - \".*\\\\.pdf$\"\n      - \"/archived/.*\"\n    max_link_depth: 3  # Maximum depth for following links\n    \n  # Confluence content source\n  - name: \"team-knowledge-base\"\n    type: \"confluence\"\n    base_url: \"https://your-company.atlassian.net\"\n    username: \"confluence_user@example.com\"\n    api_token: \"your-confluence-api-token\"\n    # Alternatively, use password authentication\n    # password: \"your-password\"\n    \n    # Space configuration\n    spaces: [\"TEAM\", \"PROJ\", \"DOCS\"]  # List of space keys to include (empty list fetches all accessible spaces)\n    exclude_personal_spaces: true  # Skip personal spaces when fetching all spaces\n    \n    # Content type settings\n    include_pages: true  # Include regular Confluence pages\n    include_blogs: true  # Include blog posts\n    include_comments: false  # Include comments on pages/blogs\n    include_attachments: false  # Include file attachments\n    \n    # Content filtering\n    include_patterns: [\"^Project.*\", \"^Guide.*\"]  # Regex patterns to include (matches against title)\n    exclude_patterns: [\"^Draft.*\", \"^WIP.*\"]  # Regex patterns to exclude\n    \n    # Advanced settings\n    expand_macros: true  # Expand Confluence macros in content\n    link_pattern: \"/wiki/spaces/[^/]+/pages/(\\\\d+)\"  # Regex pattern to extract page IDs from links\n    limit: 500  # Maximum number of items to retrieve\n    max_link_depth: 2  # For following links between Confluence pages\n  \n  # Example of a blob-based database content source\n  - name: \"database-blobs\"\n    type: \"database\"\n    connection_string: \"postgresql://user:password@localhost:5432/mydatabase\"\n    query: \"SELECT * FROM documents\"\n    id_column: \"doc_id\"\n    content_column: \"content_blob\" \n    metadata_columns: [\"title\", \"author\", \"created_date\"]\n    timestamp_column: \"updated_at\"\n    \n  # Example of a JSON-structured database content source\n  - name: \"database-json\"\n    type: \"database\"\n    connection_string: \"mysql://user:password@localhost:3306/customer_db\"\n    query: \"SELECT * FROM customers\"\n    id_column: \"customer_id\"\n    json_mode: true\n    json_columns: [\"first_name\", \"last_name\", \"email\", \"address\", \"phone_number\", \"signup_date\"]\n    metadata_columns: [\"account_status\", \"customer_type\"]\n    timestamp_column: \"last_modified\"\n    json_include_metadata: true  # Include metadata columns in the JSON document\n    \n  # Example of automatic column discovery (all columns except ID will be in JSON)\n  - name: \"database-json-auto\"\n    type: \"database\"\n    connection_string: \"sqlite:///local_database.db\"\n    query: \"products\"  # Simple table name\n    id_column: \"product_id\"\n    json_mode: true\n    # No json_columns specified - will automatically use all non-ID columns\n    metadata_columns: [\"category\", \"supplier\"]\n    timestamp_column: \"updated_at\"\n    \n  # Example with a complex query\n  - name: \"database-complex-query\"\n    type: \"database\"\n    connection_string: \"mssql+pyodbc://user:password@server/database\"\n    query: \"SELECT o.order_id, c.customer_name, o.order_date, p.product_name, oi.quantity \n            FROM orders o\n            JOIN customers c ON o.customer_id = c.customer_id\n            JOIN order_items oi ON o.order_id = oi.order_id\n            JOIN products p ON oi.product_id = p.product_id\"\n    id_column: \"order_id\"\n    json_mode: true\n    json_columns: [\"customer_name\", \"order_date\", \"product_name\", \"quantity\"]\n    timestamp_column: \"order_date\"\n\nrelationship_detection:\n  enabled: true\n\nlogging:\n  level: \"INFO\"\n  file: \"./logs/doculyzer.log\"\n```\n\n# Here's what I've tested so far\n- SQLite storage (with and within vector search plugins)\n- Web Content Source\n- File Content Source\n- Content type: MD, HTML, XLSX, PDF, XML, CSV, DOCX, PPTX\n\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Universal, Searchable, Structured Document Manager",
    "version": "0.19.0",
    "project_urls": null,
    "split_keywords": [
        "document-management",
        " semantic-search",
        " embedding",
        " document-parsing"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "2f2b019b2c88683348928022b9d28f5ae0f169dc4d8f8508c61ce3593c4e5cfe",
                "md5": "b95fca6808c8feae62101fe2dfb532be",
                "sha256": "fda986762257bdcd19ba11b0c39e52b2ee2cef5e7459e953359ab805667d841f"
            },
            "downloads": -1,
            "filename": "doculyzer-0.19.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "b95fca6808c8feae62101fe2dfb532be",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 309051,
            "upload_time": "2025-05-07T02:40:03",
            "upload_time_iso_8601": "2025-05-07T02:40:03.605416Z",
            "url": "https://files.pythonhosted.org/packages/2f/2b/019b2c88683348928022b9d28f5ae0f169dc4d8f8508c61ce3593c4e5cfe/doculyzer-0.19.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "8672466fe77736ea7e58aa474a90d388f35de3c27589bf13cb52cd6c877fd999",
                "md5": "fe23f4f08c2b6fdd873e0fa328cc67fe",
                "sha256": "4dea11314cd09ae8537613760567bdb6d6038c17e21762a48b0c40de86cba62d"
            },
            "downloads": -1,
            "filename": "doculyzer-0.19.0.tar.gz",
            "has_sig": false,
            "md5_digest": "fe23f4f08c2b6fdd873e0fa328cc67fe",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 274091,
            "upload_time": "2025-05-07T02:40:04",
            "upload_time_iso_8601": "2025-05-07T02:40:04.885004Z",
            "url": "https://files.pythonhosted.org/packages/86/72/466fe77736ea7e58aa474a90d388f35de3c27589bf13cb52cd6c877fd999/doculyzer-0.19.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-05-07 02:40:04",
    "github": false,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "lcname": "doculyzer"
}

None