| Name | mosaicx JSON |
| Version |
1.1.1
JSON |
| download |
| home_page | None |
| Summary | Medical cOmputational Suite for Advanced Intelligent eXtraction |
| upload_time | 2025-10-21 12:47:22 |
| maintainer | None |
| docs_url | None |
| author | None |
| requires_python | >=3.11 |
| license | Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License. |
| keywords |
extraction
llm
medical
nlp
pdf
radiology
|
| VCS |
 |
| bugtrack_url |
|
| requirements |
No requirements were recorded.
|
| Travis-CI |
No Travis.
|
| coveralls test coverage |
No coveralls.
|
<div align="center">
<img src="assets/mosaicx_logo.png" alt="MOSAICX Logo" width="800"/>
</div>
<p align="center">
<a href="https://pypi.org/project/mosaicx/"><img alt="PyPI" src="https://img.shields.io/pypi/v/mosaicx.svg?label=PyPI&style=flat-square&logo=python&logoColor=white&color=bd93f9"></a>
<a href="https://www.python.org/downloads/"><img alt="Python" src="https://img.shields.io/badge/Python-3.11%2B-50fa7b?style=flat-square&logo=python&logoColor=white"></a>
<a href="https://www.apache.org/licenses/LICENSE-2.0"><img alt="License" src="https://img.shields.io/badge/License-Apache--2.0-ff79c6?style=flat-square&logo=apache&logoColor=white"></a>
<a href="https://pepy.tech/project/mosaicx"><img alt="Downloads" src="https://img.shields.io/pepy/dt/mosaicx?style=flat-square&color=8be9fd&label=Downloads"></a>
<a href="https://pydantic.dev"><img alt="Pydantic v2" src="https://img.shields.io/badge/Pydantic-v2-ffb86c?style=flat-square&logo=pydantic&logoColor=white"></a>
<a href="https://ollama.ai"><img alt="Ollama Compatible" src="https://img.shields.io/badge/Ollama-Compatible-6272a4?style=flat-square&logo=ghost&logoColor=white"></a>
<a href="mailto:lalith@zenta.solutions"><img alt="Commercial License" src="https://img.shields.io/badge/Commercial%20Use-Contact%20Zenta-orange?style=flat-square&logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iOTIiIGhlaWdodD0iOTIiIHZpZXdCb3g9IjAgMCA5MiA5MiIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPGNpcmNsZSBjeD0iNDUuODk2NiIgY3k9IjcuOTEyOTYiIHI9IjYuMjkxNzEiIHRyYW5zZm9ybT0icm90YXRlKDkwIDQ1Ljg5NjYgNy45MTI5NikiIGZpbGw9IiNFNkU0REUiIHN0cm9rZT0iI0U2RTRERSIgc3Ryb2tlLXdpZHRoPSIzLjI0MjUxIi8+CjxjaXJjbGUgY3g9IjgzLjg3ODEiIGN5PSIyNi45MDQyIiByPSI2LjI5MTcxIiB0cmFuc2Zvcm09InJvdGF0ZSg5MCA4My44NzgxIDI2LjkwNDIpIiBmaWxsPSIjRTZFNERFIiBzdHJva2U9IiNFNkU0REUiIHN0cm9rZS13aWR0aD0iMy4yNDI1MSIvPgo8Y2lyY2xlIGN4PSI3LjkxMzIxIiBjeT0iMjYuOTA0MiIgcj0iNi4yOTE3MSIgdHJhbnNmb3JtPSJyb3RhdGUoOTAgNy45MTMyMSAyNi45MDQyKSIgZmlsbD0iI0U2RTRERSIgc3Ryb2tlPSIjRTZFNERFIiBzdHJva2Utd2lkdGg9IjMuMjQyNTEiLz4KPGNpcmNsZSBjeD0iNy45MTMyMSIgY3k9IjQ1Ljg5NjQiIHI9IjYuMjkxNzEiIHRyYW5zZm9ybT0icm90YXRlKDkwIDcuOTEzMjEgNDUuODk2NCkiIGZpbGw9IiNFNkU0REUiIHN0cm9rZT0iI0U2RTRERSIgc3Ryb2tlLXdpZHRoPSIzLjI0MjUxIi8+CjxjaXJjbGUgY3g9IjY0Ljg4NjgiIGN5PSI4My44Nzg4IiByPSI2LjI5MTcxIiB0cmFuc2Zvcm09InJvdGF0ZSg5MCA2NC44ODY4IDgzLjg3ODgpIiBmaWxsPSIjRTZFNERFIiBzdHJva2U9IiNFNkU0REUiIHN0cm9rZS13aWR0aD0iMy4yNDI1MSIvPgo8Y2lyY2xlIGN4PSIyNi45MDQ0IiBjeT0iODMuODc4OCIgcj0iNi4yOTE3MSIgdHJhbnNmb3JtPSJyb3RhdGUoOTAgMjYuOTA0NCA4My44Nzg4KSIgZmlsbD0iI0U2RTRERSIgc3Ryb2tlPSIjRTZFNERFIiBzdHJva2Utd2lkdGg9IjMuMjQyNTEiLz4KPGNpcmNsZSBjeD0iNy45MTMyMSIgY3k9IjY0Ljg4NzYiIHI9IjYuMjkxNzEiIHRyYW5zZm9ybT0icm90YXRlKDkwIDcuOTEzMjEgNjQuODg3NikiIGZpbGw9IiNFNkU0REUiIHN0cm9rZT0iI0U2RTRERSIgc3Ryb2tlLXdpZHRoPSIzLjI0MjUxIi8+CjxyZWN0IHg9IjkwLjE2OTgiIHk9IjM5LjYwNDciIHdpZHRoPSIzMS41NzQ1IiBoZWlnaHQ9IjEyLjU4MzQiIHJ4PSI2LjI5MTcxIiB0cmFuc2Zvcm09InJvdGF0ZSg5MCA5MC4xNjk4IDM5LjYwNDcpIiBmaWxsPSIjRTZFNERFIiBzdHJva2U9IiNFNkU0REUiIHN0cm9rZS13aWR0aD0iMy4yNDI1MSIvPgo8cmVjdCB4PSI3MS4xNzg1IiB5PSIxLjYyMTI2IiB3aWR0aD0iMzEuNTc0NSIgaGVpZ2h0PSIxMi41ODM0IiByeD0iNi4yOTE3MSIgdHJhbnNmb3JtPSJyb3RhdGUoOTAgNzEuMTc4NSAxLjYyMTI2KSIgZmlsbD0iI0U2RTRERSIgc3Ryb2tlPSIjRTZFNERFIiBzdHJva2Utd2lkdGg9IjMuMjQyNTEiLz4KPHJlY3QgeD0iNTIuMTg4MyIgeT0iNTguNTk1OSIgd2lkdGg9IjMxLjU3NDUiIGhlaWdodD0iMTIuNTgzNCIgcng9IjYuMjkxNzEiIHRyYW5zZm9ybT0icm90YXRlKDkwIDUyLjE4ODMgNTguNTk1OSkiIGZpbGw9IiNFNkU0REUiIHN0cm9rZT0iI0U2RTRERSIgc3Ryb2tlLXdpZHRoPSIzLjI0MjUxIi8+CjxyZWN0IHg9IjMzLjE5NjEiIHk9IjIyLjE5NTUiIHdpZHRoPSIzMS41NzQ1IiBoZWlnaHQ9IjEyLjU4MzQiIHJ4PSI2LjI5MTcxIiB0cmFuc2Zvcm09InJvdGF0ZSg5MCAzMy4xOTYxIDIyLjE5NTUpIiBmaWxsPSIjRTZFNERFIiBzdHJva2U9IiNFNkU0REUiIHN0cm9rZS13aWR0aD0iMy4yNDI1MSIvPgo8cmVjdCB4PSIzMy4xOTYxIiB5PSIxLjYyMTI2IiB3aWR0aD0iMzEuNTc0NSIgaGVpZ2h0PSIxMi41ODM0IiByeD0iNi4yOTE3MSIgdHJhbnNmb3JtPSJyb3RhdGUoOTAgMzMuMTk2MSAxLjYyMTI2KSIgZmlsbD0iI0U2RTRERSIgc3Ryb2tlPSIjRTZFNERFIiBzdHJva2Utd2lkdGg9IjMuMjQyNTEiLz4KPHJlY3QgeD0iMTQuMjA0OSIgeT0iMzkuNjA0NyIgd2lkdGg9IjMxLjU3NDUiIGhlaWdodD0iMTIuNTgzNCIgcng9IjYuMjkxNzEiIHRyYW5zZm9ybT0icm90YXRlKDkwIDE0LjIwNDkgMzkuNjA0NykiIGZpbGw9IiNFNkU0REUiIHN0cm9rZT0iI0U2RTRERSIgc3Ryb2tlLXdpZHRoPSIzLjI0MjUxIi8+Cjwvc3ZnPgo="></a>
</p>
<p align="center">
<strong>Developed by the <a href="https://www.linkedin.com/company/digitx-lmu/">DIGIT-X Lab</a> at LMU Munich University</strong><br>
<em>Quietly ambitious about the hard things</em>
</p>
---
## ๐งฌ **Structure first. Insight follows.**
Medical data is inherently complex, unstructured, and heterogeneous. Before we can unlock meaningful patterns, predict outcomes, or enable clinical decision support, we must first impose order on chaos. **MOSAICX** embodies this fundamental principle: **structured data is the prerequisite for knowledge discovery**.
In healthcare, unstructured documentsโradiology reports, clinical notes, pathology summariesโcontain critical information locked in narrative text. MOSAICX transforms this chaos into validated, machine-readable structures using AI-driven schema generation and extraction pipelines. Only when data is properly structured can we apply advanced analytics, machine learning, and knowledge graphs to generate actionable insights.
**Core Capabilities:**
- ๐ฌ **Schema Generation**: Transform natural language descriptions into validated Pydantic models
- ๐ **Document Extraction**: Convert PDFs and clinical documents to structured JSON using generated schemas
- ๐ **Clinical Summarization**: Generate timeline-based summaries of radiology reports with standardized outputs
- โก **CLI & API**: Powerful command-line interface and Python API for production workflows
- ๐ฅ **Privacy-First**: Process sensitive medical data locally using Ollama-compatible LLMs
- ๐ฏ **Production-Ready**: Robust error handling, validation, and reproducible outputs
- ๐ **Demo WebApp**: Interactive web interface for demonstrations and testing
> *Powered by local LLMs via **Ollama**, PDF processing via **Docling**, and strict validation via **Pydantic v2***
## ๐ **Demo WebApp**
Interactive web interface for **demonstrations and testing only**. Use CLI/API for production workflows.
### **Quick Demo Start:**
```bash
cd webapp && ./start.sh
```
**Demo Features:**
- ๐ฌ **Smart Contract Generator**: Create Pydantic schemas from natural language
- ๐ **PDF Extractor**: Drag-and-drop PDF processing
- ๐ **Report Summarizer**: Timeline-based clinical analysis
**Access Demo:** http://localhost:3000 | [Full Setup Guide โ](webapp/README.md)
**Requirements:**
- **Docker**: Desktop or Engine 20.10+
- **RAM**: 16GB+ (32GB recommended for large models)
- **Storage**: 10GB+ for containers and models
- **GPU**: Optional but recommended for large models
**Architecture Notes:**
- **Option 1**: WebApp containers โ Host Ollama (via `host.docker.internal:11434`)
- **Option 2**: WebApp containers โ Ollama container (via internal Docker network)
**Features:**
- ๐ฌ **Smart Contract Generator**: Create Pydantic models from natural language
- ๐ **PDF Extractor**: Drag-and-drop PDF processing with real-time results
- ๐ **Report Summarizer**: Timeline-based clinical report analysis
- ๐ **Sample Data**: Pre-loaded medical PDFs and schema templates
- ๐จ **Glass Morphism UI**: Electric cyan theme with professional medical interface
[**โ Full WebApp Documentation**](webapp/README.md)
---
## ๐ **Installation & Setup**
### System Requirements
- **Python**: 3.11+ (3.12 recommended)
- **Operating System**: macOS, Linux, Windows (with WSL2)
- **Memory**: 16GB RAM minimum, 32GB recommended
- **Storage**: 10GB free space for models
### Step 1: Install Ollama
```bash
# macOS/Linux (automatic installation)
curl -fsSL https://ollama.com/install.sh | sh
# Windows: Download from https://ollama.com/download/windows
# Start Ollama service
ollama serve
```
### Step 2: Install MOSAICX
```bash
# Using pip (standard)
pip install mosaicx
# Using uv (faster dependency resolution)
uv add mosaicx
# Using pipx (isolated installation)
pipx install mosaicx
# Development installation
git clone https://github.com/LalithShiyam/MOSAICX.git
cd MOSAICX
pip install -e .
```
### Step 3: Download Required Models
```bash
# Default model (recommended for most use cases)
ollama pull gpt-oss:120b
# Alternative models
ollama pull llama3.1:8b-instruct # Smaller, faster
ollama pull qwen2.5:7b-instruct # Good balance
ollama pull deepseek-r1:7b # Reasoning model
# Verify installation
mosaicx --version
```
### Step 4: Quick Test
```bash
# Test connection to Ollama
mosaicx generate --desc "Simple patient record with name and age" --class-name TestModel
```
---
## ๐ฌ **Usage Guide**
### Command Overview
MOSAICX provides three main commands with extensive options:
```bash
mosaicx --help # Show all commands
mosaicx generate --help # Schema generation options
mosaicx extract --help # Document extraction options
mosaicx summarize --help # Report summarization options
mosaicx schemas --help # Schema management options
```
### Default Settings
- **Model**: `gpt-oss:120b` (configurable via `--model`)
- **Temperature**:
- Schema generation: `0.2` (balanced creativity)
- Data extraction: `0.0` (deterministic)
- Summarization: `0.2` (slight creativity for readability)
- **Base URL**: `http://localhost:11434/v1` (Ollama default)
- **API Key**: `ollama` (Ollama default)
### 1. Schema Generation from Natural Language
Transform clinical requirements into validated Pydantic models:
```bash
# Basic usage (uses defaults)
mosaicx generate \
--desc "Echocardiography report with patient demographics, LVEF, valve grades, impression"
# Generated Pydantic Model:
```python
from pydantic import BaseModel, Field
from datetime import datetime
from typing import Literal, Optional
class EchocardiographyReport(BaseModel):
"""Echocardiography report with patient demographics, LVEF, valve grades, impression"""
patient_id: str = Field(..., description="Unique patient identifier")
patient_name: str = Field(..., description="Patient full name")
date_of_birth: datetime = Field(..., description="Patient date of birth")
exam_date: datetime = Field(..., description="Date of echocardiogram examination")
lvef_percent: float = Field(..., ge=0, le=100, description="Left ventricular ejection fraction (%)")
mitral_valve_grade: Literal["Normal", "Mild", "Moderate", "Severe"] = Field(
..., description="Mitral valve regurgitation severity"
)
aortic_valve_grade: Literal["Normal", "Mild", "Moderate", "Severe"] = Field(
..., description="Aortic valve stenosis/regurgitation severity"
)
tricuspid_valve_grade: Literal["Normal", "Mild", "Moderate", "Severe"] = Field(
..., description="Tricuspid valve regurgitation severity"
)
clinical_impression: str = Field(..., min_length=10, description="Cardiologist's clinical impression")
```
**Advanced usage with custom settings:**
```bash
mosaicx generate \
--desc "Complete blood count with patient ID, test date, hemoglobin, hematocrit, WBC count, differential counts, and reference ranges" \
--class-name CBCReport \
--model llama3.1:8b-instruct \
--temperature 0.1 \
--schema-path schemas/cbc_report.py
# Generated Pydantic Model:
```python
from pydantic import BaseModel, Field
from datetime import datetime
from typing import Optional
class CBCReport(BaseModel):
"""Complete blood count with patient ID, test date, hemoglobin, hematocrit, WBC count, differential counts, and reference ranges"""
patient_id: str = Field(..., description="Unique patient identifier")
test_date: datetime = Field(..., description="Date when CBC test was performed")
hemoglobin: float = Field(..., ge=0, le=25, description="Hemoglobin level in g/dL")
hematocrit: float = Field(..., ge=0, le=70, description="Hematocrit percentage")
wbc_count: float = Field(..., ge=0, description="White blood cell count (thousands/ฮผL)")
neutrophils_percent: float = Field(..., ge=0, le=100, description="Neutrophils percentage")
lymphocytes_percent: float = Field(..., ge=0, le=100, description="Lymphocytes percentage")
monocytes_percent: float = Field(..., ge=0, le=100, description="Monocytes percentage")
eosinophils_percent: float = Field(..., ge=0, le=100, description="Eosinophils percentage")
basophils_percent: float = Field(..., ge=0, le=100, description="Basophils percentage")
hemoglobin_ref_range: str = Field(..., description="Reference range for hemoglobin")
hematocrit_ref_range: str = Field(..., description="Reference range for hematocrit")
wbc_ref_range: str = Field(..., description="Reference range for WBC count")
```
**Available Options:**
- `--desc` (required): Natural language description
- `--class-name`: Pydantic class name (default: "GeneratedModel")
- `--model`: LLM model to use (default: "gpt-oss:120b")
- `--temperature`: Sampling temperature 0.0-2.0 (default: 0.2)
- `--schema-path`: Write the generated schema to this file
- `--base-url`: Custom API endpoint
- `--api-key`: Custom API key
- `--debug`: Enable verbose logging
### 2. Document Extraction to Structured Data
Extract structured information from clinical documents:
```bash
# Basic extraction
mosaicx extract \
--document patient_reports/echo_001.pdf \
--schema EchoReport
# Advanced extraction with custom model
mosaicx extract \
--document "case studies/complex_cardiology_report.pdf" \
--schema CBCReport_20250925_143022 \
--model qwen2.5:7b-instruct \
--save results/structured_data.json
```
Supported formats include PDF, DOC/DOCX, PPT/PPTX, TXT/MD, and RTFโmix them freely in a single run.
Behind the scenes MOSAICX layers extraction: native Docling text, then forced OCR, and finally Gemma3:27b via Ollama for vision-language transcription when required.
Example CLI output (abridged โ actual Rich formatting includes colors and panels):
```
๐ Extraction results based on schema: EchoReport
Field Extracted Value
patient_id ECG-001-2025
exam_date 2025-09-15T00:00:00
lvef_percent 55.0
mitral_valve_grade Mild
aortic_valve_grade Normal
tricuspid_valve_grade Normal
clinical_impression Normal left ventricular systolic function...
๐ Extraction saved
JSON: results/structured_data.json
```
### 3. Clinical Report Summarization
Generate timeline-based summaries from radiology reports:
```bash
# Single patient, multiple reports
mosaicx summarize \
--report patient_001/ct_baseline.pdf \
--report patient_001/ct_3month.pdf \
--report patient_001/ct_6month.pdf \
--patient P001 \
--json-out summaries/P001_longitudinal.json
# Process entire directory
mosaicx summarize \
--dir ./radiology_reports/patient_P001/ \
--patient P001 \
--model llama3.1:8b-instruct \
--temperature 0.1 \
--json-out P001_summary.json
```
Supported formats include PDF, DOC/DOCX, PPT/PPTX, TXT/MD, and RTFโmix them freely in a single run.
Example CLI output (abridged โ actual Rich formatting includes colors and panels):
```
Patient: P001
DOB: โ Sex: โ Updated: 2025-09-25T14:30:22Z
Timeline
Date Source Critical Note
2025-08-01 CT Chest/Abdomen/Pelvis Baseline study: Multiple pulmonary nodules...
2025-09-15 CT Chest Follow-up Interval growth: RUL nodule now 12mm...
Overall Summary
Progressive pulmonary nodular disease with interval growth of the RUL lesion and new LLL nodule. [Source: CT Chest Follow-up]
```
---
## ๐ **Using MOSAICX as a Python Library**
The CLI features are also exposed as pure Python helpers so you can script or integrate them into other services.
```python
from pathlib import Path
from mosaicx import (
extract_pdf,
generate_schema,
summarize_reports,
)
# 1) Generate a Pydantic schema from a plain-language description
schema = generate_schema(
"Patient vitals with name, heart rate, systolic_bp, diastolic_bp",
class_name="PatientVitals",
model="gpt-oss:120b",
)
schema_path = schema.write(Path("schemas/patient_vitals.py"))
# 2) Extract structured data from a PDF using that schema
extraction = extract_pdf(
pdf_path="tests/datasets/sample_patient_vitals.pdf",
schema_path=schema_path,
)
payload = extraction.to_dict()
# 3) Summarize one or more clinical reports
summary = summarize_reports(
paths=["tests/datasets/sample_patient_vitals.pdf"],
patient_id="demo-patient",
)
```
Example Python output (illustrative values):
```python
payload
{
"patient_name": "John Doe",
"heart_rate": 72,
"systolic_bp": 118,
"diastolic_bp": 76,
}
summary.overall
'Stable vital signs with normal heart rate and blood pressure. [Source: sample_patient_vitals.pdf]'
summary.timeline[0].model_dump()
{
"date": None,
"source": "sample_patient_vitals.pdf",
"note": "Vitals within normal limits; no acute concerns.",
}
```
All helpers accept optional `model`, `base_url`, and `api_key` arguments; when omitted the defaults mirror the CLI (environment variables first, then local Ollama).
---
## ๐ฏ **Why Structure Matters in Medical AI**
At the DIGIT-X Lab, we believe that **structure precedes insight**. The proliferation of unstructured medical dataโradiology reports, clinical notes, pathology summariesโrepresents both an opportunity and a challenge. While this data contains rich clinical knowledge, its unstructured nature makes it largely inaccessible to computational analysis.
Modern healthcare generates exabytes of unstructured text annually, yet most clinical decision support systems can only leverage structured fields from electronic health records. This fundamental disconnect limits our ability to develop robust clinical AI, conduct large-scale outcomes research, or enable personalized medicine approaches.
**MOSAICX addresses this gap by:**
- **Democratizing Data Structuring**: Transforming natural language descriptions into production-ready data schemas without requiring deep technical expertise
- **Enabling Reproducible Extraction**: Converting documents to validated JSON structures that can be reliably processed by downstream ML pipelines
- **Preserving Clinical Context**: Maintaining semantic meaning while imposing computational structure through intelligent schema design
- **Supporting Privacy Requirements**: Processing sensitive medical data locally without external API dependencies
The structured data produced by MOSAICX becomes the foundation for knowledge graphs, longitudinal analysis, cohort studies, and clinical prediction models. **Structure first. Insight follows.**
---
## ๐ง **Advanced Features**
### Schema Registry Management
The schema registry tracks all generated Pydantic models for easy reuse:
```bash
# List all generated schemas with details
mosaicx schemas
# Filter by clinical domain or keywords
mosaicx schemas --description "cardiology"
mosaicx schemas --class-name "Echo"
# Clean up orphaned registry entries (files deleted outside MOSAICX)
mosaicx schemas --cleanup
# Scan and register existing schema files not tracked by registry
mosaicx schemas --scan
```
**Available Schema Registry:**
- **EchoReport_20250925_143022**: Echocardiography report with LVEF and valve assessments
- **CBCReport_20250925_101530**: Complete blood count with differential and references
- **PathologyReport_20250924_152045**: Surgical pathology with tumor staging and margins
๐ก **Tip**: Use schema ID, filename, or file path in extract commands
### Batch Processing
Process multiple documents or directories efficiently:
```bash
# Batch summarization for multiple patients
for patient_dir in ./patients/*/; do
patient_id=$(basename "$patient_dir")
mosaicx summarize \
--dir "$patient_dir" \
--patient "$patient_id" \
--json-out "summaries/${patient_id}_summary.json"
done
# Batch extraction using same schema
find ./reports -name "*.pdf" -exec mosaicx extract \
--document {} \
--schema UniversalLabReport \
--save "structured_data/{}.json" \;
```
### Custom Model Endpoints
Use alternative LLM providers or local deployments:
```bash
# OpenAI API
mosaicx generate \
--desc "Pathology report with tumor staging" \
--base-url https://api.openai.com/v1 \
--api-key sk-your-openai-key \
--model gpt-4-turbo
# Local LM Studio
mosaicx extract \
--document report.pdf \
--schema PathologyReport \
--base-url http://localhost:1234/v1 \
--api-key lm-studio \
--model local-medical-llm
# Custom medical LLM deployment
mosaicx summarize \
--dir ./radiology_reports/ \
--base-url https://your-medical-llm.hospital.com/v1 \
--api-key your-internal-key \
--model hospital-radiology-model
```
### Environment Variables
Set default values to avoid repetitive command-line options:
```bash
# Set default model and endpoint
export OPENAI_BASE_URL="http://localhost:11434/v1"
export OPENAI_API_KEY="ollama"
export MOSAICX_DEFAULT_MODEL="gpt-oss:120b"
# Now use simplified commands
mosaicx generate --desc "Simple patient record"
mosaicx extract --document report.pdf --schema PatientRecord
```
---
## ๐ **Best Practices & Model Selection**
### Recommended Models by Use Case
| Model | Size | Use Case | Memory | Speed | Accuracy |
|-------|------|----------|---------|-------|----------|
| `gpt-oss:120b` | ~120B | Complex schemas, high accuracy | 64GB+ | Slow | โ
โ
โ
โ
โ
|
| `llama3.1:8b-instruct` | ~8B | Balanced performance | 16GB+ | Fast | โ
โ
โ
โ
โ |
| `qwen2.5:7b-instruct` | ~7B | Batch processing | 12GB+ | Fastest | โ
โ
โ
โโ |
| `deepseek-r1:7b` | ~7B | Reasoning tasks | 16GB+ | Medium | โ
โ
โ
โ
โ |
**Default Model**: `gpt-oss:120b` provides the best accuracy for medical schema generation and extraction tasks.
### Schema Design Guidelines
**โ
Good Schema Design:**
```python
# Descriptive field names with medical terminology
class EchocardiographyReport(BaseModel):
patient_id: str = Field(..., description="Unique patient identifier")
exam_date: datetime = Field(..., description="Date of echocardiogram")
lvef_percent: float = Field(..., ge=0, le=100, description="Left ventricular ejection fraction (%)")
mitral_valve_grade: Literal["Normal", "Mild", "Moderate", "Severe"] = Field(
..., description="Mitral valve regurgitation severity"
)
clinical_impression: str = Field(..., min_length=10, description="Cardiologist's interpretation")
```
**โ Poor Schema Design:**
```python
# Vague field names, no validation, poor descriptions
class Report(BaseModel):
data: str
values: list
result: float
```
### Extraction Optimization
**Document Preparation:**
- Ensure PDFs have searchable text layers (not just scanned images)
- Use OCR preprocessing for scanned documents: `tesseract input.pdf output.pdf`
- Remove password protection from PDFs before processing
**Parameter Tuning:**
- **Temperature 0.0**: Deterministic extraction for consistent results
- **Temperature 0.1-0.2**: Slight variation for creative schema generation
- **Higher models**: Use for complex medical terminology and relationships
**Validation Best Practices:**
- Always review extracted data for clinical accuracy
- Implement post-processing validation against medical standards
- Use enum constraints for standardized medical values
- Set appropriate ranges for numeric clinical measurements
### Production Deployment
**Performance Optimization:**
```bash
# Use quantized models for faster inference
ollama pull llama3.1:8b-instruct-q4_0 # 4-bit quantization
# Process in batches to maximize GPU utilization
# Use parallel processing for independent documents
```
**Error Handling:**
```bash
# Enable debug mode for troubleshooting
mosaicx extract --document document.pdf --schema MySchema --debug
# Implement retry logic for production systems
# Validate outputs against clinical standards
# Log failed extractions for manual review
```
---
## ๐ฅ **Clinical Applications**
### Research & Analytics
- **Cohort Studies**: Structure clinical notes for population-level analysis
- **Outcomes Research**: Extract standardized endpoints from heterogeneous reports
- **Quality Metrics**: Automate clinical quality measure extraction
- **Biomarker Discovery**: Structure pathology and lab reports for analysis
### Clinical Decision Support
- **Risk Stratification**: Extract risk factors into computable formats
- **Care Pathway Optimization**: Structure clinical workflows and outcomes
- **Longitudinal Tracking**: Generate patient timelines from multiple reports
- **Adverse Event Detection**: Structure safety data from clinical narratives
### Operational Excellence
- **Revenue Cycle**: Extract billable procedures and diagnoses
- **Compliance Reporting**: Structure regulatory reporting requirements
- **Care Coordination**: Generate structured handoff summaries
- **Quality Assurance**: Standardize report review workflows
---
## โก **Performance & Scalability**
### Local Processing Benefits
- **Privacy Compliance**: No PHI transmitted to external services
- **Cost Efficiency**: Eliminate per-token API costs for large-scale processing
- **Latency Optimization**: Sub-second processing for typical clinical documents
- **Offline Capability**: Process data in air-gapped environments
### Hardware Recommendations
- **Minimum**: 16GB RAM, modern CPU (M1/M2 Mac, Intel i7/AMD Ryzen 7)
- **Recommended**: 32GB RAM, GPU acceleration (RTX 4080/4090, M2 Max/Ultra)
- **High-throughput**: 64GB+ RAM, multiple GPUs for batch processing
---
## ๐ **Troubleshooting Guide**
### Installation Issues
**Python Version Compatibility:**
```bash
# Check Python version (requires 3.11+)
python --version
# Install specific Python version if needed
pyenv install 3.12.0
pyenv global 3.12.0
```
**Dependency Conflicts:**
```bash
# Use virtual environment to isolate dependencies
python -m venv mosaicx-env
source mosaicx-env/bin/activate # macOS/Linux
# mosaicx-env\Scripts\activate # Windows
pip install mosaicx
```
### Runtime Issues
| Issue | Cause | Solution |
|-------|-------|----------|
| `Connection refused` | Ollama not running | `ollama serve` |
| `Model not found` | Model not downloaded | `ollama pull model-name` |
| `Empty extraction` | Poor model/temperature | Try `gpt-oss:120b` with `--temperature 0.0` |
| `PDF processing error` | Scanned PDF without text | Use OCR: `tesseract input.pdf output.pdf` |
| `Memory error` | Model too large | Use quantized model: `llama3.1:8b-instruct-q4_0` |
| `JSON validation error` | Malformed output | Enable `--debug` and check model output |
| `Schema not found` | Registry out of sync | Run `mosaicx schemas --scan` |
### Debug Mode
Enable verbose logging to diagnose issues:
```bash
# Enable debug for all commands
mosaicx --debug generate --desc "Test schema"
mosaicx extract --document document.pdf --schema MySchema --debug
mosaicx summarize --dir ./reports --debug
# Check Ollama status
ollama list # Show downloaded models
ollama ps # Show running models
curl http://localhost:11434/api/tags # API health check
```
### Common Error Messages
**"Schema class 'MySchema' not found"**
```bash
# Check available schemas
mosaicx schemas
# Regenerate if missing
mosaicx generate --desc "Your schema description" --class-name MySchema
```
**"No text extracted from PDF"**
```bash
# Test PDF text extraction
python -c "
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert('your_file.pdf')
print(result.document.text)
"
```
**"Temperature must be between 0.0 and 2.0"**
```bash
# Fix temperature value
mosaicx generate --desc "Test" --temperature 0.2 # Valid range: 0.0-2.0
```
### Performance Issues
**Slow Processing:**
- Use smaller models: `llama3.1:8b-instruct` instead of `gpt-oss:120b`
- Increase available RAM or use quantized models (`q4_0` suffix)
- Process documents in smaller batches
**High Memory Usage:**
- Close other applications
- Use quantized models
- Process one document at a time
**Inaccurate Results:**
- Use larger, more capable models
- Lower temperature for more deterministic output
- Improve schema descriptions with more specific field definitions
- Review and refine extracted data manually
### Getting Help
**Log Analysis:**
```bash
# Enable maximum verbosity
export MOSAICX_LOG_LEVEL=DEBUG
mosaicx extract --document document.pdf --schema MySchema --debug > debug.log 2>&1
```
**System Information:**
```bash
# Gather system info for bug reports
mosaicx --version
python --version
ollama --version
pip show mosaicx
```
For additional support:
- **GitHub Issues**: [Report bugs and feature requests](https://github.com/LalithShiyam/MOSAICX/issues)
- **Research Inquiries**: lalith.shiyam@med.uni-muenchen.de
- **Commercial Support**: lalith@zenta.solutions
---
## ๐ **From DIGIT-X Lab**
**MOSAICX** is developed by the [DIGIT-X Lab](https://www.linkedin.com/company/digitx-lmu/) at LMU Munich University, a research group focused on digital transformation in radiology and medical imaging. Our mission is to bridge the gap between clinical practice and computational methods through practical, privacy-preserving tools.
**Research Focus Areas:**
- Medical Image Analysis & AI
- Clinical Natural Language Processing
- Healthcare Data Standardization
- Privacy-Preserving Medical AI
- Radiomics & Quantitative Imaging
**Team**: Led by researchers and clinicians who understand both the technical challenges and clinical requirements of modern healthcare data processing.
*We are quietly ambitious about the hard things.*
---
## ๐ **License & Citation**
MOSAICX is released under the AGPL-3.0 license for academic and open-source use. For commercial applications in healthcare organizations, please contact us for licensing options.
### Citation
```bibtex
@software{mosaicx2025,
title={MOSAICX: Medical cOmputational Suite for Advanced Intelligent eXtraction},
author={Sundar, Lalith Kumar Shiyam and DIGIT-X Lab},
year={2025},
institution={LMU Munich University},
url={https://github.com/LalithShiyam/MOSAICX},
note={Developed at DIGIT-X Lab, Department of Radiology}
}
```
---
## ๐ค **Contributing & Support**
We welcome contributions from the medical informatics and clinical AI communities:
- **Bug Reports**: Submit issues with minimal reproducible examples
- **Feature Requests**: Propose new clinical use cases and requirements
- **Documentation**: Improve clinical examples and best practices
- **Code Contributions**: Follow our development guidelines and testing requirements
**Contact**:
- Research Inquiries: [lalith.shiyam@med.uni-muenchen.de](mailto:lalith.shiyam@med.uni-muenchen.de)
- Commercial Licensing: [lalith@zenta.solutions](mailto:lalith@zenta.solutions)
- DIGIT-X Lab: [https://www.linkedin.com/company/digitx-lmu/](https://www.linkedin.com/company/digitx-lmu/)
---
*Built with โค๏ธ for the medical community by researchers who understand that great clinical AI starts with great data structure.*
**MOSAICX is infrastructure for clinical data**: schema-driven, validated, local, and reproducible. Structure reports once, then reuse the same schemas and summarizers across departments and timeโenabling longitudinal analysis, cross-modal integration, and downstream intelligence without sending data to the cloud.
Raw data
{
"_id": null,
"home_page": null,
"name": "mosaicx",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": null,
"keywords": "extraction, llm, medical, nlp, pdf, radiology",
"author": null,
"author_email": "Lalith Kumar Shiyam Sundar <lalith.shiyam@med.uni-muenchen.de>",
"download_url": "https://files.pythonhosted.org/packages/40/10/1375b16df4ae39652958976021b9cb5dcaec8d9b3661f49227e92c8e92ba/mosaicx-1.1.1.tar.gz",
"platform": null,
"description": "<div align=\"center\">\n <img src=\"assets/mosaicx_logo.png\" alt=\"MOSAICX Logo\" width=\"800\"/>\n</div>\n\n<p align=\"center\">\n <a href=\"https://pypi.org/project/mosaicx/\"><img alt=\"PyPI\" src=\"https://img.shields.io/pypi/v/mosaicx.svg?label=PyPI&style=flat-square&logo=python&logoColor=white&color=bd93f9\"></a>\n <a href=\"https://www.python.org/downloads/\"><img alt=\"Python\" src=\"https://img.shields.io/badge/Python-3.11%2B-50fa7b?style=flat-square&logo=python&logoColor=white\"></a>\n <a href=\"https://www.apache.org/licenses/LICENSE-2.0\"><img alt=\"License\" src=\"https://img.shields.io/badge/License-Apache--2.0-ff79c6?style=flat-square&logo=apache&logoColor=white\"></a>\n <a href=\"https://pepy.tech/project/mosaicx\"><img alt=\"Downloads\" src=\"https://img.shields.io/pepy/dt/mosaicx?style=flat-square&color=8be9fd&label=Downloads\"></a>\n <a href=\"https://pydantic.dev\"><img alt=\"Pydantic v2\" src=\"https://img.shields.io/badge/Pydantic-v2-ffb86c?style=flat-square&logo=pydantic&logoColor=white\"></a>\n <a href=\"https://ollama.ai\"><img alt=\"Ollama Compatible\" src=\"https://img.shields.io/badge/Ollama-Compatible-6272a4?style=flat-square&logo=ghost&logoColor=white\"></a>\n <a href=\"mailto:lalith@zenta.solutions\"><img alt=\"Commercial License\" src=\"https://img.shields.io/badge/Commercial%20Use-Contact%20Zenta-orange?style=flat-square&logo=data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iOTIiIGhlaWdodD0iOTIiIHZpZXdCb3g9IjAgMCA5MiA5MiIgZmlsbD0ibm9uZSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KPGNpcmNsZSBjeD0iNDUuODk2NiIgY3k9IjcuOTEyOTYiIHI9IjYuMjkxNzEiIHRyYW5zZm9ybT0icm90YXRlKDkwIDQ1Ljg5NjYgNy45MTI5NikiIGZpbGw9IiNFNkU0REUiIHN0cm9rZT0iI0U2RTRERSIgc3Ryb2tlLXdpZHRoPSIzLjI0MjUxIi8+CjxjaXJjbGUgY3g9IjgzLjg3ODEiIGN5PSIyNi45MDQyIiByPSI2LjI5MTcxIiB0cmFuc2Zvcm09InJvdGF0ZSg5MCA4My44NzgxIDI2LjkwNDIpIiBmaWxsPSIjRTZFNERFIiBzdHJva2U9IiNFNkU0REUiIHN0cm9rZS13aWR0aD0iMy4yNDI1MSIvPgo8Y2lyY2xlIGN4PSI3LjkxMzIxIiBjeT0iMjYuOTA0MiIgcj0iNi4yOTE3MSIgdHJhbnNmb3JtPSJyb3RhdGUoOTAgNy45MTMyMSAyNi45MDQyKSIgZmlsbD0iI0U2RTRERSIgc3Ryb2tlPSIjRTZFNERFIiBzdHJva2Utd2lkdGg9IjMuMjQyNTEiLz4KPGNpcmNsZSBjeD0iNy45MTMyMSIgY3k9IjQ1Ljg5NjQiIHI9IjYuMjkxNzEiIHRyYW5zZm9ybT0icm90YXRlKDkwIDcuOTEzMjEgNDUuODk2NCkiIGZpbGw9IiNFNkU0REUiIHN0cm9rZT0iI0U2RTRERSIgc3Ryb2tlLXdpZHRoPSIzLjI0MjUxIi8+CjxjaXJjbGUgY3g9IjY0Ljg4NjgiIGN5PSI4My44Nzg4IiByPSI2LjI5MTcxIiB0cmFuc2Zvcm09InJvdGF0ZSg5MCA2NC44ODY4IDgzLjg3ODgpIiBmaWxsPSIjRTZFNERFIiBzdHJva2U9IiNFNkU0REUiIHN0cm9rZS13aWR0aD0iMy4yNDI1MSIvPgo8Y2lyY2xlIGN4PSIyNi45MDQ0IiBjeT0iODMuODc4OCIgcj0iNi4yOTE3MSIgdHJhbnNmb3JtPSJyb3RhdGUoOTAgMjYuOTA0NCA4My44Nzg4KSIgZmlsbD0iI0U2RTRERSIgc3Ryb2tlPSIjRTZFNERFIiBzdHJva2Utd2lkdGg9IjMuMjQyNTEiLz4KPGNpcmNsZSBjeD0iNy45MTMyMSIgY3k9IjY0Ljg4NzYiIHI9IjYuMjkxNzEiIHRyYW5zZm9ybT0icm90YXRlKDkwIDcuOTEzMjEgNjQuODg3NikiIGZpbGw9IiNFNkU0REUiIHN0cm9rZT0iI0U2RTRERSIgc3Ryb2tlLXdpZHRoPSIzLjI0MjUxIi8+CjxyZWN0IHg9IjkwLjE2OTgiIHk9IjM5LjYwNDciIHdpZHRoPSIzMS41NzQ1IiBoZWlnaHQ9IjEyLjU4MzQiIHJ4PSI2LjI5MTcxIiB0cmFuc2Zvcm09InJvdGF0ZSg5MCA5MC4xNjk4IDM5LjYwNDcpIiBmaWxsPSIjRTZFNERFIiBzdHJva2U9IiNFNkU0REUiIHN0cm9rZS13aWR0aD0iMy4yNDI1MSIvPgo8cmVjdCB4PSI3MS4xNzg1IiB5PSIxLjYyMTI2IiB3aWR0aD0iMzEuNTc0NSIgaGVpZ2h0PSIxMi41ODM0IiByeD0iNi4yOTE3MSIgdHJhbnNmb3JtPSJyb3RhdGUoOTAgNzEuMTc4NSAxLjYyMTI2KSIgZmlsbD0iI0U2RTRERSIgc3Ryb2tlPSIjRTZFNERFIiBzdHJva2Utd2lkdGg9IjMuMjQyNTEiLz4KPHJlY3QgeD0iNTIuMTg4MyIgeT0iNTguNTk1OSIgd2lkdGg9IjMxLjU3NDUiIGhlaWdodD0iMTIuNTgzNCIgcng9IjYuMjkxNzEiIHRyYW5zZm9ybT0icm90YXRlKDkwIDUyLjE4ODMgNTguNTk1OSkiIGZpbGw9IiNFNkU0REUiIHN0cm9rZT0iI0U2RTRERSIgc3Ryb2tlLXdpZHRoPSIzLjI0MjUxIi8+CjxyZWN0IHg9IjMzLjE5NjEiIHk9IjIyLjE5NTUiIHdpZHRoPSIzMS41NzQ1IiBoZWlnaHQ9IjEyLjU4MzQiIHJ4PSI2LjI5MTcxIiB0cmFuc2Zvcm09InJvdGF0ZSg5MCAzMy4xOTYxIDIyLjE5NTUpIiBmaWxsPSIjRTZFNERFIiBzdHJva2U9IiNFNkU0REUiIHN0cm9rZS13aWR0aD0iMy4yNDI1MSIvPgo8cmVjdCB4PSIzMy4xOTYxIiB5PSIxLjYyMTI2IiB3aWR0aD0iMzEuNTc0NSIgaGVpZ2h0PSIxMi41ODM0IiByeD0iNi4yOTE3MSIgdHJhbnNmb3JtPSJyb3RhdGUoOTAgMzMuMTk2MSAxLjYyMTI2KSIgZmlsbD0iI0U2RTRERSIgc3Ryb2tlPSIjRTZFNERFIiBzdHJva2Utd2lkdGg9IjMuMjQyNTEiLz4KPHJlY3QgeD0iMTQuMjA0OSIgeT0iMzkuNjA0NyIgd2lkdGg9IjMxLjU3NDUiIGhlaWdodD0iMTIuNTgzNCIgcng9IjYuMjkxNzEiIHRyYW5zZm9ybT0icm90YXRlKDkwIDE0LjIwNDkgMzkuNjA0NykiIGZpbGw9IiNFNkU0REUiIHN0cm9rZT0iI0U2RTRERSIgc3Ryb2tlLXdpZHRoPSIzLjI0MjUxIi8+Cjwvc3ZnPgo=\"></a>\n</p>\n\n<p align=\"center\">\n <strong>Developed by the <a href=\"https://www.linkedin.com/company/digitx-lmu/\">DIGIT-X Lab</a> at LMU Munich University</strong><br>\n <em>Quietly ambitious about the hard things</em>\n</p>\n\n---\n\n## \ud83e\uddec **Structure first. Insight follows.**\n\nMedical data is inherently complex, unstructured, and heterogeneous. Before we can unlock meaningful patterns, predict outcomes, or enable clinical decision support, we must first impose order on chaos. **MOSAICX** embodies this fundamental principle: **structured data is the prerequisite for knowledge discovery**.\n\nIn healthcare, unstructured documents\u2014radiology reports, clinical notes, pathology summaries\u2014contain critical information locked in narrative text. MOSAICX transforms this chaos into validated, machine-readable structures using AI-driven schema generation and extraction pipelines. Only when data is properly structured can we apply advanced analytics, machine learning, and knowledge graphs to generate actionable insights.\n\n**Core Capabilities:**\n- \ud83d\udd2c **Schema Generation**: Transform natural language descriptions into validated Pydantic models\n- \ud83d\udcc4 **Document Extraction**: Convert PDFs and clinical documents to structured JSON using generated schemas \n- \ud83d\udcca **Clinical Summarization**: Generate timeline-based summaries of radiology reports with standardized outputs\n- \u26a1 **CLI & API**: Powerful command-line interface and Python API for production workflows\n- \ud83c\udfe5 **Privacy-First**: Process sensitive medical data locally using Ollama-compatible LLMs\n- \ud83c\udfaf **Production-Ready**: Robust error handling, validation, and reproducible outputs\n- \ud83c\udf10 **Demo WebApp**: Interactive web interface for demonstrations and testing\n\n> *Powered by local LLMs via **Ollama**, PDF processing via **Docling**, and strict validation via **Pydantic v2***\n\n## \ud83c\udf10 **Demo WebApp**\n\nInteractive web interface for **demonstrations and testing only**. Use CLI/API for production workflows.\n\n### **Quick Demo Start:**\n```bash\ncd webapp && ./start.sh\n```\n\n**Demo Features:**\n- \ud83d\udd2c **Smart Contract Generator**: Create Pydantic schemas from natural language\n- \ud83d\udcc4 **PDF Extractor**: Drag-and-drop PDF processing \n- \ud83d\udcca **Report Summarizer**: Timeline-based clinical analysis\n\n**Access Demo:** http://localhost:3000 | [Full Setup Guide \u2192](webapp/README.md)\n\n**Requirements:**\n- **Docker**: Desktop or Engine 20.10+\n- **RAM**: 16GB+ (32GB recommended for large models) \n- **Storage**: 10GB+ for containers and models\n- **GPU**: Optional but recommended for large models\n\n**Architecture Notes:**\n- **Option 1**: WebApp containers \u2192 Host Ollama (via `host.docker.internal:11434`)\n- **Option 2**: WebApp containers \u2192 Ollama container (via internal Docker network)\n\n**Features:**\n- \ud83d\udd2c **Smart Contract Generator**: Create Pydantic models from natural language\n- \ud83d\udcc4 **PDF Extractor**: Drag-and-drop PDF processing with real-time results \n- \ud83d\udcca **Report Summarizer**: Timeline-based clinical report analysis\n- \ud83d\udccb **Sample Data**: Pre-loaded medical PDFs and schema templates\n- \ud83c\udfa8 **Glass Morphism UI**: Electric cyan theme with professional medical interface\n\n[**\u2192 Full WebApp Documentation**](webapp/README.md)\n\n---\n\n## \ud83d\ude80 **Installation & Setup**\n\n### System Requirements\n- **Python**: 3.11+ (3.12 recommended)\n- **Operating System**: macOS, Linux, Windows (with WSL2)\n- **Memory**: 16GB RAM minimum, 32GB recommended\n- **Storage**: 10GB free space for models\n\n### Step 1: Install Ollama\n```bash\n# macOS/Linux (automatic installation)\ncurl -fsSL https://ollama.com/install.sh | sh\n\n# Windows: Download from https://ollama.com/download/windows\n\n# Start Ollama service\nollama serve\n```\n\n### Step 2: Install MOSAICX\n```bash\n# Using pip (standard)\npip install mosaicx\n\n# Using uv (faster dependency resolution)\nuv add mosaicx\n\n# Using pipx (isolated installation)\npipx install mosaicx\n\n# Development installation\ngit clone https://github.com/LalithShiyam/MOSAICX.git\ncd MOSAICX\npip install -e .\n```\n\n### Step 3: Download Required Models\n```bash\n# Default model (recommended for most use cases)\nollama pull gpt-oss:120b\n\n# Alternative models\nollama pull llama3.1:8b-instruct # Smaller, faster\nollama pull qwen2.5:7b-instruct # Good balance\nollama pull deepseek-r1:7b # Reasoning model\n\n# Verify installation\nmosaicx --version\n```\n\n\n\n### Step 4: Quick Test\n```bash\n# Test connection to Ollama\nmosaicx generate --desc \"Simple patient record with name and age\" --class-name TestModel\n```\n\n---\n\n## \ud83d\udd2c **Usage Guide**\n\n### Command Overview\nMOSAICX provides three main commands with extensive options:\n\n```bash\nmosaicx --help # Show all commands\nmosaicx generate --help # Schema generation options\nmosaicx extract --help # Document extraction options \nmosaicx summarize --help # Report summarization options\nmosaicx schemas --help # Schema management options\n```\n\n### Default Settings\n- **Model**: `gpt-oss:120b` (configurable via `--model`)\n- **Temperature**: \n - Schema generation: `0.2` (balanced creativity)\n - Data extraction: `0.0` (deterministic)\n - Summarization: `0.2` (slight creativity for readability)\n- **Base URL**: `http://localhost:11434/v1` (Ollama default)\n- **API Key**: `ollama` (Ollama default)\n\n\n### 1. Schema Generation from Natural Language\n\nTransform clinical requirements into validated Pydantic models:\n\n```bash\n# Basic usage (uses defaults)\nmosaicx generate \\\n --desc \"Echocardiography report with patient demographics, LVEF, valve grades, impression\"\n\n# Generated Pydantic Model:\n```python\nfrom pydantic import BaseModel, Field\nfrom datetime import datetime\nfrom typing import Literal, Optional\n\nclass EchocardiographyReport(BaseModel):\n \"\"\"Echocardiography report with patient demographics, LVEF, valve grades, impression\"\"\"\n \n patient_id: str = Field(..., description=\"Unique patient identifier\")\n patient_name: str = Field(..., description=\"Patient full name\")\n date_of_birth: datetime = Field(..., description=\"Patient date of birth\")\n exam_date: datetime = Field(..., description=\"Date of echocardiogram examination\")\n lvef_percent: float = Field(..., ge=0, le=100, description=\"Left ventricular ejection fraction (%)\")\n mitral_valve_grade: Literal[\"Normal\", \"Mild\", \"Moderate\", \"Severe\"] = Field(\n ..., description=\"Mitral valve regurgitation severity\"\n )\n aortic_valve_grade: Literal[\"Normal\", \"Mild\", \"Moderate\", \"Severe\"] = Field(\n ..., description=\"Aortic valve stenosis/regurgitation severity\"\n )\n tricuspid_valve_grade: Literal[\"Normal\", \"Mild\", \"Moderate\", \"Severe\"] = Field(\n ..., description=\"Tricuspid valve regurgitation severity\"\n )\n clinical_impression: str = Field(..., min_length=10, description=\"Cardiologist's clinical impression\")\n```\n\n**Advanced usage with custom settings:**\n```bash\nmosaicx generate \\\n --desc \"Complete blood count with patient ID, test date, hemoglobin, hematocrit, WBC count, differential counts, and reference ranges\" \\\n --class-name CBCReport \\\n --model llama3.1:8b-instruct \\\n --temperature 0.1 \\\n --schema-path schemas/cbc_report.py\n\n# Generated Pydantic Model:\n```python\nfrom pydantic import BaseModel, Field\nfrom datetime import datetime\nfrom typing import Optional\n\nclass CBCReport(BaseModel):\n \"\"\"Complete blood count with patient ID, test date, hemoglobin, hematocrit, WBC count, differential counts, and reference ranges\"\"\"\n \n patient_id: str = Field(..., description=\"Unique patient identifier\")\n test_date: datetime = Field(..., description=\"Date when CBC test was performed\")\n hemoglobin: float = Field(..., ge=0, le=25, description=\"Hemoglobin level in g/dL\")\n hematocrit: float = Field(..., ge=0, le=70, description=\"Hematocrit percentage\")\n wbc_count: float = Field(..., ge=0, description=\"White blood cell count (thousands/\u03bcL)\")\n neutrophils_percent: float = Field(..., ge=0, le=100, description=\"Neutrophils percentage\")\n lymphocytes_percent: float = Field(..., ge=0, le=100, description=\"Lymphocytes percentage\")\n monocytes_percent: float = Field(..., ge=0, le=100, description=\"Monocytes percentage\")\n eosinophils_percent: float = Field(..., ge=0, le=100, description=\"Eosinophils percentage\")\n basophils_percent: float = Field(..., ge=0, le=100, description=\"Basophils percentage\")\n hemoglobin_ref_range: str = Field(..., description=\"Reference range for hemoglobin\")\n hematocrit_ref_range: str = Field(..., description=\"Reference range for hematocrit\")\n wbc_ref_range: str = Field(..., description=\"Reference range for WBC count\")\n```\n\n**Available Options:**\n\n- `--desc` (required): Natural language description\n- `--class-name`: Pydantic class name (default: \"GeneratedModel\")\n- `--model`: LLM model to use (default: \"gpt-oss:120b\")\n- `--temperature`: Sampling temperature 0.0-2.0 (default: 0.2)\n- `--schema-path`: Write the generated schema to this file\n- `--base-url`: Custom API endpoint\n- `--api-key`: Custom API key\n- `--debug`: Enable verbose logging\n\n### 2. Document Extraction to Structured Data\n\nExtract structured information from clinical documents:\n\n```bash\n# Basic extraction\nmosaicx extract \\\n --document patient_reports/echo_001.pdf \\\n --schema EchoReport\n\n# Advanced extraction with custom model\nmosaicx extract \\\n --document \"case studies/complex_cardiology_report.pdf\" \\\n --schema CBCReport_20250925_143022 \\\n --model qwen2.5:7b-instruct \\\n --save results/structured_data.json\n```\n\nSupported formats include PDF, DOC/DOCX, PPT/PPTX, TXT/MD, and RTF\u2014mix them freely in a single run.\n\nBehind the scenes MOSAICX layers extraction: native Docling text, then forced OCR, and finally Gemma3:27b via Ollama for vision-language transcription when required.\n\nExample CLI output (abridged \u2013 actual Rich formatting includes colors and panels):\n\n```\n\ud83d\udccb Extraction results based on schema: EchoReport\n\nField Extracted Value\npatient_id ECG-001-2025\nexam_date 2025-09-15T00:00:00\nlvef_percent 55.0\nmitral_valve_grade Mild\naortic_valve_grade Normal\ntricuspid_valve_grade Normal\nclinical_impression Normal left ventricular systolic function...\n\n\ud83d\udcc1 Extraction saved\nJSON: results/structured_data.json\n```\n\n### 3. Clinical Report Summarization\n\nGenerate timeline-based summaries from radiology reports:\n\n```bash\n# Single patient, multiple reports\nmosaicx summarize \\\n --report patient_001/ct_baseline.pdf \\\n --report patient_001/ct_3month.pdf \\\n --report patient_001/ct_6month.pdf \\\n --patient P001 \\\n --json-out summaries/P001_longitudinal.json\n\n# Process entire directory\nmosaicx summarize \\\n --dir ./radiology_reports/patient_P001/ \\\n --patient P001 \\\n --model llama3.1:8b-instruct \\\n --temperature 0.1 \\\n --json-out P001_summary.json\n```\n\nSupported formats include PDF, DOC/DOCX, PPT/PPTX, TXT/MD, and RTF\u2014mix them freely in a single run.\n\nExample CLI output (abridged \u2013 actual Rich formatting includes colors and panels):\n\n```\nPatient: P001\nDOB: \u2014 Sex: \u2014 Updated: 2025-09-25T14:30:22Z\n\nTimeline\nDate Source Critical Note\n2025-08-01 CT Chest/Abdomen/Pelvis Baseline study: Multiple pulmonary nodules...\n2025-09-15 CT Chest Follow-up Interval growth: RUL nodule now 12mm...\n\nOverall Summary\nProgressive pulmonary nodular disease with interval growth of the RUL lesion and new LLL nodule. [Source: CT Chest Follow-up]\n```\n\n---\n\n## \ud83d\udc0d **Using MOSAICX as a Python Library**\n\nThe CLI features are also exposed as pure Python helpers so you can script or integrate them into other services.\n\n```python\nfrom pathlib import Path\n\nfrom mosaicx import (\n extract_pdf,\n generate_schema,\n summarize_reports,\n)\n\n# 1) Generate a Pydantic schema from a plain-language description\nschema = generate_schema(\n \"Patient vitals with name, heart rate, systolic_bp, diastolic_bp\",\n class_name=\"PatientVitals\",\n model=\"gpt-oss:120b\",\n)\nschema_path = schema.write(Path(\"schemas/patient_vitals.py\"))\n\n# 2) Extract structured data from a PDF using that schema\nextraction = extract_pdf(\n pdf_path=\"tests/datasets/sample_patient_vitals.pdf\",\n schema_path=schema_path,\n)\npayload = extraction.to_dict()\n\n# 3) Summarize one or more clinical reports\nsummary = summarize_reports(\n paths=[\"tests/datasets/sample_patient_vitals.pdf\"],\n patient_id=\"demo-patient\",\n)\n```\n\nExample Python output (illustrative values):\n\n```python\npayload\n{\n \"patient_name\": \"John Doe\",\n \"heart_rate\": 72,\n \"systolic_bp\": 118,\n \"diastolic_bp\": 76,\n}\n\nsummary.overall\n'Stable vital signs with normal heart rate and blood pressure. [Source: sample_patient_vitals.pdf]'\n\nsummary.timeline[0].model_dump()\n{\n \"date\": None,\n \"source\": \"sample_patient_vitals.pdf\",\n \"note\": \"Vitals within normal limits; no acute concerns.\",\n}\n```\n\nAll helpers accept optional `model`, `base_url`, and `api_key` arguments; when omitted the defaults mirror the CLI (environment variables first, then local Ollama).\n\n---\n\n## \ud83c\udfaf **Why Structure Matters in Medical AI**\n\nAt the DIGIT-X Lab, we believe that **structure precedes insight**. The proliferation of unstructured medical data\u2014radiology reports, clinical notes, pathology summaries\u2014represents both an opportunity and a challenge. While this data contains rich clinical knowledge, its unstructured nature makes it largely inaccessible to computational analysis. \n\nModern healthcare generates exabytes of unstructured text annually, yet most clinical decision support systems can only leverage structured fields from electronic health records. This fundamental disconnect limits our ability to develop robust clinical AI, conduct large-scale outcomes research, or enable personalized medicine approaches.\n\n**MOSAICX addresses this gap by:**\n- **Democratizing Data Structuring**: Transforming natural language descriptions into production-ready data schemas without requiring deep technical expertise\n- **Enabling Reproducible Extraction**: Converting documents to validated JSON structures that can be reliably processed by downstream ML pipelines\n- **Preserving Clinical Context**: Maintaining semantic meaning while imposing computational structure through intelligent schema design\n- **Supporting Privacy Requirements**: Processing sensitive medical data locally without external API dependencies\n\nThe structured data produced by MOSAICX becomes the foundation for knowledge graphs, longitudinal analysis, cohort studies, and clinical prediction models. **Structure first. Insight follows.**\n\n---\n\n## \ud83d\udd27 **Advanced Features**\n\n### Schema Registry Management\nThe schema registry tracks all generated Pydantic models for easy reuse:\n\n```bash\n# List all generated schemas with details\nmosaicx schemas\n\n# Filter by clinical domain or keywords\nmosaicx schemas --description \"cardiology\"\nmosaicx schemas --class-name \"Echo\"\n\n# Clean up orphaned registry entries (files deleted outside MOSAICX)\nmosaicx schemas --cleanup\n\n# Scan and register existing schema files not tracked by registry\nmosaicx schemas --scan\n```\n\n**Available Schema Registry:**\n- **EchoReport_20250925_143022**: Echocardiography report with LVEF and valve assessments\n- **CBCReport_20250925_101530**: Complete blood count with differential and references \n- **PathologyReport_20250924_152045**: Surgical pathology with tumor staging and margins\n\n\ud83d\udca1 **Tip**: Use schema ID, filename, or file path in extract commands\n\n### Batch Processing\nProcess multiple documents or directories efficiently:\n\n```bash\n# Batch summarization for multiple patients\nfor patient_dir in ./patients/*/; do\n patient_id=$(basename \"$patient_dir\")\n mosaicx summarize \\\n --dir \"$patient_dir\" \\\n --patient \"$patient_id\" \\\n --json-out \"summaries/${patient_id}_summary.json\"\ndone\n\n# Batch extraction using same schema\nfind ./reports -name \"*.pdf\" -exec mosaicx extract \\\n --document {} \\\n --schema UniversalLabReport \\\n --save \"structured_data/{}.json\" \\;\n```\n\n### Custom Model Endpoints\nUse alternative LLM providers or local deployments:\n\n```bash\n# OpenAI API\nmosaicx generate \\\n --desc \"Pathology report with tumor staging\" \\\n --base-url https://api.openai.com/v1 \\\n --api-key sk-your-openai-key \\\n --model gpt-4-turbo\n\n# Local LM Studio\nmosaicx extract \\\n --document report.pdf \\\n --schema PathologyReport \\\n --base-url http://localhost:1234/v1 \\\n --api-key lm-studio \\\n --model local-medical-llm\n\n# Custom medical LLM deployment\nmosaicx summarize \\\n --dir ./radiology_reports/ \\\n --base-url https://your-medical-llm.hospital.com/v1 \\\n --api-key your-internal-key \\\n --model hospital-radiology-model\n```\n\n### Environment Variables\nSet default values to avoid repetitive command-line options:\n\n```bash\n# Set default model and endpoint\nexport OPENAI_BASE_URL=\"http://localhost:11434/v1\"\nexport OPENAI_API_KEY=\"ollama\"\nexport MOSAICX_DEFAULT_MODEL=\"gpt-oss:120b\"\n\n# Now use simplified commands\nmosaicx generate --desc \"Simple patient record\"\nmosaicx extract --document report.pdf --schema PatientRecord\n```\n\n---\n\n## \ud83d\udccb **Best Practices & Model Selection**\n\n### Recommended Models by Use Case\n\n| Model | Size | Use Case | Memory | Speed | Accuracy |\n|-------|------|----------|---------|-------|----------|\n| `gpt-oss:120b` | ~120B | Complex schemas, high accuracy | 64GB+ | Slow | \u2605\u2605\u2605\u2605\u2605 |\n| `llama3.1:8b-instruct` | ~8B | Balanced performance | 16GB+ | Fast | \u2605\u2605\u2605\u2605\u2606 |\n| `qwen2.5:7b-instruct` | ~7B | Batch processing | 12GB+ | Fastest | \u2605\u2605\u2605\u2606\u2606 |\n| `deepseek-r1:7b` | ~7B | Reasoning tasks | 16GB+ | Medium | \u2605\u2605\u2605\u2605\u2606 |\n\n**Default Model**: `gpt-oss:120b` provides the best accuracy for medical schema generation and extraction tasks.\n\n### Schema Design Guidelines\n\n**\u2705 Good Schema Design:**\n```python\n# Descriptive field names with medical terminology\nclass EchocardiographyReport(BaseModel):\n patient_id: str = Field(..., description=\"Unique patient identifier\")\n exam_date: datetime = Field(..., description=\"Date of echocardiogram\")\n lvef_percent: float = Field(..., ge=0, le=100, description=\"Left ventricular ejection fraction (%)\")\n mitral_valve_grade: Literal[\"Normal\", \"Mild\", \"Moderate\", \"Severe\"] = Field(\n ..., description=\"Mitral valve regurgitation severity\"\n )\n clinical_impression: str = Field(..., min_length=10, description=\"Cardiologist's interpretation\")\n```\n\n**\u274c Poor Schema Design:**\n```python\n# Vague field names, no validation, poor descriptions\nclass Report(BaseModel):\n data: str\n values: list\n result: float\n```\n\n### Extraction Optimization\n\n**Document Preparation:**\n- Ensure PDFs have searchable text layers (not just scanned images)\n- Use OCR preprocessing for scanned documents: `tesseract input.pdf output.pdf`\n- Remove password protection from PDFs before processing\n\n**Parameter Tuning:**\n- **Temperature 0.0**: Deterministic extraction for consistent results\n- **Temperature 0.1-0.2**: Slight variation for creative schema generation\n- **Higher models**: Use for complex medical terminology and relationships\n\n**Validation Best Practices:**\n- Always review extracted data for clinical accuracy\n- Implement post-processing validation against medical standards\n- Use enum constraints for standardized medical values\n- Set appropriate ranges for numeric clinical measurements\n\n### Production Deployment\n\n**Performance Optimization:**\n```bash\n# Use quantized models for faster inference\nollama pull llama3.1:8b-instruct-q4_0 # 4-bit quantization\n\n# Process in batches to maximize GPU utilization\n# Use parallel processing for independent documents\n```\n\n**Error Handling:**\n```bash\n# Enable debug mode for troubleshooting\nmosaicx extract --document document.pdf --schema MySchema --debug\n\n# Implement retry logic for production systems\n# Validate outputs against clinical standards\n# Log failed extractions for manual review\n```\n\n---\n\n## \ud83c\udfe5 **Clinical Applications**\n\n### Research & Analytics\n- **Cohort Studies**: Structure clinical notes for population-level analysis\n- **Outcomes Research**: Extract standardized endpoints from heterogeneous reports\n- **Quality Metrics**: Automate clinical quality measure extraction\n- **Biomarker Discovery**: Structure pathology and lab reports for analysis\n\n### Clinical Decision Support\n- **Risk Stratification**: Extract risk factors into computable formats\n- **Care Pathway Optimization**: Structure clinical workflows and outcomes\n- **Longitudinal Tracking**: Generate patient timelines from multiple reports\n- **Adverse Event Detection**: Structure safety data from clinical narratives\n\n### Operational Excellence\n- **Revenue Cycle**: Extract billable procedures and diagnoses\n- **Compliance Reporting**: Structure regulatory reporting requirements\n- **Care Coordination**: Generate structured handoff summaries\n- **Quality Assurance**: Standardize report review workflows\n\n---\n\n## \u26a1 **Performance & Scalability**\n\n### Local Processing Benefits\n- **Privacy Compliance**: No PHI transmitted to external services\n- **Cost Efficiency**: Eliminate per-token API costs for large-scale processing\n- **Latency Optimization**: Sub-second processing for typical clinical documents\n- **Offline Capability**: Process data in air-gapped environments\n\n### Hardware Recommendations\n- **Minimum**: 16GB RAM, modern CPU (M1/M2 Mac, Intel i7/AMD Ryzen 7)\n- **Recommended**: 32GB RAM, GPU acceleration (RTX 4080/4090, M2 Max/Ultra)\n- **High-throughput**: 64GB+ RAM, multiple GPUs for batch processing\n\n---\n\n## \ud83d\udd0d **Troubleshooting Guide**\n\n### Installation Issues\n\n**Python Version Compatibility:**\n```bash\n# Check Python version (requires 3.11+)\npython --version\n\n# Install specific Python version if needed\npyenv install 3.12.0\npyenv global 3.12.0\n```\n\n**Dependency Conflicts:**\n```bash\n# Use virtual environment to isolate dependencies\npython -m venv mosaicx-env\nsource mosaicx-env/bin/activate # macOS/Linux\n# mosaicx-env\\Scripts\\activate # Windows\n\npip install mosaicx\n```\n\n### Runtime Issues\n\n| Issue | Cause | Solution |\n|-------|-------|----------|\n| `Connection refused` | Ollama not running | `ollama serve` |\n| `Model not found` | Model not downloaded | `ollama pull model-name` |\n| `Empty extraction` | Poor model/temperature | Try `gpt-oss:120b` with `--temperature 0.0` |\n| `PDF processing error` | Scanned PDF without text | Use OCR: `tesseract input.pdf output.pdf` |\n| `Memory error` | Model too large | Use quantized model: `llama3.1:8b-instruct-q4_0` |\n| `JSON validation error` | Malformed output | Enable `--debug` and check model output |\n| `Schema not found` | Registry out of sync | Run `mosaicx schemas --scan` |\n\n### Debug Mode\nEnable verbose logging to diagnose issues:\n\n```bash\n# Enable debug for all commands\nmosaicx --debug generate --desc \"Test schema\"\nmosaicx extract --document document.pdf --schema MySchema --debug\nmosaicx summarize --dir ./reports --debug\n\n# Check Ollama status\nollama list # Show downloaded models\nollama ps # Show running models\ncurl http://localhost:11434/api/tags # API health check\n```\n\n### Common Error Messages\n\n**\"Schema class 'MySchema' not found\"**\n```bash\n# Check available schemas\nmosaicx schemas\n\n# Regenerate if missing\nmosaicx generate --desc \"Your schema description\" --class-name MySchema\n```\n\n**\"No text extracted from PDF\"**\n```bash\n# Test PDF text extraction\npython -c \"\nfrom docling.document_converter import DocumentConverter\nconverter = DocumentConverter()\nresult = converter.convert('your_file.pdf')\nprint(result.document.text)\n\"\n```\n\n**\"Temperature must be between 0.0 and 2.0\"**\n```bash\n# Fix temperature value\nmosaicx generate --desc \"Test\" --temperature 0.2 # Valid range: 0.0-2.0\n```\n\n### Performance Issues\n\n**Slow Processing:**\n- Use smaller models: `llama3.1:8b-instruct` instead of `gpt-oss:120b`\n- Increase available RAM or use quantized models (`q4_0` suffix)\n- Process documents in smaller batches\n\n**High Memory Usage:**\n- Close other applications\n- Use quantized models\n- Process one document at a time\n\n**Inaccurate Results:**\n- Use larger, more capable models\n- Lower temperature for more deterministic output\n- Improve schema descriptions with more specific field definitions\n- Review and refine extracted data manually\n\n### Getting Help\n\n**Log Analysis:**\n```bash\n# Enable maximum verbosity\nexport MOSAICX_LOG_LEVEL=DEBUG\nmosaicx extract --document document.pdf --schema MySchema --debug > debug.log 2>&1\n```\n\n**System Information:**\n```bash\n# Gather system info for bug reports\nmosaicx --version\npython --version\nollama --version\npip show mosaicx\n```\n\nFor additional support:\n- **GitHub Issues**: [Report bugs and feature requests](https://github.com/LalithShiyam/MOSAICX/issues)\n- **Research Inquiries**: lalith.shiyam@med.uni-muenchen.de\n- **Commercial Support**: lalith@zenta.solutions\n\n---\n\n## \ud83c\udf93 **From DIGIT-X Lab**\n\n**MOSAICX** is developed by the [DIGIT-X Lab](https://www.linkedin.com/company/digitx-lmu/) at LMU Munich University, a research group focused on digital transformation in radiology and medical imaging. Our mission is to bridge the gap between clinical practice and computational methods through practical, privacy-preserving tools.\n\n**Research Focus Areas:**\n- Medical Image Analysis & AI\n- Clinical Natural Language Processing \n- Healthcare Data Standardization\n- Privacy-Preserving Medical AI\n- Radiomics & Quantitative Imaging\n\n**Team**: Led by researchers and clinicians who understand both the technical challenges and clinical requirements of modern healthcare data processing.\n\n*We are quietly ambitious about the hard things.*\n\n---\n\n## \ud83d\udcdd **License & Citation**\n\nMOSAICX is released under the AGPL-3.0 license for academic and open-source use. For commercial applications in healthcare organizations, please contact us for licensing options.\n\n### Citation\n```bibtex\n@software{mosaicx2025,\n title={MOSAICX: Medical cOmputational Suite for Advanced Intelligent eXtraction},\n author={Sundar, Lalith Kumar Shiyam and DIGIT-X Lab},\n year={2025},\n institution={LMU Munich University},\n url={https://github.com/LalithShiyam/MOSAICX},\n note={Developed at DIGIT-X Lab, Department of Radiology}\n}\n```\n\n---\n\n## \ud83e\udd1d **Contributing & Support**\n\nWe welcome contributions from the medical informatics and clinical AI communities:\n\n- **Bug Reports**: Submit issues with minimal reproducible examples\n- **Feature Requests**: Propose new clinical use cases and requirements \n- **Documentation**: Improve clinical examples and best practices\n- **Code Contributions**: Follow our development guidelines and testing requirements\n\n**Contact**: \n- Research Inquiries: [lalith.shiyam@med.uni-muenchen.de](mailto:lalith.shiyam@med.uni-muenchen.de)\n- Commercial Licensing: [lalith@zenta.solutions](mailto:lalith@zenta.solutions)\n- DIGIT-X Lab: [https://www.linkedin.com/company/digitx-lmu/](https://www.linkedin.com/company/digitx-lmu/)\n\n---\n\n*Built with \u2764\ufe0f for the medical community by researchers who understand that great clinical AI starts with great data structure.*\n\n**MOSAICX is infrastructure for clinical data**: schema-driven, validated, local, and reproducible. Structure reports once, then reuse the same schemas and summarizers across departments and time\u2014enabling longitudinal analysis, cross-modal integration, and downstream intelligence without sending data to the cloud.\n",
"bugtrack_url": null,
"license": "Apache License\n Version 2.0, January 2004\n http://www.apache.org/licenses/\n \n TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION\n \n 1. Definitions.\n \n \"License\" shall mean the terms and conditions for use, reproduction,\n and distribution as defined by Sections 1 through 9 of this document.\n \n \"Licensor\" shall mean the copyright owner or entity authorized by\n the copyright owner that is granting the License.\n \n \"Legal Entity\" shall mean the union of the acting entity and all\n other entities that control, are controlled by, or are under common\n control with that entity. For the purposes of this definition,\n \"control\" means (i) the power, direct or indirect, to cause the\n direction or management of such entity, whether by contract or\n otherwise, or (ii) ownership of fifty percent (50%) or more of the\n outstanding shares, or (iii) beneficial ownership of such entity.\n \n \"You\" (or \"Your\") shall mean an individual or Legal Entity\n exercising permissions granted by this License.\n \n \"Source\" form shall mean the preferred form for making modifications,\n including but not limited to software source code, documentation\n source, and configuration files.\n \n \"Object\" form shall mean any form resulting from mechanical\n transformation or translation of a Source form, including but\n not limited to compiled object code, generated documentation,\n and conversions to other media types.\n \n \"Work\" shall mean the work of authorship, whether in Source or\n Object form, made available under the License, as indicated by a\n copyright notice that is included in or attached to the work\n (an example is provided in the Appendix below).\n \n \"Derivative Works\" shall mean any work, whether in Source or Object\n form, that is based on (or derived from) the Work and for which the\n editorial revisions, annotations, elaborations, or other modifications\n represent, as a whole, an original work of authorship. For the purposes\n of this License, Derivative Works shall not include works that remain\n separable from, or merely link (or bind by name) to the interfaces of,\n the Work and Derivative Works thereof.\n \n \"Contribution\" shall mean any work of authorship, including\n the original version of the Work and any modifications or additions\n to that Work or Derivative Works thereof, that is intentionally\n submitted to Licensor for inclusion in the Work by the copyright owner\n or by an individual or Legal Entity authorized to submit on behalf of\n the copyright owner. For the purposes of this definition, \"submitted\"\n means any form of electronic, verbal, or written communication sent\n to the Licensor or its representatives, including but not limited to\n communication on electronic mailing lists, source code control systems,\n and issue tracking systems that are managed by, or on behalf of, the\n Licensor for the purpose of discussing and improving the Work, but\n excluding communication that is conspicuously marked or otherwise\n designated in writing by the copyright owner as \"Not a Contribution.\"\n \n \"Contributor\" shall mean Licensor and any individual or Legal Entity\n on behalf of whom a Contribution has been received by Licensor and\n subsequently incorporated within the Work.\n \n 2. Grant of Copyright License. Subject to the terms and conditions of\n this License, each Contributor hereby grants to You a perpetual,\n worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n copyright license to reproduce, prepare Derivative Works of,\n publicly display, publicly perform, sublicense, and distribute the\n Work and such Derivative Works in Source or Object form.\n \n 3. Grant of Patent License. Subject to the terms and conditions of\n this License, each Contributor hereby grants to You a perpetual,\n worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n (except as stated in this section) patent license to make, have made,\n use, offer to sell, sell, import, and otherwise transfer the Work,\n where such license applies only to those patent claims licensable\n by such Contributor that are necessarily infringed by their\n Contribution(s) alone or by combination of their Contribution(s)\n with the Work to which such Contribution(s) was submitted. If You\n institute patent litigation against any entity (including a\n cross-claim or counterclaim in a lawsuit) alleging that the Work\n or a Contribution incorporated within the Work constitutes direct\n or contributory patent infringement, then any patent licenses\n granted to You under this License for that Work shall terminate\n as of the date such litigation is filed.\n \n 4. Redistribution. You may reproduce and distribute copies of the\n Work or Derivative Works thereof in any medium, with or without\n modifications, and in Source or Object form, provided that You\n meet the following conditions:\n \n (a) You must give any other recipients of the Work or\n Derivative Works a copy of this License; and\n \n (b) You must cause any modified files to carry prominent notices\n stating that You changed the files; and\n \n (c) You must retain, in the Source form of any Derivative Works\n that You distribute, all copyright, patent, trademark, and\n attribution notices from the Source form of the Work,\n excluding those notices that do not pertain to any part of\n the Derivative Works; and\n \n (d) If the Work includes a \"NOTICE\" text file as part of its\n distribution, then any Derivative Works that You distribute must\n include a readable copy of the attribution notices contained\n within such NOTICE file, excluding those notices that do not\n pertain to any part of the Derivative Works, in at least one\n of the following places: within a NOTICE text file distributed\n as part of the Derivative Works; within the Source form or\n documentation, if provided along with the Derivative Works; or,\n within a display generated by the Derivative Works, if and\n wherever such third-party notices normally appear. The contents\n of the NOTICE file are for informational purposes only and\n do not modify the License. You may add Your own attribution\n notices within Derivative Works that You distribute, alongside\n or as an addendum to the NOTICE text from the Work, provided\n that such additional attribution notices cannot be construed\n as modifying the License.\n \n You may add Your own copyright statement to Your modifications and\n may provide additional or different license terms and conditions\n for use, reproduction, or distribution of Your modifications, or\n for any such Derivative Works as a whole, provided Your use,\n reproduction, and distribution of the Work otherwise complies with\n the conditions stated in this License.\n \n 5. Submission of Contributions. Unless You explicitly state otherwise,\n any Contribution intentionally submitted for inclusion in the Work\n by You to the Licensor shall be under the terms and conditions of\n this License, without any additional terms or conditions.\n Notwithstanding the above, nothing herein shall supersede or modify\n the terms of any separate license agreement you may have executed\n with Licensor regarding such Contributions.\n \n 6. Trademarks. This License does not grant permission to use the trade\n names, trademarks, service marks, or product names of the Licensor,\n except as required for reasonable and customary use in describing the\n origin of the Work and reproducing the content of the NOTICE file.\n \n 7. Disclaimer of Warranty. Unless required by applicable law or\n agreed to in writing, Licensor provides the Work (and each\n Contributor provides its Contributions) on an \"AS IS\" BASIS,\n WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or\n implied, including, without limitation, any warranties or conditions\n of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A\n PARTICULAR PURPOSE. You are solely responsible for determining the\n appropriateness of using or redistributing the Work and assume any\n risks associated with Your exercise of permissions under this License.\n \n 8. Limitation of Liability. In no event and under no legal theory,\n whether in tort (including negligence), contract, or otherwise,\n unless required by applicable law (such as deliberate and grossly\n negligent acts) or agreed to in writing, shall any Contributor be\n liable to You for damages, including any direct, indirect, special,\n incidental, or consequential damages of any character arising as a\n result of this License or out of the use or inability to use the\n Work (including but not limited to damages for loss of goodwill,\n work stoppage, computer failure or malfunction, or any and all\n other commercial damages or losses), even if such Contributor\n has been advised of the possibility of such damages.\n \n 9. Accepting Warranty or Additional Liability. While redistributing\n the Work or Derivative Works thereof, You may choose to offer,\n and charge a fee for, acceptance of support, warranty, indemnity,\n or other liability obligations and/or rights consistent with this\n License. However, in accepting such obligations, You may act only\n on Your own behalf and on Your sole responsibility, not on behalf\n of any other Contributor, and only if You agree to indemnify,\n defend, and hold each Contributor harmless for any liability\n incurred by, or claims asserted against, such Contributor by reason\n of your accepting any such warranty or additional liability.\n \n END OF TERMS AND CONDITIONS\n \n APPENDIX: How to apply the Apache License to your work.\n \n To apply the Apache License to your work, attach the following\n boilerplate notice, with the fields enclosed by brackets \"[]\"\n replaced with your own identifying information. (Don't include\n the brackets!) The text should be enclosed in the appropriate\n comment syntax for the file format. We also recommend that a\n file or class name and description of purpose be included on the\n same \"printed page\" as the copyright notice for easier\n identification within third-party archives.\n \n Copyright [yyyy] [name of copyright owner]\n \n Licensed under the Apache License, Version 2.0 (the \"License\");\n you may not use this file except in compliance with the License.\n You may obtain a copy of the License at\n \n http://www.apache.org/licenses/LICENSE-2.0\n \n Unless required by applicable law or agreed to in writing, software\n distributed under the License is distributed on an \"AS IS\" BASIS,\n WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n See the License for the specific language governing permissions and\n limitations under the License.",
"summary": "Medical cOmputational Suite for Advanced Intelligent eXtraction",
"version": "1.1.1",
"project_urls": {
"Bug Tracker": "https://github.com/LalithShiyam/MOSAICX/issues",
"Documentation": "https://github.com/LalithShiyam/MOSAICX#readme",
"Homepage": "https://github.com/LalithShiyam/MOSAICX",
"Repository": "https://github.com/LalithShiyam/MOSAICX"
},
"split_keywords": [
"extraction",
" llm",
" medical",
" nlp",
" pdf",
" radiology"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "0a8a538a05c50718e76e71afdb2730d0454d4fcb4769b25f7fe1988ef4c3b8aa",
"md5": "8daca675499280638b147113091ec314",
"sha256": "59b56f6c22e6c7ff8b3bd5ab907f598456b90819d220650c5d43fa203d42a21a"
},
"downloads": -1,
"filename": "mosaicx-1.1.1-py3-none-any.whl",
"has_sig": false,
"md5_digest": "8daca675499280638b147113091ec314",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 95419,
"upload_time": "2025-10-21T12:47:20",
"upload_time_iso_8601": "2025-10-21T12:47:20.517915Z",
"url": "https://files.pythonhosted.org/packages/0a/8a/538a05c50718e76e71afdb2730d0454d4fcb4769b25f7fe1988ef4c3b8aa/mosaicx-1.1.1-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "40101375b16df4ae39652958976021b9cb5dcaec8d9b3661f49227e92c8e92ba",
"md5": "d99119e4e1ce1df66b2df03529cf125d",
"sha256": "89c528d1907885c8a2daf28984fd71161214b54a5aa2ba5757739653e54d7e9f"
},
"downloads": -1,
"filename": "mosaicx-1.1.1.tar.gz",
"has_sig": false,
"md5_digest": "d99119e4e1ce1df66b2df03529cf125d",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 1099759,
"upload_time": "2025-10-21T12:47:22",
"upload_time_iso_8601": "2025-10-21T12:47:22.329327Z",
"url": "https://files.pythonhosted.org/packages/40/10/1375b16df4ae39652958976021b9cb5dcaec8d9b3661f49227e92c8e92ba/mosaicx-1.1.1.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-10-21 12:47:22",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "LalithShiyam",
"github_project": "MOSAICX",
"github_not_found": true,
"lcname": "mosaicx"
}