clausemate

Name	clausemate JSON
Version	2.0.1 JSON
	download
home_page	None
Summary	German pronoun clause mate extraction and analysis tool
upload_time	2025-07-29 20:32:49
maintainer	None
docs_url	None
author	None
requires_python	>=3.11
license	None
keywords	linguistics computational-linguistics german pronouns coreference
VCS
bugtrack_url
requirements	numpy pandas python-dateutil pytz six tzdata
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # Clause Mates Analyzer

<!-- Badges -->
<p align="left">
  <a href="https://github.com/jobschepens/clausemate/actions">
    <img src="https://github.com/jobschepens/clausemate/actions/workflows/python-app.yml/badge.svg" alt="Build Status">
  </a>
  <a href="https://codecov.io/gh/jobschepens/clausemate">
    <img src="https://codecov.io/gh/jobschepens/clausemate/branch/main/graph/badge.svg" alt="Coverage">
  </a>
  <a href="https://www.python.org/downloads/release/python-3110/">
    <img src="https://img.shields.io/badge/python-3.11%2B-blue.svg" alt="Python 3.11+">
  </a>
  <a href="https://github.com/charliermarsh/ruff">
    <img src="https://img.shields.io/badge/linting-ruff-%23f7ca18" alt="Ruff Linting">
  </a>
  <a href="https://pre-commit.com/">
    <img src="https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit" alt="pre-commit">
  </a>
  <a href="https://github.com/jobschepens/clausemate/blob/main/LICENSE">
    <img src="https://img.shields.io/badge/license-research-lightgrey.svg" alt="License">
  </a>
  <a href="https://github.com/jobschepens/clausemate/tree/main/docs">
    <img src="https://img.shields.io/badge/docs-available-brightgreen.svg" alt="Docs">
  </a>
</p>

> **⚠️ Disclaimer**: This repository contains experimental research code developed through iterative "vibe coding" sessions. While the functionality is complete and tested, the codebase reflects rapid prototyping, multiple refactoring attempts, and exploratory development. Code quality and organization may vary across different phases of development. Use with appropriate expectations for research/experimental software.

A Python tool for extracting and analyzing clause mate relationships from German pronoun data for linguistic research.

## Project Status

- ✅ **Phase 1 Complete**: Self-contained monolithic version with full functionality
- ✅ **Phase 2 Complete**: Modular architecture with adaptive parsing and 100% file compatibility
- ✅ **Phase 3.1 Complete**: Unified multi-file processing with cross-chapter coreference resolution
- ✅ **Documentation Complete**: Comprehensive format documentation for all supported file types
- ✅ **Deliverable Package Complete**: Professional HTML report with interactive visualizations ready for collaboration
- 📋 **Phase 3.2 Planned**: Advanced analytics and visualization features

> 📦 **Latest Achievement**: Complete deliverable package created with comprehensive HTML report, interactive visualizations, and professional documentation for collaborator delivery.

## Deliverable Package

A complete analysis package is available at [`data/output/deliverable_package_20250729/`](data/output/deliverable_package_20250729/) containing:

- **📊 Comprehensive HTML Report**: Interactive analysis with embedded visualizations
- **📈 Network Visualizations**: Character relationships and cross-chapter connections
- **📋 Complete Dataset**: 1,904 unified relationships in CSV format
- **📖 Documentation**: README and delivery summary for collaborators

> 🎯 **Ready for Delivery**: The package contains everything needed for collaborative analysis and can be shared independently.

## Description

This tool analyzes German pronouns and their clause mates in annotated linguistic data. It identifies critical pronouns (personal, demonstrative, and d-pronouns) and extracts their relationships with other referential expressions in the same sentence.

### Critical Pronouns Analyzed

- **Third person personal**: er, sie, es, ihm, ihr, ihn, ihnen
- **D-pronouns (pronominal)**: der, die, das, dem, den, deren, dessen, derer
- **Demonstrative**: dieser, diese, dieses, diesem, diesen

## Features

- **Unified Multi-File Processing**: Process all 4 chapter files as a single unified dataset
- **Cross-Chapter Coreference Resolution**: Identify and resolve coreference chains spanning multiple files
- **Adaptive TSV Parsing**: Supports multiple WebAnno TSV 3.3 format variations (12-38 columns)
- **Automatic Format Detection**: Preamble-based dynamic column mapping
- **100% File Compatibility**: Works with standard, extended, legacy, and incomplete formats
- **Cross-sentence Analysis**: Antecedent detection with 94.4% success rate
- **Single Unified Output**: One comprehensive file instead of four separate outputs
- **Comprehensive Documentation**: Detailed format specifications for all supported files
- **Robust Error Handling**: Graceful degradation and clear user feedback
- **Type-safe Implementation**: Full type hints and comprehensive testing
- **Timestamped Output**: Automatic organization with date/time-stamped directories

## Supported File Formats

| Format | Columns | Description | Relationships | Status |
|--------|---------|-------------|---------------|---------|
| **Standard** | 15 | Basic linguistic annotations | 448 | ✅ Fully supported |
| **Extended** | 37 | Rich morphological features | 234 | ✅ Fully supported |
| **Legacy** | 14 | Compact annotation set | 527 | ✅ Fully supported |
| **Incomplete** | 12 | Limited annotations | 695 | ✅ Graceful handling |

> 📊 **Format Documentation**: See [`data/input/FORMAT_OVERVIEW.md`](data/input/FORMAT_OVERVIEW.md) for comprehensive technical specifications.

## Project Structure

```
├── src/                        # Phase 3.1 - Complete unified processing architecture
│   ├── main.py                     # Main orchestrator with adaptive parsing
│   ├── config.py                   # Generalized configuration system
│   ├── multi_file/                 # Multi-file processing components (Phase 3.1)
│   │   ├── multi_file_batch_processor.py   # Unified multi-file coordinator
│   │   ├── unified_sentence_manager.py     # Global sentence numbering
│   │   ├── cross_file_coreference_resolver.py # Cross-chapter chain resolution
│   │   ├── unified_relationship_model.py   # Extended relationship data model
│   │   └── __init__.py                     # Multi-file module exports
│   ├── parsers/                    # Adaptive TSV parsing components
│   │   ├── adaptive_tsv_parser.py      # Preamble-based dynamic parsing
│   │   ├── incomplete_format_parser.py # Specialized incomplete format handler
│   │   ├── preamble_parser.py          # WebAnno schema extraction
│   │   └── tsv_parser.py               # Legacy parser (fallback)
│   ├── extractors/                 # Feature extraction components
│   ├── utils/                      # Format detection and utilities
│   │   └── format_detector.py          # Automatic format analysis
│   └── data/                       # Data models and structures
├── scripts/                    # Executable scripts and utilities
│   ├── run_multi_file_analysis.py     # Production multi-file processing interface
│   ├── enhanced_cross_chapter_analysis.py # Enhanced cross-chapter analysis
│   ├── generate_advanced_analysis_simple.py # Advanced analysis generator
│   └── generate_visualizations.py     # Visualization generation
├── analysis/                   # Analysis scripts and tools
│   ├── analyze_4tsv_detailed.py       # Detailed TSV format analysis
│   ├── analyze_column_mapping.py      # Column mapping analysis
│   └── analyze_preambles.py           # Preamble structure analysis
├── tests/                      # Comprehensive test suite
│   ├── test_multi_file_processing.py  # Multi-file processing tests
│   ├── test_cross_chapter_coreference.py # Cross-chapter tests
│   └── test_4tsv_processing.py        # TSV format tests
├── logs/                       # Log files and execution records
│   ├── multi_file_analysis.log        # Multi-file processing logs
│   └── visualization_generation.log   # Visualization generation logs
├── data/                       # Input and output data
│   ├── input/                      # Source TSV files with documentation
│   │   ├── FORMAT_OVERVIEW.md          # Comprehensive format comparison
│   │   ├── gotofiles/                  # Standard and extended formats
│   │   │   ├── 2.tsv_DOCUMENTATION.md      # Standard format (15 cols)
│   │   │   └── later/                      # Alternative formats
│   │   │       ├── 1.tsv_DOCUMENTATION.md      # Extended format (37 cols)
│   │   │       ├── 3.tsv_DOCUMENTATION.md      # Legacy format (14 cols)
│   │   │       └── 4.tsv_DOCUMENTATION.md      # Incomplete format (12 cols)
│   └── output/                     # Analysis results and deliverables
│       ├── deliverable_package_20250729/   # 📦 COMPLETE DELIVERABLE PACKAGE
│       │   ├── comprehensive_analysis_report.html  # Interactive HTML report
│       │   ├── unified_relationships.csv           # Complete dataset (1,904 relationships)
│       │   ├── visualizations_20250729_123445/     # Interactive network visualizations
│       │   ├── README.md                           # Package documentation
│       │   └── DELIVERY_SUMMARY.md                 # Delivery instructions
│       └── unified_analysis_20250729_123353/       # Latest raw analysis results
├── tools/                      # Development and utility tools
│   └── temp_files/                 # Temporary files and cleanup
├── docs/                       # Project documentation
│   ├── MULTI_FILE_PROCESSING_DOCUMENTATION.md # Multi-file architecture guide
│   ├── unified_multi_file_processing_plan.md # Implementation plan
│   ├── 4tsv_analysis_specification.md # TSV format specifications
│   └── cross_chapter_coreference_analysis_spec.md # Cross-chapter analysis spec
```

## Installation

1. **Clone the repository**:
   ```bash
   git clone <repository-url>
   cd clause-mates-analyzer
   ```

2. **Set up environment** (choose one):

   **Option A: pip (recommended)**
   ```bash
   python -m venv .venv
   # Windows:
   .venv\Scripts\activate
   # macOS/Linux:
   source .venv/bin/activate

   pip install -e .[dev,benchmark]
   ```

   **Option B: conda**
   ```bash
   conda env create -f environment.yml
   conda activate clausemate
   ```

## Usage

### Multi-File Processing (Phase 3.1) - **RECOMMENDED**

Process all 4 chapter files as a unified dataset with cross-chapter coreference resolution:

```bash
# Unified multi-file processing (all chapters as single dataset)
python scripts/run_multi_file_analysis.py

# With verbose logging
python scripts/run_multi_file_analysis.py --verbose

# Custom output directory
python scripts/run_multi_file_analysis.py --output-dir custom_output
```

**Output**: Single unified file with all 1,904 relationships + 36 cross-chapter chains
- Creates timestamped directory: `data/output/unified_analysis_YYYYMMDD_HHMMSS/`
- **unified_relationships.csv**: Main CSV output with source file metadata
- **unified_relationships.json**: JSON format with complete relationship data
- **cross_chapter_statistics.json**: Cross-chapter chain resolution statistics

### Single File Processing (Phase 2)

Process individual files with automatic format detection:

```bash
# Individual file processing with adaptive parsing
python src/main.py data/input/gotofiles/2.tsv                    # Standard format
python src/main.py data/input/gotofiles/later/1.tsv              # Extended format
python src/main.py data/input/gotofiles/later/3.tsv              # Legacy format
python src/main.py data/input/gotofiles/later/4.tsv              # Incomplete format

# Force legacy parser (disable adaptive features)
python src/main.py --disable-adaptive data/input/gotofiles/2.tsv

# Verbose output with format detection details
python src/main.py --verbose data/input/gotofiles/later/1.tsv
```

**Output**: Individual timestamped directories in `data/output/YYYYMMDD_HHMMSS/`

### Analysis Results Comparison

#### Unified Multi-File Processing (Recommended)

| **Unified Output** | **Total** | **Cross-Chapter Chains** | **Processing Time** |
|-------------------|-----------|--------------------------|-------------------|
| **All 4 Chapters** | **1,904 relationships** | **36 unified chains** | **~12 seconds** |

#### Individual File Processing

| File | Format | Sentences | Tokens | Relationships | Coreference Chains |
|------|--------|-----------|--------|---------------|-------------------|
| **2.tsv** | Standard | 222 | 3,665 | **448** | 235 |
| **1.tsv** | Extended | 127 | 2,267 | **234** | 195 |
| **3.tsv** | Legacy | 207 | 3,786 | **527** | 244 |
| **4.tsv** | Incomplete | 243 | 4,412 | **695** | 245 |

> 💡 **Recommendation**: Use multi-file processing for complete narrative analysis with cross-chapter relationships. Individual processing is available for specific chapter analysis or debugging.

### Analysis Tools

```bash
# Generate comprehensive analysis reports
python analysis/analyze_4tsv_detailed.py

# Analyze column mappings and format compatibility
python analysis/analyze_column_mapping.py

# Analyze preamble structures
python analysis/analyze_preambles.py

# Generate advanced analysis with visualizations
python scripts/generate_advanced_analysis_simple.py

# Create interactive visualizations
python scripts/generate_visualizations.py
```

### Testing

```bash
# Run all tests
python -m pytest

# Run specific test categories
python -m pytest tests/test_4tsv_processing.py
python -m pytest tests/test_multi_file_processing.py
python -m pytest tests/test_cross_chapter_coreference.py

# Run with verbose output
python -m pytest -v

# Run with coverage
python -m pytest --cov=src
```

## Development

### Quick Start

```bash
# Install development dependencies
pip install -e .[dev,benchmark]

# Run quality checks
nox                      # Run default sessions (lint, test)
nox -s lint              # Fast ruff linting
nox -s format            # Code formatting
nox -s test              # Run tests
nox -s ci                # Full CI pipeline

# Run tests directly
pytest
```

### Code Quality

This project uses **ruff** for fast, comprehensive code quality checking and formatting:

- **ruff**: Fast linting and formatting (replaces black, isort, flake8)
- **mypy**: Type checking
- **pytest**: Testing framework
- **pre-commit**: Git hooks for quality assurance

## Requirements

- **Python**: 3.8+
- **Core Dependencies**: pandas, standard library modules
- **Development**: ruff, mypy, pytest, pre-commit

## Contributing

This is a research project. For contributions:

1. Follow the established code style and type hints
2. Add tests for new functionality
3. Update documentation as needed
4. Ensure backward compatibility with existing data

See [`CONTRIBUTING.md`](CONTRIBUTING.md) for detailed setup instructions.

## Reproducibility

For exact result reproduction, see [`REPRODUCIBILITY.md`](REPRODUCIBILITY.md) for step-by-step instructions using locked dependencies and reference outputs.

## License

Research project - please contact maintainers for usage permissions.

## Contact

For questions about the linguistic methodology or data format, please refer to the project documentation or contact the research team.

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "clausemate",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "linguistics, computational-linguistics, german, pronouns, coreference",
    "author": null,
    "author_email": "Job Schepens <job.schepens@uni-koeln.com>",
    "download_url": "https://files.pythonhosted.org/packages/a7/0a/6ff433ea2aad38380db9af336da9b0cc56ad5d727e20def3d47fbe13b55d/clausemate-2.0.1.tar.gz",
    "platform": null,
    "description": "# Clause Mates Analyzer\r\n\r\n<!-- Badges -->\r\n<p align=\"left\">\r\n  <a href=\"https://github.com/jobschepens/clausemate/actions\">\r\n    <img src=\"https://github.com/jobschepens/clausemate/actions/workflows/python-app.yml/badge.svg\" alt=\"Build Status\">\r\n  </a>\r\n  <a href=\"https://codecov.io/gh/jobschepens/clausemate\">\r\n    <img src=\"https://codecov.io/gh/jobschepens/clausemate/branch/main/graph/badge.svg\" alt=\"Coverage\">\r\n  </a>\r\n  <a href=\"https://www.python.org/downloads/release/python-3110/\">\r\n    <img src=\"https://img.shields.io/badge/python-3.11%2B-blue.svg\" alt=\"Python 3.11+\">\r\n  </a>\r\n  <a href=\"https://github.com/charliermarsh/ruff\">\r\n    <img src=\"https://img.shields.io/badge/linting-ruff-%23f7ca18\" alt=\"Ruff Linting\">\r\n  </a>\r\n  <a href=\"https://pre-commit.com/\">\r\n    <img src=\"https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit\" alt=\"pre-commit\">\r\n  </a>\r\n  <a href=\"https://github.com/jobschepens/clausemate/blob/main/LICENSE\">\r\n    <img src=\"https://img.shields.io/badge/license-research-lightgrey.svg\" alt=\"License\">\r\n  </a>\r\n  <a href=\"https://github.com/jobschepens/clausemate/tree/main/docs\">\r\n    <img src=\"https://img.shields.io/badge/docs-available-brightgreen.svg\" alt=\"Docs\">\r\n  </a>\r\n</p>\r\n\r\n> **\u26a0\ufe0f Disclaimer**: This repository contains experimental research code developed through iterative \"vibe coding\" sessions. While the functionality is complete and tested, the codebase reflects rapid prototyping, multiple refactoring attempts, and exploratory development. Code quality and organization may vary across different phases of development. Use with appropriate expectations for research/experimental software.\r\n\r\nA Python tool for extracting and analyzing clause mate relationships from German pronoun data for linguistic research.\r\n\r\n## Project Status\r\n\r\n- \u2705 **Phase 1 Complete**: Self-contained monolithic version with full functionality\r\n- \u2705 **Phase 2 Complete**: Modular architecture with adaptive parsing and 100% file compatibility\r\n- \u2705 **Phase 3.1 Complete**: Unified multi-file processing with cross-chapter coreference resolution\r\n- \u2705 **Documentation Complete**: Comprehensive format documentation for all supported file types\r\n- \u2705 **Deliverable Package Complete**: Professional HTML report with interactive visualizations ready for collaboration\r\n- \ud83d\udccb **Phase 3.2 Planned**: Advanced analytics and visualization features\r\n\r\n> \ud83d\udce6 **Latest Achievement**: Complete deliverable package created with comprehensive HTML report, interactive visualizations, and professional documentation for collaborator delivery.\r\n\r\n## Deliverable Package\r\n\r\nA complete analysis package is available at [`data/output/deliverable_package_20250729/`](data/output/deliverable_package_20250729/) containing:\r\n\r\n- **\ud83d\udcca Comprehensive HTML Report**: Interactive analysis with embedded visualizations\r\n- **\ud83d\udcc8 Network Visualizations**: Character relationships and cross-chapter connections\r\n- **\ud83d\udccb Complete Dataset**: 1,904 unified relationships in CSV format\r\n- **\ud83d\udcd6 Documentation**: README and delivery summary for collaborators\r\n\r\n> \ud83c\udfaf **Ready for Delivery**: The package contains everything needed for collaborative analysis and can be shared independently.\r\n\r\n## Description\r\n\r\nThis tool analyzes German pronouns and their clause mates in annotated linguistic data. It identifies critical pronouns (personal, demonstrative, and d-pronouns) and extracts their relationships with other referential expressions in the same sentence.\r\n\r\n### Critical Pronouns Analyzed\r\n\r\n- **Third person personal**: er, sie, es, ihm, ihr, ihn, ihnen\r\n- **D-pronouns (pronominal)**: der, die, das, dem, den, deren, dessen, derer\r\n- **Demonstrative**: dieser, diese, dieses, diesem, diesen\r\n\r\n## Features\r\n\r\n- **Unified Multi-File Processing**: Process all 4 chapter files as a single unified dataset\r\n- **Cross-Chapter Coreference Resolution**: Identify and resolve coreference chains spanning multiple files\r\n- **Adaptive TSV Parsing**: Supports multiple WebAnno TSV 3.3 format variations (12-38 columns)\r\n- **Automatic Format Detection**: Preamble-based dynamic column mapping\r\n- **100% File Compatibility**: Works with standard, extended, legacy, and incomplete formats\r\n- **Cross-sentence Analysis**: Antecedent detection with 94.4% success rate\r\n- **Single Unified Output**: One comprehensive file instead of four separate outputs\r\n- **Comprehensive Documentation**: Detailed format specifications for all supported files\r\n- **Robust Error Handling**: Graceful degradation and clear user feedback\r\n- **Type-safe Implementation**: Full type hints and comprehensive testing\r\n- **Timestamped Output**: Automatic organization with date/time-stamped directories\r\n\r\n## Supported File Formats\r\n\r\n| Format | Columns | Description | Relationships | Status |\r\n|--------|---------|-------------|---------------|---------|\r\n| **Standard** | 15 | Basic linguistic annotations | 448 | \u2705 Fully supported |\r\n| **Extended** | 37 | Rich morphological features | 234 | \u2705 Fully supported |\r\n| **Legacy** | 14 | Compact annotation set | 527 | \u2705 Fully supported |\r\n| **Incomplete** | 12 | Limited annotations | 695 | \u2705 Graceful handling |\r\n\r\n> \ud83d\udcca **Format Documentation**: See [`data/input/FORMAT_OVERVIEW.md`](data/input/FORMAT_OVERVIEW.md) for comprehensive technical specifications.\r\n\r\n## Project Structure\r\n\r\n```\r\n\u251c\u2500\u2500 src/                        # Phase 3.1 - Complete unified processing architecture\r\n\u2502   \u251c\u2500\u2500 main.py                     # Main orchestrator with adaptive parsing\r\n\u2502   \u251c\u2500\u2500 config.py                   # Generalized configuration system\r\n\u2502   \u251c\u2500\u2500 multi_file/                 # Multi-file processing components (Phase 3.1)\r\n\u2502   \u2502   \u251c\u2500\u2500 multi_file_batch_processor.py   # Unified multi-file coordinator\r\n\u2502   \u2502   \u251c\u2500\u2500 unified_sentence_manager.py     # Global sentence numbering\r\n\u2502   \u2502   \u251c\u2500\u2500 cross_file_coreference_resolver.py # Cross-chapter chain resolution\r\n\u2502   \u2502   \u251c\u2500\u2500 unified_relationship_model.py   # Extended relationship data model\r\n\u2502   \u2502   \u2514\u2500\u2500 __init__.py                     # Multi-file module exports\r\n\u2502   \u251c\u2500\u2500 parsers/                    # Adaptive TSV parsing components\r\n\u2502   \u2502   \u251c\u2500\u2500 adaptive_tsv_parser.py      # Preamble-based dynamic parsing\r\n\u2502   \u2502   \u251c\u2500\u2500 incomplete_format_parser.py # Specialized incomplete format handler\r\n\u2502   \u2502   \u251c\u2500\u2500 preamble_parser.py          # WebAnno schema extraction\r\n\u2502   \u2502   \u2514\u2500\u2500 tsv_parser.py               # Legacy parser (fallback)\r\n\u2502   \u251c\u2500\u2500 extractors/                 # Feature extraction components\r\n\u2502   \u251c\u2500\u2500 utils/                      # Format detection and utilities\r\n\u2502   \u2502   \u2514\u2500\u2500 format_detector.py          # Automatic format analysis\r\n\u2502   \u2514\u2500\u2500 data/                       # Data models and structures\r\n\u251c\u2500\u2500 scripts/                    # Executable scripts and utilities\r\n\u2502   \u251c\u2500\u2500 run_multi_file_analysis.py     # Production multi-file processing interface\r\n\u2502   \u251c\u2500\u2500 enhanced_cross_chapter_analysis.py # Enhanced cross-chapter analysis\r\n\u2502   \u251c\u2500\u2500 generate_advanced_analysis_simple.py # Advanced analysis generator\r\n\u2502   \u2514\u2500\u2500 generate_visualizations.py     # Visualization generation\r\n\u251c\u2500\u2500 analysis/                   # Analysis scripts and tools\r\n\u2502   \u251c\u2500\u2500 analyze_4tsv_detailed.py       # Detailed TSV format analysis\r\n\u2502   \u251c\u2500\u2500 analyze_column_mapping.py      # Column mapping analysis\r\n\u2502   \u2514\u2500\u2500 analyze_preambles.py           # Preamble structure analysis\r\n\u251c\u2500\u2500 tests/                      # Comprehensive test suite\r\n\u2502   \u251c\u2500\u2500 test_multi_file_processing.py  # Multi-file processing tests\r\n\u2502   \u251c\u2500\u2500 test_cross_chapter_coreference.py # Cross-chapter tests\r\n\u2502   \u2514\u2500\u2500 test_4tsv_processing.py        # TSV format tests\r\n\u251c\u2500\u2500 logs/                       # Log files and execution records\r\n\u2502   \u251c\u2500\u2500 multi_file_analysis.log        # Multi-file processing logs\r\n\u2502   \u2514\u2500\u2500 visualization_generation.log   # Visualization generation logs\r\n\u251c\u2500\u2500 data/                       # Input and output data\r\n\u2502   \u251c\u2500\u2500 input/                      # Source TSV files with documentation\r\n\u2502   \u2502   \u251c\u2500\u2500 FORMAT_OVERVIEW.md          # Comprehensive format comparison\r\n\u2502   \u2502   \u251c\u2500\u2500 gotofiles/                  # Standard and extended formats\r\n\u2502   \u2502   \u2502   \u251c\u2500\u2500 2.tsv_DOCUMENTATION.md      # Standard format (15 cols)\r\n\u2502   \u2502   \u2502   \u2514\u2500\u2500 later/                      # Alternative formats\r\n\u2502   \u2502   \u2502       \u251c\u2500\u2500 1.tsv_DOCUMENTATION.md      # Extended format (37 cols)\r\n\u2502   \u2502   \u2502       \u251c\u2500\u2500 3.tsv_DOCUMENTATION.md      # Legacy format (14 cols)\r\n\u2502   \u2502   \u2502       \u2514\u2500\u2500 4.tsv_DOCUMENTATION.md      # Incomplete format (12 cols)\r\n\u2502   \u2514\u2500\u2500 output/                     # Analysis results and deliverables\r\n\u2502       \u251c\u2500\u2500 deliverable_package_20250729/   # \ud83d\udce6 COMPLETE DELIVERABLE PACKAGE\r\n\u2502       \u2502   \u251c\u2500\u2500 comprehensive_analysis_report.html  # Interactive HTML report\r\n\u2502       \u2502   \u251c\u2500\u2500 unified_relationships.csv           # Complete dataset (1,904 relationships)\r\n\u2502       \u2502   \u251c\u2500\u2500 visualizations_20250729_123445/     # Interactive network visualizations\r\n\u2502       \u2502   \u251c\u2500\u2500 README.md                           # Package documentation\r\n\u2502       \u2502   \u2514\u2500\u2500 DELIVERY_SUMMARY.md                 # Delivery instructions\r\n\u2502       \u2514\u2500\u2500 unified_analysis_20250729_123353/       # Latest raw analysis results\r\n\u251c\u2500\u2500 tools/                      # Development and utility tools\r\n\u2502   \u2514\u2500\u2500 temp_files/                 # Temporary files and cleanup\r\n\u251c\u2500\u2500 docs/                       # Project documentation\r\n\u2502   \u251c\u2500\u2500 MULTI_FILE_PROCESSING_DOCUMENTATION.md # Multi-file architecture guide\r\n\u2502   \u251c\u2500\u2500 unified_multi_file_processing_plan.md # Implementation plan\r\n\u2502   \u251c\u2500\u2500 4tsv_analysis_specification.md # TSV format specifications\r\n\u2502   \u2514\u2500\u2500 cross_chapter_coreference_analysis_spec.md # Cross-chapter analysis spec\r\n```\r\n\r\n## Installation\r\n\r\n1. **Clone the repository**:\r\n   ```bash\r\n   git clone <repository-url>\r\n   cd clause-mates-analyzer\r\n   ```\r\n\r\n2. **Set up environment** (choose one):\r\n\r\n   **Option A: pip (recommended)**\r\n   ```bash\r\n   python -m venv .venv\r\n   # Windows:\r\n   .venv\\Scripts\\activate\r\n   # macOS/Linux:\r\n   source .venv/bin/activate\r\n\r\n   pip install -e .[dev,benchmark]\r\n   ```\r\n\r\n   **Option B: conda**\r\n   ```bash\r\n   conda env create -f environment.yml\r\n   conda activate clausemate\r\n   ```\r\n\r\n## Usage\r\n\r\n### Multi-File Processing (Phase 3.1) - **RECOMMENDED**\r\n\r\nProcess all 4 chapter files as a unified dataset with cross-chapter coreference resolution:\r\n\r\n```bash\r\n# Unified multi-file processing (all chapters as single dataset)\r\npython scripts/run_multi_file_analysis.py\r\n\r\n# With verbose logging\r\npython scripts/run_multi_file_analysis.py --verbose\r\n\r\n# Custom output directory\r\npython scripts/run_multi_file_analysis.py --output-dir custom_output\r\n```\r\n\r\n**Output**: Single unified file with all 1,904 relationships + 36 cross-chapter chains\r\n- Creates timestamped directory: `data/output/unified_analysis_YYYYMMDD_HHMMSS/`\r\n- **unified_relationships.csv**: Main CSV output with source file metadata\r\n- **unified_relationships.json**: JSON format with complete relationship data\r\n- **cross_chapter_statistics.json**: Cross-chapter chain resolution statistics\r\n\r\n### Single File Processing (Phase 2)\r\n\r\nProcess individual files with automatic format detection:\r\n\r\n```bash\r\n# Individual file processing with adaptive parsing\r\npython src/main.py data/input/gotofiles/2.tsv                    # Standard format\r\npython src/main.py data/input/gotofiles/later/1.tsv              # Extended format\r\npython src/main.py data/input/gotofiles/later/3.tsv              # Legacy format\r\npython src/main.py data/input/gotofiles/later/4.tsv              # Incomplete format\r\n\r\n# Force legacy parser (disable adaptive features)\r\npython src/main.py --disable-adaptive data/input/gotofiles/2.tsv\r\n\r\n# Verbose output with format detection details\r\npython src/main.py --verbose data/input/gotofiles/later/1.tsv\r\n```\r\n\r\n**Output**: Individual timestamped directories in `data/output/YYYYMMDD_HHMMSS/`\r\n\r\n### Analysis Results Comparison\r\n\r\n#### Unified Multi-File Processing (Recommended)\r\n\r\n| **Unified Output** | **Total** | **Cross-Chapter Chains** | **Processing Time** |\r\n|-------------------|-----------|--------------------------|-------------------|\r\n| **All 4 Chapters** | **1,904 relationships** | **36 unified chains** | **~12 seconds** |\r\n\r\n#### Individual File Processing\r\n\r\n| File | Format | Sentences | Tokens | Relationships | Coreference Chains |\r\n|------|--------|-----------|--------|---------------|-------------------|\r\n| **2.tsv** | Standard | 222 | 3,665 | **448** | 235 |\r\n| **1.tsv** | Extended | 127 | 2,267 | **234** | 195 |\r\n| **3.tsv** | Legacy | 207 | 3,786 | **527** | 244 |\r\n| **4.tsv** | Incomplete | 243 | 4,412 | **695** | 245 |\r\n\r\n> \ud83d\udca1 **Recommendation**: Use multi-file processing for complete narrative analysis with cross-chapter relationships. Individual processing is available for specific chapter analysis or debugging.\r\n\r\n### Analysis Tools\r\n\r\n```bash\r\n# Generate comprehensive analysis reports\r\npython analysis/analyze_4tsv_detailed.py\r\n\r\n# Analyze column mappings and format compatibility\r\npython analysis/analyze_column_mapping.py\r\n\r\n# Analyze preamble structures\r\npython analysis/analyze_preambles.py\r\n\r\n# Generate advanced analysis with visualizations\r\npython scripts/generate_advanced_analysis_simple.py\r\n\r\n# Create interactive visualizations\r\npython scripts/generate_visualizations.py\r\n```\r\n\r\n### Testing\r\n\r\n```bash\r\n# Run all tests\r\npython -m pytest\r\n\r\n# Run specific test categories\r\npython -m pytest tests/test_4tsv_processing.py\r\npython -m pytest tests/test_multi_file_processing.py\r\npython -m pytest tests/test_cross_chapter_coreference.py\r\n\r\n# Run with verbose output\r\npython -m pytest -v\r\n\r\n# Run with coverage\r\npython -m pytest --cov=src\r\n```\r\n\r\n## Development\r\n\r\n### Quick Start\r\n\r\n```bash\r\n# Install development dependencies\r\npip install -e .[dev,benchmark]\r\n\r\n# Run quality checks\r\nnox                      # Run default sessions (lint, test)\r\nnox -s lint              # Fast ruff linting\r\nnox -s format            # Code formatting\r\nnox -s test              # Run tests\r\nnox -s ci                # Full CI pipeline\r\n\r\n# Run tests directly\r\npytest\r\n```\r\n\r\n### Code Quality\r\n\r\nThis project uses **ruff** for fast, comprehensive code quality checking and formatting:\r\n\r\n- **ruff**: Fast linting and formatting (replaces black, isort, flake8)\r\n- **mypy**: Type checking\r\n- **pytest**: Testing framework\r\n- **pre-commit**: Git hooks for quality assurance\r\n\r\n## Requirements\r\n\r\n- **Python**: 3.8+\r\n- **Core Dependencies**: pandas, standard library modules\r\n- **Development**: ruff, mypy, pytest, pre-commit\r\n\r\n## Contributing\r\n\r\nThis is a research project. For contributions:\r\n\r\n1. Follow the established code style and type hints\r\n2. Add tests for new functionality\r\n3. Update documentation as needed\r\n4. Ensure backward compatibility with existing data\r\n\r\nSee [`CONTRIBUTING.md`](CONTRIBUTING.md) for detailed setup instructions.\r\n\r\n## Reproducibility\r\n\r\nFor exact result reproduction, see [`REPRODUCIBILITY.md`](REPRODUCIBILITY.md) for step-by-step instructions using locked dependencies and reference outputs.\r\n\r\n## License\r\n\r\nResearch project - please contact maintainers for usage permissions.\r\n\r\n## Contact\r\n\r\nFor questions about the linguistic methodology or data format, please refer to the project documentation or contact the research team.\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "German pronoun clause mate extraction and analysis tool",
    "version": "2.0.1",
    "project_urls": {
        "Bug Tracker": "https://github.com/jobschepens/clausemate/issues",
        "Documentation": "https://github.com/jobschepens/clausemate#readme",
        "Homepage": "https://github.com/jobschepens/clausemate",
        "Repository": "https://github.com/jobschepens/clausemate.git"
    },
    "split_keywords": [
        "linguistics",
        " computational-linguistics",
        " german",
        " pronouns",
        " coreference"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1d741559f2b4d882594c4cda1d22932531ded8dec404c280f099ad61918c5caf",
                "md5": "c73c95763501ec4423863272b3ed3cd5",
                "sha256": "dd0e48f2ad02ab0b8cbb2ed7c1a22b6bd8de3a82cf857dad703149113d11e059"
            },
            "downloads": -1,
            "filename": "clausemate-2.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "c73c95763501ec4423863272b3ed3cd5",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 96269,
            "upload_time": "2025-07-29T20:32:47",
            "upload_time_iso_8601": "2025-07-29T20:32:47.854722Z",
            "url": "https://files.pythonhosted.org/packages/1d/74/1559f2b4d882594c4cda1d22932531ded8dec404c280f099ad61918c5caf/clausemate-2.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a70a6ff433ea2aad38380db9af336da9b0cc56ad5d727e20def3d47fbe13b55d",
                "md5": "6c0c413b0ef077c5a19dacb99d4ecb01",
                "sha256": "9461f2c31c7749b2059cae899ec5097ca984358d1f2619b87283e6aade8c373f"
            },
            "downloads": -1,
            "filename": "clausemate-2.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "6c0c413b0ef077c5a19dacb99d4ecb01",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 95893,
            "upload_time": "2025-07-29T20:32:49",
            "upload_time_iso_8601": "2025-07-29T20:32:49.529342Z",
            "url": "https://files.pythonhosted.org/packages/a7/0a/6ff433ea2aad38380db9af336da9b0cc56ad5d727e20def3d47fbe13b55d/clausemate-2.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-29 20:32:49",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "jobschepens",
    "github_project": "clausemate",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "numpy",
            "specs": [
                [
                    "==",
                    "2.3.1"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    "==",
                    "2.3.1"
                ]
            ]
        },
        {
            "name": "python-dateutil",
            "specs": [
                [
                    "==",
                    "2.9.0.post0"
                ]
            ]
        },
        {
            "name": "pytz",
            "specs": [
                [
                    "==",
                    "2025.2"
                ]
            ]
        },
        {
            "name": "six",
            "specs": [
                [
                    "==",
                    "1.17.0"
                ]
            ]
        },
        {
            "name": "tzdata",
            "specs": [
                [
                    "==",
                    "2025.2"
                ]
            ]
        }
    ],
    "lcname": "clausemate"
}

None