cjm-transcription-utils


Namecjm-transcription-utils JSON
Version 0.0.2 PyPI version JSON
download
home_pagehttps://github.com/cj-mills/cjm-transcription-utils
SummaryMiscellaneous utilities for helping with audio transcription.
upload_time2025-09-11 02:58:17
maintainerNone
docs_urlNone
authorChristian J. Mills
requires_python>=3.11
licenseApache Software License 2.0
keywords nbdev jupyter notebook python
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # cjm-transcription-utils


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

## Install

``` bash
pip install cjm_transcription_utils
```

## Project Structure

    nbs/
    ├── chunking.ipynb            # Fill in a module description here
    ├── formatting.ipynb          # Fill in a module description here
    ├── librosa.ipynb             # Fill in a module description here
    ├── numerizer.ipynb           # Fill in a module description here
    ├── postprocessing.ipynb      # Fill in a module description here
    ├── pydub.ipynb               # Fill in a module description here
    ├── silero_vad.ipynb          # Fill in a module description here
    └── timestamp_alignment.ipynb # Fill in a module description here

Total: 8 notebooks

## Module Dependencies

``` mermaid
graph LR
    chunking[chunking<br/>chunking]
    formatting[formatting<br/>formatting]
    librosa[librosa<br/>librosa]
    numerizer[numerizer<br/>numerizer]
    postprocessing[postprocessing<br/>postprocessing]
    pydub[pydub<br/>pydub]
    silero_vad[silero_vad<br/>silero vad]
    timestamp_alignment[timestamp_alignment<br/>timestamp alignment]

    silero_vad --> chunking
    silero_vad --> librosa
```

*2 cross-module dependencies detected*

## CLI Reference

No CLI commands found in this project.

## Module Overview

Detailed documentation for each module in the project:

### chunking (`chunking.ipynb`)

> Fill in a module description here

#### Import

``` python
from cjm_transcription_utils.chunking import (
    get_extended_timestamp_boundaries,
    get_extended_chunk_boundaries,
    generate_chunks_with_vad,
    generate_intermediate_chunks,
    generate_intermediate_chunk_tuples,
    merge_transcripts_with_overlaps
)
```

#### Functions

``` python
def get_extended_timestamp_boundaries(
    timestamps: List[Dict[str, float]], 
    index: int  # Index of the current timestamp
) -> Tuple[float, float]
    "Get extended boundaries for a timestamp using adjacent timestamps."
```

``` python
def get_extended_chunk_boundaries(
    chunks: List[Tuple[float, float]], 
    index: int  # Index of the current chunk
) -> Tuple[float, float]
    "Get extended boundaries for a chunk using adjacent chunks."
```

``` python
def generate_chunks_with_vad(
    audio_array: np.ndarray,  # Audio array
    duration: float,  # Total duration of audio in seconds
    max_chunk_seconds: float = 120,  # Maximum chunk duration in seconds
    max_chunk_seconds_offset: float = 0,  # Offset for chunk duration calculation
    speech_timestamps: Optional[List[Dict]] = None,  # List of speech timestamp dictionaries with 'start' and 'end' keys
    max_silence_threshold: float = 2.0  # Maximum silence duration (in seconds) before creating a new chunk
) -> Tuple[List[Tuple[float, float]], List[List[Dict]]]
    "Generate chunks using VAD timestamps with silence-based splitting"
```

``` python
def generate_intermediate_chunks(
    chunks: List[Tuple[float, float]],
    chunk_timestamps: List[List[Dict]],  # TODO: Add description
    use_extended_boundaries:bool  # TODO: Add description
    
) -> List[Tuple[float, float]]:  # TODO: Add return description
    "Generate overlapping chunks between consecutive chunk boundaries"
```

``` python
def generate_intermediate_chunk_tuples(
    chunks: List[Tuple[float, float]],
    chunk_timestamps: List[List[Dict]],  # List of timestamp dictionaries for each chunk
    use_extended_boundaries:bool  # TODO: Add description
) -> List[Tuple[Dict, Dict]]
    "Generate tuples of (last_timestamp, first_timestamp) from consecutive chunks."
```

``` python
def merge_transcripts_with_overlaps(
    normal_transcripts: List[str],  # List of transcripts for normal chunks
    intermediate_transcripts: List[str],  # List of transcripts for intermediate chunks
    segment_transcripts: List[Tuple[str, str]],
    verbose: bool = True  # Whether to print debug information
) -> str
    "Merge normal and intermediate transcripts with overlap correction"
```

### formatting (`formatting.ipynb`)

> Fill in a module description here

#### Import

``` python
from cjm_transcription_utils.formatting import (
    time_interval_to_hms_range
)
```

#### Functions

``` python
def time_interval_to_hms_range(
    duration_tuple  # A tuple of (start_seconds, end_seconds) as floats
)
    "Convert a time interval tuple (start_seconds, end_seconds) to HMS timestamp range format."
```

### librosa (`librosa.ipynb`)

> Fill in a module description here

#### Import

``` python
from cjm_transcription_utils.librosa import (
    load_audio
)
```

#### Functions

``` python
def load_audio(
    audio_path: str,  # TODO: Add description
    target_sr: int = 16000  # TODO: Add description
) -> Tuple[np.ndarray, int]:  # TODO: Add return description
    "Load and normalize audio file"
```

### numerizer (`numerizer.ipynb`)

> Fill in a module description here

#### Import

``` python
from cjm_transcription_utils.numerizer import (
    original_numerize_numerals,
    patched_numerize_numerals,
    smart_numerize
)
```

#### Functions

``` python
def patched_numerize_numerals(
    s,  # TODO: Add type hint and description
    ignore=None,  # TODO: Add type hint and description
    bias=None  # TODO: Add type hint and description
): # TODO: Add type hint
    "Patched version that doesn't convert 'a' to '1'"
```

``` python
def smart_numerize(
    text  # TODO: Add type hint and description
): # TODO: Add type hint
    "TODO: Add function description"
```

### postprocessing (`postprocessing.ipynb`)

> Fill in a module description here

#### Import

``` python
from cjm_transcription_utils.postprocessing import (
    replace_integers_in_string,
    transcription_post_processing
)
```

#### Functions

``` python
def replace_integers_in_string(
    text  # TODO: Add type hint and description
): # TODO: Add type hint
    "TODO: Add function description"
```

``` python
def transcription_post_processing(
    transcript:str  # TODO: Add description
)->str:  # TODO: Add return description
    "TODO: Add function description"
```

### pydub (`pydub.ipynb`)

> Fill in a module description here

#### Import

``` python
from cjm_transcription_utils.pydub import (
    get_audio_segment
)
```

#### Functions

``` python
def get_audio_segment(
    audio: AudioSegment,  # TODO: Add description
    start: float,  # TODO: Add description
    end: float,  # TODO: Add description
    offset: float=0  # TODO: Add description
) -> AudioSegment:  # TODO: Add return description
    "Extract audio segment between start and end times"
```

### silero vad (`silero_vad.ipynb`)

> Fill in a module description here

#### Import

``` python
from cjm_transcription_utils.silero_vad import (
    prepare_audio_and_vad
)
```

#### Functions

``` python
def prepare_audio_and_vad(
    audio_path: str,  # Path to audio file
    max_chunk_seconds: float,  # Maximum chunk duration in seconds
    max_silence_threshold: float,  # Maximum silence duration before creating a new chunk
    include_timestamps: bool,  # Whether timestamps will be needed
    verbose: bool = True  # Whether to print progress
)
    "Load audio and prepare VAD timestamps if needed."
```

### timestamp alignment (`timestamp_alignment.ipynb`)

> Fill in a module description here

#### Import

``` python
from cjm_transcription_utils.timestamp_alignment import (
    TranscriptAligner,
    align_timestamps_to_transcript
)
```

#### Functions

``` python
def align_timestamps_to_transcript(
    final_transcript: str,  # The final merged transcript
    timestamp_transcripts: List[str],  # List of transcripts for each timestamp segment
    speech_timestamps: List[Dict],  # List of speech timestamp dictionaries
    verbose: bool = True  # Whether to print alignment details
) -> List[Dict]
    "Align timestamp segments to the final transcript."
```

#### Classes

``` python
class TranscriptAligner:
    def __init__(self, 
                 correct_transcript: str, # The full, correct transcript text
                 segment_transcripts: List[str], # List of individual segment transcriptions (may have errors)
                 timestamps: List[Dict[str, float]], # List of timestamp dictionaries with 'start' and 'end' keys
                 confidence_threshold: int = 70 # Minimum confidence score to accept an alignment
                )
    "TODO: Add class description"
    
    def __init__(self,
                     correct_transcript: str, # The full, correct transcript text
                     segment_transcripts: List[str], # List of individual segment transcriptions (may have errors)
                     timestamps: List[Dict[str, float]], # List of timestamp dictionaries with 'start' and 'end' keys
                     confidence_threshold: int = 70 # Minimum confidence score to accept an alignment
                    )
        "Initialize the transcript aligner with complete coverage and correction mechanisms."
    
    def align_timestamps_to_correct_transcript(
            self
        ) -> List[Dict]:  # TODO: Add return description
        "Align timestamps to the correct transcript with optional corrections."
```

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/cj-mills/cjm-transcription-utils",
    "name": "cjm-transcription-utils",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.11",
    "maintainer_email": null,
    "keywords": "nbdev jupyter notebook python",
    "author": "Christian J. Mills",
    "author_email": "9126128+cj-mills@users.noreply.github.com",
    "download_url": "https://files.pythonhosted.org/packages/16/28/f919ac3ff70ce70c20a2172876b25d4afbe99452ffa32e45eab17da530ca/cjm_transcription_utils-0.0.2.tar.gz",
    "platform": null,
    "description": "# cjm-transcription-utils\n\n\n<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->\n\n## Install\n\n``` bash\npip install cjm_transcription_utils\n```\n\n## Project Structure\n\n    nbs/\n    \u251c\u2500\u2500 chunking.ipynb            # Fill in a module description here\n    \u251c\u2500\u2500 formatting.ipynb          # Fill in a module description here\n    \u251c\u2500\u2500 librosa.ipynb             # Fill in a module description here\n    \u251c\u2500\u2500 numerizer.ipynb           # Fill in a module description here\n    \u251c\u2500\u2500 postprocessing.ipynb      # Fill in a module description here\n    \u251c\u2500\u2500 pydub.ipynb               # Fill in a module description here\n    \u251c\u2500\u2500 silero_vad.ipynb          # Fill in a module description here\n    \u2514\u2500\u2500 timestamp_alignment.ipynb # Fill in a module description here\n\nTotal: 8 notebooks\n\n## Module Dependencies\n\n``` mermaid\ngraph LR\n    chunking[chunking<br/>chunking]\n    formatting[formatting<br/>formatting]\n    librosa[librosa<br/>librosa]\n    numerizer[numerizer<br/>numerizer]\n    postprocessing[postprocessing<br/>postprocessing]\n    pydub[pydub<br/>pydub]\n    silero_vad[silero_vad<br/>silero vad]\n    timestamp_alignment[timestamp_alignment<br/>timestamp alignment]\n\n    silero_vad --> chunking\n    silero_vad --> librosa\n```\n\n*2 cross-module dependencies detected*\n\n## CLI Reference\n\nNo CLI commands found in this project.\n\n## Module Overview\n\nDetailed documentation for each module in the project:\n\n### chunking (`chunking.ipynb`)\n\n> Fill in a module description here\n\n#### Import\n\n``` python\nfrom cjm_transcription_utils.chunking import (\n    get_extended_timestamp_boundaries,\n    get_extended_chunk_boundaries,\n    generate_chunks_with_vad,\n    generate_intermediate_chunks,\n    generate_intermediate_chunk_tuples,\n    merge_transcripts_with_overlaps\n)\n```\n\n#### Functions\n\n``` python\ndef get_extended_timestamp_boundaries(\n    timestamps: List[Dict[str, float]], \n    index: int  # Index of the current timestamp\n) -> Tuple[float, float]\n    \"Get extended boundaries for a timestamp using adjacent timestamps.\"\n```\n\n``` python\ndef get_extended_chunk_boundaries(\n    chunks: List[Tuple[float, float]], \n    index: int  # Index of the current chunk\n) -> Tuple[float, float]\n    \"Get extended boundaries for a chunk using adjacent chunks.\"\n```\n\n``` python\ndef generate_chunks_with_vad(\n    audio_array: np.ndarray,  # Audio array\n    duration: float,  # Total duration of audio in seconds\n    max_chunk_seconds: float = 120,  # Maximum chunk duration in seconds\n    max_chunk_seconds_offset: float = 0,  # Offset for chunk duration calculation\n    speech_timestamps: Optional[List[Dict]] = None,  # List of speech timestamp dictionaries with 'start' and 'end' keys\n    max_silence_threshold: float = 2.0  # Maximum silence duration (in seconds) before creating a new chunk\n) -> Tuple[List[Tuple[float, float]], List[List[Dict]]]\n    \"Generate chunks using VAD timestamps with silence-based splitting\"\n```\n\n``` python\ndef generate_intermediate_chunks(\n    chunks: List[Tuple[float, float]],\n    chunk_timestamps: List[List[Dict]],  # TODO: Add description\n    use_extended_boundaries:bool  # TODO: Add description\n    \n) -> List[Tuple[float, float]]:  # TODO: Add return description\n    \"Generate overlapping chunks between consecutive chunk boundaries\"\n```\n\n``` python\ndef generate_intermediate_chunk_tuples(\n    chunks: List[Tuple[float, float]],\n    chunk_timestamps: List[List[Dict]],  # List of timestamp dictionaries for each chunk\n    use_extended_boundaries:bool  # TODO: Add description\n) -> List[Tuple[Dict, Dict]]\n    \"Generate tuples of (last_timestamp, first_timestamp) from consecutive chunks.\"\n```\n\n``` python\ndef merge_transcripts_with_overlaps(\n    normal_transcripts: List[str],  # List of transcripts for normal chunks\n    intermediate_transcripts: List[str],  # List of transcripts for intermediate chunks\n    segment_transcripts: List[Tuple[str, str]],\n    verbose: bool = True  # Whether to print debug information\n) -> str\n    \"Merge normal and intermediate transcripts with overlap correction\"\n```\n\n### formatting (`formatting.ipynb`)\n\n> Fill in a module description here\n\n#### Import\n\n``` python\nfrom cjm_transcription_utils.formatting import (\n    time_interval_to_hms_range\n)\n```\n\n#### Functions\n\n``` python\ndef time_interval_to_hms_range(\n    duration_tuple  # A tuple of (start_seconds, end_seconds) as floats\n)\n    \"Convert a time interval tuple (start_seconds, end_seconds) to HMS timestamp range format.\"\n```\n\n### librosa (`librosa.ipynb`)\n\n> Fill in a module description here\n\n#### Import\n\n``` python\nfrom cjm_transcription_utils.librosa import (\n    load_audio\n)\n```\n\n#### Functions\n\n``` python\ndef load_audio(\n    audio_path: str,  # TODO: Add description\n    target_sr: int = 16000  # TODO: Add description\n) -> Tuple[np.ndarray, int]:  # TODO: Add return description\n    \"Load and normalize audio file\"\n```\n\n### numerizer (`numerizer.ipynb`)\n\n> Fill in a module description here\n\n#### Import\n\n``` python\nfrom cjm_transcription_utils.numerizer import (\n    original_numerize_numerals,\n    patched_numerize_numerals,\n    smart_numerize\n)\n```\n\n#### Functions\n\n``` python\ndef patched_numerize_numerals(\n    s,  # TODO: Add type hint and description\n    ignore=None,  # TODO: Add type hint and description\n    bias=None  # TODO: Add type hint and description\n): # TODO: Add type hint\n    \"Patched version that doesn't convert 'a' to '1'\"\n```\n\n``` python\ndef smart_numerize(\n    text  # TODO: Add type hint and description\n): # TODO: Add type hint\n    \"TODO: Add function description\"\n```\n\n### postprocessing (`postprocessing.ipynb`)\n\n> Fill in a module description here\n\n#### Import\n\n``` python\nfrom cjm_transcription_utils.postprocessing import (\n    replace_integers_in_string,\n    transcription_post_processing\n)\n```\n\n#### Functions\n\n``` python\ndef replace_integers_in_string(\n    text  # TODO: Add type hint and description\n): # TODO: Add type hint\n    \"TODO: Add function description\"\n```\n\n``` python\ndef transcription_post_processing(\n    transcript:str  # TODO: Add description\n)->str:  # TODO: Add return description\n    \"TODO: Add function description\"\n```\n\n### pydub (`pydub.ipynb`)\n\n> Fill in a module description here\n\n#### Import\n\n``` python\nfrom cjm_transcription_utils.pydub import (\n    get_audio_segment\n)\n```\n\n#### Functions\n\n``` python\ndef get_audio_segment(\n    audio: AudioSegment,  # TODO: Add description\n    start: float,  # TODO: Add description\n    end: float,  # TODO: Add description\n    offset: float=0  # TODO: Add description\n) -> AudioSegment:  # TODO: Add return description\n    \"Extract audio segment between start and end times\"\n```\n\n### silero vad (`silero_vad.ipynb`)\n\n> Fill in a module description here\n\n#### Import\n\n``` python\nfrom cjm_transcription_utils.silero_vad import (\n    prepare_audio_and_vad\n)\n```\n\n#### Functions\n\n``` python\ndef prepare_audio_and_vad(\n    audio_path: str,  # Path to audio file\n    max_chunk_seconds: float,  # Maximum chunk duration in seconds\n    max_silence_threshold: float,  # Maximum silence duration before creating a new chunk\n    include_timestamps: bool,  # Whether timestamps will be needed\n    verbose: bool = True  # Whether to print progress\n)\n    \"Load audio and prepare VAD timestamps if needed.\"\n```\n\n### timestamp alignment (`timestamp_alignment.ipynb`)\n\n> Fill in a module description here\n\n#### Import\n\n``` python\nfrom cjm_transcription_utils.timestamp_alignment import (\n    TranscriptAligner,\n    align_timestamps_to_transcript\n)\n```\n\n#### Functions\n\n``` python\ndef align_timestamps_to_transcript(\n    final_transcript: str,  # The final merged transcript\n    timestamp_transcripts: List[str],  # List of transcripts for each timestamp segment\n    speech_timestamps: List[Dict],  # List of speech timestamp dictionaries\n    verbose: bool = True  # Whether to print alignment details\n) -> List[Dict]\n    \"Align timestamp segments to the final transcript.\"\n```\n\n#### Classes\n\n``` python\nclass TranscriptAligner:\n    def __init__(self, \n                 correct_transcript: str, # The full, correct transcript text\n                 segment_transcripts: List[str], # List of individual segment transcriptions (may have errors)\n                 timestamps: List[Dict[str, float]], # List of timestamp dictionaries with 'start' and 'end' keys\n                 confidence_threshold: int = 70 # Minimum confidence score to accept an alignment\n                )\n    \"TODO: Add class description\"\n    \n    def __init__(self,\n                     correct_transcript: str, # The full, correct transcript text\n                     segment_transcripts: List[str], # List of individual segment transcriptions (may have errors)\n                     timestamps: List[Dict[str, float]], # List of timestamp dictionaries with 'start' and 'end' keys\n                     confidence_threshold: int = 70 # Minimum confidence score to accept an alignment\n                    )\n        \"Initialize the transcript aligner with complete coverage and correction mechanisms.\"\n    \n    def align_timestamps_to_correct_transcript(\n            self\n        ) -> List[Dict]:  # TODO: Add return description\n        \"Align timestamps to the correct transcript with optional corrections.\"\n```\n",
    "bugtrack_url": null,
    "license": "Apache Software License 2.0",
    "summary": "Miscellaneous utilities for helping with audio transcription.",
    "version": "0.0.2",
    "project_urls": {
        "Homepage": "https://github.com/cj-mills/cjm-transcription-utils"
    },
    "split_keywords": [
        "nbdev",
        "jupyter",
        "notebook",
        "python"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "a288c4c0fbd741f325b1c756fd9c2a49575aa1126c57d62fcbdfcf5874e26191",
                "md5": "d8b0e8ebe1e57723da4c6ab69783cbb5",
                "sha256": "f27c685f651cbc656827ecf4dbd26b776fd1a33054aa220ead1780ad49727373"
            },
            "downloads": -1,
            "filename": "cjm_transcription_utils-0.0.2-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d8b0e8ebe1e57723da4c6ab69783cbb5",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.11",
            "size": 20794,
            "upload_time": "2025-09-11T02:58:16",
            "upload_time_iso_8601": "2025-09-11T02:58:16.355684Z",
            "url": "https://files.pythonhosted.org/packages/a2/88/c4c0fbd741f325b1c756fd9c2a49575aa1126c57d62fcbdfcf5874e26191/cjm_transcription_utils-0.0.2-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "1628f919ac3ff70ce70c20a2172876b25d4afbe99452ffa32e45eab17da530ca",
                "md5": "4407d715f39fbffb9150b5b3117e6be0",
                "sha256": "6cf035a40c1df9c442724d86e0689d9b604c494771b85ea9df3c883f3b7fb790"
            },
            "downloads": -1,
            "filename": "cjm_transcription_utils-0.0.2.tar.gz",
            "has_sig": false,
            "md5_digest": "4407d715f39fbffb9150b5b3117e6be0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.11",
            "size": 19739,
            "upload_time": "2025-09-11T02:58:17",
            "upload_time_iso_8601": "2025-09-11T02:58:17.587898Z",
            "url": "https://files.pythonhosted.org/packages/16/28/f919ac3ff70ce70c20a2172876b25d4afbe99452ffa32e45eab17da530ca/cjm_transcription_utils-0.0.2.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-09-11 02:58:17",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "cj-mills",
    "github_project": "cjm-transcription-utils",
    "github_not_found": true,
    "lcname": "cjm-transcription-utils"
}
        
Elapsed time: 0.61861s