# cjm-transcription-utils
<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->
## Install
``` bash
pip install cjm_transcription_utils
```
## Project Structure
nbs/
├── chunking.ipynb # Fill in a module description here
├── formatting.ipynb # Fill in a module description here
├── librosa.ipynb # Fill in a module description here
├── numerizer.ipynb # Fill in a module description here
├── postprocessing.ipynb # Fill in a module description here
├── pydub.ipynb # Fill in a module description here
├── silero_vad.ipynb # Fill in a module description here
└── timestamp_alignment.ipynb # Fill in a module description here
Total: 8 notebooks
## Module Dependencies
``` mermaid
graph LR
chunking[chunking<br/>chunking]
formatting[formatting<br/>formatting]
librosa[librosa<br/>librosa]
numerizer[numerizer<br/>numerizer]
postprocessing[postprocessing<br/>postprocessing]
pydub[pydub<br/>pydub]
silero_vad[silero_vad<br/>silero vad]
timestamp_alignment[timestamp_alignment<br/>timestamp alignment]
silero_vad --> chunking
silero_vad --> librosa
```
*2 cross-module dependencies detected*
## CLI Reference
No CLI commands found in this project.
## Module Overview
Detailed documentation for each module in the project:
### chunking (`chunking.ipynb`)
> Fill in a module description here
#### Import
``` python
from cjm_transcription_utils.chunking import (
get_extended_timestamp_boundaries,
get_extended_chunk_boundaries,
generate_chunks_with_vad,
generate_intermediate_chunks,
generate_intermediate_chunk_tuples,
merge_transcripts_with_overlaps
)
```
#### Functions
``` python
def get_extended_timestamp_boundaries(
timestamps: List[Dict[str, float]],
index: int # Index of the current timestamp
) -> Tuple[float, float]
"Get extended boundaries for a timestamp using adjacent timestamps."
```
``` python
def get_extended_chunk_boundaries(
chunks: List[Tuple[float, float]],
index: int # Index of the current chunk
) -> Tuple[float, float]
"Get extended boundaries for a chunk using adjacent chunks."
```
``` python
def generate_chunks_with_vad(
audio_array: np.ndarray, # Audio array
duration: float, # Total duration of audio in seconds
max_chunk_seconds: float = 120, # Maximum chunk duration in seconds
max_chunk_seconds_offset: float = 0, # Offset for chunk duration calculation
speech_timestamps: Optional[List[Dict]] = None, # List of speech timestamp dictionaries with 'start' and 'end' keys
max_silence_threshold: float = 2.0 # Maximum silence duration (in seconds) before creating a new chunk
) -> Tuple[List[Tuple[float, float]], List[List[Dict]]]
"Generate chunks using VAD timestamps with silence-based splitting"
```
``` python
def generate_intermediate_chunks(
chunks: List[Tuple[float, float]],
chunk_timestamps: List[List[Dict]], # TODO: Add description
use_extended_boundaries:bool # TODO: Add description
) -> List[Tuple[float, float]]: # TODO: Add return description
"Generate overlapping chunks between consecutive chunk boundaries"
```
``` python
def generate_intermediate_chunk_tuples(
chunks: List[Tuple[float, float]],
chunk_timestamps: List[List[Dict]], # List of timestamp dictionaries for each chunk
use_extended_boundaries:bool # TODO: Add description
) -> List[Tuple[Dict, Dict]]
"Generate tuples of (last_timestamp, first_timestamp) from consecutive chunks."
```
``` python
def merge_transcripts_with_overlaps(
normal_transcripts: List[str], # List of transcripts for normal chunks
intermediate_transcripts: List[str], # List of transcripts for intermediate chunks
segment_transcripts: List[Tuple[str, str]],
verbose: bool = True # Whether to print debug information
) -> str
"Merge normal and intermediate transcripts with overlap correction"
```
### formatting (`formatting.ipynb`)
> Fill in a module description here
#### Import
``` python
from cjm_transcription_utils.formatting import (
time_interval_to_hms_range
)
```
#### Functions
``` python
def time_interval_to_hms_range(
duration_tuple # A tuple of (start_seconds, end_seconds) as floats
)
"Convert a time interval tuple (start_seconds, end_seconds) to HMS timestamp range format."
```
### librosa (`librosa.ipynb`)
> Fill in a module description here
#### Import
``` python
from cjm_transcription_utils.librosa import (
load_audio
)
```
#### Functions
``` python
def load_audio(
audio_path: str, # TODO: Add description
target_sr: int = 16000 # TODO: Add description
) -> Tuple[np.ndarray, int]: # TODO: Add return description
"Load and normalize audio file"
```
### numerizer (`numerizer.ipynb`)
> Fill in a module description here
#### Import
``` python
from cjm_transcription_utils.numerizer import (
original_numerize_numerals,
patched_numerize_numerals,
smart_numerize
)
```
#### Functions
``` python
def patched_numerize_numerals(
s, # TODO: Add type hint and description
ignore=None, # TODO: Add type hint and description
bias=None # TODO: Add type hint and description
): # TODO: Add type hint
"Patched version that doesn't convert 'a' to '1'"
```
``` python
def smart_numerize(
text # TODO: Add type hint and description
): # TODO: Add type hint
"TODO: Add function description"
```
### postprocessing (`postprocessing.ipynb`)
> Fill in a module description here
#### Import
``` python
from cjm_transcription_utils.postprocessing import (
replace_integers_in_string,
transcription_post_processing
)
```
#### Functions
``` python
def replace_integers_in_string(
text # TODO: Add type hint and description
): # TODO: Add type hint
"TODO: Add function description"
```
``` python
def transcription_post_processing(
transcript:str # TODO: Add description
)->str: # TODO: Add return description
"TODO: Add function description"
```
### pydub (`pydub.ipynb`)
> Fill in a module description here
#### Import
``` python
from cjm_transcription_utils.pydub import (
get_audio_segment
)
```
#### Functions
``` python
def get_audio_segment(
audio: AudioSegment, # TODO: Add description
start: float, # TODO: Add description
end: float, # TODO: Add description
offset: float=0 # TODO: Add description
) -> AudioSegment: # TODO: Add return description
"Extract audio segment between start and end times"
```
### silero vad (`silero_vad.ipynb`)
> Fill in a module description here
#### Import
``` python
from cjm_transcription_utils.silero_vad import (
prepare_audio_and_vad
)
```
#### Functions
``` python
def prepare_audio_and_vad(
audio_path: str, # Path to audio file
max_chunk_seconds: float, # Maximum chunk duration in seconds
max_silence_threshold: float, # Maximum silence duration before creating a new chunk
include_timestamps: bool, # Whether timestamps will be needed
verbose: bool = True # Whether to print progress
)
"Load audio and prepare VAD timestamps if needed."
```
### timestamp alignment (`timestamp_alignment.ipynb`)
> Fill in a module description here
#### Import
``` python
from cjm_transcription_utils.timestamp_alignment import (
TranscriptAligner,
align_timestamps_to_transcript
)
```
#### Functions
``` python
def align_timestamps_to_transcript(
final_transcript: str, # The final merged transcript
timestamp_transcripts: List[str], # List of transcripts for each timestamp segment
speech_timestamps: List[Dict], # List of speech timestamp dictionaries
verbose: bool = True # Whether to print alignment details
) -> List[Dict]
"Align timestamp segments to the final transcript."
```
#### Classes
``` python
class TranscriptAligner:
def __init__(self,
correct_transcript: str, # The full, correct transcript text
segment_transcripts: List[str], # List of individual segment transcriptions (may have errors)
timestamps: List[Dict[str, float]], # List of timestamp dictionaries with 'start' and 'end' keys
confidence_threshold: int = 70 # Minimum confidence score to accept an alignment
)
"TODO: Add class description"
def __init__(self,
correct_transcript: str, # The full, correct transcript text
segment_transcripts: List[str], # List of individual segment transcriptions (may have errors)
timestamps: List[Dict[str, float]], # List of timestamp dictionaries with 'start' and 'end' keys
confidence_threshold: int = 70 # Minimum confidence score to accept an alignment
)
"Initialize the transcript aligner with complete coverage and correction mechanisms."
def align_timestamps_to_correct_transcript(
self
) -> List[Dict]: # TODO: Add return description
"Align timestamps to the correct transcript with optional corrections."
```
Raw data
{
"_id": null,
"home_page": "https://github.com/cj-mills/cjm-transcription-utils",
"name": "cjm-transcription-utils",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.11",
"maintainer_email": null,
"keywords": "nbdev jupyter notebook python",
"author": "Christian J. Mills",
"author_email": "9126128+cj-mills@users.noreply.github.com",
"download_url": "https://files.pythonhosted.org/packages/16/28/f919ac3ff70ce70c20a2172876b25d4afbe99452ffa32e45eab17da530ca/cjm_transcription_utils-0.0.2.tar.gz",
"platform": null,
"description": "# cjm-transcription-utils\n\n\n<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->\n\n## Install\n\n``` bash\npip install cjm_transcription_utils\n```\n\n## Project Structure\n\n nbs/\n \u251c\u2500\u2500 chunking.ipynb # Fill in a module description here\n \u251c\u2500\u2500 formatting.ipynb # Fill in a module description here\n \u251c\u2500\u2500 librosa.ipynb # Fill in a module description here\n \u251c\u2500\u2500 numerizer.ipynb # Fill in a module description here\n \u251c\u2500\u2500 postprocessing.ipynb # Fill in a module description here\n \u251c\u2500\u2500 pydub.ipynb # Fill in a module description here\n \u251c\u2500\u2500 silero_vad.ipynb # Fill in a module description here\n \u2514\u2500\u2500 timestamp_alignment.ipynb # Fill in a module description here\n\nTotal: 8 notebooks\n\n## Module Dependencies\n\n``` mermaid\ngraph LR\n chunking[chunking<br/>chunking]\n formatting[formatting<br/>formatting]\n librosa[librosa<br/>librosa]\n numerizer[numerizer<br/>numerizer]\n postprocessing[postprocessing<br/>postprocessing]\n pydub[pydub<br/>pydub]\n silero_vad[silero_vad<br/>silero vad]\n timestamp_alignment[timestamp_alignment<br/>timestamp alignment]\n\n silero_vad --> chunking\n silero_vad --> librosa\n```\n\n*2 cross-module dependencies detected*\n\n## CLI Reference\n\nNo CLI commands found in this project.\n\n## Module Overview\n\nDetailed documentation for each module in the project:\n\n### chunking (`chunking.ipynb`)\n\n> Fill in a module description here\n\n#### Import\n\n``` python\nfrom cjm_transcription_utils.chunking import (\n get_extended_timestamp_boundaries,\n get_extended_chunk_boundaries,\n generate_chunks_with_vad,\n generate_intermediate_chunks,\n generate_intermediate_chunk_tuples,\n merge_transcripts_with_overlaps\n)\n```\n\n#### Functions\n\n``` python\ndef get_extended_timestamp_boundaries(\n timestamps: List[Dict[str, float]], \n index: int # Index of the current timestamp\n) -> Tuple[float, float]\n \"Get extended boundaries for a timestamp using adjacent timestamps.\"\n```\n\n``` python\ndef get_extended_chunk_boundaries(\n chunks: List[Tuple[float, float]], \n index: int # Index of the current chunk\n) -> Tuple[float, float]\n \"Get extended boundaries for a chunk using adjacent chunks.\"\n```\n\n``` python\ndef generate_chunks_with_vad(\n audio_array: np.ndarray, # Audio array\n duration: float, # Total duration of audio in seconds\n max_chunk_seconds: float = 120, # Maximum chunk duration in seconds\n max_chunk_seconds_offset: float = 0, # Offset for chunk duration calculation\n speech_timestamps: Optional[List[Dict]] = None, # List of speech timestamp dictionaries with 'start' and 'end' keys\n max_silence_threshold: float = 2.0 # Maximum silence duration (in seconds) before creating a new chunk\n) -> Tuple[List[Tuple[float, float]], List[List[Dict]]]\n \"Generate chunks using VAD timestamps with silence-based splitting\"\n```\n\n``` python\ndef generate_intermediate_chunks(\n chunks: List[Tuple[float, float]],\n chunk_timestamps: List[List[Dict]], # TODO: Add description\n use_extended_boundaries:bool # TODO: Add description\n \n) -> List[Tuple[float, float]]: # TODO: Add return description\n \"Generate overlapping chunks between consecutive chunk boundaries\"\n```\n\n``` python\ndef generate_intermediate_chunk_tuples(\n chunks: List[Tuple[float, float]],\n chunk_timestamps: List[List[Dict]], # List of timestamp dictionaries for each chunk\n use_extended_boundaries:bool # TODO: Add description\n) -> List[Tuple[Dict, Dict]]\n \"Generate tuples of (last_timestamp, first_timestamp) from consecutive chunks.\"\n```\n\n``` python\ndef merge_transcripts_with_overlaps(\n normal_transcripts: List[str], # List of transcripts for normal chunks\n intermediate_transcripts: List[str], # List of transcripts for intermediate chunks\n segment_transcripts: List[Tuple[str, str]],\n verbose: bool = True # Whether to print debug information\n) -> str\n \"Merge normal and intermediate transcripts with overlap correction\"\n```\n\n### formatting (`formatting.ipynb`)\n\n> Fill in a module description here\n\n#### Import\n\n``` python\nfrom cjm_transcription_utils.formatting import (\n time_interval_to_hms_range\n)\n```\n\n#### Functions\n\n``` python\ndef time_interval_to_hms_range(\n duration_tuple # A tuple of (start_seconds, end_seconds) as floats\n)\n \"Convert a time interval tuple (start_seconds, end_seconds) to HMS timestamp range format.\"\n```\n\n### librosa (`librosa.ipynb`)\n\n> Fill in a module description here\n\n#### Import\n\n``` python\nfrom cjm_transcription_utils.librosa import (\n load_audio\n)\n```\n\n#### Functions\n\n``` python\ndef load_audio(\n audio_path: str, # TODO: Add description\n target_sr: int = 16000 # TODO: Add description\n) -> Tuple[np.ndarray, int]: # TODO: Add return description\n \"Load and normalize audio file\"\n```\n\n### numerizer (`numerizer.ipynb`)\n\n> Fill in a module description here\n\n#### Import\n\n``` python\nfrom cjm_transcription_utils.numerizer import (\n original_numerize_numerals,\n patched_numerize_numerals,\n smart_numerize\n)\n```\n\n#### Functions\n\n``` python\ndef patched_numerize_numerals(\n s, # TODO: Add type hint and description\n ignore=None, # TODO: Add type hint and description\n bias=None # TODO: Add type hint and description\n): # TODO: Add type hint\n \"Patched version that doesn't convert 'a' to '1'\"\n```\n\n``` python\ndef smart_numerize(\n text # TODO: Add type hint and description\n): # TODO: Add type hint\n \"TODO: Add function description\"\n```\n\n### postprocessing (`postprocessing.ipynb`)\n\n> Fill in a module description here\n\n#### Import\n\n``` python\nfrom cjm_transcription_utils.postprocessing import (\n replace_integers_in_string,\n transcription_post_processing\n)\n```\n\n#### Functions\n\n``` python\ndef replace_integers_in_string(\n text # TODO: Add type hint and description\n): # TODO: Add type hint\n \"TODO: Add function description\"\n```\n\n``` python\ndef transcription_post_processing(\n transcript:str # TODO: Add description\n)->str: # TODO: Add return description\n \"TODO: Add function description\"\n```\n\n### pydub (`pydub.ipynb`)\n\n> Fill in a module description here\n\n#### Import\n\n``` python\nfrom cjm_transcription_utils.pydub import (\n get_audio_segment\n)\n```\n\n#### Functions\n\n``` python\ndef get_audio_segment(\n audio: AudioSegment, # TODO: Add description\n start: float, # TODO: Add description\n end: float, # TODO: Add description\n offset: float=0 # TODO: Add description\n) -> AudioSegment: # TODO: Add return description\n \"Extract audio segment between start and end times\"\n```\n\n### silero vad (`silero_vad.ipynb`)\n\n> Fill in a module description here\n\n#### Import\n\n``` python\nfrom cjm_transcription_utils.silero_vad import (\n prepare_audio_and_vad\n)\n```\n\n#### Functions\n\n``` python\ndef prepare_audio_and_vad(\n audio_path: str, # Path to audio file\n max_chunk_seconds: float, # Maximum chunk duration in seconds\n max_silence_threshold: float, # Maximum silence duration before creating a new chunk\n include_timestamps: bool, # Whether timestamps will be needed\n verbose: bool = True # Whether to print progress\n)\n \"Load audio and prepare VAD timestamps if needed.\"\n```\n\n### timestamp alignment (`timestamp_alignment.ipynb`)\n\n> Fill in a module description here\n\n#### Import\n\n``` python\nfrom cjm_transcription_utils.timestamp_alignment import (\n TranscriptAligner,\n align_timestamps_to_transcript\n)\n```\n\n#### Functions\n\n``` python\ndef align_timestamps_to_transcript(\n final_transcript: str, # The final merged transcript\n timestamp_transcripts: List[str], # List of transcripts for each timestamp segment\n speech_timestamps: List[Dict], # List of speech timestamp dictionaries\n verbose: bool = True # Whether to print alignment details\n) -> List[Dict]\n \"Align timestamp segments to the final transcript.\"\n```\n\n#### Classes\n\n``` python\nclass TranscriptAligner:\n def __init__(self, \n correct_transcript: str, # The full, correct transcript text\n segment_transcripts: List[str], # List of individual segment transcriptions (may have errors)\n timestamps: List[Dict[str, float]], # List of timestamp dictionaries with 'start' and 'end' keys\n confidence_threshold: int = 70 # Minimum confidence score to accept an alignment\n )\n \"TODO: Add class description\"\n \n def __init__(self,\n correct_transcript: str, # The full, correct transcript text\n segment_transcripts: List[str], # List of individual segment transcriptions (may have errors)\n timestamps: List[Dict[str, float]], # List of timestamp dictionaries with 'start' and 'end' keys\n confidence_threshold: int = 70 # Minimum confidence score to accept an alignment\n )\n \"Initialize the transcript aligner with complete coverage and correction mechanisms.\"\n \n def align_timestamps_to_correct_transcript(\n self\n ) -> List[Dict]: # TODO: Add return description\n \"Align timestamps to the correct transcript with optional corrections.\"\n```\n",
"bugtrack_url": null,
"license": "Apache Software License 2.0",
"summary": "Miscellaneous utilities for helping with audio transcription.",
"version": "0.0.2",
"project_urls": {
"Homepage": "https://github.com/cj-mills/cjm-transcription-utils"
},
"split_keywords": [
"nbdev",
"jupyter",
"notebook",
"python"
],
"urls": [
{
"comment_text": null,
"digests": {
"blake2b_256": "a288c4c0fbd741f325b1c756fd9c2a49575aa1126c57d62fcbdfcf5874e26191",
"md5": "d8b0e8ebe1e57723da4c6ab69783cbb5",
"sha256": "f27c685f651cbc656827ecf4dbd26b776fd1a33054aa220ead1780ad49727373"
},
"downloads": -1,
"filename": "cjm_transcription_utils-0.0.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "d8b0e8ebe1e57723da4c6ab69783cbb5",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.11",
"size": 20794,
"upload_time": "2025-09-11T02:58:16",
"upload_time_iso_8601": "2025-09-11T02:58:16.355684Z",
"url": "https://files.pythonhosted.org/packages/a2/88/c4c0fbd741f325b1c756fd9c2a49575aa1126c57d62fcbdfcf5874e26191/cjm_transcription_utils-0.0.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": null,
"digests": {
"blake2b_256": "1628f919ac3ff70ce70c20a2172876b25d4afbe99452ffa32e45eab17da530ca",
"md5": "4407d715f39fbffb9150b5b3117e6be0",
"sha256": "6cf035a40c1df9c442724d86e0689d9b604c494771b85ea9df3c883f3b7fb790"
},
"downloads": -1,
"filename": "cjm_transcription_utils-0.0.2.tar.gz",
"has_sig": false,
"md5_digest": "4407d715f39fbffb9150b5b3117e6be0",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.11",
"size": 19739,
"upload_time": "2025-09-11T02:58:17",
"upload_time_iso_8601": "2025-09-11T02:58:17.587898Z",
"url": "https://files.pythonhosted.org/packages/16/28/f919ac3ff70ce70c20a2172876b25d4afbe99452ffa32e45eab17da530ca/cjm_transcription_utils-0.0.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2025-09-11 02:58:17",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "cj-mills",
"github_project": "cjm-transcription-utils",
"github_not_found": true,
"lcname": "cjm-transcription-utils"
}