trsproc


Nametrsproc JSON
Version 2.0.0 PyPI version JSON
download
home_page
SummaryA Python library to process Transcriber TRS files
upload_time2024-03-13 14:20:03
maintainer
docs_urlNone
author
requires_python>=3.6
licenseMIT License Copyright (c) 2024 ELDA/ELRA Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
keywords python transcriber trs transcription textgrid nlp
VCS
bugtrack_url
requirements praat-parselmouth praat-textgrids rich tomli
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # README

![GitHub Tag](https://img.shields.io/github/v/tag/ELDAELRA/trsproc)

*trsproc* is a Python module allowing multiple operations and automatic processing of TRS files from [Transcriber](https://sourceforge.net/projects/trans/ "Download link").

Prior installation of Python 3.6+ is necessary. Install *trsproc* using pip and fork it on GitHub.

```
pip install trsproc
```

## USAGE FROM THE COMMAND LINE

*trsproc* may be called directly from the Terminal and it will perform the specified flag on the current directory by default.

### OPTIONAL ARGUMENTS

Some optional arguments are available for advanced processing.

```
trsproc flag [-option [option_argument_if_needed]]
```

**NB** `-h` produces a help summary including the possible arguments and the links to the documentation.

* `-a` or `--audio` followed by the audio format used for the audio data corresponding to the input TRS if it is different from WAV.

* `-cl` or `--correctionlevel` followed by the number corresponding to the correction applied to the original text according to the ELDA's interlnal script lexicalproc:
  * 0, no corrections;
  * 1, custom spelling corrections (_csp);
  * 2, automatic spelling corrections (_sp);
  * 3, automatic grammatical corrections (_gram);
  * 12, custom and automatic spelling corrections (_csp_sp) ;
  * 13, custom spelling and grammatical corrections (_csp_gram);
  * 23, automatic spelling and grammatical corrections (_sp_gram);
  * 123, custom spelling, automatic one and grammatical corrections (_csp_sp_gram).

* `-f` or `--folder` followed by a path allows to target the specified directory instead of the current one.

* `-jkz` or `--japkorzh` must be specified if the language to be processed in the input TRS does not use ASCII/Latin based characters.

* `-plh` or `--placeholder` must be specified if the processing of the `txt` flag must only produce txt files.

* `-punct` or `--punctuation` may be used in order to clear all punctuation from in the resulting txt files. The punctiation list is available at `parser.replacingPunctuations(sentence)`.

* `-s` or `--section` followed by the alternative target section name if the processing of the `rpt` or `tmp` flag must target a section other than the default one, i.e. "report".

* `-t` or `--tag` followed by the alternative language to be added during the processing of the `lang` flag.

### FLAGS

In case of incorrect flag the list of possible ones and their function will be printed in the console. The same list will also appear if no flag is provided.

* `cne` deletes the Named Entity annotations if any are present in the input TRS.

* `crt` applies specific corrections according to the function chosen from the prompted list.
  * `turnDifferenceTRS` search for differences in segmentation for the input TRS and its twin placed in a subfolder named "twin";
  * `trsEmptySpaceBeforeNE` adds an empty space before each NE annotation and save the new TRS in a separate subfolder;
  * `correctionLà` corrects sentences ending with là in la in the input txt. This needs the execution of `txt` flag beforehand;
  * `correctionMaj` corrects misplaced capiral letters.

* `ne` extracts the Named Entity annotations if any are present in the input TRS and put them in a tabular file.

* `lang` adds a language tag to each transcription segment not having one in the input TRS. It also modifies the actual language tags using the provided language dictionary in JSON format named "lang-tag.json" in the same input folder.

* `pne` pre-annotates the input TRS using the table created in the `ne` flag as a custom annotation dictionnary.

* `prt` print the parsed TRS contents directly in the console.

* `rpt` performs the operations of the `tmp` and `vsi` flags in order to obtain the basic elements for data validation. An additional report is produced with pause segments longer than 0.5s and speech segments shorter than 10s.

* `rs` calculates the minimum sample needed for the validation of the input TRS transcription and the extracts random segments (audio and text, the latter in a tabular file) according to a given quantity.

* `rsne` calculates the minimum sample needed for the validation of Named Entities of the input TRS and extracts them (audio segments and text, the latter in a tabular file) randomly by a given amount.

* `tg` converts TRS files to TextGrid files.

* `tgrs` converts TextGrid files to TRS files.

* `tmp` creates TRS-temporary files in a directory named "tmp". By default, these files contain only the "report" section(s) of the original TRS.

* `trs` rewrites a TRS file using the input txt file and a TRS-placeholder placed in a subfolder of the parent input folder. The rewritten TRS will have the content of the txt and the structure of the TRS-placeholder.

* `tsv` produces a tabular file from with the structures and contents of the TRS files.

* `txt` creates txt and TRS-placeholder files. The first only containing the transcription of the original TRS, the latter having its XML structure.

* `vad` converts TextGrid files resulting from the use of a voice activity detection algorithm (VAD) into TRS files.

* `vsi-lang` produces a tabular file containing basic information abouth the language tags present in the input TRS.

* `vsi` produces a tabular file containing basic lexical information and statistics concerning the input TRS.

## IMPORTING AS A MODULE

The class *TRSParser* may be imported in Python for scripting pusposes using `from trsproc.parser import TRSParser`. It may be used to convert a TRS file into a Python TRSParser object. 

### TRSParser class

When the class is initiated only the TRS file path must be provided. parameters `audio_format` and `lang` may be modified from their default values if needed. `audio_format` defaults to `'wav'` and is used to find an audio file with the same name and location of the TRS. `lang` defaults to `'eu'` and is used This is mainly used for word count in the transcription and can be changed to `'jkz'` in order to process a character count instead based on UNICODE characters.  

#### Attributes

* `tree` is the parsed XML tree.

* `root` is the root of the parsed XML tree.

* `inputTRS` is the complete path to the parsed TRS file.

* `filepath` is the path to the folder of the parsed TRS file.

* `corpus` is the name of the folder where the TRS file is located.

* `filename` is the name of the parsed TRS file.

* `lang` refers to the lang parameter from the TRSParser call.

* `sectionduration` represents the sum of the duration of all the sections present in the TRS file. Returns 'Section not found' value if there is no section tag in the TRS file.

* `fileduration` represents the duration of an audio file having the same name and location of the TRS. Returns 'audio not found' value if it fails to find an audio file.

* `speakers` is a Python dictionary containing the speakers' information provided in the TRS header if any. The speaker's id is used as key, the value is a tuple of speaker's name and sex.
        
* `contents` is a Python dictionary containing all the information about the TRS' segments and the overall transcription.
  * `contents[n]` represents a segment where n is its rank in the transcription. Each segment has seven keys:
    * `contents[n]['xmin']` is the segment starting point in seconds.
    * `contents[n]['xmax']` is the segment ending point in seconds.
    * `contents[n]['duration']` is the segment duration in seconds.
    * `contents[n]['tokens']` is the number of tokens present in the segment.
    * `contents[n]['content']` is the segment transcription.
    * `contents[n]['speaker']` is the segment speaker if any.
    * `contents[n]['SNR']` is the Signal-to-noise ratio of the segment. It returns `'NA'` if the SNR computation fails.
  * `contents['NE']` is a dictionary of all the Named Entities present in the TRS if any.  
    * `contents['NE'][n]` represents a Named Entity entry where n is its rank in the transcription. Each NE has four keys:
    * `contents['NE'][n]['class']` is the Named Entity class.
    * `contents['NE'][n]['xmin']` is the Named Entity starting point in seconds.
    * `contents['NE'][n]['segmentID']` is the id of the segment where the Named Entity has been annotated.
    * `contents['NE'][n]['content']` is the transcription associated with the Named Entity.

  * `contents[0]` contains overall statistics about the TRS file.
    * `contents[0]['totalSegments']` returns the total number of segments in the TRS.
    * `contents[0]['totalWords']` returns the total number of word (separated by a white space in case of `lang='eu'`. It returns the total number of UNICODE characters in case of `lang='jkz'`
    * `contents[0]['totalNE']` returns the total number of Named Entities in the TRS if any.
    * `contents[0]['totalNonTrans']` returns the total number of nontrans tags in the TRS if any.
    * `contents[0]['totalPronPi']` returns the total number of pi tags in the TRS if any.
    * `contents[0]['totalTrans']` returns the total number of segments having an actual transcription.
    * `contents[0]['totalLang']` returns the total number of language tags in the TRS if any.
    * `contents[0]['otherLang']` contains a list of the different languages annotated in the TRS.
    * `contents[0]['duration']` returns the total duration of the segments having transcription, pi and nontrans annotations.
    * `contents[0]['durationTrans']` returns the duration of the transcribed segments.
    * `contents[0]['durationNonTrans']` returns the duration of the nontrans segments.
    * `contents[0]['durationPronPi']` returns the duration of the pi segments.
    * `contents[0]['meanSNR']` returns the mean SNR of the audio file or `'NA'` if it fails to compute it.
    
#### Functions

* `retrieveContents(self)` is a basic function used to retrieve all the contents information from the input TRS file into a dictionnary structure.

* `print(self)` prints the TRS contents in the console.

* `summaryLangTRS(self)` creates a tsv file containing the information about the languages spoken in the TRS.
 
* `trsToTxt(self, need_placeholder=True)` creates a txt file and a TRS-placeholder from the input TRS.
 
* `txtToTrs(input_txt, from_correction=0)` creates a TRS file from the content of a txt file and the structure of a TRS-placeholder one.
 
* `cleanNEfromTRS(self)` creates a new TRS file without the Named Entity annotations of the origin one.
 
* `validateTRS(self)` creates a tsv file with the contents information and statistics from the input TRS.
 
* `trsToTsv(self)` transforms the input TRS structure and content into a tsv file.
 
* `vadToTRS(input_tg)` creates a TRS file following the structure of the input TextGrid file having only one Tier called 'VAD'.
 
* `trsToTextGrid(self, tiers_list=['transcription', 'speaker', 'sex', 'NE'])` creates a TextGrid file based on the segmentation and content of the input TRS. the newly created TextGrid will have 'transcription', 'speaker', 'sex', 'NE' as Tiers.
 
* `textGridToTRS(input_tg)` creates a TRS file following the structure of the input TextGrid. Textgrid's Tiers must contain 'speaker', 'transcription' and 'sex'.
 
* `retrieveNEToTsv(self)` retrieves all the Named Entity annotations from the input TRS and wrties them in a tsv file.
 
* `trsTMP(self, section_type="report")` creates a partial TRS file retaining only the target section content.

### Other functions

* `parser.replacingPunctuations(sentence)` deletes the punctuations in the following list from the input: `["\ufeff", "\u00A0", "\u2019", ".", ":", ";", "!", '"', "/", "\\", "%", "'"]`
 
* `parser.praatSNRforSegment(audio, seg_start, seg_end)` computes Signal-to-Noise ratio using Praat parselmouth formula on the selected start and end frames of the input audio signal.
 
* `utils.importJSON(json_input)` returns a Python dictionary from the input json file.

* `utils.tmpReport(trs_input, section_type="report")` creates a tsv file containing the statistical information of the input TRS and the target section validation report with segments < 10s and pauses > 0.5s.

* `utils.sampleFromDict(input_dict, sample)` returns random keys from the input dictionary.
 
* `utils.randomSampling(list_trs, save_path)` asks user for population size input and returns the minimum sample size, a tsv file table with random sampled segments from the population and audio segment files.
 
* `utils.randomSamplingNE(list_trs, save_path)` asks user for population size input and returns the minimum sample size, a table with random sampled named entities from the population and audio segments files.
 
* `utils.createUpdateDictNE(table_info, ne_dict, ne_origin)` creates or updates the table with extracted Named Entities from TRS annotations.
 
* `utils.trsPreannotation(input_trs: TRSParser)` creates a new TRS pre-annotated using the previously created Named Entities table.
 
* `utils.preAnnotateNElen1(input_trs: TRSParser, dict_ne)` pre-annotates the input TRS with Named Entoties of length 1.
 
* `utils.preAnnotateNElenPlus(input_file, list_ne, dict_ne)` pre-annotates the input TRS with Named Entoties of length higher than 1.
 
The following functions are used in case of custom corrections:

* `utils.turnDifferenceTRS(input_trs: TRSParser)` returns the list in segmetnation between the input TRS and its twin.
 
* `utils.trsEmptySpaceBeforeNE(input_trs: TRSParser)` creates a new TRS with an empty space before each Named Entity annotation.
 
* `utils.correctionLà(input_trs: TRSParser)` creates a new txt correcting 'là' to 'la' and the end of its sentences.
 
* `utils.correctionMaj(input_trs: TRSParser)` creates a new TRS with the corrected misplaced capital letters from the input one.
 

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "trsproc",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "python,transcriber,trs,transcription,textgrid,nlp",
    "author": "",
    "author_email": "Gabriele Chingoli <gabriele@elda.org>",
    "download_url": "https://files.pythonhosted.org/packages/d8/dc/796d2ae61663513c2ee61468f9747000def9670f1cfc174279fe4ef2dff8/trsproc-2.0.0.tar.gz",
    "platform": null,
    "description": "# README\n\n![GitHub Tag](https://img.shields.io/github/v/tag/ELDAELRA/trsproc)\n\n*trsproc* is a Python module allowing multiple operations and automatic processing of TRS files from [Transcriber](https://sourceforge.net/projects/trans/ \"Download link\").\n\nPrior installation of Python 3.6+ is necessary. Install *trsproc* using pip and fork it on GitHub.\n\n```\npip install trsproc\n```\n\n## USAGE FROM THE COMMAND LINE\n\n*trsproc* may be called directly from the Terminal and it will perform the specified flag on the current directory by default.\n\n### OPTIONAL ARGUMENTS\n\nSome optional arguments are available for advanced processing.\n\n```\ntrsproc flag [-option [option_argument_if_needed]]\n```\n\n**NB** `-h` produces a help summary including the possible arguments and the links to the documentation.\n\n* `-a` or `--audio` followed by the audio format used for the audio data corresponding to the input TRS if it is different from WAV.\n\n* `-cl` or `--correctionlevel` followed by the number corresponding to the correction applied to the original text according to the ELDA's interlnal script lexicalproc:\n  * 0, no corrections;\n  * 1, custom spelling corrections (_csp);\n  * 2, automatic spelling corrections (_sp);\n  * 3, automatic grammatical corrections (_gram);\n  * 12, custom and automatic spelling corrections (_csp_sp) ;\n  * 13, custom spelling and grammatical corrections (_csp_gram);\n  * 23, automatic spelling and grammatical corrections (_sp_gram);\n  * 123, custom spelling, automatic one and grammatical corrections (_csp_sp_gram).\n\n* `-f` or `--folder` followed by a path allows to target the specified directory instead of the current one.\n\n* `-jkz` or `--japkorzh` must be specified if the language to be processed in the input TRS does not use ASCII/Latin based characters.\n\n* `-plh` or `--placeholder` must be specified if the processing of the `txt` flag must only produce txt files.\n\n* `-punct` or `--punctuation` may be used in order to clear all punctuation from in the resulting txt files. The punctiation list is available at `parser.replacingPunctuations(sentence)`.\n\n* `-s` or `--section` followed by the alternative target section name if the processing of the `rpt` or `tmp` flag must target a section other than the default one, i.e. \"report\".\n\n* `-t` or `--tag` followed by the alternative language to be added during the processing of the `lang` flag.\n\n### FLAGS\n\nIn case of incorrect flag the list of possible ones and their function will be printed in the console. The same list will also appear if no flag is provided.\n\n* `cne` deletes the Named Entity annotations if any are present in the input TRS.\n\n* `crt` applies specific corrections according to the function chosen from the prompted list.\n  * `turnDifferenceTRS` search for differences in segmentation for the input TRS and its twin placed in a subfolder named \"twin\";\n  * `trsEmptySpaceBeforeNE` adds an empty space before each NE annotation and save the new TRS in a separate subfolder;\n  * `correctionL\u00e0` corrects sentences ending with l\u00e0 in la in the input txt. This needs the execution of `txt` flag beforehand;\n  * `correctionMaj` corrects misplaced capiral letters.\n\n* `ne` extracts the Named Entity annotations if any are present in the input TRS and put them in a tabular file.\n\n* `lang` adds a language tag to each transcription segment not having one in the input TRS. It also modifies the actual language tags using the provided language dictionary in JSON format named \"lang-tag.json\" in the same input folder.\n\n* `pne` pre-annotates the input TRS using the table created in the `ne` flag as a custom annotation dictionnary.\n\n* `prt` print the parsed TRS contents directly in the console.\n\n* `rpt` performs the operations of the `tmp` and `vsi` flags in order to obtain the basic elements for data validation. An additional report is produced with pause segments longer than 0.5s and speech segments shorter than 10s.\n\n* `rs` calculates the minimum sample needed for the validation of the input TRS transcription and the extracts random segments (audio and text, the latter in a tabular file) according to a given quantity.\n\n* `rsne` calculates the minimum sample needed for the validation of Named Entities of the input TRS and extracts them (audio segments and text, the latter in a tabular file) randomly by a given amount.\n\n* `tg` converts TRS files to TextGrid files.\n\n* `tgrs` converts TextGrid files to TRS files.\n\n* `tmp` creates TRS-temporary files in a directory named \"tmp\". By default, these files contain only the \"report\" section(s) of the original TRS.\n\n* `trs` rewrites a TRS file using the input txt file and a TRS-placeholder placed in a subfolder of the parent input folder. The rewritten TRS will have the content of the txt and the structure of the TRS-placeholder.\n\n* `tsv` produces a tabular file from with the structures and contents of the TRS files.\n\n* `txt` creates txt and TRS-placeholder files. The first only containing the transcription of the original TRS, the latter having its XML structure.\n\n* `vad` converts TextGrid files resulting from the use of a voice activity detection algorithm (VAD) into TRS files.\n\n* `vsi-lang` produces a tabular file containing basic information abouth the language tags present in the input TRS.\n\n* `vsi` produces a tabular file containing basic lexical information and statistics concerning the input TRS.\n\n## IMPORTING AS A MODULE\n\nThe class *TRSParser* may be imported in Python for scripting pusposes using `from trsproc.parser import TRSParser`. It may be used to convert a TRS file into a Python TRSParser object. \n\n### TRSParser class\n\nWhen the class is initiated only the TRS file path must be provided. parameters `audio_format` and `lang` may be modified from their default values if needed. `audio_format` defaults to `'wav'` and is used to find an audio file with the same name and location of the TRS. `lang` defaults to `'eu'` and is used This is mainly used for word count in the transcription and can be changed to `'jkz'` in order to process a character count instead based on UNICODE characters.  \n\n#### Attributes\n\n* `tree` is the parsed XML tree.\n\n* `root` is the root of the parsed XML tree.\n\n* `inputTRS` is the complete path to the parsed TRS file.\n\n* `filepath` is the path to the folder of the parsed TRS file.\n\n* `corpus` is the name of the folder where the TRS file is located.\n\n* `filename` is the name of the parsed TRS file.\n\n* `lang` refers to the lang parameter from the TRSParser call.\n\n* `sectionduration` represents the sum of the duration of all the sections present in the TRS file. Returns 'Section not found' value if there is no section tag in the TRS file.\n\n* `fileduration` represents the duration of an audio file having the same name and location of the TRS. Returns 'audio not found' value if it fails to find an audio file.\n\n* `speakers` is a Python dictionary containing the speakers' information provided in the TRS header if any. The speaker's id is used as key, the value is a tuple of speaker's name and sex.\n        \n* `contents` is a Python dictionary containing all the information about the TRS' segments and the overall transcription.\n  * `contents[n]` represents a segment where n is its rank in the transcription. Each segment has seven keys:\n    * `contents[n]['xmin']` is the segment starting point in seconds.\n    * `contents[n]['xmax']` is the segment ending point in seconds.\n    * `contents[n]['duration']` is the segment duration in seconds.\n    * `contents[n]['tokens']` is the number of tokens present in the segment.\n    * `contents[n]['content']` is the segment transcription.\n    * `contents[n]['speaker']` is the segment speaker if any.\n    * `contents[n]['SNR']` is the Signal-to-noise ratio of the segment. It returns `'NA'` if the SNR computation fails.\n  * `contents['NE']` is a dictionary of all the Named Entities present in the TRS if any.  \n    * `contents['NE'][n]` represents a Named Entity entry where n is its rank in the transcription. Each NE has four keys:\n    * `contents['NE'][n]['class']` is the Named Entity class.\n    * `contents['NE'][n]['xmin']` is the Named Entity starting point in seconds.\n    * `contents['NE'][n]['segmentID']` is the id of the segment where the Named Entity has been annotated.\n    * `contents['NE'][n]['content']` is the transcription associated with the Named Entity.\n\n  * `contents[0]` contains overall statistics about the TRS file.\n    * `contents[0]['totalSegments']` returns the total number of segments in the TRS.\n    * `contents[0]['totalWords']` returns the total number of word (separated by a white space in case of `lang='eu'`. It returns the total number of UNICODE characters in case of `lang='jkz'`\n    * `contents[0]['totalNE']` returns the total number of Named Entities in the TRS if any.\n    * `contents[0]['totalNonTrans']` returns the total number of nontrans tags in the TRS if any.\n    * `contents[0]['totalPronPi']` returns the total number of pi tags in the TRS if any.\n    * `contents[0]['totalTrans']` returns the total number of segments having an actual transcription.\n    * `contents[0]['totalLang']` returns the total number of language tags in the TRS if any.\n    * `contents[0]['otherLang']` contains a list of the different languages annotated in the TRS.\n    * `contents[0]['duration']` returns the total duration of the segments having transcription, pi and nontrans annotations.\n    * `contents[0]['durationTrans']` returns the duration of the transcribed segments.\n    * `contents[0]['durationNonTrans']` returns the duration of the nontrans segments.\n    * `contents[0]['durationPronPi']` returns the duration of the pi segments.\n    * `contents[0]['meanSNR']` returns the mean SNR of the audio file or `'NA'` if it fails to compute it.\n    \n#### Functions\n\n* `retrieveContents(self)` is a basic function used to retrieve all the contents information from the input TRS file into a dictionnary structure.\n\n* `print(self)` prints the TRS contents in the console.\n\n* `summaryLangTRS(self)` creates a tsv file containing the information about the languages spoken in the TRS.\n \n* `trsToTxt(self, need_placeholder=True)` creates a txt file and a TRS-placeholder from the input TRS.\n \n* `txtToTrs(input_txt, from_correction=0)` creates a TRS file from the content of a txt file and the structure of a TRS-placeholder one.\n \n* `cleanNEfromTRS(self)` creates a new TRS file without the Named Entity annotations of the origin one.\n \n* `validateTRS(self)` creates a tsv file with the contents information and statistics from the input TRS.\n \n* `trsToTsv(self)` transforms the input TRS structure and content into a tsv file.\n \n* `vadToTRS(input_tg)` creates a TRS file following the structure of the input TextGrid file having only one Tier called 'VAD'.\n \n* `trsToTextGrid(self, tiers_list=['transcription', 'speaker', 'sex', 'NE'])` creates a TextGrid file based on the segmentation and content of the input TRS. the newly created TextGrid will have 'transcription', 'speaker', 'sex', 'NE' as Tiers.\n \n* `textGridToTRS(input_tg)` creates a TRS file following the structure of the input TextGrid. Textgrid's Tiers must contain 'speaker', 'transcription' and 'sex'.\n \n* `retrieveNEToTsv(self)` retrieves all the Named Entity annotations from the input TRS and wrties them in a tsv file.\n \n* `trsTMP(self, section_type=\"report\")` creates a partial TRS file retaining only the target section content.\n\n### Other functions\n\n* `parser.replacingPunctuations(sentence)` deletes the punctuations in the following list from the input: `[\"\\ufeff\", \"\\u00A0\", \"\\u2019\", \".\", \":\", \";\", \"!\", '\"', \"/\", \"\\\\\", \"%\", \"'\"]`\n \n* `parser.praatSNRforSegment(audio, seg_start, seg_end)` computes Signal-to-Noise ratio using Praat parselmouth formula on the selected start and end frames of the input audio signal.\n \n* `utils.importJSON(json_input)` returns a Python dictionary from the input json file.\n\n* `utils.tmpReport(trs_input, section_type=\"report\")` creates a tsv file containing the statistical information of the input TRS and the target section validation report with segments < 10s and pauses > 0.5s.\n\n* `utils.sampleFromDict(input_dict, sample)` returns random keys from the input dictionary.\n \n* `utils.randomSampling(list_trs, save_path)` asks user for population size input and returns the minimum sample size, a tsv file table with random sampled segments from the population and audio segment files.\n \n* `utils.randomSamplingNE(list_trs, save_path)` asks user for population size input and returns the minimum sample size, a table with random sampled named entities from the population and audio segments files.\n \n* `utils.createUpdateDictNE(table_info, ne_dict, ne_origin)` creates or updates the table with extracted Named Entities from TRS annotations.\n \n* `utils.trsPreannotation(input_trs: TRSParser)` creates a new TRS pre-annotated using the previously created Named Entities table.\n \n* `utils.preAnnotateNElen1(input_trs: TRSParser, dict_ne)` pre-annotates the input TRS with Named Entoties of length 1.\n \n* `utils.preAnnotateNElenPlus(input_file, list_ne, dict_ne)` pre-annotates the input TRS with Named Entoties of length higher than 1.\n \nThe following functions are used in case of custom corrections:\n\n* `utils.turnDifferenceTRS(input_trs: TRSParser)` returns the list in segmetnation between the input TRS and its twin.\n \n* `utils.trsEmptySpaceBeforeNE(input_trs: TRSParser)` creates a new TRS with an empty space before each Named Entity annotation.\n \n* `utils.correctionL\u00e0(input_trs: TRSParser)` creates a new txt correcting 'l\u00e0' to 'la' and the end of its sentences.\n \n* `utils.correctionMaj(input_trs: TRSParser)` creates a new TRS with the corrected misplaced capital letters from the input one.\n \n",
    "bugtrack_url": null,
    "license": "MIT License  Copyright (c) 2024 ELDA/ELRA  Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:  The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.  THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ",
    "summary": "A Python library to process Transcriber TRS files",
    "version": "2.0.0",
    "project_urls": {
        "Homepage": "https://github.com/ELDAELRA/trsproc"
    },
    "split_keywords": [
        "python",
        "transcriber",
        "trs",
        "transcription",
        "textgrid",
        "nlp"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8becbbe7ef7d239f784686a3b19545b6b95f3e54a1079eb8df4e598aae2fc925",
                "md5": "e0b0ab1c2eee044374bbe44047f224bc",
                "sha256": "4ea4845155390209369fdf254e4137902968fe78d2e8e847bdf2f82965eac0b4"
            },
            "downloads": -1,
            "filename": "trsproc-2.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "e0b0ab1c2eee044374bbe44047f224bc",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 21604,
            "upload_time": "2024-03-13T14:19:58",
            "upload_time_iso_8601": "2024-03-13T14:19:58.903348Z",
            "url": "https://files.pythonhosted.org/packages/8b/ec/bbe7ef7d239f784686a3b19545b6b95f3e54a1079eb8df4e598aae2fc925/trsproc-2.0.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "d8dc796d2ae61663513c2ee61468f9747000def9670f1cfc174279fe4ef2dff8",
                "md5": "3e217accc37fc3e7b40213178c1804ea",
                "sha256": "0e0a756bf9b533c2d25d04ce2c02c83a56fcf75b91f2e4c2609e3a955985299d"
            },
            "downloads": -1,
            "filename": "trsproc-2.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "3e217accc37fc3e7b40213178c1804ea",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 23475,
            "upload_time": "2024-03-13T14:20:03",
            "upload_time_iso_8601": "2024-03-13T14:20:03.617771Z",
            "url": "https://files.pythonhosted.org/packages/d8/dc/796d2ae61663513c2ee61468f9747000def9670f1cfc174279fe4ef2dff8/trsproc-2.0.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-03-13 14:20:03",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "ELDAELRA",
    "github_project": "trsproc",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "praat-parselmouth",
            "specs": [
                [
                    "==",
                    "0.4.3"
                ]
            ]
        },
        {
            "name": "praat-textgrids",
            "specs": [
                [
                    "==",
                    "1.4.0"
                ]
            ]
        },
        {
            "name": "rich",
            "specs": [
                [
                    "==",
                    "13.7.0"
                ]
            ]
        },
        {
            "name": "tomli",
            "specs": []
        }
    ],
    "lcname": "trsproc"
}
        
Elapsed time: 0.20758s