newsegmentation


Namenewsegmentation JSON
Version 1.4.0 PyPI version JSON
download
home_pagehttps://github.com/iTzAlver/newsegmentation
SummaryPackage for news segmentation architecture.
upload_time2022-12-19 17:25:26
maintainer
docs_urlNone
authorAlberto Palomo Alonso
requires_python>=3.6
license
keywords deeplearning ml api
VCS
bugtrack_url
requirements matplotlib numpy nltk sklearn sentence-transformers googletrans
Travis-CI No Travis.
coveralls test coverage No coveralls.
            <p align="center">
    <img src="./tests/logo.png">

# News Segmentation Package - 1.4.0

This package takes subtitle VTT files (Video Text Track files) and extracts the piece of 
news from the whole newscast inside the file. News are stored into a Tree structure with useful NLP features inside. 
The user can specify their own algorithm for segmentation, however, there are some default. 

## About ##

Author: A.Palomo-Alonso (alberto.palomo@uah.es)\
Contributors: D.Casillas-Pérez, S.Jiménez-Fernández, A.Portilla-Figueras, S.Salcedo-Sanz.\
Universidad de Alcalá.\
Escuela Politécnica Superior.\
Departamento de Teoría De la Señal y Comunicaciones (TDSC).\
Cátedra ISDEFE.

## What's new?

### < 0.2.4
1. `NewSegmentation` abstract class for custom algorithms.
2. Architecture implemented.
3. `Segmentation` class for default modules.
4. Precision, Recall, F1, WD, Pk score evaluation for trees.
5. ``plot_matrix()`` method for the matrix generated.
6. ``where_is()`` method for finding pieces of news.
7. ``gtreader()`` for reading reference trees for evaluation in specific format.
8. ``Tree`` and ``Leaf`` structures.
9. Default ``PBMM`` and ``FB-BCM`` algorithms.
10. Default ``TDM``, ``DBM``, ``SDM`` implemented.
11. ``GPA`` implemented inside ``SDM``.

### 0.2.5
1. Code speed up (60% faster).
   1. Implemented cache for embeddings.
2. Data serializer implemented.
   1. Method ``save()`` implemented in Segmentation class.
   2. Function ``load3s`` implemented for reading trees from files.

### 0.2.6 - 0.2.8
1. Documentation bug fixing.
2. Logo added.

### 0.2.9
1. `Segmentation.evaluate()` now can take a path as a parameter!
2. `Segmentation.evaluate(integrity_validation=True)` now takes as default parameter `integrity_validation=True` 
for integrity validation.
> NOTE: If your custom algorithm removes sentences from the original text, you should call 
> ``integrity_validation=False`` as it checks every that every sentence is in each tree.
3. Programmed external cache file in ``Segmentation`` class taking a cache file as a new parameter: 
``Segmentation(cache_file='./myjson.json')``. This speeds up the architecture when the sentences sent to the ``LCM`` are the same. 
For instance, when testing parameters in the same database the process is around 1000% faster.
4. Bug fix: ``'.'`` not inserted when constructing payload from leafs.

### 0.3.0
1. Solved cache bugs.
2. One read and write in cache per call to the architecture.
3. Exception handler blockage for cache. Now an exception with cache won't block the architecture.
4. Best parameters found and set as default.
5. Preprocessing speed up explanation in doc.

### 0.3.1
1. Bug fixing with ```.TXT``` input files and cache.

### 0.3.2
1. Setuptools rework.
2. Updated performance image.
3. To continue: Update CITE AS to IEEE ref.

### 0.3.3 -0.3.5
1. Documentation rework.
2. Now the project is a library!

### 1.0.0
1. Deployment and bug fixing.

### 1.1.0 - 1.1.4
1. User errors and formatting handled.
2. Debugging.

### 1.2.0
1. A bug fix in ``evaluate(<gt>, show=True)`` where the correct segmentation and the performed segmentation switches places
in the plot representation.

### 1.2.1
1. Included logging library instead of print logging information.
2. Try - except clause for googletrans module. Now you can omit it.


## Architecture

The whole architecture and algorithms are described in depht in [this paper](https://???) or in 
[this master thesis](https://???).\
The architecture takes advantage of three main features in order perform news segmentation:
* Temporal distance: Is the distance (measured in jumps) between different pieces of text inside the VTT file.
* Spatial distance: Is the distance (measured in slots) between different pieces of text inside the VTT file.
* Semantic correlation: Is the correlation between the meaning of the sentences of two different pieces of text.

This architecture works with a _correlation matrix_ formed by the semantic correlation between each sentence 
in the news broadcast. Each module modifies the correlation matrix in order to apply temporal / spatial features 
reflected in the matrix. The algorithms shall be able to identify each piece of news inside the matrix.
Three differentiated modules make up the architecture:

* **Database Transformer (DT)**: Takes the original VTT file and converts it to plain text sentences (TXT) with time jumps specified 
at the beginning of each sentence and a temporal information vector at the end pointing the temporal 
length of each sentence measured in seconds. 
* **Specific language model (SLM)**: This module takes the blocks of text as input and outputs the semantic correlation 
between each block of text arranged into a _correlation matrix_.
* **Temporal Distance Manager (TDM)**: This module takes the temporal jumps as input and modifies the initial correlation matrix
depending on the temporal jumps.
* **Spatial Distance Manager (SDM)**: This module implements an algorithm which identifies boundaries between 
consecutive pieces of text and merges it.
* **Late correlation manager (LCM)**: This module implements an algorithm which identifies 
high semantic correlation between separate pieces of text and merges it.

The user can implement their own algorithms depending on their application.\

<p align="center">
    <img src="./tests/model.png">

The results are stored into a Tree structure with different fields representing different features from 
the piece of news.
* **Payload**: defines the whole text of the piece of news, it involves all sentences related to a same piece of news combined into a single piece of text. It can be defined as a text structure.
* **Embedding**: it is a vector of real numbers which define a semantic representation of the payload. In this model, it is the output of the SLM, output of the specific language model. It can be defined as a high dimensional vector of real numbers. This embedding is stored for computational efficiency reasons, as some models may take long time to compute.
* **ID**: it is a natural number defining the tree identity, this number must be unique for each tree in the results' storage. It can be defined as a natural number. 
* **Time information**: it is a vector containing the whole temporal length of the piece of news. It can be defined as a real positive number.
* **Correlation power (CP)**: it is a real number indicating how correlated the sentences of the leafs are within the tree. This number can become very interesting when studying the reliability of algorithms. It can be defined as a real positive number.
Where M is the size of R1+K and R is, in our architecture, the very last output matrix R1+K. This function does not take into account the main diagonal of the correlation matrix as it does not provide any information about the correlation between sentences. The correlation power is defined on the (0, 1) interval, meaning 0 no correlation between any sentence in the tree and 1 meaning absolute correlation between all the sentences within the tree. This measurement helps to evaluate the reliability of the model.

<p align="center">
    <img src="./tests/eqp.png">




* **Reference**: when several trees share the same results storage system, it is convenient to define a group of trees which make reference to a group. For example, if an analysis for several days when some piece of news can be repeated and those trees are lately merged into a subsequent tree, it is convenient to reference the day those trees belongs to. It can be done by its reference field, and it can be defined as a natural number. 
* **Leafs**: this structure stores information about the initial state of the model. Each leaf stores a unique _ID_ value and a _Payload_ value containing the minimum text size element considered; in this architecture this element is a sentence, but a single word or any group of words could be also considered.

<p align="center">
    <img src="./tests/tree.png">


## Usage

First, install the python package. After this, you can use your ``VTT`` files to get the 
news. Any other type of file can be considered, but the user must implement their own database
transformer according to the file and language used. Spanish news segmentation is the default model.

### Install:

You can install the package via pip:

    pip install newsegmentation -r requirements.txt

If any error occurred, try installing the requirements before the installation:

    numpy
    matplotlib
    googletrans == 4.0.0rc1
    sentence_transformers >= 2.2.0
    sklearn
    nltk

### Basic Usage:

In this demo, we extract the news inside the first 5 minutes of the ``VTT`` file:

    $ python
    >>> import newsegmentation as ns
    >>> myNews = ns.Segmentation(r'./1.vtt')
    >>> print(myNews)

    NewsSegmentation object: 8 news classified.

    >>> myNews.info()

    News segmentation package:
    --------------------------------------------
    FAST USAGE:
    --------------------------------------------
    PATH_TO_MY_FILE = <PAHT>
    import newsegmentation as ns
    news = ns.NewsSegmentation(PATH_TO_MY_FILE)
    for pon in news:
        print(pon)
    --------------------------------------------

    >>> myNews.about()

    Institution:
    ------------------------------------------------------
    Universidad de Alcalá.
    Escuela Politécnica Superior.
    Departamento de Teoría De la Señal y Comunicaciones.
    Cátedra ISDEFE.
    ------------------------------------------------------
    Author: Alberto Palomo Alonso
    ------------------------------------------------------

    >>> for pieceOfNews in myNews:
    >>>     print(pieceOfNews)

    No hay descanso. Desde hace más de 24 horas se trabaja sin tregua para encontrar a Julen. El niño de 2 años se cayó en un pozo en Totalán, en Málaga. Las horas pasan, los equipos de rescate luchan contrarreloj y buscan nuevas opciones en un terreno escarpado y con riesgo de derrumbes bajo tierra. Buenas noches. Arrancamos este Telediario, allí, en el lugar del rescate. ¿Cuáles son las opciones para encontrar a Julen? Se trabaja en 3 frentes retirar la arena que está taponando el pozo de prospección. Excavar en 2 pozo, y abrir en el lateral de la montaña
    El objetivo rescatar al pequeño. El proyecto de presupuestos llega al Congreso. Son las cuentas con más gasto público desde 2010 Destacan más partidas para programas sociales, contra la pobreza infantil o la dependencia, y también el aumento de inversiones en Cataluña. El gobierno necesita entre otros el apoyo de los independentistas catalanes que por ahora mantienen el NO a los presupuestos, aunque desde el ejecutivo nacional se escuchan voces más optimistas
    La familia de Laura Sanz Nombela, fallecida en París por una explosión de gas espera poder repatriar su cuerpo este próximo miércoles. Hemos hablado con su padre, que está en Francia junto a su yerno y nos ha contado que se sintieron abandonados en las primeras horas tras el accidente. La guardia civil busca en una zona de grutas volcánicas de difícil acceso el cuerpo de la joven desaparecida en Lanzarote, Romina Celeste. Su marido está detenido en relación con su muerte aunque él defiende que no la mató, que solo discutieron y que luego se la encontró muerta la noche de Año Nuevo
    Dormir poco hace que suba hasta un 27 por ciento el riesgo de enfermedades cardiovasculares
    Es la conclusión de un estudio que ha realizado durante 10 años el Centro Nacional para estas dolencias
    Y una noticia de esta misma tarde de la que estamos muy pendientes: Un tren ha descarrilado esta tarde cerca de Torrijos en Toledo sin causar heridos. Había salido de Cáceres con dirección a Madrid. Los 33 pasajeros han sido trasladados a la capital en otro tren. La circulación en la vía entre Madrid y Extremadura está interrumpida. Renfe ha organizado un transporte alternativo en autobús para los afectados
    A 15 días de la gran gala de los Goya hoy se ha entregado ya el primer premio. La cita es el próximo 2 de febrero en Sevilla, pero hoy, aquí en Madrid, en el Teatro Real gran fiesta de los denominados a los Premios Goya. Solo uno de ellos se llevará hoy su estatuilla. Chicho Ibáñez Serrador consigue el Premio Goya de Honor por toda una vida dedicada al cine de terror
    Y en los deportes Nadal gana en Australia, Sergio

    >>> myNews.plotmtx()
<p align="center">
   <img src="./tests/mtx.png">

### Finding news from text:

You can also find information inside the news using the method ``whereis()``:
    
    >>> myNews.whereis('Nadal')

    [7]

    >>> myNews.whereis('2')

    [0, 1, 3, 6]

### Evaluate performance:

If you can create a tree from any ground truth database, this package also has a method por evaluation:
    
First, you have to import a custom ground truth / golden data tree with ``gtreader()``:

    >>> from newsegmentation import gtreader
    >>> myGt = gtreader('path.txt')
    
Then evaluate the news with the reference, use the argument ``evaluate(ref, show=True)`` to plot some graphics about the evaluation:

    >>> myNews.evaluate(myGt, show=True)
  
<p align="center">
    <img src="./tests/evaluation.png">


### Save and load trees:
This package defines a data structure called news trees, this format is parsed by the code via parsers:

    >>> save_file = './testing' # or save_file = './testing.3s'
    >>> myNews.save(save_file)
    >>> sameNews = ns.load3s(save_file)
    >>> results = myNews.evaluate(sameNews)
    >>> print(results)

    {'Precision': 1.0, 'Recall': 1.0, 'F1': 1.0, 'WD': 0.0, 'Pk': 0.0}

This saves the trees generated (not the ``Segmentation`` instance) inside a ``.3s`` file given as a parameter. 

### Speeding up process:
If you want to run the same database several times (for algorithm design, parameter testing or other reasons) you should
use the cache serialization system. This system stores into a ``.json`` file all the embeddings generated in the ``SLM``. 
If any sentence is repeated, the system will not compute the embeddings again. All sentences computed in the ``SLM`` are 
stored into the ``cache_file`` if provided. Here is an example of speeding up process:

      >>> import time
      >>>
      >>> myDatabase = ['./1.vtt', './2.vtt', './3.vtt']
      >>> cache_file = './cache.json'
      >>> lcm_parameters = [0.2, 0.3, 0.4, 0.5, 0.6, 0.7]
      >>> ellapsed_time = list()
      >>>
      >>> for parameter in lcm_parameters:
      >>>   initial_time = time.perf_counter()
      >>>   for news in database:
      >>>      myNews = ns.Segmentation(news, lcm=(parameter,), cache_file=cache_file)
      >>>   ellapsed_time.append(time.perf_counter() - initial_time)
      >>>
      >>>   [print(f'{i + 1} iteration: {seconds} seconds.') for i, seconds in enumerate(ellapsed_time)]

      1: iteration 89.23 seconds.
      2: iteration 9.28 seconds.
      3: iteration 8.91 seconds.
      4: iteration 12.2 seconds.
      5: iteration 7.22 seconds.
      6: iteration 13.9 seconds.

If any further speed up is needed. The model reads the original files ``(.VTT)`` and stores it as temporal ``.TXT`` files. If the 
model is reading continuously this files, it is better to process the ``.VTT`` files to ``.TXT`` once, store it, and give the model the ``.TXT`` files instead. 
This skips the first preprocessing step in every iteration. You can do something similar to this:

      >>> in_files = ['./1.vtt', './2.vtt', './3.vtt', './4.vtt', './5.vtt']
      >>> txt_files = [default_dbt(vtt_file) for vtt_file in in_files]
      >>> times = 200
      >>> for i in range(times):
      >>>   for txt_file in txt_files:
      >>>      myNews = ns.Segmentation(txt_file)

This method speeds slightly up the process, and it is only adequate if the file is going to be transformed more than once.

### Custom Algorithms:

Implement the abstract class ``NewSegmentation`` for implementing custom algorithms, use this demo as a template:

    import newsegmentation as ns

    class MySegmentation(ns.NewsSegmentation):
        @staticmethod
        def _spatial_manager(r, param):
            # return ns.default_sdm(r, param)
            return myown_sdm(r, param)
    
        @staticmethod
        def _specific_language_model(s):
            # return ns.default_slm(s)
            return myown_slm(s)
    
        @staticmethod
        def _later_correlation_manager(lm, s, t, param):
            # return ns.default_lcm(lm, s, t, param)
            return myown_lcm(lm, s, t, param)
    
        @staticmethod
        def _database_transformation(path, op):
            # return ns.default_dbt(path, op)
            return myown_dbt(path, op)

Note that _``ns.default_xxx``_ is the default manager for the architecture and can be replaced by your own functions. 
Take into account the following constraints before implementing your own module managers:

* SDM: Takes as input the correlation matrix (r) and the algorithm parameters (param). It returns a list of integers 
pointing the index of the block in (r) where each pieces of news start.
* SLM: Takes as input the list of sentences and returns the embeddings of the sentence. For further information about 
word embeddings check the master thesis cited.
* LCM: Takes as input the SLM function reference (lm), the list of sentences (s), the temporal information vector (t) 
and the algorithm parameters (param). It returns (rk, sk, tk): the very last correlation matrix (rk), the last blocks of
text (sk) and their corresponding temporal information (tk). Note that you don't need to manage the embeddings, the SLM works on that job.
* DT: Takes as input (path) the path of the VTT file and the requested output path (op) returns the actual output path.
Note that the architecture creates temporary TXT files for reading the news from the DT.
## Performance

Comparing two different algorithms inside the architecture. LGA is a kernel-based algorithm with cellular automation techniques. PBMM algorithm is 
the default algorithm and has better F1 score performance and reliability. This is tested over Spanish news broadcast database with 10 files.

<p align="center">
    <img src="./tests/perf.png">

### Cite as:
~~~
@inproceedings{newsegmentation,
  title={News Segmentation Architecture for NLP},
  author={A.Palomo-Alonso, D.Casillas-Pérez, S.Jiménez-Fernández, A.Portilla-Figueras, S.Salcedo-Sanz},
  booktitle={Master Thesis in Telecommunication Engeneering},
  year={2022}
}
~~~

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/iTzAlver/newsegmentation",
    "name": "newsegmentation",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.6",
    "maintainer_email": "",
    "keywords": "deeplearning,ml,api",
    "author": "Alberto Palomo Alonso",
    "author_email": "a.palomo@uah.es",
    "download_url": "https://files.pythonhosted.org/packages/88/d5/df06ce6fae49521c8648f8fd638a910c95549e832414af5600e53624152f/newsegmentation-1.4.0.tar.gz",
    "platform": null,
    "description": "<p align=\"center\">\r\n    <img src=\"./tests/logo.png\">\r\n\r\n# News Segmentation Package - 1.4.0\r\n\r\nThis package takes subtitle VTT files (Video Text Track files) and extracts the piece of \r\nnews from the whole newscast inside the file. News are stored into a Tree structure with useful NLP features inside. \r\nThe user can specify their own algorithm for segmentation, however, there are some default. \r\n\r\n## About ##\r\n\r\nAuthor: A.Palomo-Alonso (alberto.palomo@uah.es)\\\r\nContributors: D.Casillas-P\u00e9rez, S.Jim\u00e9nez-Fern\u00e1ndez, A.Portilla-Figueras, S.Salcedo-Sanz.\\\r\nUniversidad de Alcal\u00e1.\\\r\nEscuela Polit\u00e9cnica Superior.\\\r\nDepartamento de Teor\u00eda De la Se\u00f1al y Comunicaciones (TDSC).\\\r\nC\u00e1tedra ISDEFE.\r\n\r\n## What's new?\r\n\r\n### < 0.2.4\r\n1. `NewSegmentation` abstract class for custom algorithms.\r\n2. Architecture implemented.\r\n3. `Segmentation` class for default modules.\r\n4. Precision, Recall, F1, WD, Pk score evaluation for trees.\r\n5. ``plot_matrix()`` method for the matrix generated.\r\n6. ``where_is()`` method for finding pieces of news.\r\n7. ``gtreader()`` for reading reference trees for evaluation in specific format.\r\n8. ``Tree`` and ``Leaf`` structures.\r\n9. Default ``PBMM`` and ``FB-BCM`` algorithms.\r\n10. Default ``TDM``, ``DBM``, ``SDM`` implemented.\r\n11. ``GPA`` implemented inside ``SDM``.\r\n\r\n### 0.2.5\r\n1. Code speed up (60% faster).\r\n   1. Implemented cache for embeddings.\r\n2. Data serializer implemented.\r\n   1. Method ``save()`` implemented in Segmentation class.\r\n   2. Function ``load3s`` implemented for reading trees from files.\r\n\r\n### 0.2.6 - 0.2.8\r\n1. Documentation bug fixing.\r\n2. Logo added.\r\n\r\n### 0.2.9\r\n1. `Segmentation.evaluate()` now can take a path as a parameter!\r\n2. `Segmentation.evaluate(integrity_validation=True)` now takes as default parameter `integrity_validation=True` \r\nfor integrity validation.\r\n> NOTE: If your custom algorithm removes sentences from the original text, you should call \r\n> ``integrity_validation=False`` as it checks every that every sentence is in each tree.\r\n3. Programmed external cache file in ``Segmentation`` class taking a cache file as a new parameter: \r\n``Segmentation(cache_file='./myjson.json')``. This speeds up the architecture when the sentences sent to the ``LCM`` are the same. \r\nFor instance, when testing parameters in the same database the process is around 1000% faster.\r\n4. Bug fix: ``'.'`` not inserted when constructing payload from leafs.\r\n\r\n### 0.3.0\r\n1. Solved cache bugs.\r\n2. One read and write in cache per call to the architecture.\r\n3. Exception handler blockage for cache. Now an exception with cache won't block the architecture.\r\n4. Best parameters found and set as default.\r\n5. Preprocessing speed up explanation in doc.\r\n\r\n### 0.3.1\r\n1. Bug fixing with ```.TXT``` input files and cache.\r\n\r\n### 0.3.2\r\n1. Setuptools rework.\r\n2. Updated performance image.\r\n3. To continue: Update CITE AS to IEEE ref.\r\n\r\n### 0.3.3 -0.3.5\r\n1. Documentation rework.\r\n2. Now the project is a library!\r\n\r\n### 1.0.0\r\n1. Deployment and bug fixing.\r\n\r\n### 1.1.0 - 1.1.4\r\n1. User errors and formatting handled.\r\n2. Debugging.\r\n\r\n### 1.2.0\r\n1. A bug fix in ``evaluate(<gt>, show=True)`` where the correct segmentation and the performed segmentation switches places\r\nin the plot representation.\r\n\r\n### 1.2.1\r\n1. Included logging library instead of print logging information.\r\n2. Try - except clause for googletrans module. Now you can omit it.\r\n\r\n\r\n## Architecture\r\n\r\nThe whole architecture and algorithms are described in depht in [this paper](https://???) or in \r\n[this master thesis](https://???).\\\r\nThe architecture takes advantage of three main features in order perform news segmentation:\r\n* Temporal distance: Is the distance (measured in jumps) between different pieces of text inside the VTT file.\r\n* Spatial distance: Is the distance (measured in slots) between different pieces of text inside the VTT file.\r\n* Semantic correlation: Is the correlation between the meaning of the sentences of two different pieces of text.\r\n\r\nThis architecture works with a _correlation matrix_ formed by the semantic correlation between each sentence \r\nin the news broadcast. Each module modifies the correlation matrix in order to apply temporal / spatial features \r\nreflected in the matrix. The algorithms shall be able to identify each piece of news inside the matrix.\r\nThree differentiated modules make up the architecture:\r\n\r\n* **Database Transformer (DT)**: Takes the original VTT file and converts it to plain text sentences (TXT) with time jumps specified \r\nat the beginning of each sentence and a temporal information vector at the end pointing the temporal \r\nlength of each sentence measured in seconds. \r\n* **Specific language model (SLM)**: This module takes the blocks of text as input and outputs the semantic correlation \r\nbetween each block of text arranged into a _correlation matrix_.\r\n* **Temporal Distance Manager (TDM)**: This module takes the temporal jumps as input and modifies the initial correlation matrix\r\ndepending on the temporal jumps.\r\n* **Spatial Distance Manager (SDM)**: This module implements an algorithm which identifies boundaries between \r\nconsecutive pieces of text and merges it.\r\n* **Late correlation manager (LCM)**: This module implements an algorithm which identifies \r\nhigh semantic correlation between separate pieces of text and merges it.\r\n\r\nThe user can implement their own algorithms depending on their application.\\\r\n\r\n<p align=\"center\">\r\n    <img src=\"./tests/model.png\">\r\n\r\nThe results are stored into a Tree structure with different fields representing different features from \r\nthe piece of news.\r\n* **Payload**: defines the whole text of the piece of news, it involves all sentences related to a same piece of news combined into a single piece of text. It can be defined as a text structure.\r\n* **Embedding**: it is a vector of real numbers which define a semantic representation of the payload. In this model, it is the output of the SLM, output of the specific language model. It can be defined as a high dimensional vector of real numbers. This embedding is stored for computational efficiency reasons, as some models may take long time to compute.\r\n* **ID**: it is a natural number defining the tree identity, this number must be unique for each tree in the results' storage. It can be defined as a natural number. \r\n* **Time information**: it is a vector containing the whole temporal length of the piece of news. It can be defined as a real positive number.\r\n* **Correlation power (CP)**: it is a real number indicating how correlated the sentences of the leafs are within the tree. This number can become very interesting when studying the reliability of algorithms. It can be defined as a real positive number.\r\nWhere M is the size of R1+K and R is, in our architecture, the very last output matrix R1+K. This function does not take into account the main diagonal of the correlation matrix as it does not provide any information about the correlation between sentences. The correlation power is defined on the (0, 1) interval, meaning 0 no correlation between any sentence in the tree and 1 meaning absolute correlation between all the sentences within the tree. This measurement helps to evaluate the reliability of the model.\r\n\r\n<p align=\"center\">\r\n    <img src=\"./tests/eqp.png\">\r\n\r\n\r\n\r\n\r\n* **Reference**: when several trees share the same results storage system, it is convenient to define a group of trees which make reference to a group. For example, if an analysis for several days when some piece of news can be repeated and those trees are lately merged into a subsequent tree, it is convenient to reference the day those trees belongs to. It can be done by its reference field, and it can be defined as a natural number. \r\n* **Leafs**: this structure stores information about the initial state of the model. Each leaf stores a unique _ID_ value and a _Payload_ value containing the minimum text size element considered; in this architecture this element is a sentence, but a single word or any group of words could be also considered.\r\n\r\n<p align=\"center\">\r\n    <img src=\"./tests/tree.png\">\r\n\r\n\r\n## Usage\r\n\r\nFirst, install the python package. After this, you can use your ``VTT`` files to get the \r\nnews. Any other type of file can be considered, but the user must implement their own database\r\ntransformer according to the file and language used. Spanish news segmentation is the default model.\r\n\r\n### Install:\r\n\r\nYou can install the package via pip:\r\n\r\n    pip install newsegmentation -r requirements.txt\r\n\r\nIf any error occurred, try installing the requirements before the installation:\r\n\r\n    numpy\r\n    matplotlib\r\n    googletrans == 4.0.0rc1\r\n    sentence_transformers >= 2.2.0\r\n    sklearn\r\n    nltk\r\n\r\n### Basic Usage:\r\n\r\nIn this demo, we extract the news inside the first 5 minutes of the ``VTT`` file:\r\n\r\n    $ python\r\n    >>> import newsegmentation as ns\r\n    >>> myNews = ns.Segmentation(r'./1.vtt')\r\n    >>> print(myNews)\r\n\r\n    NewsSegmentation object: 8 news classified.\r\n\r\n    >>> myNews.info()\r\n\r\n    News segmentation package:\r\n    --------------------------------------------\r\n    FAST USAGE:\r\n    --------------------------------------------\r\n    PATH_TO_MY_FILE = <PAHT>\r\n    import newsegmentation as ns\r\n    news = ns.NewsSegmentation(PATH_TO_MY_FILE)\r\n    for pon in news:\r\n        print(pon)\r\n    --------------------------------------------\r\n\r\n    >>> myNews.about()\r\n\r\n    Institution:\r\n    ------------------------------------------------------\r\n    Universidad de Alcal\u00e1.\r\n    Escuela Polit\u00e9cnica Superior.\r\n    Departamento de Teor\u00eda De la Se\u00f1al y Comunicaciones.\r\n    C\u00e1tedra ISDEFE.\r\n    ------------------------------------------------------\r\n    Author: Alberto Palomo Alonso\r\n    ------------------------------------------------------\r\n\r\n    >>> for pieceOfNews in myNews:\r\n    >>>     print(pieceOfNews)\r\n\r\n    No hay descanso. Desde hace m\u00e1s de 24 horas se trabaja sin tregua para encontrar a Julen. El ni\u00f1o de 2 a\u00f1os se cay\u00f3 en un pozo en Total\u00e1n, en M\u00e1laga. Las horas pasan, los equipos de rescate luchan contrarreloj y buscan nuevas opciones en un terreno escarpado y con riesgo de derrumbes bajo tierra. Buenas noches. Arrancamos este Telediario, all\u00ed, en el lugar del rescate. \u00bfCu\u00e1les son las opciones para encontrar a Julen? Se trabaja en 3 frentes retirar la arena que est\u00e1 taponando el pozo de prospecci\u00f3n. Excavar en 2 pozo, y abrir en el lateral de la monta\u00f1a\r\n    El objetivo rescatar al peque\u00f1o. El proyecto de presupuestos llega al Congreso. Son las cuentas con m\u00e1s gasto p\u00fablico desde 2010 Destacan m\u00e1s partidas para programas sociales, contra la pobreza infantil o la dependencia, y tambi\u00e9n el aumento de inversiones en Catalu\u00f1a. El gobierno necesita entre otros el apoyo de los independentistas catalanes que por ahora mantienen el NO a los presupuestos, aunque desde el ejecutivo nacional se escuchan voces m\u00e1s optimistas\r\n    La familia de Laura Sanz Nombela, fallecida en Par\u00eds por una explosi\u00f3n de gas espera poder repatriar su cuerpo este pr\u00f3ximo mi\u00e9rcoles. Hemos hablado con su padre, que est\u00e1 en Francia junto a su yerno y nos ha contado que se sintieron abandonados en las primeras horas tras el accidente. La guardia civil busca en una zona de grutas volc\u00e1nicas de dif\u00edcil acceso el cuerpo de la joven desaparecida en Lanzarote, Romina Celeste. Su marido est\u00e1 detenido en relaci\u00f3n con su muerte aunque \u00e9l defiende que no la mat\u00f3, que solo discutieron y que luego se la encontr\u00f3 muerta la noche de A\u00f1o Nuevo\r\n    Dormir poco hace que suba hasta un 27 por ciento el riesgo de enfermedades cardiovasculares\r\n    Es la conclusi\u00f3n de un estudio que ha realizado durante 10 a\u00f1os el Centro Nacional para estas dolencias\r\n    Y una noticia de esta misma tarde de la que estamos muy pendientes: Un tren ha descarrilado esta tarde cerca de Torrijos en Toledo sin causar heridos. Hab\u00eda salido de C\u00e1ceres con direcci\u00f3n a Madrid. Los 33 pasajeros han sido trasladados a la capital en otro tren. La circulaci\u00f3n en la v\u00eda entre Madrid y Extremadura est\u00e1 interrumpida. Renfe ha organizado un transporte alternativo en autob\u00fas para los afectados\r\n    A 15 d\u00edas de la gran gala de los Goya hoy se ha entregado ya el primer premio. La cita es el pr\u00f3ximo 2 de febrero en Sevilla, pero hoy, aqu\u00ed en Madrid, en el Teatro Real gran fiesta de los denominados a los Premios Goya. Solo uno de ellos se llevar\u00e1 hoy su estatuilla. Chicho Ib\u00e1\u00f1ez Serrador consigue el Premio Goya de Honor por toda una vida dedicada al cine de terror\r\n    Y en los deportes Nadal gana en Australia, Sergio\r\n\r\n    >>> myNews.plotmtx()\r\n<p align=\"center\">\r\n   <img src=\"./tests/mtx.png\">\r\n\r\n### Finding news from text:\r\n\r\nYou can also find information inside the news using the method ``whereis()``:\r\n    \r\n    >>> myNews.whereis('Nadal')\r\n\r\n    [7]\r\n\r\n    >>> myNews.whereis('2')\r\n\r\n    [0, 1, 3, 6]\r\n\r\n### Evaluate performance:\r\n\r\nIf you can create a tree from any ground truth database, this package also has a method por evaluation:\r\n    \r\nFirst, you have to import a custom ground truth / golden data tree with ``gtreader()``:\r\n\r\n    >>> from newsegmentation import gtreader\r\n    >>> myGt = gtreader('path.txt')\r\n    \r\nThen evaluate the news with the reference, use the argument ``evaluate(ref, show=True)`` to plot some graphics about the evaluation:\r\n\r\n    >>> myNews.evaluate(myGt, show=True)\r\n  \r\n<p align=\"center\">\r\n    <img src=\"./tests/evaluation.png\">\r\n\r\n\r\n### Save and load trees:\r\nThis package defines a data structure called news trees, this format is parsed by the code via parsers:\r\n\r\n    >>> save_file = './testing' # or save_file = './testing.3s'\r\n    >>> myNews.save(save_file)\r\n    >>> sameNews = ns.load3s(save_file)\r\n    >>> results = myNews.evaluate(sameNews)\r\n    >>> print(results)\r\n\r\n    {'Precision': 1.0, 'Recall': 1.0, 'F1': 1.0, 'WD': 0.0, 'Pk': 0.0}\r\n\r\nThis saves the trees generated (not the ``Segmentation`` instance) inside a ``.3s`` file given as a parameter. \r\n\r\n### Speeding up process:\r\nIf you want to run the same database several times (for algorithm design, parameter testing or other reasons) you should\r\nuse the cache serialization system. This system stores into a ``.json`` file all the embeddings generated in the ``SLM``. \r\nIf any sentence is repeated, the system will not compute the embeddings again. All sentences computed in the ``SLM`` are \r\nstored into the ``cache_file`` if provided. Here is an example of speeding up process:\r\n\r\n      >>> import time\r\n      >>>\r\n      >>> myDatabase = ['./1.vtt', './2.vtt', './3.vtt']\r\n      >>> cache_file = './cache.json'\r\n      >>> lcm_parameters = [0.2, 0.3, 0.4, 0.5, 0.6, 0.7]\r\n      >>> ellapsed_time = list()\r\n      >>>\r\n      >>> for parameter in lcm_parameters:\r\n      >>>   initial_time = time.perf_counter()\r\n      >>>   for news in database:\r\n      >>>      myNews = ns.Segmentation(news, lcm=(parameter,), cache_file=cache_file)\r\n      >>>   ellapsed_time.append(time.perf_counter() - initial_time)\r\n      >>>\r\n      >>>   [print(f'{i + 1} iteration: {seconds} seconds.') for i, seconds in enumerate(ellapsed_time)]\r\n\r\n      1: iteration 89.23 seconds.\r\n      2: iteration 9.28 seconds.\r\n      3: iteration 8.91 seconds.\r\n      4: iteration 12.2 seconds.\r\n      5: iteration 7.22 seconds.\r\n      6: iteration 13.9 seconds.\r\n\r\nIf any further speed up is needed. The model reads the original files ``(.VTT)`` and stores it as temporal ``.TXT`` files. If the \r\nmodel is reading continuously this files, it is better to process the ``.VTT`` files to ``.TXT`` once, store it, and give the model the ``.TXT`` files instead. \r\nThis skips the first preprocessing step in every iteration. You can do something similar to this:\r\n\r\n      >>> in_files = ['./1.vtt', './2.vtt', './3.vtt', './4.vtt', './5.vtt']\r\n      >>> txt_files = [default_dbt(vtt_file) for vtt_file in in_files]\r\n      >>> times = 200\r\n      >>> for i in range(times):\r\n      >>>   for txt_file in txt_files:\r\n      >>>      myNews = ns.Segmentation(txt_file)\r\n\r\nThis method speeds slightly up the process, and it is only adequate if the file is going to be transformed more than once.\r\n\r\n### Custom Algorithms:\r\n\r\nImplement the abstract class ``NewSegmentation`` for implementing custom algorithms, use this demo as a template:\r\n\r\n    import newsegmentation as ns\r\n\r\n    class MySegmentation(ns.NewsSegmentation):\r\n        @staticmethod\r\n        def _spatial_manager(r, param):\r\n            # return ns.default_sdm(r, param)\r\n            return myown_sdm(r, param)\r\n    \r\n        @staticmethod\r\n        def _specific_language_model(s):\r\n            # return ns.default_slm(s)\r\n            return myown_slm(s)\r\n    \r\n        @staticmethod\r\n        def _later_correlation_manager(lm, s, t, param):\r\n            # return ns.default_lcm(lm, s, t, param)\r\n            return myown_lcm(lm, s, t, param)\r\n    \r\n        @staticmethod\r\n        def _database_transformation(path, op):\r\n            # return ns.default_dbt(path, op)\r\n            return myown_dbt(path, op)\r\n\r\nNote that _``ns.default_xxx``_ is the default manager for the architecture and can be replaced by your own functions. \r\nTake into account the following constraints before implementing your own module managers:\r\n\r\n* SDM: Takes as input the correlation matrix (r) and the algorithm parameters (param). It returns a list of integers \r\npointing the index of the block in (r) where each pieces of news start.\r\n* SLM: Takes as input the list of sentences and returns the embeddings of the sentence. For further information about \r\nword embeddings check the master thesis cited.\r\n* LCM: Takes as input the SLM function reference (lm), the list of sentences (s), the temporal information vector (t) \r\nand the algorithm parameters (param). It returns (rk, sk, tk): the very last correlation matrix (rk), the last blocks of\r\ntext (sk) and their corresponding temporal information (tk). Note that you don't need to manage the embeddings, the SLM works on that job.\r\n* DT: Takes as input (path) the path of the VTT file and the requested output path (op) returns the actual output path.\r\nNote that the architecture creates temporary TXT files for reading the news from the DT.\r\n## Performance\r\n\r\nComparing two different algorithms inside the architecture. LGA is a kernel-based algorithm with cellular automation techniques. PBMM algorithm is \r\nthe default algorithm and has better F1 score performance and reliability. This is tested over Spanish news broadcast database with 10 files.\r\n\r\n<p align=\"center\">\r\n    <img src=\"./tests/perf.png\">\r\n\r\n### Cite as:\r\n~~~\r\n@inproceedings{newsegmentation,\r\n  title={News Segmentation Architecture for NLP},\r\n  author={A.Palomo-Alonso, D.Casillas-P\u00e9rez, S.Jim\u00e9nez-Fern\u00e1ndez, A.Portilla-Figueras, S.Salcedo-Sanz},\r\n  booktitle={Master Thesis in Telecommunication Engeneering},\r\n  year={2022}\r\n}\r\n~~~\r\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "Package for news segmentation architecture.",
    "version": "1.4.0",
    "split_keywords": [
        "deeplearning",
        "ml",
        "api"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "22c63a4c4896571fa78bf5170ad3d22d",
                "sha256": "52aa58a0eb593d1143bd9af2c2eac4195410ae6ec4994b0363e907d02e368c4a"
            },
            "downloads": -1,
            "filename": "newsegmentation-1.4.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "22c63a4c4896571fa78bf5170ad3d22d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.6",
            "size": 24734,
            "upload_time": "2022-12-19T17:25:21",
            "upload_time_iso_8601": "2022-12-19T17:25:21.700172Z",
            "url": "https://files.pythonhosted.org/packages/ba/e9/c9fead052ddde4f41536bbdd3ebccc4df2c7bac3f240d0d9616cf1cac273/newsegmentation-1.4.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "163e17d561dd39f7909f0cd456ff4ebf",
                "sha256": "851242a15da06a468187e9d1b6e6b66dc0bbf5071a016e5016444597825c0ee8"
            },
            "downloads": -1,
            "filename": "newsegmentation-1.4.0.tar.gz",
            "has_sig": false,
            "md5_digest": "163e17d561dd39f7909f0cd456ff4ebf",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.6",
            "size": 910045,
            "upload_time": "2022-12-19T17:25:26",
            "upload_time_iso_8601": "2022-12-19T17:25:26.316392Z",
            "url": "https://files.pythonhosted.org/packages/88/d5/df06ce6fae49521c8648f8fd638a910c95549e832414af5600e53624152f/newsegmentation-1.4.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2022-12-19 17:25:26",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "github_user": "iTzAlver",
    "github_project": "newsegmentation",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "requirements": [
        {
            "name": "matplotlib",
            "specs": [
                [
                    ">=",
                    "3.5.0"
                ]
            ]
        },
        {
            "name": "numpy",
            "specs": [
                [
                    ">=",
                    "1.22.3"
                ]
            ]
        },
        {
            "name": "nltk",
            "specs": [
                [
                    ">=",
                    "3.6.5"
                ]
            ]
        },
        {
            "name": "sklearn",
            "specs": [
                [
                    ">=",
                    "0.0"
                ]
            ]
        },
        {
            "name": "sentence-transformers",
            "specs": [
                [
                    ">=",
                    "2.2.0"
                ]
            ]
        },
        {
            "name": "googletrans",
            "specs": [
                [
                    "==",
                    "4.0.0rc1"
                ]
            ]
        }
    ],
    "lcname": "newsegmentation"
}
        
Elapsed time: 0.02111s