invenio-classifier


Nameinvenio-classifier JSON
Version 1.3.10 PyPI version JSON
download
home_pagehttps://github.com/inveniosoftware-contrib/invenio-classifier
SummaryInvenio module for record classification.
upload_time2025-10-24 09:16:21
maintainerNone
docs_urlhttps://pythonhosted.org/invenio-classifier/
authorCERN
requires_pythonNone
licenseGPLv2
keywords invenio keyword classification pdf
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ..
    This file is part of Invenio.
    Copyright (C) 2015 CERN.

    Invenio is free software; you can redistribute it
    and/or modify it under the terms of the GNU General Public License as
    published by the Free Software Foundation; either version 2 of the
    License, or (at your option) any later version.

    Invenio is distributed in the hope that it will be
    useful, but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
    General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with Invenio; if not, write to the
    Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,
    MA 02111-1307, USA.

    In applying this license, CERN does not
    waive the privileges and immunities granted to it by virtue of its status
    as an Intergovernmental Organization or submit itself to any jurisdiction.

====================
 Invenio-Classifier
====================

.. image:: https://img.shields.io/travis/inveniosoftware-contrib/invenio-classifier.svg
        :target: https://travis-ci.org/inveniosoftware-contrib/invenio-classifier

.. image:: https://img.shields.io/coveralls/inveniosoftware-contrib/invenio-classifier.svg
        :target: https://coveralls.io/r/inveniosoftware-contrib/invenio-classifier

.. image:: https://img.shields.io/github/tag/inveniosoftware-contrib/invenio-classifier.svg
        :target: https://github.com/inveniosoftware-contrib/invenio-classifier/releases

.. image:: https://img.shields.io/pypi/dm/invenio-classifier.svg
        :target: https://pypi.python.org/pypi/invenio-classifier

.. image:: https://img.shields.io/github/license/inveniosoftware-contrib/invenio-classifier.svg
        :target: https://github.com/inveniosoftware-contrib/invenio-classifier/blob/master/LICENSE


Invenio module for record classification.

* Free software: GPLv2 license
* Documentation: https://pythonhosted.org/invenio-classifier


Features
========

Classifier automatically extracts keywords from fulltext documents. The
automatic assignment of keywords to textual documents has clear benefits
in the digital library environment as it aids catalogization,
classification and retrieval of documents.

Keyword extraction is simple
============================

.. note:: Classifier requires Python `RDFLib <http://rdflib.net/>`__ in order
    to process the RDF/SKOS taxonomy.

In order to extract relevant keywords from a document ``fulltext.pdf``
based on a controlled vocabulary ``thesaurus.rdf``, you would run
Classifier as follows:

.. code-block:: shell

    ${INVENIO_WEB_INSTANCE} classifier extract -k thesaurus.rdf -f fulltext.pdf

Launching ``${INVENIO_WEB_INSTANCE} classifier --help`` shows the options available.

As an example, running classifier on document
`nucl-th/0204033 <http://cds.cern.ch/record/547024>`__ using the
high-energy physics RDF/SKOS taxonomy (``HEP.rdf``) would yield the
following results (based on the HEP taxonomy from October 10th 2008):

.. code-block:: text

    Input file: 0204033.pdf

    Author keywords:
    Dense matter
    Saturation
    Unstable nuclei

    Composite keywords:
    10  nucleus: stability [36, 14]
    6  saturation: density [25, 31]
    6  energy: symmetry [35, 11]
    4  nucleon: density [13, 31]
    3  energy: Coulomb [35, 3]
    2  energy: density [35, 31]
    2  nuclear matter: asymmetry [21, 2]
    1  n: matter [54, 36]
    1  n: density [54, 31]
    1  n: mass [54, 16]

    Single keywords:
    61  K0
    23  equation of state
    12  slope
    4  mass number
    4  nuclide
    3  nuclear model
    3  mass formula
    2  charge distribution
    2  elastic scattering
    2  binding energy


Thesaurus
=========

Classifier performs an extraction of keywords based on the recurrence
of specific terms, taken from a controlled vocabulary. A controlled
vocabulary is a thesaurus of all the terms that are relevant in a
specific context. When a context is defined by a discipline or branch of
knowledge then the vocabulary is said to be a *subject thesaurus*.
Various existing subject thesauri can be found
`here <http://www.fbi.fh-koeln.de/institut/labor/Bir/thesauri_new/thesen.htm>`__.

A subject thesaurus can be expressed in several different formats.
Different institutions/disciplines have developed different ways of
representing their vocabulary systems. The taxonomy used by classifier
is expressed in RDF/SKOS. It allows not only to list keywords but to
specify relations between the keywords and alternative ways to represent
the same keyword.

.. code-block:: xml

        <Concept rdf:about="http://cern.ch/thesauri/HEP.rdf#scalar">
         <composite rdf:resource="http://cern.ch/thesauri/HEP.rdf#Composite.fieldtheoryscalar"/>
         <prefLabel xml:lang="en">scalar</prefLabel>
         <note xml:lang="en">nostandalone</note>
        </Concept>

        <Concept rdf:about="http://cern.ch/thesauri/HEP.rdf#fieldtheory">
         <composite rdf:resource="http://cern.ch/thesauri/HEP.rdf#Composite.fieldtheoryscalar"/>
         <prefLabel xml:lang="en">field theory</prefLabel>
         <altLabel xml:lang="en">QFT</altLabel>
         <hiddenLabel xml:lang="en">/field theor\w*/</hiddenLabel>
         <note xml:lang="en">nostandalone</note>
        </Concept>

        <Concept rdf:about="http://cern.ch/thesauri/HEP.rdf#Composite.fieldtheoryscalar">
         <compositeOf rdf:resource="http://cern.ch/thesauri/HEP.rdf#scalar"/>
         <compositeOf rdf:resource="http://cern.ch/thesauri/HEP.rdf#fieldtheory"/>
         <prefLabel xml:lang="en">field theory: scalar</prefLabel>
         <altLabel xml:lang="en">scalar field</altLabel>
        </Concept>


In RDF/SKOS, every keyword is wrapped around a *concept* which
encapsulates the full semantics and hierarchical status of a term -
including synonyms, alternative forms, broader concepts, notes and so on
- rather than just a plain keyword.

The specification of the SKOS language and `various
manuals <http://www.w3.org/TR/2005/WD-swbp-thesaurus-pubguide-20050517/>`__
that aid the building of a semantic thesaurus can be found at the `SKOS
W3C
website <http://www.w3.org/TR/2005/WD-swbp-skos-core-guide-20051102/>`__.
Furthermore, Classifier can function on top of an extended version of
SKOS, which includes special elements such as key chains, composite
keywords and special annotations.

Keyword extraction
==================

Classifier computes the keywords of a fulltext document based on the
frequency of thesaurus terms in it. In other words, it calculates how
many times a thesaurus keyword (and its alternative and hidden labels,
defined in the taxonomy) appears in a text and it ranks the results.
Unlike other similar systems, Classifier does not use any machine
learning or AI methodologies - a just plain phrase matching using
`regular expressions <http://en.wikipedia.org/wiki/Regex>`__: it
exploits the conformation and richness of the thesaurus to produce
accurate results. It is then clear that Classifier performs best on top
of rich, well-structured, subject thesauri expressed in the RDF/SKOS
language.

Happy hacking and thanks for flying Invenio-Classifier.

| Inspirehep Development Team
|   Email: admin@inspirehep.net
|   Twitter: http://twitter.com/inveniosoftware
|   GitHub: https://github.com/inveniosoftware-contrib/invenio-classifier
|   URL: http://inveniosoftware.org

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/inveniosoftware-contrib/invenio-classifier",
    "name": "invenio-classifier",
    "maintainer": null,
    "docs_url": "https://pythonhosted.org/invenio-classifier/",
    "requires_python": null,
    "maintainer_email": null,
    "keywords": "invenio keyword classification pdf",
    "author": "CERN",
    "author_email": "admin@inspirehep.net",
    "download_url": null,
    "platform": "any",
    "description": "..\n    This file is part of Invenio.\n    Copyright (C) 2015 CERN.\n\n    Invenio is free software; you can redistribute it\n    and/or modify it under the terms of the GNU General Public License as\n    published by the Free Software Foundation; either version 2 of the\n    License, or (at your option) any later version.\n\n    Invenio is distributed in the hope that it will be\n    useful, but WITHOUT ANY WARRANTY; without even the implied warranty of\n    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU\n    General Public License for more details.\n\n    You should have received a copy of the GNU General Public License\n    along with Invenio; if not, write to the\n    Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,\n    MA 02111-1307, USA.\n\n    In applying this license, CERN does not\n    waive the privileges and immunities granted to it by virtue of its status\n    as an Intergovernmental Organization or submit itself to any jurisdiction.\n\n====================\n Invenio-Classifier\n====================\n\n.. image:: https://img.shields.io/travis/inveniosoftware-contrib/invenio-classifier.svg\n        :target: https://travis-ci.org/inveniosoftware-contrib/invenio-classifier\n\n.. image:: https://img.shields.io/coveralls/inveniosoftware-contrib/invenio-classifier.svg\n        :target: https://coveralls.io/r/inveniosoftware-contrib/invenio-classifier\n\n.. image:: https://img.shields.io/github/tag/inveniosoftware-contrib/invenio-classifier.svg\n        :target: https://github.com/inveniosoftware-contrib/invenio-classifier/releases\n\n.. image:: https://img.shields.io/pypi/dm/invenio-classifier.svg\n        :target: https://pypi.python.org/pypi/invenio-classifier\n\n.. image:: https://img.shields.io/github/license/inveniosoftware-contrib/invenio-classifier.svg\n        :target: https://github.com/inveniosoftware-contrib/invenio-classifier/blob/master/LICENSE\n\n\nInvenio module for record classification.\n\n* Free software: GPLv2 license\n* Documentation: https://pythonhosted.org/invenio-classifier\n\n\nFeatures\n========\n\nClassifier automatically extracts keywords from fulltext documents. The\nautomatic assignment of keywords to textual documents has clear benefits\nin the digital library environment as it aids catalogization,\nclassification and retrieval of documents.\n\nKeyword extraction is simple\n============================\n\n.. note:: Classifier requires Python `RDFLib <http://rdflib.net/>`__ in order\n    to process the RDF/SKOS taxonomy.\n\nIn order to extract relevant keywords from a document ``fulltext.pdf``\nbased on a controlled vocabulary ``thesaurus.rdf``, you would run\nClassifier as follows:\n\n.. code-block:: shell\n\n    ${INVENIO_WEB_INSTANCE} classifier extract -k thesaurus.rdf -f fulltext.pdf\n\nLaunching ``${INVENIO_WEB_INSTANCE} classifier --help`` shows the options available.\n\nAs an example, running classifier on document\n`nucl-th/0204033 <http://cds.cern.ch/record/547024>`__ using the\nhigh-energy physics RDF/SKOS taxonomy (``HEP.rdf``) would yield the\nfollowing results (based on the HEP taxonomy from October 10th 2008):\n\n.. code-block:: text\n\n    Input file: 0204033.pdf\n\n    Author keywords:\n    Dense matter\n    Saturation\n    Unstable nuclei\n\n    Composite keywords:\n    10  nucleus: stability [36, 14]\n    6  saturation: density [25, 31]\n    6  energy: symmetry [35, 11]\n    4  nucleon: density [13, 31]\n    3  energy: Coulomb [35, 3]\n    2  energy: density [35, 31]\n    2  nuclear matter: asymmetry [21, 2]\n    1  n: matter [54, 36]\n    1  n: density [54, 31]\n    1  n: mass [54, 16]\n\n    Single keywords:\n    61  K0\n    23  equation of state\n    12  slope\n    4  mass number\n    4  nuclide\n    3  nuclear model\n    3  mass formula\n    2  charge distribution\n    2  elastic scattering\n    2  binding energy\n\n\nThesaurus\n=========\n\nClassifier performs an extraction of keywords based on the recurrence\nof specific terms, taken from a controlled vocabulary. A controlled\nvocabulary is a thesaurus of all the terms that are relevant in a\nspecific context. When a context is defined by a discipline or branch of\nknowledge then the vocabulary is said to be a *subject thesaurus*.\nVarious existing subject thesauri can be found\n`here <http://www.fbi.fh-koeln.de/institut/labor/Bir/thesauri_new/thesen.htm>`__.\n\nA subject thesaurus can be expressed in several different formats.\nDifferent institutions/disciplines have developed different ways of\nrepresenting their vocabulary systems. The taxonomy used by classifier\nis expressed in RDF/SKOS. It allows not only to list keywords but to\nspecify relations between the keywords and alternative ways to represent\nthe same keyword.\n\n.. code-block:: xml\n\n        <Concept rdf:about=\"http://cern.ch/thesauri/HEP.rdf#scalar\">\n         <composite rdf:resource=\"http://cern.ch/thesauri/HEP.rdf#Composite.fieldtheoryscalar\"/>\n         <prefLabel xml:lang=\"en\">scalar</prefLabel>\n         <note xml:lang=\"en\">nostandalone</note>\n        </Concept>\n\n        <Concept rdf:about=\"http://cern.ch/thesauri/HEP.rdf#fieldtheory\">\n         <composite rdf:resource=\"http://cern.ch/thesauri/HEP.rdf#Composite.fieldtheoryscalar\"/>\n         <prefLabel xml:lang=\"en\">field theory</prefLabel>\n         <altLabel xml:lang=\"en\">QFT</altLabel>\n         <hiddenLabel xml:lang=\"en\">/field theor\\w*/</hiddenLabel>\n         <note xml:lang=\"en\">nostandalone</note>\n        </Concept>\n\n        <Concept rdf:about=\"http://cern.ch/thesauri/HEP.rdf#Composite.fieldtheoryscalar\">\n         <compositeOf rdf:resource=\"http://cern.ch/thesauri/HEP.rdf#scalar\"/>\n         <compositeOf rdf:resource=\"http://cern.ch/thesauri/HEP.rdf#fieldtheory\"/>\n         <prefLabel xml:lang=\"en\">field theory: scalar</prefLabel>\n         <altLabel xml:lang=\"en\">scalar field</altLabel>\n        </Concept>\n\n\nIn RDF/SKOS, every keyword is wrapped around a *concept* which\nencapsulates the full semantics and hierarchical status of a term -\nincluding synonyms, alternative forms, broader concepts, notes and so on\n- rather than just a plain keyword.\n\nThe specification of the SKOS language and `various\nmanuals <http://www.w3.org/TR/2005/WD-swbp-thesaurus-pubguide-20050517/>`__\nthat aid the building of a semantic thesaurus can be found at the `SKOS\nW3C\nwebsite <http://www.w3.org/TR/2005/WD-swbp-skos-core-guide-20051102/>`__.\nFurthermore, Classifier can function on top of an extended version of\nSKOS, which includes special elements such as key chains, composite\nkeywords and special annotations.\n\nKeyword extraction\n==================\n\nClassifier computes the keywords of a fulltext document based on the\nfrequency of thesaurus terms in it. In other words, it calculates how\nmany times a thesaurus keyword (and its alternative and hidden labels,\ndefined in the taxonomy) appears in a text and it ranks the results.\nUnlike other similar systems, Classifier does not use any machine\nlearning or AI methodologies - a just plain phrase matching using\n`regular expressions <http://en.wikipedia.org/wiki/Regex>`__: it\nexploits the conformation and richness of the thesaurus to produce\naccurate results. It is then clear that Classifier performs best on top\nof rich, well-structured, subject thesauri expressed in the RDF/SKOS\nlanguage.\n\nHappy hacking and thanks for flying Invenio-Classifier.\n\n| Inspirehep Development Team\n|   Email: admin@inspirehep.net\n|   Twitter: http://twitter.com/inveniosoftware\n|   GitHub: https://github.com/inveniosoftware-contrib/invenio-classifier\n|   URL: http://inveniosoftware.org\n",
    "bugtrack_url": null,
    "license": "GPLv2",
    "summary": "Invenio module for record classification.",
    "version": "1.3.10",
    "project_urls": {
        "Homepage": "https://github.com/inveniosoftware-contrib/invenio-classifier"
    },
    "split_keywords": [
        "invenio",
        "keyword",
        "classification",
        "pdf"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "81e2978c297d886f2fdd809d4760f97947cc31740044e9d248d94fc4994ba6e8",
                "md5": "aeb3bba84ffc2553c36bb3a27544b793",
                "sha256": "4961ddf67404e4b5d6619c81d0575cb33c7df17bb33608c8b04a348c358363e2"
            },
            "downloads": -1,
            "filename": "invenio_classifier-1.3.10-py2.py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "aeb3bba84ffc2553c36bb3a27544b793",
            "packagetype": "bdist_wheel",
            "python_version": "py2.py3",
            "requires_python": null,
            "size": 65564,
            "upload_time": "2025-10-24T09:16:21",
            "upload_time_iso_8601": "2025-10-24T09:16:21.570040Z",
            "url": "https://files.pythonhosted.org/packages/81/e2/978c297d886f2fdd809d4760f97947cc31740044e9d248d94fc4994ba6e8/invenio_classifier-1.3.10-py2.py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-10-24 09:16:21",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "inveniosoftware-contrib",
    "github_project": "invenio-classifier",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "invenio-classifier"
}
        
Elapsed time: 3.37106s