holmes-extractor

Name	holmes-extractor JSON
Version	4.2.1 JSON
	download
home_page	https://github.com/richardpaulhudson/holmes-extractor
Summary	Information extraction from English and German texts based on predicate logic
upload_time	2023-06-06 11:08:37
maintainer
docs_url	None
author	Richard Paul Hudson
requires_python	<3.12,>=3.6
license	MIT
keywords	nlp information-extraction spacy spacy-extension python machine-learning ontology semantics
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            **Holmes** is a Python 3 library (v3.6—v3.11) running on top of
[spaCy](https://spacy.io/) (v3.1—v3.5) that supports a number of use cases
involving information extraction from English and German texts. In all use cases, the information
extraction is based on analysing the semantic relationships expressed by the component parts of
each sentence:

- In the [chatbot](#getting-started) use case, the system is configured using one or more **search phrases**.
Holmes then looks for structures whose meanings correspond to those of these search phrases within
a searched **document**, which in this case corresponds to an individual snippet of text or speech
entered by the end user. Within a match, each word with its own meaning (i.e. that does not merely fulfil a grammatical function) in the search phrase
corresponds to one or more such words in the document. Both the fact that a search phrase was matched and any structured information the search phrase extracts can be used to drive the chatbot.

- The [structural extraction](#structural-extraction) use case uses exactly the same
[structural matching](#how-it-works-structural-matching) technology as the chatbot use
case, but searching takes place with respect to a pre-existing document or documents that are typically much
longer than the snippets analysed in the chatbot use case, and the aim is to extract and store structured information. For example, a set of business articles could be searched to find all the places where one company is said to be planning to
take over a second company. The identities of the companies concerned could then be stored in a database.

- The [topic matching](#topic-matching) use case aims to find passages in a document or documents whose meaning
is close to that of another document, which takes on the role of the **query document**, or to that of a **query phrase** entered ad-hoc by the user. Holmes extracts a number of small **phraselets** from the query phrase or
query document, matches the documents being searched against each phraselet, and conflates the results to find
the most relevant passages within the documents. Because there is no strict requirement that every
word with its own meaning in the query document match a specific word or words in the searched documents, more matches are found
than in the structural extraction use case, but the matches do not contain structured information that can be
used in subsequent processing. The topic matching use case is demonstrated by [a website allowing searches within
six Charles Dickens novels (for English) and around 350 traditional stories (for German)](https://holmes-demo.explosion.services/).

- The [supervised document classification](#supervised-document-classification) use case uses training data to
learn a classifier that assigns one or more **classification labels** to new documents based on what they are about.
It classifies a new document by matching it against phraselets that were extracted from the training documents in the
same way that phraselets are extracted from the query document in the topic matching use case. The technique is
inspired by bag-of-words-based classification algorithms that use n-grams, but aims to derive n-grams whose component
words are related semantically rather than that just happen to be neighbours in the surface representation of a language.

In all four use cases, the **individual words** are matched using a [number of strategies](#word-level-matching-strategies).
To work out whether two grammatical structures that contain individually matching words correspond logically and
constitute a match, Holmes transforms the syntactic parse information provided by the [spaCy](https://spacy.io/) library
into semantic structures that allow texts to be compared using predicate logic. As a user of Holmes, you do not need to
understand the intricacies of how this works, although there are some
[important tips](#writing-effective-search-phrases) around writing effective search phrases for the chatbot and
structural extraction use cases that you should try and take on board.

Holmes aims to offer generalist solutions that can be used more or less out of the box with
relatively little tuning, tweaking or training and that are rapidly applicable to a wide range of use cases.
At its core lies a logical, programmed, rule-based system that describes how syntactic representations in each
language express semantic relationships. Although the supervised document classification use case does incorporate a
neural network and although the spaCy library upon which Holmes builds has itself been pre-trained using machine
learning, the essentially rule-based nature of Holmes means that the chatbot, structural extraction and topic matching use
cases can be put to use out of the box without any training and that the supervised document classification use case
typically requires relatively little training data, which is a great advantage because pre-labelled training data is
not available for many real-world problems.

Holmes has a long and complex history and is now published under the MIT license thanks to the goodwill and openness of several companies. I, Richard Hudson, wrote the versions up to 3.0.0 while working at [msg systems](https://www.msg.group/en), a large international software consultancy based near Munich. From 2021 to 2023, I worked for [Explosion](https://explosion.ai/), the creators of [spaCy](https://spacy.io/) and [Prodigy](https://prodi.gy/). Elements of the Holmes library are covered by a [US patent](https://patents.google.com/patent/US8155946B2/en) that I myself wrote in the early 2000s while working at a startup called Definiens that has since been acquired by [AstraZeneca](https://www.astrazeneca.com/). With the kind permission of both AstraZeneca and msg systems, Holmes is now offered under a permissive license: anyone can now use Holmes under the terms of the MIT license without having to worry about the patent.

For more information, please see the [main documentation on Github](https://github.com/richardpaulhudson/holmes-extractor).

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/richardpaulhudson/holmes-extractor",
    "name": "holmes-extractor",
    "maintainer": "",
    "docs_url": null,
    "requires_python": "<3.12,>=3.6",
    "maintainer_email": "",
    "keywords": "nlp,information-extraction,spacy,spacy-extension,python,machine-learning,ontology,semantics",
    "author": "Richard Paul Hudson",
    "author_email": "hudsonrichardpaul@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/8d/46/ebb98468fb7516be94e08e7f7e8d21beec04fc6b0e153aa5d3e40e374c5c/holmes-extractor-4.2.1.tar.gz",
    "platform": null,
    "description": "**Holmes** is a Python 3 library (v3.6\u2014v3.11) running on top of\n[spaCy](https://spacy.io/) (v3.1\u2014v3.5) that supports a number of use cases\ninvolving information extraction from English and German texts. In all use cases, the information\nextraction is based on analysing the semantic relationships expressed by the component parts of\neach sentence:\n\n- In the [chatbot](#getting-started) use case, the system is configured using one or more **search phrases**.\nHolmes then looks for structures whose meanings correspond to those of these search phrases within\na searched **document**, which in this case corresponds to an individual snippet of text or speech\nentered by the end user. Within a match, each word with its own meaning (i.e. that does not merely fulfil a grammatical function) in the search phrase\ncorresponds to one or more such words in the document. Both the fact that a search phrase was matched and any structured information the search phrase extracts can be used to drive the chatbot.\n\n- The [structural extraction](#structural-extraction) use case uses exactly the same\n[structural matching](#how-it-works-structural-matching) technology as the chatbot use\ncase, but searching takes place with respect to a pre-existing document or documents that are typically much\nlonger than the snippets analysed in the chatbot use case, and the aim is to extract and store structured information. For example, a set of business articles could be searched to find all the places where one company is said to be planning to\ntake over a second company. The identities of the companies concerned could then be stored in a database.\n\n- The [topic matching](#topic-matching) use case aims to find passages in a document or documents whose meaning\nis close to that of another document, which takes on the role of the **query document**, or to that of a **query phrase** entered ad-hoc by the user. Holmes extracts a number of small **phraselets** from the query phrase or\nquery document, matches the documents being searched against each phraselet, and conflates the results to find\nthe most relevant passages within the documents. Because there is no strict requirement that every\nword with its own meaning in the query document match a specific word or words in the searched documents, more matches are found\nthan in the structural extraction use case, but the matches do not contain structured information that can be\nused in subsequent processing. The topic matching use case is demonstrated by [a website allowing searches within\nsix Charles Dickens novels (for English) and around 350 traditional stories (for German)](https://holmes-demo.explosion.services/).\n\n- The [supervised document classification](#supervised-document-classification) use case uses training data to\nlearn a classifier that assigns one or more **classification labels** to new documents based on what they are about.\nIt classifies a new document by matching it against phraselets that were extracted from the training documents in the\nsame way that phraselets are extracted from the query document in the topic matching use case. The technique is\ninspired by bag-of-words-based classification algorithms that use n-grams, but aims to derive n-grams whose component\nwords are related semantically rather than that just happen to be neighbours in the surface representation of a language.\n\nIn all four use cases, the **individual words** are matched using a [number of strategies](#word-level-matching-strategies).\nTo work out whether two grammatical structures that contain individually matching words correspond logically and\nconstitute a match, Holmes transforms the syntactic parse information provided by the [spaCy](https://spacy.io/) library\ninto semantic structures that allow texts to be compared using predicate logic. As a user of Holmes, you do not need to\nunderstand the intricacies of how this works, although there are some\n[important tips](#writing-effective-search-phrases) around writing effective search phrases for the chatbot and\nstructural extraction use cases that you should try and take on board.\n\nHolmes aims to offer generalist solutions that can be used more or less out of the box with\nrelatively little tuning, tweaking or training and that are rapidly applicable to a wide range of use cases.\nAt its core lies a logical, programmed, rule-based system that describes how syntactic representations in each\nlanguage express semantic relationships. Although the supervised document classification use case does incorporate a\nneural network and although the spaCy library upon which Holmes builds has itself been pre-trained using machine\nlearning, the essentially rule-based nature of Holmes means that the chatbot, structural extraction and topic matching use\ncases can be put to use out of the box without any training and that the supervised document classification use case\ntypically requires relatively little training data, which is a great advantage because pre-labelled training data is\nnot available for many real-world problems.\n\nHolmes has a long and complex history and is now published under the MIT license thanks to the goodwill and openness of several companies. I, Richard Hudson, wrote the versions up to 3.0.0 while working at [msg systems](https://www.msg.group/en), a large international software consultancy based near Munich. From 2021 to 2023, I worked for [Explosion](https://explosion.ai/), the creators of [spaCy](https://spacy.io/) and [Prodigy](https://prodi.gy/). Elements of the Holmes library are covered by a [US patent](https://patents.google.com/patent/US8155946B2/en) that I myself wrote in the early 2000s while working at a startup called Definiens that has since been acquired by [AstraZeneca](https://www.astrazeneca.com/). With the kind permission of both AstraZeneca and msg systems, Holmes is now offered under a permissive license: anyone can now use Holmes under the terms of the MIT license without having to worry about the patent.\n\nFor more information, please see the [main documentation on Github](https://github.com/richardpaulhudson/holmes-extractor).\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "Information extraction from English and German texts based on predicate logic",
    "version": "4.2.1",
    "project_urls": {
        "Homepage": "https://github.com/richardpaulhudson/holmes-extractor"
    },
    "split_keywords": [
        "nlp",
        "information-extraction",
        "spacy",
        "spacy-extension",
        "python",
        "machine-learning",
        "ontology",
        "semantics"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "1f70dd6d699818679422b54aa68d467c9e5620a69839e2727e32a29b8d3c6963",
                "md5": "3063083d96793ad4b10057a160c6f775",
                "sha256": "a367e462c4f0dc627e4a11e10b2ddb49ec1bf0bd9f0297a03cbc3bab04ca72ab"
            },
            "downloads": -1,
            "filename": "holmes_extractor-4.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "3063083d96793ad4b10057a160c6f775",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": "<3.12,>=3.6",
            "size": 116466,
            "upload_time": "2023-06-06T11:08:36",
            "upload_time_iso_8601": "2023-06-06T11:08:36.151384Z",
            "url": "https://files.pythonhosted.org/packages/1f/70/dd6d699818679422b54aa68d467c9e5620a69839e2727e32a29b8d3c6963/holmes_extractor-4.2.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "8d46ebb98468fb7516be94e08e7f7e8d21beec04fc6b0e153aa5d3e40e374c5c",
                "md5": "8842820eec4dc031e8422dd273848682",
                "sha256": "08ea53f7eb566be97ed9c2d9897784c64dd8bc79521a88d081a97d6dee898ffb"
            },
            "downloads": -1,
            "filename": "holmes-extractor-4.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "8842820eec4dc031e8422dd273848682",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": "<3.12,>=3.6",
            "size": 143509,
            "upload_time": "2023-06-06T11:08:37",
            "upload_time_iso_8601": "2023-06-06T11:08:37.826293Z",
            "url": "https://files.pythonhosted.org/packages/8d/46/ebb98468fb7516be94e08e7f7e8d21beec04fc6b0e153aa5d3e40e374c5c/holmes-extractor-4.2.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-06-06 11:08:37",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "richardpaulhudson",
    "github_project": "holmes-extractor",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "holmes-extractor"
}

Richard Paul Hudson