docusense


Namedocusense JSON
Version 0.0.1 PyPI version JSON
download
home_pagehttps://github.com/GytisBaravykas/docusense
SummaryA tool to extract logic from document
upload_time2023-05-19 23:37:01
maintainer
docs_urlNone
authorGytis Baravykas
requires_python>=3.8.0
licenseMIT
keywords nlp text-mining algorithms development
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            docusense
======
A python library for extracting main logic from document using NLP transformers and stop words.
With multilingual transformers the extraction process should work well enough on many types of documents.

The purpose of this library is to extract main logical sense from documents without needing to open them.


Features
---
- summary extraction
- questions & answers from the document
- strongest keywords from document

Setup
---
Install using pip
```bash
pip install docusense
```

Terminal usages
---
### Possible parameters
- path: document text for summarization.
- text: document text for summarization.
- lang: a language used for stopwords. Removes unnecessary stopwords for selected language.
- question: Question to look for the answer in text.
- min_length: Min length of generated summary.
- max_length: Max length of generated summary.
- max_answer_len: Max length of answer.
- n_keywords: Number of keywords to return.

### Simple case
Logging summary, answer to asked question and keywords.
```bash
python extract.py --path "path/to/file"
```

Examples as python code
---
### Simple case
```python
from docusense.sense import SenseExtractor

text = """
Good evening,
Carriers constantly violate the passenger and baggage transport rules.
In the evenings from 21:00 to 24:00 drivers are specially late to leave
around the ring for 3-4 minutes (all this time the buses are standing still in the ring with the engines running,
at the same time creating additional air and noise pollution close to residential houses in the evenings
and during the night).
Also, some buses often arrive a few minutes earlier than indicated
in the schedule. Apparently, the transporters live and work in a parallel world, somewhere at night
traffic jams occur in the district.
The company "Communication Services" has been informing about violations for several months, but
does not take any action, although it is required to carry out the control of public transport carriers and
ensure compliance with passenger and baggage regulations.
The director of the municipal administration also does not carry out any control, rules 3
point - To instruct the Director of Administration to control how this is carried out
solution.
Please provide an answer: why are the same violations repeated every night and how
it is ensured that the carriers comply with the passenger and baggage transport rules.
"""

extractor = SenseExtractor()
output = extractor(text=text)

```

### Divided use
You can use each part of the extractor separate.

#### Summarizer
```python

text = """
Good evening,
Carriers constantly violate the passenger and baggage transport rules.
In the evenings from 21:00 to 24:00 drivers are specially late to leave
around the ring for 3-4 minutes (all this time the buses are standing still in the ring with the engines running,
at the same time creating additional air and noise pollution close to residential houses in the evenings
and during the night).
Also, some buses often arrive a few minutes earlier than indicated
in the schedule. Apparently, the transporters live and work in a parallel world, somewhere at night
traffic jams occur in the district.
The company "Communication Services" has been informing about violations for several months, but
does not take any action, although it is required to carry out the control of public transport carriers and
ensure compliance with passenger and baggage regulations.
The director of the municipal administration also does not carry out any control, rules 3
point - To instruct the Director of Administration to control how this is carried out
solution.
Please provide an answer: why are the same violations repeated every night and how
it is ensured that the carriers comply with the passenger and baggage transport rules.
"""

#summarizer part
from docusense.summary import Summarizer
summarizer = Summarizer()
summary = summarizer(text, min_length=20)

#questions and answers
from docusense.qa import QAExtractor
qa_extractor = QAExtractor()
answer = qa_extractor(text, question="What is the question in the text?")

#keywords
from docusense.keywords import KeywordsExtractor
keywords_extractor = KeywordsExtractor()
keywords = keywords_extractor(text, lang='english', n_keywords=3)

```


Development
---
1. Install poetry https://python-poetry.org/docs/#installation depending on your machine
2. `poetry install`

            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/GytisBaravykas/docusense",
    "name": "docusense",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8.0",
    "maintainer_email": "",
    "keywords": "nlp,text-mining,algorithms,development",
    "author": "Gytis Baravykas",
    "author_email": "",
    "download_url": "https://files.pythonhosted.org/packages/24/ed/f926a511ff9ec24de44cecb719698fca9f964db851ba9cf20e285e3a286d/docusense-0.0.1.tar.gz",
    "platform": null,
    "description": "docusense\n======\nA python library for extracting main logic from document using NLP transformers and stop words.\nWith multilingual transformers the extraction process should work well enough on many types of documents.\n\nThe purpose of this library is to extract main logical sense from documents without needing to open them.\n\n\nFeatures\n---\n- summary extraction\n- questions & answers from the document\n- strongest keywords from document\n\nSetup\n---\nInstall using pip\n```bash\npip install docusense\n```\n\nTerminal usages\n---\n### Possible parameters\n- path: document text for summarization.\n- text: document text for summarization.\n- lang: a language used for stopwords. Removes unnecessary stopwords for selected language.\n- question: Question to look for the answer in text.\n- min_length: Min length of generated summary.\n- max_length: Max length of generated summary.\n- max_answer_len: Max length of answer.\n- n_keywords: Number of keywords to return.\n\n### Simple case\nLogging summary, answer to asked question and keywords.\n```bash\npython extract.py --path \"path/to/file\"\n```\n\nExamples as python code\n---\n### Simple case\n```python\nfrom docusense.sense import SenseExtractor\n\ntext = \"\"\"\nGood evening,\nCarriers constantly violate the passenger and baggage transport rules.\nIn the evenings from 21:00 to 24:00 drivers are specially late to leave\naround the ring for 3-4 minutes (all this time the buses are standing still in the ring with the engines running,\nat the same time creating additional air and noise pollution close to residential houses in the evenings\nand during the night).\nAlso, some buses often arrive a few minutes earlier than indicated\nin the schedule. Apparently, the transporters live and work in a parallel world, somewhere at night\ntraffic jams occur in the district.\nThe company \"Communication Services\" has been informing about violations for several months, but\ndoes not take any action, although it is required to carry out the control of public transport carriers and\nensure compliance with passenger and baggage regulations.\nThe director of the municipal administration also does not carry out any control, rules 3\npoint - To instruct the Director of Administration to control how this is carried out\nsolution.\nPlease provide an answer: why are the same violations repeated every night and how\nit is ensured that the carriers comply with the passenger and baggage transport rules.\n\"\"\"\n\nextractor = SenseExtractor()\noutput = extractor(text=text)\n\n```\n\n### Divided use\nYou can use each part of the extractor separate.\n\n#### Summarizer\n```python\n\ntext = \"\"\"\nGood evening,\nCarriers constantly violate the passenger and baggage transport rules.\nIn the evenings from 21:00 to 24:00 drivers are specially late to leave\naround the ring for 3-4 minutes (all this time the buses are standing still in the ring with the engines running,\nat the same time creating additional air and noise pollution close to residential houses in the evenings\nand during the night).\nAlso, some buses often arrive a few minutes earlier than indicated\nin the schedule. Apparently, the transporters live and work in a parallel world, somewhere at night\ntraffic jams occur in the district.\nThe company \"Communication Services\" has been informing about violations for several months, but\ndoes not take any action, although it is required to carry out the control of public transport carriers and\nensure compliance with passenger and baggage regulations.\nThe director of the municipal administration also does not carry out any control, rules 3\npoint - To instruct the Director of Administration to control how this is carried out\nsolution.\nPlease provide an answer: why are the same violations repeated every night and how\nit is ensured that the carriers comply with the passenger and baggage transport rules.\n\"\"\"\n\n#summarizer part\nfrom docusense.summary import Summarizer\nsummarizer = Summarizer()\nsummary = summarizer(text, min_length=20)\n\n#questions and answers\nfrom docusense.qa import QAExtractor\nqa_extractor = QAExtractor()\nanswer = qa_extractor(text, question=\"What is the question in the text?\")\n\n#keywords\nfrom docusense.keywords import KeywordsExtractor\nkeywords_extractor = KeywordsExtractor()\nkeywords = keywords_extractor(text, lang='english', n_keywords=3)\n\n```\n\n\nDevelopment\n---\n1. Install poetry https://python-poetry.org/docs/#installation depending on your machine\n2. `poetry install`\n",
    "bugtrack_url": null,
    "license": "MIT",
    "summary": "A tool to extract logic from document",
    "version": "0.0.1",
    "project_urls": {
        "Homepage": "https://github.com/GytisBaravykas/docusense"
    },
    "split_keywords": [
        "nlp",
        "text-mining",
        "algorithms",
        "development"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "00e4944997ca519fc4b0c59cce4cffec1a1b1bfba2e2fdb4a266c576ebf95788",
                "md5": "d253523e23b0d626bc54efafe27dd42d",
                "sha256": "221e8090a59c722f6af642fcdc0d46b7409b716d77a4869bc7ec248211935c3a"
            },
            "downloads": -1,
            "filename": "docusense-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "d253523e23b0d626bc54efafe27dd42d",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8.0",
            "size": 8735,
            "upload_time": "2023-05-19T23:36:58",
            "upload_time_iso_8601": "2023-05-19T23:36:58.983604Z",
            "url": "https://files.pythonhosted.org/packages/00/e4/944997ca519fc4b0c59cce4cffec1a1b1bfba2e2fdb4a266c576ebf95788/docusense-0.0.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "24edf926a511ff9ec24de44cecb719698fca9f964db851ba9cf20e285e3a286d",
                "md5": "b1b0ea6f84af2d387f42e55e69594b2a",
                "sha256": "9377ecc14fa2f78948667725fabb002253b43563f8563d5bbe4817e57912fe29"
            },
            "downloads": -1,
            "filename": "docusense-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "b1b0ea6f84af2d387f42e55e69594b2a",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8.0",
            "size": 8248,
            "upload_time": "2023-05-19T23:37:01",
            "upload_time_iso_8601": "2023-05-19T23:37:01.127477Z",
            "url": "https://files.pythonhosted.org/packages/24/ed/f926a511ff9ec24de44cecb719698fca9f964db851ba9cf20e285e3a286d/docusense-0.0.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-05-19 23:37:01",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "GytisBaravykas",
    "github_project": "docusense",
    "github_not_found": true,
    "lcname": "docusense"
}
        
Elapsed time: 0.06571s