# Acronym Extractor
This package is used to extract acronym-definition pairs from pdf files.
## Prerequisites
The package requires Java to be installed on the system in order to run tika for pdf to text conversion.
Install the JDK from https://www.oracle.com/java/technologies/javase-jdk15-downloads.html
## How it works
If the file is a pdf, it uses tika to convert it to text. If the file is a text file, it reads the text from the file.
Before extracting acronyms, it cleans the text by removing extra lines, spaces, and redundant punctuations.
The package uses a combination of regular expressions, extraction patterns, and context to extract acronyms.
1. It first extracts acronyms using regular expressions. The regular expression is based on the assumption that acronyms are usually written in capital letters.
Therefore, it extracts acronyms that start with a capital letter and end with a capital letter but can have lowercase letters in between.
2. It then extracts acronyms using extraction patterns. The extraction patterns are based on the assumption that acronyms are usually defined in the following format:
> acronym, which is an abbreviation for long-form.
> acronym, which is a short-form for long-form.
> acronym, also known as long-form.
> ...
3. If the above two methods fail to extract an acronym, it returns the context of the acronym and leaves it to the user to decide whether it is an acronym or not.
## Installation
To install the package, run the following command:
```python
pip install acronym_extractor
```
## Usage
To use the package, import the package and use the extract_acronyms function. The function takes one argument: the path to the file, and returns
a dictionary of acronyms and their definitions.
```python
from acronym_extraction import AcronymExtractor
extractor = AcronymExtractor()
acronyms = extractor.extract_acronyms(path/to/file)
print(acronyms)
```
For further info, visit to our official github page: https://github.com/ali-izhar/acronym_extraction
Raw data
{
"_id": null,
"home_page": "https://github.com/ali-izhar/acronym_extraction",
"name": "acronym-extractor",
"maintainer": "",
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "",
"keywords": "python,java,acronyms,extraction,text analysis,pdf to text",
"author": "Izhar Ali",
"author_email": "<izharali.skt@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/0d/cc/6b0c46b72e87a4313e351ff15b6e92062ee2f9fda090415141be2ac4d8d7/acronym_extractor-2.0.7.tar.gz",
"platform": null,
"description": "\r\n# Acronym Extractor\r\nThis package is used to extract acronym-definition pairs from pdf files.\r\n\r\n## Prerequisites\r\nThe package requires Java to be installed on the system in order to run tika for pdf to text conversion.\r\nInstall the JDK from https://www.oracle.com/java/technologies/javase-jdk15-downloads.html\r\n\r\n## How it works\r\nIf the file is a pdf, it uses tika to convert it to text. If the file is a text file, it reads the text from the file.\r\nBefore extracting acronyms, it cleans the text by removing extra lines, spaces, and redundant punctuations.\r\nThe package uses a combination of regular expressions, extraction patterns, and context to extract acronyms. \r\n\r\n1. It first extracts acronyms using regular expressions. The regular expression is based on the assumption that acronyms are usually written in capital letters.\r\nTherefore, it extracts acronyms that start with a capital letter and end with a capital letter but can have lowercase letters in between.\r\n\r\n2. It then extracts acronyms using extraction patterns. The extraction patterns are based on the assumption that acronyms are usually defined in the following format:\r\n> acronym, which is an abbreviation for long-form.\r\n> acronym, which is a short-form for long-form.\r\n> acronym, also known as long-form.\r\n> ...\r\n\r\n3. If the above two methods fail to extract an acronym, it returns the context of the acronym and leaves it to the user to decide whether it is an acronym or not.\r\n\r\n## Installation\r\nTo install the package, run the following command:\r\n```python\r\npip install acronym_extractor\r\n```\r\n\r\n## Usage\r\nTo use the package, import the package and use the extract_acronyms function. The function takes one argument: the path to the file, and returns \r\na dictionary of acronyms and their definitions.\r\n\r\n```python\r\nfrom acronym_extraction import AcronymExtractor\r\nextractor = AcronymExtractor()\r\nacronyms = extractor.extract_acronyms(path/to/file)\r\nprint(acronyms)\r\n```\r\n\r\nFor further info, visit to our official github page: https://github.com/ali-izhar/acronym_extraction\r\n",
"bugtrack_url": null,
"license": "",
"summary": "Extracting acronym-definition pairs from pdf or text files",
"version": "2.0.7",
"split_keywords": [
"python",
"java",
"acronyms",
"extraction",
"text analysis",
"pdf to text"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "8f502f3bbc5a8a562931aeb79a41a36be88b49e091e79395f5c4b70a48c58f54",
"md5": "18aba122a18fb9c9fbe7f7d47287e67a",
"sha256": "68ef8f73c55d774f9e5ae2a2968edfcd60ec7868aaa4b2f8341f55878719486a"
},
"downloads": -1,
"filename": "acronym_extractor-2.0.7-py3-none-any.whl",
"has_sig": false,
"md5_digest": "18aba122a18fb9c9fbe7f7d47287e67a",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 6610,
"upload_time": "2023-03-17T22:02:51",
"upload_time_iso_8601": "2023-03-17T22:02:51.865621Z",
"url": "https://files.pythonhosted.org/packages/8f/50/2f3bbc5a8a562931aeb79a41a36be88b49e091e79395f5c4b70a48c58f54/acronym_extractor-2.0.7-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "0dcc6b0c46b72e87a4313e351ff15b6e92062ee2f9fda090415141be2ac4d8d7",
"md5": "ef2f4be8cc878c9731fa4e6c933d1d36",
"sha256": "3867a8dabe041dda0a02372ad9b754f13553218916ccb2efba8ff7d7bafaae0f"
},
"downloads": -1,
"filename": "acronym_extractor-2.0.7.tar.gz",
"has_sig": false,
"md5_digest": "ef2f4be8cc878c9731fa4e6c933d1d36",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 6207,
"upload_time": "2023-03-17T22:02:53",
"upload_time_iso_8601": "2023-03-17T22:02:53.179558Z",
"url": "https://files.pythonhosted.org/packages/0d/cc/6b0c46b72e87a4313e351ff15b6e92062ee2f9fda090415141be2ac4d8d7/acronym_extractor-2.0.7.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2023-03-17 22:02:53",
"github": true,
"gitlab": false,
"bitbucket": false,
"github_user": "ali-izhar",
"github_project": "acronym_extraction",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "acronym-extractor"
}