Name | JAAT JSON |
Version |
0.1.2
JSON |
| download |
home_page | None |
Summary | Repository for JAAT: efficient and accurate analysis of job ads for task matching, title matching, firm name extraction, and attribute classification. |
upload_time | 2024-11-05 17:09:54 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.6 |
license | MIT License Copyright (c) 2024 Loyola University Chicago Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. |
keywords |
text analysis
text classification
semantic matching
|
VCS |
|
bugtrack_url |
|
requirements |
importlib_resources
nltk
pandas
psutil
sentence_transformers
torch
tqdm
transformers
xlrd
swifter
openpyxl
transformers
pure-predict
compress_pickle
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
# Job Ad Analysis Toolkit (JAAT)
In this repository, you will find the code for efficient and accurate analysis of job ads. Running the code is simple!
## Installation
To install JAAT, simply run the following:
`pip install JAAT`
Then, import JAAT using the following command:
`from JAAT import JAAT`
## TaskMatch
The first module consists of a tool to extract relevant tasks, according to O*NET, given input job ad texts.
After importing the module, simply instantiate the `TaskMatch` object:
`TM = JAAT.TaskMatch()`
Optionally, we can provide a threshold value (default = 0.9, [0, 1]), which governs how lenient to be with the matching (lower means more matches, but potentially less correct ones).
`TM = TaskMatch(threshold=0.85)`
Then, run it on any given (job ad) text:
`tasks = TM.get_tasks(TEXT)`
The output will be a list of tuples, matching an ONET task ID and its title (description). In case no tasks are matched, an empty list will be returned. Example output:
`[('16363', 'Identify operational requirements for new systems to inform selection of technological solutions.'), ('16987', 'Prepare documentation or presentations, including charts, photos, or graphs.'), ('9583', 'Assign duties to other staff and give instructions regarding work methods and routines.')]`
For batch processing, run:
`tasks = TM.get_tasks_batch(LIST_OF_TEXTS)`
## TitleMatch
The second module assists in matching input job ad titles to coded job titles from O*NET.
After importing the module, simply instantiate the `TitleMatch` object:
`TM = JAAT.TitleMatch()`
Then, run it on any given (job ad) text:
`matched_title = TM.get_title(TEXT)`
Note that this function will work on either single texts or a list of input texts. The return type will be a list of tuples, each tuple of format:
`(MATCHED_TITLE, MATCHED_TITLE_CODE, MATCH_SCORE)`
Each tuple returned corresponds in order to the input text(s).
## FirmExtract
The third module is capable of extracting the firm (company) name from a text (not necessarily only job ad texts).
After importing the module, simply instantiate the `FirmExtract` object:
`FE = JAAT.FirmExtract()`
This initiates the firm extraction object with our custom NER model: firmNER. Optionally, you can choose to have all extracted firm names standardized according to the method proposed by [Wasi and Flaeen](https://www.aaronflaaen.com/uploads/3/1/2/4/31243277/wasi_flaaen_statarecordlinkageutilities_20140401.pdf). This can be done by setting the `standardize` parameter to `True`.
Following this, run it on any given (job ad) text:
`firms = FE.get_firm(TEXT)`
This will return a firm name if found, otherwise `None`.
`FirmExtract` also features batch processing. For batch processing, run:
`firm_names = FE.get_firm_batch(LIST_OF_TEXTS)`
This will return a list of firm names (or `None` where no name is found).
## CREAM
`CREAM` is a tool that allows you to extract concepts that are hidden within texts. These concepts, called *classes*, can be defined arbitrarily by you - anything goes! All you need to do is two things:
- **keywords**: each class should contain a list of relevant keywords, or words/phrases that would "trigger" a potential class candidate
- **rules**: *rules* define archetypical text chunks that either support or refute an instance of a defined class given a found keyword. Rules should be manually define using domain expertise, and an arbitrary number of rules may be used.
Keywords should be presented as a list of strings, e.g., `[k1, k2, ..., kn]`.
Rules should be presented as list of tuples, in the form: `[(rule_1, label), (rule_2, label), ..., (rule_n, label)]`.
In the most basic form, the *labels* are binary: 1 denotes the presence of a class, 0 not.
Given these two inputs, one can instantiate the `CREAM` object.
`C = JAAT.CREAM(keywords=KEYWORDS, rules=RULES)`
There are also three optional parameters:
- `class_name`: the name of the class (i.e., labels)
- `n`: useful for CREAM internals, essentially how any context words should be considered on either side of identified keywords
- `threshold`: useful for embedding functions - the minimum similarity threshold a candidate text chunk should meet in order to be matched with a label. The higher the threshold, the stricter the matching criterion.
With this set up, all you need to do is run `CREAM` on a list of texts, and the output will be a DataFrame will the relevant results.
`res = C.run(LIST_OF_TEXTS)`
Specifically, the output will be a Pandas DataFrame will the following columns:
- **text**: the input texts
- **inferred_rule**: the best matching rule, if any
- **inferred_label**: the label assigned based on the best matching rule, if any
- **inferred_confidence**: the "confidence score" of the matching, if a match was made. Note that this is embedding model specific and should be interpreted relatively.
## ActivityMatch
In a similar way to `TaskMatch`, `ActivityMatch` will extract general activity statements from your texts, according to a set of predefined daily activities (see `data/lexiconwex2023.csv`).
`AM = JAAT.ActivityMatch()`
Optionally, we can provide a threshold value (default = 0.9, [0, 1]), which governs how lenient to be with the matching (lower means more matches, but potentially less correct ones).
`AM = ActivityMatch(threshold=0.85)`
Then, run it on any given text:
`activities = AM.get_activities(TEXT)`
For batch processing, run:
`activities = AM.get_activities_batch(LIST_OF_TEXTS)`
## JobTag
`JobTag` is used to classify pieces of texts (such as job ads) according to expert defined classification schemes. This is done using niche classifiers which we also release publicly here.
As of now the following classes are supported:
`['CitizenshipReq', 'GovContract', 'VisaExclude', 'VisaInclude', 'WorkAuthReq', 'driverslicense', 'ind_contractor', 'proflicenses', 'wfh', 'yesunion']`
To get started, create a new `JobTag` object by doing the following:
`J = JAAT.JobTag(class_name=CLASS)`
where `CLASS` is replaced by one of the supported classes. Optionally, you can also specify an `n` parameter (default: 4), which defines how large of a context window around keywords to consider.
Then, you can classify any text (binary classification, 1 == positive) by calling the following function:
`prediction = J.get_tag(TEXT)`
This will return a tuple of the form `(class_name, 1/0)`. For larger batches of texts, use the batch function:
`predictions = J.get_tag_batch(LIST_OF_TEXTS)`
This now will return a list of 1/0 predictions, in the same order as the input texts.
## Acknowledgements
This project has received generous support from the National Labor Exchange, the Russell Sage Foundation, the Washington Center for Equitable Growth.
### Data Citation
In the demo notebook `JAATDemo.ipynb` and the companion slides, we use the data made available by the following publication:
```Zhou, Steven, John Aitken, Peter McEachern, and Renee McCauley. “Data from 990 Public Real-World Job Advertisements Organized by O*NET Categories.” Journal of Open Psychology Data 10 (November 21, 2022): 17. https://doi.org/10.5334/jopd.69.```
Raw data
{
"_id": null,
"home_page": null,
"name": "JAAT",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.6",
"maintainer_email": "Stephen Meisenbacher <stephen.meisenbacher@tum.de>",
"keywords": "text analysis, text classification, semantic matching",
"author": null,
"author_email": "Stephen Meisenbacher <stephen.meisenbacher@tum.de>, Peter Norlander <pnorlander@luc.edu>",
"download_url": "https://files.pythonhosted.org/packages/9c/1f/de6fb7ed8b0995d95a382488b29d1a0a48269f1bc0fcf6aa64bc3d9c3562/jaat-0.1.2.tar.gz",
"platform": null,
"description": "# Job Ad Analysis Toolkit (JAAT)\nIn this repository, you will find the code for efficient and accurate analysis of job ads. Running the code is simple!\n\n## Installation\nTo install JAAT, simply run the following:\n\n`pip install JAAT`\n\nThen, import JAAT using the following command:\n\n`from JAAT import JAAT`\n\n## TaskMatch\nThe first module consists of a tool to extract relevant tasks, according to O*NET, given input job ad texts.\n\nAfter importing the module, simply instantiate the `TaskMatch` object:\n\n`TM = JAAT.TaskMatch()`\n\nOptionally, we can provide a threshold value (default = 0.9, [0, 1]), which governs how lenient to be with the matching (lower means more matches, but potentially less correct ones).\n\n`TM = TaskMatch(threshold=0.85)`\n\nThen, run it on any given (job ad) text:\n\n`tasks = TM.get_tasks(TEXT)`\n\nThe output will be a list of tuples, matching an ONET task ID and its title (description). In case no tasks are matched, an empty list will be returned. Example output:\n\n`[('16363', 'Identify operational requirements for new systems to inform selection of technological solutions.'), ('16987', 'Prepare documentation or presentations, including charts, photos, or graphs.'), ('9583', 'Assign duties to other staff and give instructions regarding work methods and routines.')]`\n\nFor batch processing, run:\n\n`tasks = TM.get_tasks_batch(LIST_OF_TEXTS)`\n\n## TitleMatch\nThe second module assists in matching input job ad titles to coded job titles from O*NET.\n\nAfter importing the module, simply instantiate the `TitleMatch` object:\n\n`TM = JAAT.TitleMatch()`\n\nThen, run it on any given (job ad) text:\n\n`matched_title = TM.get_title(TEXT)`\n\nNote that this function will work on either single texts or a list of input texts. The return type will be a list of tuples, each tuple of format:\n\n`(MATCHED_TITLE, MATCHED_TITLE_CODE, MATCH_SCORE)`\n\nEach tuple returned corresponds in order to the input text(s).\n\n## FirmExtract\nThe third module is capable of extracting the firm (company) name from a text (not necessarily only job ad texts).\n\nAfter importing the module, simply instantiate the `FirmExtract` object:\n\n`FE = JAAT.FirmExtract()`\n\nThis initiates the firm extraction object with our custom NER model: firmNER. Optionally, you can choose to have all extracted firm names standardized according to the method proposed by [Wasi and Flaeen](https://www.aaronflaaen.com/uploads/3/1/2/4/31243277/wasi_flaaen_statarecordlinkageutilities_20140401.pdf). This can be done by setting the `standardize` parameter to `True`.\n\nFollowing this, run it on any given (job ad) text:\n\n`firms = FE.get_firm(TEXT)`\n\nThis will return a firm name if found, otherwise `None`.\n\n`FirmExtract` also features batch processing. For batch processing, run:\n\n`firm_names = FE.get_firm_batch(LIST_OF_TEXTS)`\n\nThis will return a list of firm names (or `None` where no name is found).\n\n## CREAM\n`CREAM` is a tool that allows you to extract concepts that are hidden within texts. These concepts, called *classes*, can be defined arbitrarily by you - anything goes! All you need to do is two things:\n\n- **keywords**: each class should contain a list of relevant keywords, or words/phrases that would \"trigger\" a potential class candidate\n- **rules**: *rules* define archetypical text chunks that either support or refute an instance of a defined class given a found keyword. Rules should be manually define using domain expertise, and an arbitrary number of rules may be used.\n\nKeywords should be presented as a list of strings, e.g., `[k1, k2, ..., kn]`.\n\nRules should be presented as list of tuples, in the form: `[(rule_1, label), (rule_2, label), ..., (rule_n, label)]`.\nIn the most basic form, the *labels* are binary: 1 denotes the presence of a class, 0 not.\n\nGiven these two inputs, one can instantiate the `CREAM` object.\n\n`C = JAAT.CREAM(keywords=KEYWORDS, rules=RULES)`\n\nThere are also three optional parameters:\n\n- `class_name`: the name of the class (i.e., labels)\n- `n`: useful for CREAM internals, essentially how any context words should be considered on either side of identified keywords\n- `threshold`: useful for embedding functions - the minimum similarity threshold a candidate text chunk should meet in order to be matched with a label. The higher the threshold, the stricter the matching criterion.\n\nWith this set up, all you need to do is run `CREAM` on a list of texts, and the output will be a DataFrame will the relevant results.\n\n`res = C.run(LIST_OF_TEXTS)`\n\nSpecifically, the output will be a Pandas DataFrame will the following columns:\n\n- **text**: the input texts\n- **inferred_rule**: the best matching rule, if any\n- **inferred_label**: the label assigned based on the best matching rule, if any\n- **inferred_confidence**: the \"confidence score\" of the matching, if a match was made. Note that this is embedding model specific and should be interpreted relatively.\n\n## ActivityMatch\nIn a similar way to `TaskMatch`, `ActivityMatch` will extract general activity statements from your texts, according to a set of predefined daily activities (see `data/lexiconwex2023.csv`).\n\n`AM = JAAT.ActivityMatch()`\n\nOptionally, we can provide a threshold value (default = 0.9, [0, 1]), which governs how lenient to be with the matching (lower means more matches, but potentially less correct ones).\n\n`AM = ActivityMatch(threshold=0.85)`\n\nThen, run it on any given text:\n\n`activities = AM.get_activities(TEXT)`\n\nFor batch processing, run:\n\n`activities = AM.get_activities_batch(LIST_OF_TEXTS)`\n\n## JobTag\n`JobTag` is used to classify pieces of texts (such as job ads) according to expert defined classification schemes. This is done using niche classifiers which we also release publicly here.\n\nAs of now the following classes are supported:\n`['CitizenshipReq', 'GovContract', 'VisaExclude', 'VisaInclude', 'WorkAuthReq', 'driverslicense', 'ind_contractor', 'proflicenses', 'wfh', 'yesunion']`\n\nTo get started, create a new `JobTag` object by doing the following:\n\n`J = JAAT.JobTag(class_name=CLASS)`\n\nwhere `CLASS` is replaced by one of the supported classes. Optionally, you can also specify an `n` parameter (default: 4), which defines how large of a context window around keywords to consider.\n\nThen, you can classify any text (binary classification, 1 == positive) by calling the following function:\n\n`prediction = J.get_tag(TEXT)`\n\nThis will return a tuple of the form `(class_name, 1/0)`. For larger batches of texts, use the batch function:\n\n`predictions = J.get_tag_batch(LIST_OF_TEXTS)`\n\nThis now will return a list of 1/0 predictions, in the same order as the input texts.\n\n## Acknowledgements\n\nThis project has received generous support from the National Labor Exchange, the Russell Sage Foundation, the Washington Center for Equitable Growth.\n\n### Data Citation\nIn the demo notebook `JAATDemo.ipynb` and the companion slides, we use the data made available by the following publication:\n\n```Zhou, Steven, John Aitken, Peter McEachern, and Renee McCauley. \u201cData from 990 Public Real-World Job Advertisements Organized by O*NET Categories.\u201d Journal of Open Psychology Data 10 (November 21, 2022): 17. https://doi.org/10.5334/jopd.69.```\n",
"bugtrack_url": null,
"license": "MIT License Copyright (c) 2024 Loyola University Chicago Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.",
"summary": "Repository for JAAT: efficient and accurate analysis of job ads for task matching, title matching, firm name extraction, and attribute classification.",
"version": "0.1.2",
"project_urls": {
"Documentation": "https://github.com/Job-Ad-Research-at-QSB-LUC/JAAT",
"Homepage": "https://github.com/Job-Ad-Research-at-QSB-LUC/JAAT",
"Issues": "https://github.com/Job-Ad-Research-at-QSB-LUC/JAAT/issues",
"Repository": "https://github.com/Job-Ad-Research-at-QSB-LUC/JAAT"
},
"split_keywords": [
"text analysis",
" text classification",
" semantic matching"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "79fc470d6fd9f50c9f70a4a132818f69ef1a573a651ab08ae5a287c4a7c577e5",
"md5": "8d1336260f47a226c05a402366b9ee9e",
"sha256": "a4c5e44ac6836f5049979402663895013a25147e0e869b68dc0f07286940d46b"
},
"downloads": -1,
"filename": "JAAT-0.1.2-py3-none-any.whl",
"has_sig": false,
"md5_digest": "8d1336260f47a226c05a402366b9ee9e",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.6",
"size": 99877837,
"upload_time": "2024-11-05T17:09:45",
"upload_time_iso_8601": "2024-11-05T17:09:45.130893Z",
"url": "https://files.pythonhosted.org/packages/79/fc/470d6fd9f50c9f70a4a132818f69ef1a573a651ab08ae5a287c4a7c577e5/JAAT-0.1.2-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "9c1fde6fb7ed8b0995d95a382488b29d1a0a48269f1bc0fcf6aa64bc3d9c3562",
"md5": "9af35af6bcce6bd25752eb39b8aacb79",
"sha256": "a35611206200083fbb0a6ac89565a2149afb0b5db9dbb14bb60ed45afe76cb5b"
},
"downloads": -1,
"filename": "jaat-0.1.2.tar.gz",
"has_sig": false,
"md5_digest": "9af35af6bcce6bd25752eb39b8aacb79",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.6",
"size": 101254656,
"upload_time": "2024-11-05T17:09:54",
"upload_time_iso_8601": "2024-11-05T17:09:54.145349Z",
"url": "https://files.pythonhosted.org/packages/9c/1f/de6fb7ed8b0995d95a382488b29d1a0a48269f1bc0fcf6aa64bc3d9c3562/jaat-0.1.2.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-11-05 17:09:54",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Job-Ad-Research-at-QSB-LUC",
"github_project": "JAAT",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"requirements": [
{
"name": "importlib_resources",
"specs": []
},
{
"name": "nltk",
"specs": []
},
{
"name": "pandas",
"specs": []
},
{
"name": "psutil",
"specs": []
},
{
"name": "sentence_transformers",
"specs": []
},
{
"name": "torch",
"specs": []
},
{
"name": "tqdm",
"specs": []
},
{
"name": "transformers",
"specs": []
},
{
"name": "xlrd",
"specs": []
},
{
"name": "swifter",
"specs": []
},
{
"name": "openpyxl",
"specs": []
},
{
"name": "transformers",
"specs": []
},
{
"name": "pure-predict",
"specs": []
},
{
"name": "compress_pickle",
"specs": []
}
],
"lcname": "jaat"
}