Name | upwork-analysis JSON |
Version |
1.0.0
JSON |
| download |
home_page | None |
Summary | A Python package that scrapes and analyzes Upwork job listings. |
upload_time | 2024-05-28 05:50:12 |
maintainer | None |
docs_url | None |
author | None |
requires_python | >=3.7 |
license | MIT License Copyright (c) 2024, Yazan Sharaya. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. |
keywords |
scraping
undetected
selenium
crawling
automation
data analysis
|
VCS |
|
bugtrack_url |
|
requirements |
No requirements were recorded.
|
Travis-CI |
No Travis.
|
coveralls test coverage |
No coveralls.
|
Upwork Analysis
===============
[![PyPI version](https://badge.fury.io/py/upwork_analysis.svg)](https://badge.fury.io/py/upwork_analysis)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/upwork_analysis)
Upwork is a freelancing platform where clients post jobs and freelancers fight to get hired.
I do freelance work there from time to time and I decided to follow the advice "Work smarter not harder" therefor I created this project.
The aim is to scrape jobs' data on Upwork and conduct data analysis to extract insights from the data.
<!-- TOC -->
* [Objective](#objective)
* [Features](#features)
* [Usage](#usage)
* [The easy way](#the-easy-way)
* [Scraping](#scraping)
* [CLI usage](#cli-usage)
* [Parameters](#parameters)
* [Python](#python)
* [Analysis](#analysis)
* [CLI usage](#cli-usage-1)
* [Python](#python-1)
* [Jupyter](#jupyter)
* [Installation](#installation)
* [Automatic](#automatic)
* [Manual](#manual)
* [Documentation](#documentation)
* [Limitations](#limitations)
* [License](#license)
<!-- TOC -->
Objective
---------
This repository aims to perform the following two tasks.
1. Scrape Job listings on Upwork for a specific search query.
2. Analyze the scraped data to find the following:
1. The countries that pay the most.
2. Job frequency on different days of the week.
3. The most asked for skills.
4. The budget ranges/groups and their frequency.
5. The skills that correlate with higher budgets.
6. The relationship between these skills and the number of proposals.
Features
--------
* **Fault tolerance**: The scraper has been built with fault tolerance in mind and has been thoroughly tested to ensure that.
Even if an error occurs that's not caused by the scraping process like a network error, a captcha or the user closing the
browser session, the scraper will gracefully stop scraping and it will save any scraped jobs.
* **Retry**: Retrying functionality is baked in, so if a network error occurs or a captcha pops up, the scraper will retry to scrape
the specific page where the error occurred. This can be controlled by the 'retries' argument.
* **Undetectability**: Upwork is guarded by Cloudflare's various protection mesures, the scraper bypasses these protections,
and from my many test runs, it didn't trigger any of the protections at all.
* **Concurrency**: The scraping workload can distributed across multiple workers to speed up the scraping process tremendously.
This can be controlled by the 'workers' argument.
* **No API**: The scraper doesn't need an Upwork account or an API key to work making it more broadly available, especially because
of hard it is to get an Upwork API key (they ask for more legal documents than when you sign up!)
Usage
-----
There are two main ways to get started with this project, one is through the command
line and the other is by directly importing its functions and classes into your script.
Both achieve the same functionality, choose the one that best suits your needs.
### The easy way
If you installed the package using pip, the easiest and most straightforward way to use is through the entry points.
For scraping
```
scrape-upwork scrape Python Developer -o python_jobs.json --headless
```
For analysis
```
analyze-upwork SAVED/JOBS/FILE.json -o PATH/TO/SAVE/DIR -s
```
Continue reading down below for more methods to use the package.
### Scraping
##### CLI usage
To scrape a new search query from scratch
```
cd PATH/TO/upwork_analysis
python scrape_data.py scrape Python Developer -o python_jobs.json --headless
```
To update existing data for a search query with any new job listings
```
cd PATH/TO/upwork_analysis
python scrape_data.py update Python Developer -o PATH/TO/SAVE/FILE.json --headless
```
##### Parameters
| Parameter | Options | Default | Description |
|------------------------|---------------|-----------------|---------------------------------------------------------------------------------------------------|
| Action | scrape/update | scrape | Scrape new jobs, or update existing scraped data with any new job postings. |
| -q / --search-query | str | None (Required) | The query to search for. |
| -j / --jobs-per-page | 10, 20, 50 | 10 | How many jobs should be displayed per page. |
| -s / --start-page | int | 1 | The page number to start searching from. |
| -p / --pages-to-scrape | int | 10 | How many pages to scrape. If not passed, scrape all the pages.¹ |
| -o / --output | str | - | Where to save the scraped data. |
| -r / --retries | int | 3 | Number of retries when encountering a Captcha or timeout before failing. |
| -w / --workers | int | 1 | How many webdriver instances to concurrently spin up for scraping. |
| -f / --fast | passed or not | False | Whether to use the fast scraping method, can be 10 to 50x faster but leaves some information out. |
| --headless | passed or not | False | Whether to enable headless mode (slower and more detectable). |
<sup>¹ See the [limitations](#limitations) section</sup>
##### Python
```python
from upwork_analysis.scrape_data import JobsScraper
jobs_scraper = JobsScraper(
search_query="Python Developer",
jobs_per_page=10,
start_page=1,
pages_to_scrape=10,
save_path='PATH/TO/SAVE/FILE.json',
retries=3,
headless=True,
workers=2,
fast=False)
jobs_scraper.scrape_jobs()
jobs_scraper.update_existing()
```
This will scrape all resulting job listings for "Python Developer" from page 1 to page 10 and save the results to
"PATH/TO/SAVE/FILE.json" using a headless browser. It will scrape a total of `jobs_per_page` * `pages_to_scrape` jobs or
100 in this case.
### Analysis
A quick note, even though the analysis might run with as low as 1 data point _(or it might not and throw errors
because there isn't enough data)_, it's better to scrape more data for the results to be meaningful.
##### CLI usage
```
cd PATH/TO/upwork_analysis
python analyize_data.py SAVED/JOBS/FILE.json -o PATH/TO/SAVE/DIR -s
```
##### Python
```python
from upwork_analysis.analyze_data import perform_analysis
perform_analysis(
dataset_path='SAVED/JOBS/FILE.json',
save_plots_dir='PATH/TO/SAVE/DIR',
show_plots=True)
```
##### Jupyter
```jupyter
jupyter notebook data_analysis.ipynb
# Change the 1st line of cell 3 from
dataset_path = ""
# To
dataset_path = "SAVED/JOBS/FILE.json"
```
This will analyze the data saved at "SAVED/JOBS/FILE.json", save the resulting plots "PATH/TO/SAVE/DIR" and
-s (--show, or show_plots=True) will show the resulting plots all at once.
For more documentation about available functions, their parameters and how to use them, see the [Documentation](#documentation) section.
Installation
------------
This package requires python 3.7 or later.
##### Automatic
```
pip install upwork_analysis
```
**Note:** If you encounter an error during installing, update pip using `python -m pip install -U pip` and it should fix the issue.
##### Manual
1. Clone this repository
```
git clone https://github.com/Yazan-Sharaya/upwork_analysis
```
2. Download the dependencies
* If you just want the scraping functionality
```
pip install seleniumbase beautifulsoup4
```
* And additionally for analyzing the data
```
pip install pandas scipy seaborn scikit-learn
```
3. Build the package using
```
python -m build -s
```
4. Install the package
```
pip install upwork_analysis-1.0.0.tar.gz
```
Documentation
-------------
Both modules and all the functions they implement are documented using function and module docstrings.
To save you sometime, here's a list that covers the documentation for 99% of use cases.
* For documentation about scraping, check out the docstrings for `JobsSraper`, `JobsScraper.scrape_jobs` and `JobsScraper.update_existing`.
**Note:** You can display the documentation of a function, class or module using the build-in `help()` function.
* As for the data analysis part, checkout out `analyze_data` module docstring and `perform_analysis` function.
* For help on command line usage, append `-h` option to `python scrape_data.py` or `python analyze_data.py`.
Limitations
-----------
* Upwork won't load more than 5050 jobs on their website even if the website says there are more.\
You can still get more than 5050 jobs, first scrape all the data, then keep updating the scraped data routinely.
For information on how to do this, see [usage](#cli-usage) or [documentation](#documentation) sections.
* The post data for jobs is relative and its accuracy decreases as you go further into the past.\
Minute precision for jobs posted up to an hour ago, hour for a day (2 hours ago), days up to a week ago (3 days ago) and so on.\
This can also be worked around using the same method mentioned in point 1.
License
-------
This project is licensed under the [MIT license](https://github.com/Yazan-Sharaya/upwork_analysis/blob/main/LICENSE)
Raw data
{
"_id": null,
"home_page": null,
"name": "upwork-analysis",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.7",
"maintainer_email": null,
"keywords": "scraping, undetected, selenium, crawling, automation, data analysis",
"author": null,
"author_email": "Yazan Sharaya <yazan.sharaya.yes@gmail.com>",
"download_url": "https://files.pythonhosted.org/packages/a5/48/5418d944980b9d9c08b22766d270aa8fe628809264864d67bcb7f7c9b172/upwork_analysis-1.0.0.tar.gz",
"platform": null,
"description": "Upwork Analysis\r\n===============\r\n[![PyPI version](https://badge.fury.io/py/upwork_analysis.svg)](https://badge.fury.io/py/upwork_analysis)\r\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/upwork_analysis)\r\n\r\nUpwork is a freelancing platform where clients post jobs and freelancers fight to get hired.\r\nI do freelance work there from time to time and I decided to follow the advice \"Work smarter not harder\" therefor I created this project.\r\nThe aim is to scrape jobs' data on Upwork and conduct data analysis to extract insights from the data.\r\n\r\n\r\n<!-- TOC -->\r\n* [Objective](#objective)\r\n* [Features](#features)\r\n* [Usage](#usage)\r\n * [The easy way](#the-easy-way)\r\n * [Scraping](#scraping)\r\n * [CLI usage](#cli-usage)\r\n * [Parameters](#parameters)\r\n * [Python](#python)\r\n * [Analysis](#analysis)\r\n * [CLI usage](#cli-usage-1)\r\n * [Python](#python-1)\r\n * [Jupyter](#jupyter)\r\n* [Installation](#installation)\r\n * [Automatic](#automatic)\r\n * [Manual](#manual)\r\n* [Documentation](#documentation)\r\n* [Limitations](#limitations)\r\n* [License](#license)\r\n<!-- TOC -->\r\n\r\n\r\nObjective\r\n---------\r\nThis repository aims to perform the following two tasks.\r\n1. Scrape Job listings on Upwork for a specific search query.\r\n2. Analyze the scraped data to find the following:\r\n 1. The countries that pay the most.\r\n 2. Job frequency on different days of the week.\r\n 3. The most asked for skills.\r\n 4. The budget ranges/groups and their frequency.\r\n 5. The skills that correlate with higher budgets.\r\n 6. The relationship between these skills and the number of proposals.\r\n\r\n\r\nFeatures\r\n--------\r\n* **Fault tolerance**: The scraper has been built with fault tolerance in mind and has been thoroughly tested to ensure that.\r\n Even if an error occurs that's not caused by the scraping process like a network error, a captcha or the user closing the\r\n browser session, the scraper will gracefully stop scraping and it will save any scraped jobs.\r\n* **Retry**: Retrying functionality is baked in, so if a network error occurs or a captcha pops up, the scraper will retry to scrape\r\n the specific page where the error occurred. This can be controlled by the 'retries' argument.\r\n* **Undetectability**: Upwork is guarded by Cloudflare's various protection mesures, the scraper bypasses these protections,\r\n and from my many test runs, it didn't trigger any of the protections at all.\r\n* **Concurrency**: The scraping workload can distributed across multiple workers to speed up the scraping process tremendously.\r\n This can be controlled by the 'workers' argument.\r\n* **No API**: The scraper doesn't need an Upwork account or an API key to work making it more broadly available, especially because\r\n of hard it is to get an Upwork API key (they ask for more legal documents than when you sign up!)\r\n\r\n\r\nUsage\r\n-----\r\nThere are two main ways to get started with this project, one is through the command\r\nline and the other is by directly importing its functions and classes into your script.\r\nBoth achieve the same functionality, choose the one that best suits your needs.\r\n\r\n### The easy way\r\nIf you installed the package using pip, the easiest and most straightforward way to use is through the entry points.\r\n\r\nFor scraping\r\n```\r\nscrape-upwork scrape Python Developer -o python_jobs.json --headless\r\n```\r\nFor analysis\r\n```\r\nanalyze-upwork SAVED/JOBS/FILE.json -o PATH/TO/SAVE/DIR -s\r\n```\r\n\r\nContinue reading down below for more methods to use the package.\r\n\r\n### Scraping\r\n\r\n##### CLI usage\r\nTo scrape a new search query from scratch\r\n```\r\ncd PATH/TO/upwork_analysis\r\npython scrape_data.py scrape Python Developer -o python_jobs.json --headless\r\n```\r\nTo update existing data for a search query with any new job listings\r\n```\r\ncd PATH/TO/upwork_analysis\r\npython scrape_data.py update Python Developer -o PATH/TO/SAVE/FILE.json --headless\r\n```\r\n\r\n##### Parameters\r\n\r\n| Parameter | Options | Default | Description |\r\n|------------------------|---------------|-----------------|---------------------------------------------------------------------------------------------------|\r\n| Action | scrape/update | scrape | Scrape new jobs, or update existing scraped data with any new job postings. |\r\n| -q / --search-query | str | None (Required) | The query to search for. |\r\n| -j / --jobs-per-page | 10, 20, 50 | 10 | How many jobs should be displayed per page. |\r\n| -s / --start-page | int | 1 | The page number to start searching from. |\r\n| -p / --pages-to-scrape | int | 10 | How many pages to scrape. If not passed, scrape all the pages.\u00b9 |\r\n| -o / --output | str | - | Where to save the scraped data. |\r\n| -r / --retries | int | 3 | Number of retries when encountering a Captcha or timeout before failing. |\r\n| -w / --workers | int | 1 | How many webdriver instances to concurrently spin up for scraping. |\r\n| -f / --fast | passed or not | False | Whether to use the fast scraping method, can be 10 to 50x faster but leaves some information out. |\r\n| --headless | passed or not | False | Whether to enable headless mode (slower and more detectable). |\r\n\r\n<sup>\u00b9 See the [limitations](#limitations) section</sup>\r\n\r\n##### Python\r\n```python\r\nfrom upwork_analysis.scrape_data import JobsScraper\r\njobs_scraper = JobsScraper(\r\n search_query=\"Python Developer\",\r\n jobs_per_page=10,\r\n start_page=1,\r\n pages_to_scrape=10,\r\n save_path='PATH/TO/SAVE/FILE.json',\r\n retries=3,\r\n headless=True,\r\n workers=2,\r\n fast=False)\r\njobs_scraper.scrape_jobs()\r\njobs_scraper.update_existing()\r\n```\r\n\r\nThis will scrape all resulting job listings for \"Python Developer\" from page 1 to page 10 and save the results to\r\n\"PATH/TO/SAVE/FILE.json\" using a headless browser. It will scrape a total of `jobs_per_page` * `pages_to_scrape` jobs or\r\n100 in this case.\r\n\r\n### Analysis\r\nA quick note, even though the analysis might run with as low as 1 data point _(or it might not and throw errors\r\nbecause there isn't enough data)_, it's better to scrape more data for the results to be meaningful.\r\n\r\n##### CLI usage\r\n```\r\ncd PATH/TO/upwork_analysis\r\npython analyize_data.py SAVED/JOBS/FILE.json -o PATH/TO/SAVE/DIR -s\r\n```\r\n\r\n##### Python\r\n```python\r\nfrom upwork_analysis.analyze_data import perform_analysis\r\nperform_analysis(\r\n dataset_path='SAVED/JOBS/FILE.json',\r\n save_plots_dir='PATH/TO/SAVE/DIR',\r\n show_plots=True)\r\n```\r\n\r\n##### Jupyter\r\n```jupyter\r\njupyter notebook data_analysis.ipynb\r\n# Change the 1st line of cell 3 from\r\ndataset_path = \"\"\r\n# To\r\ndataset_path = \"SAVED/JOBS/FILE.json\"\r\n```\r\n\r\nThis will analyze the data saved at \"SAVED/JOBS/FILE.json\", save the resulting plots \"PATH/TO/SAVE/DIR\" and\r\n-s (--show, or show_plots=True) will show the resulting plots all at once.\r\n\r\nFor more documentation about available functions, their parameters and how to use them, see the [Documentation](#documentation) section.\r\n\r\n\r\nInstallation\r\n------------\r\nThis package requires python 3.7 or later.\r\n\r\n##### Automatic\r\n```\r\npip install upwork_analysis\r\n```\r\n**Note:** If you encounter an error during installing, update pip using `python -m pip install -U pip` and it should fix the issue.\r\n\r\n##### Manual\r\n1. Clone this repository\r\n ```\r\n git clone https://github.com/Yazan-Sharaya/upwork_analysis\r\n ```\r\n2. Download the dependencies\r\n * If you just want the scraping functionality\r\n ```\r\n pip install seleniumbase beautifulsoup4\r\n ```\r\n * And additionally for analyzing the data\r\n ```\r\n pip install pandas scipy seaborn scikit-learn\r\n ```\r\n3. Build the package using\r\n ```\r\n python -m build -s\r\n ```\r\n4. Install the package\r\n ```\r\n pip install upwork_analysis-1.0.0.tar.gz\r\n ```\r\n\r\n\r\nDocumentation\r\n-------------\r\nBoth modules and all the functions they implement are documented using function and module docstrings.\r\nTo save you sometime, here's a list that covers the documentation for 99% of use cases.\r\n* For documentation about scraping, check out the docstrings for `JobsSraper`, `JobsScraper.scrape_jobs` and `JobsScraper.update_existing`.\r\n **Note:** You can display the documentation of a function, class or module using the build-in `help()` function.\r\n* As for the data analysis part, checkout out `analyze_data` module docstring and `perform_analysis` function.\r\n* For help on command line usage, append `-h` option to `python scrape_data.py` or `python analyze_data.py`.\r\n\r\n\r\nLimitations\r\n-----------\r\n* Upwork won't load more than 5050 jobs on their website even if the website says there are more.\\\r\n You can still get more than 5050 jobs, first scrape all the data, then keep updating the scraped data routinely.\r\n For information on how to do this, see [usage](#cli-usage) or [documentation](#documentation) sections. \r\n* The post data for jobs is relative and its accuracy decreases as you go further into the past.\\\r\n Minute precision for jobs posted up to an hour ago, hour for a day (2 hours ago), days up to a week ago (3 days ago) and so on.\\\r\n This can also be worked around using the same method mentioned in point 1.\r\n\r\n\r\nLicense\r\n-------\r\nThis project is licensed under the [MIT license](https://github.com/Yazan-Sharaya/upwork_analysis/blob/main/LICENSE)\r\n",
"bugtrack_url": null,
"license": "MIT License Copyright (c) 2024, Yazan Sharaya. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.",
"summary": "A Python package that scrapes and analyzes Upwork job listings.",
"version": "1.0.0",
"project_urls": {
"Repostiroy": "https://github.com/Yazan-Sharaya/upwork_analysis"
},
"split_keywords": [
"scraping",
" undetected",
" selenium",
" crawling",
" automation",
" data analysis"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "1249c9a30e583a76cca682ad3f8bdb54330c118a4e2dc20fe2b836db6359ea11",
"md5": "b3d8f35afb4c7c94b4d85ae939937b0a",
"sha256": "e2228f15f3098c4c4fccd69422a1d3412677a054cca214ad566be85a5d9e608e"
},
"downloads": -1,
"filename": "upwork_analysis-1.0.0-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b3d8f35afb4c7c94b4d85ae939937b0a",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.7",
"size": 25227,
"upload_time": "2024-05-28T05:50:10",
"upload_time_iso_8601": "2024-05-28T05:50:10.135154Z",
"url": "https://files.pythonhosted.org/packages/12/49/c9a30e583a76cca682ad3f8bdb54330c118a4e2dc20fe2b836db6359ea11/upwork_analysis-1.0.0-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "a5485418d944980b9d9c08b22766d270aa8fe628809264864d67bcb7f7c9b172",
"md5": "de36dc8ec0cf78fe3502851cb954a5a0",
"sha256": "653093a5eda432fe5d96a0b5a346a98147e1fded9df2d1aee13e8c484acecdbc"
},
"downloads": -1,
"filename": "upwork_analysis-1.0.0.tar.gz",
"has_sig": false,
"md5_digest": "de36dc8ec0cf78fe3502851cb954a5a0",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.7",
"size": 26034,
"upload_time": "2024-05-28T05:50:12",
"upload_time_iso_8601": "2024-05-28T05:50:12.645240Z",
"url": "https://files.pythonhosted.org/packages/a5/48/5418d944980b9d9c08b22766d270aa8fe628809264864d67bcb7f7c9b172/upwork_analysis-1.0.0.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-05-28 05:50:12",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "Yazan-Sharaya",
"github_project": "upwork_analysis",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "upwork-analysis"
}