# corpress
<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->
Geoff Ford
<https://geoffford.nz>
Version 1.0.3
[Corpress documentation](https://geoffford.nz/corpress)
Corpress (Cor from Corpus, Press from WordPress) provides a simple
approach to retrieve posts or pages from a WordPress site’s [REST
API](https://developer.wordpress.org/rest-api/) and create a corpus
(i.e. a data-set of texts). Corpress provides an efficient and
standardized way to collect text data from WordPress sites, avoiding the
need for customized scrapers. Not all WordPress sites provide access to
the REST API, but many do.
I’m a political scientist, who applies corpus linguistics and digital
methods in [my research](https://geoffford.nz) and I’m releasing this
with academic researchers in mind. This tool is intended for academic
research. Please cite the Github repository for
[Corpress](https://github.com/polsci/corpress) if you use it in your
research.
Corpress attempts to detect a REST API endpoint from a website URL for
[posts](https://developer.wordpress.org/rest-api/reference/posts/#list-posts)
(default) and
[pages](https://developer.wordpress.org/rest-api/reference/pages/#list-pages),
then downloads JSON from the API, and then processes the JSON to create
a corpus. You can create a corpus in: 1. ‘txt’ format: texts are saved
in separate .txt files, compatible with common corpus linguistics tools,
like AntConc. An optional meta-data file can be output with the link to
each file, title, and date; or
2. ‘csv’ format: meta-data and text is saved in a single CSV file.
I’ve used [nbdev](https://nbdev.fast.ai/) to develop this library, which
uses a Jupyter notebooks to develop code,
[documentation](https://geoffford.nz/corpress), code examples and tests.
If you want to contribute, you will need to clone the [Github
repo](https://github.com/polsci/corpress) and [setup
nbdev](https://nbdev.fast.ai/getting_started.html).
## Acknowledgement
This library was developed through my research on these projects:
\* [Mapping LAWS project: Issue Mapping and Analysing the Lethal
Autonomous Weapons Debate](https://mappinglaws.net/) (Funded by Royal
Society of New Zealand’s Marsden Fund, Grant 19-UOC-068)
\* [Into the Deep: Analysing the Actors and Controversies Driving the
Adoption of the World’s First Deep Sea Mining
Governance](https://miningthesea.net/) (Funded by Royal Society of New
Zealand’s Marsden Fund, Grant 22-UOC-059)
## TODO
- Add in a way to zip a txt format corpus
- Sort out encoding - currently assumes UTF-8 all the way.
- Add checks on JSON save path.
## Install
``` sh
pip install corpress
```
## Before using
- There are good reasons not to collect and/or distribute corpora and it
is the end-user’s responsibility to use this software in an ethical
way.
- Depending on the nature of the texts collected, what you are doing
when analyzing the texts, and how you disseminate your research, it
may be appropriate to process the texts further (e.g. to remove
personally identifying information).
- Not all Wordpress sites make the REST API accessible. See [example
output when there is no REST API available](#no-rest-api-available).
- It is possible the API data may differ from what is visible online.
You should check the texts in your corpus to make sure you have what
you expect!
- Corpress will exit with appropriate logging information if an API
endpoint is not found, not accessible or returns unexpected data. Just
read what it returns.
- Collecting data uses energy and server resources. It is your
responsibility to [set an appropriate User
Agent](#set-an-appropriate-user-agent) and seconds between requests to
the API to be thoughtful and respectful in your use of this tool.
## How to use
The [corpress function](https://geoffford.nz/corpress/core#corpress) is
the intended way to invoke Corpress and create a corpus. Other functions
are relevant if you just want to get the API endpoint or download the
JSON data. If you want the data in a different format, you could just
generate the CSV and then convert that to whatever format you need.
Corpress is intentionally verbose in terms of log output. This is
helpful to record and understand the process of collecting the data.
Most WordPress sites don’t have more than 100s to 1000s of posts. Using
a Jupyter Notebook could be helpful to view and capture the log data
from running Corpress and scope the corpus.
Here’s a step-by-step description, with discussion of the key
functionality.
First import the corpress function.
``` python
from corpress.core import corpress
```
You are going to need to set a few arguments for corpress. The corpress
function is [documented in full
here](https://geoffford.nz/corpress/core#corpress). Here I’m breaking it
down and showing an example.
- `url`: Set the URL of the WordPress website, Corpress will try to
determine the endpoint from this.
- `endpoint_type`: Do you want ‘posts’ or ‘pages’. If you want both, see
the note on [collecting both posts and
pages](#collecting-both-posts-and-pages).
- `corpus_format`: How do you want your corpus saved? ‘txt’ is a
directory of txt files, ‘csv’ is a single CSV with meta-data and text.
``` python
url = 'https://www.adho.org/'
endpoint_type = 'posts'
corpus_format = 'txt'
```
Setup where and how to save the data. Corpress will try and create
directory paths if they don’t exist. \* `json_save_path` (required):
Specify the directory where Corpress will save the JSON data. Note: you
should set a new path for every new Wordpress site you collect.
\* `corpus_save_path`: Required for ‘txt’ corpus format, this is where
the .txt files will be saved. Set as `None` or ommit if using ‘csv’
format.
\* `csv_save_file`: \* For ‘txt’ corpus format this is optional. This
provides a way to export meta-data (date, title, link to text etc) for
each text in the corpus. \* For ‘csv’ corpus format this is required.
This specifies the file where the meta-data and text will be saved.
\* `include_title_in_text`: Depending on the data you are collecting and
what you want to do with it, you can save the title of the post/page as
part of the text or not. This is set to `True` by default.
``` python
json_save_path = '../test_data/example/json/'
corpus_save_path = '../test_data/example/txt/'
csv_save_file = csv_save_file='../test_data/example/metadata.csv'
include_title_in_text = True
```
Set how you query the API: \* `seconds_between_requests`: By default
this is set to one request every 5 seconds. You can’t specify less than
1 second. It may be appropiate if you are collecting lots of texts to
specify a large number of seconds between requests.
\* `headers`: Corpress uses the
[Requests](https://requests.readthedocs.io/en/latest/) Python Library
for HTTP requests. You can pass headers you want in HTTP requests
directly as a `dict`. [See documentation
here](https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers).
The most relevant one is to set a [User-Agent
header](https://en.wikipedia.org/wiki/User-Agent_header). See the note
below about how to [set an appropriate
User-Agent](#set-an-appropriate-user-agent).
\* `params`: The
[posts](https://developer.wordpress.org/rest-api/reference/posts/#list-posts)
and
[pages](https://developer.wordpress.org/rest-api/reference/pages/#list-pages)
endpoints support a number of parameters. This includes parameters to
specify a search term, restrict dates and set the way results are
ordered. Set additional parameters as a `dict`. See the Requests library
documentation on [passing parameters in
URLS](https://requests.readthedocs.io/en/latest/user/quickstart/#passing-parameters-in-urls)
to understand this. \* `max_pages`: By default Corpress will collect
*all* post (or pages). That might not be necessary. Interpret max_pages
as the maximum number of successful API requests. The REST API normally
returns 10 posts/pages per request, so if you want 100 posts you would
set max_pages to 10.
#### Set an appropriate User-Agent
Here’s a suggested format:
`Your Research Project (https://university.edu/webpage)`. See how to set
this below.
``` python
seconds_between_requests = 5
headers = {'User-Agent': 'Your Research Project (https://university.edu/webpage)'}
params = {'search': 'common'} # just comment out or remove this line to collect every post, I've just chosen a word arbitrarily here
max_pages = None # collecting all available data, if want less data - set to an integer
```
Now you can call the
[`corpress`](https://geoffford.nz/corpress/core.html#corpress) function
and create a corpus. There will be lots of information logged about
collecting and processing the texts. When completed it will output a
table with a summary of the process and texts collected. This is the
same data returned by the
[`corpress`](https://geoffford.nz/corpress/core.html#corpress) function.
``` python
result = corpress(url=url,
endpoint_type=endpoint_type,
corpus_format=corpus_format,
json_save_path=json_save_path,
corpus_save_path=corpus_save_path,
csv_save_file=csv_save_file,
include_title_in_text=include_title_in_text,
seconds_between_requests=seconds_between_requests,
headers=headers,
params=params,
max_pages=max_pages)
```
2024-08-23 11:21:25 - INFO - Found REST API endpoint link
2024-08-23 11:21:25 - INFO - Setting posts route https://adho.org/wp-json/wp/v2/posts
2024-08-23 11:21:25 - INFO - Using JSON save path: ../test_data/example/json/
2024-08-23 11:21:27 - INFO - Downloading https://adho.org/wp-json/wp/v2/posts?search=common&page=1
2024-08-23 11:21:27 - INFO - Total pages to retrieve is 3
2024-08-23 11:21:34 - INFO - Downloading https://adho.org/wp-json/wp/v2/posts?search=common&page=2
2024-08-23 11:21:40 - INFO - Downloading https://adho.org/wp-json/wp/v2/posts?search=common&page=3
2024-08-23 11:21:45 - INFO - Creating corpus in txt format
2024-08-23 11:21:45 - INFO - Using corpus save path: ../test_data/example/txt/
2024-08-23 11:21:45 - INFO - Creating CSV file for metadata: ../test_data/example/metadata.csv
2024-08-23 11:21:45 - INFO - Processing JSON: posts-3.json
2024-08-23 11:21:45 - INFO - Processing JSON: posts-2.json
2024-08-23 11:21:45 - INFO - Processing JSON: posts-1.json
| | Key | Value |
|-----|--------------------|---------------------------------------------------|
| 0 | url | https://www.adho.org/ |
| 1 | endpoint_url | https://adho.org/wp-json/wp/v2/posts |
| 2 | headers | {'User-Agent': 'Your Research Project (https:/... |
| 3 | params | {'search': 'common'} |
| 4 | get_api_url | True |
| 5 | get_json | True |
| 6 | create_corpus | True |
| 7 | corpus_format | txt |
| 8 | corpus_save_path | ../test_data/example/txt/ |
| 9 | csv_save_file | ../test_data/example/metadata.csv |
| 10 | corpus_texts_count | 29 |
You can now preview the data you’ve collected.
``` python
import pandas as pd
pd.set_option('display.max_colwidth', None) # to display full text in pandas dataframe
metadata = pd.read_csv(csv_save_file)
metadata = metadata.sort_values('date')
metadata[['date', 'link', 'title', 'filename']].head(5) # display first 5 rows of metadata, this is not all the fields available
```
| | date | link | title | filename |
|-----|------------|----------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|
| 8 | 2012-12-06 | ADHO Adopts Creative Commons License for Its Web Site | https://adho.org/2012/12/06/adho-adopts-creative-commons-license-for-its-web-site/ | 2012-12-06-post-382-adho-adopts-creative-commons-license-for-its-web-site.txt |
| 7 | 2013-03-28 | Apply to be ADHO’s Publications Liaison | https://adho.org/2013/03/28/apply-to-be-adhos-publications-liaison/ | 2013-03-28-post-366-apply-to-be-adhos-publications-liaison.txt |
| 6 | 2013-06-23 | ADHO Calls for Proposals for New Special Interest Groups | https://adho.org/2013/06/23/adho-calls-for-proposals-for-new-special-interest-groups/ | 2013-06-23-post-338-adho-calls-for-proposals-for-new-special-interest-groups.txt |
| 5 | 2013-07-09 | Participate in the Joint ADHO and centerNet AGM at Digital Humanities 2013 | https://adho.org/2013/07/09/participate-in-the-joint-adho-and-centernet-agm-at-digital-humanities-2013/ | 2013-07-09-post-408-participate-in-the-joint-adho-and-centernet-agm-at-digital-humanities-2013.txt |
| 4 | 2013-07-14 | Digital Humanities 2015 to be held in Sydney, Australia | https://adho.org/2013/07/14/digital-humanities-2015-to-be-held-in-sydney-australia/ | 2013-07-14-post-288-digital-humanities-2015-to-be-held-in-sydney-australia.txt |
You can view a specific text file (if you used the ‘txt’ format) like
this:
``` python
import os
filename = '2012-12-06-post-382-adho-adopts-creative-commons-license-for-its-web-site.txt'
with open(os.path.join(corpus_save_path, filename), 'r', encoding = 'utf-8') as file:
text = file.read()
print(text)
```
ADHO Adopts Creative Commons License for Its Web Site
The Alliance of Digital Humanities Organizations (ADHO) is pleased to announce that all content on its web site is now available under a Creative Commons Attribution (CC-BY) license. This means that individuals and organizations are welcome to re-use and adapt ADHO’s documents and resources, so long as ADHO is cited as the source. Neil Fraistat, Chair of ADHO’s Steering Committee, notes that “this is one of an ongoing series of actions this year that are being designed to make ADHO resources more open and available to the larger community.”
ADHO’s decision to adopt the CC-BY license was prompted by the recognition that through explicitly sharing its work it can have a greater impact, contribute to best practices, and demonstrate its support for open access. Recently the Program Committee for the 2013 Digital Humanities conference revamped ADHO’s Guidelines for Proposal Authors & Reviewers, making them more inclusive, concrete, and transparent. PC chair Bethany Nowviskie received a request from the organizers of another conference to re-use these guidelines. Prompted by Nowviskie's suggestion, the ADHO Steering Committee determined that not only should the conference guidelines be made freely available, but its entire web site.
In adopting a Creative Commons license for its website, ADHO follows suit with several of its existing publications, including Digital Studies/Le Champ Numerique, Digital Humanities Quarterly, and DH Answers.
## Collecting both posts and pages
If you want to collect both posts and pages, just invoke corpress twice:
once with `endpoint_type` set to ‘posts’ and then with it set to
‘pages’.
If you are outputting in the ‘txt’ corpus format without a metadata file
(i.e. `csv_save_file` set to `None` or omitted from the function call),
you won’t have a problem. The filenames for posts/pages won’t conflict.
If you are specifying a `csv_save_file` - either because you are
outputting in the ‘csv’ corpus format or in the ‘txt’ format and wanting
the meta-data - make sure you use a separate `csv_save_file` for ‘posts’
and ‘pages’. You will get two separate files, combining these with a
library like [Pandas](https://pandas.pydata.org/), which is installed
with Corpress, is trivial. I will leave that for you to Google how to
merge two CSV files into one using Pandas.
## No REST API available
Here’s an example showing what you will see if there no REST API is
accessible.
``` python
# test a site that has no endpoint
result = corpress(url = 'https://www.whitehouse.gov/',
endpoint_type='posts',
corpus_format='txt',
json_save_path = '../test_data/json/',
corpus_save_path = '../test_data/corpus/',
max_pages=2)
```
2024-08-23 11:21:46 - INFO - No REST API endpoint link in markup
2024-08-23 11:21:46 - INFO - Guessing posts route based on URL https://www.whitehouse.gov/wp-json/wp/v2/posts
2024-08-23 11:21:46 - INFO - Using JSON save path: ../test_data/json/
2024-08-23 11:21:46 - INFO - Max pages to retrieve from API is set: 2
2024-08-23 11:21:47 - INFO - Downloading https://www.whitehouse.gov/wp-json/wp/v2/posts?page=1
2024-08-23 11:21:47 - ERROR - Error downloading page 1 from https://www.whitehouse.gov/wp-json/wp/v2/posts
2024-08-23 11:21:47 - ERROR - Status code: 403
2024-08-23 11:21:47 - ERROR - It appears that this website does not provide access to the REST API. Exiting.
2024-08-23 11:21:47 - ERROR - Error downloading data. Exiting.
| | Key | Value |
|-----|--------------------|------------------------------------------------|
| 0 | url | https://www.whitehouse.gov/ |
| 1 | endpoint_url | https://www.whitehouse.gov/wp-json/wp/v2/posts |
| 2 | headers | None |
| 3 | params | None |
| 4 | get_api_url | True |
| 5 | get_json | False |
| 6 | create_corpus | False |
| 7 | corpus_format | txt |
| 8 | corpus_save_path | ../test_data/corpus/ |
| 9 | csv_save_file | None |
| 10 | corpus_texts_count | 0 |
Raw data
{
"_id": null,
"home_page": "https://github.com/polsci/corpress",
"name": "corpress",
"maintainer": null,
"docs_url": null,
"requires_python": ">=3.8",
"maintainer_email": null,
"keywords": "nbdev jupyter notebook python",
"author": "polsci",
"author_email": "polsci@users.noreply.github.com",
"download_url": "https://files.pythonhosted.org/packages/35/33/83925db03f4b91eb53b5c156a8e65de26ca4e1eedc9ba42c774838666f30/corpress-1.0.3.tar.gz",
"platform": null,
"description": "# corpress\n\n\n<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->\n\nGeoff Ford \n<https://geoffford.nz> \nVersion 1.0.3\n\n[Corpress documentation](https://geoffford.nz/corpress)\n\nCorpress (Cor from Corpus, Press from WordPress) provides a simple\napproach to retrieve posts or pages from a WordPress site\u2019s [REST\nAPI](https://developer.wordpress.org/rest-api/) and create a corpus\n(i.e.\u00a0a data-set of texts). Corpress provides an efficient and\nstandardized way to collect text data from WordPress sites, avoiding the\nneed for customized scrapers. Not all WordPress sites provide access to\nthe REST API, but many do.\n\nI\u2019m a political scientist, who applies corpus linguistics and digital\nmethods in [my research](https://geoffford.nz) and I\u2019m releasing this\nwith academic researchers in mind. This tool is intended for academic\nresearch. Please cite the Github repository for\n[Corpress](https://github.com/polsci/corpress) if you use it in your\nresearch.\n\nCorpress attempts to detect a REST API endpoint from a website URL for\n[posts](https://developer.wordpress.org/rest-api/reference/posts/#list-posts)\n(default) and\n[pages](https://developer.wordpress.org/rest-api/reference/pages/#list-pages),\nthen downloads JSON from the API, and then processes the JSON to create\na corpus. You can create a corpus in: 1. \u2018txt\u2019 format: texts are saved\nin separate .txt files, compatible with common corpus linguistics tools,\nlike AntConc. An optional meta-data file can be output with the link to\neach file, title, and date; or \n2. \u2018csv\u2019 format: meta-data and text is saved in a single CSV file.\n\nI\u2019ve used [nbdev](https://nbdev.fast.ai/) to develop this library, which\nuses a Jupyter notebooks to develop code,\n[documentation](https://geoffford.nz/corpress), code examples and tests.\nIf you want to contribute, you will need to clone the [Github\nrepo](https://github.com/polsci/corpress) and [setup\nnbdev](https://nbdev.fast.ai/getting_started.html).\n\n## Acknowledgement\n\nThis library was developed through my research on these projects: \n\\* [Mapping LAWS project: Issue Mapping and Analysing the Lethal\nAutonomous Weapons Debate](https://mappinglaws.net/) (Funded by Royal\nSociety of New Zealand\u2019s Marsden Fund, Grant 19-UOC-068) \n\\* [Into the Deep: Analysing the Actors and Controversies Driving the\nAdoption of the World\u2019s First Deep Sea Mining\nGovernance](https://miningthesea.net/) (Funded by Royal Society of New\nZealand\u2019s Marsden Fund, Grant 22-UOC-059)\n\n## TODO\n\n- Add in a way to zip a txt format corpus\n- Sort out encoding - currently assumes UTF-8 all the way.\n- Add checks on JSON save path.\n\n## Install\n\n``` sh\npip install corpress\n```\n\n## Before using\n\n- There are good reasons not to collect and/or distribute corpora and it\n is the end-user\u2019s responsibility to use this software in an ethical\n way. \n- Depending on the nature of the texts collected, what you are doing\n when analyzing the texts, and how you disseminate your research, it\n may be appropriate to process the texts further (e.g.\u00a0to remove\n personally identifying information). \n- Not all Wordpress sites make the REST API accessible. See [example\n output when there is no REST API available](#no-rest-api-available).\n- It is possible the API data may differ from what is visible online.\n You should check the texts in your corpus to make sure you have what\n you expect!\n- Corpress will exit with appropriate logging information if an API\n endpoint is not found, not accessible or returns unexpected data. Just\n read what it returns. \n- Collecting data uses energy and server resources. It is your\n responsibility to [set an appropriate User\n Agent](#set-an-appropriate-user-agent) and seconds between requests to\n the API to be thoughtful and respectful in your use of this tool.\n\n## How to use\n\nThe [corpress function](https://geoffford.nz/corpress/core#corpress) is\nthe intended way to invoke Corpress and create a corpus. Other functions\nare relevant if you just want to get the API endpoint or download the\nJSON data. If you want the data in a different format, you could just\ngenerate the CSV and then convert that to whatever format you need.\n\nCorpress is intentionally verbose in terms of log output. This is\nhelpful to record and understand the process of collecting the data.\n\nMost WordPress sites don\u2019t have more than 100s to 1000s of posts. Using\na Jupyter Notebook could be helpful to view and capture the log data\nfrom running Corpress and scope the corpus.\n\nHere\u2019s a step-by-step description, with discussion of the key\nfunctionality.\n\nFirst import the corpress function.\n\n``` python\nfrom corpress.core import corpress\n```\n\nYou are going to need to set a few arguments for corpress. The corpress\nfunction is [documented in full\nhere](https://geoffford.nz/corpress/core#corpress). Here I\u2019m breaking it\ndown and showing an example.\n\n- `url`: Set the URL of the WordPress website, Corpress will try to\n determine the endpoint from this. \n- `endpoint_type`: Do you want \u2018posts\u2019 or \u2018pages\u2019. If you want both, see\n the note on [collecting both posts and\n pages](#collecting-both-posts-and-pages). \n- `corpus_format`: How do you want your corpus saved? \u2018txt\u2019 is a\n directory of txt files, \u2018csv\u2019 is a single CSV with meta-data and text.\n\n``` python\nurl = 'https://www.adho.org/'\nendpoint_type = 'posts'\ncorpus_format = 'txt'\n```\n\nSetup where and how to save the data. Corpress will try and create\ndirectory paths if they don\u2019t exist. \\* `json_save_path` (required):\nSpecify the directory where Corpress will save the JSON data. Note: you\nshould set a new path for every new Wordpress site you collect. \n\\* `corpus_save_path`: Required for \u2018txt\u2019 corpus format, this is where\nthe .txt files will be saved. Set as `None` or ommit if using \u2018csv\u2019\nformat. \n\\* `csv_save_file`: \\* For \u2018txt\u2019 corpus format this is optional. This\nprovides a way to export meta-data (date, title, link to text etc) for\neach text in the corpus. \\* For \u2018csv\u2019 corpus format this is required.\nThis specifies the file where the meta-data and text will be saved. \n\\* `include_title_in_text`: Depending on the data you are collecting and\nwhat you want to do with it, you can save the title of the post/page as\npart of the text or not. This is set to `True` by default.\n\n``` python\njson_save_path = '../test_data/example/json/'\ncorpus_save_path = '../test_data/example/txt/'\ncsv_save_file = csv_save_file='../test_data/example/metadata.csv'\ninclude_title_in_text = True\n```\n\nSet how you query the API: \\* `seconds_between_requests`: By default\nthis is set to one request every 5 seconds. You can\u2019t specify less than\n1 second. It may be appropiate if you are collecting lots of texts to\nspecify a large number of seconds between requests. \n\\* `headers`: Corpress uses the\n[Requests](https://requests.readthedocs.io/en/latest/) Python Library\nfor HTTP requests. You can pass headers you want in HTTP requests\ndirectly as a `dict`. [See documentation\nhere](https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers).\nThe most relevant one is to set a [User-Agent\nheader](https://en.wikipedia.org/wiki/User-Agent_header). See the note\nbelow about how to [set an appropriate\nUser-Agent](#set-an-appropriate-user-agent). \n\\* `params`: The\n[posts](https://developer.wordpress.org/rest-api/reference/posts/#list-posts)\nand\n[pages](https://developer.wordpress.org/rest-api/reference/pages/#list-pages)\nendpoints support a number of parameters. This includes parameters to\nspecify a search term, restrict dates and set the way results are\nordered. Set additional parameters as a `dict`. See the Requests library\ndocumentation on [passing parameters in\nURLS](https://requests.readthedocs.io/en/latest/user/quickstart/#passing-parameters-in-urls)\nto understand this. \\* `max_pages`: By default Corpress will collect\n*all* post (or pages). That might not be necessary. Interpret max_pages\nas the maximum number of successful API requests. The REST API normally\nreturns 10 posts/pages per request, so if you want 100 posts you would\nset max_pages to 10.\n\n#### Set an appropriate User-Agent\n\nHere\u2019s a suggested format:\n`Your Research Project (https://university.edu/webpage)`. See how to set\nthis below.\n\n``` python\nseconds_between_requests = 5\nheaders = {'User-Agent': 'Your Research Project (https://university.edu/webpage)'}\nparams = {'search': 'common'} # just comment out or remove this line to collect every post, I've just chosen a word arbitrarily here\nmax_pages = None # collecting all available data, if want less data - set to an integer\n```\n\nNow you can call the\n[`corpress`](https://geoffford.nz/corpress/core.html#corpress) function\nand create a corpus. There will be lots of information logged about\ncollecting and processing the texts. When completed it will output a\ntable with a summary of the process and texts collected. This is the\nsame data returned by the\n[`corpress`](https://geoffford.nz/corpress/core.html#corpress) function.\n\n``` python\nresult = corpress(url=url, \n endpoint_type=endpoint_type, \n corpus_format=corpus_format, \n json_save_path=json_save_path, \n corpus_save_path=corpus_save_path, \n csv_save_file=csv_save_file, \n include_title_in_text=include_title_in_text, \n seconds_between_requests=seconds_between_requests, \n headers=headers, \n params=params, \n max_pages=max_pages)\n```\n\n 2024-08-23 11:21:25 - INFO - Found REST API endpoint link\n 2024-08-23 11:21:25 - INFO - Setting posts route https://adho.org/wp-json/wp/v2/posts\n 2024-08-23 11:21:25 - INFO - Using JSON save path: ../test_data/example/json/\n 2024-08-23 11:21:27 - INFO - Downloading https://adho.org/wp-json/wp/v2/posts?search=common&page=1\n 2024-08-23 11:21:27 - INFO - Total pages to retrieve is 3\n 2024-08-23 11:21:34 - INFO - Downloading https://adho.org/wp-json/wp/v2/posts?search=common&page=2\n 2024-08-23 11:21:40 - INFO - Downloading https://adho.org/wp-json/wp/v2/posts?search=common&page=3\n 2024-08-23 11:21:45 - INFO - Creating corpus in txt format\n 2024-08-23 11:21:45 - INFO - Using corpus save path: ../test_data/example/txt/\n 2024-08-23 11:21:45 - INFO - Creating CSV file for metadata: ../test_data/example/metadata.csv\n 2024-08-23 11:21:45 - INFO - Processing JSON: posts-3.json\n 2024-08-23 11:21:45 - INFO - Processing JSON: posts-2.json\n 2024-08-23 11:21:45 - INFO - Processing JSON: posts-1.json\n\n\n| | Key | Value |\n|-----|--------------------|---------------------------------------------------|\n| 0 | url | https://www.adho.org/ |\n| 1 | endpoint_url | https://adho.org/wp-json/wp/v2/posts |\n| 2 | headers | {'User-Agent': 'Your Research Project (https:/... |\n| 3 | params | {'search': 'common'} |\n| 4 | get_api_url | True |\n| 5 | get_json | True |\n| 6 | create_corpus | True |\n| 7 | corpus_format | txt |\n| 8 | corpus_save_path | ../test_data/example/txt/ |\n| 9 | csv_save_file | ../test_data/example/metadata.csv |\n| 10 | corpus_texts_count | 29 |\n\n\nYou can now preview the data you\u2019ve collected.\n\n``` python\nimport pandas as pd\npd.set_option('display.max_colwidth', None) # to display full text in pandas dataframe\nmetadata = pd.read_csv(csv_save_file)\nmetadata = metadata.sort_values('date')\nmetadata[['date', 'link', 'title', 'filename']].head(5) # display first 5 rows of metadata, this is not all the fields available\n```\n\n\n| | date | link | title | filename |\n|-----|------------|----------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|\n| 8 | 2012-12-06 | ADHO Adopts Creative Commons License for Its Web Site | https://adho.org/2012/12/06/adho-adopts-creative-commons-license-for-its-web-site/ | 2012-12-06-post-382-adho-adopts-creative-commons-license-for-its-web-site.txt |\n| 7 | 2013-03-28 | Apply to be ADHO\u2019s Publications Liaison | https://adho.org/2013/03/28/apply-to-be-adhos-publications-liaison/ | 2013-03-28-post-366-apply-to-be-adhos-publications-liaison.txt |\n| 6 | 2013-06-23 | ADHO Calls for Proposals for New Special Interest Groups | https://adho.org/2013/06/23/adho-calls-for-proposals-for-new-special-interest-groups/ | 2013-06-23-post-338-adho-calls-for-proposals-for-new-special-interest-groups.txt |\n| 5 | 2013-07-09 | Participate in the Joint ADHO and centerNet AGM at Digital Humanities 2013 | https://adho.org/2013/07/09/participate-in-the-joint-adho-and-centernet-agm-at-digital-humanities-2013/ | 2013-07-09-post-408-participate-in-the-joint-adho-and-centernet-agm-at-digital-humanities-2013.txt |\n| 4 | 2013-07-14 | Digital Humanities 2015 to be held in Sydney, Australia | https://adho.org/2013/07/14/digital-humanities-2015-to-be-held-in-sydney-australia/ | 2013-07-14-post-288-digital-humanities-2015-to-be-held-in-sydney-australia.txt |\n\nYou can view a specific text file (if you used the \u2018txt\u2019 format) like\nthis:\n\n``` python\nimport os\nfilename = '2012-12-06-post-382-adho-adopts-creative-commons-license-for-its-web-site.txt'\nwith open(os.path.join(corpus_save_path, filename), 'r', encoding = 'utf-8') as file:\n text = file.read() \n print(text)\n```\n\n ADHO Adopts Creative Commons License for Its Web Site\n\n The Alliance of Digital Humanities Organizations (ADHO) is pleased to announce that all content on its web site is now available under a Creative Commons Attribution (CC-BY) license. This means that individuals and organizations are welcome to re-use and adapt ADHO\u2019s documents and resources, so long as ADHO is cited as the source. Neil Fraistat, Chair of ADHO\u2019s Steering Committee, notes that \u201cthis is one of an ongoing series of actions this year that are being designed to make ADHO resources more open and available to the larger community.\u201d\n \u00a0\n ADHO\u2019s decision to adopt the CC-BY license was prompted by the recognition that through explicitly sharing its work it can have a greater impact, contribute to best practices, and demonstrate its support for open access. Recently the Program Committee for the 2013 Digital Humanities conference revamped ADHO\u2019s Guidelines for Proposal Authors & Reviewers, making them more inclusive, concrete, and transparent. PC chair Bethany Nowviskie received a request from the organizers of another conference to re-use these guidelines. Prompted by Nowviskie's suggestion, the ADHO Steering Committee determined that not only should the conference guidelines be made freely available, but its entire web site.\n \u00a0\n In adopting a Creative Commons license for its website, ADHO follows suit with several of its existing publications, including Digital Studies/Le Champ Numerique, Digital Humanities Quarterly, and DH Answers.\n\n## Collecting both posts and pages\n\nIf you want to collect both posts and pages, just invoke corpress twice:\nonce with `endpoint_type` set to \u2018posts\u2019 and then with it set to\n\u2018pages\u2019.\n\nIf you are outputting in the \u2018txt\u2019 corpus format without a metadata file\n(i.e.\u00a0`csv_save_file` set to `None` or omitted from the function call),\nyou won\u2019t have a problem. The filenames for posts/pages won\u2019t conflict.\n\nIf you are specifying a `csv_save_file` - either because you are\noutputting in the \u2018csv\u2019 corpus format or in the \u2018txt\u2019 format and wanting\nthe meta-data - make sure you use a separate `csv_save_file` for \u2018posts\u2019\nand \u2018pages\u2019. You will get two separate files, combining these with a\nlibrary like [Pandas](https://pandas.pydata.org/), which is installed\nwith Corpress, is trivial. I will leave that for you to Google how to\nmerge two CSV files into one using Pandas.\n\n## No REST API available\n\nHere\u2019s an example showing what you will see if there no REST API is\naccessible.\n\n``` python\n# test a site that has no endpoint\nresult = corpress(url = 'https://www.whitehouse.gov/', \n endpoint_type='posts',\n corpus_format='txt',\n json_save_path = '../test_data/json/', \n corpus_save_path = '../test_data/corpus/', \n max_pages=2)\n```\n\n 2024-08-23 11:21:46 - INFO - No REST API endpoint link in markup\n 2024-08-23 11:21:46 - INFO - Guessing posts route based on URL https://www.whitehouse.gov/wp-json/wp/v2/posts\n 2024-08-23 11:21:46 - INFO - Using JSON save path: ../test_data/json/\n 2024-08-23 11:21:46 - INFO - Max pages to retrieve from API is set: 2\n 2024-08-23 11:21:47 - INFO - Downloading https://www.whitehouse.gov/wp-json/wp/v2/posts?page=1\n 2024-08-23 11:21:47 - ERROR - Error downloading page 1 from https://www.whitehouse.gov/wp-json/wp/v2/posts\n 2024-08-23 11:21:47 - ERROR - Status code: 403\n 2024-08-23 11:21:47 - ERROR - It appears that this website does not provide access to the REST API. Exiting.\n 2024-08-23 11:21:47 - ERROR - Error downloading data. Exiting.\n\n\n| | Key | Value |\n|-----|--------------------|------------------------------------------------|\n| 0 | url | https://www.whitehouse.gov/ |\n| 1 | endpoint_url | https://www.whitehouse.gov/wp-json/wp/v2/posts |\n| 2 | headers | None |\n| 3 | params | None |\n| 4 | get_api_url | True |\n| 5 | get_json | False |\n| 6 | create_corpus | False |\n| 7 | corpus_format | txt |\n| 8 | corpus_save_path | ../test_data/corpus/ |\n| 9 | csv_save_file | None |\n| 10 | corpus_texts_count | 0 |\n\n",
"bugtrack_url": null,
"license": "MIT License",
"summary": "Create a text corpus from a WordPress site using the WordPress API.",
"version": "1.0.3",
"project_urls": {
"Homepage": "https://github.com/polsci/corpress"
},
"split_keywords": [
"nbdev",
"jupyter",
"notebook",
"python"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "e4301c26a5d78650fc6d50ad55b081525f6e6c6424bc858a23a141348e159c42",
"md5": "b267cf3dcb5e094fdef8f29012f58fea",
"sha256": "f1829f6d40fcdbbc7c25c40e65014f46373cab71b1e4d9de3746efeedd868609"
},
"downloads": -1,
"filename": "corpress-1.0.3-py3-none-any.whl",
"has_sig": false,
"md5_digest": "b267cf3dcb5e094fdef8f29012f58fea",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": ">=3.8",
"size": 12936,
"upload_time": "2024-08-23T02:54:30",
"upload_time_iso_8601": "2024-08-23T02:54:30.201105Z",
"url": "https://files.pythonhosted.org/packages/e4/30/1c26a5d78650fc6d50ad55b081525f6e6c6424bc858a23a141348e159c42/corpress-1.0.3-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "353383925db03f4b91eb53b5c156a8e65de26ca4e1eedc9ba42c774838666f30",
"md5": "49157a2d2ce9e7215b872e918b882cc7",
"sha256": "2562f19f7c302c865c00b6664a78bced185bff70d0797f26539af600b6103bbb"
},
"downloads": -1,
"filename": "corpress-1.0.3.tar.gz",
"has_sig": false,
"md5_digest": "49157a2d2ce9e7215b872e918b882cc7",
"packagetype": "sdist",
"python_version": "source",
"requires_python": ">=3.8",
"size": 19586,
"upload_time": "2024-08-23T02:54:32",
"upload_time_iso_8601": "2024-08-23T02:54:32.202277Z",
"url": "https://files.pythonhosted.org/packages/35/33/83925db03f4b91eb53b5c156a8e65de26ca4e1eedc9ba42c774838666f30/corpress-1.0.3.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-08-23 02:54:32",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "polsci",
"github_project": "corpress",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"lcname": "corpress"
}