cdx-toolkit

Name	cdx-toolkit JSON
Version	0.9.37 JSON
	download
home_page	https://github.com/cocrawler/cdx_toolkit
Summary	A toolkit for working with CDX indices
upload_time	2024-09-09 02:50:49
maintainer	None
docs_url	None
author	Greg Lindahl and others
requires_python	>=3.7
license	Apache 2.0
keywords
VCS
bugtrack_url
requirements	requests warcio pytest pytest-cov pytest-sugar coveralls twine setuptools setuptools-scm
Travis-CI
coveralls test coverage	No coveralls.

# cdx_toolkit

[![build](https://github.com/cocrawler/cdx_toolkit/actions/workflows/ci.yaml/badge.svg)](https://github.com/cocrawler/cdx_toolkit/actions/workflows/ci.yaml) [![coverage](https://codecov.io/gh/cocrawler/cdx_toolkit/graph/badge.svg?token=M1YJB998LE)](https://codecov.io/gh/cocrawler/cdx_toolkit) [![Apache License 2.0](https://img.shields.io/github/license/cocrawler/cdx_toolkit.svg)](LICENSE)

cdx_toolkit is a set of tools for working with CDX indices of web
crawls and archives, including those at the Common Crawl Foundation
(CCF) and those at the Internet Archive's Wayback Machine.

Common Crawl uses Ilya Kreymer's pywb to serve the CDX API, which is
somewhat different from the Internet Archive's CDX API server.
cdx_toolkit hides these differences as best it can. cdx_toolkit also
knits together the monthly Common Crawl CDX indices into a single,
virtual index.

Finally, cdx_toolkit allows extracting archived pages from CC and IA
into WARC files. If you're looking to create subsets of CC or IA data
and then further process them, this is a feature you'll find useful.

## Installing

```
$ pip install cdx_toolkit
```

or clone this repo and use `pip install .`

## Command-line tools

```
$ cdxt --cc size 'commoncrawl.org/*'
$ cdxt --cc --limit 10 iter 'commoncrawl.org/*' # returns the most recent year
$ cdxt --crawl 3 --limit 10 iter 'commoncrawl.org/*' # returns the most recent 3 crawls
$ cdxt --cc --limit 10 --filter '=status:200' iter 'commoncrawl.org/*'

$ cdxt --ia --limit 10 iter 'commoncrawl.org/*' # will show the beginning of IA's crawl
$ cdxt --ia --limit 10 warc 'commoncrawl.org/*'
```

cdxt takes a large number of command line switches, controlling
the time period and all other CDX query options. cdxt can generate
WARC, jsonl, and csv outputs.

If you don't specify much about the crawls or dates or number of
records you're interested in, some default limits will kick in to
prevent overly-large queries. These default limits include a maximum
of 1000 records (`--limit 1000`) and a limit of 1 year of CC indexes.
To exceed these limits, use `--limit` and `--crawl` or `--from` and
`--to`.

If it seems like nothing is happening, add `-v` or `-vv` at the start:

```
$ cdxt -vv --cc size 'commoncrawl.org/*'
```

## Selecting particular CCF crawls

Common Crawl's data is divided into "crawls", which were yearly at the
start, and are currently done monthly. There are over 100 of them.
[You can find details about these crawls here.](https://data.commoncrawl.org/crawl-data/index.html)

Unlike some web archives, CCF doesn't have a single CDX index that
covers all of these crawls -- we have 1 index per crawl. The way
you ask for a particular crawl is:

```
$ cdxt --crawl CC-MAIN-2024-33 iter 'commoncrawl.org/*'
```

- `--crawl CC-MAIN-2024-33` is a single crawl.
- `--crawl 3` is the latest 3 crawls.
- `--crawl CC-MAIN-2018` will match all of the crawls from 2018.
- `--crawl CC-MAIN-2018,CC-MAIN-2019` will match all of the crawls from 2018 and 2019.

CCF also has a hive-sharded parquet index (called the columnar index)
that covers all of our crawls. Querying broad time ranges is much
faster with the columnar index. You can find more information about
this index at [the blog post about it](https://commoncrawl.org/blog/index-to-warc-files-and-urls-in-columnar-format).

The Internet Archive cdx index is organized as a single crawl that goes
from the very beginning until now. That's why there is no `--crawl` for
`--ia`. Note that cdx queries to `--ia` will default to one year year
and limit 1000 entries if you do not specify `--from`, `--to`, and `--limit`.

## Selecting by time

In most cases you'll probably use --crawl to select the time range for
Common Crawl queries, but for the Internet Archive you'll need to specify
a time range like this:

```
$ cdxt --ia --limit 1 --from 2008 --to 200906302359 iter 'commoncrawl.org/*'
```

In this example the time range starts at the beginning of 2008 and
ends on June 30, 2009 at 23:59. All times are in UTC. If you do not
specify a time range (and also don't use `--crawl`), you'll get the
most recent year.

## The full syntax for command-line tools

```
$ cdxt --help
$ cdxt iter --help
$ cdxt warc --help
$ cdxt size --help
```

for full details. Note that argument order really matters; each switch
is valid only either before or after the {iter,warc,size} command.

Add -v (or -vv) to see what's going on under the hood.

## Python programming example

Everything that you can do on the command line, and much more, can
be done by writing a Python program.

```
import cdx_toolkit

cdx = cdx_toolkit.CDXFetcher(source='cc')
url = 'commoncrawl.org/*'

print(url, 'size estimate', cdx.get_size_estimate(url))

for obj in cdx.iter(url, limit=1):
print(obj)
```

at the moment will print:

```
commoncrawl.org/* size estimate 36000
{'urlkey': 'org,commoncrawl)/', 'timestamp': '20180219112308', 'mime-detected': 'text/html', 'url': 'http://commoncrawl.org/', 'status': '200', 'filename': 'crawl-data/CC-MAIN-2018-09/segments/1518891812584.40/warc/CC-MAIN-20180219111908-20180219131908-00494.warc.gz', 'mime': 'text/html', 'length': '5365', 'digest': 'FM7M2JDBADOQIHKCSFKVTAML4FL2HPHT', 'offset': '81614902'}
```

You can also fetch the content of the web capture as bytes:

```
print(obj.content)
```

There's a full example of iterating and selecting a subset of captures
to write into an extracted WARC file in [examples/iter-and-warc.py](examples/iter-and-warc.py)

## Filter syntax

Filters can be used to limit captures to a subset of the results.

Any field name listed in `cdxt iter --all-fields` can be used in a
filter. These field names are appropriately renamed if the source is
'ia'. The different syntax of filter modifiers for 'ia' and 'cc' is
not fully abstracted away by cdx_toolkit.

The basic syntax of a filter is `[modifier]field:expression`, for
example `=status:200` or `!=status:200`.

'cc'-style filters (pywb) come in six flavors: substring match, exact
string, full-match regex, and their inversions. These are indicated by
a modifier of nothing, '=', '\~', '!', '!=', and '!\~', respectively.

'ia'-style filters (Wayback/OpenWayback) come in two flavors, a full-match
regex and an inverted full-match regex: 'status:200' and '!status:200'

Multiple filters will be combined with AND. For example, to limit
captures to those which do not have status 200 and do not have status 404,

```
$ cdxt --cc --filter '!=status:200' --filter '!=status:404' iter ...
```

Note that filters that discard large numbers of captures put a high
load on the CDX server -- for example, a filter that returns just a
few captures from a domain that has tens of millions of captures is
likely to run very slowly and annoy the owner of the CDX server.

See https://github.com/webrecorder/pywb/wiki/CDX-Server-API#filter (pywb)
and https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server#filtering (wayback)
for full details of filter modifiers.

## CDX Jargon, Field Names, and such

cdx_toolkit supports all (ok, most!) of the options and fields discussed
in the CDX API documentation:

* https://github.com/webrecorder/pywb/wiki/CDX-Server-API
* https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server

A **capture** is a single crawled url, be it a copy of a webpage, a
redirect to another page, an error such as 404 (page not found), or a
revisit record (page identical to a previous capture.)

The **url** used by cdx_tools can be wildcarded in two ways. One way
is `*.example.com`, which in CDX jargon sets **matchType='domain'**,
and will return captures for example.com and blog.example.com and
support.example.com. The other, `example.com/*`, will return captures
for any page on example.com.

A **timestamp** represents year-month-day-time as a string of digits run togther.
Example: January 5, 2016 at 12:34:56 UTC is 20160105123456. These timestamps are
a field in the index, and are also used to pick specify the dates used
by **--from=**, **--to**, and **--closest** on the command-line. (Programmatically,
use **from_ts=**, to=, and closest=.)

An **urlkey** is a SURT, which is a munged-up url suitable for
deduplication and sorting. This sort order is how CDX indices
efficiently support queries like `*.example.com`. The SURTs for
www.example.com and example.com are identical, which is handy when
these 2 hosts actually have identical web content. The original url
should be present in all records, if you want to know exactly what it
is.

The **limit** argument limits how many captures will be returned. To
help users not shoot themselves in the foot, a limit of 1,000 is
applied to --get and .get() calls.

A **filter** allows a user to select a subset of CDX records, reducing
network traffic between the CDX API server and the user. For example,
filter='!status:200' will only show captures whose http status is not
200. Multiple filters can be specified as a list (in the api) and on
the command line (by specifying --filter more than once). Filters and
**limit** work together, with the limit applying to the count of
captures after the filter is applied. Note that revisit records have a
status of '-', not 200.

CDX API servers support a **paged interface** for efficient access to
large sets of URLs. cdx_toolkit iterators always use the paged interface.
cdx_toolkit is also polite to CDX servers by being single-threaded and
serial. If it's not fast enough for you, consider downloading Common
Crawl's index files directly.

A **digest** is a sha1 checksum of the contents of a capture. The
purpose of a digest is to be able to easily figure out if 2 captures
have identical content.

Common Crawl publishes a new index each month. cdx_toolkit will start
using new ones as soon as they are published. **By default,
cdx_toolkit will use the most recent 12 months of Common Crawl**; you
can change that using **--from** or **from_ts=** and **--to** or
**to=**.

CDX implementations do not efficiently support reversed sort orders,
so cdx_toolkit results will be ordered by ascending SURT and by
ascending timestamp. However, since CC has an individual index for
each month, and because most users want more recent results,
cdx_toolkit defaults to querying CC's CDX indices in decreasing month
order, but each month's result will be in ascending SURT and ascending
timestamp. This default sort order is named 'mixed'. If you'd like
pure ascending, set **--cc-sort** or **cc_sort=** to 'ascending'. You
may want to also specify **--from** or **from_ts=** to set a starting
timestamp.

The main problem with this ascending sort order is that it's a pain to
get the most recent N captures: --limit and limit= will return the
oldest N captures. With the 'mixed' ordering, a large enough limit=
will get close to returning the most recent N captures.

## TODO

Content downloading needs help with charset issues, preferably
figuring out the charset using an algorithm similar to browsers.

WARC generation should do smart(er) things with revisit records.

Right now the CC code selects which monthly CC indices to use based
solely on date ranges. It would be nice to have an alternative so that
a client could iterate against the most recent N CC indices, and
also have the default one-year lookback use an entire monthly index
instead of a partial one.

## Status

cdx_toolkit has reached the beta-testing stage of development.

## License

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this software except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/cocrawler/cdx_toolkit",
    "name": "cdx-toolkit",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.7",
    "maintainer_email": null,
    "keywords": null,
    "author": "Greg Lindahl and others",
    "author_email": "lindahl@pbm.com",
    "download_url": "https://files.pythonhosted.org/packages/34/fc/21958c96b592dc3e4d19c10e2e887d41d7c7300708ccc2d49dd40edf39f7/cdx_toolkit-0.9.37.tar.gz",
    "platform": null,
    "description": "# cdx_toolkit\n\n[![build](https://github.com/cocrawler/cdx_toolkit/actions/workflows/ci.yaml/badge.svg)](https://github.com/cocrawler/cdx_toolkit/actions/workflows/ci.yaml) [![coverage](https://codecov.io/gh/cocrawler/cdx_toolkit/graph/badge.svg?token=M1YJB998LE)](https://codecov.io/gh/cocrawler/cdx_toolkit) [![Apache License 2.0](https://img.shields.io/github/license/cocrawler/cdx_toolkit.svg)](LICENSE)\n\ncdx_toolkit is a set of tools for working with CDX indices of web\ncrawls and archives, including those at the Common Crawl Foundation\n(CCF) and those at the Internet Archive's Wayback Machine.\n\nCommon Crawl uses Ilya Kreymer's pywb to serve the CDX API, which is\nsomewhat different from the Internet Archive's CDX API server.\ncdx_toolkit hides these differences as best it can. cdx_toolkit also\nknits together the monthly Common Crawl CDX indices into a single,\nvirtual index.\n\nFinally, cdx_toolkit allows extracting archived pages from CC and IA\ninto WARC files. If you're looking to create subsets of CC or IA data\nand then further process them, this is a feature you'll find useful.\n\n## Installing\n\n```\n$ pip install cdx_toolkit\n```\n\nor clone this repo and use `pip install .`\n\n## Command-line tools\n\n```\n$ cdxt --cc size 'commoncrawl.org/*'\n$ cdxt --cc --limit 10 iter 'commoncrawl.org/*'  # returns the most recent year\n$ cdxt --crawl 3 --limit 10 iter 'commoncrawl.org/*'  # returns the most recent 3 crawls\n$ cdxt --cc --limit 10 --filter '=status:200' iter 'commoncrawl.org/*'\n\n$ cdxt --ia --limit 10 iter 'commoncrawl.org/*'  # will show the beginning of IA's crawl\n$ cdxt --ia --limit 10 warc 'commoncrawl.org/*'\n```\n\ncdxt takes a large number of command line switches, controlling\nthe time period and all other CDX query options. cdxt can generate\nWARC, jsonl, and csv outputs.\n\nIf you don't specify much about the crawls or dates or number of\nrecords you're interested in, some default limits will kick in to\nprevent overly-large queries. These default limits include a maximum\nof 1000 records (`--limit 1000`) and a limit of 1 year of CC indexes.\nTo exceed these limits, use `--limit` and `--crawl` or `--from` and\n`--to`.\n\nIf it seems like nothing is happening, add `-v` or `-vv` at the start:\n\n```\n$ cdxt -vv --cc size 'commoncrawl.org/*'\n```\n\n## Selecting particular CCF crawls\n\nCommon Crawl's data is divided into \"crawls\", which were yearly at the\nstart, and are currently done monthly. There are over 100 of them.\n[You can find details about these crawls here.](https://data.commoncrawl.org/crawl-data/index.html)\n\nUnlike some web archives, CCF doesn't have a single CDX index that\ncovers all of these crawls -- we have 1 index per crawl. The way\nyou ask for a particular crawl is:\n\n```\n$ cdxt --crawl CC-MAIN-2024-33 iter 'commoncrawl.org/*'\n```\n\n- `--crawl CC-MAIN-2024-33` is a single crawl.\n- `--crawl 3` is the latest 3 crawls.\n- `--crawl CC-MAIN-2018` will match all of the crawls from 2018.\n- `--crawl CC-MAIN-2018,CC-MAIN-2019` will match all of the crawls from 2018 and 2019.\n\nCCF also has a hive-sharded parquet index (called the columnar index)\nthat covers all of our crawls. Querying broad time ranges is much\nfaster with the columnar index. You can find more information about\nthis index at [the blog post about it](https://commoncrawl.org/blog/index-to-warc-files-and-urls-in-columnar-format).\n\nThe Internet Archive cdx index is organized as a single crawl that goes\nfrom the very beginning until now. That's why there is no `--crawl` for\n`--ia`. Note that cdx queries to `--ia` will default to one year year\nand limit 1000 entries if you do not specify `--from`, `--to`, and `--limit`.\n\n## Selecting by time\n\nIn most cases you'll probably use --crawl to select the time range for\nCommon Crawl queries, but for the Internet Archive you'll need to specify\na time range like this:\n\n```\n$ cdxt --ia --limit 1 --from 2008 --to 200906302359 iter 'commoncrawl.org/*'\n```\n\nIn this example the time range starts at the beginning of 2008 and\nends on June 30, 2009 at 23:59. All times are in UTC. If you do not\nspecify a time range (and also don't use `--crawl`), you'll get the\nmost recent year.\n\n## The full syntax for command-line tools\n\n```\n$ cdxt --help\n$ cdxt iter --help\n$ cdxt warc --help\n$ cdxt size --help\n```\n\nfor full details. Note that argument order really matters; each switch\nis valid only either before or after the {iter,warc,size} command.\n\nAdd -v (or -vv) to see what's going on under the hood.\n\n## Python programming example\n\nEverything that you can do on the command line, and much more, can\nbe done by writing a Python program.\n\n```\nimport cdx_toolkit\n\ncdx = cdx_toolkit.CDXFetcher(source='cc')\nurl = 'commoncrawl.org/*'\n\nprint(url, 'size estimate', cdx.get_size_estimate(url))\n\nfor obj in cdx.iter(url, limit=1):\n    print(obj)\n```\n\nat the moment will print:\n\n```\ncommoncrawl.org/* size estimate 36000\n{'urlkey': 'org,commoncrawl)/', 'timestamp': '20180219112308', 'mime-detected': 'text/html', 'url': 'http://commoncrawl.org/', 'status': '200', 'filename': 'crawl-data/CC-MAIN-2018-09/segments/1518891812584.40/warc/CC-MAIN-20180219111908-20180219131908-00494.warc.gz', 'mime': 'text/html', 'length': '5365', 'digest': 'FM7M2JDBADOQIHKCSFKVTAML4FL2HPHT', 'offset': '81614902'}\n```\n\nYou can also fetch the content of the web capture as bytes:\n\n```\n    print(obj.content)\n```\n\nThere's a full example of iterating and selecting a subset of captures\nto write into an extracted WARC file in [examples/iter-and-warc.py](examples/iter-and-warc.py)\n\n## Filter syntax\n\nFilters can be used to limit captures to a subset of the results.\n\nAny field name listed in `cdxt iter --all-fields` can be used in a\nfilter.  These field names are appropriately renamed if the source is\n'ia'.  The different syntax of filter modifiers for 'ia' and 'cc' is\nnot fully abstracted away by cdx_toolkit.\n\nThe basic syntax of a filter is `[modifier]field:expression`, for\nexample `=status:200` or `!=status:200`.\n\n'cc'-style filters (pywb) come in six flavors: substring match, exact\nstring, full-match regex, and their inversions. These are indicated by\na modifier of nothing, '=', '\\~', '!', '!=', and '!\\~', respectively.\n\n'ia'-style filters (Wayback/OpenWayback) come in two flavors, a full-match\nregex and an inverted full-match regex: 'status:200' and '!status:200'\n\nMultiple filters will be combined with AND. For example, to limit\ncaptures to those which do not have status 200 and do not have status 404,\n\n```\n$ cdxt --cc --filter '!=status:200' --filter '!=status:404' iter ...\n```\n\nNote that filters that discard large numbers of captures put a high\nload on the CDX server -- for example, a filter that returns just a\nfew captures from a domain that has tens of millions of captures is\nlikely to run very slowly and annoy the owner of the CDX server.\n\nSee https://github.com/webrecorder/pywb/wiki/CDX-Server-API#filter (pywb)\nand https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server#filtering (wayback)\nfor full details of filter modifiers.\n\n## CDX Jargon, Field Names, and such\n\ncdx_toolkit supports all (ok, most!) of the options and fields discussed\nin the CDX API documentation:\n\n* https://github.com/webrecorder/pywb/wiki/CDX-Server-API\n* https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server\n\nA **capture** is a single crawled url, be it a copy of a webpage, a\nredirect to another page, an error such as 404 (page not found), or a\nrevisit record (page identical to a previous capture.)\n\nThe **url** used by cdx_tools can be wildcarded in two ways. One way\nis `*.example.com`, which in CDX jargon sets **matchType='domain'**,\nand will return captures for example.com and blog.example.com and\nsupport.example.com. The other, `example.com/*`, will return captures\nfor any page on example.com.\n\nA **timestamp** represents year-month-day-time as a string of digits run togther.\nExample: January 5, 2016 at 12:34:56 UTC is 20160105123456. These timestamps are\na field in the index, and are also used to pick specify the dates used\nby **--from=**, **--to**, and **--closest** on the command-line. (Programmatically,\nuse **from_ts=**, to=, and closest=.)\n\nAn **urlkey** is a SURT, which is a munged-up url suitable for\ndeduplication and sorting. This sort order is how CDX indices\nefficiently support queries like `*.example.com`. The SURTs for\nwww.example.com and example.com are identical, which is handy when\nthese 2 hosts actually have identical web content. The original url\nshould be present in all records, if you want to know exactly what it\nis.\n\nThe **limit** argument limits how many captures will be returned.  To\nhelp users not shoot themselves in the foot, a limit of 1,000 is\napplied to --get and .get() calls.\n\nA **filter** allows a user to select a subset of CDX records, reducing\nnetwork traffic between the CDX API server and the user. For example,\nfilter='!status:200' will only show captures whose http status is not\n200. Multiple filters can be specified as a list (in the api) and on\nthe command line (by specifying --filter more than once). Filters and\n**limit** work together, with the limit applying to the count of\ncaptures after the filter is applied. Note that revisit records have a\nstatus of '-', not 200.\n\nCDX API servers support a **paged interface** for efficient access to\nlarge sets of URLs. cdx_toolkit iterators always use the paged interface.\ncdx_toolkit is also polite to CDX servers by being single-threaded and\nserial. If it's not fast enough for you, consider downloading Common\nCrawl's index files directly.\n\nA **digest** is a sha1 checksum of the contents of a capture. The\npurpose of a digest is to be able to easily figure out if 2 captures\nhave identical content.\n\nCommon Crawl publishes a new index each month. cdx_toolkit will start\nusing new ones as soon as they are published. **By default,\ncdx_toolkit will use the most recent 12 months of Common Crawl**; you\ncan change that using **--from** or **from_ts=** and **--to** or\n**to=**.\n\nCDX implementations do not efficiently support reversed sort orders,\nso cdx_toolkit results will be ordered by ascending SURT and by\nascending timestamp. However, since CC has an individual index for\neach month, and because most users want more recent results,\ncdx_toolkit defaults to querying CC's CDX indices in decreasing month\norder, but each month's result will be in ascending SURT and ascending\ntimestamp. This default sort order is named 'mixed'. If you'd like\npure ascending, set **--cc-sort** or **cc_sort=** to 'ascending'. You\nmay want to also specify **--from** or **from_ts=** to set a starting\ntimestamp.\n\nThe main problem with this ascending sort order is that it's a pain to\nget the most recent N captures: --limit and limit= will return the\noldest N captures. With the 'mixed' ordering, a large enough limit=\nwill get close to returning the most recent N captures.\n\n## TODO\n\nContent downloading needs help with charset issues, preferably\nfiguring out the charset using an algorithm similar to browsers.\n\nWARC generation should do smart(er) things with revisit records.\n\nRight now the CC code selects which monthly CC indices to use based\nsolely on date ranges. It would be nice to have an alternative so that\na client could iterate against the most recent N CC indices, and\nalso have the default one-year lookback use an entire monthly index\ninstead of a partial one.\n\n## Status\n\ncdx_toolkit has reached the beta-testing stage of development.\n\n## License\n\nCopyright 2018-2020 Greg Lindahl and others\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use this software except in compliance with the License.\nYou may obtain a copy of the License at\n\n    http://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.\n\n\n",
    "bugtrack_url": null,
    "license": "Apache 2.0",
    "summary": "A toolkit for working with CDX indices",
    "version": "0.9.37",
    "project_urls": {
        "Homepage": "https://github.com/cocrawler/cdx_toolkit"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "34fc21958c96b592dc3e4d19c10e2e887d41d7c7300708ccc2d49dd40edf39f7",
                "md5": "e9dbb8a554b1c84e254f9c7d1e487006",
                "sha256": "c1cacc882d83e8cdc136cad3e26e4120cf8abc552a9707bbe3a063b454f82df8"
            },
            "downloads": -1,
            "filename": "cdx_toolkit-0.9.37.tar.gz",
            "has_sig": false,
            "md5_digest": "e9dbb8a554b1c84e254f9c7d1e487006",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.7",
            "size": 41708,
            "upload_time": "2024-09-09T02:50:49",
            "upload_time_iso_8601": "2024-09-09T02:50:49.514333Z",
            "url": "https://files.pythonhosted.org/packages/34/fc/21958c96b592dc3e4d19c10e2e887d41d7c7300708ccc2d49dd40edf39f7/cdx_toolkit-0.9.37.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-09-09 02:50:49",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "cocrawler",
    "github_project": "cdx_toolkit",
    "travis_ci": true,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "requests",
            "specs": [
                [
                    "==",
                    "2.25.1"
                ]
            ]
        },
        {
            "name": "warcio",
            "specs": [
                [
                    "==",
                    "1.7.4"
                ]
            ]
        },
        {
            "name": "pytest",
            "specs": [
                [
                    "==",
                    "6.2.4"
                ]
            ]
        },
        {
            "name": "pytest-cov",
            "specs": [
                [
                    "==",
                    "2.12.1"
                ]
            ]
        },
        {
            "name": "pytest-sugar",
            "specs": [
                [
                    "==",
                    "0.9.4"
                ]
            ]
        },
        {
            "name": "coveralls",
            "specs": [
                [
                    "==",
                    "3.1.0"
                ]
            ]
        },
        {
            "name": "twine",
            "specs": [
                [
                    "==",
                    "3.4.1"
                ]
            ]
        },
        {
            "name": "setuptools",
            "specs": [
                [
                    "==",
                    "57.0.0"
                ]
            ]
        },
        {
            "name": "setuptools-scm",
            "specs": [
                [
                    "==",
                    "6.0.1"
                ]
            ]
        }
    ],
    "lcname": "cdx-toolkit"
}

Greg Lindahl and others