# pyedgar
Python package for downloading EDGAR documents and data.
[![PyPI version shields.io](https://img.shields.io/pypi/v/pyedgar.svg)](https://pypi.python.org/pypi/pyedgar/)
[![PyPI license](https://img.shields.io/pypi/l/pyedgar.svg)](https://pypi.python.org/pypi/pyedgar/)
[![PyPI pyversions](https://img.shields.io/pypi/pyversions/pyedgar.svg)](https://pypi.python.org/pypi/pyedgar/)
[![GitHub latest commit](https://badgen.net/github/last-commit/gaulinmp/pyedgar)](https://GitHub.com/gaulinmp/pyedgar/commit/)
## Usage
There are two primary interfaces to this library, namely filings and indices.
### filing.py
[filing.py](pyedgar/filing.py) is the main module for interacting with EDGAR forms.
Simple example:
```python
from pyedgar import Filing
f = Filing(20, '0000893220-96-000500')
print(f)
#output: <EDGAR filing (20/0000893220-96-000500) Headers:False, Text:False, Documents:False>
print(f.type, f)
# output: 10-K <EDGAR filing (20/0000893220-96-000500) Headers:True, Text:True, Documents:False>
print(f.documents[0]['full_text'][:800])
# Output:
# SECURITIES AND EXCHANGE COMMISSION
# WASHINGTON, D.C. 20549
#
# FORM 10-K
#
# (Mark One)
# /X/ Annual report pursuant to section 13 or 15(d) of the Securities Exchange
# Act of 1934 [Fee Required] for the fiscal year ended December 30, 1995 or
#
# / / Transition report pursuant to section 13 or 15(d) of the Securities
# Exchange Act of 1934 [No Fee Required] for the transition period from
# ________ to ________
#
# COMMISSION FILE NUMBER 0-9576
#
#
# K-TRON INTERNATIONAL, INC.
# (EXACT NAME OF REGISTRANT AS SPECIFIED IN ITS CHARTER)
#
# New Jersey 22-1759452
# (State or other jurisdiction of (I.R.S. Employer Identification No.)
```
The forms are loaded lazily, so only when you request the data is the file read from disk or downloaded from the EDGAR website.
Filing objects have the following properties:
* ``path``: path to cached filing on disk
* ``urls``: URLs the EDGAR website location for the full text file and the index file
* ``full_text``: Full text of the entire `.nc` filing (not just the first document)
* ``headers``: Dictionary of all the headers from the full filing (i.e. not the exhibits). E.g. CIK, ACCESSION, PERIOD, etc.
* ``type``: The general type of the document, extracted from the TYPE header and cleaned up (so 10-K405 --> 10-K)
* ``type_exact``: The exact text extracted from the TYPE field
* ``documents``: Array of all the documents (between <DOCUMENT></DOCUMENT> tags). 0th is typically the main form, i.e. the 10-K filing, subsequent documents are exhibits.
* Each document in this array is itself a dictionary, with fields: TYPE, SEQUENCE, DESCRIPTION (typically the file name), FULL_TEXT. The latter is the text of the exhibit, i.e. just the 10-K filing in text or HTML.
### index.py
[index.py](pyedgar/index.py) is the main module for accessing extracted EDGAR indices.
The indices are created in [pyedgar.utilities.indices](pyedgar/utilities/indices.py#L34) by the IndexMaker class.
Once these indices are created (which you can do by setting ``force_download=True``), you can view them via the ``indices`` property:
```python
from pyedgar import EDGARIndex
all_indices = EDGARIndex(force_download=False)
print(all_indices.indices)
# Output:
# {'form_all.tab': '/data/storage/edgar/indices/form_all.tab',
# 'form_10-Q.tab': '/data/storage/edgar/indices/form_10-Q.tab',
# 'form_13s.tab': '/data/storage/edgar/indices/form_13s.tab',
# 'form_DEF14A.tab': '/data/storage/edgar/indices/form_DEF14A.tab',
# 'form_8-K.tab': '/data/storage/edgar/indices/form_8-K.tab',
# 'form_20-F.tab': '/data/storage/edgar/indices/form_20-F.tab',
# 'form_10-K.tab': '/data/storage/edgar/indices/form_10-K.tab'}
```
These indices are accessible as a pandas dataframe via [] or the ``get_index`` method, where the index is selected via the key above (with or without the form_ or .tab).
```python
form_10k = all_indices['10-K']
print(form_10k.head(1))
# Output:
# cik name form filedate accession
# 0 20 K TRON INTERNATIONAL INC 10-K 1996-03-28 0000893220-96-000500
```
To get a type of form that isn't automatically extracted, you can use form_all:
```python
df_s1 = EDGARIndex().get_index('all').query("form.str.startswith('S-1')")
print(df_s1.head(1))
# Output:
# cik name form filedate accession
# 5600 1961 WORLDS INC S-1 2014-02-04 0001264931-14-000033
```
All indices are loaded and saved by pandas, so pandas is a requirement for using this functionality.
## Config
Config files named ``pyedgar.conf``, ``.pyedgar``, ``pyedgar.ini`` are searched for at (in order):
1. ``os.environ.get("PYEDGAR_CONF", '.')`` <-- PYEDGAR_CONF environmental variable
2. ``./``
3. ``~/.config/pyedgar``
4. ``~/AppData/Local/pyedgar``
5. ``~/AppData/Roaming/pyedgar``
6. ``~/Library/Preferences/pyedgar``
7. ``~/.config/``
8. ``~/``
9. ``~/Documents/``
10. ``os.path.abspath(os.path.dirname(__file__))`` <-- directory of the package. Default package ships with this existing.
See the [example config file](pyedgar/pyedgar.conf) for commented config settings.
Running multiple configs is quite easy, by setting ``os.environ`` manually:
```python
import os
# os.environ['PYEDGAR_CONF'] = os.path.expanduser('~/Dropbox/config/pyedgar/hades.local.pyedgar.conf')
os.environ['PYEDGAR_CONF'] = os.path.expanduser('~/Dropbox/config/pyedgar/hades.desb.pyedgar.conf')
from pyedgar import config
print(config.CONFIG_FILE)
# Output:
# WARNING:pyedgar.config:Loaded config file from '[~]/Dropbox/config/pyedgar/hades.desb.pyedgar.conf'.
# ALERT!!!! FILING_PATH_FORMAT is '{accession[11:13]}/{accession}.nc'.
# [~]/Dropbox/config/pyedgar/hades.desb.pyedgar.conf
```
## downloader
There is a convenience downloader script, for downloading filing feed files and indexes.
To see the status of current cached downloads (shows the latest downloaded files) and to see the config setup:
```bash
$ python -m pyedgar.downloader --status --config
```
To download and extract index files:
```bash
$ python -m pyedgar.downloader -i --log info
```
And to download and extract the last 30 days of filings:
```bash
$ python -m pyedgar.downloader -d
```
To download and extract filings since the beginning:
```bash
$ python -m pyedgar.downloader -d --start-date 1995-01-01
```
## Install
Pip installable:
```bash
pip install pyedgar
```
Or pip installable from github:
```bash
pip install git+https://github.com/gaulinmp/pyedgar#egg=pyedgar
```
or by checking out from github and installing in editable mode:
```bash
git clone https://github.com/gaulinmp/pyedgar
cd pyedgar
pip install -e ./
```
## Requirements
w3m for converting HTML to plaintext (tested on Linux).
A fallback method might one day be added.
Tested only on Python >3.4
HTML parsing tested only on Linux.
Other HTML->text conversion methodologies were tried (html2text, BeautifulSoup, lxml) but w3m was fastest even with the subprocess calling.
Converting multiple HTML files could probably be optimized with one instance of w3m instead of spawning a subprocess for each call.
But that's for future Mac to work on.
Raw data
{
"_id": null,
"home_page": "https://github.com/gaulinmp/pyedgar",
"name": "pyedgar",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": "SEC EDGAR filings",
"author": "Mac Gaulin",
"author_email": "gaulinmp+pyedgar@gmail.com",
"download_url": "https://files.pythonhosted.org/packages/9d/31/2b24ad2fe79a9db9cde8bf3b75ac0979634f2e265a9ba3b91f38ef266c97/pyedgar-0.1.10.tar.gz",
"platform": null,
"description": "# pyedgar\n\nPython package for downloading EDGAR documents and data.\n\n[![PyPI version shields.io](https://img.shields.io/pypi/v/pyedgar.svg)](https://pypi.python.org/pypi/pyedgar/)\n[![PyPI license](https://img.shields.io/pypi/l/pyedgar.svg)](https://pypi.python.org/pypi/pyedgar/)\n[![PyPI pyversions](https://img.shields.io/pypi/pyversions/pyedgar.svg)](https://pypi.python.org/pypi/pyedgar/)\n[![GitHub latest commit](https://badgen.net/github/last-commit/gaulinmp/pyedgar)](https://GitHub.com/gaulinmp/pyedgar/commit/)\n\n\n\n## Usage\nThere are two primary interfaces to this library, namely filings and indices.\n\n\n\n### filing.py\n[filing.py](pyedgar/filing.py) is the main module for interacting with EDGAR forms.\n\nSimple example:\n\n```python\nfrom pyedgar import Filing\nf = Filing(20, '0000893220-96-000500')\n\nprint(f)\n#output: <EDGAR filing (20/0000893220-96-000500) Headers:False, Text:False, Documents:False>\n\nprint(f.type, f)\n# output: 10-K <EDGAR filing (20/0000893220-96-000500) Headers:True, Text:True, Documents:False>\n\nprint(f.documents[0]['full_text'][:800])\n# Output:\n# SECURITIES AND EXCHANGE COMMISSION\n# WASHINGTON, D.C. 20549\n#\n# FORM 10-K\n#\n# (Mark One)\n# /X/ Annual report pursuant to section 13 or 15(d) of the Securities Exchange\n# Act of 1934 [Fee Required] for the fiscal year ended December 30, 1995 or\n#\n# / / Transition report pursuant to section 13 or 15(d) of the Securities\n# Exchange Act of 1934 [No Fee Required] for the transition period from\n# ________ to ________\n#\n# COMMISSION FILE NUMBER 0-9576\n#\n#\n# K-TRON INTERNATIONAL, INC.\n# (EXACT NAME OF REGISTRANT AS SPECIFIED IN ITS CHARTER)\n#\n# New Jersey 22-1759452\n# (State or other jurisdiction of (I.R.S. Employer Identification No.)\n```\n\nThe forms are loaded lazily, so only when you request the data is the file read from disk or downloaded from the EDGAR website.\nFiling objects have the following properties:\n\n* ``path``: path to cached filing on disk\n* ``urls``: URLs the EDGAR website location for the full text file and the index file\n* ``full_text``: Full text of the entire `.nc` filing (not just the first document)\n* ``headers``: Dictionary of all the headers from the full filing (i.e. not the exhibits). E.g. CIK, ACCESSION, PERIOD, etc.\n* ``type``: The general type of the document, extracted from the TYPE header and cleaned up (so 10-K405 --> 10-K)\n* ``type_exact``: The exact text extracted from the TYPE field\n* ``documents``: Array of all the documents (between <DOCUMENT></DOCUMENT> tags). 0th is typically the main form, i.e. the 10-K filing, subsequent documents are exhibits.\n * Each document in this array is itself a dictionary, with fields: TYPE, SEQUENCE, DESCRIPTION (typically the file name), FULL_TEXT. The latter is the text of the exhibit, i.e. just the 10-K filing in text or HTML.\n\n\n### index.py\n[index.py](pyedgar/index.py) is the main module for accessing extracted EDGAR indices.\nThe indices are created in [pyedgar.utilities.indices](pyedgar/utilities/indices.py#L34) by the IndexMaker class.\nOnce these indices are created (which you can do by setting ``force_download=True``), you can view them via the ``indices`` property:\n\n```python\nfrom pyedgar import EDGARIndex\nall_indices = EDGARIndex(force_download=False)\n\nprint(all_indices.indices)\n# Output:\n# {'form_all.tab': '/data/storage/edgar/indices/form_all.tab',\n# 'form_10-Q.tab': '/data/storage/edgar/indices/form_10-Q.tab',\n# 'form_13s.tab': '/data/storage/edgar/indices/form_13s.tab',\n# 'form_DEF14A.tab': '/data/storage/edgar/indices/form_DEF14A.tab',\n# 'form_8-K.tab': '/data/storage/edgar/indices/form_8-K.tab',\n# 'form_20-F.tab': '/data/storage/edgar/indices/form_20-F.tab',\n# 'form_10-K.tab': '/data/storage/edgar/indices/form_10-K.tab'}\n```\n\nThese indices are accessible as a pandas dataframe via [] or the ``get_index`` method, where the index is selected via the key above (with or without the form_ or .tab).\n\n```python\nform_10k = all_indices['10-K']\n\nprint(form_10k.head(1))\n# Output:\n# cik name form filedate accession\n# 0 20 K TRON INTERNATIONAL INC 10-K 1996-03-28 0000893220-96-000500\n```\n\nTo get a type of form that isn't automatically extracted, you can use form_all:\n\n```python\ndf_s1 = EDGARIndex().get_index('all').query(\"form.str.startswith('S-1')\")\n\nprint(df_s1.head(1))\n# Output:\n# cik name form filedate accession\n# 5600 1961 WORLDS INC S-1 2014-02-04 0001264931-14-000033\n```\n\nAll indices are loaded and saved by pandas, so pandas is a requirement for using this functionality.\n\n\n\n## Config\n\nConfig files named ``pyedgar.conf``, ``.pyedgar``, ``pyedgar.ini`` are searched for at (in order):\n\n1. ``os.environ.get(\"PYEDGAR_CONF\", '.')`` <-- PYEDGAR_CONF environmental variable\n2. ``./``\n3. ``~/.config/pyedgar``\n4. ``~/AppData/Local/pyedgar``\n5. ``~/AppData/Roaming/pyedgar``\n6. ``~/Library/Preferences/pyedgar``\n7. ``~/.config/``\n8. ``~/``\n9. ``~/Documents/``\n10. ``os.path.abspath(os.path.dirname(__file__))`` <-- directory of the package. Default package ships with this existing.\n\n\nSee the [example config file](pyedgar/pyedgar.conf) for commented config settings.\n\nRunning multiple configs is quite easy, by setting ``os.environ`` manually:\n\n```python\n\nimport os\n# os.environ['PYEDGAR_CONF'] = os.path.expanduser('~/Dropbox/config/pyedgar/hades.local.pyedgar.conf')\nos.environ['PYEDGAR_CONF'] = os.path.expanduser('~/Dropbox/config/pyedgar/hades.desb.pyedgar.conf')\n\nfrom pyedgar import config\nprint(config.CONFIG_FILE)\n\n# Output:\n# WARNING:pyedgar.config:Loaded config file from '[~]/Dropbox/config/pyedgar/hades.desb.pyedgar.conf'.\n# ALERT!!!! FILING_PATH_FORMAT is '{accession[11:13]}/{accession}.nc'.\n# [~]/Dropbox/config/pyedgar/hades.desb.pyedgar.conf\n```\n\n## downloader\n\nThere is a convenience downloader script, for downloading filing feed files and indexes.\n\nTo see the status of current cached downloads (shows the latest downloaded files) and to see the config setup:\n\n```bash\n$ python -m pyedgar.downloader --status --config\n```\n\nTo download and extract index files:\n\n```bash\n$ python -m pyedgar.downloader -i --log info\n```\n\nAnd to download and extract the last 30 days of filings:\n\n```bash\n$ python -m pyedgar.downloader -d\n```\n\nTo download and extract filings since the beginning:\n\n```bash\n$ python -m pyedgar.downloader -d --start-date 1995-01-01\n```\n\n\n\n## Install\n\nPip installable:\n\n```bash\npip install pyedgar\n```\n\nOr pip installable from github:\n\n```bash\npip install git+https://github.com/gaulinmp/pyedgar#egg=pyedgar\n```\n\nor by checking out from github and installing in editable mode:\n\n```bash\ngit clone https://github.com/gaulinmp/pyedgar\ncd pyedgar\npip install -e ./\n```\n\n## Requirements\n\nw3m for converting HTML to plaintext (tested on Linux).\nA fallback method might one day be added.\n\nTested only on Python >3.4\n\nHTML parsing tested only on Linux.\nOther HTML->text conversion methodologies were tried (html2text, BeautifulSoup, lxml) but w3m was fastest even with the subprocess calling.\nConverting multiple HTML files could probably be optimized with one instance of w3m instead of spawning a subprocess for each call.\nBut that's for future Mac to work on.\n",
"bugtrack_url": null,
"license": "MIT",
"summary": "Python interface to EDGAR filings.",
"version": "0.1.10",
"project_urls": {
"Homepage": "https://github.com/gaulinmp/pyedgar"
},
"split_keywords": [
"sec",
"edgar",
"filings"
],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "607c95c1cbc2f43ebde62db7aa2b2abf946825454340c2c95bee5b80330cb535",
"md5": "93169a2106b8421b4ab94c64fd91e623",
"sha256": "625ab37a888cc7fe9193e4c98b90ecca09e36e2aad6cf4f4596dd54c3ef04d35"
},
"downloads": -1,
"filename": "pyedgar-0.1.10-py3-none-any.whl",
"has_sig": false,
"md5_digest": "93169a2106b8421b4ab94c64fd91e623",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 51700,
"upload_time": "2024-12-17T00:01:59",
"upload_time_iso_8601": "2024-12-17T00:01:59.394960Z",
"url": "https://files.pythonhosted.org/packages/60/7c/95c1cbc2f43ebde62db7aa2b2abf946825454340c2c95bee5b80330cb535/pyedgar-0.1.10-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "9d312b24ad2fe79a9db9cde8bf3b75ac0979634f2e265a9ba3b91f38ef266c97",
"md5": "d3d2f9870eecd10cf02bfa3100838f53",
"sha256": "7b8014b0860ca08333cc2e0f997c2c09976af3d53362fdb6616756baef57455f"
},
"downloads": -1,
"filename": "pyedgar-0.1.10.tar.gz",
"has_sig": false,
"md5_digest": "d3d2f9870eecd10cf02bfa3100838f53",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 47892,
"upload_time": "2024-12-17T00:02:01",
"upload_time_iso_8601": "2024-12-17T00:02:01.937441Z",
"url": "https://files.pythonhosted.org/packages/9d/31/2b24ad2fe79a9db9cde8bf3b75ac0979634f2e265a9ba3b91f38ef266c97/pyedgar-0.1.10.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-12-17 00:02:01",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "gaulinmp",
"github_project": "pyedgar",
"travis_ci": false,
"coveralls": false,
"github_actions": true,
"requirements": [
{
"name": "requests",
"specs": []
},
{
"name": "BeautifulSoup4",
"specs": []
},
{
"name": "pandas",
"specs": []
},
{
"name": "tqdm",
"specs": []
}
],
"lcname": "pyedgar"
}