## SEC Parsers

[](https://hits.seeyoufarm.com)

Parses non-standardized SEC filings into structured xml. Use cases include LLMs, NLP, and textual analysis. Average parse-time for a 100 page document is 0.4 seconds. Package is a WIP, and is updated frequently.
Supported filing types are 10-K, 10-Q, 8-K, S-1, 20-F. More will be added soon, or you can write your own! [How to write a Custom Parser in 5 minutes](https://medium.com/@jgfriedman99/how-to-write-a-custom-sec-parser-in-5-minutes-5c7a8d5d81b0)
`sec-parsers` is maintained by John Friedman, and is under the MIT License. If you use `sec-parsers` for a project, please let me know! [Feedback](https://forms.gle/hZRgDoDGmsHs3wiF6)
<em>URGENT</em>: Advice on how to name functions used by users is needed. I don't want to deprecate function names in the future. [Link](contributors.md)
<em>Notice</em> `download_sec_filing` is being deprecated.
<div align="center">
<img src="https://raw.githubusercontent.com/john-friedman/SEC-Parsers/main/Assets/tesla_visualizationv3.png">
</div>
<div align="center">
<img src="https://raw.githubusercontent.com/john-friedman/SEC-Parsers/main/Assets/tesla_tree_v4.png" width="500">
</div>
Installation
```
pip install sec-parsers # base package
pip install sec-parsers['all'] # installs all extras
pip install sec-parsers['downloaders'] # installs downloaders extras
pip install sec-parsers['visualizers'] # installs visualizers extras
```
### Quickstart
Load package
```
from sec_parsers import Filing
```
Downloading html file (new)
```
from sec_downloaders import SEC_Downloader
downloader = SEC_Downloader()
downloader.set_headers("John Doe", "johndoe@example.com")
download = downloader.download(url)
filing = Filing(download)
```
Downloading html file (old)
```
from sec_parsers download_sec_filing
html = download_sec_filing('https://www.sec.gov/Archives/edgar/data/1318605/000162828024002390/tsla-20231231.htm')
filing = Filing(html)
```
Parsing
```
filing.parse() # parses filing
filing.visualize() # opens filing in webbrowser with highlighted section headers
filing.find_sections_from_title(title) # finds section by title, e.g. 'item 1a'
filing.find_sections_from_text(text) # finds sections which contains your text
filing.get_tree(node) # if no argument specified returns xml tree, if node specified, returns that nodes tree
filing.get_title_tree() # returns xml tree using titles instead of tags. More descriptive than get_tree.
filing.get_subsections_from_section() # get children of a section
filing.get_nested_subsections_from_section() # get descendants of a section
filing.set_filing_type(type) # e.g. 'S-1'. Use when automatic detection fails
filing.save_xml(file_name,encoding='utf-8')
filing.save_csv(file_name,encoding='ascii')
```
### Additional Resources:
* [quickstart](Examples/quickstart.ipynb)
* [How to write a Custom Parser in 5 minutes](https://medium.com/@jgfriedman99/how-to-write-a-custom-sec-parser-in-5-minutes-5c7a8d5d81b0)
* [Archive of Parsed XMLs / CSVs](https://www.dropbox.com/scl/fo/np1lpow7r3bissz80ze3o/AKGM8skBrUfEGlSweofAUDU?rlkey=cz1r78jofntjeq4ax2vb2yd0u&e=1&st=mdcwgfcm&dl=0) - Last updated 7/24/24.
* [example parsed filing](Examples/tesla_10k.xml)
* [example parsed filing exported to csv](Examples/tesla_10k.csv).
### Features:
* lots of filing types
* export to xml, csv, with option to convert to ASCII
* visualization
### Feature Requests:
[Request a Feature](contributors.md)
* company metadata (sharif) - will add to downloader
* filing metadata (sharif) - waiting for SEC Downloaders first release
* Export to dta (Denis)
* DEF 14A, DEFM14A (Denis)
* Export to markdown (Astarag)
* Better parsing_string handling. Opened an issue. (sharif)
#### SEC Downloader
Not released yet, different repo.
* Download by company name, ticker, etc
* Download all 10-Ks, etc
* Rate limit handling
* asynchronous downloads
### Statistics
* Speed: On average, 10-K filings parse in 0.25 seconds. There were 7,118 10-K annual reports filed in 2023, so to parse all 10-Ks from 2023 should take about half an hour.
* Improving speed is currently not a priority. If you need more speed, let me know. I think I can increase parsing speed to ~ .01 seconds per 10-K.
### Other packages useful for SEC filings
* https://github.com/dgunning/edgartools
### Updates
#### Towards Version 1:
* Most/All SEC text filings supported
* Few errors
* xml
Might be done along the way:
* Faster parsing, probably using streaming approach, and combining modules together.
* Introduction section parsing
* Signatures section parsing
* Better visualization interface (e.g. like pdfviewer for sections)
#### Beyond Version 1:
To improve the package beyond V1 it looks like I need compute and storage. Not sure how to get that. Working on it.
Metadata
* Clustering similar section titles using ML (e.g. seasonality headers)
* Adding tags to individual sections using small LLMs (e.g. tag for mentions supply chains, energy, etc)
Other
* Table parsing
* Image OCR
* Parsing non-html filings
### Current Priority list:
* look at code duplication w.r.t to style detectors, e.g. all caps and emphasis. may want to combine into one detector
- yep this is a priority. have to handle e.g. Introduction and Segment Overview as same rule. Bit difficult. Will think over.
* better function names - need to decide terminology soon.
* consider adding table of contents, forward looking information, etc
- forward looking information, DOCUMENTS INCORPORATED BY REFERENCE, TABLE OF CONTENTS - go with a bunch,
* fix layering issue - e.g. top div hides sections
* make trees nicer
* add more filing types
* fix all caps and emphasis issue
* clean text
* Better historical conversion: handle if PART I appears multiple times as header, e.g. logic here item 1 continued.
Raw data
{
"_id": null,
"home_page": "https://github.com/john-friedman/SEC-Parsers",
"name": "sec-parsers",
"maintainer": null,
"docs_url": null,
"requires_python": null,
"maintainer_email": null,
"keywords": null,
"author": "John Friedman",
"author_email": null,
"download_url": "https://files.pythonhosted.org/packages/23/06/f4639263631b08f692e53c0e8e52b730b52bea02fb204d425b76176df057/sec_parsers-0.549.tar.gz",
"platform": null,
"description": "## SEC Parsers\r\n\r\n[](https://hits.seeyoufarm.com)\r\n\r\n\r\nParses non-standardized SEC filings into structured xml. Use cases include LLMs, NLP, and textual analysis. Average parse-time for a 100 page document is 0.4 seconds. Package is a WIP, and is updated frequently.\r\n\r\nSupported filing types are 10-K, 10-Q, 8-K, S-1, 20-F. More will be added soon, or you can write your own! [How to write a Custom Parser in 5 minutes](https://medium.com/@jgfriedman99/how-to-write-a-custom-sec-parser-in-5-minutes-5c7a8d5d81b0)\r\n\r\n`sec-parsers` is maintained by John Friedman, and is under the MIT License. If you use `sec-parsers` for a project, please let me know! [Feedback](https://forms.gle/hZRgDoDGmsHs3wiF6)\r\n\r\n<em>URGENT</em>: Advice on how to name functions used by users is needed. I don't want to deprecate function names in the future. [Link](contributors.md)\r\n\r\n<em>Notice</em> `download_sec_filing` is being deprecated.\r\n\r\n<div align=\"center\">\r\n <img src=\"https://raw.githubusercontent.com/john-friedman/SEC-Parsers/main/Assets/tesla_visualizationv3.png\">\r\n</div>\r\n<div align=\"center\">\r\n <img src=\"https://raw.githubusercontent.com/john-friedman/SEC-Parsers/main/Assets/tesla_tree_v4.png\" width=\"500\">\r\n</div>\r\n\r\nInstallation\r\n```\r\npip install sec-parsers # base package\r\npip install sec-parsers['all'] # installs all extras\r\npip install sec-parsers['downloaders'] # installs downloaders extras\r\npip install sec-parsers['visualizers'] # installs visualizers extras\r\n```\r\n\r\n### Quickstart\r\nLoad package\r\n```\r\nfrom sec_parsers import Filing\r\n```\r\n\r\nDownloading html file (new)\r\n```\r\nfrom sec_downloaders import SEC_Downloader\r\n\r\ndownloader = SEC_Downloader()\r\ndownloader.set_headers(\"John Doe\", \"johndoe@example.com\")\r\ndownload = downloader.download(url)\r\nfiling = Filing(download)\r\n```\r\n\r\nDownloading html file (old)\r\n```\r\nfrom sec_parsers download_sec_filing\r\nhtml = download_sec_filing('https://www.sec.gov/Archives/edgar/data/1318605/000162828024002390/tsla-20231231.htm')\r\nfiling = Filing(html)\r\n```\r\n\r\nParsing\r\n```\r\nfiling.parse() # parses filing\r\nfiling.visualize() # opens filing in webbrowser with highlighted section headers\r\nfiling.find_sections_from_title(title) # finds section by title, e.g. 'item 1a'\r\nfiling.find_sections_from_text(text) # finds sections which contains your text\r\nfiling.get_tree(node) # if no argument specified returns xml tree, if node specified, returns that nodes tree\r\nfiling.get_title_tree() # returns xml tree using titles instead of tags. More descriptive than get_tree.\r\nfiling.get_subsections_from_section() # get children of a section\r\nfiling.get_nested_subsections_from_section() # get descendants of a section\r\nfiling.set_filing_type(type) # e.g. 'S-1'. Use when automatic detection fails\r\nfiling.save_xml(file_name,encoding='utf-8')\r\nfiling.save_csv(file_name,encoding='ascii')\r\n```\r\n### Additional Resources:\r\n* [quickstart](Examples/quickstart.ipynb)\r\n* [How to write a Custom Parser in 5 minutes](https://medium.com/@jgfriedman99/how-to-write-a-custom-sec-parser-in-5-minutes-5c7a8d5d81b0)\r\n* [Archive of Parsed XMLs / CSVs](https://www.dropbox.com/scl/fo/np1lpow7r3bissz80ze3o/AKGM8skBrUfEGlSweofAUDU?rlkey=cz1r78jofntjeq4ax2vb2yd0u&e=1&st=mdcwgfcm&dl=0) - Last updated 7/24/24.\r\n* [example parsed filing](Examples/tesla_10k.xml)\r\n* [example parsed filing exported to csv](Examples/tesla_10k.csv).\r\n\r\n### Features:\r\n* lots of filing types\r\n* export to xml, csv, with option to convert to ASCII\r\n* visualization\r\n\r\n### Feature Requests:\r\n[Request a Feature](contributors.md)\r\n* company metadata (sharif) - will add to downloader\r\n* filing metadata (sharif) - waiting for SEC Downloaders first release\r\n* Export to dta (Denis)\r\n* DEF 14A, DEFM14A (Denis)\r\n* Export to markdown (Astarag)\r\n* Better parsing_string handling. Opened an issue. (sharif)\r\n\r\n#### SEC Downloader\r\nNot released yet, different repo.\r\n* Download by company name, ticker, etc\r\n* Download all 10-Ks, etc\r\n* Rate limit handling\r\n* asynchronous downloads\r\n\r\n### Statistics\r\n* Speed: On average, 10-K filings parse in 0.25 seconds. There were 7,118 10-K annual reports filed in 2023, so to parse all 10-Ks from 2023 should take about half an hour.\r\n* Improving speed is currently not a priority. If you need more speed, let me know. I think I can increase parsing speed to ~ .01 seconds per 10-K.\r\n\r\n### Other packages useful for SEC filings\r\n* https://github.com/dgunning/edgartools\r\n\r\n### Updates\r\n#### Towards Version 1:\r\n* Most/All SEC text filings supported\r\n* Few errors\r\n* xml \r\n\r\nMight be done along the way:\r\n* Faster parsing, probably using streaming approach, and combining modules together.\r\n* Introduction section parsing\r\n* Signatures section parsing\r\n* Better visualization interface (e.g. like pdfviewer for sections)\r\n\r\n#### Beyond Version 1:\r\nTo improve the package beyond V1 it looks like I need compute and storage. Not sure how to get that. Working on it.\r\n\r\nMetadata\r\n* Clustering similar section titles using ML (e.g. seasonality headers)\r\n* Adding tags to individual sections using small LLMs (e.g. tag for mentions supply chains, energy, etc)\r\n\r\nOther\r\n* Table parsing\r\n* Image OCR\r\n* Parsing non-html filings\r\n\r\n### Current Priority list:\r\n* look at code duplication w.r.t to style detectors, e.g. all caps and emphasis. may want to combine into one detector\r\n- yep this is a priority. have to handle e.g. Introduction and Segment Overview as same rule. Bit difficult. Will think over.\r\n* better function names - need to decide terminology soon.\r\n* consider adding table of contents, forward looking information, etc\r\n- forward looking information, DOCUMENTS INCORPORATED BY REFERENCE, TABLE OF CONTENTS - go with a bunch, \r\n* fix layering issue - e.g. top div hides sections\r\n* make trees nicer\r\n* add more filing types\r\n* fix all caps and emphasis issue\r\n* clean text\r\n* Better historical conversion: handle if PART I appears multiple times as header, e.g. logic here item 1 continued.\r\n\r\n\r\n",
"bugtrack_url": null,
"license": null,
"summary": "A package to parse SEC filings",
"version": "0.549",
"project_urls": {
"Homepage": "https://github.com/john-friedman/SEC-Parsers"
},
"split_keywords": [],
"urls": [
{
"comment_text": "",
"digests": {
"blake2b_256": "e87032e07c28c127f76ad44a3398128cb6968b6aa80a9f57a955478c14cb847c",
"md5": "04c245baf74e0f9f2368f30077a039dd",
"sha256": "1989e9e4d894bb6a65876e123941d53423e2ed9301eae9766fdfe26cd3bf965b"
},
"downloads": -1,
"filename": "sec_parsers-0.549-py3-none-any.whl",
"has_sig": false,
"md5_digest": "04c245baf74e0f9f2368f30077a039dd",
"packagetype": "bdist_wheel",
"python_version": "py3",
"requires_python": null,
"size": 21011,
"upload_time": "2024-07-29T18:17:53",
"upload_time_iso_8601": "2024-07-29T18:17:53.271010Z",
"url": "https://files.pythonhosted.org/packages/e8/70/32e07c28c127f76ad44a3398128cb6968b6aa80a9f57a955478c14cb847c/sec_parsers-0.549-py3-none-any.whl",
"yanked": false,
"yanked_reason": null
},
{
"comment_text": "",
"digests": {
"blake2b_256": "2306f4639263631b08f692e53c0e8e52b730b52bea02fb204d425b76176df057",
"md5": "bb86af3719d2d0db31cca4fb32cd0158",
"sha256": "bfd4a610a45aec57b5757552c4a2421990ecfe0e89873efd08718545512c2e2b"
},
"downloads": -1,
"filename": "sec_parsers-0.549.tar.gz",
"has_sig": false,
"md5_digest": "bb86af3719d2d0db31cca4fb32cd0158",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 19904,
"upload_time": "2024-07-29T18:17:54",
"upload_time_iso_8601": "2024-07-29T18:17:54.770481Z",
"url": "https://files.pythonhosted.org/packages/23/06/f4639263631b08f692e53c0e8e52b730b52bea02fb204d425b76176df057/sec_parsers-0.549.tar.gz",
"yanked": false,
"yanked_reason": null
}
],
"upload_time": "2024-07-29 18:17:54",
"github": true,
"gitlab": false,
"bitbucket": false,
"codeberg": false,
"github_user": "john-friedman",
"github_project": "SEC-Parsers",
"travis_ci": false,
"coveralls": false,
"github_actions": false,
"lcname": "sec-parsers"
}