sec-parsers


Namesec-parsers JSON
Version 0.549 PyPI version JSON
download
home_pagehttps://github.com/john-friedman/SEC-Parsers
SummaryA package to parse SEC filings
upload_time2024-07-29 18:17:54
maintainerNone
docs_urlNone
authorJohn Friedman
requires_pythonNone
licenseNone
keywords
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            ## SEC Parsers
![PyPI - Downloads](https://img.shields.io/pypi/dm/sec-parsers)
[![Hits](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fhttps%2F%2Fgithub.com%2Fjohn-friedman%2FSEC-Parsers&count_bg=%2379C83D&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=hits&edge_flat=false)](https://hits.seeyoufarm.com)
![GitHub](https://img.shields.io/github/stars/john-friedman/sec-parsers)

Parses non-standardized SEC filings into structured xml. Use cases include LLMs, NLP, and textual analysis. Average parse-time for a 100 page document is 0.4 seconds. Package is a WIP, and is updated frequently.

Supported filing types are 10-K, 10-Q, 8-K, S-1, 20-F. More will be added soon, or you can write your own! [How to write a Custom Parser in 5 minutes](https://medium.com/@jgfriedman99/how-to-write-a-custom-sec-parser-in-5-minutes-5c7a8d5d81b0)

`sec-parsers` is maintained by John Friedman, and is under the MIT License. If you use `sec-parsers` for a project, please let me know! [Feedback](https://forms.gle/hZRgDoDGmsHs3wiF6)

<em>URGENT</em>: Advice on how to name functions used by users is needed. I don't want to deprecate function names in the future. [Link](contributors.md)

<em>Notice</em> `download_sec_filing` is being deprecated.

<div align="center">
  <img src="https://raw.githubusercontent.com/john-friedman/SEC-Parsers/main/Assets/tesla_visualizationv3.png">
</div>
<div align="center">
  <img src="https://raw.githubusercontent.com/john-friedman/SEC-Parsers/main/Assets/tesla_tree_v4.png" width="500">
</div>

Installation
```
pip install sec-parsers # base package
pip install sec-parsers['all'] # installs all extras
pip install sec-parsers['downloaders'] # installs downloaders extras
pip install sec-parsers['visualizers'] # installs visualizers extras
```

### Quickstart
Load package
```
from sec_parsers import Filing
```

Downloading html file (new)
```
from sec_downloaders import SEC_Downloader

downloader = SEC_Downloader()
downloader.set_headers("John Doe", "johndoe@example.com")
download = downloader.download(url)
filing = Filing(download)
```

Downloading html file (old)
```
from sec_parsers download_sec_filing
html = download_sec_filing('https://www.sec.gov/Archives/edgar/data/1318605/000162828024002390/tsla-20231231.htm')
filing = Filing(html)
```

Parsing
```
filing.parse() # parses filing
filing.visualize() # opens filing in webbrowser with highlighted section headers
filing.find_sections_from_title(title) # finds section by title, e.g. 'item 1a'
filing.find_sections_from_text(text) # finds sections which contains your text
filing.get_tree(node) # if no argument specified returns xml tree, if node specified, returns that nodes tree
filing.get_title_tree() # returns xml tree using titles instead of tags. More descriptive than get_tree.
filing.get_subsections_from_section() # get children of a section
filing.get_nested_subsections_from_section() # get descendants of a section
filing.set_filing_type(type) # e.g. 'S-1'. Use when automatic detection fails
filing.save_xml(file_name,encoding='utf-8')
filing.save_csv(file_name,encoding='ascii')
```
### Additional Resources:
* [quickstart](Examples/quickstart.ipynb)
* [How to write a Custom Parser in 5 minutes](https://medium.com/@jgfriedman99/how-to-write-a-custom-sec-parser-in-5-minutes-5c7a8d5d81b0)
* [Archive of Parsed XMLs / CSVs](https://www.dropbox.com/scl/fo/np1lpow7r3bissz80ze3o/AKGM8skBrUfEGlSweofAUDU?rlkey=cz1r78jofntjeq4ax2vb2yd0u&e=1&st=mdcwgfcm&dl=0) - Last updated 7/24/24.
* [example parsed filing](Examples/tesla_10k.xml)
* [example parsed filing exported to csv](Examples/tesla_10k.csv).

### Features:
* lots of filing types
* export to xml, csv, with option to convert to ASCII
* visualization

### Feature Requests:
[Request a Feature](contributors.md)
* company metadata (sharif) - will add to downloader
* filing metadata (sharif) - waiting for SEC Downloaders first release
* Export to dta (Denis)
* DEF 14A, DEFM14A (Denis)
* Export to markdown (Astarag)
* Better parsing_string handling. Opened an issue. (sharif)

#### SEC Downloader
Not released yet, different repo.
* Download by company name, ticker, etc
* Download all 10-Ks, etc
* Rate limit handling
* asynchronous downloads

### Statistics
* Speed: On average, 10-K filings parse in 0.25 seconds. There were 7,118 10-K annual reports filed in 2023, so to parse all 10-Ks from 2023 should take about half an hour.
* Improving speed is currently not a priority. If you need more speed, let me know. I think I can increase parsing speed to ~ .01 seconds per 10-K.

### Other packages useful for SEC filings
* https://github.com/dgunning/edgartools

### Updates
#### Towards Version 1:
* Most/All SEC text filings supported
* Few errors
* xml 

Might be done along the way:
* Faster parsing, probably using streaming approach, and combining modules together.
* Introduction section parsing
* Signatures section parsing
* Better visualization interface (e.g. like pdfviewer for sections)

#### Beyond Version 1:
To improve the package beyond V1 it looks like I need compute and storage. Not sure how to get that. Working on it.

Metadata
* Clustering similar section titles using ML (e.g. seasonality headers)
* Adding tags to individual sections using small LLMs (e.g. tag for mentions supply chains, energy, etc)

Other
* Table parsing
* Image OCR
* Parsing non-html filings

### Current Priority list:
* look at code duplication w.r.t to style detectors, e.g. all caps and emphasis. may want to combine into one detector
- yep this is a priority. have to handle e.g. Introduction and Segment Overview as same rule. Bit difficult. Will think over.
* better function names - need to decide terminology soon.
* consider adding table of contents, forward looking information, etc
- forward looking information, DOCUMENTS INCORPORATED BY REFERENCE, TABLE OF CONTENTS - go with a bunch, 
* fix layering issue - e.g. top div hides sections
* make trees nicer
* add more filing types
* fix all caps and emphasis issue
* clean text
* Better historical conversion: handle if PART I appears multiple times as header, e.g. logic here item 1 continued.



            

Raw data

            {
    "_id": null,
    "home_page": "https://github.com/john-friedman/SEC-Parsers",
    "name": "sec-parsers",
    "maintainer": null,
    "docs_url": null,
    "requires_python": null,
    "maintainer_email": null,
    "keywords": null,
    "author": "John Friedman",
    "author_email": null,
    "download_url": "https://files.pythonhosted.org/packages/23/06/f4639263631b08f692e53c0e8e52b730b52bea02fb204d425b76176df057/sec_parsers-0.549.tar.gz",
    "platform": null,
    "description": "## SEC Parsers\r\n![PyPI - Downloads](https://img.shields.io/pypi/dm/sec-parsers)\r\n[![Hits](https://hits.seeyoufarm.com/api/count/incr/badge.svg?url=https%3A%2F%2Fhttps%2F%2Fgithub.com%2Fjohn-friedman%2FSEC-Parsers&count_bg=%2379C83D&title_bg=%23555555&icon=&icon_color=%23E7E7E7&title=hits&edge_flat=false)](https://hits.seeyoufarm.com)\r\n![GitHub](https://img.shields.io/github/stars/john-friedman/sec-parsers)\r\n\r\nParses non-standardized SEC filings into structured xml. Use cases include LLMs, NLP, and textual analysis. Average parse-time for a 100 page document is 0.4 seconds. Package is a WIP, and is updated frequently.\r\n\r\nSupported filing types are 10-K, 10-Q, 8-K, S-1, 20-F. More will be added soon, or you can write your own! [How to write a Custom Parser in 5 minutes](https://medium.com/@jgfriedman99/how-to-write-a-custom-sec-parser-in-5-minutes-5c7a8d5d81b0)\r\n\r\n`sec-parsers` is maintained by John Friedman, and is under the MIT License. If you use `sec-parsers` for a project, please let me know! [Feedback](https://forms.gle/hZRgDoDGmsHs3wiF6)\r\n\r\n<em>URGENT</em>: Advice on how to name functions used by users is needed. I don't want to deprecate function names in the future. [Link](contributors.md)\r\n\r\n<em>Notice</em> `download_sec_filing` is being deprecated.\r\n\r\n<div align=\"center\">\r\n  <img src=\"https://raw.githubusercontent.com/john-friedman/SEC-Parsers/main/Assets/tesla_visualizationv3.png\">\r\n</div>\r\n<div align=\"center\">\r\n  <img src=\"https://raw.githubusercontent.com/john-friedman/SEC-Parsers/main/Assets/tesla_tree_v4.png\" width=\"500\">\r\n</div>\r\n\r\nInstallation\r\n```\r\npip install sec-parsers # base package\r\npip install sec-parsers['all'] # installs all extras\r\npip install sec-parsers['downloaders'] # installs downloaders extras\r\npip install sec-parsers['visualizers'] # installs visualizers extras\r\n```\r\n\r\n### Quickstart\r\nLoad package\r\n```\r\nfrom sec_parsers import Filing\r\n```\r\n\r\nDownloading html file (new)\r\n```\r\nfrom sec_downloaders import SEC_Downloader\r\n\r\ndownloader = SEC_Downloader()\r\ndownloader.set_headers(\"John Doe\", \"johndoe@example.com\")\r\ndownload = downloader.download(url)\r\nfiling = Filing(download)\r\n```\r\n\r\nDownloading html file (old)\r\n```\r\nfrom sec_parsers download_sec_filing\r\nhtml = download_sec_filing('https://www.sec.gov/Archives/edgar/data/1318605/000162828024002390/tsla-20231231.htm')\r\nfiling = Filing(html)\r\n```\r\n\r\nParsing\r\n```\r\nfiling.parse() # parses filing\r\nfiling.visualize() # opens filing in webbrowser with highlighted section headers\r\nfiling.find_sections_from_title(title) # finds section by title, e.g. 'item 1a'\r\nfiling.find_sections_from_text(text) # finds sections which contains your text\r\nfiling.get_tree(node) # if no argument specified returns xml tree, if node specified, returns that nodes tree\r\nfiling.get_title_tree() # returns xml tree using titles instead of tags. More descriptive than get_tree.\r\nfiling.get_subsections_from_section() # get children of a section\r\nfiling.get_nested_subsections_from_section() # get descendants of a section\r\nfiling.set_filing_type(type) # e.g. 'S-1'. Use when automatic detection fails\r\nfiling.save_xml(file_name,encoding='utf-8')\r\nfiling.save_csv(file_name,encoding='ascii')\r\n```\r\n### Additional Resources:\r\n* [quickstart](Examples/quickstart.ipynb)\r\n* [How to write a Custom Parser in 5 minutes](https://medium.com/@jgfriedman99/how-to-write-a-custom-sec-parser-in-5-minutes-5c7a8d5d81b0)\r\n* [Archive of Parsed XMLs / CSVs](https://www.dropbox.com/scl/fo/np1lpow7r3bissz80ze3o/AKGM8skBrUfEGlSweofAUDU?rlkey=cz1r78jofntjeq4ax2vb2yd0u&e=1&st=mdcwgfcm&dl=0) - Last updated 7/24/24.\r\n* [example parsed filing](Examples/tesla_10k.xml)\r\n* [example parsed filing exported to csv](Examples/tesla_10k.csv).\r\n\r\n### Features:\r\n* lots of filing types\r\n* export to xml, csv, with option to convert to ASCII\r\n* visualization\r\n\r\n### Feature Requests:\r\n[Request a Feature](contributors.md)\r\n* company metadata (sharif) - will add to downloader\r\n* filing metadata (sharif) - waiting for SEC Downloaders first release\r\n* Export to dta (Denis)\r\n* DEF 14A, DEFM14A (Denis)\r\n* Export to markdown (Astarag)\r\n* Better parsing_string handling. Opened an issue. (sharif)\r\n\r\n#### SEC Downloader\r\nNot released yet, different repo.\r\n* Download by company name, ticker, etc\r\n* Download all 10-Ks, etc\r\n* Rate limit handling\r\n* asynchronous downloads\r\n\r\n### Statistics\r\n* Speed: On average, 10-K filings parse in 0.25 seconds. There were 7,118 10-K annual reports filed in 2023, so to parse all 10-Ks from 2023 should take about half an hour.\r\n* Improving speed is currently not a priority. If you need more speed, let me know. I think I can increase parsing speed to ~ .01 seconds per 10-K.\r\n\r\n### Other packages useful for SEC filings\r\n* https://github.com/dgunning/edgartools\r\n\r\n### Updates\r\n#### Towards Version 1:\r\n* Most/All SEC text filings supported\r\n* Few errors\r\n* xml \r\n\r\nMight be done along the way:\r\n* Faster parsing, probably using streaming approach, and combining modules together.\r\n* Introduction section parsing\r\n* Signatures section parsing\r\n* Better visualization interface (e.g. like pdfviewer for sections)\r\n\r\n#### Beyond Version 1:\r\nTo improve the package beyond V1 it looks like I need compute and storage. Not sure how to get that. Working on it.\r\n\r\nMetadata\r\n* Clustering similar section titles using ML (e.g. seasonality headers)\r\n* Adding tags to individual sections using small LLMs (e.g. tag for mentions supply chains, energy, etc)\r\n\r\nOther\r\n* Table parsing\r\n* Image OCR\r\n* Parsing non-html filings\r\n\r\n### Current Priority list:\r\n* look at code duplication w.r.t to style detectors, e.g. all caps and emphasis. may want to combine into one detector\r\n- yep this is a priority. have to handle e.g. Introduction and Segment Overview as same rule. Bit difficult. Will think over.\r\n* better function names - need to decide terminology soon.\r\n* consider adding table of contents, forward looking information, etc\r\n- forward looking information, DOCUMENTS INCORPORATED BY REFERENCE, TABLE OF CONTENTS - go with a bunch, \r\n* fix layering issue - e.g. top div hides sections\r\n* make trees nicer\r\n* add more filing types\r\n* fix all caps and emphasis issue\r\n* clean text\r\n* Better historical conversion: handle if PART I appears multiple times as header, e.g. logic here item 1 continued.\r\n\r\n\r\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A package to parse SEC filings",
    "version": "0.549",
    "project_urls": {
        "Homepage": "https://github.com/john-friedman/SEC-Parsers"
    },
    "split_keywords": [],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "e87032e07c28c127f76ad44a3398128cb6968b6aa80a9f57a955478c14cb847c",
                "md5": "04c245baf74e0f9f2368f30077a039dd",
                "sha256": "1989e9e4d894bb6a65876e123941d53423e2ed9301eae9766fdfe26cd3bf965b"
            },
            "downloads": -1,
            "filename": "sec_parsers-0.549-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "04c245baf74e0f9f2368f30077a039dd",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 21011,
            "upload_time": "2024-07-29T18:17:53",
            "upload_time_iso_8601": "2024-07-29T18:17:53.271010Z",
            "url": "https://files.pythonhosted.org/packages/e8/70/32e07c28c127f76ad44a3398128cb6968b6aa80a9f57a955478c14cb847c/sec_parsers-0.549-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "2306f4639263631b08f692e53c0e8e52b730b52bea02fb204d425b76176df057",
                "md5": "bb86af3719d2d0db31cca4fb32cd0158",
                "sha256": "bfd4a610a45aec57b5757552c4a2421990ecfe0e89873efd08718545512c2e2b"
            },
            "downloads": -1,
            "filename": "sec_parsers-0.549.tar.gz",
            "has_sig": false,
            "md5_digest": "bb86af3719d2d0db31cca4fb32cd0158",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 19904,
            "upload_time": "2024-07-29T18:17:54",
            "upload_time_iso_8601": "2024-07-29T18:17:54.770481Z",
            "url": "https://files.pythonhosted.org/packages/23/06/f4639263631b08f692e53c0e8e52b730b52bea02fb204d425b76176df057/sec_parsers-0.549.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-07-29 18:17:54",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "john-friedman",
    "github_project": "SEC-Parsers",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": false,
    "lcname": "sec-parsers"
}
        
Elapsed time: 0.58089s