secfsdstools

Name	secfsdstools JSON
Version	2.4.1 JSON
	download
home_page	None
Summary	A few python tools to analyze the SEC.gov financial statements data sets (https://www.sec.gov/dera/data/financial-statement-data-sets)
upload_time	2025-07-26 05:16:38
maintainer	Hansjoerg Wingeier
docs_url	None
author	Hansjoerg
requires_python	>=3.10
license	Apache-2.0
keywords	sec.gov sec edgar sec filing edgar finance cik 10-q 10-k 8-k financial statements financial statements dataset financial analysis data processing financial data sec api xbrl
VCS
bugtrack_url
requirements	No requirements were recorded.
Travis-CI	No Travis.
coveralls test coverage	No coveralls.

            # sec-fincancial-statement-data-set tools (SFSDSTools 2)

Helper tools to analyze the [Financial Statement Data Sets](https://www.sec.gov/dera/data/financial-statement-data-sets) from the U.S. securities and exchange commission (sec.gov).
The SEC releases quarterly zip files, each containing four CSV files with numerical data from all financial reports filed within that quarter. However, accessing data from the past 12 years can be time-consuming due to the large amount of data - over 120 million data points in over 2GB of zip files by 2023.

This library simplifies the process of working with this data and provides a
convenient way to extract information from the primary financial statements - the balance sheet (BS), income statement (IS), and statement of cash flows (CF).

Check out my article at Medium [Understanding the the SEC Financial Statement Data Sets](https://medium.com/@hansjoerg.wingeier/understanding-the-sec-financial-statement-data-sets-6148e07d1715) to get
an introduction to the [Financial Statement Data Sets](https://www.sec.gov/dera/data/financial-statement-data-sets).

The main features include:
- all data is on your local hard drive and can be updated automatically, no need for numerous API calls
- data is loaded as pandas files
- fast and efficient reading of a single report, all reports of one or multiple companies, or even all available reports 
- filter framework with predefined filters, easy to extend, supports easy way of saving, loading, and combining filtered data (see [01_quickstart.ipynb](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/01_quickstart.ipynb) and
[03_explore_with_interactive_notebook.ipynb](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/03_explore_with_interactive_notebook.ipynb))
- standardize the data for balance sheets, income statements, and cash flow statements to make reports easily comparable
(see [07_00_standardizer_basics.ipynb](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_00_standardizer_basics.ipynb), 
[07_01_BS_standardizer.ipynb](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_01_BS_standardizer.ipynb), 
[07_01_BS_standardizer.ipynb](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_01_BS_standardizer.ipynb), and
[07_03_CF_standardizer.ipynb](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_03_CF_standardizer.ipynb))
- automate processing and standardizing by configuring customized process steps that are executed whenever a new 
  data file is detected on sec.gov (see [08_00_automation_basics.ipynb](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/08_00_automation_basics.ipynb))
- version 2 supports the new "segments" column that was added in December 2024
- **experimental - instroduced in version 2.4.0: support for daily updates of the financial reports (see [10_00_daily_financial_report_updates.ipynb](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/10_00_daily_financial_report_updates.ipynb))**

Have a look at the [Release Notes](https://hansjoergw.github.io/sec-fincancial-statement-data-set/releasenotes/)
<br/>
<br/>
<br/>

<span style="color: #FF8C00;">==========================================================</span>
### If you find this tool useful, a sponsorship would be greatly appreciated! ###

**https://github.com/sponsors/HansjoergW**

### How to get in touch ###
* Found a bug: https://github.com/HansjoergW/sec-fincancial-statement-data-set/issues
* Have a remark: https://github.com/HansjoergW/sec-fincancial-statement-data-set/discussions/categories/general
* Have an idea: https://github.com/HansjoergW/sec-fincancial-statement-data-set/discussions/categories/ideas
* Have a question: https://github.com/HansjoergW/sec-fincancial-statement-data-set/discussions/categories/q-a
* Have something to show: https://github.com/HansjoergW/sec-fincancial-statement-data-set/discussions/categories/show-and-tell

<span style="color: #FF8C00;">==========================================================</span>


# Principles

The goal is to be able to do bulk processing of the data without the need to do countless API calls to sec.gov. 
Therefore, the quarterly zip files are downloaded and indexed using a SQLite database table.
The index table contains information on all filed reports since about 2010 - over 500,000 in total. The first
download will take a couple of minutes but after that, all the data is on your local harddisk.

Using the index in the sqlite db allows for direct extraction of data for a specific report from the
appropriate zip file, reducing the need to open and search through each zip file.
Moreover, the downloaded zip files are converted to the parquet format which provides faster read access
to the data compared to reading the csv files inside the zip files.

The library is designed to have a low memory footprint.


# Installation and basic usage

The library has been tested for python version 3.8, 3.9, 3.10 and 3.11.
The project is published on [pypi.org](https://pypi.org/project/secfsdstools/). Simply use the following command to install the latest version:

```
pip install secfsdstools
```

If you want to contribute, just clone the project and use a python 3.8 environment.
The dependencies are defined in the requirements.txt file or use the pyproject.toml to install them.

To have a first glance at the library, check out the interactive jupyter notebooks [01_quickstart.ipynb](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/01_quickstart.ipynb) 
and [03_explore_with_interactive_notebook.ipynb](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/03_explore_with_interactive_notebook.ipynb) that are located in `notebooks` directory in the github repo.

Upon using the library for the first time, it downloads the data files and creates the index by calling the `update()`
method. You can manually trigger the update using the following code:

```
from secfsdstools.update import update

if __name__ == '__main__':
    update()
```

The following tasks will be executed:
1. All currently available zip-files are downloaded form sec.gov (these are over 50 files that will need over 2 GB of space on your local drive)
2. All the zipfiles are transformed and stored as parquet files. Per default, the zipfile is deleted afterward. If you want to keep the zip files, set the parameter 'KeepZipFiles' in the config file to True.
3. An index inside a sqlite db file is created


Moreover, at most once a day, it is checked if there is a new zip file available on sec.gov. If there is, a download will be started automatically. 
If you don't want 'auto-update', set the 'AutoUpdate' in your config file to False.



## Configuration (optional)

If you don't provide a config file, a config file with name `secfsdstools.cfg` will be created the first time you use the api and placed inside your home directory. 
The file only requires the following entries:

```
[DEFAULT]
downloaddirectory = c:/users/me/secfsdstools/data/dld
parquetdirectory = c:/users/me/secfsdstools/data/parquet
dbdirectory = c:/users/me/secfsdstools/data/db
useragentemail = your.email@goeshere.com
```

The `downloaddirectory` is the place where quarterly zip files from the sec.gov are downloaded to.
The `parquetdirectory` is the folder where the data is stored in parquet format.
The `dbdirectory` is the directory in which the sqllite db is created.
The `useragentemail` is used in the requests made to the sec.gov website. Since we only make limited calls to the sec.gov,
you can leave the example "your.email@goeshere.com". 

## A first simple example
Goal: present the information in the balance sheet of Apple's 2022 10-K report in the same way as it appears in the
original report on page 31 ("CONSOLIDATED BALANCE SHEETS"): https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm

**Note:** Version 2 of the framework supports now the `segments` that was introduced in January 2025. By adjusting the 
parameter `show_segments` you can define whether the segments information are shown or not

````
from secfsdstools.e_collector.reportcollecting import SingleReportCollector
from secfsdstools.e_filter.rawfiltering import ReportPeriodAndPreviousPeriodRawFilter
from secfsdstools.e_presenter.presenting import StandardStatementPresenter

if __name__ == '__main__':
    # the unique identifier for apple's 10-K report of 2022
    apple_10k_2022_adsh = "0000320193-22-000108"
  
    # us a Collector to grab the data of the 10-K report. an filter for balancesheet information
    collector: SingleReportCollector = SingleReportCollector.get_report_by_adsh(
          adsh=apple_10k_2022_adsh,
          stmt_filter=["BS"]
    )  
    rawdatabag = collector.collect() # load the data from the disk
    
   
    bs_df = (rawdatabag
                       # ensure only data from the period (2022) and the previous period (2021) is in the data
                       .filter(ReportPeriodAndPreviousPeriodRawFilter())
                       # join the the content of the pre_txt and num_txt together
                       .join()  
                       # format the data in the same way as it appears in the report
                       .present(StandardStatementPresenter(show_segments=False))) 
    print(bs_df) 
````


## Viewing metadata

The recommend way to view and use the metadata is using `secfsdstools` library functions as described in [notebooks/01_quickstart.ipynb](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/01_quickstart.ipynb)  

Of course, the created "index of reports" can be viewed also using a database viewer that supports the SQLite format,
such as [DB Browser for SQLite](https://sqlitebrowser.org/).

(The location of the SQLite database file is specified in the `dbdirectory` field of the config file, which is set to
`<home>/secfsdstools/data/db` in the default configuration. The name of the database file is `secfsdstools.db`.)

There are only two relevant tables in the database: `index_parquet_reports` and `index_parquet_processing_state`.

The `index_parquet_reports` table provides an overview of all available reports in the downloaded
data and includes the following relevant columns:

* **adsh** : The unique id of the report (a string).
* **cik** : The unique id of the company (an int).
* **name** : The name of the company in uppercases.
* **form** : The type of the report (e.g.: annual: 10-K, quarterly: 10-Q).
* **filed** : The date when the report has been filed in the format YYYYMMDD (stored as a integer number).
* **period** : The date for which the report was create. this is the date on the balancesheet.(stored as a integer number) 
* **fullPath** : The path to the downloaded zip files that contains the details of that report.
* **url** : The url which takes you directly to the filing of this report on the sec.gov website.

For instance, if you want to have an overview of all reports that Apple has filed since 2010,
just search for "%APPLE INC%" in the name column.

Searching for "%APPLE INC%" will also reveal its cik: 320193

If you accidentally delete data in the database file, don't worry. Just delete the database file
and run `update()` again (see previous chapter).


## Overview
The following diagram gives an overview on SECFSDSTools library.

![Overview](https://github.com/HansjoergW/sec-fincancial-statement-data-set/raw/main/docs/images/overview.png)

It mainly exists out of two main processes. The first one ist the "Date Update Process" which is responsible for the
download of the Financial Statement Data Sets zip files from the sec.gov website, transforming the content into parquet
format, and indexing the content of these files in a simple SQLite database. Again, this whole process can be started
"manually" by calling the update method, or it is done automatically, as it described above.

The second main process is the "Data Processing Process", which is working with the data that is stored inside the
sub.txt, pre.txt, and num.txt files from the zip files. The "Data Processing Process" mainly exists out of four steps:

* **Collect** <br/> Collect the rawdata from one or more different zip files. For instance, get all the data for a single
report, or get the data for all 10-K reports of a single or multiple companies from several zip files.
* **Raw Processing** <br/> Once the data is collected, the collected data for sub.txt, pre.txt, and num.txt is available
as a pandas dataframe. Filters can be applied, the content can directly be saved and loaded.
* **Joined Processing** <br/> From the "Raw Data", a "joined" representation can be created. This joins the data from
the pre.txt and num.txt content together based on the "adhs", "tag", and "version" attributes. "Joined data" can also be
filtered, concatenated, directly saved and loaded.
* **Present** <br/> Produce a single pandas dataframe out of the data and use it for further processing or use the standardizers
 to create comparable data for the balance sheet, the income statement, and the cash flow statement.

The diagramm also shows the main classes with which a user interacts. The use of them  is described in the following chapters.


## Feature Overview

This section shows some example code of the different features. Have a look at the [notebooks/01_quickstart.ipynb](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/01_quickstart.ipynb)
notebook and all other notebooks to get more details on how to use the framework.

### Working with the Index

* Access the index in the slite database to find the CIK (central index key) for a company:
  ```
  from secfsdstools.c_index.searching import IndexSearch
  
  index_search = IndexSearch.get_index_search()
  results = index_search.find_company_by_name("apple")
  print(results)
  ```

* Get the information on the latest filing of a company:
  ````
  from secfsdstools.c_index.companyindexreading import CompanyIndexReader
  
  apple_cik = 320193
  apple_index_reader = CompanyIndexReader.get_company_index_reader(cik=apple_cik)
  print(apple_index_reader.get_latest_company_filing())
  ````

* Show all annual reports of company by using its CIK number:
  ````
  from secfsdstools.c_index.companyindexreading import CompanyIndexReader
  
  apple_cik = 320193
  apple_index_reader = CompanyIndexReader.get_company_index_reader(cik=apple_cik)
  
  # only show the annual reports of apple
  print(apple_index_reader.get_all_company_reports_df(forms=["10-K"]))
  ````

### Loading Data
The previously introduced `IndexSearch` and `CompanyIndexReader` let you know what data is available, but they do not
return the real data of the financial statements. This is what the `Collector` classes are used for.

All the `Collector` classes have their own factory method(s) which instantiates the class. 

Most of these factory methods
also provide parameters to filter the data directly when being loaded from the parquet files.
These are the `forms_filter` (which type of reports you want to read, for instance "10-K"), the `stmt_filter`
(which statements you want to read, for instance the balance sheet), and the `tag_filter` (which defines the tags
you want to read, for instance "Assets"). Of course, such filters could also be applied afterward, but it is slightly
more efficient to apply them directly when loading.

All `Collector` classes have a `collect` method which then loads the data from the parquet files and returns an instance
of `RawDataBag`. The `RawDataBag` instance contains then a pandas dataframe for the `sub` (subscription) data,
`pre` (presentation) data, and `num` (the numeric values) data.

* Load a single report using the `SingleReportCollector`:
    ````
    from secfsdstools.e_collector.reportcollecting import SingleReportCollector

    apple_10k_2022_adsh = "0000320193-22-000108"

    collector: SingleReportCollector = SingleReportCollector.get_report_by_adsh(adsh=apple_10k_2022_adsh)
    rawdatabag = collector.collect()

    # as expected, there is just one entry in the submission dataframe
    print(rawdatabag.sub_df)
    # just print the size of the pre and num dataframes
    print(rawdatabag.pre_df.shape)
    print(rawdatabag.num_df.shape)
    ````
* Load multiple reports with the `MultiReportCollector`:
    ````
    from secfsdstools.e_collector.multireportcollecting import MultiReportCollector
    apple_10k_2022_adsh = "0000320193-22-000108"
    apple_10k_2012_adsh = "0001193125-12-444068"

    if __name__ == '__main__':
        # load only the assets tags that are present in the 10-K report of apple in the years
        # 2022 and 2012
        collector: MultiReportCollector = \
            MultiReportCollector.get_reports_by_adshs(adshs=[apple_10k_2022_adsh,
                                                             apple_10k_2012_adsh],
                                                      tag_filter=['Assets'])
        rawdatabag = collector.collect()
        # as expected, there are just two entries in the submission dataframe
        print(rawdatabag.sub_df)
        print(rawdatabag.num_df)  
    ```` 

* Load all data for one or multiple quarters using the `ZipCollector`:
    ````
    from secfsdstools.e_collector.zipcollecting import ZipCollector

    # only collect the Balance Sheet of annual reports that
    # were filed during the first quarter in 2022
    if __name__ == '__main__':
        collector: ZipCollector = ZipCollector.get_zip_by_name(name="2022q1.zip",
                                                               forms_filter=["10-K"],
                                                               stmt_filter=["BS"])
    
        rawdatabag = collector.collect()
    
        # only show the size of the data frame
        # .. over 4000 companies filed a 10 K report in q1 2022
        print(rawdatabag.sub_df.shape)
        print(rawdatabag.pre_df.shape)
        print(rawdatabag.num_df.shape)    
    ```` 

* Load all data for a single company or multiple companies
    ````
    from secfsdstools.e_collector.companycollecting import CompanyReportCollector
    
    if __name__ == '__main__':
        apple_cik = 320193
        collector = CompanyReportCollector.get_company_collector(ciks=[apple_cik],
                                                                 forms_filter=["10-K"])
    
        rawdatabag = collector.collect()
    
        # all filed 10-K reports for apple since 2010 are in the databag
        print(rawdatabag.sub_df)
    
        print(rawdatabag.pre_df.shape)
        print(rawdatabag.num_df.shape)    
    ```` 

Have a look at the [collector_deep_dive notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/04_collector_deep_dive.ipynb).

### Working with raw data
When the `collect` method of a `Collector` class is called, the data for the sub, pre, and num dataframes are loaded
and being stored in the sub_df, pre_df, and num_df attributes inside an instance of `RawDataBag`.

* `save` and `load`
    ````
    from secfsdstools.e_collector.reportcollecting import SingleReportCollector

    # read data
    apple_10k_2022_adsh = "0000320193-22-000108"
    collector: SingleReportCollector = SingleReportCollector.get_report_by_adsh(adsh=apple_10k_2022_adsh)
    rawdatabag = collector.collect()

    # save it
    rawdatabag.save("<path>")
    
    # load it back
    bag = RawDataBag.load("<path")
  
    # load it back with Predicate Pushdown (filter while reading)
    bag = RawDataBag.load("<path", stmt_filter=["BS"])
    ```` 

*  `concat` multiple instances of `RawDataBag`
    ````
    concat_bag = RawDataBag.concat(list_of_rawdatabags)    
    ````

*  `concat_filebased` concat multiple RawDataBag folders into a new folder in very memory efficient way
    ````
    RawDataBag.concat_filebased(list_of_rawdatabag_folders, target_folder)    
    ````

* `join` produces a `JoinedRawDataBag` by joining the content of the pre_df and num_df
   based on the columns adsh, tag, and version. It is an inner join. The joined dataframe appears as pre_num_df in
   the `JoinedRawDataBag`.
    ````
    from secfsdstools.e_collector.reportcollecting import SingleReportCollector

    # read data
    apple_10k_2022_adsh = "0000320193-22-000108"
    collector: SingleReportCollector = SingleReportCollector.get_report_by_adsh(adsh=apple_10k_2022_adsh)
    rawdatabag = collector.collect()

    joineddatabag = rawdatabag.join()
  
    print(joineddatabag.pre_num_df)
    ```

*  Use filters to `filter` the data. There are many predefined filters, but it is also easy to write your own.
   ````
   
   # Note, instead of using a_RawDataBag.filter(<myFilter>) you could also use a_RawDataBag[<myFilter>]
   
   # Filters the `RawDataBag` instance based on the list of adshs that were provided in the constructor. 
   a_filtered_RawDataBag = a_RawDataBag.filter(AdshRawFilter(adshs=['0001193125-09-214859', '0001193125-10-238044']))

   # Filters the `RawDataBag`instance based on the list of statements ('BS', 'CF', 'IS', ...). <br>
   a_filtered_RawDataBag = a_RawDataBag.filter(StmtRawFilter(stmts=['BS', 'CF']))

   # Filters the `RawDataBag`instance based on the list of tags that is provided. <br>
   a_filtered_RawDataBag = a_RawDataBag.filter(TagRawFilter(tags=['Assets', 'Liabilities']))

   # Filters the `RawDataBag` so that data of subsidiaries are removed.
   a_filtered_RawDataBag = a_RawDataBag.filter(MainCoregRawFilter()) 

   # The data of a report usually also contains data from previous years.
   # However, often you want just to analyze the data of the current and the previous year. This filter ensures that
   # only data for the current period and the previous period are contained in the data.
   a_filtered_RawDataBag = a_RawDataBag.filter(ReportPeriodAndPreviousPeriodRawFilter()) 

   # If you are just interested in the data of a report that is from the current period
   # of the report then you can use this filter.
   a_filtered_RawDataBag = a_RawDataBag.filter(ReportPeriodRawFilter()) 

   # Sometimes company provide their own tags, which are not defined by the us-gaap XBRL
   # definition. In such cases, the version columns contains the value of the adsh instead of something like us-gab/2022.
   # This filter removes unofficial tags.
   a_filtered_RawDataBag = a_RawDataBag.filter(OfficialTagsOnlyRawFilter()) 

   # Reports often also contain datapoints or also the same datapint in other currencies than USD.
   # This filters ensures that only USD  datapoints are kept  
   a_filtered_RawDataBag = a_RawDataBag.filter(USDOnlyRawFilter()) 

   # If you dont care about Segments information, you can use this filter.
   a_filtered_RawDataBag = a_RawDataBag.filter(NoSegmentInfoRawFilter()) 
   
   ````  

   Have a look at the [filter_deep_dive notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/05_filter_deep_dive.ipynb).

### Working with joined data
When the `join` method of a `RawDataBag` instance is called an instance of `JoinedDataBag` is returned.

The `JoinedDataBag` provides `save`, `load`, `concat`, and `concat_filebased` in the same manner as the
`RawDataBag`does. 
More over, also `filter` is possible and the same filters are available. They just go by the name
`...JoinedFilter` instead of `...RawFilter`.

`present` The idea of the present method is to make a final presentation of the data as pandas dataframe. 
The method has a parameter presenter of type Presenter.
It is simple to write your own presenter classes. So far, the framework provides the following Presenter 
implementations (module `secfsdstools.e_presenter.presenting`):

* `StandardStatementPresenter` <br> This presenter provides the data in the same form, as you see in
  the reports itself.
  ````
  apple_10k_2022_adsh = "0000320193-22-000108"

  collector: SingleReportCollector = SingleReportCollector.get_report_by_adsh(
        adsh=apple_10k_2022_adsh,
        stmt_filter=["BS"]
  )
  rawdatabag = collector.collect()
  bs_df = rawdatabag.filter(ReportPeriodAndPreviousPeriodRawFilter())
                    .join()
                    .present(StandardStatementPresenter())
  print(bs_df) 
  ```` 

### Standardize financial data
  Even if xbrl is a standard on how to tag positions and numbers in financial statements, that doesn't mean that financial
  statements can then be compared easily. For instance, there are over 3000 tags which can be used in a balance sheet.
  Moreover, some tags can mean similar things or can be grouped behind a "parent" tag, which itself might not be present.
  For instance, "AccountsNoncurrent" is often not shown in statements. So you would find the position for "Accounts"
  and "AccountsCurrent", but not for "AccountsNoncurrent". Instead, only child tags for "AccountsNoncurrent" might be
  present.<br><br>
  The standardizer helps to solve these problems by unifying the information of financial statements.<br> <br>
  With the standardized financial statements, you can then actually compare the statements between different
  companies or different years, and you can use the dataset for ML. <br><br>
  For details, have a look at the following notebooks:
  * [standardizer_basics](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_00_standardizer_basics.ipynb)
  * [standardize the balance sheets and make them comparable](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_01_BS_standardizer.ipynb)
  * [standardize the income statements and make them comparable](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_02_IS_standardizer.ipynb)
  * [standardize the cash flow statements and make them comparable](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_03_CF_standardizer.ipynb)


  * `BalanceSheetStandardizer` <br>
  The `BalanceSheetStandardizer` collects and/or calculates the following positions of balance sheets:  

    ````
    - Assets
      - AssetsCurrent
        - Cash
      - AssetsNoncurrent
    - Liabilities
      - LiabilitiesCurrent
      - LiabilitiesNoncurrent
    - Equity
      - HolderEquity (mainly StockholderEquity or PartnerCapital)
        - RetainedEarnings
        - AdditionalPaidInCapital
        - TreasuryStockValue
      - TemporaryEquity
      - RedeemableEquity
    - LiabilitiesAndEquity
    ````

    With just a few lines of code, you'll get a comparable dataset with the main positions of a balance sheet for Microsoft, Alphabet, and Amazon:
    (see the [stanardize the balance sheets and make them comparable notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_01_BS_standardizer.ipynb) for details)
     ````python
     from secfsdstools.e_collector.companycollecting import CompanyReportCollector
     from secfsdstools.e_filter.rawfiltering import ReportPeriodRawFilter, MainCoregRawFilter, OfficialTagsOnlyRawFilter, USDOnlyRawFilter
     from secfsdstools.f_standardize.bs_standardize import BalanceSheetStandardizer
   
     bag = CompanyReportCollector.get_company_collector(ciks=[789019, 1652044,1018724]).collect() #Microsoft, Alphabet, Amazon
     filtered_bag = bag[ReportPeriodRawFilter()][MainCoregRawFilter()][OfficialTagsOnlyRawFilter()][USDOnlyRawFilter()]
     joined_bag = filtered_bag.join()
   
     standardizer = BalanceSheetStandardizer()
   
     standardized_bs_df = joined_bag.present(standardizer)
   
     import matplotlib.pyplot as plt
     # Group by 'name' and plot equity for each group
     # Note: using the `present` method ensured that the same cik has always the same name even if the company name did change in the past
     for name, group in standardized_bs_df.groupby('name'):
       plt.plot(group['date'], group['Equity'], label=name, linestyle='-')
   
     # Add labels and title
     plt.xlabel('Date')
     plt.ylabel('Equity')
     plt.title('Equity Over Time for Different Companies (CIKs)')
   
     # Display legend
     plt.legend()
     ````
     ![Equity Compare](https://github.com/HansjoergW/sec-fincancial-statement-data-set/raw/main/docs/images/equity_compare.png)

  * `IncomeStatementStandardizer` <br>
  The `IncomeStatementStandardizer` collects and/or calculates the following positions of balance sheets:
    
    ````  
      Revenues
      - CostOfRevenue
      ---------------
      = GrossProfit
      - OperatingExpenses
      -------------------
      = OperatingIncomeLoss
        
      IncomeLossFromContinuingOperationsBeforeIncomeTaxExpenseBenefit
      - AllIncomeTaxExpenseBenefit
      ----------------------------
      = IncomeLossFromContinuingOperations
      + IncomeLossFromDiscontinuedOperationsNetOfTax
      -----------------------------------------------
      = ProfitLoss
      - NetIncomeLossAttributableToNoncontrollingInterest
      ---------------------------------------------------
      = NetIncomeLoss
    
      OustandingShares
      EarningsPerShare
    ````
  
    With just a few lines of code, you'll get a comparable dataset with the main positions of an income statement for Microsoft, Alphabet, and Amazon:
  (see the [standardize the income statement and make them comparable notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_02_IS_standardizer.ipynb) for details)
   
    ````python
    from secfsdstools.e_collector.companycollecting import CompanyReportCollector
    from secfsdstools.e_filter.rawfiltering import ReportPeriodRawFilter, MainCoregRawFilter, OfficialTagsOnlyRawFilter, USDOnlyRawFilter
    from secfsdstools.f_standardize.is_standardize import IncomeStatementStandardizer
      
    bag = CompanyReportCollector.get_company_collector(ciks=[789019, 1652044,1018724]).collect() #Microsoft, Alphabet, Amazon
    filtered_bag = bag[ReportPeriodRawFilter()][MainCoregRawFilter()][OfficialTagsOnlyRawFilter()][USDOnlyRawFilter()]
    joined_bag = filtered_bag.join()
      
    standardizer = IncomeStatementStandardizer()
      
    standardized_is_df = joined_bag.present(standardizer)
    # just use the yearly reports with data for the whole year
    standardized_is_df = standardized_is_df[(standardized_is_df.fp=="FY") & (standardized_is_df.qtrs==4)].copy()
      
    import matplotlib.pyplot as plt
    # Group by 'name' and plot equity for each group
    # Note: using the `present` method ensured that the same cik has always the same name even if the company name did change in the past
    for name, group in standardized_is_df.groupby('name'):
      plt.plot(group['date'], group['GrossProfit'], label=name, linestyle='-')
      
    # Add labels and title
    plt.xlabel('Date')
    plt.ylabel('GrossProfit')
    plt.title('GrossProfit Over Time for Different Companies (CIKs)')
      
    # Display legend
    plt.legend()
     ````

  ![GrossProfit Compare](https://github.com/HansjoergW/sec-fincancial-statement-data-set/raw/main/docs/images/grossprofit_compare.png)

  * `CashFlowStandardizer` <br>
   The `CashFlowStandardizer` collects and/or calculates the following positions of cash flow statements:
     
    ````
     NetCashProvidedByUsedInOperatingActivities
       CashProvidedByUsedInOperatingActivitiesDiscontinuedOperations
       NetCashProvidedByUsedInOperatingActivitiesContinuingOperations
           DepreciationDepletionAndAmortization
           DeferredIncomeTaxExpenseBenefit
           ShareBasedCompensation
           IncreaseDecreaseInAccountsPayable
           IncreaseDecreaseInAccruedLiabilities
           InterestPaidNet
           IncomeTaxesPaidNet
    
     NetCashProvidedByUsedInInvestingActivities
         CashProvidedByUsedInInvestingActivitiesDiscontinuedOperations
         NetCashProvidedByUsedInInvestingActivitiesContinuingOperations
           PaymentsToAcquirePropertyPlantAndEquipment
           ProceedsFromSaleOfPropertyPlantAndEquipment
           PaymentsToAcquireInvestments
           ProceedsFromSaleOfInvestments
           PaymentsToAcquireBusinessesNetOfCashAcquired
           ProceedsFromDivestitureOfBusinessesNetOfCashDivested
           PaymentsToAcquireIntangibleAssets
           ProceedsFromSaleOfIntangibleAssets
    
     NetCashProvidedByUsedInFinancingActivities
         CashProvidedByUsedInFinancingActivitiesDiscontinuedOperations
         NetCashProvidedByUsedInFinancingActivitiesContinuingOperations
           ProceedsFromIssuanceOfCommonStock
           ProceedsFromStockOptionsExercised
           PaymentsForRepurchaseOfCommonStock
           ProceedsFromIssuanceOfDebt
           RepaymentsOfDebt
           PaymentsOfDividends
    
    
     EffectOfExchangeRateFinal
     CashPeriodIncreaseDecreaseIncludingExRateEffectFinal
    
     CashAndCashEquivalentsEndOfPeriod
    ````

     With just a few lines of code, you'll get a comparable dataset with the main positions of an cash flow statement for Microsoft, Alphabet, and Amazon:
(see the [standardize the cash flow statements and make them comparable](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_03_CF_standardizer.ipynb) for details)
    ````python
    from secfsdstools.e_collector.companycollecting import CompanyReportCollector
    from secfsdstools.e_filter.rawfiltering import ReportPeriodRawFilter, MainCoregRawFilter, OfficialTagsOnlyRawFilter, USDOnlyRawFilter
    from secfsdstools.f_standardize.cf_standardize import CashFlowStandardizer
    
    bag = CompanyReportCollector.get_company_collector(ciks=[789019, 1652044,1018724]).collect() #Microsoft, Alphabet, Amazon
    filtered_bag = bag[ReportPeriodRawFilter()][MainCoregRawFilter()][OfficialTagsOnlyRawFilter()][USDOnlyRawFilter()]
    joined_bag = filtered_bag.join()
    
    standardizer = CashFlowStandardizer()
    
    standardized_cf_df = joined_bag.present(standardizer)
    standardized_cf_df = standardized_cf_df[(standardized_cf_df.fp=="FY") & (standardized_cf_df.qtrs==4)].copy()
    
    import matplotlib.pyplot as plt
    # Group by 'name' and plot NetCashProvidedByUsedInOperatingActivities for each group
    # Note: using the `present` method ensured that the same cik has always the same name even if the company name did change in the past
    for name, group in standardized_cf_df.groupby('name'):
        plt.plot(group['date'], group['NetCashProvidedByUsedInOperatingActivities'], label=name, linestyle='-')
    
    # Add labels and title
    plt.xlabel('Date')
    plt.ylabel('NetCashProvidedByUsedInOperatingActivities')
    plt.title('NetCashProvidedByUsedInOperatingActivities Over Time for Different Companies (CIKs)')
    
    # Display legend
    plt.legend()
    ````
  ![NetCashOperating Compare](https://github.com/HansjoergW/sec-fincancial-statement-data-set/raw/main/docs/images/netcashoperating_compare.png)


## Automate processing
The framework provides two hook methods, that are called whenever the default update process is being executed.
This way, you can implement additional processing steps that are executed, after a new data file from the sec.gov was 
downloaded, transformed to parquet, and index.

Have a look at [08_00_automation_basics](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/08_00_automation_basics.ipynb)


## Daily Updates (Experimental)
Introduced with version 2.4.0, secfsdstools now also provides daily updates for filed reports at the SEC.

You have to activate it by adding `dailyprocessing = True` in the `DEFAULT` section of the configuration file.

Note, that there are some limitations (see [10_00_daily_financial_report_updates](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/10_00_daily_financial_report_updates.ipynb) for details).




# Links 
* [For a detail description of the content and the structure of the dataset](https://www.sec.gov/files/aqfs.pdf)
* [Release Notes](https://hansjoergw.github.io/sec-fincancial-statement-data-set/releasenotes/)
* [Documentation](https://hansjoergw.github.io/sec-fincancial-statement-data-set/)
* [QuickStart Jupyter Notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/01_quickstart.ipynb)
* [Explore the data with an interactive Notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/03_explore_with_interactive_notebook.ipynb)
* [collector_deep_dive Notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/04_collector_deep_dive.ipynb)
* [filter_deep_dive Notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/05_filter_deep_dive.ipynb).
* [bulk_data_processing_deep_dive Notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/06_bulk_data_processing_deep_dive.ipynb)
* [bulk_data_processing_memory_efficiency](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/06_01_bulk_data_memory_efficiency.ipynb)
* [standardizer_basics](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_00_standardizer_basics.ipynb)
* [standardize the balance sheets and make them comparable](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_01_BS_standardizer.ipynb)
* [standardize the income statements and make them comparable](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_02_IS_standardizer.ipynb)
* [standardize the cash flow statements and make them comparable](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_03_CF_standardizer.ipynb)
* [automate additional processing steps that are executed after new data is discovered](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/08_00_automation_basics.ipynb)
* [checkout the `u_usecases` package](https://hansjoergw.github.io/sec-fincancial-statement-data-set/doc_latest/api/secfsdstools/u_usecases/index.html)
* [Trouble shooting and known issues](KNOWN_ISSUES.md)

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "secfsdstools",
    "maintainer": "Hansjoerg Wingeier",
    "docs_url": null,
    "requires_python": ">=3.10",
    "maintainer_email": "hansjoeg.wingeier@gmail.com",
    "keywords": "SEC.GOV, SEC EDGAR, SEC Filing, EDGAR, Finance, CIK, 10-Q, 10-K, 8-K, Financial Statements, Financial Statements Dataset, Financial Analysis, Data Processing, Financial Data, SEC API, XBRL",
    "author": "Hansjoerg",
    "author_email": "hansjoerg.wingeier@gmail.com",
    "download_url": "https://files.pythonhosted.org/packages/91/96/19f59cab0debb2299fad2b734ae7c4a4cd5d43558cfd3507c8d94561bbb2/secfsdstools-2.4.1.tar.gz",
    "platform": null,
    "description": "# sec-fincancial-statement-data-set tools (SFSDSTools 2)\n\nHelper tools to analyze the [Financial Statement Data Sets](https://www.sec.gov/dera/data/financial-statement-data-sets) from the U.S. securities and exchange commission (sec.gov).\nThe SEC releases quarterly zip files, each containing four CSV files with numerical data from all financial reports filed within that quarter. However, accessing data from the past 12 years can be time-consuming due to the large amount of data - over 120 million data points in over 2GB of zip files by 2023.\n\nThis library simplifies the process of working with this data and provides a\nconvenient way to extract information from the primary financial statements - the balance sheet (BS), income statement (IS), and statement of cash flows (CF).\n\nCheck out my article at Medium [Understanding the the SEC Financial Statement Data Sets](https://medium.com/@hansjoerg.wingeier/understanding-the-sec-financial-statement-data-sets-6148e07d1715) to get\nan introduction to the [Financial Statement Data Sets](https://www.sec.gov/dera/data/financial-statement-data-sets).\n\nThe main features include:\n- all data is on your local hard drive and can be updated automatically, no need for numerous API calls\n- data is loaded as pandas files\n- fast and efficient reading of a single report, all reports of one or multiple companies, or even all available reports \n- filter framework with predefined filters, easy to extend, supports easy way of saving, loading, and combining filtered data (see [01_quickstart.ipynb](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/01_quickstart.ipynb) and\n[03_explore_with_interactive_notebook.ipynb](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/03_explore_with_interactive_notebook.ipynb))\n- standardize the data for balance sheets, income statements, and cash flow statements to make reports easily comparable\n(see [07_00_standardizer_basics.ipynb](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_00_standardizer_basics.ipynb), \n[07_01_BS_standardizer.ipynb](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_01_BS_standardizer.ipynb), \n[07_01_BS_standardizer.ipynb](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_01_BS_standardizer.ipynb), and\n[07_03_CF_standardizer.ipynb](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_03_CF_standardizer.ipynb))\n- automate processing and standardizing by configuring customized process steps that are executed whenever a new \n  data file is detected on sec.gov (see [08_00_automation_basics.ipynb](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/08_00_automation_basics.ipynb))\n- version 2 supports the new \"segments\" column that was added in December 2024\n- **experimental - instroduced in version 2.4.0: support for daily updates of the financial reports (see [10_00_daily_financial_report_updates.ipynb](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/10_00_daily_financial_report_updates.ipynb))**\n\nHave a look at the [Release Notes](https://hansjoergw.github.io/sec-fincancial-statement-data-set/releasenotes/)\n<br/>\n<br/>\n<br/>\n\n<span style=\"color: #FF8C00;\">==========================================================</span>\n### If you find this tool useful, a sponsorship would be greatly appreciated! ###\n\n**https://github.com/sponsors/HansjoergW**\n\n### How to get in touch ###\n* Found a bug: https://github.com/HansjoergW/sec-fincancial-statement-data-set/issues\n* Have a remark: https://github.com/HansjoergW/sec-fincancial-statement-data-set/discussions/categories/general\n* Have an idea: https://github.com/HansjoergW/sec-fincancial-statement-data-set/discussions/categories/ideas\n* Have a question: https://github.com/HansjoergW/sec-fincancial-statement-data-set/discussions/categories/q-a\n* Have something to show: https://github.com/HansjoergW/sec-fincancial-statement-data-set/discussions/categories/show-and-tell\n\n<span style=\"color: #FF8C00;\">==========================================================</span>\n\n\n# Principles\n\nThe goal is to be able to do bulk processing of the data without the need to do countless API calls to sec.gov. \nTherefore, the quarterly zip files are downloaded and indexed using a SQLite database table.\nThe index table contains information on all filed reports since about 2010 - over 500,000 in total. The first\ndownload will take a couple of minutes but after that, all the data is on your local harddisk.\n\nUsing the index in the sqlite db allows for direct extraction of data for a specific report from the\nappropriate zip file, reducing the need to open and search through each zip file.\nMoreover, the downloaded zip files are converted to the parquet format which provides faster read access\nto the data compared to reading the csv files inside the zip files.\n\nThe library is designed to have a low memory footprint.\n\n\n# Installation and basic usage\n\nThe library has been tested for python version 3.8, 3.9, 3.10 and 3.11.\nThe project is published on [pypi.org](https://pypi.org/project/secfsdstools/). Simply use the following command to install the latest version:\n\n```\npip install secfsdstools\n```\n\nIf you want to contribute, just clone the project and use a python 3.8 environment.\nThe dependencies are defined in the requirements.txt file or use the pyproject.toml to install them.\n\nTo have a first glance at the library, check out the interactive jupyter notebooks [01_quickstart.ipynb](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/01_quickstart.ipynb) \nand [03_explore_with_interactive_notebook.ipynb](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/03_explore_with_interactive_notebook.ipynb) that are located in `notebooks` directory in the github repo.\n\nUpon using the library for the first time, it downloads the data files and creates the index by calling the `update()`\nmethod. You can manually trigger the update using the following code:\n\n```\nfrom secfsdstools.update import update\n\nif __name__ == '__main__':\n    update()\n```\n\nThe following tasks will be executed:\n1. All currently available zip-files are downloaded form sec.gov (these are over 50 files that will need over 2 GB of space on your local drive)\n2. All the zipfiles are transformed and stored as parquet files. Per default, the zipfile is deleted afterward. If you want to keep the zip files, set the parameter 'KeepZipFiles' in the config file to True.\n3. An index inside a sqlite db file is created\n\n\nMoreover, at most once a day, it is checked if there is a new zip file available on sec.gov. If there is, a download will be started automatically. \nIf you don't want 'auto-update', set the 'AutoUpdate' in your config file to False.\n\n\n\n## Configuration (optional)\n\nIf you don't provide a config file, a config file with name `secfsdstools.cfg` will be created the first time you use the api and placed inside your home directory. \nThe file only requires the following entries:\n\n```\n[DEFAULT]\ndownloaddirectory = c:/users/me/secfsdstools/data/dld\nparquetdirectory = c:/users/me/secfsdstools/data/parquet\ndbdirectory = c:/users/me/secfsdstools/data/db\nuseragentemail = your.email@goeshere.com\n```\n\nThe `downloaddirectory` is the place where quarterly zip files from the sec.gov are downloaded to.\nThe `parquetdirectory` is the folder where the data is stored in parquet format.\nThe `dbdirectory` is the directory in which the sqllite db is created.\nThe `useragentemail` is used in the requests made to the sec.gov website. Since we only make limited calls to the sec.gov,\nyou can leave the example \"your.email@goeshere.com\". \n\n## A first simple example\nGoal: present the information in the balance sheet of Apple's 2022 10-K report in the same way as it appears in the\noriginal report on page 31 (\"CONSOLIDATED BALANCE SHEETS\"): https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm\n\n**Note:** Version 2 of the framework supports now the `segments` that was introduced in January 2025. By adjusting the \nparameter `show_segments` you can define whether the segments information are shown or not\n\n````\nfrom secfsdstools.e_collector.reportcollecting import SingleReportCollector\nfrom secfsdstools.e_filter.rawfiltering import ReportPeriodAndPreviousPeriodRawFilter\nfrom secfsdstools.e_presenter.presenting import StandardStatementPresenter\n\nif __name__ == '__main__':\n    # the unique identifier for apple's 10-K report of 2022\n    apple_10k_2022_adsh = \"0000320193-22-000108\"\n  \n    # us a Collector to grab the data of the 10-K report. an filter for balancesheet information\n    collector: SingleReportCollector = SingleReportCollector.get_report_by_adsh(\n          adsh=apple_10k_2022_adsh,\n          stmt_filter=[\"BS\"]\n    )  \n    rawdatabag = collector.collect() # load the data from the disk\n    \n   \n    bs_df = (rawdatabag\n                       # ensure only data from the period (2022) and the previous period (2021) is in the data\n                       .filter(ReportPeriodAndPreviousPeriodRawFilter())\n                       # join the the content of the pre_txt and num_txt together\n                       .join()  \n                       # format the data in the same way as it appears in the report\n                       .present(StandardStatementPresenter(show_segments=False))) \n    print(bs_df) \n````\n\n\n## Viewing metadata\n\nThe recommend way to view and use the metadata is using `secfsdstools` library functions as described in [notebooks/01_quickstart.ipynb](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/01_quickstart.ipynb)  \n\nOf course, the created \"index of reports\" can be viewed also using a database viewer that supports the SQLite format,\nsuch as [DB Browser for SQLite](https://sqlitebrowser.org/).\n\n(The location of the SQLite database file is specified in the `dbdirectory` field of the config file, which is set to\n`<home>/secfsdstools/data/db` in the default configuration. The name of the database file is `secfsdstools.db`.)\n\nThere are only two relevant tables in the database: `index_parquet_reports` and `index_parquet_processing_state`.\n\nThe `index_parquet_reports` table provides an overview of all available reports in the downloaded\ndata and includes the following relevant columns:\n\n* **adsh** : The unique id of the report (a string).\n* **cik** : The unique id of the company (an int).\n* **name** : The name of the company in uppercases.\n* **form** : The type of the report (e.g.: annual: 10-K, quarterly: 10-Q).\n* **filed** : The date when the report has been filed in the format YYYYMMDD (stored as a integer number).\n* **period** : The date for which the report was create. this is the date on the balancesheet.(stored as a integer number) \n* **fullPath** : The path to the downloaded zip files that contains the details of that report.\n* **url** : The url which takes you directly to the filing of this report on the sec.gov website.\n\nFor instance, if you want to have an overview of all reports that Apple has filed since 2010,\njust search for \"%APPLE INC%\" in the name column.\n\nSearching for \"%APPLE INC%\" will also reveal its cik: 320193\n\nIf you accidentally delete data in the database file, don't worry. Just delete the database file\nand run `update()` again (see previous chapter).\n\n\n## Overview\nThe following diagram gives an overview on SECFSDSTools library.\n\n![Overview](https://github.com/HansjoergW/sec-fincancial-statement-data-set/raw/main/docs/images/overview.png)\n\nIt mainly exists out of two main processes. The first one ist the \"Date Update Process\" which is responsible for the\ndownload of the Financial Statement Data Sets zip files from the sec.gov website, transforming the content into parquet\nformat, and indexing the content of these files in a simple SQLite database. Again, this whole process can be started\n\"manually\" by calling the update method, or it is done automatically, as it described above.\n\nThe second main process is the \"Data Processing Process\", which is working with the data that is stored inside the\nsub.txt, pre.txt, and num.txt files from the zip files. The \"Data Processing Process\" mainly exists out of four steps:\n\n* **Collect** <br/> Collect the rawdata from one or more different zip files. For instance, get all the data for a single\nreport, or get the data for all 10-K reports of a single or multiple companies from several zip files.\n* **Raw Processing** <br/> Once the data is collected, the collected data for sub.txt, pre.txt, and num.txt is available\nas a pandas dataframe. Filters can be applied, the content can directly be saved and loaded.\n* **Joined Processing** <br/> From the \"Raw Data\", a \"joined\" representation can be created. This joins the data from\nthe pre.txt and num.txt content together based on the \"adhs\", \"tag\", and \"version\" attributes. \"Joined data\" can also be\nfiltered, concatenated, directly saved and loaded.\n* **Present** <br/> Produce a single pandas dataframe out of the data and use it for further processing or use the standardizers\n to create comparable data for the balance sheet, the income statement, and the cash flow statement.\n\nThe diagramm also shows the main classes with which a user interacts. The use of them  is described in the following chapters.\n\n\n## Feature Overview\n\nThis section shows some example code of the different features. Have a look at the [notebooks/01_quickstart.ipynb](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/01_quickstart.ipynb)\nnotebook and all other notebooks to get more details on how to use the framework.\n\n### Working with the Index\n\n* Access the index in the slite database to find the CIK (central index key) for a company:\n  ```\n  from secfsdstools.c_index.searching import IndexSearch\n  \n  index_search = IndexSearch.get_index_search()\n  results = index_search.find_company_by_name(\"apple\")\n  print(results)\n  ```\n\n* Get the information on the latest filing of a company:\n  ````\n  from secfsdstools.c_index.companyindexreading import CompanyIndexReader\n  \n  apple_cik = 320193\n  apple_index_reader = CompanyIndexReader.get_company_index_reader(cik=apple_cik)\n  print(apple_index_reader.get_latest_company_filing())\n  ````\n\n* Show all annual reports of company by using its CIK number:\n  ````\n  from secfsdstools.c_index.companyindexreading import CompanyIndexReader\n  \n  apple_cik = 320193\n  apple_index_reader = CompanyIndexReader.get_company_index_reader(cik=apple_cik)\n  \n  # only show the annual reports of apple\n  print(apple_index_reader.get_all_company_reports_df(forms=[\"10-K\"]))\n  ````\n\n### Loading Data\nThe previously introduced `IndexSearch` and `CompanyIndexReader` let you know what data is available, but they do not\nreturn the real data of the financial statements. This is what the `Collector` classes are used for.\n\nAll the `Collector` classes have their own factory method(s) which instantiates the class. \n\nMost of these factory methods\nalso provide parameters to filter the data directly when being loaded from the parquet files.\nThese are the `forms_filter` (which type of reports you want to read, for instance \"10-K\"), the `stmt_filter`\n(which statements you want to read, for instance the balance sheet), and the `tag_filter` (which defines the tags\nyou want to read, for instance \"Assets\"). Of course, such filters could also be applied afterward, but it is slightly\nmore efficient to apply them directly when loading.\n\nAll `Collector` classes have a `collect` method which then loads the data from the parquet files and returns an instance\nof `RawDataBag`. The `RawDataBag` instance contains then a pandas dataframe for the `sub` (subscription) data,\n`pre` (presentation) data, and `num` (the numeric values) data.\n\n* Load a single report using the `SingleReportCollector`:\n    ````\n    from secfsdstools.e_collector.reportcollecting import SingleReportCollector\n\n    apple_10k_2022_adsh = \"0000320193-22-000108\"\n\n    collector: SingleReportCollector = SingleReportCollector.get_report_by_adsh(adsh=apple_10k_2022_adsh)\n    rawdatabag = collector.collect()\n\n    # as expected, there is just one entry in the submission dataframe\n    print(rawdatabag.sub_df)\n    # just print the size of the pre and num dataframes\n    print(rawdatabag.pre_df.shape)\n    print(rawdatabag.num_df.shape)\n    ````\n* Load multiple reports with the `MultiReportCollector`:\n    ````\n    from secfsdstools.e_collector.multireportcollecting import MultiReportCollector\n    apple_10k_2022_adsh = \"0000320193-22-000108\"\n    apple_10k_2012_adsh = \"0001193125-12-444068\"\n\n    if __name__ == '__main__':\n        # load only the assets tags that are present in the 10-K report of apple in the years\n        # 2022 and 2012\n        collector: MultiReportCollector = \\\n            MultiReportCollector.get_reports_by_adshs(adshs=[apple_10k_2022_adsh,\n                                                             apple_10k_2012_adsh],\n                                                      tag_filter=['Assets'])\n        rawdatabag = collector.collect()\n        # as expected, there are just two entries in the submission dataframe\n        print(rawdatabag.sub_df)\n        print(rawdatabag.num_df)  \n    ```` \n\n* Load all data for one or multiple quarters using the `ZipCollector`:\n    ````\n    from secfsdstools.e_collector.zipcollecting import ZipCollector\n\n    # only collect the Balance Sheet of annual reports that\n    # were filed during the first quarter in 2022\n    if __name__ == '__main__':\n        collector: ZipCollector = ZipCollector.get_zip_by_name(name=\"2022q1.zip\",\n                                                               forms_filter=[\"10-K\"],\n                                                               stmt_filter=[\"BS\"])\n    \n        rawdatabag = collector.collect()\n    \n        # only show the size of the data frame\n        # .. over 4000 companies filed a 10 K report in q1 2022\n        print(rawdatabag.sub_df.shape)\n        print(rawdatabag.pre_df.shape)\n        print(rawdatabag.num_df.shape)    \n    ```` \n\n* Load all data for a single company or multiple companies\n    ````\n    from secfsdstools.e_collector.companycollecting import CompanyReportCollector\n    \n    if __name__ == '__main__':\n        apple_cik = 320193\n        collector = CompanyReportCollector.get_company_collector(ciks=[apple_cik],\n                                                                 forms_filter=[\"10-K\"])\n    \n        rawdatabag = collector.collect()\n    \n        # all filed 10-K reports for apple since 2010 are in the databag\n        print(rawdatabag.sub_df)\n    \n        print(rawdatabag.pre_df.shape)\n        print(rawdatabag.num_df.shape)    \n    ```` \n\nHave a look at the [collector_deep_dive notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/04_collector_deep_dive.ipynb).\n\n### Working with raw data\nWhen the `collect` method of a `Collector` class is called, the data for the sub, pre, and num dataframes are loaded\nand being stored in the sub_df, pre_df, and num_df attributes inside an instance of `RawDataBag`.\n\n* `save` and `load`\n    ````\n    from secfsdstools.e_collector.reportcollecting import SingleReportCollector\n\n    # read data\n    apple_10k_2022_adsh = \"0000320193-22-000108\"\n    collector: SingleReportCollector = SingleReportCollector.get_report_by_adsh(adsh=apple_10k_2022_adsh)\n    rawdatabag = collector.collect()\n\n    # save it\n    rawdatabag.save(\"<path>\")\n    \n    # load it back\n    bag = RawDataBag.load(\"<path\")\n  \n    # load it back with Predicate Pushdown (filter while reading)\n    bag = RawDataBag.load(\"<path\", stmt_filter=[\"BS\"])\n    ```` \n\n*  `concat` multiple instances of `RawDataBag`\n    ````\n    concat_bag = RawDataBag.concat(list_of_rawdatabags)    \n    ````\n\n*  `concat_filebased` concat multiple RawDataBag folders into a new folder in very memory efficient way\n    ````\n    RawDataBag.concat_filebased(list_of_rawdatabag_folders, target_folder)    \n    ````\n\n* `join` produces a `JoinedRawDataBag` by joining the content of the pre_df and num_df\n   based on the columns adsh, tag, and version. It is an inner join. The joined dataframe appears as pre_num_df in\n   the `JoinedRawDataBag`.\n    ````\n    from secfsdstools.e_collector.reportcollecting import SingleReportCollector\n\n    # read data\n    apple_10k_2022_adsh = \"0000320193-22-000108\"\n    collector: SingleReportCollector = SingleReportCollector.get_report_by_adsh(adsh=apple_10k_2022_adsh)\n    rawdatabag = collector.collect()\n\n    joineddatabag = rawdatabag.join()\n  \n    print(joineddatabag.pre_num_df)\n    ```\n\n*  Use filters to `filter` the data. There are many predefined filters, but it is also easy to write your own.\n   ````\n   \n   # Note, instead of using a_RawDataBag.filter(<myFilter>) you could also use a_RawDataBag[<myFilter>]\n   \n   # Filters the `RawDataBag` instance based on the list of adshs that were provided in the constructor. \n   a_filtered_RawDataBag = a_RawDataBag.filter(AdshRawFilter(adshs=['0001193125-09-214859', '0001193125-10-238044']))\n\n   # Filters the `RawDataBag`instance based on the list of statements ('BS', 'CF', 'IS', ...). <br>\n   a_filtered_RawDataBag = a_RawDataBag.filter(StmtRawFilter(stmts=['BS', 'CF']))\n\n   # Filters the `RawDataBag`instance based on the list of tags that is provided. <br>\n   a_filtered_RawDataBag = a_RawDataBag.filter(TagRawFilter(tags=['Assets', 'Liabilities']))\n\n   # Filters the `RawDataBag` so that data of subsidiaries are removed.\n   a_filtered_RawDataBag = a_RawDataBag.filter(MainCoregRawFilter()) \n\n   # The data of a report usually also contains data from previous years.\n   # However, often you want just to analyze the data of the current and the previous year. This filter ensures that\n   # only data for the current period and the previous period are contained in the data.\n   a_filtered_RawDataBag = a_RawDataBag.filter(ReportPeriodAndPreviousPeriodRawFilter()) \n\n   # If you are just interested in the data of a report that is from the current period\n   # of the report then you can use this filter.\n   a_filtered_RawDataBag = a_RawDataBag.filter(ReportPeriodRawFilter()) \n\n   # Sometimes company provide their own tags, which are not defined by the us-gaap XBRL\n   # definition. In such cases, the version columns contains the value of the adsh instead of something like us-gab/2022.\n   # This filter removes unofficial tags.\n   a_filtered_RawDataBag = a_RawDataBag.filter(OfficialTagsOnlyRawFilter()) \n\n   # Reports often also contain datapoints or also the same datapint in other currencies than USD.\n   # This filters ensures that only USD  datapoints are kept  \n   a_filtered_RawDataBag = a_RawDataBag.filter(USDOnlyRawFilter()) \n\n   # If you dont care about Segments information, you can use this filter.\n   a_filtered_RawDataBag = a_RawDataBag.filter(NoSegmentInfoRawFilter()) \n   \n   ````  \n\n   Have a look at the [filter_deep_dive notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/05_filter_deep_dive.ipynb).\n\n### Working with joined data\nWhen the `join` method of a `RawDataBag` instance is called an instance of `JoinedDataBag` is returned.\n\nThe `JoinedDataBag` provides `save`, `load`, `concat`, and `concat_filebased` in the same manner as the\n`RawDataBag`does. \nMore over, also `filter` is possible and the same filters are available. They just go by the name\n`...JoinedFilter` instead of `...RawFilter`.\n\n`present` The idea of the present method is to make a final presentation of the data as pandas dataframe. \nThe method has a parameter presenter of type Presenter.\nIt is simple to write your own presenter classes. So far, the framework provides the following Presenter \nimplementations (module `secfsdstools.e_presenter.presenting`):\n\n* `StandardStatementPresenter` <br> This presenter provides the data in the same form, as you see in\n  the reports itself.\n  ````\n  apple_10k_2022_adsh = \"0000320193-22-000108\"\n\n  collector: SingleReportCollector = SingleReportCollector.get_report_by_adsh(\n        adsh=apple_10k_2022_adsh,\n        stmt_filter=[\"BS\"]\n  )\n  rawdatabag = collector.collect()\n  bs_df = rawdatabag.filter(ReportPeriodAndPreviousPeriodRawFilter())\n                    .join()\n                    .present(StandardStatementPresenter())\n  print(bs_df) \n  ```` \n\n### Standardize financial data\n  Even if xbrl is a standard on how to tag positions and numbers in financial statements, that doesn't mean that financial\n  statements can then be compared easily. For instance, there are over 3000 tags which can be used in a balance sheet.\n  Moreover, some tags can mean similar things or can be grouped behind a \"parent\" tag, which itself might not be present.\n  For instance, \"AccountsNoncurrent\" is often not shown in statements. So you would find the position for \"Accounts\"\n  and \"AccountsCurrent\", but not for \"AccountsNoncurrent\". Instead, only child tags for \"AccountsNoncurrent\" might be\n  present.<br><br>\n  The standardizer helps to solve these problems by unifying the information of financial statements.<br> <br>\n  With the standardized financial statements, you can then actually compare the statements between different\n  companies or different years, and you can use the dataset for ML. <br><br>\n  For details, have a look at the following notebooks:\n  * [standardizer_basics](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_00_standardizer_basics.ipynb)\n  * [standardize the balance sheets and make them comparable](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_01_BS_standardizer.ipynb)\n  * [standardize the income statements and make them comparable](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_02_IS_standardizer.ipynb)\n  * [standardize the cash flow statements and make them comparable](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_03_CF_standardizer.ipynb)\n\n\n  * `BalanceSheetStandardizer` <br>\n  The `BalanceSheetStandardizer` collects and/or calculates the following positions of balance sheets:  \n\n    ````\n    - Assets\n      - AssetsCurrent\n        - Cash\n      - AssetsNoncurrent\n    - Liabilities\n      - LiabilitiesCurrent\n      - LiabilitiesNoncurrent\n    - Equity\n      - HolderEquity (mainly StockholderEquity or PartnerCapital)\n        - RetainedEarnings\n        - AdditionalPaidInCapital\n        - TreasuryStockValue\n      - TemporaryEquity\n      - RedeemableEquity\n    - LiabilitiesAndEquity\n    ````\n\n    With just a few lines of code, you'll get a comparable dataset with the main positions of a balance sheet for Microsoft, Alphabet, and Amazon:\n    (see the [stanardize the balance sheets and make them comparable notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_01_BS_standardizer.ipynb) for details)\n     ````python\n     from secfsdstools.e_collector.companycollecting import CompanyReportCollector\n     from secfsdstools.e_filter.rawfiltering import ReportPeriodRawFilter, MainCoregRawFilter, OfficialTagsOnlyRawFilter, USDOnlyRawFilter\n     from secfsdstools.f_standardize.bs_standardize import BalanceSheetStandardizer\n   \n     bag = CompanyReportCollector.get_company_collector(ciks=[789019, 1652044,1018724]).collect() #Microsoft, Alphabet, Amazon\n     filtered_bag = bag[ReportPeriodRawFilter()][MainCoregRawFilter()][OfficialTagsOnlyRawFilter()][USDOnlyRawFilter()]\n     joined_bag = filtered_bag.join()\n   \n     standardizer = BalanceSheetStandardizer()\n   \n     standardized_bs_df = joined_bag.present(standardizer)\n   \n     import matplotlib.pyplot as plt\n     # Group by 'name' and plot equity for each group\n     # Note: using the `present` method ensured that the same cik has always the same name even if the company name did change in the past\n     for name, group in standardized_bs_df.groupby('name'):\n       plt.plot(group['date'], group['Equity'], label=name, linestyle='-')\n   \n     # Add labels and title\n     plt.xlabel('Date')\n     plt.ylabel('Equity')\n     plt.title('Equity Over Time for Different Companies (CIKs)')\n   \n     # Display legend\n     plt.legend()\n     ````\n     ![Equity Compare](https://github.com/HansjoergW/sec-fincancial-statement-data-set/raw/main/docs/images/equity_compare.png)\n\n  * `IncomeStatementStandardizer` <br>\n  The `IncomeStatementStandardizer` collects and/or calculates the following positions of balance sheets:\n    \n    ````  \n      Revenues\n      - CostOfRevenue\n      ---------------\n      = GrossProfit\n      - OperatingExpenses\n      -------------------\n      = OperatingIncomeLoss\n        \n      IncomeLossFromContinuingOperationsBeforeIncomeTaxExpenseBenefit\n      - AllIncomeTaxExpenseBenefit\n      ----------------------------\n      = IncomeLossFromContinuingOperations\n      + IncomeLossFromDiscontinuedOperationsNetOfTax\n      -----------------------------------------------\n      = ProfitLoss\n      - NetIncomeLossAttributableToNoncontrollingInterest\n      ---------------------------------------------------\n      = NetIncomeLoss\n    \n      OustandingShares\n      EarningsPerShare\n    ````\n  \n    With just a few lines of code, you'll get a comparable dataset with the main positions of an income statement for Microsoft, Alphabet, and Amazon:\n  (see the [standardize the income statement and make them comparable notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_02_IS_standardizer.ipynb) for details)\n   \n    ````python\n    from secfsdstools.e_collector.companycollecting import CompanyReportCollector\n    from secfsdstools.e_filter.rawfiltering import ReportPeriodRawFilter, MainCoregRawFilter, OfficialTagsOnlyRawFilter, USDOnlyRawFilter\n    from secfsdstools.f_standardize.is_standardize import IncomeStatementStandardizer\n      \n    bag = CompanyReportCollector.get_company_collector(ciks=[789019, 1652044,1018724]).collect() #Microsoft, Alphabet, Amazon\n    filtered_bag = bag[ReportPeriodRawFilter()][MainCoregRawFilter()][OfficialTagsOnlyRawFilter()][USDOnlyRawFilter()]\n    joined_bag = filtered_bag.join()\n      \n    standardizer = IncomeStatementStandardizer()\n      \n    standardized_is_df = joined_bag.present(standardizer)\n    # just use the yearly reports with data for the whole year\n    standardized_is_df = standardized_is_df[(standardized_is_df.fp==\"FY\") & (standardized_is_df.qtrs==4)].copy()\n      \n    import matplotlib.pyplot as plt\n    # Group by 'name' and plot equity for each group\n    # Note: using the `present` method ensured that the same cik has always the same name even if the company name did change in the past\n    for name, group in standardized_is_df.groupby('name'):\n      plt.plot(group['date'], group['GrossProfit'], label=name, linestyle='-')\n      \n    # Add labels and title\n    plt.xlabel('Date')\n    plt.ylabel('GrossProfit')\n    plt.title('GrossProfit Over Time for Different Companies (CIKs)')\n      \n    # Display legend\n    plt.legend()\n     ````\n\n  ![GrossProfit Compare](https://github.com/HansjoergW/sec-fincancial-statement-data-set/raw/main/docs/images/grossprofit_compare.png)\n\n  * `CashFlowStandardizer` <br>\n   The `CashFlowStandardizer` collects and/or calculates the following positions of cash flow statements:\n     \n    ````\n     NetCashProvidedByUsedInOperatingActivities\n       CashProvidedByUsedInOperatingActivitiesDiscontinuedOperations\n       NetCashProvidedByUsedInOperatingActivitiesContinuingOperations\n           DepreciationDepletionAndAmortization\n           DeferredIncomeTaxExpenseBenefit\n           ShareBasedCompensation\n           IncreaseDecreaseInAccountsPayable\n           IncreaseDecreaseInAccruedLiabilities\n           InterestPaidNet\n           IncomeTaxesPaidNet\n    \n     NetCashProvidedByUsedInInvestingActivities\n         CashProvidedByUsedInInvestingActivitiesDiscontinuedOperations\n         NetCashProvidedByUsedInInvestingActivitiesContinuingOperations\n           PaymentsToAcquirePropertyPlantAndEquipment\n           ProceedsFromSaleOfPropertyPlantAndEquipment\n           PaymentsToAcquireInvestments\n           ProceedsFromSaleOfInvestments\n           PaymentsToAcquireBusinessesNetOfCashAcquired\n           ProceedsFromDivestitureOfBusinessesNetOfCashDivested\n           PaymentsToAcquireIntangibleAssets\n           ProceedsFromSaleOfIntangibleAssets\n    \n     NetCashProvidedByUsedInFinancingActivities\n         CashProvidedByUsedInFinancingActivitiesDiscontinuedOperations\n         NetCashProvidedByUsedInFinancingActivitiesContinuingOperations\n           ProceedsFromIssuanceOfCommonStock\n           ProceedsFromStockOptionsExercised\n           PaymentsForRepurchaseOfCommonStock\n           ProceedsFromIssuanceOfDebt\n           RepaymentsOfDebt\n           PaymentsOfDividends\n    \n    \n     EffectOfExchangeRateFinal\n     CashPeriodIncreaseDecreaseIncludingExRateEffectFinal\n    \n     CashAndCashEquivalentsEndOfPeriod\n    ````\n\n     With just a few lines of code, you'll get a comparable dataset with the main positions of an cash flow statement for Microsoft, Alphabet, and Amazon:\n(see the [standardize the cash flow statements and make them comparable](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_03_CF_standardizer.ipynb) for details)\n    ````python\n    from secfsdstools.e_collector.companycollecting import CompanyReportCollector\n    from secfsdstools.e_filter.rawfiltering import ReportPeriodRawFilter, MainCoregRawFilter, OfficialTagsOnlyRawFilter, USDOnlyRawFilter\n    from secfsdstools.f_standardize.cf_standardize import CashFlowStandardizer\n    \n    bag = CompanyReportCollector.get_company_collector(ciks=[789019, 1652044,1018724]).collect() #Microsoft, Alphabet, Amazon\n    filtered_bag = bag[ReportPeriodRawFilter()][MainCoregRawFilter()][OfficialTagsOnlyRawFilter()][USDOnlyRawFilter()]\n    joined_bag = filtered_bag.join()\n    \n    standardizer = CashFlowStandardizer()\n    \n    standardized_cf_df = joined_bag.present(standardizer)\n    standardized_cf_df = standardized_cf_df[(standardized_cf_df.fp==\"FY\") & (standardized_cf_df.qtrs==4)].copy()\n    \n    import matplotlib.pyplot as plt\n    # Group by 'name' and plot NetCashProvidedByUsedInOperatingActivities for each group\n    # Note: using the `present` method ensured that the same cik has always the same name even if the company name did change in the past\n    for name, group in standardized_cf_df.groupby('name'):\n        plt.plot(group['date'], group['NetCashProvidedByUsedInOperatingActivities'], label=name, linestyle='-')\n    \n    # Add labels and title\n    plt.xlabel('Date')\n    plt.ylabel('NetCashProvidedByUsedInOperatingActivities')\n    plt.title('NetCashProvidedByUsedInOperatingActivities Over Time for Different Companies (CIKs)')\n    \n    # Display legend\n    plt.legend()\n    ````\n  ![NetCashOperating Compare](https://github.com/HansjoergW/sec-fincancial-statement-data-set/raw/main/docs/images/netcashoperating_compare.png)\n\n\n## Automate processing\nThe framework provides two hook methods, that are called whenever the default update process is being executed.\nThis way, you can implement additional processing steps that are executed, after a new data file from the sec.gov was \ndownloaded, transformed to parquet, and index.\n\nHave a look at [08_00_automation_basics](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/08_00_automation_basics.ipynb)\n\n\n## Daily Updates (Experimental)\nIntroduced with version 2.4.0, secfsdstools now also provides daily updates for filed reports at the SEC.\n\nYou have to activate it by adding `dailyprocessing = True` in the `DEFAULT` section of the configuration file.\n\nNote, that there are some limitations (see [10_00_daily_financial_report_updates](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/10_00_daily_financial_report_updates.ipynb) for details).\n\n\n\n\n# Links \n* [For a detail description of the content and the structure of the dataset](https://www.sec.gov/files/aqfs.pdf)\n* [Release Notes](https://hansjoergw.github.io/sec-fincancial-statement-data-set/releasenotes/)\n* [Documentation](https://hansjoergw.github.io/sec-fincancial-statement-data-set/)\n* [QuickStart Jupyter Notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/01_quickstart.ipynb)\n* [Explore the data with an interactive Notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/03_explore_with_interactive_notebook.ipynb)\n* [collector_deep_dive Notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/04_collector_deep_dive.ipynb)\n* [filter_deep_dive Notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/05_filter_deep_dive.ipynb).\n* [bulk_data_processing_deep_dive Notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/06_bulk_data_processing_deep_dive.ipynb)\n* [bulk_data_processing_memory_efficiency](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/06_01_bulk_data_memory_efficiency.ipynb)\n* [standardizer_basics](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_00_standardizer_basics.ipynb)\n* [standardize the balance sheets and make them comparable](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_01_BS_standardizer.ipynb)\n* [standardize the income statements and make them comparable](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_02_IS_standardizer.ipynb)\n* [standardize the cash flow statements and make them comparable](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_03_CF_standardizer.ipynb)\n* [automate additional processing steps that are executed after new data is discovered](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/08_00_automation_basics.ipynb)\n* [checkout the `u_usecases` package](https://hansjoergw.github.io/sec-fincancial-statement-data-set/doc_latest/api/secfsdstools/u_usecases/index.html)\n* [Trouble shooting and known issues](KNOWN_ISSUES.md)\n\n",
    "bugtrack_url": null,
    "license": "Apache-2.0",
    "summary": "A few python tools to analyze the SEC.gov financial statements data sets (https://www.sec.gov/dera/data/financial-statement-data-sets)",
    "version": "2.4.1",
    "project_urls": {
        "Bug Tracker": "https://github.com/HansjoergW/sec-fincancial-statement-data-set/issues",
        "Change Log": "https://github.com/HansjoergW/sec-fincancial-statement-data-set/blob/main/CHANGELOG.md",
        "Forum": "https://github.com/HansjoergW/sec-fincancial-statement-data-set/discussions",
        "Funding": "https://github.com/sponsors/HansjoergW",
        "Github": "https://github.com/HansjoergW/sec-fincancial-statement-data-set",
        "Homepage": "https://hansjoergw.github.io/sec-fincancial-statement-data-set/"
    },
    "split_keywords": [
        "sec.gov",
        " sec edgar",
        " sec filing",
        " edgar",
        " finance",
        " cik",
        " 10-q",
        " 10-k",
        " 8-k",
        " financial statements",
        " financial statements dataset",
        " financial analysis",
        " data processing",
        " financial data",
        " sec api",
        " xbrl"
    ],
    "urls": [
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "6303eafdc58e6437f6c8442c52b39d2037bc5224dbd08732379adac043f38678",
                "md5": "56e7fbdc442bcaaa249d52a3bf89b3f3",
                "sha256": "027569c863aa37f3a4d8f03e79a64bb1e1ec026deaa5c5b6255d0e181a022e87"
            },
            "downloads": -1,
            "filename": "secfsdstools-2.4.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "56e7fbdc442bcaaa249d52a3bf89b3f3",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.10",
            "size": 142531,
            "upload_time": "2025-07-26T05:16:36",
            "upload_time_iso_8601": "2025-07-26T05:16:36.887957Z",
            "url": "https://files.pythonhosted.org/packages/63/03/eafdc58e6437f6c8442c52b39d2037bc5224dbd08732379adac043f38678/secfsdstools-2.4.1-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": null,
            "digests": {
                "blake2b_256": "919619f59cab0debb2299fad2b734ae7c4a4cd5d43558cfd3507c8d94561bbb2",
                "md5": "9bf18928001a55c67d70613b634b3073",
                "sha256": "388617ab67fc585b29a8970e8ac49fa8aa3a690cf4a21c9fd1600cb3da3edc27"
            },
            "downloads": -1,
            "filename": "secfsdstools-2.4.1.tar.gz",
            "has_sig": false,
            "md5_digest": "9bf18928001a55c67d70613b634b3073",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.10",
            "size": 115005,
            "upload_time": "2025-07-26T05:16:38",
            "upload_time_iso_8601": "2025-07-26T05:16:38.689345Z",
            "url": "https://files.pythonhosted.org/packages/91/96/19f59cab0debb2299fad2b734ae7c4a4cd5d43558cfd3507c8d94561bbb2/secfsdstools-2.4.1.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2025-07-26 05:16:38",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "HansjoergW",
    "github_project": "sec-fincancial-statement-data-set",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "lcname": "secfsdstools"
}

Hansjoerg