secfsdstools


Namesecfsdstools JSON
Version 1.7.0 PyPI version JSON
download
home_pageNone
SummaryA few python tools to analyze the SEC.gov financial statements data sets (https://www.sec.gov/dera/data/financial-statement-data-sets)
upload_time2024-12-22 06:56:36
maintainerNone
docs_urlNone
authorNone
requires_python>=3.8
licenseNone
keywords 10-k 10-q 8-k cik data processing edgar finance financial analysis financial statements financial statements dataset sec edgar sec filing sec.gov
VCS
bugtrack_url
requirements pytest pylint pandas requests pathos pyarrow fastparquet
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # sec-fincancial-statement-data-set

Helper tools to analyze the [Financial Statement Data Sets](https://www.sec.gov/dera/data/financial-statement-data-sets) from the U.S. securities and exchange commission (sec.gov).
The SEC releases quarterly zip files, each containing four CSV files with numerical data from all financial reports filed within that quarter. However, accessing data from the past 12 years can be time-consuming due to the large amount of data - over 120 million data points in over 2GB of zip files by 2023.

This library simplifies the process of working with this data and provides a
convenient way to extract information from the primary financial statements - the balance sheet (BS), income statement (IS), and statement of cash flows (CF).

Check out my article at Medium [Understanding the the SEC Financial Statement Data Sets](https://medium.com/@hansjoerg.wingeier/understanding-the-sec-financial-statement-data-sets-6148e07d1715) to get
an introduction to the [Financial Statement Data Sets](https://www.sec.gov/dera/data/financial-statement-data-sets).

The main features include:
- all data is on your local hard drive and can be updated automatically, no need for numerous API calls
- data is loaded as pandas files
- fast and efficient reading of a single report, all reports of one or multiple companies, or even all available reports 
- filter framework with predefined filters, easy to extend, supports easy way of saving, loading, and combining filtered data (see [01_quickstart.ipynb](notebooks/01_quickstart.ipynb) and
[03_explore_with_interactive_notebook.ipynb](notebooks/03_explore_with_interactive_notebook.ipynb))
- standardize the data for balance sheets, income statements, and cash flow statements to make reports easily comparable
(see [07_00_standardizer_basics.ipynb](notebooks/07_00_standardizer_basics.ipynb), 
[07_01_BS_standardizer.ipynb](notebooks/07_01_BS_standardizer.ipynb), 
[07_01_BS_standardizer.ipynb](notebooks/07_01_BS_standardizer.ipynb), and
[07_03_CF_standardizer.ipynb](notebooks/07_03_CF_standardizer.ipynb))

# Principles

The goal is to be able to do bulk processing of the data without the need to do countless API calls to sec.gov. Therefore, the quarterly zip files are downloaded and indexed using a SQLite database table.
The index table contains information on all filed reports since about 2010, over 500,000 in total. The first
download will take a couple of minutes but after that, all the data is on your local harddisk.

Using the index in the sqlite db allows for direct extraction of data for a specific report from the
appropriate zip file, reducing the need to open and search through each zip file.
Moreover, the downloaded zip files are converted to the parquet format which provides faster read access
to the data compared to reading the csv files inside the zip files.

The library is designed to have a low memory footprint, only parsing and reading the data for a specific report into pandas dataframe tables.


# Installation and basic usage

The library has been tested for python version 3.7, 3.8, 3.9, and 3.10.
The project is published on [pypi.org](https://pypi.org/project/secfsdstools/). Simply use the following command to install the latest version:

```
pip install secfsdstools
```



If you want to contribute, just clone the project and use a python 3.7 environment.
The dependencies are defined in the requirements.txt file or use the pyproject.toml to install them.

It is possible to write standalone python script but I recommend to first start with interactive jupyter notebooks [01_quickstart.ipynb](notebooks/01_quickstart.ipynb) and [03_explore_with_interactive_notebook.ipynb](notebooks/03_explore_with_interactive_notebook.ipynb) that are located in `notebooks` directory.

Upon using the library for the first time, it downloads the data files and creates the index by calling the `update()` method. You can manually trigger the update using the following code:

```
from secfsdstools.update import update

if __name__ == '__main__':
    update()
```

The following tasks will be executed:
1. All currently available zip-files are downloaded form sec.gov (these are over 50 files that will need over 2 GB of space on your local drive)
2. All the zipfiles are transformed and stored as parquet files. Per default, the zipfile is deleted afterwards. If you want to keep the zip files, set the parameter 'KeepZipFiles' in the config file to True.
3. An index inside a sqlite db file is created


Moreover, at most once a day, it is checked if there is a new zip file available on sec.gov. If there is, a download will be started automatically. 
If you don't want 'auto-update', set the 'AutoUpdate' in your config file to False.



## Configuration (optional)

If you don't provide a config file, a config file with name `secfsdstools.cfg` will be created the first time you use the api and placed inside your home directory. 
The file only requires the following entries:

```
[DEFAULT]
downloaddirectory = c:/users/me/secfsdstools/data/dld
parquetdirectory = c:/users/me/secfsdstools/data/parquet
dbdirectory = c:/users/me/secfsdstools/data/db
useragentemail = your.email@goeshere.com
```



The `downloaddirectory` is the place where quarterly zip files from the sec.gov are downloaded to.
The `parquetdirectory` is the folder where the data is stored in parquet format.
The `dbdirectory` is the directory in which the sqllite db is created.
The `useragentemail` is used in the requests made to the sec.gov website. Since we only make limited calls to the sec.gov,
you can leave the example "your.email@goeshere.com". 

## Viewing metadata

The recommend way to view and use the metadata is using `secfsdstools` library functions as described in [notebooks/01_quickstart.ipynb](notebooks/01_quickstart.ipynb)  

The "index of reports" that was created in the previous step can be viewed using a database viewer that supports the SQLite format,
such as [DB Browser for SQLite](https://sqlitebrowser.org/).

(The location of the SQLite database file is specified in the `dbdirectory` field of the config file, which is set to
`<home>/secfsdstools/data/db` in the default configuration. The database file is named `secfsdstools.db`.)

There are only two relevant tables in the database: `index_parquet_reports` and `index_parquet_processing_state`.

The `index_parquet_reports` table provides an overview of all available reports in the downloaded
data and includes the following relevant columns:

* **adsh** : The unique id of the report (a string).
* **cik** : The unique id of the company (an int).
* **name** : The name of the company in uppercases.
* **form** : The type of the report (e.g.: annual: 10-K, quarterly: 10-Q).
* **filed** : The date when the report has been filed in the format YYYYMMDD (stored as a integer number).
* **period** : The date for which the report was create. this is the date on the balancesheet.(stored as a integer number) 
* **fullPath** : The path to the downloaded zip files that contains the details of that report.
* **url** : The url which takes you directly to the filing of this report on the sec.gov website.

For instance, if you want to have an overview of all reports that Apple has filed since 2010,
just search for "%APPLE INC%" in the name column.

Searching for "%APPLE INC%" will also reveal its cik: 320193

If you accidentally delete data in the database file, don't worry. Just delete the database file
and run `update()` again (see previous chapter).




## A first simple example
Goal: present the information in the balance sheet of Apple's 2022 10-K report in the same way as it appears in the
original report on page 31 ("CONSOLIDATED BALANCE SHEETS"): https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm

````
from secfsdstools.e_collector.reportcollecting import SingleReportCollector
from secfsdstools.e_filter.rawfiltering import ReportPeriodAndPreviousPeriodRawFilter
from secfsdstools.e_presenter.presenting import StandardStatementPresenter

if __name__ == '__main__':
    # the unique identifier for apple's 10-K report of 2022
    apple_10k_2022_adsh = "0000320193-22-000108"
  
    # us a Collector to grab the data of the 10-K report. an filter for balancesheet information
    collector: SingleReportCollector = SingleReportCollector.get_report_by_adsh(
          adsh=apple_10k_2022_adsh,
          stmt_filter=["BS"]
    )  
    rawdatabag = collector.collect() # load the data from the disk
    
   
    bs_df = (rawdatabag
                       # ensure only data from the period (2022) and the previous period (2021) is in the data
                       .filter(ReportPeriodAndPreviousPeriodRawFilter())
                       # join the the content of the pre_txt and num_txt together
                       .join()  
                       # format the data in the same way as it appears in the report
                       .present(StandardStatementPresenter())) 
    print(bs_df) 
````

## Overview
The following diagram gives an overview on SECFSDSTools library.

![Overview](https://github.com/HansjoergW/sec-fincancial-statement-data-set/raw/main/docs/images/overview.png)

It mainly exists out of two main processes. The first one ist the "Date Update Process" which is responsible for the
download of the Financial Statement Data Sets zip files from the sec.gov website, transforming the content into parquet
format, and indexing the content of these files in a simple SQLite database. Again, this whole process can be started
"manually" by calling the update method, or it is done automatically, as it described above.

The second main process is the "Data Processing Process", which is working with the data that is stored inside the
sub.txt, pre.txt, and num.txt files from the zip files. The "Data Processing Process" mainly exists out of four steps:

* **Collect** <br/> Collect the rawdata from one or more different zip files. For instance, get all the data for a single
report, or get the data for all 10-K reports of a single or multiple companies from several zip files.
* **Raw Processing** <br/> Once the data is collected, the collected data for sub.txt, pre.txt, and num.txt is available
as a pandas dataframe. Filters can be applied, the content can directly be saved and loaded.
* **Joined Processing** <br/> From the "Raw Data", a "joined" representation can be created. This joins the data from
the pre.txt and num.txt content together based on the "adhs", "tag", and "version" attributes. "Joined data" can also be
filtered, concatenated, directly saved and loaded.
* **Present** <br/> Produce a single pandas dataframe out of the data and use it for further processing or use the standardizers
 to create comparable data for the balance sheet, the income statement, and the cash flow statement.

The diagramm also shows the main classes with which a user interacts. The use of them  is described in the following chapters.

## General
Most of the classes you can interact with have a factory method which name starts with `get_`. All this factory method
take at least one **optional** parameter called configuration which is of type `Configuration`.

If you do not provide this parameter, the class will read the configuration info from you configuration file in your home
directory. If, for whatever reason, you do want to provide an alternative configuration, you can overwrite it.

However, normally you do not have to provide the `configuration` parameter.

## Index: working with the index
The first class that interacts with the index is the `IndexSearch` class. It provides a single method `find_company_by_name`
which executes a SQL Like search on the name of the available companies and returns a pandas dataframe with the columns
`name` and `cik` (the central index key, or the unique id of a company in the financial statements data sets).
The main purpose of this class is to find the cik for a company (of course, you can also directly search the cik on https://www.sec.gov/edgar/searchedgar/companysearch).


```
from secfsdstools.c_index.searching import IndexSearch

index_search = IndexSearch.get_index_search()
results = index_search.find_company_by_name("apple")
print(results)
```

*Output:*
````
                           name      cik
      APPLE GREEN HOLDING, INC.  1510976
   APPLE HOSPITALITY REIT, INC.  1418121
                      APPLE INC   320193
         APPLE REIT EIGHT, INC.  1387361
          APPLE REIT NINE, INC.  1418121
         APPLE REIT SEVEN, INC.  1329011
             APPLE REIT SIX INC  1277151
           APPLE REIT TEN, INC.  1498864
         APPLETON PAPERS INC/WI  1144326
  DR PEPPER SNAPPLE GROUP, INC.  1418135
   MAUI LAND & PINEAPPLE CO INC    63330
          PINEAPPLE ENERGY INC.    22701
  PINEAPPLE EXPRESS CANNABIS CO  1710495
        PINEAPPLE EXPRESS, INC.  1654672
       PINEAPPLE HOLDINGS, INC.    22701
                PINEAPPLE, INC.  1654672
````


Once you have the cik of a company, you can use the `CompanyIndexReader` to get information on available reports of a company.
To get an instance of the class, you use the get `get_company_index_reader` method and provide the cik parameter.

````
from secfsdstools.c_index.companyindexreading import CompanyIndexReader

apple_cik = 320193
apple_index_reader = CompanyIndexReader.get_company_index_reader(cik=apple_cik)
````

First, you could use the method `get_latest_company_filing` which returns a dictionary with the latest filing of the company:

````
print(apple_index_reader.get_latest_company_filing())
````
*Output:*
````
{'adsh': '0001140361-23-023909', 'cik': 320193, 'name': 'APPLE INC', 'sic': 3571.0, 'countryba': 'US', 'stprba': 'CA', 'cityba': 'CUPERTINO', 
'zipba': '95014', 'bas1': 'ONE APPLE PARK WAY', 'bas2': None, 'baph': '(408) 996-1010', 'countryma': 'US', 'stprma': 'CA', 
'cityma': 'CUPERTINO', 'zipma': '95014', 'mas1': 'ONE APPLE PARK WAY', 'mas2': None, 'countryinc': 'US', 'stprinc': 'CA', 
'ein': 942404110, 'former': 'APPLE INC', 'changed': 20070109.0, 'afs': '1-LAF', 'wksi': 0, 'fye': '0930', 'form': '8-K', 
'period': 20230430, 'fy': nan, 'fp': None, 'filed': 20230510, 'accepted': '2023-05-10 16:31:00.0', 'prevrpt': 0, 'detail': 0, 
'instance': 'ny20007635x4_8k_htm.xml', 'nciks': 1, 'aciks': None}
````

Next there are two methods which return the metadata of the reports that a company has filed. The result is either
returned as a list of `IndexReport` instances, if you use the method `get_all_company_reports` or as pandas dataframe if
you use the method `get_all_company_reports_df`. Both method can take an optional parameter forms, which defines the
type of the report that shall be returned. For instance, if you are only interested in the annual and quarterly report,
set forms to `["10-K", "10-Q"]`.

````
# only show the annual reports of apple
print(apple_index_reader.get_all_company_reports_df(forms=["10-K"]))
````

*Output:*
````
                 adsh     cik       name  form     filed    period                                           fullPath  originFile originFileType                                                url
 0000320193-22-000108  320193  APPLE INC  10-K  20221028  20220930  C:\Users\hansj\secfsdstools\data\parquet\quart...  2022q4.zip        quarter  https://www.sec.gov/Archives/edgar/data/320193...
 0000320193-21-000105  320193  APPLE INC  10-K  20211029  20210930  C:\Users\hansj\secfsdstools\data\parquet\quart...  2021q4.zip        quarter  https://www.sec.gov/Archives/edgar/data/320193...
 0000320193-20-000096  320193  APPLE INC  10-K  20201030  20200930  C:\Users\hansj\secfsdstools\data\parquet\quart...  2020q4.zip        quarter  https://www.sec.gov/Archives/edgar/data/320193...
 0000320193-19-000119  320193  APPLE INC  10-K  20191031  20190930  C:\Users\hansj\secfsdstools\data\parquet\quart...  2019q4.zip        quarter  https://www.sec.gov/Archives/edgar/data/320193...
 0000320193-18-000145  320193  APPLE INC  10-K  20181105  20180930  C:\Users\hansj\secfsdstools\data\parquet\quart...  2018q4.zip        quarter  https://www.sec.gov/Archives/edgar/data/320193...
 0000320193-17-000070  320193  APPLE INC  10-K  20171103  20170930  C:\Users\hansj\secfsdstools\data\parquet\quart...  2017q4.zip        quarter  https://www.sec.gov/Archives/edgar/data/320193...
 0001628280-16-020309  320193  APPLE INC  10-K  20161026  20160930  C:\Users\hansj\secfsdstools\data\parquet\quart...  2016q4.zip        quarter  https://www.sec.gov/Archives/edgar/data/320193...
 0001193125-15-356351  320193  APPLE INC  10-K  20151028  20150930  C:\Users\hansj\secfsdstools\data\parquet\quart...  2015q4.zip        quarter  https://www.sec.gov/Archives/edgar/data/320193...
 0001193125-14-383437  320193  APPLE INC  10-K  20141027  20140930  C:\Users\hansj\secfsdstools\data\parquet\quart...  2014q4.zip        quarter  https://www.sec.gov/Archives/edgar/data/320193...
 0001193125-13-416534  320193  APPLE INC  10-K  20131030  20130930  C:\Users\hansj\secfsdstools\data\parquet\quart...  2013q4.zip        quarter  https://www.sec.gov/Archives/edgar/data/320193...
 0001193125-12-444068  320193  APPLE INC  10-K  20121031  20120930  C:\Users\hansj\secfsdstools\data\parquet\quart...  2012q4.zip        quarter  https://www.sec.gov/Archives/edgar/data/320193...
 0001193125-11-282113  320193  APPLE INC  10-K  20111026  20110930  C:\Users\hansj\secfsdstools\data\parquet\quart...  2011q4.zip        quarter  https://www.sec.gov/Archives/edgar/data/320193...
 0001193125-10-238044  320193  APPLE INC  10-K  20101027  20100930  C:\Users\hansj\secfsdstools\data\parquet\quart...  2010q4.zip        quarter  https://www.sec.gov/Archives/edgar/data/320193...
 0001193125-09-214859  320193  APPLE INC  10-K  20091027  20090930  C:\Users\hansj\secfsdstools\data\parquet\quart...  2009q4.zip        quarter  https://www.sec.gov/Archives/edgar/data/320193...
````

## Collect: collecting the data for reports
The previously introduced `IndexSearch` and `CompanyIndexReader` let you know what data is available, but they do not
return the real data of the financial statements. This is what the `Collector` classes are used for.

All the `Collector` classes have their own factory method(s) which instantiates the class. Most of these factory methods
also provide parameters to filter the data directly when being loaded from the parquet files.
These are
* the `forms_filter` <br> lets you select which report type should be loaded (e.g. "10-K" or "10-Q").<br>
  Note: the fomrs filter affects all dataframes (sub, pre, num).
* the `stmt_filter` <br> defines the statements that should be loaded (e.g., "BS" if only "Balance Sheet" data should be loaded) <br>
  Note: the stmt filter only affects the pre dataframe.
* the `tag_filter` <br> defines the tags, that should be loaded (e.g., "Assets" if only the "Assets" tag should be loaded) <br>
  Note: the tag filter affects the pre and num dataframes.

It is also possible to apply filter for these attributes after the data is loaded, but since the `Collector` classes
apply this filters directly during the load process from the parquet files (which means that fewer data is loaded from
the disk) this is generally more efficient.

All `Collector` classes have a `collect` method which then loads the data from the parquet files and returns an instance
of `RawDataBag`. The `RawDataBag` instance contains then a pandas dataframe for the `sub` (subscription) data,
`pre` (presentation) data, and `num` (the numeric values) data.

The framework provides the following collectors:
* `SingleReportCollector` <br> As the name suggests, this `Collector` returns the data of a single report. It is 
  instantiated by providing the `adsh` of the desired report as parameter of the `get_report_by_adsh` factory method, 
  or by using an instance of the `IndexReport` as parameter of the `get_report_by_indexreport`. (As a reminder: 
  instances of `IndexReport` are returned by the `CompanyIndexReader` class).
  <br><br>*Example:*
    ````
    from secfsdstools.e_collector.reportcollecting import SingleReportCollector

    apple_10k_2022_adsh = "0000320193-22-000108"

    collector: SingleReportCollector = SingleReportCollector.get_report_by_adsh(adsh=apple_10k_2022_adsh)
    rawdatabag = collector.collect()

    # as expected, there is just one entry in the submission dataframe
    print(rawdatabag.sub_df)
    # just print the size of the pre and num dataframes
    print(rawdatabag.pre_df.shape)
    print(rawdatabag.num_df.shape)
    ````
    <br>*Output*:
    ````
                       adsh     cik       name     sic countryba stprba     cityba  ...
    0  0000320193-22-000108  320193  APPLE INC  3571.0        US     CA  CUPERTINO  ...
    (185, 10)
    (503, 9)  
    ````
    <br>

* `MultiReportCollector` <br> Contrary to the `SingleReportCollector`, this `Collector` can collect data from several
  reports. Moreover, the data of the reports are loaded in parallel, this  especially improves the performance if the
  reports are from different quarters (resp. are in different zip files). The class provides the factory methods 
  `get_reports_by_adshs` and `get_reports_by_indexreports`. The first takes a list of adsh strings, the second a list
  of `IndexReport` instances.
  <br><br>*Example:*
    ````
    from secfsdstools.e_collector.multireportcollecting import MultiReportCollector
    apple_10k_2022_adsh = "0000320193-22-000108"
    apple_10k_2012_adsh = "0001193125-12-444068"

    if __name__ == '__main__':
        # load only the assets tags that are present in the 10-K report of apple in the years
        # 2022 and 2012
        collector: MultiReportCollector = \
            MultiReportCollector.get_reports_by_adshs(adshs=[apple_10k_2022_adsh,
                                                             apple_10k_2012_adsh],
                                                      tag_filter=['Assets'])
        rawdatabag = collector.collect()
        # as expected, there are just two entries in the submission dataframe
        print(rawdatabag.sub_df)
        print(rawdatabag.num_df)  
    ```` 
  <br>*Output*:
    ````
                       adsh     cik       name     sic countryba stprba     cityba  ...          
    0  0000320193-22-000108  320193  APPLE INC  3571.0        US     CA  CUPERTINO  ...
    1  0001193125-12-444068  320193  APPLE INC  3571.0        US     CA  CUPERTINO  ...
    
                       adsh     tag       version coreg     ddate  qtrs  uom         value footnote
    0  0000320193-22-000108  Assets  us-gaap/2022        20210930     0  USD  3.510020e+11     None
    1  0000320193-22-000108  Assets  us-gaap/2022        20220930     0  USD  3.527550e+11     None
    2  0001193125-12-444068  Assets  us-gaap/2012        20110930     0  USD  1.163710e+11     None
    3  0001193125-12-444068  Assets  us-gaap/2012        20120930     0  USD  1.760640e+11     None  
    ````
    <br>
* `ZipCollector` <br> This `Collector` collects the data of one or more zip (resp. the folders that contain the parquet
  files of this zip files). And since every of the original zip files contains the data for one quarter, the names you provide
  in the `get_zip_by_name` or `get_zip_by_names` factory methods reflect the quarter which data you want to load: 
  e.g. `2022q1.zip`.
 
  <br><br>*Example:*
    ````
    from secfsdstools.e_collector.zipcollecting import ZipCollector

    # only collect the Balance Sheet of annual reports that
    # were filed during the first quarter in 2022
    if __name__ == '__main__':
        collector: ZipCollector = ZipCollector.get_zip_by_name(name="2022q1.zip",
                                                               forms_filter=["10-K"],
                                                               stmt_filter=["BS"])
    
        rawdatabag = collector.collect()
    
        # only show the size of the data frame
        # .. over 4000 companies filed a 10 K report in q1 2022
        print(rawdatabag.sub_df.shape)
        print(rawdatabag.pre_df.shape)
        print(rawdatabag.num_df.shape)    
    ```` 
  <br>*Output*:
    ````
    (4875, 36)
    (232863, 10)
    (2404949, 9)
    ````

* `CompanyReportCollector` <br> This class returns reports for one or more companies. The factory method 
  `get_company_collector` provides the parameter `ciks` which takes a list of cik numbers.
  <br><br>*Example:*
    ````
    from secfsdstools.e_collector.companycollecting import CompanyReportCollector
    
    if __name__ == '__main__':
        apple_cik = 320193
        collector = CompanyReportCollector.get_company_collector(ciks=[apple_cik],
                                                                 forms_filter=["10-K"])
    
        rawdatabag = collector.collect()
    
        # all filed 10-K reports for apple since 2010 are in the databag
        print(rawdatabag.sub_df)
    
        print(rawdatabag.pre_df.shape)
        print(rawdatabag.num_df.shape)    
    ```` 
  <br>*Output*:
    ````
                        adsh     cik       name     sic ...
    0   0000320193-22-000108  320193  APPLE INC  3571.0 ...
    1   0000320193-21-000105  320193  APPLE INC  3571.0 ...
    2   0000320193-20-000096  320193  APPLE INC  3571.0 ...
    3   0000320193-19-000119  320193  APPLE INC  3571.0 ...
    4   0000320193-18-000145  320193  APPLE INC  3571.0 ...
    5   0000320193-17-000070  320193  APPLE INC  3571.0 ...
    6   0001628280-16-020309  320193  APPLE INC  3571.0 ...
    7   0001193125-15-356351  320193  APPLE INC  3571.0 ...
    8   0001193125-14-383437  320193  APPLE INC  3571.0 ...
    9   0001193125-13-416534  320193  APPLE INC  3571.0 ...
    10  0001193125-12-444068  320193  APPLE INC  3571.0 ...
    11  0001193125-11-282113  320193  APPLE INC  3571.0 ...
    12  0001193125-10-238044  320193  APPLE INC  3571.0 ...
    13  0001193125-09-214859  320193  APPLE INC  3571.0 ...
    (2246, 10)
    (7925, 9)
    Process finished with exit code 0  
    ````

Have a look at the [collector_deep_dive notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/04_collector_deep_dive.ipynb).



## Raw Processing: working with the raw data
When the `collect` method of a `Collector` class is called, the data for the sub, pre, and num dataframes are loaded
and being stored in the sub_df, pre_df, and num_df attributes inside an instance of `RawDataBag`.

The `RawDataBag` provides the following methods:
* `save`, `load`<br> The content of a `RawDataBag` can be saved into a directory. Within that directory, 
   parquet files are stored for the content of the sub_df, pre_df, and num_df. In order to load this 
   data directly, the static method `RawDataBag.load()` can be used.
* `concat`<br> Several instances of a `RawDataBag` can be concatenated into one single instance. In order to do 
   that, the static method `RawDataBag.concat()` takes a list of RawDataBag as parameter.
* `join` <br> This method produces a `JoinedRawDataBag` by joining the content of the pre_df and num_df
   based on the columns adsh, tag, and version. It is an inner join. The joined dataframe appears as pre_num_df in
   the `JoinedRawDataBag`.
* `filter` <br> The filter method takes a parameter of the type `FilterRaw`, applies it to the data and
   produces a new instance of `RawDataBag` with the filtered data. Therefore, filters can also be chained like
   `a_filtered_RawDataBag = a_RawDataBag.filter(filter1).filter(filter2)`. Moreover, the `__get__item` method
   is forwarded to the filter method, so you can also write `a_filtered_RawDataBag = a_RawDataBag[filter1][filter2]`.

It is simple to write your own filters, just get some inspiration from the once that are already present in the
Framework (module `secfsdstools.e_filter.rawfiltering`:

* `AdshRawFilter` <br> Filters the `RawDataBag` instance based on the list of adshs that were provided in the constructor. <br>
   ````
   a_filtered_RawDataBag = a_RawDataBag.filter(AdshRawFilter(adshs=['0001193125-09-214859', '0001193125-10-238044']))
   ````
* `StmtRawFilter` <br> Filters the `RawDataBag`instance based on the list of statements ('BS', 'CF', 'IS', ...). <br>
   ````
   a_filtered_RawDataBag = a_RawDataBag.filter(StmtRawFilter(stmts=['BS', 'CF']))
   ````
* `TagRawFilter` <br> Filters the `RawDataBag`instance based on the list of tags that is provided. <br>
   ````
   a_filtered_RawDataBag = a_RawDataBag.filter(TagRawFilter(tags=['Assets', 'Liabilities']))
   ````
* `MainCoregRawFilter` <br> Filters the `RawDataBag` so that data of subsidiaries are removed.
   ````
   a_filtered_RawDataBag = a_RawDataBag.filter(MainCoregRawFilter()) 
   ````
* `ReportPeriodAndPreviousPeriodRawFilter` <br> The data of a report usually also contains data from previous years.
  However, often you want just to analyze the data of the current and the previous year. This filter ensures that
  only data for the current period and the previous period are contained in the data.
   ````
   a_filtered_RawDataBag = a_RawDataBag.filter(ReportPeriodAndPreviousPeriodRawFilter()) 
   ````
* `ReportPeriodRawFilter` <br> If you are just interested in the data of a report that is from the current period
  of the report then you can use this filter. For instance, if you use a `CompanyReportCollector` to collect all
  10-K reports of this company, you want to ensure that every report only contains data for its own period and not for
  previous periods.
   ````
   a_filtered_RawDataBag = a_RawDataBag.filter(ReportPeriodRawFilter()) 
   ````
* `OfficialTagsOnlyRawFilter` <br> Sometimes company provide their own tags, which are not defined by the us-gaap XBRL
  definition. In such cases, the version columns contains the value of the adsh instead of something like us-gab/2022.
  This filter removes unofficial tags.
   ````
   a_filtered_RawDataBag = a_RawDataBag.filter(OfficialTagsOnlyRawFilter()) 
   ````  
* `USDOnlyRawFilter` <br> Reports often also contain datapoints in other currency than USD. So it might happen that
  the same datapoint in a balance sheet is present in different currencies. If you are just interested in the USD
  currency, then we can use this filter.
   ````
   a_filtered_RawDataBag = a_RawDataBag.filter(USDOnlyRawFilter()) 
   ````  

Have a look at the [filter_deep_dive notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/05_filter_deep_dive.ipynb).



## Joined Processing: working with joined data
When the `join` method of a `RawDataBag` instance is called an instance of `JoinedDataBag` is returned. The returned
instance contains an attribute sub_df, which is a reference to the same sub_df that is in the `RawDataBag`.
In addition to that, the `JoinedDataBag` contains an attribut pre_num_df, which is an inner join of the pre_df and 
the num_df based on the columns adsh, tag, and version. Note that an entry in the pre_df can be joined with more than 
one entry in the num_df.

The `JoinedDataBag` provides the following methods:
* `save`, `load`<br> The content of a `JoinedDataBag` can be saved into a directory. Within that directory,
  parquet files are stored for the content of the sub_df, pre_df, and num_df. In order to load this
  data directly, the static method `JoinedDataBag.save()` can be used.
* `concat`<br> Several instances of a `JoinedDataBag` can be concatenated in one single instance. In order to do
  that, the static method `JoinedDataBag.concat()` takes a list of RawDataBag as parameter.
* `filter` <br> The filter method takes a parameter of the type `FilterJoined`, applies it to the data and
  produces a new instance of `JoinedDataBag` with the filtered data. Therefore, filters can also be chained like
  `a_filtered_JoinedDataBag = a_JoinedDataBag.filter(filter1).filter(filter2)`. Moreover, the `__get__item` method
  is forwarded to the filter method, so you can also write `a_filtered_JoinedDataBag = a_JoinedDataBag[filter1][filter2]`.
  **Note**: The same filters that are present for the `RawDataBag` are also available for the `JoinedDataBag`. Just
  look into the module `secfsdstools.e_filter.joinedfiltering`
* `present` <br> The idea of the present method is to make a final presentation of the data as pandas dataframe. 
  The method has a parameter presenter of type Presenter.

## Present
It is simple to write your own presenter classes. So far, the framework provides the following Presenter 
implementations (module `secfsdstools.e_presenter.presenting`):

* `StandardStatementPresenter` <br> This presenter provides the data in the same form, as you are used to see in
  the reports itself. For instance, the primary financial statements balance sheet, income statement, and cash flow
  display the different positions in rows and the columns contain the different dates/periods of the data.
  Let us say you want to recreate the BS information of the apples 10-K report of 2022, you would write:
  ````
  apple_10k_2022_adsh = "0000320193-22-000108"

  collector: SingleReportCollector = SingleReportCollector.get_report_by_adsh(
        adsh=apple_10k_2022_adsh,
        stmt_filter=["BS"]
  )
  rawdatabag = collector.collect()
  bs_df = rawdatabag.filter(ReportPeriodAndPreviousPeriodRawFilter())
                    .join()
                    .present(StandardStatementPresenter())
  print(bs_df) 
  ````
  <br>*Output*:
  ````  
                        adsh coreg                                              tag       version stmt  report  line     uom  negating  inpth  qrtrs_0/20220930  qrtrs_0/20210930
   0   0000320193-22-000108                  CashAndCashEquivalentsAtCarryingValue  us-gaap/2022   BS       5     3     USD         0      0        2.364600e+10        3.494000e+10
   1   0000320193-22-000108                            MarketableSecuritiesCurrent  us-gaap/2022   BS       5     4     USD         0      0        2.465800e+10        2.769900e+10
   2   0000320193-22-000108                           AccountsReceivableNetCurrent  us-gaap/2022   BS       5     5     USD         0      0        2.818400e+10        2.627800e+10
   3   0000320193-22-000108                                           InventoryNet  us-gaap/2022   BS       5     6     USD         0      0        4.946000e+09        6.580000e+09
   4   0000320193-22-000108                             NontradeReceivablesCurrent  us-gaap/2022   BS       5     7     USD         0      0        3.274800e+10        2.522800e+10
   5   0000320193-22-000108                                     OtherAssetsCurrent  us-gaap/2022   BS       5     8     USD         0      0        2.122300e+10        1.411100e+10
   6   0000320193-22-000108                                          AssetsCurrent  us-gaap/2022   BS       5     9     USD         0      0        1.354050e+11        1.348360e+11
   7   0000320193-22-000108                         MarketableSecuritiesNoncurrent  us-gaap/2022   BS       5    11     USD         0      0        1.208050e+11        1.278770e+11
   8   0000320193-22-000108                           PropertyPlantAndEquipmentNet  us-gaap/2022   BS       5    12     USD         0      0        4.211700e+10        3.944000e+10
   9   0000320193-22-000108                                  OtherAssetsNoncurrent  us-gaap/2022   BS       5    13     USD         0      0        5.442800e+10        4.884900e+10
   10  0000320193-22-000108                                       AssetsNoncurrent  us-gaap/2022   BS       5    14     USD         0      0        2.173500e+11        2.161660e+11
   11  0000320193-22-000108                                                 Assets  us-gaap/2022   BS       5    15     USD         0      0        3.527550e+11        3.510020e+11
   12  0000320193-22-000108                                 AccountsPayableCurrent  us-gaap/2022   BS       5    18     USD         0      0        6.411500e+10        5.476300e+10
   13  0000320193-22-000108                                OtherLiabilitiesCurrent  us-gaap/2022   BS       5    19     USD         0      0        6.084500e+10        4.749300e+10
   14  0000320193-22-000108                   ContractWithCustomerLiabilityCurrent  us-gaap/2022   BS       5    20     USD         0      0        7.912000e+09        7.612000e+09
   15  0000320193-22-000108                                        CommercialPaper  us-gaap/2022   BS       5    21     USD         0      0        9.982000e+09        6.000000e+09
   16  0000320193-22-000108                                    LongTermDebtCurrent  us-gaap/2022   BS       5    22     USD         0      0        1.112800e+10        9.613000e+09
   17  0000320193-22-000108                                     LiabilitiesCurrent  us-gaap/2022   BS       5    23     USD         0      0        1.539820e+11        1.254810e+11
   18  0000320193-22-000108                                 LongTermDebtNoncurrent  us-gaap/2022   BS       5    25     USD         0      0        9.895900e+10        1.091060e+11
   19  0000320193-22-000108                             OtherLiabilitiesNoncurrent  us-gaap/2022   BS       5    26     USD         0      0        4.914200e+10        5.332500e+10
   20  0000320193-22-000108                                  LiabilitiesNoncurrent  us-gaap/2022   BS       5    27     USD         0      0        1.481010e+11        1.624310e+11
   21  0000320193-22-000108                                            Liabilities  us-gaap/2022   BS       5    28     USD         0      0        3.020830e+11        2.879120e+11
   22  0000320193-22-000108           CommonStocksIncludingAdditionalPaidInCapital  us-gaap/2022   BS       5    31     USD         0      0        6.484900e+10        5.736500e+10
   23  0000320193-22-000108                     RetainedEarningsAccumulatedDeficit  us-gaap/2022   BS       5    32     USD         0      0       -3.068000e+09        5.562000e+09
   24  0000320193-22-000108        AccumulatedOtherComprehensiveIncomeLossNetOfTax  us-gaap/2022   BS       5    33     USD         0      0       -1.110900e+10        1.630000e+08
   25  0000320193-22-000108                                     StockholdersEquity  us-gaap/2022   BS       5    34     USD         0      0        5.067200e+10        6.309000e+10
   26  0000320193-22-000108                       LiabilitiesAndStockholdersEquity  us-gaap/2022   BS       5    35     USD         0      0        3.527550e+11        3.510020e+11
   27  0000320193-22-000108                    CommonStockParOrStatedValuePerShare  us-gaap/2022   BS       6     1     USD         0      1        0.000000e+00        0.000000e+00
   28  0000320193-22-000108                            CommonStockSharesAuthorized  us-gaap/2022   BS       6     2  shares         0      1        5.040000e+10        5.040000e+10
   29  0000320193-22-000108                                CommonStockSharesIssued  us-gaap/2022   BS       6     3  shares         0      1        1.594342e+10        1.642679e+10
   30  0000320193-22-000108                           CommonStockSharesOutstanding  us-gaap/2022   BS       6     4  shares         0      1        1.594342e+10        1.642679e+10  
  ````  
  If you compare this with the real report at https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm
  you will notice, that order of the tags and the values are the same.


* `Standardizer` <br>
  Even if xbrl is a standard on how to tag positions and numbers in financial statements, that doesn't mean that financial
  statements can then be compared easily. For instance, there are over 3000 tags which can be used in a balance sheet.
  Moreover, some tags can mean similar things or can be grouped behind a "parent" tag, which itself might not be present.
  For instance, "AccountsNoncurrent" is often not shown in statements. So you would find the position for "Accounts"
  and "AccountsCurrent", but not for "AccountsNoncurrent". Instead, only child tags for "AccountsNoncurrent" might be
  present.<br><br>
  The standardizer helps to solve these problems by unifying the information of financial statements.<br> <br>
  With the standardized financial statements, you can then actually compare the statements between different
  companies or different years, and you can use the dataset for ML. <br><br>
  Have a look at [standardizer_basics](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_00_standardizer_basics.ipynb) which explains it in more details.<br><br>

  * `BalanceSheetStandardizer` <br>
  The `BalanceSheetStandardizer` collects and/or calculates the following positions of balance sheets:  

    ````
    - Assets
      - AssetsCurrent
        - Cash
      - AssetsNoncurrent
    - Liabilities
      - LiabilitiesCurrent
      - LiabilitiesNoncurrent
    - Equity
      - HolderEquity (mainly StockholderEquity or PartnerCapital)
        - RetainedEarnings
        - AdditionalPaidInCapital
        - TreasuryStockValue
      - TemporaryEquity
      - RedeemableEquity
    - LiabilitiesAndEquity
    ````

    With just a few lines of code, you'll get a comparable dataset with the main positions of a balance sheet for Microsoft, Alphabet, and Amazon:
    (see the [stanardize the balance sheets and make them comparable notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_01_BS_standardizer.ipynb) for details)
     ````python
     from secfsdstools.e_collector.companycollecting import CompanyReportCollector
     from secfsdstools.e_filter.rawfiltering import ReportPeriodRawFilter, MainCoregRawFilter, OfficialTagsOnlyRawFilter, USDOnlyRawFilter
     from secfsdstools.f_standardize.bs_standardize import BalanceSheetStandardizer
   
     bag = CompanyReportCollector.get_company_collector(ciks=[789019, 1652044,1018724]).collect() #Microsoft, Alphabet, Amazon
     filtered_bag = bag[ReportPeriodRawFilter()][MainCoregRawFilter()][OfficialTagsOnlyRawFilter()][USDOnlyRawFilter()]
     joined_bag = filtered_bag.join()
   
     standardizer = BalanceSheetStandardizer()
   
     standardized_bs_df = joined_bag.present(standardizer)
   
     import matplotlib.pyplot as plt
     # Group by 'name' and plot equity for each group
     # Note: using the `present` method ensured that the same cik has always the same name even if the company name did change in the past
     for name, group in standardized_bs_df.groupby('name'):
       plt.plot(group['date'], group['Equity'], label=name, linestyle='-')
   
     # Add labels and title
     plt.xlabel('Date')
     plt.ylabel('Equity')
     plt.title('Equity Over Time for Different Companies (CIKs)')
   
     # Display legend
     plt.legend()
     ````
     ![Equity Compare](https://github.com/HansjoergW/sec-fincancial-statement-data-set/raw/main/docs/images/equity_compare.png)

  * `IncomeStatementStandardizer` <br>
  The `IncomeStatementStandardizer` collects and/or calculates the following positions of balance sheets:
    
    ````  
      Revenues
      - CostOfRevenue
      ---------------
      = GrossProfit
      - OperatingExpenses
      -------------------
      = OperatingIncomeLoss
        
      IncomeLossFromContinuingOperationsBeforeIncomeTaxExpenseBenefit
      - AllIncomeTaxExpenseBenefit
      ----------------------------
      = IncomeLossFromContinuingOperations
      + IncomeLossFromDiscontinuedOperationsNetOfTax
      -----------------------------------------------
      = ProfitLoss
      - NetIncomeLossAttributableToNoncontrollingInterest
      ---------------------------------------------------
      = NetIncomeLoss
    
      OustandingShares
      EarningsPerShare
    ````
  
    With just a few lines of code, you'll get a comparable dataset with the main positions of an income statement for Microsoft, Alphabet, and Amazon:
  (see the [standardize the income statement and make them comparable notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_02_IS_standardizer.ipynb) for details)
   
    ````python
    from secfsdstools.e_collector.companycollecting import CompanyReportCollector
    from secfsdstools.e_filter.rawfiltering import ReportPeriodRawFilter, MainCoregRawFilter, OfficialTagsOnlyRawFilter, USDOnlyRawFilter
    from secfsdstools.f_standardize.is_standardize import IncomeStatementStandardizer
      
    bag = CompanyReportCollector.get_company_collector(ciks=[789019, 1652044,1018724]).collect() #Microsoft, Alphabet, Amazon
    filtered_bag = bag[ReportPeriodRawFilter()][MainCoregRawFilter()][OfficialTagsOnlyRawFilter()][USDOnlyRawFilter()]
    joined_bag = filtered_bag.join()
      
    standardizer = IncomeStatementStandardizer()
      
    standardized_is_df = joined_bag.present(standardizer)
    # just use the yearly reports with data for the whole year
    standardized_is_df = standardized_is_df[(standardized_is_df.fp=="FY") & (standardized_is_df.qtrs==4)].copy()
      
    import matplotlib.pyplot as plt
    # Group by 'name' and plot equity for each group
    # Note: using the `present` method ensured that the same cik has always the same name even if the company name did change in the past
    for name, group in standardized_is_df.groupby('name'):
      plt.plot(group['date'], group['GrossProfit'], label=name, linestyle='-')
      
    # Add labels and title
    plt.xlabel('Date')
    plt.ylabel('GrossProfit')
    plt.title('GrossProfit Over Time for Different Companies (CIKs)')
      
    # Display legend
    plt.legend()
     ````

  ![GrossProfit Compare](https://github.com/HansjoergW/sec-fincancial-statement-data-set/raw/main/docs/images/grossprofit_compare.png)

  * `CashFlowStandardizer` <br>
   The `CashFlowStandardizer` collects and/or calculates the following positions of cash flow statements:
     
    ````
     NetCashProvidedByUsedInOperatingActivities
       CashProvidedByUsedInOperatingActivitiesDiscontinuedOperations
       NetCashProvidedByUsedInOperatingActivitiesContinuingOperations
           DepreciationDepletionAndAmortization
           DeferredIncomeTaxExpenseBenefit
           ShareBasedCompensation
           IncreaseDecreaseInAccountsPayable
           IncreaseDecreaseInAccruedLiabilities
           InterestPaidNet
           IncomeTaxesPaidNet
    
     NetCashProvidedByUsedInInvestingActivities
         CashProvidedByUsedInInvestingActivitiesDiscontinuedOperations
         NetCashProvidedByUsedInInvestingActivitiesContinuingOperations
           PaymentsToAcquirePropertyPlantAndEquipment
           ProceedsFromSaleOfPropertyPlantAndEquipment
           PaymentsToAcquireInvestments
           ProceedsFromSaleOfInvestments
           PaymentsToAcquireBusinessesNetOfCashAcquired
           ProceedsFromDivestitureOfBusinessesNetOfCashDivested
           PaymentsToAcquireIntangibleAssets
           ProceedsFromSaleOfIntangibleAssets
    
     NetCashProvidedByUsedInFinancingActivities
         CashProvidedByUsedInFinancingActivitiesDiscontinuedOperations
         NetCashProvidedByUsedInFinancingActivitiesContinuingOperations
           ProceedsFromIssuanceOfCommonStock
           ProceedsFromStockOptionsExercised
           PaymentsForRepurchaseOfCommonStock
           ProceedsFromIssuanceOfDebt
           RepaymentsOfDebt
           PaymentsOfDividends
    
    
     EffectOfExchangeRateFinal
     CashPeriodIncreaseDecreaseIncludingExRateEffectFinal
    
     CashAndCashEquivalentsEndOfPeriod
    ````

     With just a few lines of code, you'll get a comparable dataset with the main positions of an cash flow statement for Microsoft, Alphabet, and Amazon:
(see the [standardize the cash flow statements and make them comparable](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_03_CF_standardizer.ipynb) for details)
    ````python
    from secfsdstools.e_collector.companycollecting import CompanyReportCollector
    from secfsdstools.e_filter.rawfiltering import ReportPeriodRawFilter, MainCoregRawFilter, OfficialTagsOnlyRawFilter, USDOnlyRawFilter
    from secfsdstools.f_standardize.cf_standardize import CashFlowStandardizer
    
    bag = CompanyReportCollector.get_company_collector(ciks=[789019, 1652044,1018724]).collect() #Microsoft, Alphabet, Amazon
    filtered_bag = bag[ReportPeriodRawFilter()][MainCoregRawFilter()][OfficialTagsOnlyRawFilter()][USDOnlyRawFilter()]
    joined_bag = filtered_bag.join()
    
    standardizer = CashFlowStandardizer()
    
    standardized_cf_df = joined_bag.present(standardizer)
    standardized_cf_df = standardized_cf_df[(standardized_cf_df.fp=="FY") & (standardized_cf_df.qtrs==4)].copy()
    
    import matplotlib.pyplot as plt
    # Group by 'name' and plot NetCashProvidedByUsedInOperatingActivities for each group
    # Note: using the `present` method ensured that the same cik has always the same name even if the company name did change in the past
    for name, group in standardized_cf_df.groupby('name'):
        plt.plot(group['date'], group['NetCashProvidedByUsedInOperatingActivities'], label=name, linestyle='-')
    
    # Add labels and title
    plt.xlabel('Date')
    plt.ylabel('NetCashProvidedByUsedInOperatingActivities')
    plt.title('NetCashProvidedByUsedInOperatingActivities Over Time for Different Companies (CIKs)')
    
    # Display legend
    plt.legend()
    ````
  ![NetCashOperating Compare](https://github.com/HansjoergW/sec-fincancial-statement-data-set/raw/main/docs/images/netcashoperating_compare.png)

# Links 
* [For a detail description of the content and the structure of the dataset](https://www.sec.gov/files/aqfs.pdf)
* [Release Notes](https://hansjoergw.github.io/sec-fincancial-statement-data-set/releasenotes/)
* [Documentation](https://hansjoergw.github.io/sec-fincancial-statement-data-set/)
* [QuickStart Jupyter Notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/01_quickstart.ipynb)
* [Connect to the daily-sec-financial-statement-dataset Notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/02_connect_rapidapi.ipynb)
* [Explore the data with an interactive Notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/03_explore_with_interactive_notebook.ipynb)
* [collector_deep_dive Notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/04_collector_deep_dive.ipynb)
* [filter_deep_dive Notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/05_filter_deep_dive.ipynb).
* [bulk_data_processing_deep_dive Notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/06_bulk_data_processing_deep_dive.ipynb)
* [standardizer_basics](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_00_standardizer_basics.ipynb)
* [BS_standardizer](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_01_BS_standardizer.ipynb)
* [IS_standardizer](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_02_IS_standardizer.ipynb)
* [CF_standardizer](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_03_CF_standardizer.ipynb)
* [checkout the `u_usecases` package](https://hansjoergw.github.io/sec-fincancial-statement-data-set/doc_latest/api/secfsdstools/u_usecases/index.html)
* [standardize the balance sheets and make them comparable](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_01_BS_standardizer.ipynb)
* [standardize the income statements and make them comparable](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_02_IS_standardizer.ipynb)
* [standardize the cash flow statements and make them comparable](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_03_CF_standardizer.ipynb)
* [Trouble hssting and known issues](KNOWN_ISSUES.md)
* [Changelog](CHANGELOG.md)

            

Raw data

            {
    "_id": null,
    "home_page": null,
    "name": "secfsdstools",
    "maintainer": null,
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": null,
    "keywords": "10-K, 10-Q, 8-K, CIK, Data Processing, EDGAR, Finance, Financial Analysis, Financial Statements, Financial Statements Dataset, SEC EDGAR, SEC Filing, SEC.GOV",
    "author": null,
    "author_email": "Hansjoerg Wingeier <hansjoerg.wingeier@gmail.com>",
    "download_url": "https://files.pythonhosted.org/packages/95/54/08e931914ccc5dbc0b2f72431ea4c88ba8fb0cc1c5b4c86bac434a8a351f/secfsdstools-1.7.0.tar.gz",
    "platform": null,
    "description": "# sec-fincancial-statement-data-set\n\nHelper tools to analyze the [Financial Statement Data Sets](https://www.sec.gov/dera/data/financial-statement-data-sets) from the U.S. securities and exchange commission (sec.gov).\nThe SEC releases quarterly zip files, each containing four CSV files with numerical data from all financial reports filed within that quarter. However, accessing data from the past 12 years can be time-consuming due to the large amount of data - over 120 million data points in over 2GB of zip files by 2023.\n\nThis library simplifies the process of working with this data and provides a\nconvenient way to extract information from the primary financial statements - the balance sheet (BS), income statement (IS), and statement of cash flows (CF).\n\nCheck out my article at Medium [Understanding the the SEC Financial Statement Data Sets](https://medium.com/@hansjoerg.wingeier/understanding-the-sec-financial-statement-data-sets-6148e07d1715) to get\nan introduction to the [Financial Statement Data Sets](https://www.sec.gov/dera/data/financial-statement-data-sets).\n\nThe main features include:\n- all data is on your local hard drive and can be updated automatically, no need for numerous API calls\n- data is loaded as pandas files\n- fast and efficient reading of a single report, all reports of one or multiple companies, or even all available reports \n- filter framework with predefined filters, easy to extend, supports easy way of saving, loading, and combining filtered data (see [01_quickstart.ipynb](notebooks/01_quickstart.ipynb) and\n[03_explore_with_interactive_notebook.ipynb](notebooks/03_explore_with_interactive_notebook.ipynb))\n- standardize the data for balance sheets, income statements, and cash flow statements to make reports easily comparable\n(see [07_00_standardizer_basics.ipynb](notebooks/07_00_standardizer_basics.ipynb), \n[07_01_BS_standardizer.ipynb](notebooks/07_01_BS_standardizer.ipynb), \n[07_01_BS_standardizer.ipynb](notebooks/07_01_BS_standardizer.ipynb), and\n[07_03_CF_standardizer.ipynb](notebooks/07_03_CF_standardizer.ipynb))\n\n# Principles\n\nThe goal is to be able to do bulk processing of the data without the need to do countless API calls to sec.gov. Therefore, the quarterly zip files are downloaded and indexed using a SQLite database table.\nThe index table contains information on all filed reports since about 2010, over 500,000 in total. The first\ndownload will take a couple of minutes but after that, all the data is on your local harddisk.\n\nUsing the index in the sqlite db allows for direct extraction of data for a specific report from the\nappropriate zip file, reducing the need to open and search through each zip file.\nMoreover, the downloaded zip files are converted to the parquet format which provides faster read access\nto the data compared to reading the csv files inside the zip files.\n\nThe library is designed to have a low memory footprint, only parsing and reading the data for a specific report into pandas dataframe tables.\n\n\n# Installation and basic usage\n\nThe library has been tested for python version 3.7, 3.8, 3.9, and 3.10.\nThe project is published on [pypi.org](https://pypi.org/project/secfsdstools/). Simply use the following command to install the latest version:\n\n```\npip install secfsdstools\n```\n\n\n\nIf you want to contribute, just clone the project and use a python 3.7 environment.\nThe dependencies are defined in the requirements.txt file or use the pyproject.toml to install them.\n\nIt is possible to write standalone python script but I recommend to first start with interactive jupyter notebooks [01_quickstart.ipynb](notebooks/01_quickstart.ipynb) and [03_explore_with_interactive_notebook.ipynb](notebooks/03_explore_with_interactive_notebook.ipynb) that are located in `notebooks` directory.\n\nUpon using the library for the first time, it downloads the data files and creates the index by calling the `update()` method. You can manually trigger the update using the following code:\n\n```\nfrom secfsdstools.update import update\n\nif __name__ == '__main__':\n    update()\n```\n\nThe following tasks will be executed:\n1. All currently available zip-files are downloaded form sec.gov (these are over 50 files that will need over 2 GB of space on your local drive)\n2. All the zipfiles are transformed and stored as parquet files. Per default, the zipfile is deleted afterwards. If you want to keep the zip files, set the parameter 'KeepZipFiles' in the config file to True.\n3. An index inside a sqlite db file is created\n\n\nMoreover, at most once a day, it is checked if there is a new zip file available on sec.gov. If there is, a download will be started automatically. \nIf you don't want 'auto-update', set the 'AutoUpdate' in your config file to False.\n\n\n\n## Configuration (optional)\n\nIf you don't provide a config file, a config file with name `secfsdstools.cfg` will be created the first time you use the api and placed inside your home directory. \nThe file only requires the following entries:\n\n```\n[DEFAULT]\ndownloaddirectory = c:/users/me/secfsdstools/data/dld\nparquetdirectory = c:/users/me/secfsdstools/data/parquet\ndbdirectory = c:/users/me/secfsdstools/data/db\nuseragentemail = your.email@goeshere.com\n```\n\n\n\nThe `downloaddirectory` is the place where quarterly zip files from the sec.gov are downloaded to.\nThe `parquetdirectory` is the folder where the data is stored in parquet format.\nThe `dbdirectory` is the directory in which the sqllite db is created.\nThe `useragentemail` is used in the requests made to the sec.gov website. Since we only make limited calls to the sec.gov,\nyou can leave the example \"your.email@goeshere.com\". \n\n## Viewing metadata\n\nThe recommend way to view and use the metadata is using `secfsdstools` library functions as described in [notebooks/01_quickstart.ipynb](notebooks/01_quickstart.ipynb)  \n\nThe \"index of reports\" that was created in the previous step can be viewed using a database viewer that supports the SQLite format,\nsuch as [DB Browser for SQLite](https://sqlitebrowser.org/).\n\n(The location of the SQLite database file is specified in the `dbdirectory` field of the config file, which is set to\n`<home>/secfsdstools/data/db` in the default configuration. The database file is named `secfsdstools.db`.)\n\nThere are only two relevant tables in the database: `index_parquet_reports` and `index_parquet_processing_state`.\n\nThe `index_parquet_reports` table provides an overview of all available reports in the downloaded\ndata and includes the following relevant columns:\n\n* **adsh** : The unique id of the report (a string).\n* **cik** : The unique id of the company (an int).\n* **name** : The name of the company in uppercases.\n* **form** : The type of the report (e.g.: annual: 10-K, quarterly: 10-Q).\n* **filed** : The date when the report has been filed in the format YYYYMMDD (stored as a integer number).\n* **period** : The date for which the report was create. this is the date on the balancesheet.(stored as a integer number) \n* **fullPath** : The path to the downloaded zip files that contains the details of that report.\n* **url** : The url which takes you directly to the filing of this report on the sec.gov website.\n\nFor instance, if you want to have an overview of all reports that Apple has filed since 2010,\njust search for \"%APPLE INC%\" in the name column.\n\nSearching for \"%APPLE INC%\" will also reveal its cik: 320193\n\nIf you accidentally delete data in the database file, don't worry. Just delete the database file\nand run `update()` again (see previous chapter).\n\n\n\n\n## A first simple example\nGoal: present the information in the balance sheet of Apple's 2022 10-K report in the same way as it appears in the\noriginal report on page 31 (\"CONSOLIDATED BALANCE SHEETS\"): https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm\n\n````\nfrom secfsdstools.e_collector.reportcollecting import SingleReportCollector\nfrom secfsdstools.e_filter.rawfiltering import ReportPeriodAndPreviousPeriodRawFilter\nfrom secfsdstools.e_presenter.presenting import StandardStatementPresenter\n\nif __name__ == '__main__':\n    # the unique identifier for apple's 10-K report of 2022\n    apple_10k_2022_adsh = \"0000320193-22-000108\"\n  \n    # us a Collector to grab the data of the 10-K report. an filter for balancesheet information\n    collector: SingleReportCollector = SingleReportCollector.get_report_by_adsh(\n          adsh=apple_10k_2022_adsh,\n          stmt_filter=[\"BS\"]\n    )  \n    rawdatabag = collector.collect() # load the data from the disk\n    \n   \n    bs_df = (rawdatabag\n                       # ensure only data from the period (2022) and the previous period (2021) is in the data\n                       .filter(ReportPeriodAndPreviousPeriodRawFilter())\n                       # join the the content of the pre_txt and num_txt together\n                       .join()  \n                       # format the data in the same way as it appears in the report\n                       .present(StandardStatementPresenter())) \n    print(bs_df) \n````\n\n## Overview\nThe following diagram gives an overview on SECFSDSTools library.\n\n![Overview](https://github.com/HansjoergW/sec-fincancial-statement-data-set/raw/main/docs/images/overview.png)\n\nIt mainly exists out of two main processes. The first one ist the \"Date Update Process\" which is responsible for the\ndownload of the Financial Statement Data Sets zip files from the sec.gov website, transforming the content into parquet\nformat, and indexing the content of these files in a simple SQLite database. Again, this whole process can be started\n\"manually\" by calling the update method, or it is done automatically, as it described above.\n\nThe second main process is the \"Data Processing Process\", which is working with the data that is stored inside the\nsub.txt, pre.txt, and num.txt files from the zip files. The \"Data Processing Process\" mainly exists out of four steps:\n\n* **Collect** <br/> Collect the rawdata from one or more different zip files. For instance, get all the data for a single\nreport, or get the data for all 10-K reports of a single or multiple companies from several zip files.\n* **Raw Processing** <br/> Once the data is collected, the collected data for sub.txt, pre.txt, and num.txt is available\nas a pandas dataframe. Filters can be applied, the content can directly be saved and loaded.\n* **Joined Processing** <br/> From the \"Raw Data\", a \"joined\" representation can be created. This joins the data from\nthe pre.txt and num.txt content together based on the \"adhs\", \"tag\", and \"version\" attributes. \"Joined data\" can also be\nfiltered, concatenated, directly saved and loaded.\n* **Present** <br/> Produce a single pandas dataframe out of the data and use it for further processing or use the standardizers\n to create comparable data for the balance sheet, the income statement, and the cash flow statement.\n\nThe diagramm also shows the main classes with which a user interacts. The use of them  is described in the following chapters.\n\n## General\nMost of the classes you can interact with have a factory method which name starts with `get_`. All this factory method\ntake at least one **optional** parameter called configuration which is of type `Configuration`.\n\nIf you do not provide this parameter, the class will read the configuration info from you configuration file in your home\ndirectory. If, for whatever reason, you do want to provide an alternative configuration, you can overwrite it.\n\nHowever, normally you do not have to provide the `configuration` parameter.\n\n## Index: working with the index\nThe first class that interacts with the index is the `IndexSearch` class. It provides a single method `find_company_by_name`\nwhich executes a SQL Like search on the name of the available companies and returns a pandas dataframe with the columns\n`name` and `cik` (the central index key, or the unique id of a company in the financial statements data sets).\nThe main purpose of this class is to find the cik for a company (of course, you can also directly search the cik on https://www.sec.gov/edgar/searchedgar/companysearch).\n\n\n```\nfrom secfsdstools.c_index.searching import IndexSearch\n\nindex_search = IndexSearch.get_index_search()\nresults = index_search.find_company_by_name(\"apple\")\nprint(results)\n```\n\n*Output:*\n````\n                           name      cik\n      APPLE GREEN HOLDING, INC.  1510976\n   APPLE HOSPITALITY REIT, INC.  1418121\n                      APPLE INC   320193\n         APPLE REIT EIGHT, INC.  1387361\n          APPLE REIT NINE, INC.  1418121\n         APPLE REIT SEVEN, INC.  1329011\n             APPLE REIT SIX INC  1277151\n           APPLE REIT TEN, INC.  1498864\n         APPLETON PAPERS INC/WI  1144326\n  DR PEPPER SNAPPLE GROUP, INC.  1418135\n   MAUI LAND & PINEAPPLE CO INC    63330\n          PINEAPPLE ENERGY INC.    22701\n  PINEAPPLE EXPRESS CANNABIS CO  1710495\n        PINEAPPLE EXPRESS, INC.  1654672\n       PINEAPPLE HOLDINGS, INC.    22701\n                PINEAPPLE, INC.  1654672\n````\n\n\nOnce you have the cik of a company, you can use the `CompanyIndexReader` to get information on available reports of a company.\nTo get an instance of the class, you use the get `get_company_index_reader` method and provide the cik parameter.\n\n````\nfrom secfsdstools.c_index.companyindexreading import CompanyIndexReader\n\napple_cik = 320193\napple_index_reader = CompanyIndexReader.get_company_index_reader(cik=apple_cik)\n````\n\nFirst, you could use the method `get_latest_company_filing` which returns a dictionary with the latest filing of the company:\n\n````\nprint(apple_index_reader.get_latest_company_filing())\n````\n*Output:*\n````\n{'adsh': '0001140361-23-023909', 'cik': 320193, 'name': 'APPLE INC', 'sic': 3571.0, 'countryba': 'US', 'stprba': 'CA', 'cityba': 'CUPERTINO', \n'zipba': '95014', 'bas1': 'ONE APPLE PARK WAY', 'bas2': None, 'baph': '(408) 996-1010', 'countryma': 'US', 'stprma': 'CA', \n'cityma': 'CUPERTINO', 'zipma': '95014', 'mas1': 'ONE APPLE PARK WAY', 'mas2': None, 'countryinc': 'US', 'stprinc': 'CA', \n'ein': 942404110, 'former': 'APPLE INC', 'changed': 20070109.0, 'afs': '1-LAF', 'wksi': 0, 'fye': '0930', 'form': '8-K', \n'period': 20230430, 'fy': nan, 'fp': None, 'filed': 20230510, 'accepted': '2023-05-10 16:31:00.0', 'prevrpt': 0, 'detail': 0, \n'instance': 'ny20007635x4_8k_htm.xml', 'nciks': 1, 'aciks': None}\n````\n\nNext there are two methods which return the metadata of the reports that a company has filed. The result is either\nreturned as a list of `IndexReport` instances, if you use the method `get_all_company_reports` or as pandas dataframe if\nyou use the method `get_all_company_reports_df`. Both method can take an optional parameter forms, which defines the\ntype of the report that shall be returned. For instance, if you are only interested in the annual and quarterly report,\nset forms to `[\"10-K\", \"10-Q\"]`.\n\n````\n# only show the annual reports of apple\nprint(apple_index_reader.get_all_company_reports_df(forms=[\"10-K\"]))\n````\n\n*Output:*\n````\n                 adsh     cik       name  form     filed    period                                           fullPath  originFile originFileType                                                url\n 0000320193-22-000108  320193  APPLE INC  10-K  20221028  20220930  C:\\Users\\hansj\\secfsdstools\\data\\parquet\\quart...  2022q4.zip        quarter  https://www.sec.gov/Archives/edgar/data/320193...\n 0000320193-21-000105  320193  APPLE INC  10-K  20211029  20210930  C:\\Users\\hansj\\secfsdstools\\data\\parquet\\quart...  2021q4.zip        quarter  https://www.sec.gov/Archives/edgar/data/320193...\n 0000320193-20-000096  320193  APPLE INC  10-K  20201030  20200930  C:\\Users\\hansj\\secfsdstools\\data\\parquet\\quart...  2020q4.zip        quarter  https://www.sec.gov/Archives/edgar/data/320193...\n 0000320193-19-000119  320193  APPLE INC  10-K  20191031  20190930  C:\\Users\\hansj\\secfsdstools\\data\\parquet\\quart...  2019q4.zip        quarter  https://www.sec.gov/Archives/edgar/data/320193...\n 0000320193-18-000145  320193  APPLE INC  10-K  20181105  20180930  C:\\Users\\hansj\\secfsdstools\\data\\parquet\\quart...  2018q4.zip        quarter  https://www.sec.gov/Archives/edgar/data/320193...\n 0000320193-17-000070  320193  APPLE INC  10-K  20171103  20170930  C:\\Users\\hansj\\secfsdstools\\data\\parquet\\quart...  2017q4.zip        quarter  https://www.sec.gov/Archives/edgar/data/320193...\n 0001628280-16-020309  320193  APPLE INC  10-K  20161026  20160930  C:\\Users\\hansj\\secfsdstools\\data\\parquet\\quart...  2016q4.zip        quarter  https://www.sec.gov/Archives/edgar/data/320193...\n 0001193125-15-356351  320193  APPLE INC  10-K  20151028  20150930  C:\\Users\\hansj\\secfsdstools\\data\\parquet\\quart...  2015q4.zip        quarter  https://www.sec.gov/Archives/edgar/data/320193...\n 0001193125-14-383437  320193  APPLE INC  10-K  20141027  20140930  C:\\Users\\hansj\\secfsdstools\\data\\parquet\\quart...  2014q4.zip        quarter  https://www.sec.gov/Archives/edgar/data/320193...\n 0001193125-13-416534  320193  APPLE INC  10-K  20131030  20130930  C:\\Users\\hansj\\secfsdstools\\data\\parquet\\quart...  2013q4.zip        quarter  https://www.sec.gov/Archives/edgar/data/320193...\n 0001193125-12-444068  320193  APPLE INC  10-K  20121031  20120930  C:\\Users\\hansj\\secfsdstools\\data\\parquet\\quart...  2012q4.zip        quarter  https://www.sec.gov/Archives/edgar/data/320193...\n 0001193125-11-282113  320193  APPLE INC  10-K  20111026  20110930  C:\\Users\\hansj\\secfsdstools\\data\\parquet\\quart...  2011q4.zip        quarter  https://www.sec.gov/Archives/edgar/data/320193...\n 0001193125-10-238044  320193  APPLE INC  10-K  20101027  20100930  C:\\Users\\hansj\\secfsdstools\\data\\parquet\\quart...  2010q4.zip        quarter  https://www.sec.gov/Archives/edgar/data/320193...\n 0001193125-09-214859  320193  APPLE INC  10-K  20091027  20090930  C:\\Users\\hansj\\secfsdstools\\data\\parquet\\quart...  2009q4.zip        quarter  https://www.sec.gov/Archives/edgar/data/320193...\n````\n\n## Collect: collecting the data for reports\nThe previously introduced `IndexSearch` and `CompanyIndexReader` let you know what data is available, but they do not\nreturn the real data of the financial statements. This is what the `Collector` classes are used for.\n\nAll the `Collector` classes have their own factory method(s) which instantiates the class. Most of these factory methods\nalso provide parameters to filter the data directly when being loaded from the parquet files.\nThese are\n* the `forms_filter` <br> lets you select which report type should be loaded (e.g. \"10-K\" or \"10-Q\").<br>\n  Note: the fomrs filter affects all dataframes (sub, pre, num).\n* the `stmt_filter` <br> defines the statements that should be loaded (e.g., \"BS\" if only \"Balance Sheet\" data should be loaded) <br>\n  Note: the stmt filter only affects the pre dataframe.\n* the `tag_filter` <br> defines the tags, that should be loaded (e.g., \"Assets\" if only the \"Assets\" tag should be loaded) <br>\n  Note: the tag filter affects the pre and num dataframes.\n\nIt is also possible to apply filter for these attributes after the data is loaded, but since the `Collector` classes\napply this filters directly during the load process from the parquet files (which means that fewer data is loaded from\nthe disk) this is generally more efficient.\n\nAll `Collector` classes have a `collect` method which then loads the data from the parquet files and returns an instance\nof `RawDataBag`. The `RawDataBag` instance contains then a pandas dataframe for the `sub` (subscription) data,\n`pre` (presentation) data, and `num` (the numeric values) data.\n\nThe framework provides the following collectors:\n* `SingleReportCollector` <br> As the name suggests, this `Collector` returns the data of a single report. It is \n  instantiated by providing the `adsh` of the desired report as parameter of the `get_report_by_adsh` factory method, \n  or by using an instance of the `IndexReport` as parameter of the `get_report_by_indexreport`. (As a reminder: \n  instances of `IndexReport` are returned by the `CompanyIndexReader` class).\n  <br><br>*Example:*\n    ````\n    from secfsdstools.e_collector.reportcollecting import SingleReportCollector\n\n    apple_10k_2022_adsh = \"0000320193-22-000108\"\n\n    collector: SingleReportCollector = SingleReportCollector.get_report_by_adsh(adsh=apple_10k_2022_adsh)\n    rawdatabag = collector.collect()\n\n    # as expected, there is just one entry in the submission dataframe\n    print(rawdatabag.sub_df)\n    # just print the size of the pre and num dataframes\n    print(rawdatabag.pre_df.shape)\n    print(rawdatabag.num_df.shape)\n    ````\n    <br>*Output*:\n    ````\n                       adsh     cik       name     sic countryba stprba     cityba  ...\n    0  0000320193-22-000108  320193  APPLE INC  3571.0        US     CA  CUPERTINO  ...\n    (185, 10)\n    (503, 9)  \n    ````\n    <br>\n\n* `MultiReportCollector` <br> Contrary to the `SingleReportCollector`, this `Collector` can collect data from several\n  reports. Moreover, the data of the reports are loaded in parallel, this  especially improves the performance if the\n  reports are from different quarters (resp. are in different zip files). The class provides the factory methods \n  `get_reports_by_adshs` and `get_reports_by_indexreports`. The first takes a list of adsh strings, the second a list\n  of `IndexReport` instances.\n  <br><br>*Example:*\n    ````\n    from secfsdstools.e_collector.multireportcollecting import MultiReportCollector\n    apple_10k_2022_adsh = \"0000320193-22-000108\"\n    apple_10k_2012_adsh = \"0001193125-12-444068\"\n\n    if __name__ == '__main__':\n        # load only the assets tags that are present in the 10-K report of apple in the years\n        # 2022 and 2012\n        collector: MultiReportCollector = \\\n            MultiReportCollector.get_reports_by_adshs(adshs=[apple_10k_2022_adsh,\n                                                             apple_10k_2012_adsh],\n                                                      tag_filter=['Assets'])\n        rawdatabag = collector.collect()\n        # as expected, there are just two entries in the submission dataframe\n        print(rawdatabag.sub_df)\n        print(rawdatabag.num_df)  \n    ```` \n  <br>*Output*:\n    ````\n                       adsh     cik       name     sic countryba stprba     cityba  ...          \n    0  0000320193-22-000108  320193  APPLE INC  3571.0        US     CA  CUPERTINO  ...\n    1  0001193125-12-444068  320193  APPLE INC  3571.0        US     CA  CUPERTINO  ...\n    \n                       adsh     tag       version coreg     ddate  qtrs  uom         value footnote\n    0  0000320193-22-000108  Assets  us-gaap/2022        20210930     0  USD  3.510020e+11     None\n    1  0000320193-22-000108  Assets  us-gaap/2022        20220930     0  USD  3.527550e+11     None\n    2  0001193125-12-444068  Assets  us-gaap/2012        20110930     0  USD  1.163710e+11     None\n    3  0001193125-12-444068  Assets  us-gaap/2012        20120930     0  USD  1.760640e+11     None  \n    ````\n    <br>\n* `ZipCollector` <br> This `Collector` collects the data of one or more zip (resp. the folders that contain the parquet\n  files of this zip files). And since every of the original zip files contains the data for one quarter, the names you provide\n  in the `get_zip_by_name` or `get_zip_by_names` factory methods reflect the quarter which data you want to load: \n  e.g. `2022q1.zip`.\n \n  <br><br>*Example:*\n    ````\n    from secfsdstools.e_collector.zipcollecting import ZipCollector\n\n    # only collect the Balance Sheet of annual reports that\n    # were filed during the first quarter in 2022\n    if __name__ == '__main__':\n        collector: ZipCollector = ZipCollector.get_zip_by_name(name=\"2022q1.zip\",\n                                                               forms_filter=[\"10-K\"],\n                                                               stmt_filter=[\"BS\"])\n    \n        rawdatabag = collector.collect()\n    \n        # only show the size of the data frame\n        # .. over 4000 companies filed a 10 K report in q1 2022\n        print(rawdatabag.sub_df.shape)\n        print(rawdatabag.pre_df.shape)\n        print(rawdatabag.num_df.shape)    \n    ```` \n  <br>*Output*:\n    ````\n    (4875, 36)\n    (232863, 10)\n    (2404949, 9)\n    ````\n\n* `CompanyReportCollector` <br> This class returns reports for one or more companies. The factory method \n  `get_company_collector` provides the parameter `ciks` which takes a list of cik numbers.\n  <br><br>*Example:*\n    ````\n    from secfsdstools.e_collector.companycollecting import CompanyReportCollector\n    \n    if __name__ == '__main__':\n        apple_cik = 320193\n        collector = CompanyReportCollector.get_company_collector(ciks=[apple_cik],\n                                                                 forms_filter=[\"10-K\"])\n    \n        rawdatabag = collector.collect()\n    \n        # all filed 10-K reports for apple since 2010 are in the databag\n        print(rawdatabag.sub_df)\n    \n        print(rawdatabag.pre_df.shape)\n        print(rawdatabag.num_df.shape)    \n    ```` \n  <br>*Output*:\n    ````\n                        adsh     cik       name     sic ...\n    0   0000320193-22-000108  320193  APPLE INC  3571.0 ...\n    1   0000320193-21-000105  320193  APPLE INC  3571.0 ...\n    2   0000320193-20-000096  320193  APPLE INC  3571.0 ...\n    3   0000320193-19-000119  320193  APPLE INC  3571.0 ...\n    4   0000320193-18-000145  320193  APPLE INC  3571.0 ...\n    5   0000320193-17-000070  320193  APPLE INC  3571.0 ...\n    6   0001628280-16-020309  320193  APPLE INC  3571.0 ...\n    7   0001193125-15-356351  320193  APPLE INC  3571.0 ...\n    8   0001193125-14-383437  320193  APPLE INC  3571.0 ...\n    9   0001193125-13-416534  320193  APPLE INC  3571.0 ...\n    10  0001193125-12-444068  320193  APPLE INC  3571.0 ...\n    11  0001193125-11-282113  320193  APPLE INC  3571.0 ...\n    12  0001193125-10-238044  320193  APPLE INC  3571.0 ...\n    13  0001193125-09-214859  320193  APPLE INC  3571.0 ...\n    (2246, 10)\n    (7925, 9)\n    Process finished with exit code 0  \n    ````\n\nHave a look at the [collector_deep_dive notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/04_collector_deep_dive.ipynb).\n\n\n\n## Raw Processing: working with the raw data\nWhen the `collect` method of a `Collector` class is called, the data for the sub, pre, and num dataframes are loaded\nand being stored in the sub_df, pre_df, and num_df attributes inside an instance of `RawDataBag`.\n\nThe `RawDataBag` provides the following methods:\n* `save`, `load`<br> The content of a `RawDataBag` can be saved into a directory. Within that directory, \n   parquet files are stored for the content of the sub_df, pre_df, and num_df. In order to load this \n   data directly, the static method `RawDataBag.load()` can be used.\n* `concat`<br> Several instances of a `RawDataBag` can be concatenated into one single instance. In order to do \n   that, the static method `RawDataBag.concat()` takes a list of RawDataBag as parameter.\n* `join` <br> This method produces a `JoinedRawDataBag` by joining the content of the pre_df and num_df\n   based on the columns adsh, tag, and version. It is an inner join. The joined dataframe appears as pre_num_df in\n   the `JoinedRawDataBag`.\n* `filter` <br> The filter method takes a parameter of the type `FilterRaw`, applies it to the data and\n   produces a new instance of `RawDataBag` with the filtered data. Therefore, filters can also be chained like\n   `a_filtered_RawDataBag = a_RawDataBag.filter(filter1).filter(filter2)`. Moreover, the `__get__item` method\n   is forwarded to the filter method, so you can also write `a_filtered_RawDataBag = a_RawDataBag[filter1][filter2]`.\n\nIt is simple to write your own filters, just get some inspiration from the once that are already present in the\nFramework (module `secfsdstools.e_filter.rawfiltering`:\n\n* `AdshRawFilter` <br> Filters the `RawDataBag` instance based on the list of adshs that were provided in the constructor. <br>\n   ````\n   a_filtered_RawDataBag = a_RawDataBag.filter(AdshRawFilter(adshs=['0001193125-09-214859', '0001193125-10-238044']))\n   ````\n* `StmtRawFilter` <br> Filters the `RawDataBag`instance based on the list of statements ('BS', 'CF', 'IS', ...). <br>\n   ````\n   a_filtered_RawDataBag = a_RawDataBag.filter(StmtRawFilter(stmts=['BS', 'CF']))\n   ````\n* `TagRawFilter` <br> Filters the `RawDataBag`instance based on the list of tags that is provided. <br>\n   ````\n   a_filtered_RawDataBag = a_RawDataBag.filter(TagRawFilter(tags=['Assets', 'Liabilities']))\n   ````\n* `MainCoregRawFilter` <br> Filters the `RawDataBag` so that data of subsidiaries are removed.\n   ````\n   a_filtered_RawDataBag = a_RawDataBag.filter(MainCoregRawFilter()) \n   ````\n* `ReportPeriodAndPreviousPeriodRawFilter` <br> The data of a report usually also contains data from previous years.\n  However, often you want just to analyze the data of the current and the previous year. This filter ensures that\n  only data for the current period and the previous period are contained in the data.\n   ````\n   a_filtered_RawDataBag = a_RawDataBag.filter(ReportPeriodAndPreviousPeriodRawFilter()) \n   ````\n* `ReportPeriodRawFilter` <br> If you are just interested in the data of a report that is from the current period\n  of the report then you can use this filter. For instance, if you use a `CompanyReportCollector` to collect all\n  10-K reports of this company, you want to ensure that every report only contains data for its own period and not for\n  previous periods.\n   ````\n   a_filtered_RawDataBag = a_RawDataBag.filter(ReportPeriodRawFilter()) \n   ````\n* `OfficialTagsOnlyRawFilter` <br> Sometimes company provide their own tags, which are not defined by the us-gaap XBRL\n  definition. In such cases, the version columns contains the value of the adsh instead of something like us-gab/2022.\n  This filter removes unofficial tags.\n   ````\n   a_filtered_RawDataBag = a_RawDataBag.filter(OfficialTagsOnlyRawFilter()) \n   ````  \n* `USDOnlyRawFilter` <br> Reports often also contain datapoints in other currency than USD. So it might happen that\n  the same datapoint in a balance sheet is present in different currencies. If you are just interested in the USD\n  currency, then we can use this filter.\n   ````\n   a_filtered_RawDataBag = a_RawDataBag.filter(USDOnlyRawFilter()) \n   ````  \n\nHave a look at the [filter_deep_dive notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/05_filter_deep_dive.ipynb).\n\n\n\n## Joined Processing: working with joined data\nWhen the `join` method of a `RawDataBag` instance is called an instance of `JoinedDataBag` is returned. The returned\ninstance contains an attribute sub_df, which is a reference to the same sub_df that is in the `RawDataBag`.\nIn addition to that, the `JoinedDataBag` contains an attribut pre_num_df, which is an inner join of the pre_df and \nthe num_df based on the columns adsh, tag, and version. Note that an entry in the pre_df can be joined with more than \none entry in the num_df.\n\nThe `JoinedDataBag` provides the following methods:\n* `save`, `load`<br> The content of a `JoinedDataBag` can be saved into a directory. Within that directory,\n  parquet files are stored for the content of the sub_df, pre_df, and num_df. In order to load this\n  data directly, the static method `JoinedDataBag.save()` can be used.\n* `concat`<br> Several instances of a `JoinedDataBag` can be concatenated in one single instance. In order to do\n  that, the static method `JoinedDataBag.concat()` takes a list of RawDataBag as parameter.\n* `filter` <br> The filter method takes a parameter of the type `FilterJoined`, applies it to the data and\n  produces a new instance of `JoinedDataBag` with the filtered data. Therefore, filters can also be chained like\n  `a_filtered_JoinedDataBag = a_JoinedDataBag.filter(filter1).filter(filter2)`. Moreover, the `__get__item` method\n  is forwarded to the filter method, so you can also write `a_filtered_JoinedDataBag = a_JoinedDataBag[filter1][filter2]`.\n  **Note**: The same filters that are present for the `RawDataBag` are also available for the `JoinedDataBag`. Just\n  look into the module `secfsdstools.e_filter.joinedfiltering`\n* `present` <br> The idea of the present method is to make a final presentation of the data as pandas dataframe. \n  The method has a parameter presenter of type Presenter.\n\n## Present\nIt is simple to write your own presenter classes. So far, the framework provides the following Presenter \nimplementations (module `secfsdstools.e_presenter.presenting`):\n\n* `StandardStatementPresenter` <br> This presenter provides the data in the same form, as you are used to see in\n  the reports itself. For instance, the primary financial statements balance sheet, income statement, and cash flow\n  display the different positions in rows and the columns contain the different dates/periods of the data.\n  Let us say you want to recreate the BS information of the apples 10-K report of 2022, you would write:\n  ````\n  apple_10k_2022_adsh = \"0000320193-22-000108\"\n\n  collector: SingleReportCollector = SingleReportCollector.get_report_by_adsh(\n        adsh=apple_10k_2022_adsh,\n        stmt_filter=[\"BS\"]\n  )\n  rawdatabag = collector.collect()\n  bs_df = rawdatabag.filter(ReportPeriodAndPreviousPeriodRawFilter())\n                    .join()\n                    .present(StandardStatementPresenter())\n  print(bs_df) \n  ````\n  <br>*Output*:\n  ````  \n                        adsh coreg                                              tag       version stmt  report  line     uom  negating  inpth  qrtrs_0/20220930  qrtrs_0/20210930\n   0   0000320193-22-000108                  CashAndCashEquivalentsAtCarryingValue  us-gaap/2022   BS       5     3     USD         0      0        2.364600e+10        3.494000e+10\n   1   0000320193-22-000108                            MarketableSecuritiesCurrent  us-gaap/2022   BS       5     4     USD         0      0        2.465800e+10        2.769900e+10\n   2   0000320193-22-000108                           AccountsReceivableNetCurrent  us-gaap/2022   BS       5     5     USD         0      0        2.818400e+10        2.627800e+10\n   3   0000320193-22-000108                                           InventoryNet  us-gaap/2022   BS       5     6     USD         0      0        4.946000e+09        6.580000e+09\n   4   0000320193-22-000108                             NontradeReceivablesCurrent  us-gaap/2022   BS       5     7     USD         0      0        3.274800e+10        2.522800e+10\n   5   0000320193-22-000108                                     OtherAssetsCurrent  us-gaap/2022   BS       5     8     USD         0      0        2.122300e+10        1.411100e+10\n   6   0000320193-22-000108                                          AssetsCurrent  us-gaap/2022   BS       5     9     USD         0      0        1.354050e+11        1.348360e+11\n   7   0000320193-22-000108                         MarketableSecuritiesNoncurrent  us-gaap/2022   BS       5    11     USD         0      0        1.208050e+11        1.278770e+11\n   8   0000320193-22-000108                           PropertyPlantAndEquipmentNet  us-gaap/2022   BS       5    12     USD         0      0        4.211700e+10        3.944000e+10\n   9   0000320193-22-000108                                  OtherAssetsNoncurrent  us-gaap/2022   BS       5    13     USD         0      0        5.442800e+10        4.884900e+10\n   10  0000320193-22-000108                                       AssetsNoncurrent  us-gaap/2022   BS       5    14     USD         0      0        2.173500e+11        2.161660e+11\n   11  0000320193-22-000108                                                 Assets  us-gaap/2022   BS       5    15     USD         0      0        3.527550e+11        3.510020e+11\n   12  0000320193-22-000108                                 AccountsPayableCurrent  us-gaap/2022   BS       5    18     USD         0      0        6.411500e+10        5.476300e+10\n   13  0000320193-22-000108                                OtherLiabilitiesCurrent  us-gaap/2022   BS       5    19     USD         0      0        6.084500e+10        4.749300e+10\n   14  0000320193-22-000108                   ContractWithCustomerLiabilityCurrent  us-gaap/2022   BS       5    20     USD         0      0        7.912000e+09        7.612000e+09\n   15  0000320193-22-000108                                        CommercialPaper  us-gaap/2022   BS       5    21     USD         0      0        9.982000e+09        6.000000e+09\n   16  0000320193-22-000108                                    LongTermDebtCurrent  us-gaap/2022   BS       5    22     USD         0      0        1.112800e+10        9.613000e+09\n   17  0000320193-22-000108                                     LiabilitiesCurrent  us-gaap/2022   BS       5    23     USD         0      0        1.539820e+11        1.254810e+11\n   18  0000320193-22-000108                                 LongTermDebtNoncurrent  us-gaap/2022   BS       5    25     USD         0      0        9.895900e+10        1.091060e+11\n   19  0000320193-22-000108                             OtherLiabilitiesNoncurrent  us-gaap/2022   BS       5    26     USD         0      0        4.914200e+10        5.332500e+10\n   20  0000320193-22-000108                                  LiabilitiesNoncurrent  us-gaap/2022   BS       5    27     USD         0      0        1.481010e+11        1.624310e+11\n   21  0000320193-22-000108                                            Liabilities  us-gaap/2022   BS       5    28     USD         0      0        3.020830e+11        2.879120e+11\n   22  0000320193-22-000108           CommonStocksIncludingAdditionalPaidInCapital  us-gaap/2022   BS       5    31     USD         0      0        6.484900e+10        5.736500e+10\n   23  0000320193-22-000108                     RetainedEarningsAccumulatedDeficit  us-gaap/2022   BS       5    32     USD         0      0       -3.068000e+09        5.562000e+09\n   24  0000320193-22-000108        AccumulatedOtherComprehensiveIncomeLossNetOfTax  us-gaap/2022   BS       5    33     USD         0      0       -1.110900e+10        1.630000e+08\n   25  0000320193-22-000108                                     StockholdersEquity  us-gaap/2022   BS       5    34     USD         0      0        5.067200e+10        6.309000e+10\n   26  0000320193-22-000108                       LiabilitiesAndStockholdersEquity  us-gaap/2022   BS       5    35     USD         0      0        3.527550e+11        3.510020e+11\n   27  0000320193-22-000108                    CommonStockParOrStatedValuePerShare  us-gaap/2022   BS       6     1     USD         0      1        0.000000e+00        0.000000e+00\n   28  0000320193-22-000108                            CommonStockSharesAuthorized  us-gaap/2022   BS       6     2  shares         0      1        5.040000e+10        5.040000e+10\n   29  0000320193-22-000108                                CommonStockSharesIssued  us-gaap/2022   BS       6     3  shares         0      1        1.594342e+10        1.642679e+10\n   30  0000320193-22-000108                           CommonStockSharesOutstanding  us-gaap/2022   BS       6     4  shares         0      1        1.594342e+10        1.642679e+10  \n  ````  \n  If you compare this with the real report at https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm\n  you will notice, that order of the tags and the values are the same.\n\n\n* `Standardizer` <br>\n  Even if xbrl is a standard on how to tag positions and numbers in financial statements, that doesn't mean that financial\n  statements can then be compared easily. For instance, there are over 3000 tags which can be used in a balance sheet.\n  Moreover, some tags can mean similar things or can be grouped behind a \"parent\" tag, which itself might not be present.\n  For instance, \"AccountsNoncurrent\" is often not shown in statements. So you would find the position for \"Accounts\"\n  and \"AccountsCurrent\", but not for \"AccountsNoncurrent\". Instead, only child tags for \"AccountsNoncurrent\" might be\n  present.<br><br>\n  The standardizer helps to solve these problems by unifying the information of financial statements.<br> <br>\n  With the standardized financial statements, you can then actually compare the statements between different\n  companies or different years, and you can use the dataset for ML. <br><br>\n  Have a look at [standardizer_basics](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_00_standardizer_basics.ipynb) which explains it in more details.<br><br>\n\n  * `BalanceSheetStandardizer` <br>\n  The `BalanceSheetStandardizer` collects and/or calculates the following positions of balance sheets:  \n\n    ````\n    - Assets\n      - AssetsCurrent\n        - Cash\n      - AssetsNoncurrent\n    - Liabilities\n      - LiabilitiesCurrent\n      - LiabilitiesNoncurrent\n    - Equity\n      - HolderEquity (mainly StockholderEquity or PartnerCapital)\n        - RetainedEarnings\n        - AdditionalPaidInCapital\n        - TreasuryStockValue\n      - TemporaryEquity\n      - RedeemableEquity\n    - LiabilitiesAndEquity\n    ````\n\n    With just a few lines of code, you'll get a comparable dataset with the main positions of a balance sheet for Microsoft, Alphabet, and Amazon:\n    (see the [stanardize the balance sheets and make them comparable notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_01_BS_standardizer.ipynb) for details)\n     ````python\n     from secfsdstools.e_collector.companycollecting import CompanyReportCollector\n     from secfsdstools.e_filter.rawfiltering import ReportPeriodRawFilter, MainCoregRawFilter, OfficialTagsOnlyRawFilter, USDOnlyRawFilter\n     from secfsdstools.f_standardize.bs_standardize import BalanceSheetStandardizer\n   \n     bag = CompanyReportCollector.get_company_collector(ciks=[789019, 1652044,1018724]).collect() #Microsoft, Alphabet, Amazon\n     filtered_bag = bag[ReportPeriodRawFilter()][MainCoregRawFilter()][OfficialTagsOnlyRawFilter()][USDOnlyRawFilter()]\n     joined_bag = filtered_bag.join()\n   \n     standardizer = BalanceSheetStandardizer()\n   \n     standardized_bs_df = joined_bag.present(standardizer)\n   \n     import matplotlib.pyplot as plt\n     # Group by 'name' and plot equity for each group\n     # Note: using the `present` method ensured that the same cik has always the same name even if the company name did change in the past\n     for name, group in standardized_bs_df.groupby('name'):\n       plt.plot(group['date'], group['Equity'], label=name, linestyle='-')\n   \n     # Add labels and title\n     plt.xlabel('Date')\n     plt.ylabel('Equity')\n     plt.title('Equity Over Time for Different Companies (CIKs)')\n   \n     # Display legend\n     plt.legend()\n     ````\n     ![Equity Compare](https://github.com/HansjoergW/sec-fincancial-statement-data-set/raw/main/docs/images/equity_compare.png)\n\n  * `IncomeStatementStandardizer` <br>\n  The `IncomeStatementStandardizer` collects and/or calculates the following positions of balance sheets:\n    \n    ````  \n      Revenues\n      - CostOfRevenue\n      ---------------\n      = GrossProfit\n      - OperatingExpenses\n      -------------------\n      = OperatingIncomeLoss\n        \n      IncomeLossFromContinuingOperationsBeforeIncomeTaxExpenseBenefit\n      - AllIncomeTaxExpenseBenefit\n      ----------------------------\n      = IncomeLossFromContinuingOperations\n      + IncomeLossFromDiscontinuedOperationsNetOfTax\n      -----------------------------------------------\n      = ProfitLoss\n      - NetIncomeLossAttributableToNoncontrollingInterest\n      ---------------------------------------------------\n      = NetIncomeLoss\n    \n      OustandingShares\n      EarningsPerShare\n    ````\n  \n    With just a few lines of code, you'll get a comparable dataset with the main positions of an income statement for Microsoft, Alphabet, and Amazon:\n  (see the [standardize the income statement and make them comparable notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_02_IS_standardizer.ipynb) for details)\n   \n    ````python\n    from secfsdstools.e_collector.companycollecting import CompanyReportCollector\n    from secfsdstools.e_filter.rawfiltering import ReportPeriodRawFilter, MainCoregRawFilter, OfficialTagsOnlyRawFilter, USDOnlyRawFilter\n    from secfsdstools.f_standardize.is_standardize import IncomeStatementStandardizer\n      \n    bag = CompanyReportCollector.get_company_collector(ciks=[789019, 1652044,1018724]).collect() #Microsoft, Alphabet, Amazon\n    filtered_bag = bag[ReportPeriodRawFilter()][MainCoregRawFilter()][OfficialTagsOnlyRawFilter()][USDOnlyRawFilter()]\n    joined_bag = filtered_bag.join()\n      \n    standardizer = IncomeStatementStandardizer()\n      \n    standardized_is_df = joined_bag.present(standardizer)\n    # just use the yearly reports with data for the whole year\n    standardized_is_df = standardized_is_df[(standardized_is_df.fp==\"FY\") & (standardized_is_df.qtrs==4)].copy()\n      \n    import matplotlib.pyplot as plt\n    # Group by 'name' and plot equity for each group\n    # Note: using the `present` method ensured that the same cik has always the same name even if the company name did change in the past\n    for name, group in standardized_is_df.groupby('name'):\n      plt.plot(group['date'], group['GrossProfit'], label=name, linestyle='-')\n      \n    # Add labels and title\n    plt.xlabel('Date')\n    plt.ylabel('GrossProfit')\n    plt.title('GrossProfit Over Time for Different Companies (CIKs)')\n      \n    # Display legend\n    plt.legend()\n     ````\n\n  ![GrossProfit Compare](https://github.com/HansjoergW/sec-fincancial-statement-data-set/raw/main/docs/images/grossprofit_compare.png)\n\n  * `CashFlowStandardizer` <br>\n   The `CashFlowStandardizer` collects and/or calculates the following positions of cash flow statements:\n     \n    ````\n     NetCashProvidedByUsedInOperatingActivities\n       CashProvidedByUsedInOperatingActivitiesDiscontinuedOperations\n       NetCashProvidedByUsedInOperatingActivitiesContinuingOperations\n           DepreciationDepletionAndAmortization\n           DeferredIncomeTaxExpenseBenefit\n           ShareBasedCompensation\n           IncreaseDecreaseInAccountsPayable\n           IncreaseDecreaseInAccruedLiabilities\n           InterestPaidNet\n           IncomeTaxesPaidNet\n    \n     NetCashProvidedByUsedInInvestingActivities\n         CashProvidedByUsedInInvestingActivitiesDiscontinuedOperations\n         NetCashProvidedByUsedInInvestingActivitiesContinuingOperations\n           PaymentsToAcquirePropertyPlantAndEquipment\n           ProceedsFromSaleOfPropertyPlantAndEquipment\n           PaymentsToAcquireInvestments\n           ProceedsFromSaleOfInvestments\n           PaymentsToAcquireBusinessesNetOfCashAcquired\n           ProceedsFromDivestitureOfBusinessesNetOfCashDivested\n           PaymentsToAcquireIntangibleAssets\n           ProceedsFromSaleOfIntangibleAssets\n    \n     NetCashProvidedByUsedInFinancingActivities\n         CashProvidedByUsedInFinancingActivitiesDiscontinuedOperations\n         NetCashProvidedByUsedInFinancingActivitiesContinuingOperations\n           ProceedsFromIssuanceOfCommonStock\n           ProceedsFromStockOptionsExercised\n           PaymentsForRepurchaseOfCommonStock\n           ProceedsFromIssuanceOfDebt\n           RepaymentsOfDebt\n           PaymentsOfDividends\n    \n    \n     EffectOfExchangeRateFinal\n     CashPeriodIncreaseDecreaseIncludingExRateEffectFinal\n    \n     CashAndCashEquivalentsEndOfPeriod\n    ````\n\n     With just a few lines of code, you'll get a comparable dataset with the main positions of an cash flow statement for Microsoft, Alphabet, and Amazon:\n(see the [standardize the cash flow statements and make them comparable](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_03_CF_standardizer.ipynb) for details)\n    ````python\n    from secfsdstools.e_collector.companycollecting import CompanyReportCollector\n    from secfsdstools.e_filter.rawfiltering import ReportPeriodRawFilter, MainCoregRawFilter, OfficialTagsOnlyRawFilter, USDOnlyRawFilter\n    from secfsdstools.f_standardize.cf_standardize import CashFlowStandardizer\n    \n    bag = CompanyReportCollector.get_company_collector(ciks=[789019, 1652044,1018724]).collect() #Microsoft, Alphabet, Amazon\n    filtered_bag = bag[ReportPeriodRawFilter()][MainCoregRawFilter()][OfficialTagsOnlyRawFilter()][USDOnlyRawFilter()]\n    joined_bag = filtered_bag.join()\n    \n    standardizer = CashFlowStandardizer()\n    \n    standardized_cf_df = joined_bag.present(standardizer)\n    standardized_cf_df = standardized_cf_df[(standardized_cf_df.fp==\"FY\") & (standardized_cf_df.qtrs==4)].copy()\n    \n    import matplotlib.pyplot as plt\n    # Group by 'name' and plot NetCashProvidedByUsedInOperatingActivities for each group\n    # Note: using the `present` method ensured that the same cik has always the same name even if the company name did change in the past\n    for name, group in standardized_cf_df.groupby('name'):\n        plt.plot(group['date'], group['NetCashProvidedByUsedInOperatingActivities'], label=name, linestyle='-')\n    \n    # Add labels and title\n    plt.xlabel('Date')\n    plt.ylabel('NetCashProvidedByUsedInOperatingActivities')\n    plt.title('NetCashProvidedByUsedInOperatingActivities Over Time for Different Companies (CIKs)')\n    \n    # Display legend\n    plt.legend()\n    ````\n  ![NetCashOperating Compare](https://github.com/HansjoergW/sec-fincancial-statement-data-set/raw/main/docs/images/netcashoperating_compare.png)\n\n# Links \n* [For a detail description of the content and the structure of the dataset](https://www.sec.gov/files/aqfs.pdf)\n* [Release Notes](https://hansjoergw.github.io/sec-fincancial-statement-data-set/releasenotes/)\n* [Documentation](https://hansjoergw.github.io/sec-fincancial-statement-data-set/)\n* [QuickStart Jupyter Notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/01_quickstart.ipynb)\n* [Connect to the daily-sec-financial-statement-dataset Notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/02_connect_rapidapi.ipynb)\n* [Explore the data with an interactive Notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/03_explore_with_interactive_notebook.ipynb)\n* [collector_deep_dive Notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/04_collector_deep_dive.ipynb)\n* [filter_deep_dive Notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/05_filter_deep_dive.ipynb).\n* [bulk_data_processing_deep_dive Notebook](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/06_bulk_data_processing_deep_dive.ipynb)\n* [standardizer_basics](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_00_standardizer_basics.ipynb)\n* [BS_standardizer](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_01_BS_standardizer.ipynb)\n* [IS_standardizer](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_02_IS_standardizer.ipynb)\n* [CF_standardizer](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_03_CF_standardizer.ipynb)\n* [checkout the `u_usecases` package](https://hansjoergw.github.io/sec-fincancial-statement-data-set/doc_latest/api/secfsdstools/u_usecases/index.html)\n* [standardize the balance sheets and make them comparable](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_01_BS_standardizer.ipynb)\n* [standardize the income statements and make them comparable](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_02_IS_standardizer.ipynb)\n* [standardize the cash flow statements and make them comparable](https://nbviewer.org/github/HansjoergW/sec-fincancial-statement-data-set/blob/main/notebooks/07_03_CF_standardizer.ipynb)\n* [Trouble hssting and known issues](KNOWN_ISSUES.md)\n* [Changelog](CHANGELOG.md)\n",
    "bugtrack_url": null,
    "license": null,
    "summary": "A few python tools to analyze the SEC.gov financial statements data sets (https://www.sec.gov/dera/data/financial-statement-data-sets)",
    "version": "1.7.0",
    "project_urls": {
        "Bug Tracker": "https://github.com/HansjoergW/sec-fincancial-statement-data-set/issues",
        "Github": "https://github.com/HansjoergW/sec-fincancial-statement-data-set",
        "Homepage": "https://hansjoergw.github.io/sec-fincancial-statement-data-set/"
    },
    "split_keywords": [
        "10-k",
        " 10-q",
        " 8-k",
        " cik",
        " data processing",
        " edgar",
        " finance",
        " financial analysis",
        " financial statements",
        " financial statements dataset",
        " sec edgar",
        " sec filing",
        " sec.gov"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "6a45627ad8ff68e71859aa01a1162319b1d1701275d19041e13c0db11fad4d16",
                "md5": "fe759c195385ae8a496b97ce86480364",
                "sha256": "e932c74dbbfe17d3a268b0732e41ac611ee5ca289878460c48440adc76d63286"
            },
            "downloads": -1,
            "filename": "secfsdstools-1.7.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "fe759c195385ae8a496b97ce86480364",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 114187,
            "upload_time": "2024-12-22T06:35:42",
            "upload_time_iso_8601": "2024-12-22T06:35:42.056598Z",
            "url": "https://files.pythonhosted.org/packages/6a/45/627ad8ff68e71859aa01a1162319b1d1701275d19041e13c0db11fad4d16/secfsdstools-1.7.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "955408e931914ccc5dbc0b2f72431ea4c88ba8fb0cc1c5b4c86bac434a8a351f",
                "md5": "86d4d8a5e0e6aea36d5dc89fb4f45d39",
                "sha256": "e56222a3854024340aaa5ec38c0e579a1ab4572950ec04c98f9e76c2d0beaa26"
            },
            "downloads": -1,
            "filename": "secfsdstools-1.7.0.tar.gz",
            "has_sig": false,
            "md5_digest": "86d4d8a5e0e6aea36d5dc89fb4f45d39",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 68190793,
            "upload_time": "2024-12-22T06:56:36",
            "upload_time_iso_8601": "2024-12-22T06:56:36.575164Z",
            "url": "https://files.pythonhosted.org/packages/95/54/08e931914ccc5dbc0b2f72431ea4c88ba8fb0cc1c5b4c86bac434a8a351f/secfsdstools-1.7.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2024-12-22 06:56:36",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "HansjoergW",
    "github_project": "sec-fincancial-statement-data-set",
    "travis_ci": false,
    "coveralls": false,
    "github_actions": true,
    "requirements": [
        {
            "name": "pytest",
            "specs": [
                [
                    "~=",
                    "7.2"
                ]
            ]
        },
        {
            "name": "pylint",
            "specs": [
                [
                    "~=",
                    "2.13"
                ]
            ]
        },
        {
            "name": "pandas",
            "specs": [
                [
                    "~=",
                    "1.3"
                ]
            ]
        },
        {
            "name": "requests",
            "specs": [
                [
                    "~=",
                    "2.28"
                ]
            ]
        },
        {
            "name": "pathos",
            "specs": [
                [
                    "~=",
                    "0.3"
                ]
            ]
        },
        {
            "name": "pyarrow",
            "specs": [
                [
                    "~=",
                    "11.0"
                ]
            ]
        },
        {
            "name": "fastparquet",
            "specs": [
                [
                    "~=",
                    "0.8"
                ]
            ]
        }
    ],
    "lcname": "secfsdstools"
}
        
Elapsed time: 0.51642s