itemseg


Nameitemseg JSON
Version 1.6.0 PyPI version JSON
download
home_page
Summary10-K Report Item Segmentation with Line-based Attention (ISLA)
upload_time2023-12-28 05:10:45
maintainer
docs_urlNone
author
requires_python>=3.8
license
keywords 10-k item segmentation sequence labeling
VCS
bugtrack_url
requirements No requirements were recorded.
Travis-CI No Travis.
coveralls test coverage No coveralls.
            # itemseg

![](https://raw.githubusercontent.com/hsinmin/itemseg/main/ITEMSEG%20LOGO1%20SMALL.jpg)


10-K Item Segmentation with Line-based Attention (ISLA) is a tool to process
EDGAR 10-K reports and extract item-specific text. 


[![PyPI - Version](https://img.shields.io/pypi/v/itemseg.svg)](https://pypi.org/project/itemseg)
[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/itemseg.svg)](https://pypi.org/project/itemseg)

-----

**Table of Contents**

- [Installation](#installation)
- [License](#license)

## Installation

```console
pip3 install itemseg
```

### Download resource file
```console
python3 -m itemseg --get_resource
```

### Download nltk data

Launch python3 console
```console
>>> import nltk
>>> nltk.download('punkt')
```

### Segment items in a 10-K file
Using Apple 10-K (2023) as an example:
```console
python3 -m itemseg --input https://www.sec.gov/Archives/edgar/data/320193/000032019323000106/0000320193-23-000106.txt
```

See the results in ./segout01/

The *.csv file contain line-by-line prediction for items in a Begin-Inside-Outside (BIO) style tags. Other files contain item-sepcific text. Change output file types via `--outfn_type`.


### About 10-K files. 
A 10-K report is an annual report filed by publicly traded companies with the U.S. Securities and Exchange Commission (SEC). It provides a comprehensive overview of the company's financial performance and is more detailed than an annual report. Key items of a 10-K report include:

* Item 1 (Business): Describes the company's main operations, products, and services.
* Item 1A (Risk Factors): Outlines risks that could affect the company's business, financial condition, or operating results. 
* Item 3 (Legal Proceedings)
* Item 7 (Management’s Discussion and Analysis of Financial Condition and Results of Operations; MD&A): Offers management's perspective on the financial results, including discussion of liquidity, capital resources, and results of operations.

You can search and read 10-K reports through the [EDGAR web interface](https://www.sec.gov/edgar/search-and-access). The itemseg module takes the URL of the `Complete submission text file`, convert the HTML to formated txt file, and segment the txt file by items. 

As an example, the AMAZON 10-K report page for [fiscal year 2022](https://www.sec.gov/Archives/edgar/data/1018724/000101872423000004/0001018724-23-000004-index.htm) shows the link to the HTML 10-K report and a `Complete submission text file` [0001018724-23-000004.txt](https://www.sec.gov/Archives/edgar/data/1018724/000101872423000004/0001018724-23-000004.txt). Pass this link (https://www.sec.gov/Archives/edgar/data/1018724/000101872423000004/0001018724-23-000004.txt) to the itemseg module, and it will retrive the file and segment items for you. 

```console
python3 -m itemseg --input https://www.sec.gov/Archives/edgar/data/1018724/000101872423000004/0001018724-23-000004.txt
```

The default setting is to output line-by-line tag (BIO style) in a csv file, together with Item 1, Item 1A, Item 3, and Item 7 in separate files (--outfn_type "csv,item1,item1a,item3,item7"). You can change output file type combination with --outfn_type. For example, if you only want to output Item 1A and Item 7, then set --outfn_type "item1a,item7". 

If you are trying to process large amounts of 10-K files, a good starting point is the master index (https://www.sec.gov/Archives/edgar/full-index/), which lists all available files and provides a convenient venue to construct a comprehensive list of target files.

The module also comes with a script file that allow you to run the module via `itemseg` command. The default location (for Ubuntu) is at ~/.local/bin. Add this location to your path to enable `itemseg` command. 


## License

`itemseg` is distributed under the terms of the [CC BY-NC](https://creativecommons.org/licenses/by-nc/4.0/) license.

            

Raw data

            {
    "_id": null,
    "home_page": "",
    "name": "itemseg",
    "maintainer": "",
    "docs_url": null,
    "requires_python": ">=3.8",
    "maintainer_email": "",
    "keywords": "10-K,Item Segmentation,Sequence Labeling",
    "author": "",
    "author_email": "\"Hsin-Min Lu; Huan-Hsun Yen; Yen-Hsiu Chen\" <luim@ntu.edu.tw>",
    "download_url": "https://files.pythonhosted.org/packages/95/0d/59631b5191e200f67673ea6092c8c89933ec704de5b9c6aa15b004308709/itemseg-1.6.0.tar.gz",
    "platform": null,
    "description": "# itemseg\n\n![](https://raw.githubusercontent.com/hsinmin/itemseg/main/ITEMSEG%20LOGO1%20SMALL.jpg)\n\n\n10-K Item Segmentation with Line-based Attention (ISLA) is a tool to process\nEDGAR 10-K reports and extract item-specific text. \n\n\n[![PyPI - Version](https://img.shields.io/pypi/v/itemseg.svg)](https://pypi.org/project/itemseg)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/itemseg.svg)](https://pypi.org/project/itemseg)\n\n-----\n\n**Table of Contents**\n\n- [Installation](#installation)\n- [License](#license)\n\n## Installation\n\n```console\npip3 install itemseg\n```\n\n### Download resource file\n```console\npython3 -m itemseg --get_resource\n```\n\n### Download nltk data\n\nLaunch python3 console\n```console\n>>> import nltk\n>>> nltk.download('punkt')\n```\n\n### Segment items in a 10-K file\nUsing Apple 10-K (2023) as an example:\n```console\npython3 -m itemseg --input https://www.sec.gov/Archives/edgar/data/320193/000032019323000106/0000320193-23-000106.txt\n```\n\nSee the results in ./segout01/\n\nThe *.csv file contain line-by-line prediction for items in a Begin-Inside-Outside (BIO) style tags. Other files contain item-sepcific text. Change output file types via `--outfn_type`.\n\n\n### About 10-K files. \nA 10-K report is an annual report filed by publicly traded companies with the U.S. Securities and Exchange Commission (SEC). It provides a comprehensive overview of the company's financial performance and is more detailed than an annual report. Key items of a 10-K report include:\n\n* Item 1 (Business): Describes the company's main operations, products, and services.\n* Item 1A (Risk Factors): Outlines risks that could affect the company's business, financial condition, or operating results. \n* Item 3 (Legal Proceedings)\n* Item 7 (Management\u2019s Discussion and Analysis of Financial Condition and Results of Operations; MD&A): Offers management's perspective on the financial results, including discussion of liquidity, capital resources, and results of operations.\n\nYou can search and read 10-K reports through the [EDGAR web interface](https://www.sec.gov/edgar/search-and-access). The itemseg module takes the URL of the `Complete submission text file`, convert the HTML to formated txt file, and segment the txt file by items. \n\nAs an example, the AMAZON 10-K report page for [fiscal year 2022](https://www.sec.gov/Archives/edgar/data/1018724/000101872423000004/0001018724-23-000004-index.htm) shows the link to the HTML 10-K report and a `Complete submission text file` [0001018724-23-000004.txt](https://www.sec.gov/Archives/edgar/data/1018724/000101872423000004/0001018724-23-000004.txt). Pass this link (https://www.sec.gov/Archives/edgar/data/1018724/000101872423000004/0001018724-23-000004.txt) to the itemseg module, and it will retrive the file and segment items for you. \n\n```console\npython3 -m itemseg --input https://www.sec.gov/Archives/edgar/data/1018724/000101872423000004/0001018724-23-000004.txt\n```\n\nThe default setting is to output line-by-line tag (BIO style) in a csv file, together with Item 1, Item 1A, Item 3, and Item 7 in separate files (--outfn_type \"csv,item1,item1a,item3,item7\"). You can change output file type combination with --outfn_type. For example, if you only want to output Item 1A and Item 7, then set --outfn_type \"item1a,item7\". \n\nIf you are trying to process large amounts of 10-K files, a good starting point is the master index (https://www.sec.gov/Archives/edgar/full-index/), which lists all available files and provides a convenient venue to construct a comprehensive list of target files.\n\nThe module also comes with a script file that allow you to run the module via `itemseg` command. The default location (for Ubuntu) is at ~/.local/bin. Add this location to your path to enable `itemseg` command. \n\n\n## License\n\n`itemseg` is distributed under the terms of the [CC BY-NC](https://creativecommons.org/licenses/by-nc/4.0/) license.\n",
    "bugtrack_url": null,
    "license": "",
    "summary": "10-K Report Item Segmentation with Line-based Attention (ISLA)",
    "version": "1.6.0",
    "project_urls": {
        "Documentation": "https://github.com/hsinmin/isla#readme",
        "Issues": "https://github.com/hsinmin/itemseg/issues",
        "Source": "https://github.com/hsinmin/itemseg"
    },
    "split_keywords": [
        "10-k",
        "item segmentation",
        "sequence labeling"
    ],
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "59daac83889445ee20a0273f978571f321cf8ddc475712bf9f04ee285cade410",
                "md5": "f8347c647eaec65646cdaf2ebb5bef08",
                "sha256": "ffa4d2f231ecb12df61e2e761d19b9aff78387aae4d51bcce942706a4485a6d9"
            },
            "downloads": -1,
            "filename": "itemseg-1.6.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f8347c647eaec65646cdaf2ebb5bef08",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">=3.8",
            "size": 33427,
            "upload_time": "2023-12-28T05:10:43",
            "upload_time_iso_8601": "2023-12-28T05:10:43.524987Z",
            "url": "https://files.pythonhosted.org/packages/59/da/ac83889445ee20a0273f978571f321cf8ddc475712bf9f04ee285cade410/itemseg-1.6.0-py3-none-any.whl",
            "yanked": false,
            "yanked_reason": null
        },
        {
            "comment_text": "",
            "digests": {
                "blake2b_256": "950d59631b5191e200f67673ea6092c8c89933ec704de5b9c6aa15b004308709",
                "md5": "fb09a79396bffcfbc8fbd177f8eb2255",
                "sha256": "95db659e2030853677bb8570fe7bb5427f42447c21aee3f86c954f22c7519fcc"
            },
            "downloads": -1,
            "filename": "itemseg-1.6.0.tar.gz",
            "has_sig": false,
            "md5_digest": "fb09a79396bffcfbc8fbd177f8eb2255",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">=3.8",
            "size": 31575,
            "upload_time": "2023-12-28T05:10:45",
            "upload_time_iso_8601": "2023-12-28T05:10:45.540304Z",
            "url": "https://files.pythonhosted.org/packages/95/0d/59631b5191e200f67673ea6092c8c89933ec704de5b9c6aa15b004308709/itemseg-1.6.0.tar.gz",
            "yanked": false,
            "yanked_reason": null
        }
    ],
    "upload_time": "2023-12-28 05:10:45",
    "github": true,
    "gitlab": false,
    "bitbucket": false,
    "codeberg": false,
    "github_user": "hsinmin",
    "github_project": "isla#readme",
    "github_not_found": true,
    "lcname": "itemseg"
}
        
Elapsed time: 0.15467s