
Security News
ECMAScript 2025 Finalized with Iterator Helpers, Set Methods, RegExp.escape, and More
ECMAScript 2025 introduces Iterator Helpers, Set methods, JSON modules, and more in its latest spec update approved by Ecma in June 2025.
10-K Item Segmentation with Line-based Attention (ISLA) is a tool to process EDGAR 10-K reports and extract item-specific text.
Table of Contents
pip3 install itemseg
python3 -m itemseg --get_resource
Launch python3 console
>>> import nltk
>>> nltk.download('punkt')
Using Apple 10-K (2023) as an example:
python3 -m itemseg --input https://www.sec.gov/Archives/edgar/data/320193/000032019323000106/0000320193-23-000106.txt
See the results in ./segout01/
The *.csv file contain line-by-line prediction for items in a Begin-Inside-Outside (BIO) style tags. Other files contain item-sepcific text. Change output file types via --outfn_type
.
A 10-K report is an annual report filed by publicly traded companies with the U.S. Securities and Exchange Commission (SEC). It provides a comprehensive overview of the company's financial performance and is more detailed than an annual report. Key items of a 10-K report include:
You can search and read 10-K reports through the EDGAR web interface. The itemseg module takes the URL of the Complete submission text file
, convert the HTML to formated txt file, and segment the txt file by items.
As an example, the AMAZON 10-K report page for fiscal year 2022 shows the link to the HTML 10-K report and a Complete submission text file
0001018724-23-000004.txt. Pass this link (https://www.sec.gov/Archives/edgar/data/1018724/000101872423000004/0001018724-23-000004.txt) to the itemseg module, and it will retrive the file and segment items for you.
python3 -m itemseg --input https://www.sec.gov/Archives/edgar/data/1018724/000101872423000004/0001018724-23-000004.txt
The default setting is to output line-by-line tag (BIO style) in a csv file, together with Item 1, Item 1A, Item 3, and Item 7 in separate files (--outfn_type "csv,item1,item1a,item3,item7"). You can change output file type combination with --outfn_type. For example, if you only want to output Item 1A and Item 7, then set --outfn_type "item1a,item7".
If you are trying to process large amounts of 10-K files, a good starting point is the master index (https://www.sec.gov/Archives/edgar/full-index/), which lists all available files and provides a convenient venue to construct a comprehensive list of target files.
The module also comes with a script file that allow you to run the module via itemseg
command. The default location (for Ubuntu) is at ~/.local/bin. Add this location to your path to enable itemseg
command.
itemseg
is distributed under the terms of the CC BY-NC license.
FAQs
10-K Report Item Segmentation with Line-based Attention (ISLA)
We found that itemseg demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
ECMAScript 2025 introduces Iterator Helpers, Set methods, JSON modules, and more in its latest spec update approved by Ecma in June 2025.
Security News
A new Node.js homepage button linking to paid support for EOL versions has sparked a heated discussion among contributors and the wider community.
Research
North Korean threat actors linked to the Contagious Interview campaign return with 35 new malicious npm packages using a stealthy multi-stage malware loader.