Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Parse SEC EDGAR HTML documents into a tree of elements that correspond to the visual structure of the document.
Essentials ➔
Health ➔
Quality ➔
Distribution ➔
Community ➔
The sec-parser
project simplifies extracting meaningful information from SEC EDGAR HTML documents by organizing them into semantic elements and a tree structure. Semantic elements might include section titles, paragraphs, and tables, each classified for easier data manipulation. This forms a semantic tree that corresponds to the visual and informational structure of the document. If you're familiar with the Image Semantic Segmentation concept, it's the same but applied to HTML documents.
This tool is especially beneficial for Artificial Intelligence (AI), Machine Learning (ML), and Large Language Models (LLM) applications by streamlining data pre-processing and feature extraction.
sec-parser
is versatile and can be applied in various scenarios, including but not limited to:
These use-cases demonstrate the flexibility and power of sec-parser
in handling both traditional data extraction tasks and facilitating more advanced AI-driven analysis.
[!IMPORTANT] This project,
sec-parser
, is an independent, open-source initiative and has no affiliation, endorsement, or verification by the United States Securities and Exchange Commission (SEC). It utilizes public APIs and data provided by the SEC solely for research, informational, and educational objectives. This tool is not intended for financial advisement or as a substitute for professional investment advice or compliance with securities regulations. The creators and maintainers make no warranties, expressed or implied, about the accuracy, completeness, or reliability of the data and analyses presented. Use this software at your own risk. For accurate and comprehensive financial analysis, consult with qualified financial professionals and comply with all relevant legal requirements. The project maintainers and contributors are not liable for any financial or legal consequences arising from the use of this tool.
This guide will walk you through the process of installing the sec-parser
package and using it to extract the "Segment Operating Performance" section as a semantic tree from the latest Apple 10-Q filing.
[!TIP] To run the example code in a ready-to-code environment, you can use GitHub Codespaces. Click the button below to open the example code below in a codespace and start experimenting with
sec-parser
:
First, install the sec-parser
package using pip:
pip install sec-parser
To run the example code in this README, you'll also need the sec_downloader
package:
pip install sec-downloader
Once you've installed the necessary packages, you can start by downloading the filing from the SEC EDGAR website. Here's how you can do it:
from sec_downloader import Downloader
# Initialize the downloader with your company name and email
dl = Downloader("MyCompanyName", "email@example.com")
# Download the latest 10-Q filing for Apple
html = dl.get_filing_html(ticker="AAPL", form="10-Q")
[!NOTE] The company name and email address are used to form a user-agent string that adheres to the SEC EDGAR's fair access policy for programmatic downloading. Source
[!TIP] Read sec-downloader documentation (and examples) for more advanced usage (such as downloading three latest Apple 10-Q filings instead of just one, or downloading based on a specific CIK or Filing ID (i.e. accession number)).
Now, we can parse the filing HTML into a list of semantic elements:
# Utility function to make the example code a bit more compact
def print_first_n_lines(text: str, *, n: int):
print("\n".join(text.split("\n")[:n]), "...", sep="\n")
import sec_parser as sp
elements: list = sp.Edgar10QParser().parse(html)
demo_output: str = sp.render(elements)
print_first_n_lines(demo_output, n=7)
TopSectionTitle: PART I — FINANCIAL INFORMATION TopSectionTitle: Item 1. Financial Statements TitleElement: CONDENSED CONSOLIDATED STATEMENTS OF OPERATIONS (Unaudited) SupplementaryText: (In millions, except number of ...housands and per share amounts) TableElement: Table with 24 rows, 80 numbers, and 1058 characters. SupplementaryText: See accompanying Notes to Conde...solidated Financial Statements. TitleElement: CONDENSED CONSOLIDATED STATEMEN...OMPREHENSIVE INCOME (Unaudited) ...
[!TIP]
FAQ: How do I get the text of each element (or all of the document)? How do I get all of the text in a specific section?
Use the
element.text
field. Check out this notebook for a full example.
We can also construct a semantic tree to allow for easy filtering by parent sections:
tree = sp.TreeBuilder().build(elements)
demo_output: str = sp.render(tree)
print_first_n_lines(demo_output, n=7)
TopSectionTitle: PART I — FINANCIAL INFORMATION ├── TopSectionTitle: Item 1. Financial Statements │ ├── TitleElement: CONDENSED CONSOLIDATED STATEMENTS OF OPERATIONS (Unaudited) │ │ ├── SupplementaryText: (In millions, except number of ...housands and per share amounts) │ │ ├── TableElement: Table with 24 rows, 80 numbers, and 1058 characters. │ │ ├── SupplementaryText: See accompanying Notes to Conde...solidated Financial Statements. │ ├── TitleElement: CONDENSED CONSOLIDATED STATEMEN...OMPREHENSIVE INCOME (Unaudited) ...
[!TIP]
Feel free to experiment with the example code provided above. You can easily do this by launching a GitHub Codespace for the
sec-parser
repository, which will set up a development environment for you in the cloud:This is a great way to play around with the code without having to set up anything on your local machine. Give it a try!
For more examples and advanced usage, you can continue learning how to use sec-parser
by referring to the User Guide, Developer Guide, and Documentation.
Please refer to this document.
Your turn to explore the capabilities of sec-parser
! With the tools and examples provided, you can now dive into parsing and analyzing SEC filings.
The semantic elements and tree structures created by the parser will serve as a solid foundation for your financial analysis and research tasks with the tools of your choice.
For a tailored experience, consider using our free and open-source library for AI-powered financial analysis:
pip install sec-ai
To ensure your code remains functional even when we change the internal structure of sec-parser
, it's recommended to avoid deep imports. Here is an example of a deep import (not recommended):
[!CAUTION]
from sec_parser.semantic_tree.internal_utils.core import SomeInternalClass
Instead, use the suggested ways to import modules from sec-parser
:
import sec_parser as sp
. This imports the main package as sp
. You can then access its functionalities using sp.
prefix.from sec_parser import SomeClass
: This allows you to directly use SomeClass
without any prefix.import sec_parser.semantic_tree
: This imports the semantic_tree
submodule, and you can access its classes and functions using semantic_tree.
prefix.from sec_parser.semantic_tree import SomeClass
: This imports a specific class SomeClass
from the semantic_tree
submodule.[!NOTE] The main package
sec_parser
contains only the most common functionalities. For specialized tasks, please use submodule or submodule-level imports.
For information about setting up the development environment, coding standards, and contribution workflows, please refer to our CONTRIBUTING.md guide.
This project is licensed under the MIT License - see the LICENSE file for details.
FAQs
Parse SEC EDGAR HTML documents into a tree of elements that correspond to the visual structure of the document.
We found that sec-parser demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.