Enhancing Open-Source Compliance: Introducing Socket’s Advan...

Open-source license detection and compliance tools often fall short in accuracy, scope, and/or usability, making comprehensive compliance management quite a challenge. Our survey of existing solutions found a clustering around two extremes; tools which apply simple analysis strategies to obvious locations are able to produce results which seem clear and concise, but are in fact incomplete and inaccurate. On the other hand, tools that storm the gates and apply every detection strategy, everywhere, all of the time, tend to produce excessive amounts of data, with false positives and extraneous information obscuring otherwise useful feedback.

To address these issues, we have developed our own suite of license analysis tools, which have now reached version 2.0, with significant improvements over the initial release. Our new state-of-the-art license analysis and compliance features offer the most precise license detection capabilities available to open source developers today, supporting over 1,000 license types and incorporating robust and highly configurable license enforcement.

Designed to meet enterprise standards, Socket’s license analysis and compliance completes our traditional Software Composition Analysis (SCA) toolkit. Let’s take a technical deep dive into the advancements that make our license analysis one of the most accurate and comprehensive tool sets available for the task.

Why we’re tackling the challenge of open source license detection and analysis#

Properly managing license information is an important part of risk management. Because software (in both source and binary form) is generally the proper subject matter of copyright, the author of a piece of software becomes the exclusive owner of the rights to copy, modify, and distribute the work as soon as it is fixed in some tangible medium of expression. Downstream users therefore need a license if they wish to copy, modify, or distribute the work without exposing themselves to liability for copyright infringement.

In the United States, penalties for copyright infringement under Title 17 (the Copyright Act) include not only money damages and attorney’s fees, but more importantly disgorgement and an injunction prohibiting further infringement (see 17 U.S.C. §§ 502-505), not to mention the time, expense, and reputational damage that can accompany a lawsuit.

License detection and analysis poses a significant challenge; license and authorship data can appear almost anywhere in a package, and can be communicated to licensees in whatever form the author chooses. While standards and conventions have emerged and are slowly seeing increased adoption, like the use of SPDX license identifiers and placement of license information in certain locations, huge swathes of the open source universe do not adhere to these (or any) standards or conventions, owing to historical inertia, lack of awareness, the oddities found in machine generated code, vendored dependencies, etc.

Implementing useful license analysis features requires developers to learn and adapt to many patterns and historical trends both new and old, addressing a host of ecosystem and registry-specific idioms, and to spend significant time identifying and examining edge cases in the wild.

Socket has been hard at work developing and refining a suite of features for analyzing, managing, and complying with licenses across a range of supported languages and ecosystems, intent on surpassing the accuracy and capabilities offered by existing tools.

Example: Django 5.1.1

Readers may more readily appreciate the finer points of license analysis by way of a real world example; in this case we’ll examine the popular Python package Django, version 5.1.1. A public overview of Socket’s license analysis can be found on the package's license page.

The contents of Django-5.1.1 directly implicate the terms and conditions of seven distinct licenses across ten license files throughout the project, and include a PKG-INFO file which combines SPDX license expressions, the much-maligned PEP 301 classifier system, paths redirecting to license text files, and a path pointing to a file containing authorship information, which is important for attribution as may be demanded by the terms of one or more of the applicable licenses.

A thorough analysis of the package, including exploration of vendored dependencies and other bundled data, reveals a number of different licenses, including a copyleft license (CC-BY-SA-4.0) not exposed by the registry metadata.

Though most package registries display a modicum of license information for a package, the information displayed is generally taken verbatim from a single source of information, like the package.json file’s “license” property for npm. While this is certainly better than nothing, package registry license data is more often than not a gross underestimate. Case in point, PyPI's page for Django only associates BSD-3-Clause with this version.

LICENSE File Detection and Analysis

The most common means of communicating license information to users is by including the relevant terms and conditions in a LICENSE file in the project’s root directory, a practice which may be familiar to many readers.

Even when developers use an off the shelf license like Apache-2.0 or MIT, they may make modifications to the text to reflect authorship, dates, contact info, or other information. Accordingly, the first pillar of license detection is determining what terms and conditions are described by a given LICENSE file, generally by matching its contents to the terms and conditions of a known license. This task seems simple at first glance, but presents subtle challenges even in its purest form. Once we enter the arena of public package registries however, the problem becomes much more difficult, as proper detection and reporting demands much more thoughtful analysis.

Most license analysis tools begin by applying some kind of transformation or normalization to the file’s contents to negate the effects of non-critical changes made to the license text (capitalization, formatting, filling in author names, dates, etc.), then they attempt to match the text to a known license using some kind of text similarity metric.

Because there is a relatively large set of known licenses to compare the text to, and because those licenses have very different ways of presenting author and copyright information (and ultimately authors are free to choose their own), the precision of this normalization step is critical. If a normalization procedure is too aggressive in its determination of what is extraneous, substantive content may be cut or modified, leading to misidentification. Conversely, if normalization is too conservative and fails to remove enough non-essential content, substantially similar licenses will fail to be recognized as a match.

One of the things that surprised us most during development was the degree to which small changes in these normalization techniques could completely make or break the end result. While the SPDX specification provides a set of standards for normalization during license matching (guidelines which are used by a number of peer tools), the reality is that developers do not consult the SPDX guidelines when modifying their project’s LICENSE files, which in many cases significantly blunts their effectiveness when exposed to real-world data.

We were able to achieve a significant increase in our license detection accuracy by eschewing the SPDX matching guidelines and developing our own procedures in-house through a massive amount of trial and error applied to data observed on public package registries.

The most basic benchmark for normalization effectiveness, and the most rudimentary test in Socket’s license matching test suite, creates a matrix of all known licenses, normalizes them with some noise, and attempts to match each row, requiring the main diagonal of the matrix to contain perfect self-matches. Socket’s test suite requires 100% accuracy to pass.

When we surveyed peer tools, we were surprised to see significant weaknesses emerge at this early stage; on further investigation this was generally due to overly aggressive normalization excising key portions of the text.

The typescript port of the official SPDX license matching tool (the official Python version fails to build) fails to produce a correct result in 53 of the roughly 600 licenses in the base SPDX license list.
Trivy, another open source tool, identifies only 258 of the > 600 licenses in the SPDX license list.
Licensee, the popular open source ruby gem which forms the core of GitHub’s license identification, similarly fails to correctly identify a number of common licenses, even in their unmodified form, or when seemingly trivial changes are introduced like white space or repeated words (GitHub issues 602, 631, 655, among others).

Of the peer tools we explored, scancode-toolkit deserves a shout out for achieving effectively 100% accuracy in this basic test. Socket’s test suite also incorporates a growing collection of user-modified LICENSE files found in the wild to ensure accurate normalization and basic license identification.

However, the real challenge is found by stepping out into the open frontier of public package registries, where a broad range of totally free-form LICENSE files seek to frustrate naive matching techniques. The most common non-standard specimen is a combination of more than one license text in a single file, either as a form of dual licensing (offering licensees a choice of e.g. Apache-2.0 or MIT), or as a way of communicating license information for vendored dependencies.

Socket has developed new metrics for matching these complex LICENSE files, which may also mix short and long-form license texts. Users may view an example of such a file, found by Socket’s license analysis, here.

Other examples of degenerate LICENSE files we’ve encountered in the wild and adapted to are license files containing just a raw SPDX expression, files with custom begin/end delimiters surrounded by extraneous front/back matter, and files called “LICENSE” which only contain paths redirecting to other license files.

One of Socket’s major goals is to minimize noise directed at users, providing only essential information; without proper detection and case-specific handling of these patterns, for example files simply pointing to a license elsewhere, a match on a primary LICENSE file would fail, producing a false-positive for an unidentifiable license. Based on our exploration of other license detection tools available to us for comparison, we believe Socket is unique in offering effective analysis of complex and otherwise degenerate license files.

We also felt strongly that Socket’s license analysis should find all license files, anywhere in a package, not just in the project root. These files may use a number of different naming patterns (LICENSE, COPYING, some variation on the name of the actual license, random filenames within an aptly named directory, etc.), as well as files identified by ecosystem/language specific patterns frequently seen in manifest files (looking at you Python). The only peer tool we surveyed which fully explored the package files was scancode-toolkit which, while admirably thorough, took almost exactly an hour to complete license analysis on our benchmark repository, ostensibly due to its reliance on a large database of one-off rules (for reference, Socket’s license analysis takes seconds). Some tools, including licensee (as used by GitHub), have explicitly stated that this level of analysis is a non-goal and will not be implemented.

Lastly, Socket’s license analysis searches all files for copyright, authorship, and attribution information which may be inlined in the file’s header, regardless of whether they’ve been identified as potential “main” LICENSE text files.

Ecosystem-specific License Analysis

While the practice of putting license terms in a root LICENSE file may have already been familiar to many readers, a large amount of critical license and attribution information is also communicated by manifest and configuration files, almost always within language specific patterns, some of which are quite complex.

Python in particular has a host of manifest and configuration file formats which may contain license data in completely different formats (including but not limited to setup.py, pyproject.toml, and PKG-INFO/METADATA), which permit different combinations of SPDX license expressions, arbitrary strings conveying license information, PEP 301 license classifiers (many of which are irreconcilably ambiguous), arbitrary glob patterns as relative pointers to license files, as well as author information needed for attribution as may be required by the license terms.

While many npm packages conform to the modern standard of having a single SPDX expression in a “license” property, many packages use an older and much more complex “licenses” property with a list of nested objects. Unbeknownst to many, npm’s documentation endorses a rather frightening extension of the SPDX expression grammar, allowing the special string “UNLICENSED” to signal that the work is not offered under license. For this reason, among others, Socket developed its own SPDX expression parser to recognize a superset of the SPDX grammar (as a side note, Socket’s license analysis became dependency-free as development progressed; we required more from our tools than we could squeeze out of existing libraries).

None of the peer tools available for us to survey, including licensee (used by GitHub), Trivy, the official SPDX tools, and scancode-toolkit, and cycloneDX, implemented a significant subset of these ecosystem-specific features.

Automating License Attribution and Compliance

Finding and analyzing license information is only half the story; what remains is to act on the available information, not only to insulate yourself from risk, but also to be a good citizen of the open source community and to respect the rights of developers.

Many common open source licenses are conditioned on proper attribution and inclusion of the license itself in downstream works. Even for permissive licenses, if the terms require attribution, compliance requires an exhaustive traversal of the package for relevant authorship and attribution data, and subsequent inclusion of that information.

Socket provides users the ability to generate attribution data for any package, both through the front-end on the package’s license page, and in bulk through the public API for any collection of packages. The included data is sourced from the same exhaustive analysis described above, and is provided in a structured JSON format, permitting users to sort, aggregate, and modify as needed.

Socket’s License Policy Feature

Socket’s license policy feature allows users to create and manage sets of allowed licenses, apply those allow lists to their dependencies, and receive data detailing the presence and location of any license data not permitted by the user’s allow list. The allow list functionality is available through the web front-end, and through the public API.

Because Socket recognizes more than a thousand licenses and exceptions, we offer users the ability to add families of licenses in bulk based on characteristics like FSF or OSI approval, and whether they’re permissive or copyleft, including subcategories like strong, weak, and network copyleft. Users can view the “expanded” form of their license policy (an exhaustive list of the licenses allowed by a combination of categories) via the license policy saturation endpoint, if they wish to manually review what’s actually allowed, or to create custom categories like “all permissive licenses except for this specific license”.

Due to the free form nature of license data and the complexities of intellectual property, no license analysis will be perfect. For that reason, Socket’s license policy feature is designed such that users can place licenses, files, and packages in their allow list once they’ve been manually reviewed.

Returning to our Django example above, the presence of a copyleft license (CC-BY-SA-4.0) would be caught by application of a license policy which disallows copyleft licenses. If, on manual review, it is believed that the share-alike provisions of the license are not triggered by this specific use case, the offending license file (or package) can be placed in the license allow list to silence this warning as unnecessary for all future checks while excising only the smallest unit of code necessary. This level of control is important, because certain license conditions often create edge cases to be analyzed on a case by case basis; requiring users to either allow or deny all instances of a given license globally creates a blunt tool which forecloses certain possibilities which may otherwise be beneficial.

While other license analysis tools offer the ability to apply an allow list to detected licenses (generally with less control) or the ability to generate some form of attribution, the usefulness of these feature is predicated on the tool’s thoroughness and accuracy of detection. As our investigation of peer tools has lead us to conclude that Socket’s license detection is superior, we submit that our attribution and license policy features are similarly the best implementation(s) currently available.

Take the updated Socket license scanner for a test drive on your open-source dependencies to check out the technical improvements and see how it can improve your license compliance today!

Enhancing Open-Source Compliance: Introducing Socket’s Advanced License Analysis

We're launching a new set of license analysis and compliance features for analyzing, managing, and complying with licenses across a range of supported languages and ecosystems.