
Security News
Attackers Are Hunting High-Impact Node.js Maintainers in a Coordinated Social Engineering Campaign
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.
sgl-kernel
Advanced tools
Kernel Library for LLM inference engines
sgl-kernel provides optimized compute primitives for LLM inference engines, enabling efficient inference for large language models and vision-language models through custom kernel operations. It has been used by LightLLM, SGLang and so on.
Requires torch == 2.9.1
# Latest version
pip3 install sgl-kernel --upgrade
Requires
make build
By default, make build uses all available CPU cores. You can override build parallelism and NVCC compile threads:
# Limit parallel jobs (controls both make and cmake parallelism)
make build MAX_JOBS=2
# Additionally limit NVCC internal threads (reduces CPU and peak memory)
make build MAX_JOBS=2 CMAKE_ARGS="-DSGL_KERNEL_COMPILE_THREADS=1"
m.def, and device binding with m.impl:How to write schema: Schema reference
// We need def with schema here for torch.compile
m.def(
"bmm_fp8(Tensor A, Tensor B, Tensor! D, Tensor A_scale, Tensor B_scale, Tensor workspace_buffer, "
"int cublas_handle) -> ()");
m.impl("bmm_fp8", torch::kCUDA, &bmm_fp8);
Third-party C++ libraries often use int and float, but PyTorch bindings require int64_t and double due to Python's type mapping.
Use make_pytorch_shim from sgl_kernel_torch_shim.h to handle conversions automatically:
// Add type conversion for int -> int64_t
template <>
struct pytorch_library_compatible_type<int> {
using type = int64_t;
static int convert_from_type(int64_t arg) {
TORCH_CHECK(arg <= std::numeric_limits<int>::max(), "value too large");
TORCH_CHECK(arg >= std::numeric_limits<int>::min(), "value too small");
return arg;
}
};
// Wrap your function
m.impl("fwd", torch::kCUDA, make_pytorch_shim(&mha_fwd));
@pytest.mark.skipif@pytest.mark.skipif(
skip_condition, reason="Nvfp4 Requires compute capability of 10 or above."
)
Add benchmarks using triton benchmark in benchmark/
We recommend using triton.testing.do_bench_cudagraph for kernel benchmarking:
Compared to triton.testing.do_bench, do_bench_cudagraph provides:
Run test suite
Analyze CUDA kernel sizes in compiled wheel files to identify oversized kernels and template-instantiation bloat:
This tool requires cubloaty (install with pip install cubloaty) to work.
# Install cubloaty
pip install cubloaty
# Analyze a wheel file
python analyze_whl_kernel_sizes.py path/to/sgl_kernel-*.whl
# Custom output file
python analyze_whl_kernel_sizes.py path/to/sgl_kernel-*.whl --output my_analysis.txt
The tool generates:
Use this to identify large kernels and potential template instantiation bloat.
FAQs
Kernel Library for SGLang
We found that sgl-kernel demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Security News
Multiple high-impact npm maintainers confirm they have been targeted in the same social engineering campaign that compromised Axios.

Security News
Axios compromise traced to social engineering, showing how attacks on maintainers can bypass controls and expose the broader software supply chain.

Security News
Node.js has paused its bug bounty program after funding ended, removing payouts for vulnerability reports but keeping its security process unchanged.