Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
LMDeploy is a toolkit for compressing, deploying, and serving LLM, developed by the MMRazor and MMDeploy teams. It has the following core features:
Efficient Inference: LMDeploy delivers up to 1.8x higher request throughput than vLLM, by introducing key features like persistent batch(a.k.a. continuous batching), blocked KV cache, dynamic split&fuse, tensor parallelism, high-performance CUDA kernels and so on.
Effective Quantization: LMDeploy supports weight-only and k/v quantization, and the 4-bit inference performance is 2.4x higher than FP16. The quantization quality has been confirmed via OpenCompass evaluation.
Effortless Distribution Server: Leveraging the request distribution service, LMDeploy facilitates an easy and efficient deployment of multi-model services across multiple machines and cards.
Interactive Inference Mode: By caching the k/v of attention during multi-round dialogue processes, the engine remembers dialogue history, thus avoiding repetitive processing of historical sessions.
Excellent Compatibility: LMDeploy supports KV Cache Quant, AWQ and Automatic Prefix Caching to be used simultaneously.
For detailed inference benchmarks in more devices and more settings, please refer to the following link:
LLMs | VLMs |
|
|
LMDeploy has developed two inference engines - TurboMind and PyTorch, each with a different focus. The former strives for ultimate optimization of inference performance, while the latter, developed purely in Python, aims to decrease the barriers for developers.
They differ in the types of supported models and the inference data type. Please refer to this table for each engine's capability and choose the proper one that best fits your actual needs.
It is recommended installing lmdeploy using pip in a conda environment (python 3.8 - 3.12):
conda create -n lmdeploy python=3.8 -y
conda activate lmdeploy
pip install lmdeploy
The default prebuilt package is compiled on CUDA 12 since v0.3.0. For more information on installing on CUDA 11+ platform, or for instructions on building from source, please refer to the installation guide.
import lmdeploy
pipe = lmdeploy.pipeline("internlm/internlm2-chat-7b")
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)
[!NOTE] By default, LMDeploy downloads model from HuggingFace. If you would like to use models from ModelScope, please install ModelScope by
pip install modelscope
and set the environment variable:
export LMDEPLOY_USE_MODELSCOPE=True
If you would like to use models from openMind Hub, please install openMind Hub by
pip install openmind_hub
and set the environment variable:
export LMDEPLOY_USE_OPENMIND_HUB=True
For more information about inference pipeline, please refer to here.
Please review getting_started section for the basic usage of LMDeploy.
For detailed user guides and advanced guides, please refer to our tutorials:
Deploying LLMs offline on the NVIDIA Jetson platform by LMDeploy: LMDeploy-Jetson
Example project for deploying LLMs using LMDeploy and BentoML: BentoLMDeploy
We appreciate all contributions to LMDeploy. Please refer to CONTRIBUTING.md for the contributing guideline.
@misc{2023lmdeploy,
title={LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM},
author={LMDeploy Contributors},
howpublished = {\url{https://github.com/InternLM/lmdeploy}},
year={2023}
}
This project is released under the Apache 2.0 license.
FAQs
A toolset for compressing, deploying and serving LLM
We found that lmdeploy demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.