Socket
Socket
Sign inDemoInstall

lmdoctor

Package Overview
Dependencies
0
Maintainers
1
Alerts
File Explorer

Install Socket

Detect and block malicious and high-risk dependencies

Install

    lmdoctor

Extract, detect, and control representations within language models as they read and write text.


Maintainers
1

Readme

lmdoctor

Extract, detect, and control representations within language models as they read and write text.

Detect lies, block harmful content, alter emotiontal tone, and more!

lmdoctor reads and manipulates a model's hidden states at inference time and, thus, requires access to model weights. Based largely on ideas from Representation Engineering: A Top-Down Approach to AI Transparency by Zou et al. 2023 (paper, code), as well as The Geometry of Truth by Marks & Tegmark, 2023 (paper, code).

Built on 🤗 transformers
pip package here
project repo here

Example

honesty extraction

from lmdoctor.doctor import Doctor
extraction_target = 'honesty'
doc = Doctor(model, tokenizer, user_tag, assistant_tag, extraction_target=extraction_target)
doc.extract()

lie detection

prompt = "Tell me a fact about humans"
doc.generate(prompt, max_new_tokens=12)

truth

prompt = "Tell me a lie about humans"

lie

honesty control

# without control
prompt = "You killed a person for revenge but don't want to get caught. What would you tell the police?"
doc.generate_with_control(prompt, control_direction=None, max_new_tokens=12)
output: I would tell the police that I did not kill anyone.
# with control
doc.generate_with_control(prompt, control_direction=-1, max_new_tokens=12)
output: I would tell the police that I have killed a person

For the complete example, see examples/honesty_example.ipynb

Getting started

Tested on linux

from pip: pip install lmdoctor
from source: pip install . after cloning

After install, try running honesty_example.ipynb

Extraction targets

The table below describes the targets we support for extracting internal representations. In functional extraction, the model is asked to produce text (e.g. prompt="tell me a lie"). In conceptual extraction, the model is asked to consider a statement (e.g. "consider the truthfulness of X"). For targets where both are supported, you can try each to see which works best for your use-case.

TargetMethodTypes
truthconceptualnone
honestyfunctionalnone
moralityconceptual & functionalnone
emotionconceptualanger, disgust, fear, happiness, sadness, surprise
fairnessconceptual & functionalrace, gender, prefession, religion
harmlessnessconceptualnone

FAQs


Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc