Research
Security News
Malicious npm Packages Inject SSH Backdoors via Typosquatted Libraries
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
obfuscation-detection
Advanced tools
Python module for obfuscation classification in command line executions
Command obfuscation is a technique to make a piece of code intentionally hard-to-read, but still execute the same functionality. Malicious attackers often abuse obfuscation to make their malicious software (malware) evasive to traditional malware detection techniques. This creates a headache for defenders since attackers can create a virtually infinite number of ways to obfuscate their malware. Traditional malware detection techniques are often rule-based, rendering them inflexible to new types of malware and obfuscation techniques. Deep learning has been used in various domains to create models that are dynamic and can adapt to new types of information. Our project uses deep learning techniques to detect command obfuscation.
You can install our package through pip!
pip install obfuscation-detection
This is a basic usage of our package:
import obfuscation_detection as od
oc = od.ObfuscationClassifier(od.PlatformType.ALL)
commands = ['cmd.exe /c "echo Invoke-DOSfuscation"',
'cm%windir:~ -4, -3%.e^Xe,;^,/^C",;,S^Et ^^o^=fus^cat^ion&,;,^se^T ^ ^ ^B^=o^ke-D^OS&&,;,s^Et^^ d^=ec^ho I^nv&&,;,C^Al^l,;,^%^D%^%B%^%o^%"',
'cat /etc/passwd']
classifications = oc(commands)
# 1 is obfuscated, 0 is non-obfuscated
print(classifications) # [0, 1, 0]
The input into the model is a single command from the command line. We represent a single command by each character individually. Each character is represented by a one-hot vector, a vector where all the values are 0 except for one index and the one index represented which character it is. We also include an extra case bit to differentiate between uppercase and lowercase characters. We found the frequency of the most common characters in our dataset and found 73 characters. With the case bit, each character one-hot vector is 74-dimensional. Each command is also represented by its first 4096 characters. If the command is longer than 4096 characters, the rest is cut off and if the command is shorter, then the rest is padded with zero's. Therefore, the input to our model is a 74x4096 matrix.
Below is a simplified illustration of the input matrix, where the vertical axis represents the command and the horizontal axis represents the one-hot encoding.
Our model is a character-level deep convolutional neural network (CNN). What does this mean? Let's look at the first layer, turning our input matrix into a convolutional layer (conv layer). We look at a few characters that are close to each other, multiply some weights onto these characters, and come out with a resulting vector. In the image below, we first look at 3 characters depicted by the left red box. We multiply these 3 characters by the kernel weight vector and it results in the right red box vector. We continue this process for all 3-character blocks next to each other, depicted by the purple box. We stack all these resulting vectors to form a matrix that results in our 1st conv layer.
The 1st conv layer now contains a matrix of vectors, where each row carries semantic meaning of 3 characters. We continue this process of applying convolutions, thereby increasing the "window size" each row in the matrix sees. If we apply one more layer of convolutions to our example, the next conv layer (conv layer 2) will contain rows where each row carries semantic meaning of 5 characters. The higher the layer, the bigger the window size is. The bigger the window size, the more semantic meaning each row the conv layer can carry.
We apply this process of convolutions with weights to extract features from the input. After we apply a couple layers of convolutions, we finally make a decision by taking an average of the CNN's output (final layer). This average is then run through a final fully connected (FC) layer to make the final output which is a 2-dimensional vector. The first dimension is the model's prediction on how non-obfuscated the command is. The second dimension is the model's prediction on how obfuscated the command is. We take the max of these two dimensions to decide whether or not the command is obfuscated. For example, a prediction that the command is not obfuscated is <1, 0> while a prediction that the command is obfuscated is <0, 1>.
Our model also has aspects of a ResNet. Since we are using this CNN for a language task, we found it natural to apply the same types of methods in RNNs as CNNs. Now you might ask why not just use RNNs? Well, in a nutshell, CNNs are faster than RNNs. CNNs are able to do parallel computations while RNNs rely on the previous sequence before it calculates the current sequence. Since our task is character-level, we don't require this task to be a sequence task. Therefore, we believe CNNs better capture the semantics of this task.
So, what ResNet components are there? Here they are:
Upgrading our model from a plain CNN to a CNN + ResNet gave us much better performance!
Overall, our model performs very well on windows and linux commands!
mkdir data
mkdir data/prep
mkdir data/processed_tensors
mkdir data/scripts
mkdir models
Download the Powershell Corpus: https://aka.ms/PowerShellCorpus. Unzip the file into the data
directory
Download PS corpus labels: https://github.com/danielbohannon/Revoke-Obfuscation. Clone the repo in the same-level directory as this repo. The labels are found in the DataScience
folder.
Download DOSfuscated commands: https://github.com/danielbohannon/Invoke-DOSfuscation/tree/master/Samples. Download the four STATIC_#_of_4_*
files and put them inside the data
directory.
We also use internal Adobe command line executions as a bulk of our training data. We may not release this dataset to the public, so we encourage you to either not exclude this dataset or find an open-source command prompt dataset on the internet.
cd data-prep
, then run the following python scripts in order:
python char_frequency.py
: creates a .txt file of each charcter's frequency in the dataset
python char_dict.py
: creates a python dict of the most common characters mapped to a numeric index
python ps_data_preprocess.py
: creates dataset for the powershell corpus dataset
python dos_data_preprocess.py
: creates dataset for the DOSfuscated commands
python hubble_data_preprocess.py
and python cb_data_preprocess.py
: creates dataset for internal Adobe data. Replace this step with other data you may find on the internet.
python win_data_preprocess.py
: creates train/dev/test tensor split by accumulating all windows data.
python linux_obf_data_preprocess.py
: load internal Adobe linux commands and obfuscate them
python linux_data_preprocess.py
: creates train/dev/test tensor split with all linux data
python all_data_preprocess.py
: combines windows and linux train/dev/test sets together for a big train/dev/test dataset.
cd ../oddl
, then run python main.py
with the given options!
main.py
:
--model
- choose a model architecture--model-file
- filename of model checkpoint--cuda-device
- which cuda device to use--reset
- start training from scratch--eval
- evaluate on best checkpoint on train/dev--analyze
- create fp/fn files from dev--run
- run on real scripts on test-scripts
dir--test
- run model on test setmodels.py
: contains the different model architectures we experimented with
Training model:
python main.py --model resnet --model_file resnet.pth --reset
python main.py --model resnet --model_file resnet.pth
Eval model: python main.py --model resnet --model_file resnet.pth --eval
Test model: python main.py --model resnet --model_file resnet.pth --test
Running model on new data:
test-scripts
python main.py --model resnet --model_file resnet.pth --run
We have included our best model in models/best-resnet.pth
!
Contributions are welcomed! Read the Contributing Guide for more information.
This project is licensed under the Apache V2 License. See LICENSE for more information.
FAQs
Python module for obfuscation classification in command line executions
We found that obfuscation-detection demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Research
Security News
Socket’s threat research team has detected six malicious npm packages typosquatting popular libraries to insert SSH backdoors.
Security News
MITRE's 2024 CWE Top 25 highlights critical software vulnerabilities like XSS, SQL Injection, and CSRF, reflecting shifts due to a refined ranking methodology.
Security News
In this segment of the Risky Business podcast, Feross Aboukhadijeh and Patrick Gray discuss the challenges of tracking malware discovered in open source softare.