MS2Query - Reliable and fast MS/MS spectral-based analogue search
Contents
Overview
The publication can be found here: https://rdcu.be/c8Hkc
Please cite this article when using MS2Query
MS2Query uses MS2 mass spectral data to find the best match in a library
and is able to search for both analogues and exact matches.
A pretrained library for MS2Query is available based on the GNPS library.
In our benchmarking we show that MS2Query performs better compared to current standards in the field like Cosine Score and the Modified Cosine score.
MS2Query is easy to install (see below) and is scalable to large numbers of MS2 spectra.
Workflow
MS2Query is a tool for MSMS library matching, searching both for analogues and exact matches in one run. The workflow for running MS2Query first uses MS2Deepscore to calculate spectral similarity scores between all library spectra and a query spectrum. By using pre-computed MS2Deepscore embeddings for library spectra, this full-library comparison can be computed very quickly. The top 2000 spectra with the highest MS2Deepscore are selected. In contrast to other analogue search methods, no preselection on precursor m/z is performed. MS2Query optimizes re-ranking the best analogue or exact match at the top by using a random forest that combines 5 features. The random forest predicts a score between 0 and 1 between each library and query spectrum and the highest scoring library match is selected. By using a minimum threshold for this score, unreliable matches are filtered out.
For questions regarding MS2Query please make an issue on github or contact niek.dejonge@wur.nl
Installation guide
Prepare environmnent
We recommend to create an Anaconda environment with
conda create --name ms2query python=3.9
conda activate ms2query
Pip install MS2Query
MS2Query can simply be installed by running:
pip install ms2query
All dependencies are automatically installed, the dependencies can be found in setup.py.
The installation is expected to take about 2 minutes.
MS2Query is tested by continous integration on MacOS, Windows and Ubuntu for python version 3.9 and 3.1
Run MS2Query from command line
Download default library
When running for the first time a pretrained ms2query library should be downloaded.
Change the file locations to the location where the library should be stored.
Change the --ionmode to the needed ionmode (positive or negative)
ms2query --library .\folder_to_store_the_library --download --ionmode positive
Alternatively all model files can be manually downloaded from
https://zenodo.org/record/6124552 for positive mode and
https://zenodo.org/record/7104184 for negative mode.
Preprocessing mass spectra
MS2Query is run on all MS2 spectra in a spectrum file.
MS2Query does not do any peak picking or clustering of similar MS2 spectra.
If your files contain many MS2 spectra per feature it is advised to first reduce the number of MS2 spectra
by clustering or feature selection. There are multiple tools available that do this.
One reliable method is using MZMine for preprocessing, https://mzmine.github.io/mzmine_documentation/index.html.
As input for MS2Query you can use the MGF file of the FBMN output of MZMine,
see https://ccms-ucsd.github.io/GNPSDocumentation/featurebasedmolecularnetworking-with-mzmine2/.
Running MS2Query
After downloading a default library MS2Query can be run on your MS2 spectra.
Run the command below and specify the location where your spectra are stored.
If a spectrum file is specified all spectra in this folder will be processed.
If a folder is specified all spectrum files within this folder will be processed.
The results generated by MS2Query, are stored as csv files in a results directory within the same directory as your query spectra.
ms2query --spectra .\location_of_spectra --library .\library_folder --ionmode positive
To do a test run with dummy data you can download the file
dummy_spectra.mgf.
The expected results can be found in expected_results_dummy_data.csv.
After downloading the library files, running on the dummy data is expected to take less than half a minute.
Run ms2query --help for more info/options, or see below:
usage: MS2Query [-h] [--spectra SPECTRA] --library LIBRARY_FOLDER [--ionmode {positive,negative}] [--download] [--results RESULTS] [--filter_ionmode]
MS2Query is a tool for MSMS library matching, searching both for analogues and exact matches in one run
optional arguments:
-h, --help show this help message and exit
--spectra SPECTRA The MS2 query spectra that should be processed. If a directory is specified all spectrum files in the directory will be processed. Accepted formats are: "mzML", "json", "mgf", "msp", "mzxml", "usi" or a
pickled matchms object
--library LIBRARY_FOLDER
The directory containing the library spectra (in sqlite), models and precalculated embeddings, to download add --download
--ionmode {positive,negative}
Specify the ionization mode used
--download This will download the most up to date model and library.The model will be stored in the folder given as the second argumentThe model will be downloaded in the in the ionization mode specified under --mode
--results RESULTS The folder in which the results should be stored. The default is a new results folder in the folder with the spectra
--filter_ionmode Filter out all spectra that are not in the specified ion-mode. The ion mode can be specified by using --ionmode
--addional_metadata Return additional metadata columns in the results, for example --additional_metadata retention_time feature_id
Interpretation of results
As output a csv file is returned, an example of results can be found in expected_results_dummy_data.csv. For each of your input spectra MS2Query predicts a library match. It is important to check the ms2query_model_prediction column. This column contains a score, which indicates the likelihood that the found match is a good match. This score ranges between 0 and 1, the closer this score is to 1 the more likely that it is a good match/analogue. It is important to use this score to select only the reliable hits, since a prediction is given for each spectrum, regardless of the ms2query score. There is no strict minimum for this score, but the higher the MS2Query model prediction the more likely it is a good match/analogue. It will depend on your research goal, what a good threshold is. If a high recall is important you might want a low threshold and if a high reliability is more important you might want a high threshold. To give a general indication, a score > 0,7 has many good analogues and exact matches. In the range of 0.6-0.7, the results can still be useful, but should be analysed with more caution and results below 0.6 can often best be discarded.
MS2Query does not need two different workflows for searching for analogues and searching for exact matches, it automatically selects the most likely library spectra. If it is important to separate potential exact matches from potential analogues for your research question, the column with the precursor mz difference can be used to separate these results, since exact matches should have no precursor mz difference.
The columns completely to the right are estimated molecular classes based on the molecular structure of the predicted library molecule, these columns can be used to get a quick overview of the kind of compounds that were found.
Build MS2Query into other tools
If you want to incorporate MS2Query into another tool it might be easier to run MS2Query from a python script,
instead of running from the command line. The guide below can be used as a starting point.
Below you can find an example script for running MS2Query.
Before running the script, replace the variables ms2query_library_files_directory
and ms2_spectra_directory
with the correct directories.
This script will first download files for a default MS2Query library.
This default library is trained on the GNPS library from 2021-15-12.
After downloading, a library search and an analog search is performed on the query spectra in your directory (ms2_spectra_directory
).
The results generated by MS2Query, are stored as csv files in a results directory within the same directory as your query spectra.
from ms2query.run_ms2query import download_zenodo_files, run_complete_folder
from ms2query.ms2library import create_library_object_from_one_dir
ms2query_library_files_directory = "./ms2query_library_files"
ms2_spectra_directory =
ion_mode =
download_zenodo_files(ion_mode,
ms2query_library_files_directory)
ms2library = create_library_object_from_one_dir(ms2query_library_files_directory)
run_complete_folder(ms2library, ms2_spectra_directory)
Create your own library (without training new models)
The code below creates all required library files for your own in house library.
No new models for MS2deepscore, Spec2Vec and MS2Query will be trained, to do this see the next section.
First install MS2Query (see above under installation guide)
To create your own library you also need to install RDKit, by running the following in your command line (while in the ms2query conda environment):
conda install -c conda-forge rdkit
It is important that the library spectra are annotated with smiles, inchi's or inchikeys in the metadata otherwise they
are not included in the library.
Fill in the blank spots with the file locations.
The models for spec2vec, ms2deepscore and ms2query can be downloaded from the zenodo links (see above).
import os
from ms2query.create_new_library.library_files_creator import LibraryFilesCreator
from ms2query.clean_and_filter_spectra import clean_normalize_and_split_annotated_spectra
from ms2query.utils import load_matchms_spectrum_objects_from_file, select_files_in_directory
from ms2query.run_ms2query import download_zenodo_files
from ms2query.ms2library import select_files_for_ms2query
spectrum_file_location = "../tests/test_files/general_test_files/100_test_spectra.mgf"
ionisation_mode = "positive"
directory_for_library_and_models = "./ms2query_library"
download_zenodo_files(ionisation_mode, directory_for_library_and_models, only_models=True)
library_spectra = load_matchms_spectrum_objects_from_file(spectrum_file_location)
files_in_directory = select_files_in_directory(directory_for_library_and_models)
dict_with_file_names = select_files_for_ms2query(files_in_directory, ["s2v_model", "ms2ds_model"])
ms2ds_model_file_name = os.path.join(directory_for_library_and_models, dict_with_file_names["ms2ds_model"])
s2v_model_file_name = os.path.join(directory_for_library_and_models, dict_with_file_names["s2v_model"])
cleaned_library_spectra = clean_normalize_and_split_annotated_spectra(library_spectra,
ion_mode_to_keep=ionisation_mode)[0]
library_creator = LibraryFilesCreator(cleaned_library_spectra,
output_directory=directory_for_library_and_models,
ms2ds_model_file_name=ms2ds_model_file_name,
s2v_model_file_name=s2v_model_file_name, )
library_creator.create_all_library_files()
To run MS2Query on your own created library. Check out the instructions under Run MS2Query. Both command line and the code version should work.
Make sure that the downloaded models and the SQLite file, S2V embeddings file and the ms2ds embeddings file,
just generated by you, are in the same directory.
The results will be returned as csv files in a results directory.
An alternative for loading in a ms2library is by specifying each file type manually, which is needed if not all file are in one dir or if the library files or models have unexpected names.
from ms2query.ms2library import MS2Library
from ms2query.run_ms2query import run_complete_folder
ms2_spectra_directory =
ms2library = MS2Library(sqlite_file_name= ,
s2v_model_file_name= ,
ms2ds_model_file_name= ,
pickled_s2v_embeddings_file_name= ,
pickled_ms2ds_embeddings_file_name= ,
ms2query_model_file_name= ,
classifier_csv_file_name= ,
)
run_complete_folder(ms2library, ms2_spectra_directory)
Create your own library and train new models
The code trains new MS2Deepscore, Spec2Vec and MS2Query models for your in house library,
and creates all needed files for running MS2Query.
It is important that the library spectra are annotated with smiles, inchi's or inchikeys in the metadata otherwise they
are not included in the library and training.
Fill in the blank spots below and run the code (can take several days).
The models will be stored in the specified output_folder. MS2Query can be run
from ms2query.create_new_library.train_models import clean_and_train_models
clean_and_train_models(spectrum_file=,
ion_mode=,
output_folder=
)
To run MS2Query on your own created library run the code below (again fill in the blanks).
from ms2query.run_ms2query import run_complete_folder
from ms2query.ms2library import create_library_object_from_one_dir
ms2_spectra_directory =
ms2_library_directory =
ms2library = create_library_object_from_one_dir(ms2_library_directory)
run_complete_folder(ms2library, ms2_spectra_directory)
After running you can run MS2Query on your newly created models and library. See above on how to run MS2Query.
Documentation for developers
Prepare environmnent
We recommend to create an Anaconda environment with
conda create --name ms2query python=3.9
conda activate ms2query
Clone repository
Clone the present repository, e.g. by running
git clone https://github.com/iomega/ms2query.git
And then install the required dependencies, e.g. by running the following from within the cloned directory
pip install -e .
To run all unit tests, to check if everything was installed successfully run:
pytest
Recreate Results Manuscript
To recreate the results in Figure 2 in the MS2Query manuscript download the in between results from https://zenodo.org/record/7427094.
Install MS2Query as described above (no need to download the models) and run the code below. This code should work with version 0.5.7.
from ms2query.benchmarking.create_accuracy_vs_recall_plot import create_plot
base_folder = "./benchmarking_results"
create_plot(exact_matches=True,
positive=False,
recalculate_means=True,
save_figure=False,
base_folder=base_folder
)
Above code only recreates the figures based on the already generated test results. To reproduce the test results from scratch the models have to be retrained. The test split is random and the models trained have a random component to it, so the results could vary sligtly, but the general conclusions from the results are expected to be the same.
From https://zenodo.org/record/7427094 download the ALL_GNPS_NO_PROPOGATED.mgf file to use the same starting data as used for the 20-fold cross-validation, this set was downloaded from https://gnps-external.ucsd.edu/gnpslibrary on 01-11-2022, alternatively a more recent version could be downloaded. To redo the analysis with exactly the same test split as in the Manuscript the test sets can be downloaded from https://zenodo.org/record/7427094 the training data can be constructed by combining the 19 other test sets together for each of the 20 data splits.
If you want to randomly recreate the test splits from scratch run:
from ms2query.benchmarking.k_fold_cross_validation import split_and_store_annotated_unannotated, split_k_fold_cross_validation_analogue_test_set, split_k_fold_cross_validation_exact_match_test_set
spectrum_file_name = "./ALL_GNPS_NO_PROPOGATED.mgf"
split_and_store_annotated_unannotated(spectrum_file_name, ionmode="positive", output_folder="./positive_mode_data_split")
split_and_store_annotated_unannotated(spectrum_file_name, ionmode="negative", output_folder="./negative_mode_data_split")
positive_annotated_spectra = load_matchms_spectrum_objects_from_file("positive_mode_data_split/annotated_training_spectra.pickle")
negative_annotated_spectra = load_matchms_spectrum_objects_from_file("negative_mode_data_split/annotated_training_spectra.pickle")
split_k_fold_cross_validation_analogue_test_set(positive_annotated_spectra, 20, output_folder = "./positive_mode/analogue_test_sets_splits/)
split_k_fold_cross_validation_exact_match_test_set(positive_annotated_spectra, 20, output_folder = "./positive_mode/exact_matches_test_sets_splits/)
split_k_fold_cross_validation_analogue_test_set(negative_annotated_spectra, 20, output_folder = "./negative_mode/analogue_test_sets_splits/)
split_k_fold_cross_validation_exact_match_test_set(negative_annotated_spectra, 20, output_folder = "./negative_mode/exact_matches_test_sets_splits/)
To train the models and to create the test results for MS2Query and all benchmarking methods (cosine, modified cosine and MS2Deepscore) run for each of the test split. So the script should be started 20 times for each type of test split. The running time for this script is a few days, since it trains all models and creates all test results.
from ms2query.benchmarking.k_fold_cross_validation import train_models_and_test_result_from_k_fold_folder
k_fold_split_number = 0
train_models_and_test_result_from_k_fold_folder(
"./benchmarking_test_sets/exact_matches_test_sets_splits/",
k_fold_split_number,
exact_matches=True)
After creating all the results, run the create_plot (the first block of python code) to create the new plots.
Contributing
If you want to contribute to the development of ms2query,
have a look at the contribution guidelines.
License
Copyright (c) 2021, Netherlands eScience Center
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.