Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More
Socket
Sign inDemoInstall
Socket

somadata

Package Overview
Dependencies
Maintainers
1
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

somadata

SomaLogic Python Data Input/Output Library

  • 1.1.0
  • PyPI
  • Socket score

Maintainers
1

The Python SomaData Package from Somalogic, Inc.

License:
MIT PyPI Downloads


Overview

This document accompanies the Python package somadata, which loads the SomaLogic, Inc. structured text data file called an *.adat. The somadata.Adat object is an extension of the pandas.DataFrame class. The package provides auxiliary functions for extracting relevant information from the ADAT object once in the Python environment. Basic familiarity with the Python environment is assumed, as is the ability to install contributed packages from the Python Package Installer (pip)


Table of Contents:

  1. Installation
  2. Basic Use
  3. Reading ADAT text files
  4. Wrangling Data
  5. Adding Metadata
  6. Slicing Data
  7. SomaScan Version Lifting
  8. Writing an ADAT text file
  9. Example Data Analysis

Installation

The easiest way to install SomaData is to install directly from PyPI

PIP:

pip install SomaData

Alternatively one can install from the GitHub repository.

GitHub:

pip install git+https://github.com/SomaLogic/Canopy.git#egg=somadata

Alternatively, if you wish to develop or change the source code, you may clone the repository and install manually via:

git clone https://github.com/SomaLogic/Canopy.git
pip install -e ./somadata

Dependencies

Python >=3.8 is required to install somadata. The following package dependencies are installed on a pip install:

  • pandas >= 1.1.0
  • numpy >= 1.19.1

return to top

Basics

Upon installation, load somadata as normal:

return to top

import somadata

For a traversable index of the library:

help(somadata)
# help(somadata.adat) ... etc
Help on package somadata:

NAME
    somadata

PACKAGE CONTENTS
    adat
    annotations
    base (package)
    data (package)
    errors
    io (package)
    tools (package)

FILE
    /Users/tjohnson/code/repos/SomaData/somadata/__init__.py

Internal Objects

The somadata package comes with one internal object available to users to run canned examples (or analyses). It can be accessed by perform the import:

  • from somadata.data.example_data import example_data

Main Features (I/O)

  • Loading data (Import)
    • Import a text file in the *.adat format into a Python session as an adat object.
  • Wrangling data (Manipulation)
    • Subset, reorder, and list various fields of an adat object.
  • Exporting data (Output)
    • Write out an adat object as a *.adat text file.

Loading an ADAT

return to top

Loading the sample file from within the somadata library via its path

adat = somadata.read_adat('./somadata/data/example_data.adat')
type(adat)
somadata.adat.Adat
adat.shape
(192, 5284)
adat.columns
MultiIndex([( '10000-28', '3', 'SL019233', ...),
            (  '10001-7', '3', 'SL002564', ...),
            ( '10003-15', '3', 'SL019245', ...),
            ( '10006-25', '3', 'SL019228', ...),
            ( '10008-43', '3', 'SL019234', ...),
            ( '10011-65', '3', 'SL019246', ...),
            (  '10012-5', '3', 'SL014669', ...),
            ( '10013-34', '3', 'SL025418', ...),
            ( '10014-31', '3', 'SL007803', ...),
            ('10015-119', '3', 'SL014924', ...),
            ...
            (  '9981-18', '3', 'SL018293', ...),
            (  '9983-97', '3', 'SL019202', ...),
            (  '9984-12', '3', 'SL019205', ...),
            (  '9986-14', '3', 'SL005356', ...),
            (  '9989-12', '3', 'SL019194', ...),
            (  '9993-11', '3', 'SL019212', ...),
            ( '9994-217', '3', 'SL019217', ...),
            (   '9995-6', '3', 'SL013164', ...),
            (  '9997-12', '3', 'SL019215', ...),
            (   '9999-1', '3', 'SL019231', ...)],
           names=['SeqId', 'SeqIdVersion', 'SomaId', 'TargetFullName', 'Target', 'UniProt', 'EntrezGeneID', 'EntrezGeneSymbol', 'Organism', 'Units', 'Type', 'Dilution', 'PlateScale_Reference', 'CalReference', 'Cal_Example_Adat_Set001', 'ColCheck', 'CalQcRatio_Example_Adat_Set001_170255', 'QcReference_170255', 'Cal_Example_Adat_Set002', 'CalQcRatio_Example_Adat_Set002_170255'], length=5284)
from IPython.display import HTML
#Display the first five rows and columns of the adat
HTML(adat.iloc[:5,:5].to_html()) # Need to use HTML & to_html() here to display nicely for this README
# Output is left-right scrollable in both this readme and Jupyter notebooks
SeqId10000-2810001-710003-1510006-2510008-43
SeqIdVersion33333
SomaIdSL019233SL002564SL019245SL019228SL019234
TargetFullNameBeta-crystallin B2RAF proto-oncogene serine/threonine-protein kinaseZinc finger protein 41ETS domain-containing protein Elk-1Guanylyl cyclase-activating protein 1
TargetCRBB2c-RafZNF41ELK1GUC1A
UniProtP43320P04049P51814P19419P43080
EntrezGeneID14155894759220022978
EntrezGeneSymbolCRYBB2RAF1ZNF41ELK1GUCA1A
OrganismHumanHumanHumanHumanHuman
UnitsRFURFURFURFURFU
TypeProteinProteinProteinProteinProtein
Dilution20200.52020
PlateScale_Reference687.4227.8126.9634.2585.0
CalReference687.4227.8126.9634.2585.0
Cal_Example_Adat_Set0011.012520251.016057090.950561800.996073500.94051447
ColCheckPASSPASSPASSPASSPASS
CalQcRatio_Example_Adat_Set001_1702551.0080.9701.0461.0421.036
QcReference_170255505.4223.9119.6667.2587.5
Cal_Example_Adat_Set0021.014762331.036868461.152588560.935812310.96201283
CalQcRatio_Example_Adat_Set002_1702551.0671.0070.9811.0260.998
PlateIdPlateRunDateScannerIDPlatePositionSlideIdSubarraySampleIdSampleTypePercentDilutionSampleMatrixBarcodeBarcode2dSampleNameSampleNotesAliquotingNotesSampleDescriptionAssayNotesTimePointExtIdentifierSsfExtIdSampleGroupSiteIdTubeUniqueIDCLIHybControlNormScaleRowCheckNormScale_20NormScale_0_005NormScale_0_5ANMLFractionUsed_20ANMLFractionUsed_0_005ANMLFractionUsed_0_5AgeSex
Example Adat Set0012020-06-18SG15214400H925849580001231Sample20Plasma-PPT0.98185998PASS1.036935800.857016240.777174910.9140.8690.90376F476.5310.1100.3602.8561.8
H825849580000472Sample20Plasma-PPT0.96671829PASS0.960225050.848584200.852019530.9370.9560.97355F474.4293.5101.8561.9541.9
H725849580001083Sample20Plasma-PPT1.00193072PASS0.984116171.032701560.915191530.9070.9190.91547M415.6299.63030.1563.9423.9
H625849580000344Sample20Plasma-PPT0.94017961PASS1.078398780.946268410.912467310.9340.9190.91237M442.6247.9112.9563.7469.8
H525849580000945Sample20Plasma-PPT0.94621098PASS0.846794460.929045530.774130560.7070.8940.70871F465.7710.795.9791.0443.5

You may also access the dict header metadata via:

adat.header_metadata
{'!AdatId': 'GID-1234-56-789-abcdef',
 '!Version': '1.2',
 '!AssayType': 'PharmaServices',
 '!AssayVersion': 'V4',
 '!AssayRobot': 'Fluent 1 L-307',
 '!Legal': 'Experiment details and data have been processed to protect Personally Identifiable Information (PII) and comply with existing privacy laws.',
 '!CreatedBy': 'PharmaServices',
 '!CreatedDate': '2020-07-24',
 '!EnteredBy': 'Technician1',
 '!ExpDate': '2020-06-18, 2020-07-20',
 '!GeneratedBy': 'Px (Build:  : ), Canopy_0.1.1',
 '!RunNotes': "2 columns ('Age' and 'Sex') have been added to this ADAT. Age has been randomly increased or decreased by 1-2 years to protect patient information",
 '!ProcessSteps': 'Raw RFU, Hyb Normalization, medNormInt (SampleId), plateScale, Calibration, anmlQC, qcCheck, anmlSMP',
 '!ProteinEffectiveDate': '2019-08-06',
 '!StudyMatrix': 'EDTA Plasma',
 '!PlateType': '',
 '!LabLocation': 'SLUS',
 '!StudyOrganism': '',
 '!Title': 'Example Adat Set001, Example Adat Set002',
 '!AssaySite': 'SW',
 '!CalibratorId': '170261',
 '!ReportConfig': {'analysisSteps': [{'stepType': 'hybNorm',
    'referenceSource': 'intraplate',
    'includeSampleTypes': ['QC', 'Calibrator', 'Buffer']},
   {'stepName': 'medNormInt',
    'stepType': 'medNorm',
    'includeSampleTypes': ['Calibrator', 'Buffer'],
    'referenceSource': 'intraplate',
    'referenceFields': ['SampleId']},
   {'stepType': 'plateScale',
    'referenceSource': 'Reference_v4_Plasma_Calibrator_170261'},
   {'stepType': 'calibrate',
    'referenceSource': 'Reference_v4_Plasma_Calibrator_170261'},
   {'stepName': 'anmlQC',
    'stepType': 'ANML',
    'effectSizeCutoff': 2.0,
    'minFractionUsed': 0.3,
    'includeSampleTypes': ['QC'],
    'referenceSource': 'Reference_v4_Plasma_ANML'},
   {'stepType': 'qcCheck',
    'QCReferenceSource': 'Reference_v4_Plasma_QC_ANML_170255',
    'tailsCriteriaLower': 0.8,
    'tailsCriteriaUpper': 1.2,
    'tailThreshold': 15.0,
    'QCAdditionalReferenceSources': ['Reference_v4_Plasma_QC_ANML_170259',
     'Reference_v4_Plasma_QC_ANML_170260'],
    'prenormalized': True},
   {'stepName': 'anmlSMP',
    'stepType': 'ANML',
    'effectSizeCutoff': 2.0,
    'minFractionUsed': 0.3,
    'includeSampleTypes': ['Sample'],
    'referenceSource': 'Reference_v4_Plasma_ANML'}],
  'qualityReports': ['SQS Report'],
  'filter': {'proteinEffectiveDate': '2019-08-06'}},
 'HybNormReference': 'intraplate',
 'MedNormReference': 'intraplate',
 'NormalizationAlgorithm': 'ANML',
 'PlateScale_ReferenceSource': 'Reference_v4_Plasma_Calibrator_170261',
 'PlateScale_Scalar_Example_Adat_Set001': '1.08091554',
 'PlateScale_PassFlag_Example_Adat_Set001': 'PASS',
 'CalibrationReference': 'Reference_v4_Plasma_Calibrator_170261',
 'CalPlateTailPercent_Example_Adat_Set001': '0.1',
 'PlateTailPercent_Example_Adat_Set001': '1.2',
 'PlateTailTest_Example_Adat_Set001': 'PASS',
 'PlateScale_Scalar_Example_Adat_Set002': '1.09915270',
 'PlateScale_PassFlag_Example_Adat_Set002': 'PASS',
 'CalPlateTailPercent_Example_Adat_Set002': '2.6',
 'PlateTailPercent_Example_Adat_Set002': '4.2',
 'PlateTailTest_Example_Adat_Set002': 'PASS'}

SomaData's Adat object inherits the pandas printing methods which displays nicely in Jupyter Notebooks when using IPython.display.display().

Wrangling

Dataframe columns Contain Feature Information

return to top

# Using the `adat` loaded from above
aptamer_df = adat.columns.to_frame(index=False)
type(aptamer_df)
pandas.core.frame.DataFrame
HTML(aptamer_df.head(5).to_html())
SeqIdSeqIdVersionSomaIdTargetFullNameTargetUniProtEntrezGeneIDEntrezGeneSymbolOrganismUnitsTypeDilutionPlateScale_ReferenceCalReferenceCal_Example_Adat_Set001ColCheckCalQcRatio_Example_Adat_Set001_170255QcReference_170255Cal_Example_Adat_Set002CalQcRatio_Example_Adat_Set002_170255
010000-283SL019233Beta-crystallin B2CRBB2P433201415CRYBB2HumanRFUProtein20687.4687.41.01252025PASS1.008505.41.014762331.067
110001-73SL002564RAF proto-oncogene serine/threonine-protein kinasec-RafP040495894RAF1HumanRFUProtein20227.8227.81.01605709PASS0.970223.91.036868461.007
210003-153SL019245Zinc finger protein 41ZNF41P518147592ZNF41HumanRFUProtein0.5126.9126.90.95056180PASS1.046119.61.152588560.981
310006-253SL019228ETS domain-containing protein Elk-1ELK1P194192002ELK1HumanRFUProtein20634.2634.20.99607350PASS1.042667.20.935812311.026
410008-433SL019234Guanylyl cyclase-activating protein 1GUC1AP430802978GUCA1AHumanRFUProtein20585.0585.00.94051447PASS1.036587.50.962012830.998

Accessing feature data

The .to_frame() method creates a lookup table that links the feature names in the adat object to the annotation data in columns:

col_df = adat.columns.to_frame(index=False)
type(col_df)
pandas.core.frame.DataFrame
HTML(col_df.head(5).to_html())
SeqIdSeqIdVersionSomaIdTargetFullNameTargetUniProtEntrezGeneIDEntrezGeneSymbolOrganismUnitsTypeDilutionPlateScale_ReferenceCalReferenceCal_Example_Adat_Set001ColCheckCalQcRatio_Example_Adat_Set001_170255QcReference_170255Cal_Example_Adat_Set002CalQcRatio_Example_Adat_Set002_170255
010000-283SL019233Beta-crystallin B2CRBB2P433201415CRYBB2HumanRFUProtein20687.4687.41.01252025PASS1.008505.41.014762331.067
110001-73SL002564RAF proto-oncogene serine/threonine-protein kinasec-RafP040495894RAF1HumanRFUProtein20227.8227.81.01605709PASS0.970223.91.036868461.007
210003-153SL019245Zinc finger protein 41ZNF41P518147592ZNF41HumanRFUProtein0.5126.9126.90.95056180PASS1.046119.61.152588560.981
310006-253SL019228ETS domain-containing protein Elk-1ELK1P194192002ELK1HumanRFUProtein20634.2634.20.99607350PASS1.042667.20.935812311.026
410008-433SL019234Guanylyl cyclase-activating protein 1GUC1AP430802978GUCA1AHumanRFUProtein20585.0585.00.94051447PASS1.036587.50.962012830.998

Display features

adat.columns.get_level_values('SeqId')[:20] # first 20 features
Index(['10000-28', '10001-7', '10003-15', '10006-25', '10008-43', '10011-65',
       '10012-5', '10013-34', '10014-31', '10015-119', '10021-1', '10022-207',
       '10023-32', '10024-44', '10030-8', '10034-16', '10035-6', '10036-201',
       '10037-98', '10040-63'],
      dtype='object', name='SeqId')

Get # Features

adat.shape[1]
5284

Clinical Data

Dataframe index Contains Sample Information

# Using the `adat` loaded from above
sample_df = adat.index.to_frame(index=False)
type(sample_df)
pandas.core.frame.DataFrame
HTML(sample_df.head(5).to_html())
PlateIdPlateRunDateScannerIDPlatePositionSlideIdSubarraySampleIdSampleTypePercentDilutionSampleMatrixBarcodeBarcode2dSampleNameSampleNotesAliquotingNotesSampleDescriptionAssayNotesTimePointExtIdentifierSsfExtIdSampleGroupSiteIdTubeUniqueIDCLIHybControlNormScaleRowCheckNormScale_20NormScale_0_005NormScale_0_5ANMLFractionUsed_20ANMLFractionUsed_0_005ANMLFractionUsed_0_5AgeSex
0Example Adat Set0012020-06-18SG15214400H925849580001231Sample20Plasma-PPT0.98185998PASS1.036935800.857016240.777174910.9140.8690.90376F
1Example Adat Set0012020-06-18SG15214400H825849580000472Sample20Plasma-PPT0.96671829PASS0.960225050.848584200.852019530.9370.9560.97355F
2Example Adat Set0012020-06-18SG15214400H725849580001083Sample20Plasma-PPT1.00193072PASS0.984116171.032701560.915191530.9070.9190.91547M
3Example Adat Set0012020-06-18SG15214400H625849580000344Sample20Plasma-PPT0.94017961PASS1.078398780.946268410.912467310.9340.9190.91237M
4Example Adat Set0012020-06-18SG15214400H525849580000945Sample20Plasma-PPT0.94621098PASS0.846794460.929045530.774130560.7070.8940.70871F

Modifying Metadata

The Adat index and columns are pandas.MultiIndex objects. Several convenience functions exist to help you modify these objects. Typically, the row metadata (axis=0) represents data describing the sample or the individual from whom the sample was collected. The column metadata (axis=1) contains data regarding the SOMAmer reagent, the reagent's target and scalars applied to columns during normalization, these columns are not usually edited by the end user but can be using the same methods demonstrated on row metadata below.

return to top

Add Row Metadata

Row metadata is sample level information which could include added clinical data like age, sex or clinical measurements. The Adat class facilitates managing this data.

# using ittertools to simulate some metadata:
from itertools import cycle, islice
import pandas as pd

# for demonstration we will simulate two types of metadata.  Metadata stored in a list and metadata stored with key-value pairs.
metadata_list = list(islice(cycle(['A', 'B', 'X', 'Y']), adat.shape[0]))
metadata_dictionary = {k:v for k, v in zip(adat.index.get_level_values('SampleId').to_list(), metadata_list)}
Add unlabeled metadata

You might do this if you know your metadata and Adat are ordered the same way but you are not using a shared key.

return to top

new_meta_adat = adat.insert_meta(0,'GroupData', metadata_list)
# this will produce a new Adat file with group data in the right most column of the index
new_meta_adat.index.to_frame(index=False).loc[0:6, ['PlateId', 'SampleId', 'GroupData']]
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
PlateIdSampleIdGroupData
0Example Adat Set0011A
1Example Adat Set0012B
2Example Adat Set0013X
3Example Adat Set0014Y
4Example Adat Set0015A
5Example Adat Set0016B
6Example Adat Set0017X
Add Keyed Metadata

You might have data coming in as key value pairs from another data source. In that case it is easier to insert metadata using those keys:

return to top

# The arguments are `axis` 0 for row metadata, 1 for column metadata, `name` the name of the new index,
#`key_meta_name` the nameo of the axis used to match the keys. `values_dict` the dictionary containing the new data
new_meta_adat = adat.insert_keyed_meta(0,'GroupData', 'SampleId', metadata_dictionary)
# this will produce a new Adat file with group data in the right most column of the index
new_meta_adat.index.to_frame(index=False).loc[0:6, ['PlateId', 'SampleId', 'GroupData']]
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
PlateIdSampleIdGroupData
0Example Adat Set0011A
1Example Adat Set0012B
2Example Adat Set0013X
3Example Adat Set0014Y
4Example Adat Set0015A
5Example Adat Set0016B
6Example Adat Set0017X
Replace Metadata with Unlabeled Metadata

You might do this if you know your metadata and Adat are ordered the same way but you are not using a shared key.

return to top

new_meta_adat = adat.replace_meta(0,'SampleName', metadata_list)
# this will produce a new Adat file with group data in the right most column of the index
new_meta_adat.index.to_frame(index=False).loc[0:6, ['PlateId', 'SampleId', 'SampleName']]

.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
PlateIdSampleIdSampleName
0Example Adat Set0011A
1Example Adat Set0012B
2Example Adat Set0013X
3Example Adat Set0014Y
4Example Adat Set0015A
5Example Adat Set0016B
6Example Adat Set0017X

Replace Metadata with Keyed Metadata

You might need to replace metadata based on another document using key-value pairs.

return to top

#Here we replace the values in `SampleName` with the values from `metadata_dictionary`
new_meta_adat = adat.replace_keyed_meta(0,'SampleName', metadata_dictionary, 'SampleId')
# this will produce a new Adat file with group data in the right most column of the index
new_meta_adat.index.to_frame(index=False).loc[0:6, ['PlateId', 'SampleId', 'SampleName']]
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
PlateIdSampleIdSampleName
0Example Adat Set0011A
1Example Adat Set0012B
2Example Adat Set0013X
3Example Adat Set0014Y
4Example Adat Set0015A
5Example Adat Set0016B
6Example Adat Set0017X

Math

You may perform mathematical transformations on the feature data via apply or calling those functions and passing the entire dataframe.

import numpy as np
# Using the `adat` loaded from above
log10_adat = adat.apply(np.log10)  # equivalent to `np.log10(adat)`
rounded_adat = adat.apply(round, args=[5,])  # equivalent to `round(adat, 5)`
sqrt_adat = adat.apply(np.sqrt)  # equivlane to `np.sqrt(adat)`

Subsetting/Slicing the Dataframe

You may extract certain subgroups of samples and/or features. SomaData augments the pandas dataframe with a number of helper functions to aid the user.

return to top

# Extract rows of only calibrator-type samples
calibrator_adat = adat.pick_on_meta(axis=0, name='SampleType', values=['Calibrator'])

# Exclude calibrator-type samples
non_calibrator_adat = adat.exclude_on_meta(axis=0, name='SampleType', values=['Calibrator'])

# Extract columns containing features that start with 'MMP'
target_names = adat.columns.get_level_values('Target')
mmp_names = [target for target in target_names if target.startswith('MMP')]
mmp_adat = adat.pick_on_meta(axis=1, name='Target', values=mmp_names)
mmp_adat
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead tr th { text-align: left; } .dataframe thead tr:last-of-type th { text-align: right; }
SeqId15419-152579-172788-552789-262838-534160-494496-604924-324925-545002-765268-496425-878479-49172-699719-145
SeqIdVersion351211212133333
SomaIdSL012374SL000527SL000524SL000525SL003332SL000124SL000522SL000521SL000523SL002646SL003331SL007616SL000645SL000526SL003331
TargetFullNameMatrix metalloproteinase-20Matrix metalloproteinase-9Stromelysin-1MatrilysinMatrix metalloproteinase-1772 kDa type IV collagenaseMacrophage metalloelastaseInterstitial collagenaseCollagenase 3Matrix metalloproteinase-14Matrix metalloproteinase-16Matrix metalloproteinase-19Stromelysin-2Neutrophil collagenaseMatrix metalloproteinase-16
TargetMMP20MMP-9MMP-3MMP-7MMP-17MMP-2MMP-12MMP-1MMP-13MMP-14MMP-16MMP19MMP-10MMP-8MMP-16
UniProtO60882P14780P08254P09237Q9ULZ9P08253P39900P03956P45452P50281P51512Q99542P09238P22894P51512
EntrezGeneID931343184314431643264313432143124322432343254327431943174325
EntrezGeneSymbolMMP20MMP9MMP3MMP7MMP17MMP2MMP12MMP1MMP13MMP14MMP16MMP19MMP10MMP8MMP16
OrganismHumanHumanHumanHumanHumanHumanHumanHumanHumanHumanHumanHumanHumanHumanHuman
UnitsRFURFURFURFURFURFURFURFURFURFURFURFURFURFURFU
TypeProteinProteinProteinProteinProteinProteinProteinProteinProteinProteinProteinProteinProteinProteinProtein
Dilution200.50.520200.520202020200.5202020
PlateScale_Reference937.519472.3117.22392.91520.614888.51014.97611.5376.6632.1565.85063.41534.01088.51166.8
CalReference937.519472.3117.22392.91520.614888.51014.97611.5376.6632.1565.85063.41534.01088.51166.8
Cal_Example_Adat_Set0011.069472961.019572220.984047020.901314551.137832980.989611030.961808190.911622390.986897271.024971620.979062120.978434780.950610400.947098231.02243253
ColCheckPASSPASSPASSPASSPASSPASSPASSPASSPASSPASSPASSPASSPASSPASSPASS
CalQcRatio_Example_Adat_Set001_1702551.1321.0020.8931.1300.9550.9871.0141.0541.0230.9871.0241.0051.0231.0291.003
QcReference_170255793.410420.6124.29482.51252.016044.71115.313192.8394.4492.1559.26162.71946.6845.51171.0
Cal_Example_Adat_Set0021.023806921.001177411.312430010.678125091.093170380.984995340.959534840.951544550.905941781.081437130.981439720.988211870.927000241.039835691.02055454
CalQcRatio_Example_Adat_Set002_1702551.1921.0090.8211.1231.0831.0281.0061.0250.9910.9491.0020.9901.0101.0390.951
PlateIdPlateRunDateScannerIDPlatePositionSlideIdSubarraySampleIdSampleTypePercentDilutionSampleMatrixBarcodeBarcode2dSampleNameSampleNotesAliquotingNotesSampleDescriptionAssayNotesTimePointExtIdentifierSsfExtIdSampleGroupSiteIdTubeUniqueIDCLIHybControlNormScaleRowCheckNormScale_20NormScale_0_005NormScale_0_5ANMLFractionUsed_20ANMLFractionUsed_0_005ANMLFractionUsed_0_5AgeSex
Example Adat Set0012020-06-18SG15214400H925849580001231Sample20Plasma-PPT0.98185998PASS1.036935800.857016240.777174910.9140.8690.90376F729.916230.2177.911903.31378.111997.22455.920988.1442.5414.2712.58965.52030.62181.51096.4
H825849580000472Sample20Plasma-PPT0.96671829PASS0.960225050.848584200.852019530.9370.9560.97355F873.317253.4152.816508.81652.014574.61595.011498.5501.1505.9629.98669.71301.61571.21149.0
H725849580001083Sample20Plasma-PPT1.00193072PASS0.984116171.032701560.915191530.9070.9190.91547M993.013094.1127.57577.01711.714468.7503.314697.72883.2445.8510.66728.91067.31528.91027.8
H625849580000344Sample20Plasma-PPT0.94017961PASS1.078398780.946268410.912467310.9340.9190.91237M906.520991.4155.08673.71667.59913.1438.420819.0375.2644.8547.88629.01162.41173.71091.6
H525849580000945Sample20Plasma-PPT0.94621098PASS0.846794460.929045530.774130560.7070.8940.70871F747.08070.4124.020423.71426.611345.01954.516168.9356.9446.4541.78125.21667.01048.61132.6
...................................................................................................................................................
Example Adat Set0022020-07-20SG15214400A22584958001083188Sample20Plasma-PPT0.96699908PASS0.959932751.089101380.994919790.5660.9120.71938F714.719877.3114.515151.9894.717266.2682.827419.4480.3705.1527.96638.23919.71787.51191.1
A122584958001042189Sample20Plasma-PPT0.91482584PASS1.218801291.010226970.992443740.9180.9190.92640F865.825801.4123.615711.61308.416598.31369.215153.5474.2655.0592.28953.92494.92156.21391.5
A112584958001085190Sample20Plasma-PPT0.88282283PASS1.366991421.162714271.196735870.9270.9810.96443M869.713728.5140.911437.31353.517996.21344.210575.3433.7439.8530.76193.92067.62466.61114.8
A102584958001055191Sample20Plasma-PPT0.95792282PASS1.305903740.983951660.974601190.8350.9630.94455M529.213298.2161.714210.81026.014549.01466.19683.8291.6435.8676.19584.81799.51347.81194.2
A12584958001105192Sample20Plasma-PPT0.97384118PASS1.307106460.932301231.008043410.7930.9630.93356F1934.47567.8133.013614.31220.216223.81110.97737.4364.23407.0645.27867.11720.91888.21293.8

192 rows × 15 columns

Lifting ADAT data between assay versions.

Adat data can be lifted from one SomaScan Assay version's RFU space to another SomaScan Assay version's RFU space. This is achieved by scaling SOMAmer reagent measurement columns using scale factors available at menu.somalogic.com and built in to this tool in the Adat.lift() method.

The example Adat exists in SomaScan Version V4.0 assay space (also called 5K in some literature). In this example we will lift to SomaScan V5.0 (11K) assay space. It is important to know that only SOMAmer reagent measurements in both assay versions can be lifted. When lifting from a smaller to a larger plex (as demonstrated) the resulting Adat will remain in the smaller plex size. When lifting from a larger to smaller plex size reagents that don't exist in the small plex size will be scaled by 1.0. The end user might choose to redact the lifted Adat to the smaller plex to better merge data.

The tool will raise in error if the end user attempts to lift an Adat object to its current version or an unsupported assay version.

Assay version mapping:

SomaScan data can be referred to by the assay version i.e. 'V5.0' or by the plex size i.e. '11K'. The tool can use either 'V5.0' or '11K' interchangeably in it's input. The mapping between these terms is shown in the table below:

SomaScan Assay VersionSomaScan PlexSomaScan Menu Size
V45K5284
V4.17K7596
V5.011K11083

return to top

lifted_adat = adat.lift('V5.0')
lifted_adat
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead tr th { text-align: left; } .dataframe thead tr:last-of-type th { text-align: right; }
SeqId10000-2810001-710003-1510006-2510008-4310011-6510012-510013-3410014-3110015-119...9981-189983-979984-129986-149989-129993-119994-2179995-69997-129999-1
SeqIdVersion3333333333...3333333333
SomaIdSL019233SL002564SL019245SL019228SL019234SL019246SL014669SL025418SL007803SL014924...SL018293SL019202SL019205SL005356SL019194SL019212SL019217SL013164SL019215SL019231
TargetFullNameBeta-crystallin B2RAF proto-oncogene serine/threonine-protein kinaseZinc finger protein 41ETS domain-containing protein Elk-1Guanylyl cyclase-activating protein 1Inositol polyphosphate 5-phosphatase OCRL-1SAM pointed domain-containing Ets transcription factorFc_MOUSEZinc finger protein SNAI2Voltage-gated potassium channel subunit beta-2...Protein FAM234BInactive serine protease 35Protein YIPF6Neuropeptide WLeucine-rich repeat-containing protein 24Zinc finger protein 264Potassium-transporting ATPase subunit betaDeoxyuridine 5'-triphosphate nucleotidohydrolase, mitochondrialUBX domain-containing protein 4Interferon regulatory factor 6
TargetCRBB2c-RafZNF41ELK1GUC1AOCRLSPDEFFc_MOUSESLUGKCAB2...K1467PRS35YIPF6Neuropeptide WLRC24ZN264ATP4BDUTUBXN4IRF6
UniProtP43320P04049P51814P19419P43080Q01968O95238Q99LC4O43623Q13303...A2RU67Q8N3Z0Q96EC8Q8N729Q50LG9O43296P51164P33316Q92575O14896
EntrezGeneID1415589475922002297849522580365918514...5761316768128645128386944138194224961854231903664
EntrezGeneSymbolCRYBB2RAF1ZNF41ELK1GUCA1AOCRLSPDEFSNAI2KCNAB2...KIAA1467PRSS35YIPF6NPWLRRC24ZNF264ATP4BDUTUBXN4IRF6
OrganismHumanHumanHumanHumanHumanHumanHumanMouseHumanHuman...HumanHumanHumanHumanHumanHumanHumanHumanHumanHuman
UnitsRFURFURFURFURFURFURFURFURFURFU...RFURFURFURFURFURFURFURFURFURFU
TypeProteinProteinProteinProteinProteinProteinProteinProteinProteinProtein...ProteinProteinProteinProteinProteinProteinProteinProteinProteinProtein
Dilution20200.520202020202020...20202020202020202020
PlateScale_Reference687.4227.8126.9634.2585.02807.11623.3499.6857.2443.3...643.9430.0627.53644.5449.4953.31971.11275.64426.9851.9
CalReference687.4227.8126.9634.2585.02807.11623.3499.6857.2443.3...643.9430.0627.53644.5449.4953.31971.11275.64426.9851.9
Cal_Example_Adat_Set0011.012520251.016057090.950561800.996073500.940514471.053834891.172904621.070953911.034640921.07466667...0.980359321.048780491.035136920.963414311.014446951.045514370.982994220.974261060.968962720.96042841
ColCheckPASSPASSPASSPASSPASSPASSPASSPASSPASSPASS...PASSPASSPASSPASSPASSPASSPASSPASSPASSPASS
CalQcRatio_Example_Adat_Set001_1702551.0080.9701.0461.0421.0360.9751.0100.9530.9780.975...0.9820.9491.0030.9381.0170.9981.0710.9850.9600.974
QcReference_170255505.4223.9119.6667.2587.52617.61340.6443.01289.4441.5...700.7393.2612.63089.2455.1885.61389.7950.95560.71033.6
Cal_Example_Adat_Set0021.014762331.036868461.152588560.935812310.962012831.031339551.212503731.181925720.989267171.13173347...0.960757981.152506031.120135671.082964370.993149171.082680301.027845860.973517520.948289530.92900763
CalQcRatio_Example_Adat_Set002_1702551.0671.0070.9811.0260.9981.0131.0780.9960.9710.941...0.9820.9930.9900.9290.9780.9611.0220.9701.0270.997
PlateIdPlateRunDateScannerIDPlatePositionSlideIdSubarraySampleIdSampleTypePercentDilutionSampleMatrixBarcodeBarcode2dSampleNameSampleNotesAliquotingNotesSampleDescriptionAssayNotesTimePointExtIdentifierSsfExtIdSampleGroupSiteIdTubeUniqueIDCLIHybControlNormScaleRowCheckNormScale_20NormScale_0_005NormScale_0_5ANMLFractionUsed_20ANMLFractionUsed_0_005ANMLFractionUsed_0_5AgeSex
Example Adat Set0012020-06-18SG15214400H925849580001231Sample20Plasma-PPT0.98185998PASS1.036935800.857016240.777174910.9140.8690.90376F386.0309.597.6449.1396.64965.91106.7274.9786.3567.2...551.9352.5408.43027.5538.8686.35202.42188.412697.7966.5
H825849580000472Sample20Plasma-PPT0.96671829PASS0.960225050.848584200.852019530.9370.9560.97355F384.3292.999.1418.6382.62149.61307.8324.1779.4371.8...689.3358.2456.05724.5470.3663.01195.92302.713247.8824.2
H725849580001083Sample20Plasma-PPT1.00193072PASS0.984116171.032701560.915191530.9070.9190.91547M336.6299.02948.3420.1299.32306.61290.9348.6845.0416.8...547.0382.0503.93380.7405.3647.61552.41183.211072.11145.8
H625849580000344Sample20Plasma-PPT0.94017961PASS1.078398780.946268410.912467310.9340.9190.91237M358.5247.4109.9420.0331.72261.41184.1362.04348.0374.6...561.9384.4477.71361.8493.0643.51005.31399.49082.4804.6
H525849580000945Sample20Plasma-PPT0.94621098PASS0.846794460.929045530.774130560.7070.8940.70871F377.2709.393.3589.3313.11949.4990.0272.8787.7723.8...480.0343.0742.34846.0487.7701.41132.49852.938461.12865.9
.....................................................................................................................................................................
Example Adat Set0022020-07-20SG15214400A22584958001083188Sample20Plasma-PPT0.96699908PASS0.959932751.089101380.994919790.5660.9120.71938F393.3954.0124.2719.81284.41312.0750.7312.1727.7829.2...454.3301.8531.01658.5541.8861.51352.311604.645580.63687.1
A122584958001042189Sample20Plasma-PPT0.91482584PASS1.218801291.010226970.992443740.9180.9190.92640F337.4281.682.5625.526153.51856.41004.8297.4734.0388.3...636.7292.7410.79236.8487.6629.31910.91855.09778.61004.4
A112584958001085190Sample20Plasma-PPT0.88282283PASS1.366991421.162714271.196735870.9270.9810.96443M372.9270.8204.2472.6446.11733.81067.7354.3745.7410.7...566.0364.5448.22597.7515.6621.41113.31302.78766.2770.8
A102584958001055191Sample20Plasma-PPT0.95792282PASS1.305903740.983951660.974601190.8350.9630.94455M320.4319.1105.9527.1370.81701.9756.9266.4618.4453.9...536.3309.1434.05167.2522.8588.2891.12466.615455.91190.4
A12584958001105192Sample20Plasma-PPT0.97384118PASS1.307106460.932301231.008043410.7930.9630.93356F370.3288.1187.2469.9438.51777.9787.3249.4930.8423.7...601.9285.9467.02564.2590.0673.21036.72142.17950.7722.8

192 rows × 5284 columns

# because some common documentation refers to plex size instead of assay version the tool also supports lifing by naming plex size.
lifted_adat = adat.lift('11K')
# Observing the Lin's CCC of the lifting scale factors.
from somadata.data import getSomaScanLiftCCC

# the method returns a pandas dataframe containing the available lift Lins's CCC values:
ccc = getSomaScanLiftCCC()

ccc
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
Serum Lin's CCC v5.0 11K to v4.1 7KPlasma Lin's CCC v5.0 11K to v4.1 7KSerum Lin's CCC v5.0 11K to v4.0 5KPlasma Lin's CCC v5.0 11K to v4.0 5KSerum Lin's CCC v4.1 7K to v4.0 5KPlasma Lin's CCC v4.1 7K to v4.0 5K
10000-280.9770.9820.9700.9660.9670.963
10001-70.8570.9610.8190.8600.8750.875
10003-150.7590.7870.7610.6740.7740.668
10006-250.9370.9270.9030.8640.9370.877
10008-430.9510.9390.9150.8790.9250.908
.....................
9993-110.8230.8550.7040.7530.7140.711
9994-2170.4920.9640.5020.7670.8090.778
9995-60.9750.9760.9650.9160.9830.922
9997-120.8770.9550.8570.8920.9260.885
9999-10.9090.9620.8700.8830.9440.898

11083 rows × 6 columns

Lin's CCC Between Lifted and Assay Space Native Data

The tool allows you to display Lin's concordance correlation coefficient (Lin 1989) derived during the calculation of the lifting scale factors. This metric allows you to see how well lifted data is expected to correlate with date collected originally in the target assay data signal space. A Lin's CCC close to 1.0 indicates strong correlation indicating the signal would be highly concordant with the lifted value if the sample data were collected in the target assay version space.

# your exact transformation's Lin's CCC can be selected by filtering the column that contains your matrix and versions
# Lin's CCC are symmetrical v5.0 -> v4.0 == v4.0 -> v5.0.
ccc["Plasma Lin's CCC v5.0 11K to v4.0 5K"]
10000-28    0.966
10001-7     0.860
10003-15    0.674
10006-25    0.864
10008-43    0.879
            ...
9993-11     0.753
9994-217    0.767
9995-6      0.916
9997-12     0.892
9999-1      0.883
Name: Plasma Lin's CCC v5.0 11K to v4.0 5K, Length: 11083, dtype: float64

Writing an ADAT file

In order to store or share analysis the user may need to write out an ADAT file. This utility supports writing to the file system.

return to top

adat.to_adat('/tmp/out_file.adat')

Typical Analyses

Although it is beyond the scope of the SomaData package, below are 3 sample analyses that typical users/clients would perform on SomaLogic data. They are not intended to be a definitive guide in statistical analysis and existing packages do exist in the Python universe that perform parts or extensions of these techniques. Many variations of the workflows below exist, however the framework highlights how one could perform standard preliminary analyses on SomaLogic data for:

  • Two-group differential expression (t-test)
  • Binary classification (logistic regression)
  • Linear regression

return to top

Compare Groups (M/F) via t-test

from somadata.data.example_data import example_data # Example ADAT included with SomaData
from scipy.stats import ttest_ind
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from io import StringIO

Display the shape of the adat (rows, columns)

example_data.shape
(192, 5284)

Describe the sample types within the adat and display their counts

Counter(example_data.index.get_level_values('SampleType'))
Counter({'Sample': 170, 'Calibrator': 10, 'Buffer': 6, 'QC': 6})

Prepare the adat for analysis

filtered_transformed_data = (
    example_data
        .exclude_on_meta(axis=0, name='Sex', values=[''])            # rm NAs if present
        .pick_on_meta(axis=0, name='SampleType', values=['Sample'])  # rm control samples
        .apply(np.log10)                                             # log10-transform
)

clean_data = (
    filtered_transformed_data
        .insert_keyed_meta(                                          # map Sex -> 0/1
            axis=0,
            key_meta_name='Sex',
            inserted_meta_name='Group',
            values_dict={'M': 1, 'F': 0}
        )
        .apply(lambda x: x - x.mean(), axis=0)                       # center features
        .apply(lambda x: x / x.std(), axis=0)                        # scale features
)

Display the grouping counts

print(clean_data.index.to_frame()['Sex'].value_counts())
print(clean_data.index.to_frame()['Group'].value_counts())
Sex
F    85
M    85
Name: count, dtype: int64
Group
0    85
1    85
Name: count, dtype: int64

Split the adat based on Group and perform t-test across all aptamers

tt_g0 = clean_data.pick_on_meta(axis=0, name='Group', values=[0])
tt_g1 = clean_data.pick_on_meta(axis=0, name='Group', values=[1])

tt_res = ttest_ind(tt_g0, tt_g1)
t_tests = list(zip(clean_data.columns.get_level_values('TargetFullName'), tt_res.pvalue))

Sort the results and display the 12 aptamers with the most significant p-values

t_tests_sorted = sorted(t_tests, key=lambda x: x[1])
tt_top_12_analytes = [name for name, p_value in t_tests_sorted[:12]]
tt_top_12_analytes
['Prostate-specific antigen',
 'Pregnancy zone protein',
 'Kunitz-type protease inhibitor 3',
 'Follicle stimulating hormone',
 'Ectonucleotide pyrophosphatase/phosphodiesterase family member 2',
 'Beta-defensin 104',
 'Luteinizing hormone',
 'Cysteine-rich secretory protein 2',
 'Human Chorionic Gonadotropin',
 'Serum amyloid P-component',
 'SLIT and NTRK-like protein 4',
 'Neurotrimin']

Plot the Group log(RFU) for each aptamer

tt_df= (
    filtered_transformed_data
        .pick_meta(axis=1, names=['TargetFullName'])
        .pick_on_meta(axis=1, name='TargetFullName', values=tt_top_12_analytes)[tt_top_12_analytes]
        .reset_index()
)

tt_melted_df = pd.melt(tt_df, value_vars=tt_top_12_analytes, id_vars='Sex', value_name='log10(RFU)')

tt_p = sns.catplot(
    x='Sex',
    y='log10(RFU)',
    col='TargetFullName',
    data=tt_melted_df,
    kind='box',
    col_wrap=3,
    sharey=False
)
tt_p.set_titles(row_template='{row_name}', col_template='{col_name}')
plt.show()

png

Logistic Regression (Predict Sex)

# Import the libraries that we need for this analysis
from sklearn.model_selection import train_test_split
from sklearn import metrics
from scipy.stats import pearsonr
import statsmodels.api as sm
from IPython.display import HTML

Prepare the data for LogisticRegression

# Wrangle `clean_data` into a simpler form
logr_x_df = (
    clean_data
        .pick_meta(axis=1, names=['SeqId', 'TargetFullName'])
        .reset_index(drop=True)
)
logr_y_df = (
    clean_data.index.get_level_values('Group')
)

# Split the dataset into train and test, holding back 25 samples for testing
logr_x_train, logr_x_test, logr_y_train, logr_y_test = train_test_split(logr_x_df, logr_y_df, test_size=25, random_state=0)

Perform univariate logistic regression for each aptamer

logr_apt_perf = []
for seq_info in logr_x_train:
    x = sm.add_constant(logr_x_train[seq_info]) # Need to add the intercept term since sm.GLM does not automatically do it
    mod = sm.GLM(logr_y_train, x, family=sm.families.Binomial())
    res = mod.fit()
    logr_apt_perf.append(res.summary2().tables[1].loc[[seq_info]])

Wrangle the GLM results of each aptamer into a dataframe and sort them by p-value

logr_df = pd.concat(logr_apt_perf).reset_index()
logr_df['SeqId'] = [x[0] for x in logr_df['index']]
logr_df['TargetFullName'] = [x[1] for x in logr_df['index']]
logr_df = logr_df.drop('index', axis=1)
logr_df = logr_df[['SeqId', 'TargetFullName', 'Coef.', 'Std.Err.', 'z', 'P>|z|', '[0.025', '0.975]']].set_index('SeqId')
logr_df_sorted = logr_df.sort_values('P>|z|')
HTML(logr_df_sorted.head(20).to_html()) # Need to use HTML here to display nicely for this README
TargetFullNameCoef.Std.Err.zP>|z|[0.0250.975]
SeqId
6580-29Pregnancy zone protein-3.0798180.489558-6.2910203.153866e-10-4.039334-2.120302
5763-67Beta-defensin 1042.9747780.4784006.2181815.029509e-102.0371313.912425
3032-11Follicle stimulating hormone-1.5057180.250398-6.0132921.817935e-09-1.996490-1.014946
7926-13Kunitz-type protease inhibitor 32.8874750.4825265.9840872.176067e-091.9417423.833208
16892-23Ectonucleotide pyrophosphatase/phosphodiesterase family member 2-2.3351130.396641-5.8872163.927542e-09-3.112516-1.557710
9282-12Cysteine-rich secretory protein 21.7680260.3090505.7208371.060006e-081.1622992.373754
2953-31Luteinizing hormone-1.3197280.240323-5.4914663.986115e-08-1.790753-0.848702
4914-10Human Chorionic Gonadotropin-1.2445510.229781-5.4162406.086534e-08-1.694914-0.794188
8468-19Prostate-specific antigen5.8411311.1130305.2479531.537986e-073.6596328.022630
2474-54Serum amyloid P-component1.4349290.2792185.1391082.760458e-070.8876731.982185
8428-102Neurotrimin-1.2643170.246142-5.1365432.798380e-07-1.746745-0.781888
9002-36Serpin A11-1.0873850.219035-4.9644346.890169e-07-1.516685-0.658084
3066-12Galectin-3-1.0056150.206735-4.8642761.148764e-06-1.410808-0.600423
5116-62Roundabout homolog 2-1.2915940.270447-4.7757671.790237e-06-1.821661-0.761527
7139-14SLIT and NTRK-like protein 41.0185200.2186254.6587613.181183e-060.5900231.447016
8484-24Leptin-0.9915850.219260-4.5224156.113800e-06-1.421326-0.561843
5934-1Ferritin1.0123000.2275254.4491888.619536e-060.5663601.458240
15324-58Ferritin light chain1.0028130.2264294.4288229.474919e-060.5590211.446606
4234-8Interleukin-1 receptor-like 11.1340580.2586324.3848311.160761e-050.6271491.640968
2696-87Persephin1.4128330.3233484.3693891.245947e-050.7790832.046584

Fit model

logr_top_analytes = [(index, row['TargetFullName']) for index, row in logr_df_sorted.head(5).iterrows()] # Select the top 5 aptamers based on p-value
x = sm.add_constant(logr_x_train[logr_top_analytes])
logr_mod = sm.GLM(logr_y_train, x, family=sm.families.Binomial())
logr_res = logr_mod.fit()
logr_res.summary()
Generalized Linear Model Regression Results
Dep. Variable:y No. Observations: 145
Model:GLM Df Residuals: 139
Model Family:Binomial Df Model: 5
Link Function:Logit Scale: 1.0000
Method:IRLS Log-Likelihood: -8.4167
Date:Fri, 01 Mar 2024 Deviance: 16.833
Time:13:19:34 Pearson chi2: 17.8
No. Iterations:10 Pseudo R-squ. (CS):0.7186
Covariance Type:nonrobust
coefstd errzP>|z|[0.0250.975]
const 1.6106 1.178 1.367 0.172 -0.699 3.920
('6580-29', 'Pregnancy zone protein') -7.0008 3.112 -2.250 0.024 -13.099 -0.902
('5763-67', 'Beta-defensin 104') 2.7821 1.255 2.217 0.027 0.322 5.242
('3032-11', 'Follicle stimulating hormone') -1.1722 0.818 -1.432 0.152 -2.776 0.432
('7926-13', 'Kunitz-type protease inhibitor 3') 2.2901 1.053 2.174 0.030 0.226 4.354
('16892-23', 'Ectonucleotide pyrophosphatase/phosphodiesterase family member 2') -3.6045 1.498 -2.407 0.016 -6.540 -0.669
# Create confusion matrix
x = sm.add_constant(logr_x_test[logr_top_analytes])
logr_predictions = [1 if val > 0.5 else 0 for val in logr_res.predict(x)]
cm = metrics.confusion_matrix(logr_y_test.values, logr_predictions)
# Print out performance metrics via Pandas
tp = cm[1, 1]
tn = cm[0, 0]
fp = cm[0, 1]
fn = cm[1, 0]
logr_perf_df = pd.DataFrame.from_records({
    'Sensitivity': tp / (tp + fn),
    'Specificity': tn / (tn + fp),
    'Accuracy': (tp + tn) / sum(sum(cm)),
    'PPV': tp / (tp + fp),
    'NPV': tn / (tn + fn)
}, index=['Value'])
HTML(logr_perf_df.to_html())
AccuracyNPVPPVSensitivitySpecificity
Value1.01.01.01.01.0
# Display the confusion matrix
plt.figure(figsize=(3,3))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square=True, cmap='Blues')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
all_sample_title = 'Accuracy Score: {0}'.format(100 * logr_perf_df['Accuracy'].values[0])
plt.title(all_sample_title)
plt.show()

png

Linear Regression (Predict Age)

We use the same clean_data as the logistic regression analysis above.

Wrangle data

# Wrangle `clean_data` into a simpler form
linr_x_df = (
    clean_data
        .pick_meta(axis=1, names=['SeqId', 'TargetFullName'])
        .reset_index(drop=True)
)
linr_y = (
    [float(age) for age in clean_data.index.get_level_values('Age')]
)

# Split the dataset into train and test, holding back 25 samples for testing
linr_x_train, linr_x_test, linr_y_train, linr_y_test = train_test_split(linr_x_df, linr_y, test_size=25, random_state=5)

Perform univariate linear regression for each aptamer

linr_apt_perf = []
for seq_info in linr_x_df:
    x = sm.add_constant(linr_x_train[seq_info])
    mod = sm.OLS(linr_y_train, x)
    res = mod.fit()
    linr_apt_perf.append(res.summary2().tables[1].loc[[seq_info]])

Wrangle the GLM results of each aptamer into a dataframe and sort them by p-value

linr_res_df = pd.concat(linr_apt_perf).reset_index()
linr_res_df['SeqId'] = [x[0] for x in linr_res_df['index']]
linr_res_df['TargetFullName'] = [x[1] for x in linr_res_df['index']]
linr_res_df = linr_res_df.drop('index', axis=1)
linr_res_df = linr_res_df[['SeqId', 'TargetFullName', 'Coef.', 'Std.Err.', 't', 'P>|t|', '[0.025', '0.975]']].set_index('SeqId')
linr_sorted_res_df = linr_res_df.sort_values('P>|t|')
HTML(linr_sorted_res_df.head(20).to_html())
TargetFullNameCoef.Std.Err.tP>|t|[0.0250.975]
SeqId
3045-72Pleiotrophin6.7133390.8655787.7559061.506400e-125.0023598.424320
4374-45Growth/differentiation factor 156.7665370.9029267.4940116.377086e-124.9817308.551343
3024-18Alpha-2-antiplasmin-6.2587390.895850-6.9863739.854830e-11-8.029558-4.487920
6392-7WNT1-inducible-signaling pathway protein 26.2062030.8954266.9310071.321588e-104.4362227.976185
8480-29EGF-containing fibulin-like extracellular matrix protein 16.1794730.9003706.8632601.889770e-104.3997197.959227
15640-54Transgelin6.1597690.9050436.8060482.552783e-104.3707777.948761
15533-97Macrophage scavenger receptor types I and II5.9867410.9076156.5961277.616175e-104.1926667.780815
15386-7Fatty acid-binding protein, adipocyte6.1305620.9319546.5781828.355679e-104.2883767.972748
16818-200CUB domain-containing protein 15.9199090.9028426.5569709.321408e-104.1352687.704550
4496-60Macrophage metalloelastase6.1499460.9401336.5415701.009072e-094.2915928.008299
3362-61Chordin-like protein 15.7654440.9137036.3099753.287540e-093.9593347.571554
4541-49Cell adhesion molecule-related/down-regulated by oncogenes-5.7031660.906248-6.2931643.578780e-09-7.494540-3.911793
3600-2Chitotriosidase-15.8315900.9511846.1308718.071212e-093.9513917.711789
2609-59Cystatin-C5.5778940.9340725.9715881.773159e-083.7315217.424267
3234-23Coiled-coil domain-containing protein 805.6474870.9487955.9522701.949244e-083.7720107.522963
14133-93Interleukin-1 receptor type 2-5.4893680.926319-5.9260042.216438e-08-7.320415-3.658321
19601-15Ankyrin repeat and SOCS box protein 95.4120740.9303135.8174743.755863e-083.5731317.251018
9793-145Immunoglobulin superfamily DCC subclass member 4-5.2927030.911239-5.8082473.927116e-08-7.093942-3.491463
2677-1Epidermal growth factor receptor-5.3413960.919656-5.8080393.931061e-08-7.159272-3.523520
4968-50Macrophage-capping protein5.3457100.9264585.7700504.721157e-083.5143877.177033

Feed top 8 SOMAmers into statsmodels OLS regression

linr_top_analytes = [(index, row['TargetFullName']) for index, row in linr_sorted_res_df.head(8).iterrows()]
x = sm.add_constant(linr_x_train[linr_top_analytes])
mod = sm.OLS(linr_y_train, x).fit()
mod.summary()
OLS Regression Results
Dep. Variable:y R-squared: 0.501
Model:OLS Adj. R-squared: 0.471
Method:Least Squares F-statistic: 17.05
Date:Fri, 01 Mar 2024 Prob (F-statistic):2.29e-17
Time:13:20:02 Log-Likelihood: -522.29
No. Observations: 145 AIC: 1063.
Df Residuals: 136 BIC: 1089.
Df Model: 8
Covariance Type:nonrobust
coefstd errtP>|t|[0.0250.975]
const 55.5436 0.765 72.602 0.000 54.031 57.057
('3045-72', 'Pleiotrophin') 1.6913 1.197 1.413 0.160 -0.676 4.059
('4374-45', 'Growth/differentiation factor 15') 1.2404 1.258 0.986 0.326 -1.247 3.728
('3024-18', 'Alpha-2-antiplasmin') -2.5113 0.910 -2.758 0.007 -4.312 -0.711
('6392-7', 'WNT1-inducible-signaling pathway protein 2') 1.5143 0.997 1.519 0.131 -0.457 3.486
('8480-29', 'EGF-containing fibulin-like extracellular matrix protein 1') 2.1363 0.972 2.197 0.030 0.214 4.059
('15640-54', 'Transgelin') 1.2006 1.010 1.189 0.237 -0.796 3.198
('15533-97', 'Macrophage scavenger receptor types I and II') 0.8792 1.223 0.719 0.474 -1.540 3.298
('15386-7', 'Fatty acid-binding protein, adipocyte') 1.1453 1.180 0.971 0.333 -1.188 3.479
Omnibus: 2.712 Durbin-Watson: 2.042
Prob(Omnibus): 0.258 Jarque-Bera (JB): 2.501
Skew:-0.322 Prob(JB): 0.286
Kurtosis: 3.008 Cond. No. 4.53


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Compute predictions on test set

x = sm.add_constant(linr_x_test[linr_top_analytes])
linr_predictions = mod.predict(x)
linr_pred_df = pd.DataFrame({
    'Actual Age': linr_y_test,
    'Predicted Age': linr_predictions
})

linr_pred_df['Pred Error'] = linr_pred_df['Predicted Age'] - linr_pred_df['Actual Age']

Compute model performance

# Lin's Concordance Correl. Coef.
# Accounts for location + scale shifts
def linCCC(x, y):
    if len(x) != len(y):
        raise Exception('Arrays are not the same length!')
    a = 2 * pearsonr(x, y)[0] * np.std(x, ddof=1) * np.std(y, ddof=1)
    b = np.var(x, ddof=1) + np.var(y, ddof=1) + (np.mean(x) - np.mean(y))**2
    return a / b

n = linr_x_test.shape[0]
p = len(linr_top_analytes)

# Regression metrics
linr_metrics_df = pd.DataFrame({
    'rss': linr_pred_df['Pred Error'].apply(lambda x: x**2).sum(), # residual sum of squares
    'tss': sum((linr_pred_df['Actual Age'] - linr_pred_df['Actual Age'].mean()) ** 2), # total sum of squares
    'R2': pearsonr(linr_pred_df['Actual Age'], linr_pred_df['Predicted Age'])[0] ** 2, # R-squared Pearson approx.
    'MAE': np.mean(np.abs(linr_pred_df['Pred Error'])), # Mean absolute error
    'RMSE': np.sqrt(np.mean(linr_pred_df['Pred Error'] ** 2)), # Root mean squared error
    'CCC': linCCC(linr_predictions, linr_y_test) # Lins concordance correlation coefficient
}, index=['Value'])

linr_metrics_df['rsq'] = 1 - (linr_metrics_df['rss'] / linr_metrics_df['tss']) # R-squared
linr_metrics_df['rsqadj'] = max(0, 1 - (1 - linr_metrics_df['rsq'][0]) * (n - 1) / (n - p - 1)), # Adjusted R-squared

HTML(linr_metrics_df.to_html())
rsstssR2MAERMSECCCrsqrsqadj
Value989.7682312771.840.6744845.2144346.2921160.7523260.642920.46438

Visualize performance via concordance plot of predicted and actual values

f, ax = plt.subplots(1, figsize=(5, 5), dpi=150)
plot_range = [linr_pred_df[['Actual Age', 'Predicted Age']].min().min() * 0.95, linr_pred_df[['Actual Age', 'Predicted Age']].max().max() * 1.05]
ax.plot(plot_range, plot_range, c='g')
ax.scatter(linr_pred_df['Actual Age'], linr_pred_df['Predicted Age'], alpha=0.5)
ax.set(
    xlim=plot_range,
    xlabel='Actual Age',
    ylim=plot_range,
    ylabel='Predicted Age',
    title='Concordance in Predicted vs. Actual Age'
)
plt.show()

png

Closing Remarks

  • Many variants of above possible.
  • Goal to provide general framework to handle SomaLogic data.
  • Not definitive guide in statistical theory, etc.

MIT LICENSE


return to top

FAQs


Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts

SocketSocket SOC 2 Logo

Product

  • Package Alerts
  • Integrations
  • Docs
  • Pricing
  • FAQ
  • Roadmap
  • Changelog

Packages

npm

Stay in touch

Get open source security insights delivered straight into your inbox.


  • Terms
  • Privacy
  • Security

Made with ⚡️ by Socket Inc