Research

Research_

My research mainly focuses on algorithms that can operate on “big data” in proteomics - over 400TB of public mass spectrometry data at MassIVE. I’m especially interested in what currently evades identification in common pipelines and how to effectively understand both the scope and meaning of this “missing” proteome.

Chimeric-aware MS/MS clustering for accurate estimation of the dark proteome

Developed a method for clustering MS/MS spectra designed to properly account for chimeric spectra with the goal of finding the lower-bound for total distinct molecules in a sample. We show we are to reduce the number of known peptides occuring in 2+ clusters by over 6x the state of the art approach, while still identifying more peptides. This method reduces the number of clusters in downstream analyses and effectively removes duplicates allowing for more efficient searching and creation of spectral networks.

Fast, modification-tolerant searches of MS/MS spectra over billions of spectra

Developed a method to quickly query repository scale MS/MS spectrum collections (3.4B proteomics spectra, 500M small molecule spectra) while considering modifications and with cosine guarentees. Querying the 3.4B database takes ~20-30 seconds with all the data on SSD. Developed an online user interface to allow for community use of the tool.

Automated community-scale validation of novel protein discoveries

Developed systems to help the community validate missing protein (MP) discoveries. Built an online tool to show protein-level coverage of MS/MS spectra in repostory scale MS/MS proteomics libraries. Developed a workflow to automate the process of applying human proteome project (HPP) criteria for validating missing protein (MP) in proteomics searches, including mapping to the proteome, matching synthetics, and displaying sharable results.