Newswise — Using a new deep learning approach, a broad spectrum of scientists can now generate a protein database directly from proteomics data gathered from a specific soil sample. A key element of this approach is a digital tool called Kaiko, a deep learning computer model that has significantly improved accuracy compared to currently available digital tools for generating a protein database.
Kaiko was trained to use peptide sequence data from mass spectrometry. To teach Kaiko about proteins, scientists mined the extensive archive of mass spectrometry data obtained and maintained by EMSL, the Environmental and Molecular Sciences Laboratory, a Department of Energy (DOE) Office of Science user facility. The training data included a set of 5 million sample matches from 55 diverse microorganisms across 9 phyla. With the EMSL data as its training set, Kaiko successfully identified organisms directly from the proteomic data from natural and synthetic soil samples. Finally, the team of scientists involved in this research created a process to generate a database of all the proteomic data from a sample, or a metaproteome, using Kaiko, and tested the process on native soils collected from a site in Kansas. The process identified all highly abundant microbes and uncovered several additional species. The new digital tool will allow a greater number and variety of scientists around the world to study the soil microbiome.