%0 Journal Article
%A Borry, Maxime
%+ Archaeogenetics, Max Planck Institute for the Science of Human History, Max Planck Society
%T Sourcepredict: Prediction of metagenomic sample sources using dimension reduction followed by machine learning classification : 
%G eng
%U https://hdl.handle.net/21.11116/0000-0007-2F15-E
%R 10.21105/joss.01540
%F OTHER: shh2440
%7 2019-09-04
%D 2019
%8 04.09.2019
%* Review method: peer-reviewed
%X SourcePredict is a Python package distributed through Conda, to classify and predict the<br>origin of metagenomic samples, given a reference dataset of known origins, a problem also<br>known as source tracking.<br>DNA shotgun sequencing of human, animal, and environmental samples has opened up new<br>doors to explore the diversity of life in these different environments, a field known as metagenomics<br>(Hugenholtz & Tyson, 2008). One aspect of metagenomics is investigating the community<br>composition of organisms within a sequencing sample with tools known as taxonomic<br>classifiers, such as Kraken (Wood & Salzberg, 2014).<br>In cases where the origin of a metagenomic sample, its source, is unknown, it is often part of the<br>research question to predict and/or confirm the source. For example, in microbial archaelogy,<br>it is sometimes necessary to rely on metagenomics to validate the source of paleofaeces.<br>Using samples of known sources, a reference dataset can be established with the taxonomic<br>composition of the samples, i.e., the organisms identified in the samples as features, and the<br>sources of the samples as class labels.<br>With this reference dataset, a machine learning algorithm can be trained to predict the source<br>of unknown samples (sinks) from their taxonomic composition.<br>Other tools used to perform the prediction of a sample source already exist, such as Source-<br>Tracker (Knights et al., 2011), which employs Gibbs sampling.<br>However, the Sourcepredict results are more easily interpreted since the samples are embedded<br>in a human observable low-dimensional space. This embedding is performed by a dimension<br>reduction algorithm followed by K-Nearest-Neighbours (KNN) classification.
%Z Summary
Method
- Prediction of the proportion of unknown sources
- Prediction of the proportion of known sources
- Combining unknown and source proportions
%J The Journal of Open Source Software 
%O Journal of Open Source Software JOSS
%] 01540
%@ 2475-9066