Analyzing DeepVariant
To raised perceive what DeepVariant is studying from its coaching information, we used a set of easy clustering and visualization strategies to summarize the data captured within the mannequin’s excessive dimensional information. In partnership with collaborators on the Google Genomics staff, we first loaded examples into the Built-in Genomics Viewer (IGV), a widely-used software for inspecting genomes and sequencing information. Then, we utilized Uniform Manifold Approximation and Projection (UMAP) to the embeddings of the mixed5 max-pooling layer of the mannequin, which is roughly in the course of the community and incorporates a mixture of low- and high-level options. This visualization technique permits one to visually examine any rising buildings. We used completely different colours to characterize identified sequencing attributes within the enter information (e.g., low high quality sequence reads and areas which are exhausting to uniquely map within the genome) and a mixed attribute utilizing completely different worth combos of the fundamental attribute.
The buildings that emerged reveal that among the attributes’ values are mapped shut to one another, naturally forming clusters. We noticed that these “pure clusters” kind at completely different ranges throughout mannequin layers, and at occasions get “forgotten” because the community additional processes the enter. This means that several types of details about the enter DNA reads are essential to completely different depths of the community.
Based mostly on this primary look, we then used extra clustering strategies with the hope of “discovering” beforehand unknown attributes (clusters). We started by making use of okay-means clustering to search out 10 clusters. Ok-means is an easy clustering algorithm that teams information factors by proximity in vector house, with out use of labels that may point out similarity. This leads to visible separation between main clusters, a few of that are rather more populous than others. To have management of the dimensions of ensuing clusters, we then utilized hierarchical clustering by working okay-means a number of occasions; first we run 3-cluster okay-means, then for every of the three clusters we apply a second spherical okay-means to additional divide the clusters, the place the cluster quantity relies on the form and dimension of the primary spherical clusters.
