_

A high-throughput pipeline to classify eukaryotic taxa and assess deep-sea biodiversity directly from raw environmental DNA, bypassing the limitations of conventional reference databases.

01Problem Statement

// Incomplete Databases

Deep-sea organisms are critically underrepresented in genetic reference databases. This data void results in misclassification, unassigned reads, and a fundamental underestimation of true biodiversity.

// Computational Bottlenecks

Legacy bioinformatic pipelines are computationally expensive and inefficient for novel discovery. Their reliance on sequence alignment against flawed databases is a primary limiting factor.

02Proposed Solution

Our AI-driven pipeline leverages deep learning and unsupervised clustering to analyze eDNA without primary reliance on existing databases. The system is designed to:

  • [+]CLASSIFY TAXA DIRECTLY: A fine-tuned DNA-BERT transformer model interprets raw sequence data, enabling classification without perfect database matches.
  • [+]DISCOVER NOVEL SPECIES: Unsupervised clustering algorithms (DBSCAN, k-means) identify and group unknown sequences, flagging potential new taxa for targeted analysis.
  • [+]GENERATE ECOLOGICAL INSIGHTS: Rapidly produce accurate estimations of species abundance and community structure to inform conservation and research priorities.
03System Architecture

SYSTEM INGESTS RAW eDNA DATA -> PREPROCESSING MODULE EXTRACTS 18S rRNA & COI MARKERS -> DATA IS VECTORIZED BY A FINE-TUNED DNA-BERT TRANSFORMER -> EMBEDDINGS ARE PROCESSED VIA DUAL PATHWAYS: [A] DEEP LEARNING FOR CLASSIFICATION, [B] UNSUPERVISED CLUSTERING FOR NOVELTY DETECTION -> OUTPUT GENERATION: TAXONOMIC GROUPING, ABUNDANCE ESTIMATION, ECOLOGICAL INSIGHTS.