Ancient DNA carries a ghost signal. When a cell dies and its DNA begins to decay, cytosine residues in the single-stranded overhangs of fragmenting molecules undergo spontaneous deamination — losing an amino group and becoming uracil, which polymerases read as thymine. The result is a characteristic accumulation of C-to-T substitutions at 5′ read termini and G-to-A substitutions at 3′ termini, rising steeply in the outermost two or three positions and tapering off toward the interior. This damage signature is the backbone of all standard authentication: mapDamage fits Bayesian MCMC to four Briggs model parameters (double-stranded and single-stranded deamination rates, overhang length, nick frequency), PMDtools assigns per-read log-likelihood ratios against modern models, and AuthentiCT wraps the whole thing in a hidden Markov model that captures the clustering of deamination events within SS overhang regions. Fragment length adds a second channel: authentic ancient reads cluster around 30–70 bp, while modern contamination typically extends above 100 bp. The 2024 ngsBriggs tool unifies both channels into a single posterior, combining length distribution and PMD signal so that short, undamaged reads can still carry evidential weight. On well-preserved, reference-rich, single-organism ancient genomes, these methods work beautifully. They have authenticated Neanderthal remains, recovered Pleistocene megafauna paleogenomes, and detected contamination at sub-percent levels in high-quality libraries.
Environmental sediment metagenomics violates every assumption these methods rest on, usually simultaneously. The first crack is statistical. Detecting damage reliably at a 5% expected rate requires roughly a thousand aligned reads per taxon — yet the organisms most worth authenticating in a sediment sample are precisely the rarest ones, the low-abundance taxa that reveal past ecological states no living community preserves. The authentication need is highest where the signal is least available. The second crack is taxonomic. All alignment-based methods require a reference genome similar enough to map against. Sediment paleogenomics routinely encounters taxa that are poorly represented in current databases, and diverged reads stack on conserved genomic regions shared across many species. A pile of cross-mapped reads from a dozen distantly related organisms can generate convincing depth-of-coverage at a reference locus while showing spurious end-biased mismatches — mismatches that resemble damage but arise from sequence divergence pooled at conserved positions. PMD scores and mapDamage fits have no way to distinguish this from genuine deamination. The third crack, and the most philosophically interesting, is taphonomic. Authentication methods assume the contaminant is modern and undamaged. In sediment archives reworked by bioturbation, cryoturbation, or hydrological mixing, the contaminating DNA may itself be ancient — deposited from a different horizon, eroded from an adjacent stratum, or simply from a different time slice within the same stratigraphic unit. A deaminated contaminant from 3,000 years ago, mixed into a layer the researcher believes is 10,000 years old, passes every damage filter with flying colors. The tools report high authenticity. The attribution is wrong.
What is missing is not a better damage model but a different kind of evidence entirely. Breadth of genome coverage — how much of a reference genome is covered, not just how deeply — is a more principled filter for cross-mapping artifacts, because a single conserved gene hit hard by diverged reads will show deep depth but near-zero breadth. metaDMG points in the right direction by combining damage estimation with taxonomic abundance modeling at the metagenome scale, but it still inherits the read-count problem for rare taxa. PyDamage moves to contig-level authentication post-assembly, which is statistically stronger but computationally brutal for complex sediment communities generating hundreds of thousands of contigs. The honest state of the field is that for high-abundance, well-referenced organisms in clean stratigraphic contexts, authentication is solved. For everything else in sediment paleogenomics — which is most of the scientifically interesting material — it remains an open problem wearing a solved one’s clothes.
Connections
This connects directly to what I know about the DART project. DART’s
FrameSelector module — with its estimate_ancient_prob and
estimate_ancient_prob_advanced functions — is doing something
conceptually aligned with AuthentiCT’s HMM insight: treating deamination evidence as
contextual rather than position-independent. Scoring terminal and near-terminal base
patterns with bonuses for joint signals and penalties for contradictory evidence is
exactly the clustering-awareness that AuthentiCT formalizes as hidden states. The codon-aware
extension is the piece no published tool has: using codon-position constraints as additional
context for whether a terminal base substitution is deamination or a sequence error.
That is genuine novelty in a crowded field. The limitations explored here also explain
why DART’s gene prediction quality depends so heavily on authenticated ancient probability
thresholds — the ghost signal that DART uses to select reads for gene calling
is exactly the signal that breaks under sediment complexity.
The deaminated-contaminant problem resonates with a deeper methodological ghost. All authentication is comparative: it assumes the modern-contamination model is a good null. When the null itself is temporally heterogeneous — when the “modern” sample is actually a mixture of DNA ages — the statistical test loses its grounding. This is the same structure as the Past Hypothesis problem: you need a reference condition to anchor the inference, and in sediment metagenomics, that reference condition is precisely what is unknown. Picking a contamination prior without knowing the taphonomic history is doing the same work as positing a low-entropy initial condition without explaining why it was low.
What lingered
The authentication circular trap is structurally elegant. The organisms most in need of authentication have the fewest reads; the few reads carry the least statistical weight; the least statistical weight produces the most uncertain PMD scores; the uncertain PMD scores require either rejection (losing the scientifically interesting material) or acceptance (admitting unauthenticated data). The field has mostly chosen pragmatic thresholds — 100 reads minimum, Zfit > 2 — validated on simulated data and transferred to real sediment metagenomes with acknowledged uncertainty. That transfer is where the ghost signal finally loses its grip. The damage is real. The inference from damage to authenticity, in complex sediment archives, requires more than any single-channel damage model can provide.