SpeechLLM for AMD: The 8-Layer Architecture Behind 99.7% Accuracy

When most people hear "99.7% accuracy" in answering machine detection, they assume it's marketing hyperbole layered on top of marginal improvements to the same old heuristic-based detection logic.

It's not.

SpeechLLM is fundamentally different. It's not AMD with better tuning. It's not incremental improvement. It's a ground-up redesign of how answering machine detection actually works.

To understand why SpeechLLM achieves 99.7% accuracy where legacy AMD plateaus around 75-85%, you need to understand the architecture. Not at a surface level, but the actual layers that make it work.

This deep dive walks through the 8-layer SpeechLLM architecture that powers answering machine detection at VM Hunter.

Why Architecture Matters

Traditional AMD is essentially one big heuristic: measure timing and silence patterns, apply rules, make a decision. It's a single-layer approach.

SpeechLLM is eight layers working in parallel and sequence, each specializing in a specific aspect of audio analysis. Together, they create redundancy, nuance, and accuracy that single-layer approaches can't achieve.

The layers aren't arbitrary. Each one solves a specific problem that legacy AMD fails at.

The 8-Layer SpeechLLM Architecture

Layer 1: Signal Normalization & Preprocessing

The Problem: Raw telephony audio varies wildly. Mobile networks compress audio. Landline carriers add noise. Different codec configurations produce different amplitude profiles. VoIP jitter introduces timing artifacts.

Legacy AMD applies rules to raw, unnormalized audio. A voicemail greeting might hit 70% of the energy threshold on one carrier but 40% on another, throwing off timing-based classification.

The Solution: Layer 1 normalizes the audio stream before processing:

Volume normalization — scales audio to consistent amplitude across all carriers and codecs
Silence floor detection — identifies the baseline noise level for this specific connection
Codec detection — identifies whether G.711, G.729, or other codecs are in use
Jitter compensation — corrects timing artifacts from VoIP networks
Spectral whitening — removes predictable noise patterns

The output is audio that's consistent regardless of how it got to you. This single layer eliminates 30-40% of false positives just by removing the carrier and network noise variability.

Layer 2: Feature Extraction via Mel-Spectrograms

The Problem: AI models don't process raw waveforms well. You need features that capture the meaningful information in audio.

Legacy AMD uses energy and timing features. These are insufficient to distinguish complex patterns.

The Solution: Layer 2 converts normalized audio into mel-spectrograms:

Fast Fourier Transform (FFT) — converts audio into frequency-domain representation
Mel-scale filtering — applies perceptual scaling to frequencies (humans distinguish low frequencies better than high)
Log compression — applies logarithmic scaling to match human loudness perception
Windowing — divides audio into overlapping frames for temporal resolution

The result: a 2D spectrogram where x-axis = time, y-axis = frequency, intensity = signal strength. This representation preserves all acoustic information while being optimized for neural network processing.

Layer 3: Real-Time Signal Detection

The Problem: You need to detect specific audio events — beeps, silence, tone variations, fax signals — without waiting for the entire audio buffer.

The Solution: Layer 3 runs real-time signal detection in parallel with other analysis:

Voicemail beep detection — listens for the distinctive 2kHz tone at the end of voicemail greetings
Fax tone detection — identifies CNG (calling tone) and CED (called terminal) signals
Silence detection — tracks silence patterns and duration
Carrier intercept tones — detects specific tones that carriers use for disconnected numbers
DTMF detection — identifies touch-tone signaling from IVRs

These signals are strong classifiers on their own. A detected beep = voicemail (0.001% false positive). Detected fax tone = fax machine. But Layer 3 isn't the final decision — it's input to later layers.

Layer 4: Speech Recognition & Transcription

The Problem: Traditional AMD cannot hear what's being said. SpeechLLM needs to understand linguistic content.

The Solution: Layer 4 performs real-time speech recognition:

Streaming ASR — uses streaming automatic speech recognition tuned for short utterances (voicemail greetings are 1-5 seconds)
Confidence scoring — tracks how confident the recognition is (important for noisy audio)
Tokenization — converts recognized speech into tokens for analysis
Language detection — identifies the language being spoken (critical for 65+ language support)

The output isn't full transcription — that would be too slow. Instead, Layer 4 generates a compressed representation of what was said, sufficient for semantic analysis.

Layer 5: Linguistic Pattern Analysis

The Problem: Voicemail greetings are formulaic. They contain predictable linguistic patterns that humans can recognize instantly but which legacy AMD completely misses.

The Solution: Layer 5 analyzes linguistic markers:

Keyword detection — looks for voicemail-specific phrases: "leave a message," "after the tone," "please call back," "not available," "press 1," etc.
Semantic analysis — understands the meaning of recognized speech, not just the words
Greeting formula detection — voicemail greetings follow predictable patterns ("You've reached [name]. I'm [status]. [Call-back instruction]")
Prosody analysis — analyzes the rhythm and intonation of speech (scripted voicemail has different prosody than spontaneous human speech)

A human hearing "leave your message after the tone" immediately knows it's voicemail. Layer 5 makes that determination programmatically.

Layer 6: Transformer-Based Context Analysis

The Problem: Individual features and patterns are insufficient. You need holistic understanding that considers the entire audio context.

The Solution: Layer 6 runs a transformer-based neural network:

Self-attention mechanisms — learns which parts of the audio are most informative for classification
Positional encoding — understands that audio is temporal (what comes first vs. last matters)
Multi-head attention — analyzes audio from multiple perspectives simultaneously
Sequence modeling — understands that audio is a sequence with dependencies

The transformer has been trained on millions of labeled real-world calls. It's learned that:

Formal business greetings (even if long) = human
Casual short greetings = human
Repeated greetings with identical prosody = voicemail
Call screening phrase patterns = CALLGUARD, not machine
Silence after greeting followed by tone = voicemail
Background noise with speech = likely human

Layer 7: Classification Head with Confidence Scoring

The Problem: After analysis, you need to make a discrete decision. But you also need to know how confident that decision is.

The Solution: Layer 7 outputs multi-class probabilities:

HUMAN:        0.976
VOICEMAIL:    0.018
CALLGUARD:    0.004
IVR:          0.001
DISCONNECT:   0.001

Not just the max probability — the full distribution. This lets your dialplan set custom thresholds:

High-compliance campaigns: require >0.99 confidence before connecting
High-volume campaigns: accept >0.85 confidence
Manual review queue: route 0.70-0.85 confidence to supervisors

Layer 8: Post-Processing & Feedback Integration

The Problem: Static models get worse over time as the world changes. Carriers update voicemail systems. New call screening platforms emerge. Without adaptation, accuracy drifts.

The Solution: Layer 8 implements continuous learning:

Production call logging — captures all classification decisions and outcomes
Feedback collection — when agents report misclassifications, those are logged with audio
Drift detection — monitors accuracy metrics and alerts when performance degrades
Periodic retraining — model is retrained monthly on new production data
A/B testing — new model versions are tested against production before deployment

This feedback loop means SpeechLLM gets better every month as it sees more of your specific calling patterns.

How This Delivers 99.7% Accuracy

The 8-layer architecture doesn't add layers for the sake of it. Each layer solves a category of problem:

Problem	Legacy AMD	SpeechLLM Layer
Carrier/codec variations	❌ Fails	✅ Layer 1 normalization
Requires human-readable features	❌ Timing only	✅ Layer 2 mel-spectrograms
Needs specific event detection	❌ Generic rules	✅ Layer 3 signal detection
Must understand language	❌ Deaf	✅ Layers 4-5 ASR + linguistics
Requires contextual understanding	❌ Single rules	✅ Layer 6 transformer
Needs flexible confidence	❌ Binary	✅ Layer 7 probability distribution
Must adapt to new patterns	❌ Static	✅ Layer 8 continuous learning

The redundancy is intentional. If Layer 3 detects a beep with high confidence, that's a strong signal for voicemail. But if Layer 5 detects human speech patterns, that context modulates the beep signal. If Layer 4 detects a non-English voicemail phrase but Layer 6 classifies overall as human, that discrepancy triggers review.

The result: 99.7% sustained accuracy across all call types, carriers, languages, and calling conditions.

Real-World Performance Across Call Types

Call Type	Legacy AMD	SpeechLLM
Formal business greeting	65%	99.8%
Casual human greeting	92%	99.9%
Short voicemail ("It's Mike")	40%	99.5%
Voicemail with background noise	72%	98.9%
iOS call screening	10%	99.7%
Non-English voicemail	68%	99.2%
Carrier intercept message	35%	99.8%
Multiple languages mixed	20%	97.5%

The improvements aren't marginal — they're categorical. On iOS call screening, legacy AMD achieves 10% accuracy (essentially random guessing). SpeechLLM achieves 99.7%.

Why This Matters for Your Call Center

The 8-layer architecture isn't academic. It translates to operational reality:

Real humans don't get dropped — 0.2% false positive rate
Voicemails don't hit agents — 0.3% false negative rate
CALLGUARD calls get routed to agents — iOS/Android screening detected and handled correctly
Sub-50ms classification — no perceptible delay
Works globally — 65+ languages handled consistently
Gets better over time — continuous learning from your call patterns

For a 50-person call center running 50,000 calls daily, this means reaching 4,500+ live humans instead of 2,500. That's not incremental. That's transformational.

Conclusion

SpeechLLM's 8-layer architecture represents a fundamental departure from legacy AMD. It's not better heuristics. It's actual intelligence applied to audio understanding.

The layers work together to solve the complete problem: handle all carrier types, understand all call types, detect all modern phenomena (like call screening), and continuously improve.

The result: 99.7% accuracy. Sustained. In production. At scale.

Start your free trial — deploy SpeechLLM's 8-layer architecture on your dialer today. No credit card required.