Trung cấpHướng dẫnClaude APINguồn: Anthropic

Phiên âm Audio với Deepgram + Claude — Giọng nói đến insight

Minh TuấnCTO, Transform GroupTheo dõi

26/03/2026 553 0 5 phút đọc

Nghe bài viết

00:00

1 Muốn làm chủ deepgram vs các stt khác, hãy bắt đầu từ việc hiểu Tại sao chọn Deepgram? Accuracy — WER Word Error Rate thấp nhất trong các STT services Speed — Real-time transcription với latency &lt 300ms Features — Diarization ai nói?, punctuation — kỹ thuật này được nhiều developer áp dụng thành công trong dự án thực tế.
2 Một thực tế quan trọng về phiên âm file audio: Deepgram hỗ trợ nhiều định dạng: MP3, MP4, WAV, FLAC, OGG, WebM và nhiều hơn: async language"vi": """ Phiên âm file audio với Deepgram — tuy mang lại lợi ích rõ ràng nhưng cũng đòi hỏi đầu tư thời gian học và thử nghiệm phù hợp.
3 Dữ liệu từ phiên âm từ url cho thấy: async language"en": """Phiên âm audio từ URL YouTube, podcast, etc.""" options PrerecordedOptions model"nova-2", languagelanguage, smart_formatTrue, punctuateTrue, diarizeTrue, utterancesTrue response await deepgram.listen.asyncprerecorded.v"1".transcribe_url "url": audio_url, options transcript response.results.channels0.alternatives0.tr — những con số này phản ánh mức độ cải thiện thực tế mà người dùng có thể kỳ vọng.
4 Để áp dụng real-time transcription hiệu quả, bạn cần nắm rõ: Deepgram cũng hỗ trợ streaming transcription cho live audio: LiveOptions async """Real-time transcription từ microphone.""" connection deepgram.listen.asynclive.v"1" # Callback khi nhận transcript async result, **kwargs: sentence result.channel.alternatives0.transcript if sen — đây là bước quan trọng giúp tối ưu quy trình làm việc với AI trong thực tế.
5 Một thực tế quan trọng về phân tích customer calls: Use case thực tế: phân tích sentiment và chất lượng customer service: """Phân tích cuộc gọi khách hàng — quality assurance.""" prompt f"""Phân tích cuộc gọi customer service này: TRANSCRIPT: transcript:3000 Đánh giá theo các tiêu chí: 1 — tuy mang lại lợi ích rõ ràng nhưng cũng đòi hỏi đầu tư thời gian học và thử nghiệm phù hợp.

Kết hợp Deepgram (speech-to-text hàng đầu) với Claude (text analysis) tạo ra pipeline mạnh mẽ: chuyển audio thành text, rồi extract insights, tóm tắt, hoặc phân tích sentiment tự động. Use case phổ biến: ghi chép cuộc họp, phân tích customer calls, tạo subtitles.

Deepgram vs các STT khác

Tại sao chọn Deepgram?

Accuracy — WER (Word Error Rate) thấp nhất trong các STT services
Speed — Real-time transcription với latency < 300ms
Features — Diarization (ai nói?), punctuation, speaker labels
Vietnamese support — Hỗ trợ tiếng Việt với model Nova-2
Cost — Cạnh tranh, có free tier 12,000 phút/năm

Cài đặt

pip install deepgram-sdk anthropic aiofiles

import os
import asyncio
from deepgram import DeepgramClient, PrerecordedOptions, FileSource
import anthropic

deepgram = DeepgramClient(api_key=os.environ.get("DEEPGRAM_API_KEY"))
claude = anthropic.Anthropic(api_key=os.environ.get("ANTHROPIC_API_KEY"))

print("Deepgram + Claude ready")

Phiên âm File Audio

Deepgram hỗ trợ nhiều định dạng: MP3, MP4, WAV, FLAC, OGG, WebM và nhiều hơn:

async def transcribe_file(audio_path, language="vi"):
    """
    Phiên âm file audio với Deepgram.

    Args:
        audio_path: Đường dẫn đến file audio
        language: 'vi' cho tiếng Việt, 'en' cho tiếng Anh

    Returns:
        dict với transcript, words, và metadata
    """
    with open(audio_path, "rb") as audio_file:
        audio_data = audio_file.read()

    payload: FileSource = {"buffer": audio_data}

    options = PrerecordedOptions(
        model="nova-2",           # Model tốt nhất hiện tại
        language=language,
        smart_format=True,        # Tự động format numbers, dates
        punctuate=True,           # Thêm dấu câu
        diarize=True,             # Phân biệt người nói
        utterances=True,          # Chia theo câu nói
        paragraphs=True,          # Chia theo đoạn văn
        sentiment=True,           # Phân tích cảm xúc
        summarize="v2",           # Tóm tắt tự động
    )

    response = await deepgram.listen.asyncprerecorded.v("1").transcribe_file(
        payload,
        options
    )

    result = response.results
    channel = result.channels[0]
    alternative = channel.alternatives[0]

    # Extract speaker segments nếu có diarization
    speakers = {}
    if result.utterances:
        for utterance in result.utterances:
            speaker = utterance.speaker
            if speaker not in speakers:
                speakers[speaker] = []
            speakers[speaker].append(utterance.transcript)

    return {
        "transcript": alternative.transcript,
        "paragraphs": [p.sentences[0].text if p.sentences else ""
                       for p in (alternative.paragraphs.paragraphs if alternative.paragraphs else [])],
        "confidence": alternative.confidence,
        "speakers": speakers,
        "summary": result.summary.short if hasattr(result, 'summary') and result.summary else None,
        "duration": response.metadata.duration
    }

# Ví dụ sử dụng
result = asyncio.run(transcribe_file("meeting.mp3", language="vi"))
print(f"Transcript: {result['transcript'][:500]}")
print(f"Duration: {result['duration']:.1f}s")
print(f"Confidence: {result['confidence']:.2%}")

Phiên âm từ URL

async def transcribe_url(audio_url, language="en"):
    """Phiên âm audio từ URL (YouTube, podcast, etc.)"""
    options = PrerecordedOptions(
        model="nova-2",
        language=language,
        smart_format=True,
        punctuate=True,
        diarize=True,
        utterances=True
    )

    response = await deepgram.listen.asyncprerecorded.v("1").transcribe_url(
        {"url": audio_url},
        options
    )

    transcript = response.results.channels[0].alternatives[0].transcript
    return transcript

# Phiên âm podcast hoặc video
url = "https://example.com/podcast-episode.mp3"
transcript = asyncio.run(transcribe_url(url))

Phân tích Meeting với Claude

Sau khi có transcript, Claude có thể extract nhiều loại insights:

def analyze_meeting(transcript, speakers=None):
    """
    Phân tích nội dung cuộc họp với Claude.
    Trả về: tóm tắt, action items, quyết định, và điểm chú ý.
    """

    # Format transcript với speaker labels nếu có
    if speakers and len(speakers) > 1:
        formatted = "TRANSCRIPT THEO NGƯỜI NÓI:

"
        for speaker_id, texts in speakers.items():
            formatted += f"[Speaker {speaker_id}]:
"
            formatted += "
".join(texts[:5])  # 5 đoạn đầu mỗi người
            formatted += "

"
    else:
        formatted = transcript

    prompt = f"""Phân tích transcript cuộc họp sau và cung cấp:

1. **TÓM TẮT** (3-5 câu): Nội dung chính của cuộc họp là gì?
2. **ACTION ITEMS**: Liệt kê các việc cần làm, ai chịu trách nhiệm
3. **QUYẾT ĐỊNH**: Các quyết định đã được đưa ra
4. **ĐIỂM QUAN TRỌNG**: Thông tin key facts, numbers, deadlines
5. **FOLLOW-UP**: Câu hỏi chưa được giải quyết, cần theo dõi thêm

TRANSCRIPT:
{formatted[:4000]}"""

    response = claude.messages.create(
        model="claude-opus-4-5",
        max_tokens=2048,
        messages=[{"role": "user", "content": prompt}]
    )

    return response.content[0].text

# Pipeline hoàn chỉnh
async def meeting_pipeline(audio_path):
    """Pipeline từ audio file đến meeting analysis."""
    print("Step 1: Transcribing audio...")
    result = await transcribe_file(audio_path, language="vi")

    print(f"  Transcribed {result['duration']:.0f}s of audio")
    print(f"  Confidence: {result['confidence']:.2%}")

    print("
Step 2: Analyzing with Claude...")
    analysis = analyze_meeting(
        transcript=result["transcript"],
        speakers=result["speakers"]
    )

    return {
        "transcript": result["transcript"],
        "analysis": analysis,
        "speakers_count": len(result["speakers"]),
        "duration_minutes": result["duration"] / 60
    }

# Chạy pipeline
# meeting_result = asyncio.run(meeting_pipeline("team_meeting.mp3"))
# print(meeting_result["analysis"])

Real-time Transcription

Deepgram cũng hỗ trợ streaming transcription cho live audio:

import asyncio
from deepgram import LiveTranscriptionEvents, LiveOptions

async def realtime_transcription():
    """Real-time transcription từ microphone."""

    connection = deepgram.listen.asynclive.v("1")

    # Callback khi nhận transcript
    async def on_transcript(self, result, **kwargs):
        sentence = result.channel.alternatives[0].transcript
        if sentence:
            print(f"Transcript: {sentence}")

    connection.on(LiveTranscriptionEvents.Transcript, on_transcript)

    options = LiveOptions(
        model="nova-2",
        language="vi",
        encoding="linear16",
        channels=1,
        sample_rate=16000,
        interim_results=True,  # Show partial results
        smart_format=True
    )

    await connection.start(options)

    # Đọc audio từ microphone (cần pyaudio)
    # Gửi chunks: await connection.send(audio_chunk)

    await asyncio.sleep(30)  # Record 30 giây
    await connection.finish()

Phân tích Customer Calls

Use case thực tế: phân tích sentiment và chất lượng customer service:

def analyze_customer_call(transcript):
    """Phân tích cuộc gọi khách hàng — quality assurance."""

    prompt = f"""Phân tích cuộc gọi customer service này:

TRANSCRIPT:
{transcript[:3000]}

Đánh giá theo các tiêu chí:
1. **SENTIMENT KHÁCH HÀNG**: Positive/Neutral/Negative, lý do
2. **VẤN ĐỀ**: Khách hàng gọi vì vấn đề gì?
3. **GIẢI QUYẾT**: Vấn đề có được giải quyết không?
4. **CHẤT LƯỢNG AGENT**: Điểm 1-10, nhận xét về thái độ và kỹ năng
5. **ESCALATION RISK**: Có nguy cơ khách hàng churn không?
6. **IMPROVEMENT**: Gợi ý cải thiện cho lần sau

Trả lời dưới dạng JSON."""

    response = claude.messages.create(
        model="claude-haiku-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )

    return response.content[0].text

Xử lý nhiều files cùng lúc

import asyncio
from pathlib import Path

async def batch_transcribe(audio_dir, language="vi"):
    """Phiên âm tất cả audio files trong thư mục."""
    audio_files = list(Path(audio_dir).glob("*.mp3")) +                   list(Path(audio_dir).glob("*.wav"))

    print(f"Found {len(audio_files)} audio files")

    # Chạy parallel (tối đa 5 files cùng lúc)
    semaphore = asyncio.Semaphore(5)

    async def transcribe_with_semaphore(file_path):
        async with semaphore:
            result = await transcribe_file(str(file_path), language)
            return {"file": file_path.name, **result}

    tasks = [transcribe_with_semaphore(f) for f in audio_files]
    results = await asyncio.gather(*tasks, return_exceptions=True)

    # Filter errors
    successful = [r for r in results if not isinstance(r, Exception)]
    print(f"Successfully transcribed: {len(successful)}/{len(audio_files)}")

    return successful

Kết luận

Pipeline Deepgram + Claude biến audio thành structured insights hoàn toàn tự động. Từ meeting recordings đến customer calls, từ podcasts đến voice memos — bất kỳ audio nào cũng có thể trở thành searchable, analyzable knowledge.

Bước tiếp theo: Đọc về Voice Assistant với ElevenLabs + Claude để thêm khả năng text-to-speech, tạo vòng lặp voice conversation hoàn chỉnh.

Gợi ý cho bạn

Claude Cowork Giải Phóng 60GB Dung Lượng Máy Tính: Trải Nghiệm Thực Tế

Phiên âm Audio với Deepgram + Claude — Giọng nói đến insight

Điểm nổi bật

Deepgram vs các STT khác

Cài đặt

Phiên âm File Audio

Phiên âm từ URL

Phân tích Meeting với Claude

Real-time Transcription

Phân tích Customer Calls

Xử lý nhiều files cùng lúc

Kết luận

Bài viết liên quan

Gợi ý cho bạn

Claude Cowork Giải Phóng 60GB Dung Lượng Máy Tính: Trải Nghiệm Thực Tế

Context Compaction — Tự động nén context cho conversations dài

Upload PDF lên Claude API — Đọc và tóm tắt tài liệu

Sub-Agent Pattern — Dùng Haiku phân tích, Opus tổng hợp

Tin liên quan nên xem

Tool Evaluation — Đánh giá hiệu quả tools trong agent systems

Red Team cho AI — Test prompt với tấn công adversarial trước khi deploy

So sánh Agent Framework 2026 — LangGraph vs CrewAI vs Claude Agent SDK

Đọc biểu đồ, đồ thị và slide deck với Claude Vision

Phiên âm Audio với Deepgram + Claude — Giọng nói đến insight

Điểm nổi bật

Deepgram vs các STT khác

Cài đặt

Phiên âm File Audio

Phiên âm từ URL

Phân tích Meeting với Claude

Real-time Transcription

Phân tích Customer Calls

Xử lý nhiều files cùng lúc

Kết luận

Bài viết liên quan

Gợi ý cho bạn

Claude Cowork Giải Phóng 60GB Dung Lượng Máy Tính: Trải Nghiệm Thực Tế

Context Compaction — Tự động nén context cho conversations dài

Upload PDF lên Claude API — Đọc và tóm tắt tài liệu

Sub-Agent Pattern — Dùng Haiku phân tích, Opus tổng hợp

Tin liên quan nên xem

Tool Evaluation — Đánh giá hiệu quả tools trong agent systems

Red Team cho AI — Test prompt với tấn công adversarial trước khi deploy

So sánh Agent Framework 2026 — LangGraph vs CrewAI vs Claude Agent SDK

Đọc biểu đồ, đồ thị và slide deck với Claude Vision

Đăng ký nhận bản tin