Nâng caoHướng dẫnClaude APINguồn: Anthropic

Computer Use Demo — Claude điều khiển máy tính của bạn

Minh TuấnCTO, Transform GroupTheo dõi

26/03/2026 568 0 6 phút đọc

Nghe bài viết

00:00

1 Để áp dụng computer use hoạt động như thế nào? hiệu quả, bạn cần nắm rõ: Về mặt kỹ thuật, computer use dựa trên 3 khái niệm: Screenshot — Claude chụp màn hình, nhận ảnh base64 qua vision API Action Tools — Claude gọi tool để click, type, scroll — đây là bước quan trọng giúp tối ưu quy trình làm việc với AI trong thực tế.
2 Góc nhìn thực tế về setup môi trường docker an toàn: Chạy computer use trong Docker để cô lập hoàn toàn với máy host: # Dockerfile FROM ubuntu:22.04 # Cai dat X11 virtual display RUN apt-get update &amp&amp apt-get install -y xvfb x11vnc xdotool scrot python3 python3-pip firefox-esr --no-install — hiệu quả phụ thuộc nhiều vào cách triển khai và ngữ cảnh sử dụng cụ thể.
3 Dữ liệu từ định nghĩa computer use tools cho thấy: Anthropic cung cấp sẵn tool schema chuẩn cho computer use. Bạn cần implement phía backend: client anthropic.Anthropic # Tool 1: Chup man hinh -&gt str: """Chup man hinh, tra ve base64 PNG.""" result subprocess.run "scrot", "-", "-z", capture_outputTrue return base64.b64encoderes — những con số này phản ánh mức độ cải thiện thực tế mà người dùng có thể kỳ vọng.
4 Để áp dụng agent loop với vision hiệu quả, bạn cần nắm rõ: Điểm khác biệt quan trọng: khi gửi screenshot, bạn dùng image content block , không phải text: str -&gt str: """ Chay computer use agent voi task cho truoc — đây là bước quan trọng giúp tối ưu quy trình làm việc với AI trong thực tế.
5 Một thực tế quan trọng về demo: tự động điền form web: # Mo Firefox va dien form result run_computer_agent "Mo Firefox, vao trang google.com, " "tim kiem 'anthropic claude api', " "va chup man hinh ket qua dau tien" printresult Claude sẽ tự động: Quan sát màn hình — tuy mang lại lợi ích rõ ràng nhưng cũng đòi hỏi đầu tư thời gian học và thử nghiệm phù hợp.

yellow sticky notes beside white apple magic mouse and white apple keyboard

Computer Use là một trong những tính năng ấn tượng nhất của Claude — khả năng nhìn màn hình máy tính và điều khiển chuột, bàn phím như một con người. Thay vì chỉ sinh text, Claude có thể thực sự mở app, điều hướng website, điền form, và thực hiện các tác vụ desktop phức tạp.

Bài viết này hướng dẫn bạn setup môi trường an toàn bằng Docker và xây dựng demo computer use đầu tiên.

Computer Use hoạt động như thế nào?

Về mặt kỹ thuật, computer use dựa trên 3 khái niệm:

Screenshot — Claude chụp màn hình, nhận ảnh base64 qua vision API
Action Tools — Claude gọi tool để click, type, scroll, hay nhấn phím tắt
Feedback Loop — Sau mỗi action, chụp screenshot mới để xác nhận kết quả

Claude nhin man hinh
      |
      v
[Phan tich: can lam gi tiep theo?]
      |
      v
[Goi tool: click(x, y) / type(text) / screenshot()]
      |
      v
[Nhan ket qua + screenshot moi]
      |
      v
[Lap lai cho den khi xong viec]

Điều quan trọng: Claude không có quyền truy cập trực tiếp vào OS — nó chỉ thấy screenshot và ra lệnh qua tools. Bạn, developer, là người implement tools đó.

Setup môi trường Docker an toàn

Chạy computer use trong Docker để cô lập hoàn toàn với máy host:

# Dockerfile
FROM ubuntu:22.04

# Cai dat X11 virtual display
RUN apt-get update && apt-get install -y     xvfb     x11vnc     xdotool     scrot     python3     python3-pip     firefox-esr     --no-install-recommends

# Cai dat Python deps
RUN pip3 install anthropic pillow

# Tao non-root user de bao mat hon
RUN useradd -m -s /bin/bash claudeuser
USER claudeuser
WORKDIR /home/claudeuser

COPY demo.py .

CMD ["bash", "-c", "Xvfb :99 -screen 0 1366x768x24 & sleep 1 && DISPLAY=:99 python3 demo.py"]

Build và chạy:

docker build -t computer-use-demo .
docker run -e ANTHROPIC_API_KEY=your_key_here computer-use-demo

Định nghĩa Computer Use Tools

Anthropic cung cấp sẵn tool schema chuẩn cho computer use. Bạn cần implement phía backend:

import anthropic
import subprocess
import base64
from PIL import ImageGrab
import io

client = anthropic.Anthropic()

# Tool 1: Chup man hinh
def take_screenshot() -> str:
    """Chup man hinh, tra ve base64 PNG."""
    result = subprocess.run(
        ["scrot", "-", "-z"],
        capture_output=True
    )
    return base64.b64encode(result.stdout).decode()

# Tool 2: Click chuot
def mouse_click(x: int, y: int, button: str = "left") -> str:
    button_map = {"left": "1", "middle": "2", "right": "3"}
    btn = button_map.get(button, "1")
    subprocess.run(["xdotool", "mousemove", str(x), str(y)])
    subprocess.run(["xdotool", "click", btn])
    return f"Da click {button} tai ({x}, {y})"

# Tool 3: Nhap text
def type_text(text: str) -> str:
    subprocess.run(["xdotool", "type", "--clearmodifiers", text])
    return f"Da nhap: {text[:50]}..."

# Tool 4: Nhan phim tat
def key_press(key: str) -> str:
    subprocess.run(["xdotool", "key", key])
    return f"Da nhan phim: {key}"

# Tool 5: Scroll
def scroll(x: int, y: int, direction: str, amount: int = 3) -> str:
    btn = "4" if direction == "up" else "5"
    for _ in range(amount):
        subprocess.run(["xdotool", "click", "--repeat", "1", btn])
    return f"Da scroll {direction} {amount} lan tai ({x}, {y})"

Tool Schemas theo chuẩn Anthropic

computer_tools = [
    {
        "type": "computer_20241022",
        "name": "computer",
        "display_width_px": 1366,
        "display_height_px": 768,
        "display_number": 1
    }
]

# Hoac tu dinh nghia chi tiet hon:
custom_tools = [
    {
        "name": "screenshot",
        "description": "Chup man hinh hien tai, tra ve anh PNG base64",
        "input_schema": {
            "type": "object",
            "properties": {},
            "required": []
        }
    },
    {
        "name": "mouse_click",
        "description": "Click chuot tai toa do (x, y)",
        "input_schema": {
            "type": "object",
            "properties": {
                "x": {"type": "integer", "description": "Toa do X (pixel)"},
                "y": {"type": "integer", "description": "Toa do Y (pixel)"},
                "button": {
                    "type": "string",
                    "enum": ["left", "middle", "right"],
                    "description": "Nut chuot, mac dinh left"
                }
            },
            "required": ["x", "y"]
        }
    },
    {
        "name": "type_text",
        "description": "Nhap text vao vi tri hien tai",
        "input_schema": {
            "type": "object",
            "properties": {
                "text": {"type": "string", "description": "Text can nhap"}
            },
            "required": ["text"]
        }
    },
    {
        "name": "key_press",
        "description": "Nhan phim tat, vi du: Return, ctrl+c, alt+Tab",
        "input_schema": {
            "type": "object",
            "properties": {
                "key": {"type": "string", "description": "Ten phim theo xdotool format"}
            },
            "required": ["key"]
        }
    }
]

Agent Loop với Vision

Điểm khác biệt quan trọng: khi gửi screenshot, bạn dùng image content block, không phải text:

def run_computer_agent(task: str) -> str:
    """
    Chay computer use agent voi task cho truoc.
    """
    # Chup man hinh ban dau
    screenshot_b64 = take_screenshot()

    # Tao message dau tien voi anh
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": screenshot_b64
                    }
                },
                {
                    "type": "text",
                    "text": f"Day la man hinh hien tai. Nhiem vu cua ban: {task}"
                }
            ]
        }
    ]

    system = """Ban la mot AI dieu khien may tinh.
    Truoc moi hanh dong, hay quan sat man hinh can than.
    Sau moi hanh dong, chup man hinh moi de xac nhan ket qua.
    Neu co loi, thu lai voi cach khac.
    Bao cao khi hoan thanh nhiem vu."""

    tool_map = {
        "screenshot": lambda: take_screenshot(),
        "mouse_click": mouse_click,
        "type_text": type_text,
        "key_press": key_press,
        "scroll": scroll
    }

    for _ in range(50):  # Max 50 actions
        response = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=4096,
            system=system,
            tools=custom_tools,
            messages=messages
        )

        messages.append({
            "role": "assistant",
            "content": response.content
        })

        if response.stop_reason == "end_turn":
            return next(
                (b.text for b in response.content if hasattr(b, "text")), ""
            )

        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                print(f"Action: {block.name}({block.input})")
                result = tool_map[block.name](**block.input)

                # Neu la screenshot, tra ve image block
                if block.name == "screenshot":
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": [
                            {
                                "type": "image",
                                "source": {
                                    "type": "base64",
                                    "media_type": "image/png",
                                    "data": result
                                }
                            }
                        ]
                    })
                else:
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": str(result)
                    })

        messages.append({
            "role": "user",
            "content": tool_results
        })

    return "Timeout: het so action toi da"

Demo: Tự động điền form web

# Mo Firefox va dien form
result = run_computer_agent(
    "Mo Firefox, vao trang google.com, "
    "tim kiem 'anthropic claude api', "
    "va chup man hinh ket qua dau tien"
)

print(result)

Claude sẽ tự động:

Quan sát màn hình, xác định Firefox chưa mở
Double-click icon Firefox (hoặc dùng terminal)
Chụp screenshot sau khi Firefox mở
Click vào address bar
Gõ google.com và Enter
Click vào search box, gõ query
Chụp screenshot kết quả cuối cùng

Safety Considerations — Quan trọng!

Computer use là tính năng mạnh nhưng tiềm ẩn rủi ro. Anthropic khuyến nghị:

Luôn dùng sandbox — Docker, VM, hoặc máy ảo. KHÔNG chạy trực tiếp trên máy host.
Giới hạn quyền — Non-root user, không có quyền sudo trong container
Monitor actions — Log mọi action trước khi thực thi, cho phép human review
Confirm sensitive actions — Xóa file, gửi email, mua hàng... cần human confirm
Network isolation — Hạn chế network access trong container

SENSITIVE_PATTERNS = [
    "rm -rf", "delete", "format",
    "send email", "purchase", "payment"
]

def safe_action(action_name: str, action_input: dict) -> str:
    # Kiem tra hanh dong nguy hiem
    input_str = str(action_input).lower()
    for pattern in SENSITIVE_PATTERNS:
        if pattern in input_str:
            confirm = input(
                f"CANH BAO: Hanh dong nhay cam '{action_name}' "
                f"voi input '{input_str[:50]}'. "
                f"Xac nhan? (y/n): "
            )
            if confirm.lower() != 'y':
                return "Hanh dong bi huy boi nguoi dung"

    return tool_map[action_name](**action_input)

Tổng kết

Thành phần	Vai trò	Công nghệ
Screenshot	Claude nhìn màn hình	scrot + base64
Mouse control	Click, drag, scroll	xdotool
Keyboard	Type, hotkeys	xdotool type/key
Sandbox	Cô lập an toàn	Docker + Xvfb
Vision API	Claude phân tích ảnh	Claude vision + base64

Computer Use mở ra khả năng tự động hóa bất kỳ tác vụ desktop nào — từ điền form, xử lý email, đến test UI tự động. Bước tiếp theo: xem Browser Use Demo để thấy cách tự động hóa web chuyên sâu hơn với Puppeteer, hoặc quay lại LLM Agent từ đầu để hiểu kiến trúc agent foundation.

Gợi ý cho bạn

Computer Use Trong Cowork: Claude Tự Thao Tác Máy Tính Của Bạn

Computer Use Demo — Claude điều khiển máy tính của bạn

Điểm nổi bật

Computer Use hoạt động như thế nào?

Setup môi trường Docker an toàn

Định nghĩa Computer Use Tools

Tool Schemas theo chuẩn Anthropic

Agent Loop với Vision

Demo: Tự động điền form web

Safety Considerations — Quan trọng!

Tổng kết

Bài viết liên quan

Gợi ý cho bạn

Computer Use Trong Cowork: Claude Tự Thao Tác Máy Tính Của Bạn

Claude Skills — Tạo Excel, PowerPoint, PDF tự động

Claude Cowork và Năng suất AI: Tương lai của công việc doanh nghiệp

Research Agent một dòng code — Bắt đầu với Claude Agent SDK

Tin liên quan nên xem

Xây dựng Customer Service Agent với Claude Tool Use

Tool Choice — Kiểm soát cách Claude chọn và gọi tools

Parallel Tool Calls — Gọi nhiều tools đồng thời với Claude

Vision + Tool Use — Trích xuất dữ liệu từ hình ảnh

Computer Use Demo — Claude điều khiển máy tính của bạn

Điểm nổi bật

Computer Use hoạt động như thế nào?

Setup môi trường Docker an toàn

Định nghĩa Computer Use Tools

Tool Schemas theo chuẩn Anthropic

Agent Loop với Vision

Demo: Tự động điền form web

Safety Considerations — Quan trọng!

Tổng kết

Bài viết liên quan

Gợi ý cho bạn

Computer Use Trong Cowork: Claude Tự Thao Tác Máy Tính Của Bạn

Claude Skills — Tạo Excel, PowerPoint, PDF tự động

Claude Cowork và Năng suất AI: Tương lai của công việc doanh nghiệp

Research Agent một dòng code — Bắt đầu với Claude Agent SDK

Tin liên quan nên xem

Xây dựng Customer Service Agent với Claude Tool Use

Tool Choice — Kiểm soát cách Claude chọn và gọi tools

Parallel Tool Calls — Gọi nhiều tools đồng thời với Claude

Vision + Tool Use — Trích xuất dữ liệu từ hình ảnh

Đăng ký nhận bản tin