SWE-bench Verified Da Chet — Tai Sao OpenAI Khai Tu Benchmark Cua Chinh Minh

OpenAI da ngung bao cao diem SWE-bench Verified va keu goi toan nganh AI lam dieu tuong tu. Benchmark tung dinh nghia tien bo AI lap trinh nay chinh thuc bi khai tu: 59,4% test case kiem toan co loi, va moi mo hinh frontier lon deu bi nhiem du lieu training.
Khi Mot Benchmark Het Vai Tro
Ke tu khi ra mat vao thang 8/2024, SWE-bench Verified tro thanh tieu chuan vang de do luong kha nang tu dong sua loi phan mem thuc te cua cac mo hinh AI. Moi lan ra mat mo hinh lon deu trich dan no. Roi OpenAI nhin ky hon — va nhung gi ho tim thay khien viec tiep tuc su dung no tro nen khong the bien ho.
OpenAI da chinh thuc ngung bao cao diem SWE-bench Verified va khuyen nghi toan nganh lam dieu tuong tu. Benchmark nay khong chi bao hoa — no bi hong theo hai cach rieng biet va nghiem trong.
Van De #1 — Cac Test Case Sai
OpenAI da kiem toan 138 bai toan ma mo hinh manh nhat cua ho, o3, lien tuc that bai qua 64 lan chay doc lap. Ket qua rat dang lo ngai: 59,4% cac bai toan duoc kiem toan co loi nghiem trong trong thiet ke test hoac mo ta bai toan — khien chung cuc ky kho hoac khong the giai dung, ngay ca voi mo hinh hay ky su gioi nhat.
Hai loai test bi hong noi len:
- Test qua hep (35,5%): Test case ap dat mot cach tiep can trien khai cu the, tu choi cac giai phap dung ve mat chuc nang nhung giai theo cach khac.
- Test qua rong (18,8%): Test case kiem tra chuc nang khong bao gio duoc mo ta trong de bai goc — phat cac mo hinh vi khong doc duoc suy nghi cua nguoi viet test.
Noi cach khac, khi cac mo hinh frontier that bai voi task SWE-bench Verified, thuong khong phai vi chung tra loi sai — ma la vi test duoc viet sai. Dieu nay pha vo can ban kha nang cua benchmark trong viec do luong nhung gi no tuyen bo do luong.
Van De #2 — Du Lieu Bi Nhiem
Van de thu hai tham chi con dang bao dong hon. Cac bai toan SWE-bench duoc lay tu cac repository GitHub ma nguon mo — chinh nhung repository ma cac mo hinh frontier lon duoc huan luyen tren do. Phan tich cua OpenAI phat hien rang moi mo hinh frontier duoc kiem tra deu co the tai tao dung ban sua loi do con nguoi viet duoc dung lam giai phap tham chieu, cung cac chi tiet bai toan nguyen van tu benchmark.
Day khong phai rui ro nhiem ly thuyet — do la phoi nhiem duoc xac nhan. Cac mo hinh da thay giai phap benchmark trong qua trinh training se dat diem cao hon khong phai vi chung co nang luc hon, ma vi chung ve co ban da co dap an. Cac mo hinh frontier bi anh huong duoc xac nhan trong phan tich cua OpenAI bao gom GPT-5.2, Claude Opus 4.5 va Gemini 3 Flash.
Buoc Tiep Theo: SWE-bench Pro
Khuyen nghi cua OpenAI la chuyen sang SWE-bench Pro, cho thay muc nhiem thap hon dang ke va su dung harness danh gia tieu chuan hoa ngan chan cac thu thuat scaffolding agent lam phong diem so. Su danh doi la ngay lap tuc va ro rang: hay chuan bi cho diem so giam 20-30 diem phan tram khi nganh chuyen doi. Do khong phai su thut lui ve nang luc — do la su dieu chinh trong do luong.
Ve lau dai, OpenAI keu goi cong dong nghien cuu rong lon hon dau tu vao cac benchmark duoc soan rieng tu, khong bi nhiem — cac danh gia ma cac bai toan chua bao gio duoc cong bo, lam cho viec phoi nhiem du lieu training tro nen bat kha thi ve mat cau truc.
Su That Kho Chiu Ve Van Hoa Benchmark
Su kien nay phoi bay mot van de he thong trong cach nganh AI theo doi tien bo. Benchmark tro thanh tieu chuan. Tieu chuan thu hut ap luc toi uu hoa. Ap luc toi uu hoa dan den overfitting. Mot khi benchmark duoc cong bo o quy mo lon, tin hieu cua no bat dau suy giam.
Quyet dinh cua OpenAI ve viec cong khai khai tu SWE-bench Verified — mot benchmark ma chinh ho giup tao ra va phan anh tot cho cac mo hinh cua ho — la mot hanh dong trung thuc tri tue co y nghia. Cau hoi kho hon la lieu phan con lai cua nganh co lam theo, hay cac lab se tiep tuc trich dan diem Verified khi chung co loi va lang le chuyen sang khi chung khong co loi.
Voi cac nha phat trien va nguoi ra quyet dinh ky thuat dang danh gia cac AI coding agent: hay coi bat ky diem SWE-bench Verified nao duoc bao cao sau giua nam 2025 voi su hoai nghi. Con so co the la that. Benchmark duoc do tren do thi khong.