As of August 4th, 2024, the BIRD team will stop using of the Valid Efficiency Score (VES) as the efficiency metric for submission evaluation. The VES metric does not have a upper boundary on time ratio,which can result in misleading evaluations, especially when most predicted SQL queries are inherently more time-consuming but a few are extreme faster than ground truth. You can review the VES results for previously submitted models below.
Date | Model | Code | Size | Oracle Knowledge | Dev | Test |
---|---|---|---|---|---|---|
Human Performance | Data Engineers + DB Students | ✔️ | 90.27 | |||
May 14, 2024 | ExSL + granite-20b-code | IBM Research AI | 20B | ✔️ | 75.75 | 80.40 |
Jul 22, 2024 | Distillery + GPT-4o | Distyl AI Research | UNK | ✔️ | 72.94 | 77.74 |
Jul 14, 2024 | RECAP + Gemini | Google Cloud | UNK | ✔️ | – | 76.11 |
Jul 2, 2024 | ByteBrain | ByteDance Infra Lab | 33B | ✔️ | 65.80 | 73.24 |
May 24, 2024 | ExSL + granite-20b-code | IBM Research AI | 20B | 66.34 | 72.78 | |
May 21, 2024 | CHESS | link Talaei et al.’24 | UNK | ✔️ | 65.43 | 72.63 |
Jan 14, 2024 | MCS-SQL + GPT-4 | Dunamu | UNK | ✔️ | 64.82 | 71.35 |
Apr 10, 2024 | GRA-SQL | Tencent CDP-youpu | UNK | ✔️ | 67.55 | 69.56 |
Feb 27, 2024 | PB-SQL | Seoul National University | UNK | ✔️ | 71.31 | 68.90 |
Jul 5, 2024 | Insights AI | Uber Freight | UNK | ✔️ | – | 68.82 |
Apr 08, 2024 | OpenSearch-SQL,v1 + GPT-4 | Alibaba Cloud | UNK | ✔️ | 68.38 | 68.80 |
Nov 21, 2023 | MAC-SQL + GPT-4 | Wang et al. ’23 BUAA & Tencent | UNK | ✔️ | 58.76 | 67.68 |
Jun 1, 2024 | SuperSQL | link Li et al. ’24 | UNK | ✔️ | 61.99 | 67.66 |
Jun 7, 2024 | SFT CodeS-15B + SQLFixAgent | Soochow University | UNK | ✔️ | – | 67.24 |
Feb 27, 2024 | DTS-SQL + DeepSeek 7B | link Pourreza et al. ’24 | 7B | ✔️ | 60.31 | 64.52 |
Oct 12, 2023 | SFT CodeS-15B | link Li et al. SIGMOD’24 | 15B | ✔️ | 59.87 | 64.22 |
Mar 27, 2024 | {Chat2Query} (GPT-4 + data entity modeling) (PingCAP) | link PingCAP | UNK | ✔️ | – | 63.89 |
Oct 12, 2023 | SFT CodeS-7B | link Li et al. SIGMOD’24 | 7B | ✔️ | 58.80 | 63.62 |
Nov 16, 2023 | Dubo-SQL, v1 | Mercator Technologies | UNK | ✔️ | 66.01 | 63.00 |
Nov 9, 2023 | DAIL-SQL + GPT-4 | link Gao and Wang et al. VLDB’24 | UNK | ✔️ | 56.08 | 61.95 |
Jul 1, 2023 | GPT-4 | link Baseline | UNK | ✔️ | 49.77 | 60.77 |
Aug 15, 2023 | DIN-SQL + GPT-4 | link Pourreza et al. ’23 | UNK | ✔️ | 58.79 | 59.44 |
Mar 17, 2023 | ChatGPT + CoT | link Li et al. NeurIPS’23 | UNK | ✔️ | 42.30 | 56.56 |
Mar 17, 2023 | ChatGPT | Baseline | UNK | ✔️ | 43.81 | 51.40 |
Mar 17, 2023 | ChatGPT + CoT | link Li et al. NeurIPS’23 | UNK | 32.33 | 49.69 | |
Nov 23, 2023 | OPEN-SQL | Anonymous | 7B | ✔️ | 41.56 | 48.08 |
Feb 17, 2023 | Codex | Baseline | 175B | ✔️ | 43.41 | 41.60 |
Mar 17, 2023 | ChatGPT | Baseline | UNK | 27.97 | 36.68 | |
Feb 17, 2023 | Codex | Baseline | 175B | 33.37 | 35.40 | |
Feb 5, 2023 | T5-3B | Baseline | 3B | ✔️ | 25.57 | 27.80 |
Feb 3, 2023 | T5-Large | Baseline | 770M | ✔️ | 22.74 | 25.00 |
Feb 5, 2023 | T5-3B | Baseline | 3B | 13.62 | 15.17 | |
Feb 3, 2023 | T5-Base | Baseline | 220M | ✔️ | 12.90 | 14.70 |
Feb 3, 2023 | T5-Large | Baseline | 770M | 9.90 | 12.25 | |
Feb 3, 2023 | T5-Base | Baseline | 220M | 7.78 | 8.97 |