K-MetBench Leaderboard

Public leaderboard for the K-MetBench camera-ready release on expert meteorological reasoning, geo-cultural alignment, and multimodal weather understanding.

Paper badge Dataset badge Code badge Citation badge

K-MetBench evaluates more than 50 language models and vision-language models on 1,774 questions drawn from the Korean National Meteorological Engineer Examination. All models are evaluated under a zero-shot protocol.

The benchmark supports fine-grained analysis through 82 multimodal questions, 141 reasoning questions with expert-verified rationales, and 73 Korean-specific questions, while also spanning five official subject areas: Weather Analysis and Forecast Theory (P1), Meteorological Observation Methods (P2), Atmospheric Dynamics (P3), Climatology (P4), and Atmospheric Physics (P5). Together, these subsets help diagnose gaps in modality understanding, expert reasoning, geo-cultural knowledge, and topic-specific performance in weather-domain evaluation.

Loading leaderboard metrics...

= Proprietary model. = Korean model. = Vision language model. = Reasoning model. Size = parameter count in billions (B). Acc = Accuracy. Reasoning = Reasoning score (4-20). Geo = Geo-Cultural. Text = Text-Only. Multi = Multimodal. P1 = Weather Analysis & Forecast Theory. P2 = Meteorological Observation Methods. P3 = Atmospheric Dynamics. P4 = Climatology. P5 = Atmospheric Physics.

# Model Size Type Acc Reasoning Geo Text Multi P1 P2 P3 P4 P5