We have 8 categories of tasks, all MCQs. FigQA, TableQA, and ProtocolQA are reasoning tasks that don’t require tools. LitQA, SeqQA, dbQA, and suppQA are tool-use benchmarks for literature search, database access, etc. Cloning Scenarios are non-trivial “real-world” challenges. 2/