MAmmoTH

<!DOCTYPE html> <html> <head> <meta charset="utf-8">   <meta name="description" content="DESCRIPTION META TAG"> <meta property="og:title" content="SOCIAL MEDIA TITLE TAG" /> <meta property="og:description" content="SOCIAL MEDIA DESCRIPTION TAG TAG" /> <meta property="og:url" content="URL OF THE WEBSITE" />  <meta property="og:image" content="static/image/your_banner_image.png" /> <meta property="og:image:width" content="1200" /> <meta property="og:image:height" content="630" /> <meta name="twitter:title" content="TWITTER BANNER TITLE META TAG"> <meta name="twitter:description" content="TWITTER BANNER DESCRIPTION META TAG">  <meta name="twitter:image" content="static/images/your_twitter_banner_image.png"> <meta name="twitter:card" content="summary_large_image">  <meta name="keywords" content="KEYWORDS SHOULD BE PLACED HERE"> <meta name="viewport" content="width=device-width, initial-scale=1"> <title>MAmmoTH</title> <link rel="icon" type="image/x-icon" href="static/images/mammoth_icon.png"> <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro" rel="stylesheet"> <link rel="stylesheet" href="static/css/bulma.min.css"> <link rel="stylesheet" href="static/css/bulma-carousel.min.css"> <link rel="stylesheet" href="static/css/bulma-slider.min.css"> <link rel="stylesheet" href="static/css/fontawesome.all.min.css"> <link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css"> <link rel="stylesheet" href="static/css/index.css"> <script src="static/js/jquery.min.js"></script> <script src="static/js/main.js"></script> <script defer src="static/js/fontawesome.all.min.js"></script> <script src="static/js/bulma-carousel.min.js"></script> <script src="static/js/bulma-slider.min.js"></script> <script src="static/js/index.js"></script> <link rel="stylesheet" type="text/css" href="static/css/jquery.dataTables.css"> <script type="text/javascript" charset="utf8" src="static/js/jquery-3.5.1.js"></script> <script type="text/javascript" charset="utf8" src="static/js/jquery.dataTables.js"></script> </head> <body> <section class="hero"> <div class="hero-body"> <div class="container is-max-desktop"> <div class="columns is-centered"> <div class="column has-text-centered"> <h1 class="title is-1 publication-title">🦣 MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning</h1> <div class="is-size-5 publication-authors">  <span class="author-block"> <sup>♣</sup><a href="https://xiangyue9607.github.io/" target="_blank">Xiang Yue</a><sup>*</sup>,</span> <span class="author-block"> <sup>†</sup><a href="https://scholar.google.com/citations?hl=zh-CN&user=hty-MWIAAAAJ" target="_blank">Xingwei Qu</a>, <span class="author-block"> <sup>†</sup><a href="https://scholar.google.com/citations?user=qyTrq4kAAAAJ&hl=en" target="_blank">Ge Zhang</a>, </span> <span class="author-block"> <sup>○</sup><a href="https://franxyao.github.io/" target="_blank">Yao Fu</a>, </span> <span class="author-block"> <sup>§</sup><a href="Thttps://scholar.google.com/citations?hl=en&user=OdE3MsQAAAAJ&view_op=list_works&sortby=pubdate" target="_blank">Wenhao Huang</a>, </span> <span class="author-block"> <sup>♣</sup><a href="http://web.cse.ohio-state.edu/~sun.397/" target="_blank">Huan Sun</a>, </span> <span class="author-block"> <sup>♣</sup><a href="https://ysu1989.github.io/" target="_blank">Yu Su</a>, </span> <span class="author-block"> <sup>†</sup><a href="https://wenhuchen.github.io/" target="_blank">Wenhu Chen</a><sup>*</sup> </span> </div> <div class="is-size-5 publication-authors"> <span class="author-block"> <sup>†</sup>University of Waterloo, <sup>♣</sup>The Ohio State University, <sup>‡</sup>HKUST, <sup>○</sup>University of Edinburg, <sup>§</sup>01.AI <span class="eql-cntrb"><small><br><sup>*</sup>Xiang Yue and Wenhu Chen are the leading authors of the project. They contributed equally to this project.</small></span> <span class="author-block"><a href="mailto:yue.149@osu.edu">yue.149@osu.edu</a> , <a href="mailto:wenhuchen@uwaterloo.ca">wenhuchen@uwaterloo.ca</a> </span> </div> <div class="column has-text-centered"> <div class="publication-links"> <span class="link-block"> <a href="https://huggingface.co/datasets/TIGER-Lab/MathInstruct" target="_blank" class="external-link button is-normal is-rounded is-dark"> <span class="icon"> 🤗 </span> <span>Dataset</span> </a> </span>  <span class="link-block"> <a href="https://huggingface.co/TIGER-Lab/MAmmoTH-Coder-7B" target="_blank" class="external-link button is-normal is-rounded is-dark"> <span class="icon"> 🤗 </span> <span>Models</span> </a> </span>  <span class="link-block"> <a href="https://github.com/TIGER-AI-Lab/MAmmoTH" target="_blank" class="external-link button is-normal is-rounded is-dark"> <span class="icon"> <i class="fab fa-github"></i> </span> <span>Code</span> </a> </span>  <span class="link-block"> <a href="https://arxiv.org/pdf/2309.05653.pdf" target="_blank" class="external-link button is-normal is-rounded is-dark"> <span class="icon"> <i class="ai ai-arxiv"></i> </span> <span>arXiv</span> </a> </span> </div> </div> </div> </div> </div> </div> </section>  <section class="hero"> <div class="hero-body"> <div class="container is-max-desktop"> <div class="columns is-centered"> <div class="column has-text-centered"> <h2 class="title is-3">Abstract</h2> <div class="content has-text-justified"> <p> We introduce 🦣MAmmoTH, a series of open-source large language models (LLMs) specifically tailored for general math problem-solving. The MAmmoTH models are trained on MathInstruct, our meticulously curated instruction tuning dataset. MathInstruct is compiled from 13 math datasets with intermediate rationales, six of which have rationales newly curated by us. It boasts the hybrid of chain-of- thought (CoT) and program-of-thought (PoT) rationales, and also ensures exten- sive coverage of diverse fields in math. The hybrid of CoT and PoT can not only unleash the potential of tool use but also allow different thought processes for dif- ferent math problems. As a result, the MAmmoTH series substantially outperform existing open-source models on nine mathematical reasoning datasets across all scales with an average accuracy gain ranging from 12% to 29%. Remarkably, our MAmmoTH-7B model reaches 35% on MATH (a competition-level dataset), which exceeds the best open-source 7B model (WizardMath) by 25%, and the MAmmoTH-34B model achieves 46% accuracy on MATH, even surpassing GPT- 4’s CoT result. Our work underscores the importance of diverse problem coverage and the use of hybrid rationales in developing superior math generalist models. </p> </div> </div> </div> </div> </div> </section>   <section class="hero"> <div class="hero-body"> <div class="container is-max-desktop"> <div class="columns is-centered"> <div class="column is-four-fifths"> <div class="item">  <img src="static/images/mammoth_github.png" alt="The hybrid instruction tuning of 🦣MAmmoTH" /> <h2 class="subtitle"> Figure 1: The hybrid instruction tuning of 🦣MAmmoTH, a series of models trained to solve a diverse set of mathematical problems using CoTs and PoTs. </h2> </div> </div> </div> </div> </div> </section>  <section class="hero"> <div class="hero-body"> <div class="container is-max-desktop"> <div class="columns is-centered"> <div class="column has-text-centered is-fifths-fifths"> <h2 class="title is-3">Our Dataset: MathInstruct</h2> <div class="content has-text-justified"> <table id="myTable"> <thead> <tr> <th>Training Dataset</th> <th>Type</th> <th>Annotation</th> <th>Samples</th> <th>Characteristics</th> <th>Fields</th> </tr> </thead> <tbody> <tr> <td>GSM8K</td> <td>CoT</td> <td>Human</td> <td>7K</td> <td>Grade Schol Exam</td> <td>Pre-Algebra</td> </tr> <tr> <td>GSM8K-RFT</td> <td>CoT</td> <td>Llama</td> <td>28K</td> <td>Llama + Validated</td> <td>Pre-Algebra</td> </tr> <tr> <td>AQuA-RAT</td> <td>CoT</td> <td>Human</td> <td>90K</td> <td>GRE/GMAT Exam</td> <td>Inter-Algebra</td> </tr> <tr> <td>MATH</td> <td>CoT</td> <td>Human</td> <td>7K</td> <td>Math Competition</td> <td>Pre-Algebra, Inter-Algebra, Algebra, Probability, NumTheory, Calculus, Geometry</td> </tr> <tr> <td>TheoremQA</td> <td>CoT</td> <td>GPT-4</td> <td>600</td> <td>GPT4 + Validated</td> <td>Algebra, Probability, NumTheory, Calculus, Geometry</td> </tr> <tr> <td>Camel-Math</td> <td>CoT</td> <td>GPT-4</td> <td>50K</td> <td>GPT4 (Unvalidated)</td> <td>Algebra, Probability, NumTheory, Calculus, Geometry</td> </tr> <tr> <td>College-Math</td> <td>CoT</td> <td>GPT-4</td> <td>1.8K</td> <td>GPT4 (Unvalidated)</td> <td>Algebra</td> </tr> <tr> <td>GSM8K</td> <td>PoT</td> <td>GPT4</td> <td>14K</td> <td>GPT4 + Validated</td> <td>Pre-Algebra</td> </tr> <tr> <td>AQuA-RAT</td> <td>PoT</td> <td>GPT4</td> <td>9.7K</td> <td>GPT4 + Validated</td> <td>Inter-Algebra</td> </tr> <tr> <td>MATH</td> <td>PoT</td> <td>GPT4</td> <td>7K</td> <td>GPT4 + Validated</td> <td>Pre-Algebra, Inter-Algebra, Algebra, Probability</td> </tr> <tr> <td>TheoremQA</td> <td>PoT</td> <td>GPT4</td> <td>700</td> <td>GPT4 + Validated</td> <td>Algebra, Probability, NumTheory, Calculus, Geometry</td> </tr> <tr> <td>MathQA</td> <td>PoT</td> <td>Human</td> <td>25K</td> <td>AQuA-RAT Subset</td> <td>Inter-Algebra</td> </tr> <tr> <td>NumG</td> <td>PoT</td> <td>Human</td> <td>13K</td> <td>Lila Annotated</td> <td>Pre-Algebra</td> </tr> </tbody> </table> </div> </div> </div> </div> </div> </section> <section class="hero is-small"> <div class="hero-body"> <div class="container is-max-desktop"> <div class="columns is-centered has-text-centered"> <div class="column is-four-fifths"> <h2 class="title is-3">Overall Results</h2> <div class="item">  <img src="static/images/overall_results.png" alt="MY ALT TEXT" /> <h2 class="subtitle has-text-centered"> Figure 2: Overall results of 🦣MAmmoTH on the in-domain and out-of-domain datasets. </h2> <p> Overall, we can see that MAmmoTH and MAmmoTH-Coder are able to outperform the SoTA model at different scales. In general, the performance gain for OOD datasets is more significant than IND datasets. These results show us the potential of our models as a mathematical generalist. On several datasets, MAmmoTH-Coder-34B and MAmmoTH-70B are even surpassing closed-source LLMs (see more break down results below). </p> </div> </div> </div> </div> </div> </section> <section class="hero is-small"> <div class="hero-body"> <div class="container is-max-desktop"> <div class="columns is-centered has-text-centered"> <div class="column is-four-fifths"> <h2 class="title is-3">Where does the gain come from?</h2> <div class="item">  <img src="static/images/ablation_results.png" alt="MY ALT TEXT" /> <h2 class="subtitle"> Figure 3: Investigation of the influence of CoT \& PoT hybrid training on the 7B Llama-2 model. Key insights include: 1) The SoTA model, utilizing dataset-specific CoT fine-tuning on GSM and MATH, displays strong performance within its domains but struggles in OOD scenarios; 2) Diverse data sources in MathInstruct enable better math generalist model; 3) Fine-tuning on the PoT subsets generally outperforms fine-tuning on the CoT subsets; 4) Hybrid training yields the best-performing model. </h2> <p> In order to better understand what factors contribute to the great gain of 🦣MAmmoTH over existing baselines, we set up a group of control experiments in the Figure 3. We study the following setups: <ol> <li>🦣<b>MAmmoTH (MathInstruct - CoT):</b> This experiment aims to understand how much our curated CoT data could improve the generalization over the SoTA model WizardMath trained specifically on GSM + MATH. As can be seen, while sacrificing accuracy on GSM + MATH by 3%, our CoT subset fine-tuning improves the overall nine-dataset accuracy from 27% to 32%. </li> <li>🦣<b>MAmmoTH (MathInstruct - PoT):</b> This experiment aims to understand the advantage of our PoT subset. As can be observed, our PoT subset fine-tuning can significantly improve the overall accuracy from 27% to 37.5%. This ablation reflects the importance of unlocking the program generation capabilities of our model.</li> <li>🦣<b>MAmmoTH (MathInstruct - Hybrid):</b> We further combine CoT and PoT as the hybrid training data to achieve the best overall performance of 45.4%. This combined gain comes from two aspects: <ul style="list-style-type: disc;"> <li> The CoT subset can help maintain the generic language-based reasoning skills to handle scenarios where PoT cannot handle well, e.g., the multi-choice questions in AQuA, SAT, and MMLU. </li> <li> The PoT subset can teach the model how to utilize Python APIs to solve complex math problems with high precision, e.g., the MATH problems requiring complex computation. </li> </ul> </li> </ol>  </p> </div> </div> </div> </div> </div> </section> <section class="hero"> <div class="hero-body"> <div class="container is-max-desktop"> <div class="columns is-centered"> <div class="column has-text-centered is-fifths-fifths"> <h2 class="title is-3">Comprehensive Results: Break Down</h2> <div class="content has-text-justified"> <div class="buttonGroup"> <button value="ALL">All</button> <button value="CS">Closed-source</button> <button value="7B">7B Parameter</button> <button value="13_15B">13-15B Parameter</button> <button value="30_34B">30-34B Parameter</button> <button value="65_70B">65-70B Parameter</button> </div> <table> <thead> <tr> <th>Model</th> <th>GSM8K</th> <th>MATH</th> <th>AQuA</th> <th>NumG</th> <th>SVAMP</th> <th>Mathematics</th> <th>SimulEq</th> <th>SAT</th> <th>MMLU</th> </tr> <tr> </tr> </thead> <tbody id="tabResults"> <tr class="th"> <td id="CS" colspan="14" style="text-align: center; font-weight: bold;">Closed-source Model</td> </tr> <tr> </tr> <tr> <td>GPT-4</td> <td>-</td> <td>Unknown</td> <td>92</td> <td>42.5</td> <td>72.6</td> <td>-</td> <td>-</td> <td>97</td> <td>-</td> <td>-</td> <td>95</td> <td>-</td> <td>-</td> </tr> <tr> </tr> <tr> <td>Code-Interpreter</td> <td>-</td> <td>Unknown</td> <td>97</td> <td>69.7</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>PaLM-2</td> <td>-</td> <td>Unknown</td> <td>80.7</td> <td>34.3</td> <td>64.1</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Claude-2</td> <td>-</td> <td>Unknown</td> <td>85.2</td> <td>32.5</td> <td>60.9</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Codex (PoT)</td> <td>-</td> <td>No</td> <td>71.6</td> <td>36.8</td> <td>54.1</td> <td>-</td> <td>- </td> <td>85.2</td> <td>-</td> <td>-</td> <td>68</td> <td>-</td> <td>- </td> </tr> <tr> </tr> <tr class="th"> <td id="7B" colspan="14" style="text-align: center; font-weight: bold;">7B Parameter Model</td> </tr> <tr> </tr> <tr> <td>Llama-1</td> <td>-</td> <td>No</td> <td>10.7</td> <td>2.9</td> <td>22.6</td> <td>24.7</td> <td>15.5</td> <td>24.5</td> <td>6.2</td> <td>4.6</td> <td>22.7</td> <td>30.6</td> <td>17.7</td> </tr> <tr> </tr> <tr> <td>Llama-2</td> <td>-</td> <td>No</td> <td>14.6</td> <td>2.5</td> <td>30.3</td> <td>29.9</td> <td>19.3</td> <td>34.5</td> <td>6</td> <td>5</td> <td>26.8</td> <td>29.8</td> <td>20.4</td> </tr> <tr> </tr> <tr> <td>Galactica-6.7B</td> <td>GAL</td> <td>GAL-Instruct</td> <td>10.2</td> <td>2.2</td> <td>25.6</td> <td>25.8</td> <td>15.9</td> <td>25.6</td> <td>4.6</td> <td>4.2</td> <td>17.5</td> <td>28</td> <td>16</td> </tr> <tr> </tr> <tr> <td>Code-Llama (PoT)</td> <td>-</td> <td>No</td> <td>25.2</td> <td>14.2</td> <td>24</td> <td>26.8</td> <td>22.3 </td> <td>49.4</td> <td>21.7</td> <td>3.5</td> <td>28.6</td> <td>26.9</td> <td>26 </td> </tr> <tr> </tr> <tr> <td>AQuA-SFT</td> <td>Llama-2</td> <td>AQuA</td> <td>11.2</td> <td>3.6</td> <td>35.6</td> <td>12.2</td> <td>15.6 </td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Llama-1 RFT</td> <td>Llama-1</td> <td>GSM8K</td> <td>46.5</td> <td>5.2</td> <td>18.8</td> <td>21.1</td> <td>22.9 </td> <td>21.1</td> <td>5.1</td> <td>11</td> <td>12.5</td> <td>21.7</td> <td>14.3 </td> </tr> <tr> </tr> <tr> <td>WizardMath</td> <td>Llama-2</td> <td>GSM8K+MATH</td> <td>54.9</td> <td>10.7</td> <td>26.3</td> <td>36.1</td> <td>32 </td> <td>36.1</td> <td>9.3</td> <td>12.8</td> <td>25.4</td> <td>31.1</td> <td>28.6 </td> </tr> <tr> </tr> <tr> <td>MAmmoTH</td> <td>Llama-2</td> <td>MathInstruct</td> <td>53.6</td> <td>31.5</td> <td>44.5</td> <td>61.2</td> <td>61.2 </td> <td>67.7</td> <td>46.3</td> <td>41.2</td> <td>42.7</td> <td>42.6</td> <td>48.1 </td> </tr> <tr> </tr> <tr> <td>MAmmoTHc</td> <td>Code-Llama</td> <td>MathInstruct</td> <td>59.4</td> <td>33.4</td> <td>47.2</td> <td>66.4</td> <td>51.6 </td> <td>71.4</td> <td>55.4</td> <td>45.9</td> <td>40.5</td> <td>48.3</td> <td>52.3 </td> </tr> <tr> </tr> <tr> <td>$\Delta$</td> <td></td> <td></td> <td>+5</td> <td>+21</td> <td>+12</td> <td>+30</td> <td>+20 </td> <td>+22</td> <td>+34</td> <td>+33</td> <td>+14</td> <td>+17</td> <td>+24 </td> </tr> <tr> </tr> <tr class="th"> <td id="13_15B" colspan="14" style="text-align: center; font-weight: bold;">13-15B Parameter Model</td> </tr> <tr> </tr> <tr> <td>Llama-1</td> <td>-</td> <td>No</td> <td>17.8</td> <td>3.9</td> <td>26</td> <td>24.8</td> <td>18.1 </td> <td>34.7</td> <td>6.9</td> <td>5.4</td> <td>27.7</td> <td>30.7</td> <td>21 </td> </tr> <tr> </tr> <tr> <td>Llama-2</td> <td>-</td> <td>No</td> <td>28.7</td> <td>3.9</td> <td>25.1</td> <td>8.8</td> <td>16.6 </td> <td>35.1</td> <td>11.5</td> <td>5.8</td> <td>32.7</td> <td>34.4</td> <td>23.9 </td> </tr> <tr> </tr> <tr> <td>Code-Llama (PoT)</td> <td>-</td> <td>No</td> <td>36.1</td> <td>18.1</td> <td>28.7</td> <td>29.2</td> <td>28 </td> <td>60</td> <td>21.3</td> <td>3.8</td> <td>25.9</td> <td>27.7</td> <td>27.7 </td> </tr> <tr> </tr> <tr> <td>CodeT5+ (PoT)</td> <td>-</td> <td>No</td> <td>12.5</td> <td>2.4</td> <td>20.5</td> <td>19.4</td> <td>13.7 </td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>CodeGen+ (PoT)</td> <td>-</td> <td>No</td> <td>12.7</td> <td>3.4</td> <td>24.5</td> <td>22.5</td> <td>15.7 </td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> <td>-</td> </tr> <tr> <td>Vicuna-1.5</td> <td>Llama-2</td> <td>No</td> <td>28.4</td> <td>5.8</td> <td>24.8</td> <td>36.9</td> <td>23.9 </td> <td>55.7</td> <td>10</td> <td>6.6</td> <td>34</td> <td>34.1</td> <td>28.1 </td> </tr> <tr> </tr> <tr> <td>Llama-1 RFT</td> <td>Llama-1</td> <td>GSM8K</td> <td>52.1</td> <td>5.1</td> <td>16.1</td> <td>24.5</td> <td>24.4 </td> <td>46.5</td> <td>6.7</td> <td>10.1</td> <td>13.2</td> <td>21.6</td> <td>19.6 </td> </tr> <tr> </tr> <tr> <td>Orca-Platypus</td> <td>Llama-2</td> <td>Platypus</td> <td>38.4</td> <td>3</td> <td>18.9</td> <td>35.3</td> <td>23.9 </td> <td>56.8</td> <td>12.6</td> <td>7.9</td> <td>29.5</td> <td>41.6</td> <td>29.7 </td> </tr> <tr> </tr> <tr> <td>Platypus</td> <td>Llama-2</td> <td>Platypus</td> <td>25.7</td> <td>2.5</td> <td>33.4</td> <td>42.3</td> <td>25.9 </td> <td>55.4</td> <td>11.4</td> <td>7.4</td> <td>36.8</td> <td>35.5</td> <td>29.3 </td> </tr> <tr> </tr> <tr> <td>WizardMath</td> <td>Llama-2</td> <td>GSM8K+MATH</td> <td>63.9</td> <td>14</td> <td>21.2</td> <td>40.8</td> <td>34.9 </td> <td>51.9</td> <td>14.1</td> <td>14.9</td> <td>24.5</td> <td>32.1</td> <td>27.5 </td> </tr> <tr> </tr> <tr> <td>MAmmoTH</td> <td>Llama-2</td> <td>MathInstruct</td> <td>62.0</td> <td>34.2</td> <td>51.6</td> <td>68.7</td> <td>54.1 </td> <td>72.4</td> <td>49.2</td> <td>43.2</td> <td>46.8</td> <td>47.6</td> <td>51.8 </td> </tr> <tr> </tr> <tr> <td>MAmmoTHc</td> <td>Code-Llama</td> <td>MathInstruct</td> <td>64.7</td> <td>36.3</td> <td>46.9</td> <td>66.8</td> <td>53.7 </td> <td>73.7</td> <td>61.5</td> <td>47.1</td> <td>48.6</td> <td>48.3</td> <td>55.8 </td> </tr> <tr> </tr> <tr> <td>$\Delta$</td> <td></td> <td></td> <td>+1</td> <td>+20</td> <td>+18</td> <td>+26</td> <td>+19 </td> <td>+14</td> <td>+40</td> <td>+33</td> <td>+12</td> <td>+7</td> <td>+26 </td> </tr> <tr> </tr> <tr class="th"> <td id="30_34B" colspan="14" style="text-align: center; font-weight: bold;">30-34B Parameter Model</td> </tr> <tr> </tr> <tr> <td>Llama-1</td> <td>-</td> <td>No</td> <td>35.6</td> <td>7.1</td> <td>33.4</td> <td>28.4</td> <td>26.1 </td> <td>48.8</td> <td>12.8</td> <td>11.2</td> <td>33.4</td> <td>39</td> <td>29 </td> </tr> <tr> </tr> <tr> <td>Code-Llama (PoT)</td> <td>-</td> <td>No</td> <td>44</td> <td>25</td> <td>25.2</td> <td>29.3</td> <td>30.8 </td> <td>69.1</td> <td>34.5</td> <td>6.8</td> <td>26.8</td> <td>21.6</td> <td>31.7 </td> </tr> <tr> </tr> <tr> <td>Llama-1 RFT</td> <td>Llama-1</td> <td>GSM8K</td> <td>56.5</td> <td>7.4</td> <td>18.5</td> <td>24.3</td> <td>26.6 </td> <td>55.4</td> <td>7.6</td> <td>12.8</td> <td>20.4</td> <td>37.9</td> <td>26.8 </td> </tr> <tr> </tr> <tr> <td>Galactica-30B</td> <td>GAL</td> <td>GAL-Instruct</td> <td>41.7</td> <td>12.7</td> <td>28.7</td> <td>34.7</td> <td>29.4 </td> <td>41.6</td> <td>11.8</td> <td>13.2</td> <td>37.7</td> <td>37.9</td> <td>28.4 </td> </tr> <tr> </tr> <tr> <td>Platypus</td> <td>Llama-1</td> <td>Platypus</td> <td>37.8</td> <td>10.1</td> <td>27.9</td> <td>40.5</td> <td>29.1 </td> <td>51.7</td> <td>13.8</td> <td>13.6</td> <td>38.6</td> <td>41</td> <td>31.7 </td> </tr> <tr> </tr> <tr> <td>Tulu</td> <td>Llama-2</td> <td>Tulu</td> <td>51</td> <td>10.8</td> <td>25.5</td> <td>43.4</td> <td>32.6 </td> <td>59</td> <td>10.7</td> <td>10.3</td> <td>31.3</td> <td>39.8</td> <td>30.2 </td> </tr> <tr> </tr> <tr> <td>MAmmoTHc</td> <td>Code-Llama</td> <td>MathInstruct</td> <td>72.7</td> <td>43.6</td> <td>54.7</td> <td>71.6</td> <td>60.7 </td> <td>84.3</td> <td>65.4</td> <td>51.8</td> <td>60.9</td> <td>53.8</td> <td>63.2 </td> </tr> <tr> </tr> <tr> <td>$\Delta$</td> <td></td> <td></td> <td>+16</td> <td>+21</td> <td>+21</td> <td>+28</td> <td>+28 </td> <td>+15</td> <td>+31</td> <td>+38</td> <td>+22</td> <td>+13</td> <td>+32 </td> </tr> <tr> </tr> <tr class="th"> <td id="65_70B" colspan="14" style="text-align: center; font-weight: bold;">65-70B Parameter Model</td> </tr> <tr> </tr> <tr> <td>Llama-1</td> <td>-</td> <td>No</td> <td>50.9</td> <td>10.6</td> <td>35</td> <td>50.2</td> <td>36.6 </td> <td>55.3</td> <td>14.2</td> <td>15.2</td> <td>37.4</td> <td>44.1</td> <td>33.2 </td> </tr> <tr> </tr> <tr> <td>Llama-2</td> <td>-</td> <td>No</td> <td>56.8</td> <td>13.5</td> <td>40.9</td> <td>50.4</td> <td>40.4 </td> <td>63.8</td> <td>20.5</td> <td>14</td> <td>51.3</td> <td>47.1</td> <td>39.3 </td> </tr> <tr> </tr> <tr> <td>Llama-2-Chat</td> <td>Llama-2</td> <td>No</td> <td>54.9</td> <td>18.6</td> <td>37</td> <td>51.6</td> <td>40.5 </td> <td>71.5</td> <td>19.2</td> <td>21.7</td> <td>44.1</td> <td>46.9</td> <td>40.6 </td> </tr> <tr> </tr> <tr> <td>Guanaco</td> <td>Llama-2</td> <td>No</td> <td>59.2</td> <td>4.1</td> <td>45.2</td> <td>53.5</td> <td>40.5 </td> <td>66.8</td> <td>17.8</td> <td>20.2</td> <td>50</td> <td>47.3</td> <td>40.4 </td> </tr> <tr> </tr> <tr> <td>WizardMath</td> <td>Llama-2</td> <td>GSM8K+MATH</td> <td>81.6</td> <td>22.7</td> <td>20</td> <td>48.9</td> <td>43.3 </td> <td>71.8</td> <td>17.1</td> <td>37.9</td> <td>13.2</td> <td>27.4</td> <td>33.4 </td> </tr> <tr> </tr> <tr> <td>Platypus</td> <td>Llama-2</td> <td>Platypus</td> <td>70.6</td> <td>18.6</td> <td>51.2</td> <td>55.4</td> <td>48.9 </td> <td>51.8</td> <td>26.3</td> <td>21.7</td> <td>55.9</td> <td>52.5</td> <td>41.6 </td> </tr> <tr> </tr> <tr> <td>MAmmoTH</td> <td>Llama-2</td> <td>MathInstruct</td> <td>76.9</td> <td>41.8</td> <td>65.0</td> <td>74.4</td> <td>64.5 </td> <td>82.4</td> <td>55.6</td> <td>51.4</td> <td>66.4</td> <td>56.7</td> <td>62.5 </td> </tr> <tr> </tr> <tr> <td>$\Delta$</td> <td></td> <td></td> <td>-5</td> <td>+19</td> <td>+14</td> <td>+19</td> <td>+16 </td> <td>+11</td> <td>+29</td> <td>+14</td> <td>+11</td> <td>+4</td> <td>+21 </td> </tr> <tr> </tr> </div> </tbody> </table> </div> </div> </div> </div> </div> </section>  <section class="section" id="BibTeX"> <div class="container is-max-desktop content"> <h2 class="title">Reference</h2> Please kindly cite our paper if you use our code, data, models or results: <br><br> <pre><code>@article{yue2023mammoth, title={MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning}, author={Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, Wenhu Chen}, journal={arXiv preprint arXiv:2309.05653}, year={2023} }</code></pre> </div> </section> <footer class="footer"> <div class="container"> <div class="columns is-centered"> <div class="column is-8"> <div class="content"> <p> This page was built using the <a href="https://github.com/eliahuhorwitz/Academic-project-page-template" target="_blank">Academic Project Page Template</a> which was adopted from the <a href="https://nerfies.github.io" target="_blank">Nerfies</a> project page. You are free to borrow the of this website, we just ask that you link back to this page in the footer. <br> This website is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/" target="_blank">Creative Commons Attribution-ShareAlike 4.0 International License</a>. </p> </div> </div> </div> </div> </footer> </body> <style> .buttonGroup { text-align: center; } .buttonGroup>button { padding: 15px; color: white; background-color: #363636; border-radius: 5px; } .buttonGroup>button:hover { box-shadow: 5px; } </style> </html>

CINXE.COM

MAmmoTH