TravelPlanner: A Benchmark for Real-World Planning with Language Agents

<!DOCTYPE html> <html> <head> <meta charset="utf-8"> <meta name="description" content="A Benchmark for Real-World Planning with Language Agents"> <meta name="keywords" content="TravelPlanner, Travel Bench"> <meta name="viewport" content="width=device-width, initial-scale=1"> <title> TravelPlanner: A Benchmark for Real-World Planning with Language Agents</title>    <link rel="icon" href="static/images/icon.png"> <link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro" rel="stylesheet"> <link rel="stylesheet" href="./static/css/bulma.min.css"> <link rel="stylesheet" href="./static/css/bulma-carousel.min.css"> <link rel="stylesheet" href="./static/css/bulma-slider.min.css"> <link rel="stylesheet" href="./static/css/fontawesome.all.min.css"> <link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/jpswalsh/academicons@1/css/academicons.min.css"> <link rel="stylesheet" href="./static/css/index.css"> <link rel="stylesheet" href="./static/css/leaderboard.css"> <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.5.1/styles/default.min.css"> <script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/11.5.1/highlight.min.js"></script> <script>hljs.initHighlightingOnLoad();</script>  <script type="text/javascript" src="static/js/sort-table.js" defer></script> <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script> <script defer src="./static/js/fontawesome.all.min.js"></script> <script src="./static/js/bulma-carousel.min.js"></script> <script src="./static/js/bulma-slider.min.js"></script> <script src="./static/js/explorer-index.js"></script> <script src="./static/js/question_card.js"></script> <script src="./static/js/leaderboard_testmini.js"></script> <script src="./data/results/output_folders.js" defer></script> <script src="./data/results/model_scores.js" defer></script> <script src="./visualizer/data/data_public.js" defer></script> <style> pre { max-height: 400px; overflow-x: auto; overflow-y: auto; background-color: #f0f0f0; padding: 10px; border: 1px solid #ccc; text-align: left; } </style> <script> function loadJSON(file, elementId) { var xhr = new XMLHttpRequest(); xhr.onreadystatechange = function() { if (xhr.readyState === 4 && xhr.status === 200) { document.getElementById(elementId).textContent = xhr.responseText; hljs.highlightElement(document.getElementById(elementId)); } }; xhr.open('GET', file, true); xhr.send(); } loadJSON('./static/example/plan_1.json', 'json1'); loadJSON('./static/example/plan_6.json', 'json2'); loadJSON('./static/example/plan_11.json', 'json3'); loadJSON('./static/example/plan_16.json', 'json4'); loadJSON('./static/example/plan_21.json', 'json5'); loadJSON('./static/example/plan_26.json', 'json6'); loadJSON('./static/example/plan_31.json', 'json7'); loadJSON('./static/example/plan_36.json', 'json8'); loadJSON('./static/example/plan_41.json', 'json9'); </script> </head> <body> <nav class="navbar" role="navigation" aria-label="main navigation"> <div class="navbar-brand"> <a role="button" class="navbar-burger" aria-label="menu" aria-expanded="false"> <span aria-hidden="true"></span> <span aria-hidden="true"></span> <span aria-hidden="true"></span> </a> </div> <div class="navbar-menu"> <div class="navbar-start" style="flex-grow: 1; justify-content: center;">   <div class="navbar-item has-dropdown is-hoverable"> <a class="navbar-link"> More Research </a> <div class="navbar-dropdown"> <a class="navbar-item" href="https://osu-nlp-group.github.io/SeeAct/"> <b>SeeAct🔥🔥🔥</b> <p style="font-size:18px; display: inline; margin-left: 5px;"></p> </a> <a class="navbar-item" href="https://mmmu-benchmark.github.io/"> <b>MMMU🔥🔥🔥</b> <p style="font-size:18px; display: inline; margin-left: 5px;"></p> </a> <a class="navbar-item" href="https://llmbench.ai/agent"> <b>AgentBench</b> <p style="font-size:18px; display: inline; margin-left: 5px;"></p> </a> <a class="navbar-item" href="https://dki-lab.github.io/LLM-Planner/"> <b>LLM-Planner</b> <p style="font-size:18px; display: inline; margin-left: 5px;"></p> </a> <a class="navbar-item" href="https://osu-nlp-group.github.io/Mind2Web/"> <b>Mind2Web</b> <p style="font-size:18px; display: inline; margin-left: 5px;"></p> </a> <a class="navbar-item" href="https://osu-nlp-group.github.io/MagicBrush/"> <b>MagicBrush</b> <p style="font-size:18px; display: inline; margin-left: 5px;"></p> </a> <a class="navbar-item" href="https://github.com/dki-lab/Pangu?tab=readme-ov-file"> <b>Pangu</b> <p style="font-size:18px; display: inline; margin-left: 5px;"></p> </a> </div> </div> </div> </div> </nav> <section class="hero"> <div class="hero-body"> <div class="container is-max-desktop"> <div class="columns is-centered"> <div class="column has-text-centered"> <h1 class="title is-1 publication-title is-bold"> <img src="static/images/icon.png" style="width:2em;vertical-align: middle" alt="Logo" /> <span class="mathvista" style="vertical-align: middle">TravelPlanner</span> </h1> <h2 class="subtitle is-3 publication-subtitle"> A Benchmark for Real-World Planning with Language Agents   </h2> <div class="is-size-5 publication-authors"> <span class="author-block"> <a href="https://hsaest.github.io/">Jian Xie*</a><sup style="color:#6fbf73;">1</sup>,</span> <span class="author-block"> <a href="https://drogozhang.github.io/">Kai Zhang*</a><sup style="color:#ed4b82;">2</sup>,</span> <span class="author-block"> <a href="https://jiangjiechen.github.io/">Jiangjie Chen</a><sup style="color:#6fbf73;">1</sup>, </span> <span class="author-block"> <a href="https://darthzhu.github.io/">Tinghui Zhu</a><sup style="color:#6fbf73">1</sup>, </span> <span class="author-block"> <a href="https://renzelou.github.io/">Renze Lou</a><sup style="color:#ffac33">3</sup>, </span> <br> <span class="author-block"> <a href="https://yuandong-tian.com/">Yuandong Tian</a><sup style="color:#7781bc">4</sup>, </span> <span class="author-block"> <a href="https://scholar.google.com/citations?user=odFW4FoAAAAJ">Yanghua Xiao</a><sup style="color:#6fbf73;">1</sup>, </span> <span class="author-block"> <a href="https://ysu1989.github.io/">Yu Su</a><sup style="color:#ed4b82;">2</sup> </span> </div> <div class="is-size-5 publication-authors"> <span class="author-block"><sup style="color:#6fbf73;">1</sup>Fudan University</span> <span class="author-block"><sup style="color:#ed4b82">2</sup>The Ohio State University</span><br> <span class="author-block"><sup style="color:#ffac33">3</sup>The Pennsylvania State University</span> <span class="author-block"><sup style="color:#7781bc">4</sup>Meta AI</span><br> <span class="author-block">* Equal Contribution</span><br> <span class="author-block">† Corresponding to <a href="mailto:jianx0321@gmail.com">jianx0321@gmail.com</a>, <a href="mailto:zhang.13253@osu.edu">zhang.13253@osu.edu</a>, <a href="mailto:su.809@osu.edu">su.809@osu.edu</a></span> </div>    <div class="column has-text-centered"> <div class="publication-links">  <span class="link-block">  <a href="https://arxiv.org/pdf/2402.01622.pdf" class="external-link button is-normal is-rounded is-dark"> <span class="icon"> <i class="fas fa-file-pdf"></i> </span> <span>Paper</span> </a> </span> <span class="link-block"> <a href="https://arxiv.org/abs/2402.01622" class="external-link button is-normal is-rounded is-dark">  <span class="icon"> <i class="ai ai-arxiv"></i> </span> <span>arXiv</span> </a> </span>    <span class="link-block"> <a href="https://github.com/OSU-NLP-Group/TravelPlanner" class="external-link button is-normal is-rounded is-dark"> <span class="icon"> <i class="fab fa-github"></i> </span> <span>Code</span> </a> </span>  <span class="link-block"> <a href="https://huggingface.co/datasets/osunlp/TravelPlanner" class="external-link button is-normal is-rounded is-dark"> <span class="icon">  <p style="font-size:18px">🤗</p>  </span> <span>Dataset</span> </a> </span>    <span class="link-block"> <a href="https://huggingface.co/spaces/osunlp/TravelPlannerLeaderboard" class="external-link button is-normal is-rounded is-dark"> <span class="icon"> <p style="font-size:18px">🏆</p> </span> <span>Leaderboard</span> </a> </span> <span class="link-block"> <a href="https://huggingface.co/spaces/osunlp/TravelPlannerEnvironment" class="external-link button is-normal is-rounded is-dark"> <span class="icon">   <p style="font-size:18px">🌏</p> </span> <span>Environment</span> </a> </span>  <span class="link-block"> <a href="https://twitter.com/ysu_nlp/status/1754365367294562680" class="external-link button is-normal is-rounded is-dark"> <span class="icon">   <p style="font-size:18px">🌐</p> </span> <span>Twitter</span> </a> </span> </div> </div> </div> </div> </div> </div> </section> <section class="hero teaser"> <div class="container is-max-desktop"> <div class="content has-text-centered"> <img src="static/images/main.png" alt="geometric reasoning" width="100%" /> <p> Overview of <img src="static/images/icon.png" style="width:1.0em;vertical-align: middle" alt="Logo" /> TravelPlanner. Given a query, language agents are tasked with employing various search tools to gather information. Based on the collected information, language agents are expected to deliver a plan that not only meet the user's needs specified in the query but also adheres to commonsense constraints. </p> </div> </div> </section>  <section class="section"> <div class="container" style="margin-bottom: 2vh;">  <div class="columns is-centered has-text-centered"> <div class="column is-four-fifths"> <h2 class="title is-3">Introduction</h2> <div class="content has-text-justified"> <p> We introduce TravelPlanner: a comprehensive benchmark designed to evaluate the planning abilities of language agents in real-world scenarios across multiple dimensions. Without losing generality, TravelPlanner casts travel planning as its test environment, with all relevant information meticulously crafted to minimize data contamination. TravelPlanner does not have a singular ground truth for each query. Instead, the benchmark employs several pre-defined evaluation scripts to assess each tested plan, determining whether the language agent can effectively use tools to create a plan that aligns with both the implicit commonsense and explicit user needs outlined in the query (i.e., commonsense constraint and hard constraint). Every query in TravelPlanner has undergone thorough human verification to guarantee that feasible solutions exist. Additionally, TravelPlanner evaluates the language agent's capability by varying the breadth and depth of planning, controlled through the number of travel days and the quantity of hard constraints. </p> </div> <div> <video controls> <source src="static/images/TravelPlanner_video.mp4" type="video/mp4"> </video> </div> </div> </div>  </div> </section>  <section class="hero is-light is-small"> <div class="hero-body has-text-centered">  <h1 class="title is-1 mathvista"> <img src="static/images/icon.png" style="width:1em;vertical-align: middle" alt="Logo" /> <span class="mathvista" style="vertical-align: middle">TravelPlanner Dataset</span> </h1> </div> </section>  <section class="section"> <div class="container"> <div class="columns is-centered has-text-centered">  <div class="column is-four-fifths"> <h2 class="title is-3">Overview</h2> <div class="content has-text-justified"> <p> We introduce <img src="static/images/icon.png" style="width:1.0em;vertical-align: middle" alt="Logo" />TravelPlanner, a benchmark crafted for evaluating language agents in tool-use and complex planning within multiple constraints. Grounded in travel planning, a real world use-case that naturally includes diverse constraints such as user needs and commonsense constraints in the environment, TravelPlanner evaluates whether language agents can develop reasonable travel plans by collecting information via diverse tools and making decisions, while satisfying the constraints. For a given query, language agents are expected to formulate a comprehensive plan that includes transportation, daily meals, attractions, and accommodation for each day. For constraints, from the perspective of real world applications, we design three types of them: Environment Constraint, Commonsense Constraint, and Hard Constraint. TravelPlanner comprises 1,225 queries in total. The number of days and hard constraints are designed to test agents' abilities across both the breadth and depth of complex planning. </p>  <p> And the benchmark is divided into the training, validation, and test set.  <ul> <li><b>Train Set</b>: 5 queries with corresponding human-annotated plans for group, resulting in a total of 45 query-plan pairs. <li><b>Validation Set</b>: 20 queries from each group, amounting to 180 queries in total. <li><b>Test Set</b>: 1,000 randomly distributed queries. </ul> Download the dataset on <a href="https://huggingface.co/datasets/osunlp/TravelPlanner" target="_blank">Hugging Face Dataset</a>. </p> <div class="content has-text-centered"> <img src="static/images/statistics/dataset.png" alt="data-overview" style="max-height: 50%; max-width: 50%;" /> <p> Dataset distribution of TravelPlanner. </p> </div> <p> Examples in train set: </p> <div id="results-carousel" class="carousel results-carousel"> <div class="box m-5"> <div class="content has-text-centered"> <pre><code class="json" id="json1"></code></pre> <p> Easy Level & 3-day</p> </div> </div> <div class="box m-5"> <div class="content has-text-centered"> <pre><code class="json" id="json2"></code></pre> <p> Easy Level & 5-day</p> </div> </div> <div class="box m-5"> <div class="content has-text-centered"> <pre><code class="json" id="json3"></code></pre> <p> Easy Level & 7-day</p> </div> </div> <div class="box m-5"> <div class="content has-text-centered"> <pre><code class="json" id="json4"></code></pre> <p> Medium Level & 3-day</p> </div> </div> <div class="box m-5"> <div class="content has-text-centered"> <pre><code class="json" id="json5"></code></pre> <p> Medium Level & 5-day</p> </div> </div> <div class="box m-5"> <div class="content has-text-centered"> <pre><code class="json" id="json6"></code></pre> <p> Medium Level & 7-day</p> </div> </div> <div class="box m-5"> <div class="content has-text-centered"> <pre><code class="json" id="json7"></code></pre> <p> Hard Level & 3-day</p> </div> </div> <div class="box m-5"> <div class="content has-text-centered"> <pre><code class="json" id="json8"></code></pre> <p> Hard Level & 5-day</p> </div> </div> <div class="box m-5"> <div class="content has-text-centered"> <pre><code class="json" id="json9"></code></pre> <p> Hard Level & 7-day</p> </div> </div> </div> </div> </div> </div> <div class="columns is-centered m-6"> <div class="column is-full has-text-centered content"> <h2 class="title is-3">Constraint</h2> <div class="container is-max-desktop"> <div class="content has-text-centered"> <img src="static/images/constraint_description.png" alt="data-overview" style="max-width: 100%;" /> <p><img src="static/images/icon.png" style="width:1.0em;vertical-align: middle" alt="Logo" />TravelPlanner constraint description. The environment constraint is manifested through the feedback received from the environment, assessing whether the language agent can adjust its plan appropriately. The commonsense constraint and hard constraint are evaluated based on how well the language agent's plan aligns with these specific criteria. </p> </div> </div> </div> </div> <div class="columns is-centered m-6"> <div class="column is-full has-text-centered content"> <h2 class="title is-3">Tool</h2> <div class="container is-max-desktop"> <div class="content has-text-centered"> <img src="static/images/tool_description.png" alt="data-overview" style="max-width: 100%;" /> <p><img src="static/images/icon.png" style="width:1.0em;vertical-align: middle" alt="Logo" />Tool description and the number of items in the database. The original data for each tool is sourced from publicly available internet data. We then modify this data, which includes adding, deleting, and altering certain keys and values to suit our requirements. In this way, we effectively avoid the problem of data contamination. </p> </div> </div> </div> </div> </div> </section>  <section class="hero is-light is-small"> <div class="hero-body has-text-centered"> <h1 class="title is-1 mathvista">Experiment Results</h1> </div> </section> <section class="section"> <div class="container"> <div class="columns is-centered m-6"> <div class="column is-full has-text-centered content"> <h2 class="title is-3">Results on Existing Large Language Models and Planning Strategies</h2>  <div id="results-carousel" class="carousel results-carousel"> <div class="box m-5"> <div class="content has-text-centered"> <img src="static/images/results-figures/main_results.png" alt="grade-lv" height="50%" /> <p>Main results of different LLMs and planning strategies on the TravelPlanner validation and test set. The best results are marked in bold. </p> </div> </div> <div class="box m-5"> <div class="content has-text-centered"> <img src="static/images/results-figures/tool_use_error.png" alt="grade-lv" width="80%" /> <p> Tool-use error distribution on test set. We set the maximum tool-use process step as 30. An agent will trigger an early stop if it either makes three consecutive failed attempts or repeats an action thrice consecutively, indicating a dead loop. </div> </div> <div class="box m-5"> <div class="content has-text-centered"> <img src="static/images/results-figures/constraint_pass_rate.png" alt="grade-lv" width="50%" /> <p>Constraint pass rate of GPT-4-Turbo on test set. The results of sole-planning mode are based on Direct strategy. Note that plans failing to meet the "Within Sandbox" or "No Missed Key Information" criteria are excluded from the hard constraint pass rate calculation. This exclusion is due to the fact that information beyond the sandbox's scope or key details that are missed cannot be effectively searched or evaluated.</p> </div> </div> <div class="box m-5"> <div class="content has-text-centered"> <img src="static/images/results-figures/information_collection_comparison.png" alt="contexts" width="90%" /> <p>Comparison of the numbers of different tool uses between agent (GPT-4-Turbo) and reference. The results of agent are based on the number of entries written into the "Notebook".</p> </div> </div> </div> </div> </div> <div class="columns is-centered m-6"> <div class="column is-full has-text-centered content"> <h2 class="title is-3">Case Study</h2>  <div id="results-carousel" class="carousel results-carousel"> <div class="box m-5"> <div class="content has-text-centered"> <img src="static/images/results-examples/1.png" alt="grade-lv" width="40%" /> <p>GPT-4-Turbo + ReAct in tool-use scenario. </p> </div> </div> <div class="box m-5"> <div class="content has-text-centered"> <img src="static/images/results-examples/4.png" alt="grade-lv" width="40%" /> <p>GPT-4-Turbo + ReAct in tool-use scenario. </p> </div> </div> <div class="box m-5"> <div class="content has-text-centered"> <img src="static/images/results-examples/5.png" alt="grade-lv" width="40%" /> <p>GPT-4-Turbo + ReAct in tool-use scenario. </p> </div> </div> <div class="box m-5"> <div class="content has-text-centered"> <img src="static/images/results-examples/6.png" alt="grade-lv" width="40%" /> <p>GPT-4-Turbo + ReAct in tool-use scenario. </p> </div> </div> <div class="box m-5"> <div class="content has-text-centered"> <img src="static/images/results-examples/2.png" alt="grade-lv" width="40%" /> <p> GPT-4-Turbo + Direct Planning in sole-planning scenario. </div> </div> <div class="box m-5"> <div class="content has-text-centered"> <img src="static/images/results-examples/7.png" alt="grade-lv" width="40%" /> <p> GPT-4-Turbo + Direct Planning in sole-planning scenario. </div> </div> <div class="box m-5"> <div class="content has-text-centered"> <img src="static/images/results-examples/8.png" alt="grade-lv" width="40%" /> <p> GPT-4-Turbo + Direct Planning in sole-planning scenario. </div> </div> <div class="box m-5"> <div class="content has-text-centered"> <img src="static/images/results-examples/9.png" alt="grade-lv" width="40%" /> <p> GPT-4-Turbo + Direct Planning in sole-planning scenario. </div> </div> <div class="box m-5"> <div class="content has-text-centered"> <img src="static/images/results-examples/3.png" alt="grade-lv" width="40%" /> <p>GPT-4-Turbo + Reflexion Planning in sole-planning scenario.</p> </div> </div> <div class="box m-5"> <div class="content has-text-centered"> <img src="static/images/results-examples/10.png" alt="grade-lv" width="40%" /> <p>GPT-4-Turbo + Reflexion Planning in sole-planning scenario.</p> </div> </div> </div> </div> </div> </div> </section>  <section class="section" id="BibTeX"> <div class="container is-max-desktop content"> <h2 class="title is-3 has-text-centered">BibTeX</h2> <pre><code>@inproceedings{xie2024travelplanner, title={TravelPlanner: A Benchmark for Real-World Planning with Language Agents}, author={Xie, Jian and Zhang, Kai and Chen, Jiangjie and Zhu, Tinghui and Lou, Renze and Tian, Yuandong and Xiao, Yanghua and Su, Yu}, booktitle={Forty-first International Conference on Machine Learning} }</code></pre> </div> </section> <section> <div class="section" id="org-banners" style="display:flex"> <a href="https://www.fudan.edu.cn/en/" target="_blank" rel="external"> <img class="center-block org-banner" src="static/images/fdu.png"> </a> <a href="https://www.osu.edu/" target="blank" class="ext-link"> <img class="center-block org-banner" src="static/images/osu.svg"> </a> <a href="https://www.psu.edu/" target="_blank" rel="external"> <img class="center-block org-banner" src="static/images/psu.png"> </a> <a href="https://ai.meta.com/" target="_blank" rel="external"> <img class="center-block org-banner" src="static/images/meta.png"> </a> </div> </section> <footer class="footer">  <div class="content has-text-centered"> </div> <div class="columns is-centered"> <div class="column is-8"> <div class="content"> <p> This website is adapted from <a href="https://nerfies.github.io/">Nerfies</a> and <a href="https://mathvista.github.io/">MathVista</a>, licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. </p> </div> </div> </div>  </footer> </body> </html>

CINXE.COM

TravelPlanner: A Benchmark for Real-World Planning with Language Agents