{"id":114398,"date":"2024-10-15T11:59:46","date_gmt":"2024-10-15T04:59:46","guid":{"rendered":"https:\/\/hotvideos24.online\/?p=114398"},"modified":"2024-10-15T11:59:46","modified_gmt":"2024-10-15T04:59:46","slug":"apple-study-exposes-deep-cracks-in-llms-reasoning-capabilities","status":"publish","type":"post","link":"https:\/\/hotvideos24.online\/?p=114398","title":{"rendered":"Apple study exposes deep cracks in LLMs\u2019 \u201creasoning\u201d capabilities"},"content":{"rendered":"<p> <script async src=\"https:\/\/pagead2.googlesyndication.com\/pagead\/js\/adsbygoogle.js?client=ca-pub-3711241968723425\"\r\n     crossorigin=\"anonymous\"><\/script>\r\n<ins class=\"adsbygoogle\"\r\n     style=\"display:block\"\r\n     data-ad-format=\"fluid\"\r\n     data-ad-layout-key=\"-fb+5w+4e-db+86\"\r\n     data-ad-client=\"ca-pub-3711241968723425\"\r\n     data-ad-slot=\"7910942971\"><\/ins>\r\n<script>\r\n     (adsbygoogle = window.adsbygoogle || []).push({});\r\n<\/script><br \/>\n<\/p>\n<div>\n<p>This kind of variance\u2014both within different GSM-Symbolic runs and compared to GSM8K results\u2014is more than a little surprising since, as the researchers point out, &#8220;the overall reasoning steps needed to solve a question remain the same.&#8221; The fact that such small changes lead to such variable results suggests to the researchers that these models are not doing any &#8220;formal&#8221; reasoning but are instead &#8220;attempt[ing] to perform a kind of in-distribution pattern-matching, aligning given questions and solution steps with similar ones seen in the training data.&#8221;<\/p>\n<h2>Don\u2019t get distracted<\/h2>\n<p>Still, the overall variance shown for the GSM-Symbolic tests was often relatively small in the grand scheme of things. OpenAI&#8217;s ChatGPT-4o, for instance, dropped from 95.2 percent accuracy on GSM8K to a still-impressive 94.9 percent on GSM-Symbolic. That&#8217;s a pretty high success rate using either benchmark, regardless of whether or not the model itself is using &#8220;formal&#8221; reasoning behind the scenes (though total accuracy for many models dropped precipitously when the researchers added just one or two additional logical steps to the problems).<\/p>\n<figure class=\"ars-img-shortcode id-2056422 align-center\">\n<div>\n<div class=\"ars-lightbox\">\n<div class=\"ars-lightbox-item\">\n              <a data-pswp-width=\"1440\" data-pswp-height=\"745\" data-pswp-srcset=\"https:\/\/cdn.arstechnica.net\/wp-content\/uploads\/2024\/10\/gsm-study3-300x155.png 300w, https:\/\/cdn.arstechnica.net\/wp-content\/uploads\/2024\/10\/gsm-study3-640x331.png 640w, https:\/\/cdn.arstechnica.net\/wp-content\/uploads\/2024\/10\/gsm-study3-768x398.png 768w, https:\/\/cdn.arstechnica.net\/wp-content\/uploads\/2024\/10\/gsm-study3-1536x795.png 1536w, https:\/\/cdn.arstechnica.net\/wp-content\/uploads\/2024\/10\/gsm-study3-980x507.png 980w, https:\/\/cdn.arstechnica.net\/wp-content\/uploads\/2024\/10\/gsm-study3-1440x745.png 1440w, https:\/\/cdn.arstechnica.net\/wp-content\/uploads\/2024\/10\/gsm-study3.png 1762w\" data-cropped=\"true\" href=\"https:\/\/cdn.arstechnica.net\/wp-content\/uploads\/2024\/10\/gsm-study3-1440x745.png\" target=\"_blank\" class=\"cursor-zoom-in\" rel=\"noopener\"><br \/>\n                <img loading=\"lazy\" decoding=\"async\" width=\"1762\" height=\"912\" src=\"https:\/\/cdn.arstechnica.net\/wp-content\/uploads\/2024\/10\/gsm-study3.png\" class=\"attachment-full size-full\" alt=\"\" srcset=\"https:\/\/cdn.arstechnica.net\/wp-content\/uploads\/2024\/10\/gsm-study3.png 1762w, https:\/\/cdn.arstechnica.net\/wp-content\/uploads\/2024\/10\/gsm-study3-300x155.png 300w, https:\/\/cdn.arstechnica.net\/wp-content\/uploads\/2024\/10\/gsm-study3-640x331.png 640w, https:\/\/cdn.arstechnica.net\/wp-content\/uploads\/2024\/10\/gsm-study3-768x398.png 768w, https:\/\/cdn.arstechnica.net\/wp-content\/uploads\/2024\/10\/gsm-study3-1536x795.png 1536w, https:\/\/cdn.arstechnica.net\/wp-content\/uploads\/2024\/10\/gsm-study3-980x507.png 980w, https:\/\/cdn.arstechnica.net\/wp-content\/uploads\/2024\/10\/gsm-study3-1440x745.png 1440w\" sizes=\"auto, (max-width: 1762px) 100vw, 1762px\"\/><br \/>\n              <\/a><\/p>\n<div class=\"pswp-caption-content\" id=\"caption-2056422\">\n                An example showing how some models get mislead by irrelevant information added to the GSM8K benchmark suite.<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div><figcaption>\n<div class=\"caption mt-1 inline-flex flex-row items-stretch gap-1 text-lg leading-tight text-gray-300\">\n<div class=\"caption-content\">\n      An example showing how some models get mislead by irrelevant information added to the GSM8K benchmark suite.<\/p>\n<p>              <span class=\"caption-credit mt-2 whitespace-nowrap text-xs\"><br \/>\n          Credit:<\/p>\n<p>                      <a class=\"caption-credit-link text-gray-400 hover:text-gray-300\" href=\"https:\/\/arxiv.org\/pdf\/2410.05229\"><\/p>\n<p>          Apple Research<\/p>\n<p>                      <\/a><br \/>\n                  <\/span>\n          <\/div>\n<\/p><\/div>\n<\/figcaption><\/figure>\n<p>The tested LLMs fared much worse, though, when the Apple researchers modified the GSM-Symbolic benchmark by adding &#8220;seemingly relevant but ultimately inconsequential statements&#8221; to the questions. For this &#8220;GSM-NoOp&#8221; benchmark set (short for &#8220;no operation&#8221;), a question about how many kiwis someone picks across multiple days might be modified to include the incidental detail that &#8220;five of them [the kiwis] were a bit smaller than average.&#8221;<\/p>\n<p>Adding in these red herrings led to what the researchers termed &#8220;catastrophic performance drops&#8221; in accuracy compared to GSM8K, ranging from 17.5 percent to a whopping 65.7 percent, depending on the model tested. These massive drops in accuracy highlight the inherent limits in using simple &#8220;pattern matching&#8221; to &#8220;convert statements to operations without truly understanding their meaning,&#8221; the researchers write.<\/p>\n<figure class=\"ars-img-shortcode id-2056423 align-right\">\n<div>\n<div class=\"ars-lightbox\">\n<div class=\"ars-lightbox-item\">\n              <a data-pswp-width=\"696\" data-pswp-height=\"953\" data-pswp-srcset=\"https:\/\/cdn.arstechnica.net\/wp-content\/uploads\/2024\/10\/gsm-study2-300x411.png 300w, https:\/\/cdn.arstechnica.net\/wp-content\/uploads\/2024\/10\/gsm-study2-640x876.png 640w, https:\/\/cdn.arstechnica.net\/wp-content\/uploads\/2024\/10\/gsm-study2.png 696w\" data-cropped=\"true\" href=\"https:\/\/cdn.arstechnica.net\/wp-content\/uploads\/2024\/10\/gsm-study2.png\" target=\"_blank\" class=\"cursor-zoom-in\" rel=\"noopener\"><br \/>\n                <img loading=\"lazy\" decoding=\"async\" width=\"696\" height=\"953\" src=\"https:\/\/cdn.arstechnica.net\/wp-content\/uploads\/2024\/10\/gsm-study2.png\" class=\"attachment-full size-full\" alt=\"\" srcset=\"https:\/\/cdn.arstechnica.net\/wp-content\/uploads\/2024\/10\/gsm-study2.png 696w, https:\/\/cdn.arstechnica.net\/wp-content\/uploads\/2024\/10\/gsm-study2-300x411.png 300w, https:\/\/cdn.arstechnica.net\/wp-content\/uploads\/2024\/10\/gsm-study2-640x876.png 640w\" sizes=\"auto, (max-width: 696px) 100vw, 696px\"\/><br \/>\n              <\/a><\/p>\n<div class=\"pswp-caption-content\" id=\"caption-2056423\">\n                Introducing irrelevant information to the prompts often led to &#8220;catastrophic&#8221; failure for most &#8220;reasoning&#8221; LLMs<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div>\n<\/p><\/div><figcaption>\n<div class=\"caption mt-1 inline-flex flex-row items-stretch gap-1 text-lg leading-tight text-gray-300\">\n<div class=\"caption-content\">\n      Introducing irrelevant information to the prompts often led to &#8220;catastrophic&#8221; failure for most &#8220;reasoning&#8221; LLMs<\/p>\n<p>              <span class=\"caption-credit mt-2 whitespace-nowrap text-xs\"><br \/>\n          Credit:<\/p>\n<p>                      <a class=\"caption-credit-link text-gray-400 hover:text-gray-300\" href=\"https:\/\/arxiv.org\/pdf\/2410.05229\"><\/p>\n<p>          Apple Research<\/p>\n<p>                      <\/a><br \/>\n                  <\/span>\n          <\/div>\n<\/p><\/div>\n<\/figcaption><\/figure>\n<p>In the example with the smaller kiwis, for instance, most models try to subtract the smaller fruits from the final total because, the researchers surmise, &#8220;their training datasets included similar examples that required conversion to subtraction operations.&#8221; This is the kind of &#8220;critical flaw&#8221; that the researchers say &#8220;suggests deeper issues in [the models&#8217;] reasoning processes&#8221; that can&#8217;t be helped with fine-tuning or other refinements.<\/p>\n<\/p><\/div>\n<p><script async src=\"https:\/\/pagead2.googlesyndication.com\/pagead\/js\/adsbygoogle.js?client=ca-pub-3711241968723425\"\r\n     crossorigin=\"anonymous\"><\/script>\r\n<ins class=\"adsbygoogle\"\r\n     style=\"display:block\"\r\n     data-ad-format=\"fluid\"\r\n     data-ad-layout-key=\"-fb+5w+4e-db+86\"\r\n     data-ad-client=\"ca-pub-3711241968723425\"\r\n     data-ad-slot=\"7910942971\"><\/ins>\r\n<script>\r\n     (adsbygoogle = window.adsbygoogle || []).push({});\r\n<\/script><br \/>\n<br \/><div data-type=\"_mgwidget\" data-widget-id=\"1660802\">\r\n<\/div>\r\n<script>(function(w,q){w[q]=w[q]||[];w[q].push([\"_mgc.load\"])})(window,\"_mgq\");\r\n<\/script>\r\n<br \/>\n<br \/><a href=\"https:\/\/arstechnica.com\/ai\/2024\/10\/llms-cant-perform-genuine-logical-reasoning-apple-researchers-suggest\/\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>This kind of variance\u2014both within different GSM-Symbolic runs and compared to GSM8K results\u2014is more than a little surprising since, as the researchers point out, &#8220;the overall reasoning steps needed to &hellip; <a href=\"https:\/\/hotvideos24.online\/?p=114398\" class=\"more-link\">Read More<\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[8630],"tags":[],"class_list":["post-114398","post","type-post","status-publish","format-standard","hentry","category-technology","entry"],"_links":{"self":[{"href":"https:\/\/hotvideos24.online\/index.php?rest_route=\/wp\/v2\/posts\/114398","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/hotvideos24.online\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/hotvideos24.online\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/hotvideos24.online\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/hotvideos24.online\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=114398"}],"version-history":[{"count":0,"href":"https:\/\/hotvideos24.online\/index.php?rest_route=\/wp\/v2\/posts\/114398\/revisions"}],"wp:attachment":[{"href":"https:\/\/hotvideos24.online\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=114398"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/hotvideos24.online\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=114398"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/hotvideos24.online\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=114398"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}