Sebastian Raschka, PhD - Founder | Principal AI & LLM Research Engineer - RAIR Lab

<!DOCTYPE html> <html lang="en"> <head> <meta name="pageKey" content="public_profile_v3_desktop"> <meta name="robots" content="max-image-preview:large, noarchive"> <meta name="bingbot" content="nocache">  <meta name="linkedin:pageTag" content="openToProvider"> <meta name="locale" content="en_US"> <meta id="config" data-app-version="2.0.3497" data-call-tree-id="AAYxxP7iOeymYXe2iA3JhA==" data-multiproduct-name="public-profile-frontend" data-service-name="public-profile-frontend" data-browser-id="bfb3b82d-2e82-4a0c-8244-5b7e6275c22d" data-enable-page-view-heartbeat-tracking data-page-instance="urn:li:page:public_profile_v3;aa7yygURQhWW2huyCVp7UQ==" data-disable-jsbeacon-pagekey-suffix="false" data-member-id="0" data-msafdf-lib="https://static.licdn.com/aero-v1/sc/h/80ndnja80f2uvg4l8sj2su82m" data-logout-url="/uas/logout" data-is-li-sugr-tracking-enabled data-should-use-full-url-in-pve-path="true" data-dna-member-lix-treatment="enabled" data-human-member-lix-treatment="enabled" data-dfp-member-lix-treatment="control" data-sync-apfc-headers-lix-treatment="control" data-sync-apfc-cb-lix-treatment="enabled"> <link rel="canonical" href="https://www.linkedin.com/in/sebastianraschka">    <meta property="al:android:url" content="https://www.linkedin.com/in/sebastianraschka"> <meta property="al:android:package" content="com.linkedin.android"> <meta property="al:android:app_name" content="LinkedIn"> <meta property="al:ios:url" content="https://www.linkedin.com/in/sebastianraschka"> <meta property="al:ios:app_store_id" content="288429040"> <meta property="al:ios:app_name" content="LinkedIn"> <link rel="manifest" href="/homepage-guest/manifest.json" crossorigin="use-credentials"> <link rel="icon" href="https://static.licdn.com/aero-v1/sc/h/al2o9zrvru7aqj8e1x2rzsrca"> <meta property="og:image" content="https://static.licdn.com/scds/common/u/images/email/artdeco/logos/96/linkedin-bug-color.png"> <script> function getDfd() {let yFn,nFn;const p=new Promise(function(y, n){yFn=y;nFn=n;});p.resolve=yFn;p.reject=nFn;return p;} window.lazyloader = getDfd(); window.tracking = getDfd(); window.impressionTracking = getDfd(); window.ingraphTracking = getDfd(); window.appDetection = getDfd(); window.pemTracking = getDfd(); </script>  <title>Sebastian Raschka, PhD - Founder | Principal AI & LLM Research Engineer - RAIR Lab | LinkedIn</title> <link rel="stylesheet" href="https://static.licdn.com/aero-v1/sc/h/8o5908a3zzjr5en4ikakjf67a"> <meta charset="UTF-8"> <meta name="viewport" content="width=device-width, initial-scale=1.0"> <meta name="litmsProfileName" content="public-profile-frontend"> <meta name="ubba" content="https://static.licdn.com/aero-v1/sc/h/dlxevueaszn1k5tiemxu7kmqy"> <meta name="platform" content="https://static.licdn.com/aero-v1/sc/h/exkw4l4akgsym3om6tnupboq6"> <meta name="platform-worker" content="https://static.licdn.com/aero-v1/sc/h/7nirg34a8ey4y2l4rw7xgwxx4"> <meta name="description" content="ML/AI research engineer. Author of Build a Large Language Model From Scratch (amzn.to/4fqvn0D) and Ahead of AI (magazine.sebastianraschka.com), on how LLMs work and the latest developments in the field. · Sebastian Raschka, PhD, is an LLM Research Engineer with over a decade of experience in artificial intelligence. His work spans industry and academia, including implementing LLM solutions as a senior engineer at Lightning AI and teaching as a statistics professor at the University of Wisconsin–Madison. Sebastian collaborates with Fortune 500 companies on AI solutions and serves on the Open Source Board at University of Wisconsin–Madison. He specializes in LLMs and the development of high-performance AI systems, with a deep focus on practical, code-driven implementations. He is the author of the bestselling books Machine Learning with PyTorch and Scikit-Learn, Machine Learning Q and AI, and Build a Large Language Model (From Scratch). For more information about specific projects, please see the resources below. Website: https://sebastianraschka.com Publications: https://scholar.google.com/citations?user&#61;X4RCC0IAAAAJ&amp;hl&#61;enrasbt Open-source projects: https://github.com/rasbt · Experience: RAIR Lab · Education: Michigan State University · Location: Madison · 500+ connections on LinkedIn. View Sebastian Raschka, PhD’s profile on LinkedIn, a professional community of 1 billion members."> <meta name="og:description" content="ML/AI research engineer. Author of Build a Large Language Model From Scratch (amzn.to/4fqvn0D) and Ahead of AI (magazine.sebastianraschka.com), on how LLMs work and the latest developments in the field. · Sebastian Raschka, PhD, is an LLM Research Engineer with over a decade of experience in artificial intelligence. His work spans industry and academia, including implementing LLM solutions as a senior engineer at Lightning AI and teaching as a statistics professor at the University of Wisconsin–Madison. Sebastian collaborates with Fortune 500 companies on AI solutions and serves on the Open Source Board at University of Wisconsin–Madison. He specializes in LLMs and the development of high-performance AI systems, with a deep focus on practical, code-driven implementations. He is the author of the bestselling books Machine Learning with PyTorch and Scikit-Learn, Machine Learning Q and AI, and Build a Large Language Model (From Scratch). For more information about specific projects, please see the resources below. Website: https://sebastianraschka.com Publications: https://scholar.google.com/citations?user&#61;X4RCC0IAAAAJ&amp;hl&#61;enrasbt Open-source projects: https://github.com/rasbt · Experience: RAIR Lab · Education: Michigan State University · Location: Madison · 500+ connections on LinkedIn. View Sebastian Raschka, PhD’s profile on LinkedIn, a professional community of 1 billion members."> <meta name="twitter:description" content="ML/AI research engineer. Author of Build a Large Language Model From Scratch (amzn.to/4fqvn0D) and Ahead of AI (magazine.sebastianraschka.com), on how LLMs work and the latest developments in the field. · Sebastian Raschka, PhD, is an LLM Research Engineer with over a decade of experience in artificial intelligence. His work spans industry and academia, including implementing LLM solutions as a senior engineer at Lightning AI and teaching as a statistics professor at the University of Wisconsin–Madison. Sebastian collaborates with Fortune 500 companies on AI solutions and serves on the Open Source Board at University of Wisconsin–Madison. He specializes in LLMs and the development of high-performance AI systems, with a deep focus on practical, code-driven implementations. He is the author of the bestselling books Machine Learning with PyTorch and Scikit-Learn, Machine Learning Q and AI, and Build a Large Language Model (From Scratch). For more information about specific projects, please see the resources below. Website: https://sebastianraschka.com Publications: https://scholar.google.com/citations?user&#61;X4RCC0IAAAAJ&amp;hl&#61;enrasbt Open-source projects: https://github.com/rasbt · Experience: RAIR Lab · Education: Michigan State University · Location: Madison · 500+ connections on LinkedIn. View Sebastian Raschka, PhD’s profile on LinkedIn, a professional community of 1 billion members."> <meta property="og:title" content="Sebastian Raschka, PhD - Founder | Principal AI &amp; LLM Research Engineer - RAIR Lab | LinkedIn"> <meta property="og:image" content="https://media.licdn.com/dms/image/v2/C4E03AQHVyH-IfD1KbQ/profile-displayphoto-shrink_200_200/profile-displayphoto-shrink_200_200/0/1629877234109?e=2147483647&v=beta&t=OTOiJBYLX5LT7qC2dke5DvYyo_GnlKJRKC45xXeY_7o"> <meta property="og:type" content="profile"> <meta property="profile:first_name" content="Sebastian"> <meta property="profile:last_name" content="Raschka, PhD"> <meta property="og:url" content="https://www.linkedin.com/in/sebastianraschka"> <meta name="twitter:card" content="summary"> <meta name="twitter:site" content="@Linkedin"> <meta name="twitter:title" content="Sebastian Raschka, PhD - Founder | Principal AI &amp; LLM Research Engineer - RAIR Lab | LinkedIn"> <meta name="twitter:image" content="https://media.licdn.com/dms/image/v2/C4E03AQHVyH-IfD1KbQ/profile-displayphoto-shrink_200_200/profile-displayphoto-shrink_200_200/0/1629877234109?e=2147483647&v=beta&t=OTOiJBYLX5LT7qC2dke5DvYyo_GnlKJRKC45xXeY_7o">  <meta name="linkedin:pageTag" content="openToProvider"> <meta name="clientSideIngraphs" content="1" data-gauge-metric-endpoint="/public-profile/api/ingraphs/guestGauge" data-counter-metric-endpoint="/public-profile/api/ingraphs/counter"> <script type="application/ld+json"> {"@context":"http://schema.org","@graph":[{"@type":"Article","author":{"@type":"Person","name":"Sebastian Raschka, PhD","url":"https://www.linkedin.com/in/sebastianraschka"},"dateModified":"","datePublished":"2025-02-05T13:33:01.000+00:00","headline":"Understanding Reasoning LLMs ","image":{"@type":"ImageObject","url":"https://media.licdn.com/dms/image/v2/D5612AQHNs0sfbHFYyg/article-cover_image-shrink_720_1280/article-cover_image-shrink_720_1280/0/1738711288594?e=2147483647&v=beta&t=iepjYE-w0xdsh16mhqNYN-TidZXJ8wl0GxNqiRhYsDo"},"articleBody":"\u003Cp\u003E\u003Cstrong\u003E\u003Cem\u003EMethods and Strategies for Building and Refining Reasoning Models\u003C/em\u003E\u003C/strong\u003E\u003C/p\u003E\u003Cp\u003EIn this article, I will describe the four main approaches to building reasoning models, or how we can enhance LLMs with reasoning capabilities. I hope this provides valuable insights and helps you navigate the rapidly evolving literature and hype surrounding this topic.\u003C/p\u003E\u003Cp\u003EIn 2024, the LLM field saw increasing specialization. Beyond pre-training and fine-tuning, we witnessed the rise of specialized applications, from RAGs to code assistants. I expect this trend to accelerate in 2025, with an even greater emphasis on domain- and application-specific optimizations (i.e., \"specializations\").\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFMwioAQi0hkw/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1738711384059?e=1749081600&v=beta&t=5xMVGBMbtptrsO2kJhB461ORfr6rQshO7-yObeLp3mI\" src=\"//:0\"\u003E\u003Cfigcaption\u003EStages 1-3 are the common steps to developing LLMs. Stage 4 specializes LLMs for specific use cases.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EThe development of reasoning models is one of these specializations. This means we refine LLMs to excel at complex tasks that are best solved with intermediate steps, such as puzzles, advanced math, and coding challenges. However, this specialization does not replace other LLM applications. Because transforming an LLM into a reasoning model also introduces certain drawbacks, which I will discuss later.\u003C/p\u003E\u003Cp\u003ETo give you a brief glimpse of what's covered below, in this article, I will:\u003C/p\u003E\u003Col\u003E\u003Cli\u003E\u003Cp\u003EExplain the meaning of \"reasoning model\"\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EDiscuss the advantages and disadvantages of reasoning models\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EOutline the methodology behind DeepSeek R1\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EDescribe the four main approaches to building and improving reasoning models\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EShare thoughts on the LLM landscape following the DeepSeek V3 and R1 releases\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EProvide tips for developing reasoning models on a tight budget\u003C/p\u003E\u003C/li\u003E\u003C/ol\u003E\u003Cp\u003EI hope you find this article useful as AI continues its rapid development this year!\u003C/p\u003E\u003Ch2\u003EHow do we define reasoning model?\u003C/h2\u003E\u003Cp\u003EIf you work in AI (or machine learning in general), you are probably familiar with vague and hotly debated definitions. The term \"reasoning models\" is no exception. Eventually, someone will define it formally in a paper, only for it to be redefined in the next, and so on.\u003C/p\u003E\u003Cp\u003EIn this article, I define \"reasoning\" as the process of answering questions that require complex, multi-step generation with intermediate steps. For example, factual question-answering like \"What is the capital of France?\" does not involve reasoning. In contrast, a question like \"If a train is moving at 60 mph and travels for 3 hours, how far does it go?\" requires some simple reasoning. For instance, it requires recognizing the relationship between distance, speed, and time before arriving at the answer.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQG-r8mSXseVVw/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1738711427424?e=1749081600&v=beta&t=aQEg-8SsmIFtJZ1km8Klc6SmaclwjSjY8Cl0vEKG5WQ\" src=\"//:0\"\u003E\u003Cfigcaption\u003EA regular LLM may only provide a short answer (as shown on the left), whereas reasoning models typically include intermediate steps that reveal part of the thought process. (Note that many LLMs who have not been specifically developed for reasoning tasks can also provide intermediate reasoning steps in their answers.)\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EMost modern LLMs are capable of basic reasoning and can answer questions like, \"If a train is moving at 60 mph and travels for 3 hours, how far does it go?\" So, today, when we refer to reasoning models, we typically mean LLMs that excel at more complex reasoning tasks, such as solving puzzles, riddles, and mathematical proofs.\u003C/p\u003E\u003Cp\u003EAdditionally, most LLMs branded as reasoning models today include a \"thought\" or \"thinking\" process as part of their response. Whether and how an LLM actually \"thinks\" is a separate discussion.\u003C/p\u003E\u003Cp\u003EIntermediate steps in reasoning models can appear in two ways. First, they may be explicitly included in the response, as shown in the previous figure. Second, some reasoning LLMs, such as OpenAI's o1, run multiple iterations with intermediate steps that are not shown to the user.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQHcN57WWBgzXw/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1738711459125?e=1749081600&v=beta&t=HetIwPQG57h5OkRiQP-4SSO3eRGLyTmZ-v25bU-A910\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\"Reasoning\" is used at two different levels: 1) processing the input and generating via multiple intermediate steps and 2) providing some sort of reasoning as part of the response to the user.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch2\u003EWhen should we use reasoning models?\u003C/h2\u003E\u003Cp\u003ENow that we have defined reasoning models, we can move on to the more interesting part: how to build and improve LLMs for reasoning tasks. However, before diving into the technical details, it is important to consider when reasoning models are actually needed.\u003C/p\u003E\u003Cp\u003E\u003Cstrong\u003EWhen do we need a reasoning model?\u003C/strong\u003E Reasoning models are designed to be good at complex tasks such as solving puzzles, advanced math problems, and challenging coding tasks. However, they are not necessary for simpler tasks like summarization, translation, or knowledge-based question answering. In fact, using reasoning models for everything can be inefficient and expensive. For instance, reasoning models are typically more expensive to use, more verbose, and sometimes more prone to errors due to \"overthinking.\" Also here the simple rule applies: Use the right tool (or type of LLM) for the task.\u003C/p\u003E\u003Cp\u003EThe key strengths and limitations of reasoning models are summarized in the figure below.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQESQuSrl3TjRQ/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1738711479911?e=1749081600&v=beta&t=8qriEQ0DMWv4UDB3qe3EqqrT_4DxqsfIcOSYPWVDoCY\" src=\"//:0\"\u003E\u003Cfigcaption\u003EThe key strengths and weaknesses of reasoning models.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E \u003C/p\u003E\u003Ch2\u003EA brief look at the DeepSeek training pipeline\u003C/h2\u003E\u003Cp\u003EBefore discussing four main approaches to building and improving reasoning models in the next section, I want to briefly outline the DeepSeek R1 pipeline, as described in the \u003Ca href=\"https://arxiv.org/abs/2501.12948\" target=\"_blank\" rel=\"nofollow noopener\"\u003EDeepSeek R1 technical report\u003C/a\u003E. This report serves as both an interesting case study and a blueprint for developing reasoning LLMs.\u003C/p\u003E\u003Cp\u003ENote that DeepSeek did not release a single R1 reasoning model but instead introduced three distinct variants: DeepSeek-R1-Zero, DeepSeek-R1, and DeepSeek-R1-Distill.\u003C/p\u003E\u003Cp\u003EBased on the descriptions in the technical report, I have summarized the development process of these models in the diagram below.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQHISKz1ReTTog/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1738766250912?e=1749081600&v=beta&t=kc5CjKf_CYC713Cjf915to0Pbh42MBoy_04LfFCBmjc\" src=\"//:0\"\u003E\u003Cfigcaption\u003EDevelopment process of DeepSeeks three different reasoning models that are discussed in the DeepSeek R1 technical report.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003ENext, let's briefly go over the process shown in the diagram above. More details will be covered in the next section, where we discuss the four main approaches to building and improving reasoning models.\u003C/p\u003E\u003Cp\u003E\u003Cstrong\u003E(1) DeepSeek-R1-Zero: \u003C/strong\u003EThis model is based on the 671B pre-trained DeepSeek-V3 base model released in December 2024. The research team trained it using reinforcement learning (RL) with two types of rewards. This approach is referred to as \"cold start\" training because it did not include a supervised fine-tuning (SFT) step, which is typically part of reinforcement learning with human feedback (RLHF).\u003C/p\u003E\u003Cp\u003E\u003Cstrong\u003E(2) DeepSeek-R1: \u003C/strong\u003EThis is DeepSeek's flagship reasoning model, built upon DeepSeek-R1-Zero. The team further refined it with additional SFT stages and further RL training, improving upon the \"cold-started\" R1-Zero model.\u003C/p\u003E\u003Cp\u003E\u003Cstrong\u003E(3) DeepSeek-R1-Distill*: \u003C/strong\u003EUsing the SFT data generated in the previous steps, the DeepSeek team fine-tuned Qwen and Llama models to enhance their reasoning abilities. While not distillation in the traditional sense, this process involved training smaller models (Llama 8B and 70B, and Qwen 1.5B–30B) on outputs from the larger DeepSeek-R1 671B model.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Chr\u003E\u003Cp\u003E\u003Cstrong\u003E\u003Cem\u003EIf you like this article, consider subscribing to my blog at \u003C/em\u003E\u003C/strong\u003E\u003Ca href=\"https://magazine.sebastianraschka.com\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cstrong\u003E\u003Cem\u003Ehttps://magazine.sebastianraschka.com\u003C/em\u003E\u003C/strong\u003E\u003C/a\u003E\u003Cstrong\u003E\u003Cem\u003E where I post articles more regularly.\u003C/em\u003E\u003C/strong\u003E\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQG8E5qNHo0PZA/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1738712133462?e=1749081600&v=beta&t=XgHiyupuzhJZ0VrC8fZYD_P9mIbkHE-a3AMi_jHEYTE\" src=\"//:0\"\u003E\u003Cfigcaption\u003EMy research blog at \u003C/figcaption\u003E\u003C/figure\u003E\u003Chr\u003E\u003Ch2\u003EThe 4 main ways to build and improve reasoning models\u003C/h2\u003E\u003Cp\u003EIn this section, I will outline the key techniques currently used to enhance the reasoning capabilities of LLMs and to build specialized reasoning models such as DeepSeek-R1, OpenAI's o1 & o3, and others.\u003C/p\u003E\u003Cp\u003ENote: The exact workings of o1 and o3 remain unknown outside of OpenAI. However, they are rumored to leverage a combination of both inference and training techniques.\u003C/p\u003E\u003Ch3\u003E1) Inference-time scaling\u003C/h3\u003E\u003Cp\u003EOne way to improve an LLM's reasoning capabilities (or any capability in general) is inference-time scaling. This term can have multiple meanings, but in this context, it refers to increasing computational resources during inference to improve output quality.\u003C/p\u003E\u003Cp\u003EA rough analogy is how humans tend to generate better responses when given more time to think through complex problems. Similarly, we can apply techniques that encourage the LLM to \"think\" more while generating an answer. (Although, whether LLMs actually \"think\" is a different discussion.)\u003C/p\u003E\u003Cp\u003EOne straightforward approach to inference-time scaling is clever prompt engineering. A classic example is \u003Cem\u003Echain-of-thought (CoT) prompting\u003C/em\u003E, where phrases like \"think step by step\" are included in the input prompt. This encourages the model to generate intermediate reasoning steps rather than jumping directly to the final answer, which can often (but not always) lead to more accurate results on more complex problems. (Note that it doesn't make sense to employ this strategy for simpler knowledge-based questions, like \"What is the capital of France\", which is again a good rule of thumb to find out whether a reasoning model makes sense on your given input query.)\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFPmtUSGCnQFw/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1738711525703?e=1749081600&v=beta&t=OTlVq3PTUbRohF9l2U_X7Idz71atlt7eKfOWNv40mWA\" src=\"//:0\"\u003E\u003Cfigcaption\u003EAn example of classic CoT prompting from the 2022 Large Language Models are Zero-Shot Reasoners paper (\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EThe aforementioned CoT approach can be seen as inference-time scaling because it makes inference more expensive through generating more output tokens.\u003C/p\u003E\u003Cp\u003EAnother approach to inference-time scaling is the use of voting and search strategies. One simple example is majority voting where we have the LLM generate multiple answers, and we select the correct answer by majority vote. Similarly, we can use beam search and other search algorithms to generate better responses. \u003C/p\u003E\u003Cp\u003EI highly recommend the \u003Ca href=\"https://arxiv.org/abs/2408.03314\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003EScaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters\u003C/em\u003E\u003C/a\u003E paper that I described in my previous Noteworthy AI Research Papers of 2024 (Part Two) article (\u003Ca href=\"https://magazine.sebastianraschka.com/p/ai-research-papers-2024-part-2\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehttps://magazine.sebastianraschka.com/p/ai-research-papers-2024-part-2\u003C/a\u003E) for more details on these different strategies.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQE8n2oN2zpvSg/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1738711557257?e=1749081600&v=beta&t=vMHETMVH2YieSxsbOCoG5zA6iV14aFCF-dEZvjWjvTo\" src=\"//:0\"\u003E\u003Cfigcaption\u003EDifferent search-based methods rely on a process-reward-based model to select the best answer. Annotated figure from the LLM Test-Time Compute paper, \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EThe DeepSeek R1 technical report states that its models do not use inference-time scaling. However, this technique is often implemented at the application layer on top of the LLM, so it is possible that DeepSeek applies it within their app.\u003C/p\u003E\u003Cp\u003EI suspect that OpenAI's o1 and o3 models use inference-time scaling, which would explain why they are relatively expensive compared to models like GPT-4o. In addition to inference-time scaling, o1 and o3 were likely trained using RL pipelines similar to those used for DeepSeek R1. More on reinforcement learning in the next two sections below.\u003C/p\u003E\u003Ch3\u003E2) Pure reinforcement learning (RL)\u003C/h3\u003E\u003Cp\u003EOne of my personal highlights from the \u003Ca href=\"https://arxiv.org/abs/2501.12948\" target=\"_blank\" rel=\"nofollow noopener\"\u003EDeepSeek R1 paper\u003C/a\u003E is their discovery that reasoning emerges as a behavior from pure reinforcement learning (RL). Let's explore what this means in more detail.\u003C/p\u003E\u003Cp\u003EAs outlined earlier, DeepSeek developed three types of R1 models. The first, \u003Cstrong\u003EDeepSeek-R1-Zero\u003C/strong\u003E, was built on top of the DeepSeek-V3 base model, a standard pre-trained LLM they released in December 2024. Unlike typical RL pipelines, where supervised fine-tuning (SFT) is applied before RL, DeepSeek-R1-Zero was trained \u003Cstrong\u003Eexclusively\u003C/strong\u003E with reinforcement learning without an initial SFT stage as highlighted in the diagram below.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQF-Yc_9j_87Gg/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1738764916228?e=1749081600&v=beta&t=-NZtnjNxDmRn7xQSpEagmE2p9z7Gka5zqNb4ZVepYXI\" src=\"//:0\"\u003E\u003Cfigcaption\u003EThe development process of DeepSeek-R1-Zero model.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EStill, this RL process is similar to the commonly used RLHF approach, which is typically applied to preference-tune LLMs. (I covered RLHF in more detail in my article, \u003Ca href=\"https://magazine.sebastianraschka.com/p/llm-training-rlhf-and-its-alternatives\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003ELLM Training: RLHF and Its Alternatives\u003C/em\u003E\u003C/a\u003E.) However, as mentioned above, the key difference in \u003Cem\u003EDeepSeek-R1-Zero\u003C/em\u003E is that they skipped the supervised fine-tuning (SFT) stage for instruction tuning. This is why they refer to it as \"pure\" RL. (Although, RL in the context of LLMs differs significantly from traditional RL, which is a topic for another time.)\u003C/p\u003E\u003Cp\u003EFor rewards, instead of using a reward model trained on human preferences, they employed two types of rewards: an accuracy reward and a format reward.\u003C/p\u003E\u003Cul\u003E\u003Cli\u003E\u003Cp\u003EThe \u003Cstrong\u003Eaccuracy reward\u003C/strong\u003E uses the LeetCode compiler to verify coding answers and a deterministic system to evaluate mathematical responses.\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EThe \u003Cstrong\u003Eformat reward\u003C/strong\u003E relies on an LLM judge to ensure responses follow the expected format, such as placing reasoning steps inside <think> tags.\u003C/p\u003E\u003C/li\u003E\u003C/ul\u003E\u003Cp\u003ESurprisingly, this approach was enough for the LLM to develop basic reasoning skills. The researchers observed an \"Aha!\" moment, where the model began generating reasoning traces as part of its responses despite not being explicitly trained to do so, as shown in the figure below.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEtpkB2LE2UMA/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1738711622861?e=1749081600&v=beta&t=m7fD7aRcLiQMGqEadyNyHpLWkzg0Uh1at_UjocKnIhg\" src=\"//:0\"\u003E\u003Cfigcaption\u003EA figure from the DeepSeek R1 technical report (\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EWhile R1-Zero is not a top-performing reasoning model, it does demonstrate reasoning capabilities by generating intermediate \"thinking\" steps, as shown in the figure above. This confirms that it is possible to develop a reasoning model using pure RL, and the DeepSeek team was the first to demonstrate (or at least publish) this approach.\u003C/p\u003E\u003Ch3\u003E3) Supervised finetuning and reinforcement learning (SFT + RL)\u003C/h3\u003E\u003Cp\u003ENext, let's look at the development of DeepSeek-R1, DeepSeek’s flagship reasoning model, which serves as a blueprint for building reasoning models. This model improves upon DeepSeek-R1-Zero by incorporating additional supervised fine-tuning (SFT) and reinforcement learning (RL) to improve its reasoning performance.\u003C/p\u003E\u003Cp\u003ENote that it is actually common to include an SFT stage before RL, as seen in the standard RLHF pipeline. OpenAI's o1 was likely developed using a similar approach.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFVZeGAHDEPaw/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1738764888883?e=1749081600&v=beta&t=kfiRwiYb3eHdekRu1nmDGz-bsPk8KYgR505RpsA1mKg\" src=\"//:0\"\u003E\u003Cfigcaption\u003EThe development process of DeepSeek-R1 model.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EAs shown in the diagram above, the DeepSeek team used DeepSeek-R1-Zero to generate what they call \"cold-start\" SFT data. The term \"cold start\" refers to the fact that this data was produced by DeepSeek-R1-Zero, which itself had not been trained on any supervised fine-tuning (SFT) data.\u003C/p\u003E\u003Cp\u003EUsing this cold-start SFT data, DeepSeek then trained the model via instruction fine-tuning, followed by another reinforcement learning (RL) stage. This RL stage retained the same accuracy and format rewards used in DeepSeek-R1-Zero’s RL process. However, they added a consistency reward to prevent language mixing, which occurs when the model switches between multiple languages within a response.\u003C/p\u003E\u003Cp\u003EThe RL stage was followed by another round of SFT data collection. In this phase, the most recent model checkpoint was used to generate 600K Chain-of-Thought (CoT) SFT examples, while an additional 200K knowledge-based SFT examples were created using the DeepSeek-V3 base model.\u003C/p\u003E\u003Cp\u003EThese 600K + 200K SFT samples were then used for another round of RL. In this stage, they again used rule-based methods for accuracy rewards for math and coding questions, while human preference labels used for other question types.\u003C/p\u003E\u003Cp\u003EThe final model, DeepSeek-R1 has a noticeable performance boost over DeepSeek-R1-Zero thanks to the additional SFT and RL stages, as shown in the table below.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQGcaDmXbwsCng/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1738711665074?e=1749081600&v=beta&t=gmcOhesI5th2U3mBe4QKZQ6ul5r_ZEFGuCqpijFMOJE\" src=\"//:0\"\u003E\u003Cfigcaption\u003EBenchmark comparison of OpenAI A1 and DeepSeek R1 models. Annotated figure from the DeepSeek-R1 technical report \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E4) Pure supervised finetuning (SFT) and distillation\u003C/h3\u003E\u003Cp\u003ESo far, we have covered three key approaches to building and improving reasoning models:\u003C/p\u003E\u003Cp\u003E1. Inference-time scaling, a technique that improves reasoning capabilities without training or otherwise modifying the underlying model.\u003C/p\u003E\u003Cp\u003E2. Pure reinforcement learning (RL) as in DeepSeek-R1-Zero, which showed that reasoning can emerge as a learned behavior without supervised fine-tuning.\u003C/p\u003E\u003Cp\u003E3. Supervised fine-tuning (SFT) plus RL, which led to DeepSeek-R1, DeepSeek’s flagship reasoning model.\u003C/p\u003E\u003Cp\u003E\u003Cstrong\u003ESo, what’s left? Model \"distillation.\"\u003C/strong\u003E\u003C/p\u003E\u003Cp\u003ESurprisingly, DeepSeek also released smaller models trained via a process they call \u003Cem\u003Edistillation\u003C/em\u003E. However, in the context of LLMs, distillation does not necessarily follow the classical knowledge distillation approach used in deep learning. Traditionally, in knowledge distillation (as briefly described in Chapter 6 of my \u003Ca href=\"https://amzn.to/40YYowg\" target=\"_blank\" rel=\"nofollow noopener\"\u003EMachine Learning Q and AI\u003C/a\u003E book), a smaller student model is trained on both the logits of a larger teacher model and a target dataset.\u003C/p\u003E\u003Cp\u003EInstead, here distillation refers to instruction fine-tuning smaller LLMs, such as Llama 8B and 70B and Qwen 2.5 models (0.5B to 32B), on an SFT dataset generated by larger LLMs. Specifically, these larger LLMs are DeepSeek-V3 and an intermediate checkpoint of DeepSeek-R1. In fact, the SFT data used for this distillation process is the same dataset that was used to train DeepSeek-R1, as described in the previous section.\u003C/p\u003E\u003Cp\u003ETo clarify this process, I have highlighted the distillation portion in the diagram below.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQGWwbbaC0aQ5Q/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1738764846561?e=1749081600&v=beta&t=ePvnTYgdkWpWFqUmj3mCiMTmPGgmkQAdH9eGYNXL6kE\" src=\"//:0\"\u003E\u003Cfigcaption\u003EThe development process of the distilled DeepSeek R1 models.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EWhy did they develop these distilled models? In my opinion, there are two key reasons:\u003C/p\u003E\u003Cp\u003E1. Smaller models are more efficient. This means they are cheaper to run, but they also can run on lower-end hardware, which makes these especially interesting for many researchers and tinkerers like me.\u003C/p\u003E\u003Cp\u003E2. A case study in pure SFT. These distilled models serve as an interesting benchmark, showing how far pure supervised fine-tuning (SFT) can take a model without reinforcement learning.\u003C/p\u003E\u003Cp\u003EThe table below compares the performance of these distilled models against other popular models, as well as DeepSeek-R1-Zero and DeepSeek-R1.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFtzvWPbDeohg/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1738711730811?e=1749081600&v=beta&t=WNHitritltClX1Z9GdlPXAWyaiVTRm9KuBaYQxrE9vU\" src=\"//:0\"\u003E\u003Cfigcaption\u003EBenchmark comparison of distilled versus non-distilled models. Annotated figure from the DeepSeek-R1 technical report \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EAs we can see, the distilled models are noticeably weaker than DeepSeek-R1, but they are surprisingly strong relative to DeepSeek-R1-Zero, despite being orders of magnitude smaller. It's also interesting to note how well these models perform compared to o1 mini (I suspect o1-mini itself might be a similarly distilled version of o1).\u003C/p\u003E\u003Cp\u003EBefore wrapping up this section with a conclusion, there’s one more interesting comparison worth mentioning. The DeepSeek team tested whether the emergent reasoning behavior seen in DeepSeek-R1-Zero could also appear in smaller models. To investigate this, they applied the same pure RL approach from DeepSeek-R1-Zero directly to Qwen-32B.\u003C/p\u003E\u003Cp\u003EThe results of this experiment are summarized in the table below, where QwQ-32B-Preview serves as a reference reasoning model based on Qwen 2.5 32B developed by the Qwen team (I think the training details were never disclosed). This comparison provides some additional insights into whether pure RL alone can induce reasoning capabilities in models much smaller than DeepSeek-R1-Zero.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFrVVLR7kpTWQ/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1738711754190?e=1749081600&v=beta&t=yFavi0xVqAhl8viG4HhHsoZhKXOk4QCUfzMmKLQjsfw\" src=\"//:0\"\u003E\u003Cfigcaption\u003EBenchmark comparison distillation and RL on a smaller 32B model. Annotated figure from the DeepSeek-R1 technical report \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EInterestingly, the results suggest that distillation is far more effective than pure RL for smaller models. This aligns with the idea that RL alone may not be sufficient to induce strong reasoning abilities in models of this scale, whereas SFT on high-quality reasoning data can be a more effective strategy when working with small models.\u003C/p\u003E\u003Cp\u003EFor completeness, it would have been useful to see additional comparisons in the table:\u003C/p\u003E\u003Cp\u003E1. Qwen-32B trained with SFT + RL, similar to how DeepSeek-R1 was developed. This would help determine how much improvement can be made, compared to pure RL and pure SFT, when RL is combined with SFT.\u003C/p\u003E\u003Cp\u003E2. DeepSeek-V3 trained with pure SFT, similar to how the distilled models were created. This would allow for a direct comparison to see how effective RL + SFT is over pure SFT.\u003C/p\u003E\u003Ch3\u003EConclusion\u003C/h3\u003E\u003Cp\u003EIn this section, we explored four different strategies for building and improving reasoning models:\u003C/p\u003E\u003Cp\u003E1. Inference-time scaling requires no additional training but increases inference costs, making large-scale deployment more expensive as the number or users or query volume grows. Still, it remains a no-brainer for improving the performance of already strong models. I strongly suspect that o1 leverages inference-time scaling, which helps explain why it is more expensive on a per-token basis compared to DeepSeek-R1.\u003C/p\u003E\u003Cp\u003E2. Pure RL is interesting for research purposes because it provides insights into reasoning as an emergent behavior. However, in practical model development, RL + SFT is the preferred approach as it leads to stronger reasoning models. I strongly suspect that o1 was trained using RL + SFT as well. More precisely, I believe o1 starts from a weaker, smaller base model than DeepSeek-R1 but compensates with RL + SFT and inference-time scaling.\u003C/p\u003E\u003Cp\u003E3. As mentioned above, RL + SFT is the key approach for building high-performance reasoning models. DeepSeek-R1 is a nice blueprint showing how this can be done.\u003C/p\u003E\u003Cp\u003E4. Distillation is an attractive approach, especially for creating smaller, more efficient models. However, the limitation is that distillation does not drive innovation or produce the next generation of reasoning models. For instance, distillation always depends on an existing, stronger model to generate the supervised fine-tuning (SFT) data.\u003C/p\u003E\u003Cp\u003EOne interesting aspect I expect to see next is to combine RL + SFT (approach 3) with inference-time scaling (approach 1). This is likely what OpenAI o1 is doing, except it's probably based on a weaker base model than DeepSeek-R1, which explains why DeepSeek-R1 performs so well while remaining relatively cheap at inference time.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Chr\u003E\u003Cp\u003E\u003Cstrong\u003E\u003Cem\u003EIf you like this article, consider subscribing to my blog at \u003C/em\u003E\u003C/strong\u003E\u003Ca href=\"https://magazine.sebastianraschka.com\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cstrong\u003E\u003Cem\u003Ehttps://magazine.sebastianraschka.com\u003C/em\u003E\u003C/strong\u003E\u003C/a\u003E\u003Cstrong\u003E\u003Cem\u003E where I post articles more regularly.\u003C/em\u003E\u003C/strong\u003E\u003C/p\u003E\u003Chr\u003E\u003Ch2\u003EThoughts about DeepSeek R1\u003C/h2\u003E\u003Cp\u003EIn recent weeks, many people have asked for my thoughts on the DeepSeek-R1 models. In short, I think they are an awesome achievement. As a research engineer, I particularly appreciate the detailed technical report, which provides insights into their methodology that I can learn from. \u003C/p\u003E\u003Cp\u003EOne of the most fascinating takeaways is how reasoning emerged as a behavior from pure RL. And it's impressive that DeepSeek has open-sourced their models under a permissive open-source MIT license, which has even fewer restrictions than Meta's Llama models.\u003C/p\u003E\u003Cp\u003E\u003Cstrong\u003EHow does it compare to o1?\u003C/strong\u003E\u003C/p\u003E\u003Cp\u003EIs DeepSeek-R1 better than o1? I’d say it’s roughly in the same ballpark. However, what stands out is that DeepSeek-R1 is more efficient at inference time. This suggests that DeepSeek likely invested more heavily in the training process, while OpenAI may have relied more on inference-time scaling for o1.\u003C/p\u003E\u003Cp\u003EThat said, it's difficult to compare o1 and DeepSeek-R1 directly because OpenAI has not disclosed much about o1. For instance, we don’t know:\u003C/p\u003E\u003Cul\u003E\u003Cli\u003E\u003Cp\u003EIs o1 also a Mixture of Experts (MoE)?\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EHow large is o1?\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003ECould o1 just be a slightly refined version of GPT-4o with minimal RL + SFT and only extensive inference-time scaling?\u003C/p\u003E\u003C/li\u003E\u003C/ul\u003E\u003Cp\u003EWithout knowing these details, a direct comparison remains an apples-to-oranges comparison.\u003C/p\u003E\u003Cp\u003E\u003Cstrong\u003EThe cost of training DeepSeek-R1\u003C/strong\u003E\u003C/p\u003E\u003Cp\u003EAnother point of discussion has been the cost of developing DeepSeek-R1. Some have mentioned a ~$6 million training cost, but they likely conflated DeepSeek-V3 (the base model released in December last year) and DeepSeek-R1.\u003C/p\u003E\u003Cp\u003EThe $6 million estimate is based on an assumed $2 per GPU hour and the number of GPU hours required for the final training run of DeepSeek-V3, which was originally discussed back in December 2024.\u003C/p\u003E\u003Cp\u003EHowever, the DeepSeek team has never disclosed the exact GPU hours or development cost for R1, so any cost estimates remain pure speculation.\u003C/p\u003E\u003Cp\u003EEither way, ultimately, DeepSeek-R1 is a major milestone in open-weight reasoning models, and its efficiency at inference time makes it an interesting alternative to OpenAI’s o1.\u003C/p\u003E\u003Ch2\u003EDeveloping reasoning models on a limited budget\u003C/h2\u003E\u003Cp\u003EDeveloping a DeepSeek-R1-level reasoning model likely requires hundreds of thousands to millions of dollars, even when starting with an open-weight base model like DeepSeek-V3. This can feel discouraging for researchers or engineers working with limited budgets.\u003C/p\u003E\u003Cp\u003E\u003Cstrong\u003EThe good news: Distillation can go a long way\u003C/strong\u003E\u003C/p\u003E\u003Cp\u003EFortunately, model distillation offers a more cost-effective alternative. The DeepSeek team demonstrated this with their R1-distilled models, which achieve surprisingly strong reasoning performance despite being significantly smaller than DeepSeek-R1. However, even this approach isn’t entirely cheap. Their distillation process used 800K SFT samples, which requires substantial compute.\u003C/p\u003E\u003Cp\u003EInterestingly, just a few days before DeepSeek-R1 was released, I came across \u003Ca href=\"https://novasky-ai.github.io/posts/sky-t1/\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ean article about Sky-T1\u003C/a\u003E, a fascinating project where a small team trained an open-weight 32B model using only 17K SFT samples. The total cost? Just $450, which is less than the registration fee for most AI conferences.\u003C/p\u003E\u003Cp\u003EThis example highlights that while large-scale training remains expensive, smaller, targeted fine-tuning efforts can still yield impressive results at a fraction of the cost.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQGXY9HBxJdwtQ/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1738711851682?e=1749081600&v=beta&t=db3yOjMkwM170ifBeVP0lInKRKQrgCIzsRkMzeS12eU\" src=\"//:0\"\u003E\u003Cfigcaption\u003EFigure from the \"Sky-T1: Train your own O1 preview model within $450\" article, \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EAccording to their benchmarks, Sky-T1 performs roughly on par with o1, which is impressive given its low training cost.\u003C/p\u003E\u003Cp\u003E\u003Cstrong\u003EPure RL on a budget: TinyZero\u003C/strong\u003E\u003C/p\u003E\u003Cp\u003EWhile Sky-T1 focused on model distillation, I also came across some interesting work in the \"pure RL\" space. One notable example is \u003Ca href=\"https://github.com/Jiayi-Pan/TinyZero/\" target=\"_blank\" rel=\"nofollow noopener\"\u003ETinyZero\u003C/a\u003E, a 3B parameter model that replicates the DeepSeek-R1-Zero approach (side note: it costs less than $30 to train).\u003C/p\u003E\u003Cp\u003ESurprisingly, even at just 3B parameters, TinyZero exhibits some emergent self-verification abilities, which supports the idea that reasoning can emerge through pure RL, even in small models.\u003C/p\u003E\u003Cp\u003EThe \u003Ca href=\"https://github.com/Jiayi-Pan/TinyZero/\" target=\"_blank\" rel=\"nofollow noopener\"\u003ETinyZero repository\u003C/a\u003E mentions that a research report is still work in progress, and I’ll definitely be keeping an eye out for further details.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQHH6460R1FNjQ/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1738711872203?e=1749081600&v=beta&t=k6z6e3M3K9tbkuufh6Yduo6xQoSdnbjZzAditSaRJ0k\" src=\"//:0\"\u003E\u003Cfigcaption\u003EA figure from the TinyZero repository (\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EThe two projects mentioned above demonstrate that interesting work on reasoning models is possible even with limited budgets. While both approaches replicate methods from DeepSeek-R1, one focusing on pure RL (TinyZero) and the other on pure SFT (Sky-T1), it would be fascinating to explore how these ideas can be extended further.\u003C/p\u003E\u003Cp\u003E\u003Cstrong\u003EBeyond Traditional SFT: Journey Learning\u003C/strong\u003E\u003C/p\u003E\u003Cp\u003EOne particularly interesting approach I came across last year is described in the paper \u003Ca href=\"https://arxiv.org/abs/2410.18982\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003EO1 Replication Journey: A Strategic Progress Report – Part 1\u003C/em\u003E\u003C/a\u003E. Despite its title, the paper does not actually replicate o1. Instead, it introduces an different way to improve the distillation (pure SFT) process.\u003C/p\u003E\u003Cp\u003EThe key idea in the paper is \"journey learning\" as an alternative to \"shortcut learning.\"\u003C/p\u003E\u003Cul\u003E\u003Cli\u003E\u003Cp\u003EShortcut learning refers to the traditional approach in instruction fine-tuning, where models are trained using only correct solution paths.\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EJourney learning, on the other hand, also includes incorrect solution paths, allowing the model to learn from mistakes.\u003C/p\u003E\u003C/li\u003E\u003C/ul\u003E\u003Cp\u003EThis approach is kind of related to the self-verification abilities observed in TinyZero’s pure RL training, but it focuses on improving the model entirely through SFT. By exposing the model to incorrect reasoning paths and their corrections, journey learning may also reinforce self-correction abilities, potentially making reasoning models more reliable this way.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFDjC-rMJI8yQ/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1738711898780?e=1749081600&v=beta&t=LU7hxKQi1RcISSUMM-ZEQiUTBkG-CzvLCscn0TIGBGM\" src=\"//:0\"\u003E\u003Cfigcaption\u003EJourney learning, as opposed to traditional shortcut learning, includes wrong solutions paths in the SFT data. Annotated figure from the  O1 Replication Journey: A Strategic Progress Report – Part 1 (\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EThis could be an exciting direction for future work, particularly for low-budget reasoning model development, where RL-based approaches may be computationally impractical.\u003C/p\u003E\u003Cp\u003EAnyways, a lot of interesting work is currently happening on the reasoning model front, and I'm sure we will see a lot more exciting work in the upcoming months!\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Chr\u003E\u003Cp\u003E\u003Cem\u003EThis magazine is a personal passion project. For those who wish to support me, please consider purchasing a copy of my \u003C/em\u003E\u003Ca href=\"https://amzn.to/4fqvn0D\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003EBuild a Large Language Model (From Scratch) book\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E. (I am confident that you'll get lots out of this book as it explains how LLMs work in a level of detail that is not found anywhere else.)\u003C/em\u003E\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFMOQyfLQJRWQ/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1738711999117?e=1749081600&v=beta&t=meRYM2S0KNY07WBqdAoLhGHdKB3cLEogsXsmf71Rzwk\" src=\"//:0\"\u003E\u003Cfigcaption\u003EBuild a Large Language Model (From Scratch) now \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003Cem\u003EIf you read the book and have a few minutes to spare, I'd really appreciate a \u003C/em\u003E\u003Ca href=\"https://www.amazon.com/Build-Large-Language-Model-Scratch/dp/1633437167\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ebrief review\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E. It helps us authors a lot!\u003C/em\u003E\u003C/p\u003E\u003Cp\u003E\u003Cstrong\u003EYour support means a great deal! Thank you!\u003C/strong\u003E\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E","url":"https://www.linkedin.com/pulse/understanding-reasoning-llms-sebastian-raschka-phd-1tshc"},{"@type":"Article","author":{"@type":"Person","name":"Sebastian Raschka, PhD","url":"https://www.linkedin.com/in/sebastianraschka"},"dateModified":"","datePublished":"2024-11-03T13:43:01.000+00:00","headline":"Understanding Multimodal LLMs","image":{"@type":"ImageObject","url":"https://media.licdn.com/dms/image/v2/D5612AQEeA4ibot-6MQ/article-cover_image-shrink_720_1280/article-cover_image-shrink_720_1280/0/1730575858615?e=2147483647&v=beta&t=Kz0rm76WIizQSrc60TGwT_vJVifl0DF6lr9qLKVOHqM"},"articleBody":"\u003Cp\u003EIt was a wild two months. There have once again been many developments in AI research, with two Nobel Prizes awarded to AI and several interesting research papers published. \u003C/p\u003E\u003Cp\u003EAmong others, Meta AI released their latest Llama 3.2 models, which include open-weight versions for the 1B and 3B large language models and two multimodal models.\u003C/p\u003E\u003Cp\u003EIn this article, I aim to explain how multimodal LLMs function. Additionally, I will review and summarize roughly a dozen other recent multimodal papers and models published in recent weeks (including Llama 3.2) to compare their approaches.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEEDTktUs15pg/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1730576068804?e=1748476800&v=beta&t=JtZk3TwN9IwkiNdbBUNdKV1RMv62WirEF5Aq3HtWWtA\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EAn illustration of a multimodal LLM that can accept different input modalities (audio, text, images, and videos) and returns text as the output modality.\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003Cstrong\u003EBut before we begin, I also have some exciting news to share on the personal front! My book, \u003Cem\u003E\"Build A Large Language Model (From Scratch)\"\u003C/em\u003E, is now finally \u003C/strong\u003E\u003Ca href=\"https://amazon.com/Build-Large-Language-Model-Scratch/dp/1633437167\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cstrong\u003Eavailable on Amazon\u003C/strong\u003E\u003C/a\u003E\u003Cstrong\u003E!\u003C/strong\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFVRDzUlspqnQ/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1730576086582?e=1748476800&v=beta&t=ORYsdyEc4exHzQ6pY6_pRqBePPMjvQIWpAUZhIpTQBI\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Ca href=\"https://amazon.com/Build-Large-Language-Model-Scratch/dp/1633437167\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003EBuild a Large Language Model (From Scratch)\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E now available on Amazon\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EWriting this book was a tremendous effort, and I’m incredibly grateful for all the support and motivating feedback over the past two years—especially in these last couple of months, as so many kind readers have shared their feedback. Thank you all, and as an author, there is nothing more motivating than to hear that the book makes a difference in your careers!\u003C/p\u003E\u003Cp\u003EFor those who have finished the book and are eager for more, stay tuned! I’ll be adding some bonus content to the GitHub repository in the coming months. \u003C/p\u003E\u003Cp\u003E\u003Cstrong\u003EP.S. If you have read the book, I'd really appreciate it if you could leave a \u003C/strong\u003E\u003Ca href=\"https://www.amazon.com/Build-Large-Language-Model-Scratch/dp/1633437167/\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cstrong\u003Ebrief review\u003C/strong\u003E\u003C/a\u003E\u003Cstrong\u003E; it truly helps us authors!\u003C/strong\u003E\u003C/p\u003E\u003Ch2\u003E1. Use cases of multimodal LLMs\u003C/h2\u003E\u003Cp\u003EWhat are multimodal LLMs? As hinted at in the introduction, multimodal LLMs are large language models capable of processing multiple types of inputs, where each \"modality\" refers to a specific type of data—such as text (like in traditional LLMs), sound, images, videos, and more. For simplicity, we will primarily focus on the image modality alongside text inputs.\u003C/p\u003E\u003Cp\u003EA classic and intuitive application of multimodal LLMs is image captioning: you provide an input image, and the model generates a description of the image, as shown in the figure below.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQENxB4_OMKnUA/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1730576118978?e=1748476800&v=beta&t=4u01s8rr4xrIz1RqBuXAIL30uXTfBg-XRWtUhVjo_mQ\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EExample use of a multimodal LLM explaining \u003C/em\u003E\u003Ca href=\"https://x.com/PainSci/status/1309570607458086914\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ea meme\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E.\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EOf course, there are many other use cases. For example, one of my favorites is extracting information from a PDF table and converting it into LaTeX or Markdown.\u003C/p\u003E\u003Ch2\u003E2. Common approaches to building multimodal LLMs\u003C/h2\u003E\u003Cp\u003EThere are two main approaches to building multimodal LLMs:\u003C/p\u003E\u003Cp\u003EA. a Unified Embedding Decoder Architecture approach;\u003C/p\u003E\u003Cp\u003EB. a Cross-modality Attention Architecture approach.\u003C/p\u003E\u003Cp\u003E(By the way, I don’t believe official terms for these techniques exist yet, but let me know if you’ve come across any. For instance, briefer descriptions may be \"decoder-only\" and \"cross-attention-based\" approaches.)\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQGDuYjwpbmEyA/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1730576143297?e=1748476800&v=beta&t=OELegJzXO9qeRqdsf3EdL5Fb5gol9KEJwlaUFqcJpbY\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EThe two main approaches to developing multimodal LLM architectures.\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EAs shown in the figure above, the \u003Cstrong\u003E\u003Cem\u003EUnified Embedding-Decoder Architecture\u003C/em\u003E\u003C/strong\u003E utilizes a single decoder model, much like an unmodified LLM architecture such as GPT-2 or Llama 3.2. In this approach, images are converted into tokens with the same embedding size as the original text tokens, allowing the LLM to process both text and image input tokens together after concatenation.\u003C/p\u003E\u003Cp\u003EThe \u003Cstrong\u003E\u003Cem\u003ECross-Modality Attention Architecture\u003C/em\u003E\u003C/strong\u003E employs a cross-attention mechanism to integrate image and text embeddings directly within the attention layer.\u003C/p\u003E\u003Cp\u003EIn the following sections, we will explore how these methods work on a conceptual level. Then, we will look at recent research papers on multimodal LLMs to see how they are applied in practice.\u003C/p\u003E\u003Ch3\u003E2.1 Method A: Unified Embedding Decoder Architecture\u003C/h3\u003E\u003Cp\u003ELet’s begin with the unified embedding decoder architecture, illustrated again in the figure below.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEvgE-B1D6SFA/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1730576161806?e=1748476800&v=beta&t=S--VmlRuIVjQJHLa-jrwOW6D8yM1v9S1HE1CgIFFfg8\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EIllustration of the unified embedding decoder architecture, which is an unmodified decoder-style LLM (like GPT-2, Phi-3, Gemma, or Llama 3.2) that receives inputs consisting of image token and text token embeddings.\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EIn the unified embedding-decoder architecture, an image is converted into embedding vectors, similar to how input text is converted into embeddings in a standard text-only LLM.\u003C/p\u003E\u003Cp\u003EFor a typical text-only LLM that processes text, the text input is usually tokenized (e.g., using Byte-Pair Encoding) and then passed through an embedding layer, as shown in the figure below.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQGHqyIdRYLKqA/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1730576181573?e=1748476800&v=beta&t=ysbZlNlJab-1fHsjqza1NUx6uQYSJPZ5nFbyQtuozHw\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EIllustration of the standard process for tokenizing text and converting it into token embedding vectors, which are subsequently passed to an LLM during training and inference.\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E2.1.1 Understanding Image encoders\u003C/h3\u003E\u003Cp\u003EAnalogous to the tokenization and embedding of text, image embeddings are generated using an image encoder module (instead of a tokenizer), as shown in the figure below.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEyFfz-mnylZw/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1730576207502?e=1748476800&v=beta&t=F1ta1w1lrSs-m-GqdRHhfJz7KEx_i3yNsjihuIH-xEs\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EIllustration of the process for encoding an image into image patch embeddings.\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EWhat happens inside the image encoder shown above? To process an image, we first divide it into smaller patches, much like breaking words into subwords during tokenization. These patches are then encoded by a pretrained vision transformer (ViT), as shown in the figure below.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFLU6GFgtAq_w/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1730576341080?e=1748476800&v=beta&t=ZditLmXY4AYJ46zgfhE85iyvuhJF8bBJXNEafu0GgX8\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EIllustration of a classic vision transformer (ViT) setup, similar to the model proposed in \u003C/em\u003E\u003Ca href=\"https://arxiv.org/abs/2010.11929\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003EAn Image is Worth 16x16 Words: Transformers for Image Recognition at Scale\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E (2020).\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003ENote that ViTs are often used for classification tasks, so I included the classification head in the figure above. However, in this case, we only need the image encoder part.\u003C/p\u003E\u003Ch3\u003E2.1.2 The role of the linear projection module\u003C/h3\u003E\u003Cp\u003EThe \"linear projection\" shown in the previous figure consists of a single linear layer (i.e., a fully connected layer). The purpose of this layer is to project the image patches, which are flattened into a vector, into an embedding size compatible with the transformer encoder. This linear projection is illustrated in the figure below. An image patch, flattened into a 256-dimensional vector, is up-projected to a 768-dimensional vector.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQHoFZstO_bOXg/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1730576357807?e=1748476800&v=beta&t=LVBHQ_CuM4c3NT-FH5SyLQTDVLTGBqXVycXhUijJOII\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EIllustration of a linear projection layer that projects flattened image patches from a 256-dimensional into a 768-dimensional embedding space\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EFor those who prefer seeing a code example, In PyTorch code, we could implement the linear projection for the image patches as follows:\u003C/p\u003E\u003Cpre\u003E\u003C/pre\u003E\u003Cp\u003EIf you have read my \u003Ca href=\"https://www.amazon.com/Machine-Learning-AI-Essential-Questions/dp/1718503768/\" target=\"_blank\" rel=\"nofollow noopener\"\u003EMachine Learning Q and AI\u003C/a\u003E book by chance, you may know there are ways to replace linear layers with convolution operations that can be implemented to be mathematically equivalent. Here, this can be especially handy as we can combine the creation of patches and projection into two lines of code:\u003C/p\u003E\u003Cpre\u003E\u003C/pre\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E2.1.3 Image vs text tokenization\u003C/h3\u003E\u003Cp\u003ENow that we briefly discussed the purpose of the image encoder (and the linear projection that is part of the encoder), let's return to the text tokenization analogy from earlier and look at text and image tokenization and embedding side by side, as depicted in the figure below.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQHTv-Viuov8HA/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1730576375259?e=1748476800&v=beta&t=WqR18NIYLE4ZoN7aZ2yMhDCBRp1TDpCljtYzUNF5Qe0\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EImage tokenization and embedding (left) and text tokenization and embedding (right) side by side.\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EAs you can see in the figure above, I included an additional \u003Cstrong\u003E\u003Cem\u003Eprojector\u003C/em\u003E\u003C/strong\u003E module that follows the image encoder. This \u003Cem\u003Eprojector\u003C/em\u003E is usually just another \u003Cstrong\u003E\u003Cem\u003Elinear projection\u003C/em\u003E\u003C/strong\u003E layer that is similar to the one explained earlier. The purpose is to project the image encoder outputs into a dimension that matches the dimensions of the embedded text tokens, as illustrated in the figure below. (As we will see later, the projector is sometimes also called adapter, adaptor, or connector.)\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQETbap8nrylgA/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1730576390416?e=1748476800&v=beta&t=dBdHzmIniIOX7HFC5CXg4f0rK7c11tnMGUmxNViaWF0\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EAnother side-by-side comparison between image tokenization and text tokenization, where the role of the projector is to match the text token embedding dimensions.\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003ENow that the image patch embeddings have the same embedding dimension as the text token embeddings, we can simply concatenate them as input to the LLM, as shown in the figure at the beginning of this section. Below is the same figure again for easier reference.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQF6wKTPch4OgQ/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1730576503104?e=1748476800&v=beta&t=DPj4aw0Q_ocmeiN7rE9-HacZva2OCiUoANEQdN4RRsw\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EAfter projecting the image patch tokens into the same dimension as the text token embeddings, we can simply concatenate them as input to a standard LLM\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EBy the way, the image encoder we discussed in this section is usually a pretrained vision transformer. A popular choice is \u003Ca href=\"https://github.com/openai/CLIP\" target=\"_blank\" rel=\"nofollow noopener\"\u003ECLIP\u003C/a\u003E or \u003Ca href=\"https://github.com/mlfoundations/open_clip\" target=\"_blank\" rel=\"nofollow noopener\"\u003EOpenCLIP\u003C/a\u003E.\u003C/p\u003E\u003Cp\u003EHowever, there are also versions of Method A that operate directly on patches, such as \u003Ca href=\"https://www.adept.ai/blog/fuyu-8b\" target=\"_blank\" rel=\"nofollow noopener\"\u003EFuyu\u003C/a\u003E, which is shown in the figure below.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEyZmZqfUmJaA/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1730576521194?e=1748476800&v=beta&t=Tflv1xJPiM6t5FM9wBzzvLebPj4OiGoTys4EbJfeFz8\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EAnnotated figure of the Fuyu multimodal LLM that operates directly on the image patches without image encoder. (Annotated figure from \u003C/em\u003E\u003Ca href=\"https://www.adept.ai/blog/fuyu-8b\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ehttps://www.adept.ai/blog/fuyu-8b\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E.)\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EAs illustrated in the figure above, Fuyu passes the input patches directly into a linear projection (or embedding layer) to learn its own image patch embeddings rather than relying on an additional pretrained image encoder like other models and methods do. This greatly simplifies the architecture and training setup.\u003C/p\u003E\u003Ch3\u003E2.2 Method B: Cross-Modality Attention Architecture\u003C/h3\u003E\u003Cp\u003ENow that we have discussed the unified embedding decoder architecture approach to building multimodal LLMs and understand the basic concept behind image encoding, let's talk about an alternative way of implementing multimodal LLMs via cross-attention, as summarized in the figure below.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFR4emabxtNDg/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1730576578761?e=1748476800&v=beta&t=ByMyIHiFA9-W2ClRVFUIKLr7UohJJ1RBsFOxIC-lzog\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EAn illustration of the Cross-Modality Attention Architecture approach to building multimodal LLMs.\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EIn the Cross-Modality Attention Architecture method depicted in the figure above, we still use the same image encoder setup we discussed previously. However, instead of encoding the patches as input to the LLM, we connect the input patches in the multi-head attention layer via a cross-attention mechanism.\u003C/p\u003E\u003Cp\u003EThe idea is related and goes back to the original transformer architecture from the 2017 \u003Ca href=\"https://arxiv.org/abs/1706.03762\" target=\"_blank\" rel=\"nofollow noopener\"\u003EAttention Is All You Need\u003C/a\u003E paper, highlighted in the figure below.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFARiFgRDt9WQ/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1730576637677?e=1748476800&v=beta&t=aozkLnLH7Ul5RihAtUPpxb2F_jBZI3EGE6HQiDFuF_Y\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EHigh-level illustration of the cross-attention mechanism used in the original transformer architecture. (Annotated figure from the \"Attention Is All You Need\" paper: \u003C/em\u003E\u003Ca href=\"https://arxiv.org/abs/1706.03762\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ehttps://arxiv.org/abs/1706.03762\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E.)\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003ENote that the original \"Attention Is All You Need\" transformer depicted in the figure above was originally developed for language translation. So, it consists of a text \u003Cstrong\u003Een\u003C/strong\u003Ecoder (left part of the figure) that takes the sentence to be translated and generates the translation via a text \u003Cstrong\u003Ede\u003C/strong\u003Ecoder (right part of the figure). In the context of multimodal LLM, the encoder is an image encoder instead of a text encoder, but the same idea applies.\u003C/p\u003E\u003Cp\u003EHow does cross-attention work? Let's have a look at a conceptual drawing of what happens inside the regular self-attention mechanism.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQGWz78RSZmqVw/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1730576654357?e=1748476800&v=beta&t=Q7LLsFvSpRNlEsR-PAOHoSJ7FLKnfU-YphCtiQXgU2E\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EOutline of the regular self-attention mechanism. (This flow depicts one of the heads in a regular multi-head attention module.)\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EIn the figure above, x is the input, and \u003Cem\u003EWq\u003C/em\u003E is a weight matrix used to generate the queries (\u003Cem\u003EQ\u003C/em\u003E). Similarly, \u003Cem\u003EK\u003C/em\u003E stands for keys, and \u003Cem\u003EV\u003C/em\u003E stands for values. A represents the attention scores matrix, and \u003Cem\u003EZ\u003C/em\u003E are the inputs (x) transformed into the output context vectors. (If this seems confusing, you may find a comprehensive introduction in Chapter 3 of my \u003Ca href=\"https://www.amazon.com/Build-Large-Language-Model-Scratch/dp/1633437167/\" target=\"_blank\" rel=\"nofollow noopener\"\u003EBuild a Large Language Model from Scratch book\u003C/a\u003E helpful; alternatively, you may also find my article, \u003Ca href=\"http://d/\" target=\"_blank\" rel=\"nofollow noopener\"\u003EUnderstanding and Coding Self-Attention, Multi-Head Attention, Cross-Attention, and Causal-Attention in LLMs\u003C/a\u003E helpful here.)\u003C/p\u003E\u003Cp\u003EIn cross-attention, in contrast to self-attention, we have two different input sources, as illustrated in the following figure.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEdvb0kMCfmNQ/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1730576672170?e=1748476800&v=beta&t=3vpHHte3FLWLOTcqnPaMUiPCNiaegBeOWqdjk8i4cE4\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EIllustration of cross attention, where there can be two different inputs x1 and x2\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EAs illustrated in the previous two figures, in self-attention, we work with the same input sequence. In cross-attention, we mix or combine two different input sequences. \u003C/p\u003E\u003Cp\u003EIn the case of the original transformer architecture in the \u003Cem\u003EAttention Is All You Need\u003C/em\u003E paper, the two inputs \u003Cem\u003Ex1\u003C/em\u003E and \u003Cem\u003Ex2\u003C/em\u003E correspond to the sequence returned by the encoder module on the left (\u003Cem\u003Ex2\u003C/em\u003E) and the input sequence being processed by the decoder part on the right (\u003Cem\u003Ex1\u003C/em\u003E). In the context of a multimodal LLM, \u003Cem\u003Ex2\u003C/em\u003E is the output of an image encoder. (Note that the queries usually come from the decoder, and the keys and values typically come from the encoder.)\u003C/p\u003E\u003Cp\u003ENote that in cross-attention, the two input sequences \u003Cem\u003Ex1\u003C/em\u003E and \u003Cem\u003Ex2\u003C/em\u003E can have different numbers of elements. However, their embedding dimensions must match. If we set \u003Cem\u003Ex1 = x2\u003C/em\u003E, this is equivalent to self-attention.\u003C/p\u003E\u003Ch2\u003E3. Unified decoder and cross-attention model training\u003C/h2\u003E\u003Cp\u003ENow that we have talked a bit about the two major multimodal design choices, let's briefly talk about how we deal with the three major components during model training, which are summarized in the figure below.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQHPVdrBeSuu-w/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1730576708989?e=1748476800&v=beta&t=5F02dMgsNw-dc_JMS2s7Ds1FDHncUhLJcDW657I1v74\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EAn overview of the different components in a multimodal LLM. The components numbered 1-3 can be frozen or unfrozen during the multimodal training process.\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003ESimilar to the development of traditional text-only LLMs, the training of multimodal LLMs also involves two phases: pretraining and instruction finetuning. However, unlike starting from scratch, multimodal LLM training typically begins with a pretrained, instruction-finetuned text-only LLM as the base model.\u003C/p\u003E\u003Cp\u003EFor the image encoder, CLIP is commonly used and often remains unchanged during the entire training process, though there are exceptions, as we will explore later. Keeping the LLM part frozen during the pretraining phase is also usual, focusing only on training the projector—a linear layer or a small multi-layer perceptron. Given the projector's limited learning capacity, usually comprising just one or two layers, the LLM is often unfrozen during multimodal instruction finetuning (stage 2) to allow for more comprehensive updates. However, note that in the cross-attention-based models (Method B), the cross-attention layers are unfrozen throughout the entire training process.\u003C/p\u003E\u003Cp\u003EAfter introducing the two primary approaches (Method A: Unified Embedding Decoder Architecture and Method B: Cross-modality Attention Architecture), you might be wondering which is more effective. The answer depends on specific trade-offs.\u003C/p\u003E\u003Cp\u003EThe Unified Embedding Decoder Architecture (Method A) is typically easier to implement since it doesn't require any modifications to the LLM architecture itself.\u003C/p\u003E\u003Cp\u003EThe Cross-modality Attention Architecture (Method B) is often considered more computationally efficient because it doesn't overload the input context with additional image tokens, introducing them later in the cross-attention layers instead. Additionally, this approach maintains the text-only performance of the original LLM if the LLM parameters are kept frozen during training.\u003C/p\u003E\u003Cp\u003EWe will revisit the discussion on modeling performance and response quality in a later section, where we will discuss NVIDIA's NVLM paper.\u003C/p\u003E\u003Cp\u003EThis marks the end of what turned out to be a rather extensive introduction to multimodal LLMs. As I write this, I realize that the discussion has become lengthier than initially planned, which probably makes this a good place to conclude the article. \u003C/p\u003E\u003Cp\u003EHowever, to provide a practical perspective, it would be nice to examine a few recent research papers that implement these approaches. So, we will explore these papers in the remaining sections of this article.\u003C/p\u003E\u003Ch2\u003E4. Recent multimodal models and methods\u003C/h2\u003E\u003Cp\u003EFor the remainder of this article, I will review recent literature concerning multimodal LLMs, focusing specifically on works published in the last few weeks to maintain a reasonable scope.\u003C/p\u003E\u003Cp\u003EThus, this is not a historical overview or comprehensive review of multimodal LLMs but rather a brief look at the latest developments. I will also try to keep these summaries short and without too much fluff as there are 10 of them. \u003C/p\u003E\u003Cp\u003EThe conclusion section at the end of this has an overview that compares the methods used in these papers.\u003C/p\u003E\u003Ch3\u003E4.1 The Llama 3 Herd of Models\u003C/h3\u003E\u003Cp\u003E\u003Ca href=\"https://arxiv.org/abs/2407.21783\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003EThe Llama 3 Herd of Models\u003C/em\u003E\u003C/a\u003E paper (July 31, 2024) by Meta AI came out earlier this summer, which feels like ages ago in LLM terms. However, given that they only described but did not release their multimodal models until much later, I think it's fair to include Llama 3 in this list. (Llama 3.2 models were officially announced and made available on September 25.)\u003C/p\u003E\u003Cp\u003EThe multimodal Llama 3.2 models, which come in an 11-billion and 90-billion parameter version, are image-text models that use the previously described cross-attention-based approach, which is illustrated in the figure below.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEvSpVCRWqGpQ/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1730576730288?e=1748476800&v=beta&t=5Gg4Jcn-FaDbZg_m8Gqk9p4rIh8ePwRqTCVYCEJuTwk\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EIllustration of the multimodal LLM approach used by Llama 3.2. (Annotated figure from the Llama 3 paper: \u003C/em\u003E\u003Ca href=\"https://arxiv.org/abs/2407.21783.The\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ehttps://arxiv.org/abs/2407.21783.The\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E video and speech parts are visually occluded to focus the attention on the image part.)\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003ENote that while the figure also depicts video and speech as possible modalities, the models that were released as of this writing focus only on image and text.\u003C/p\u003E\u003Cp\u003ELlama 3.2 uses the cross-attention-based approach. However, it differs a bit from what I wrote about earlier, namely that in multimodal LLM development, we usually freeze the image encoder and only update the LLM parameters during pretraining.\u003C/p\u003E\u003Cp\u003EHere, the researchers almost take the opposite approach: they update the image encoder but do not update the language model's parameters. They write that this is intentional and done to preserve the text-only capabilities so that the 11B and 90B multimodal models can be used as drop-in replacements for the Llama 3.1 8B and 70B text-only model on text tasks.\u003C/p\u003E\u003Cp\u003EThe training itself is done in multiple iterations, starting with the Llama 3.1 text models. After adding the image encoder and projection (here called \"adapter\") layers, they pretrain the model on image-text data. Then, similar to the Llama 3 model text-only training (I wrote about it in an earlier article, URL:), they follow up with instruction and preference finetuning.\u003C/p\u003E\u003Cp\u003EInstead of adopting a pretrained model such as CLIP as an image encoder, the researchers used a vision transformer that they pretrained from scratch. Specifically, they adopted the  ViT-H/14 variant (630 million parameters) of the classic vision transformer architecture (\u003Ca href=\"https://arxiv.org/abs/2010.11929\" target=\"_blank\" rel=\"nofollow noopener\"\u003EDosovitskiy et al., 2020\u003C/a\u003E). They then pretrained the ViT on a dataset of 2.5 billion image-text pairs over five epochs; this was done before connecting the image encoder to the LLM. (The image encoder operates on images with a resolution of 224×224, dividing them into 16×16 patches of uniform size.)\u003C/p\u003E\u003Cp\u003EAs the cross-attention layers add a substantial amount of parameters, they are only added in every fourth transformer block. (For the 8B model, this adds 3B parameters, and for the 70B model, this adds 20 parameters.)\u003C/p\u003E\u003Ch3\u003E4.2 Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models\u003C/h3\u003E\u003Cp\u003E\u003Ca href=\"https://www.arxiv.org/abs/2409.17146\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003EThe Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models\u003C/em\u003E\u003C/a\u003E paper (September 25, 2024) is notable because it promises to open source not only the model weights but also the dataset and source code similar to the language-only OLMo LLM. (This is great for LLM research as it allows us to take a look at the exact training procedure and code and also lets us run ablation studies and reproduce results on the same dataset.)\u003C/p\u003E\u003Cp\u003EIf you are wondering why there are two names in the paper title, Molmo refers to the model (Multimodal Open Language Model), and PixMo (Pixels for Molmo) is the dataset.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQH9WfUUFljlQw/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1730576749807?e=1748476800&v=beta&t=Q7tdyhAv_aaab-IoNTdfUQGJA8KYhVm8rwW0uy1SicY\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EIllustration of the Molmo decoder-only approach (Method B). Annotated figure adapted from the Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models paper: \u003C/em\u003E\u003Ca href=\"https://www.arxiv.org/abs/2409.17146\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ehttps://www.arxiv.org/abs/2409.17146\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E.\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EAs illustrated in the figure above, the image encoder employs an off-the-shelf vision transformer, specifically CLIP. The term \"connector\" here refers to a \"projector\" that aligns image features with the language model.\u003C/p\u003E\u003Cp\u003EMolmo streamlines the training process by avoiding multiple pretraining stages, choosing instead a simple pipeline that updates all parameters in a unified approach—including those of the base LLM, the connector, and the image encoder.\u003C/p\u003E\u003Cp\u003EThe Molmo team offers several options for the base LLM:\u003C/p\u003E\u003Cul\u003E\u003Cli\u003E\u003Cp\u003EOLMo-7B-1024 (a fully open model backbone),\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EOLMoE-1B-7B (a mixture-of-experts architecture; the most efficient model),\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EQwen2 7B (an open-weight model that performs better than OLMo-7B-1024),\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EQwen2 72B (an open-weight model and the best-performing model)\u003C/p\u003E\u003C/li\u003E\u003C/ul\u003E\u003Ch3\u003E4.3 NVLM: Open Frontier-Class Multimodal LLMs\u003C/h3\u003E\u003Cp\u003ENVIDIA's \u003Ca href=\"https://arxiv.org/abs/2409.11402\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003ENVLM: Open Frontier-Class Multimodal LLMs\u003C/em\u003E\u003C/a\u003E paper (September 17, 2024) is particularly interesting because, rather than focusing on a single approach, it explores both methods: \u003C/p\u003E\u003Cul\u003E\u003Cli\u003E\u003Cp\u003EMethod A, the Unified Embedding Decoder Architecture (\"decoder-only architecture,\" NVLM-D), and \u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EMethod B, the Cross-Modality Attention Architecture (\"cross-attention-based architecture,\" NVLM-X). \u003C/p\u003E\u003C/li\u003E\u003C/ul\u003E\u003Cp\u003EAdditionally, they develop a hybrid approach (NVLM-H) and provide an apples-to-apples comparison of all three methods.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFIe3AiIeXjUQ/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1730576776482?e=1748476800&v=beta&t=jbpcpyUe03hhVTt6n03fxdNnedJqkfSZA64u8CLzQZA\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EOverview of the three multimodal approaches. (Annotated figure from the NVLM: Open Frontier-Class Multimodal LLMs paper: \u003C/em\u003E\u003Ca href=\"https://arxiv.org/abs/2409.11402\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ehttps://arxiv.org/abs/2409.11402\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E)\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EAs summarized in the figure below, NVLM-D corresponds to Method A, and NVLM-X corresponds to Method B, as discussed earlier. The concept behind the hybrid model (NVLM-H) is to combine the strengths of both methods: an image thumbnail is provided as input, followed by a dynamic number of patches passed through cross-attention to capture finer high-resolution details.\u003C/p\u003E\u003Cp\u003EIn short, the research team find that:\u003C/p\u003E\u003Cul\u003E\u003Cli\u003E\u003Cp\u003ENVLM-X demonstrates superior computational efficiency for high-resolution images.\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003ENVLM-D achieves higher accuracy in OCR-related tasks.\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003ENVLM-H combines the advantages of both methods.\u003C/p\u003E\u003C/li\u003E\u003C/ul\u003E\u003Cp\u003ESimilar to Molmo and other approaches, they begin with a text-only LLM rather than pretraining a multimodal model from scratch (as this generally performs better). Additionally, they use an instruction-tuned LLM instead of a base LLM. Specifically, the backbone LLM is Qwen2-72B-Instruct (to my knowledge, Molmo used the Qwen2-72B base model).\u003C/p\u003E\u003Cp\u003EWhile training all LLM parameters in the NVLM-D approach, they found that for NVLM-X, it works well to freeze the original LLM parameters and train only the cross-attention layers during both pretraining and instruction finetuning.\u003C/p\u003E\u003Cp\u003EFor the image encoder, instead of using a typical CLIP model, they use \u003Ca href=\"https://arxiv.org/abs/2312.14238\" target=\"_blank\" rel=\"nofollow noopener\"\u003EInternViT-6B\u003C/a\u003E, which remains frozen throughout all stages.\u003C/p\u003E\u003Cp\u003EThe projector is a multilayer perceptron rather than a single linear layer.\u003C/p\u003E\u003Ch3\u003E4.4 Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution\u003C/h3\u003E\u003Cp\u003EThe previous two papers and models, Molmo and NVLM, were based on Qwen2-72B LLM. In this paper, the Qwen research team itself announces a multimodal LLM, \u003Ca href=\"https://arxiv.org/abs/2409.12191\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003EQwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution\u003C/em\u003E\u003C/a\u003E (October 3rd, 2024).\u003C/p\u003E\u003Cp\u003EAt the core of this work is their so-called \"Naive Dynamic Resolution\" mechanism (the term \"naive\" is intentional and not a typo for \"native,\" though \"native\" could also be fitting). This mechanism allows the model to handle images of varying resolutions without simple downsampling, enabling the input of images in their original resolution.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQH9JQnQdWRbbQ/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1730576796924?e=1748476800&v=beta&t=H8gUOYGUEN_CCyJ7ZOSxziqThk_VCvs22SwKskjxKvU\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EAn overview of the multimodal Qwen model, which can process input images with various different resolutions natively. (Annotated figure from the Qwen2-VL paper: \u003C/em\u003E\u003Ca href=\"https://arxiv.org/abs/2409.12191\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ehttps://arxiv.org/abs/2409.12191\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E)\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EThe native resolution input is implemented via a modified ViT by removing the original absolute position embeddings and introducing 2D-RoPE.\u003C/p\u003E\u003Cp\u003EThey used a classic vision encoder with 675M parameters and LLM backbones of varying sizes, as shown in the table below.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFfFZ_mMITcRg/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1730576819375?e=1748476800&v=beta&t=aijtBJBKjAVwrd85NhJ5q3ApfZypx4-lT8eClIcWn_Q\" src=\"//:0\"\u003E\u003Cfigcaption\u003EThe components of the different Qwen2-VL models. (Annotated figure from the Qwen2-VL paper: \u003Ca href=\"https://arxiv.org/abs/2409.12191\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehttps://arxiv.org/abs/2409.12191\u003C/a\u003E)\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EThe training itself consists of 3 stages: (1) pretraining only the image encoder, (2) unfreezing all parameters (including LLM), and (3) freezing the image encoder and instruction-finetuning only the LLM.\u003C/p\u003E\u003Ch3\u003E4.5 Pixtral 12B\u003C/h3\u003E\u003Cp\u003E\u003Ca href=\"https://mistral.ai/news/pixtral-12b/\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003EPixtral 12B\u003C/em\u003E\u003C/a\u003E (September 17, 2024), which uses the Method A: Unified Embedding Decoder Architecture approach, is the first multimodal model from Mistral AI. Unfortunately, there is no technical paper or report available, but the Mistral team shared a few interesting tidbits in their \u003Ca href=\"https://mistral.ai/news/pixtral-12b/\" target=\"_blank\" rel=\"nofollow noopener\"\u003Eblog post\u003C/a\u003E.\u003C/p\u003E\u003Cp\u003EInterestingly, they chose not to use a pretrained image encoder, instead training one with 400 million parameters from scratch. For the LLM backbone, they used the 12-billion-parameter \u003Ca href=\"https://mistral.ai/news/mistral-nemo/\" target=\"_blank\" rel=\"nofollow noopener\"\u003EMistral NeMo\u003C/a\u003E model.\u003C/p\u003E\u003Cp\u003ESimilar to Qwen2-VL, Pixtral also supports variable image sizes natively, as illustrated in the figure below.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQE5TGtj2SIC5A/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1730576831405?e=1748476800&v=beta&t=rqS2pCiLDQk48ZyOX_9JiW6_6bmWFS7d9A6ISbA2qdU\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EIllustration of how Pixtral processes images of different sizes. (Annotated figure from the Pixtral blog  post: \u003C/em\u003E\u003Ca href=\"https://mistral.ai/news/pixtral-12b/\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ehttps://mistral.ai/news/pixtral-12b/\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E)\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E4.6 MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning\u003C/h3\u003E\u003Cp\u003EThe \u003Ca href=\"https://arxiv.org/abs/2409.20566\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003EMM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning\u003C/em\u003E\u003C/a\u003E paper (September 30, 2024) provides practical tips and introduces a mixture-of-experts multimodal model alongside a dense model similar to Molmo. The models span a wide size range, from 1 billion to 30 billion parameters.\u003C/p\u003E\u003Cp\u003EThe models described in this paper focuse on Method A, a Unified Embedding Transformer Architecture, which structures inputs effectively for multimodal learning.\u003C/p\u003E\u003Cp\u003EIn addition, the paper has a series of interesting ablation studies looking into data mixtures and the effects of using coordinate tokens. \u003C/p\u003E\u003Ch2\u003E\u003C/h2\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQF8ZjcYc9FcuQ/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1730576845541?e=1748476800&v=beta&t=yqEjP46YGhxxiWQyVvZfexI2eV_9CZDTa2o14xD_mRU\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EIllustration of the MM1.5 approach, which includes additional coordinate tokens to denote bounding boxes. (Annotated figure from the MM1.5 paper: \u003C/em\u003E\u003Ca href=\"https://arxiv.org/abs/2409.20566\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ehttps://arxiv.org/abs/2409.20566\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E.)\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E4.7 Aria: An Open Multimodal Native Mixture-of-Experts Model\u003C/h3\u003E\u003Cp\u003EThe \u003Ca href=\"https://arxiv.org/abs/2410.05993\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003EAria: An Open Multimodal Native Mixture-of-Experts Model\u003C/em\u003E\u003C/a\u003E paper (October 8, 2024) introduces another mixture-of-experts model approach, similar to one of the variants in the Molmo and MM1.5 lineups. \u003C/p\u003E\u003Cp\u003EThe Aria model has 24.9 billion parameters, with 3.5 billion parameters allocated per text token. The image encoder (\u003Ca href=\"https://arxiv.org/abs/2303.15343\" target=\"_blank\" rel=\"nofollow noopener\"\u003ESigLIP\u003C/a\u003E) has 438-million-parameters.\u003C/p\u003E\u003Cp\u003EThis model is based on a cross-attention approach with the following overall training procedure:\u003C/p\u003E\u003Col\u003E\u003Cli\u003E\u003Cp\u003ETraining the LLM backbone entirely from scratch.\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EPretraining both the LLM backbone and the vision encoder.\u003C/p\u003E\u003C/li\u003E\u003C/ol\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E4.8 Baichuan-Omni\u003C/h3\u003E\u003Cp\u003EThe \u003Ca href=\"https://arxiv.org/abs/2410.08565\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003EBaichuan-Omni Technical Report\u003C/em\u003E\u003C/a\u003E (October 11, 2024) introduces Baichuan-Omni, a 7-billion-parameter multimodal LLM based on Method A: the Unified Embedding Decoder Architecture approach, as shown in the figure below.\u003C/p\u003E\u003Ch2\u003E\u003C/h2\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQGTkAy63bke-w/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1730576864708?e=1748476800&v=beta&t=1hBjrMIr2BQg_MHq7JjbOcRSzWyc6TV8XCImv94CxzE\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EAn overview of the Baichuan-Omni model, which can handle various input modalities. (Annotated figure from the Baichuan-Omni paper: \u003C/em\u003E\u003Ca href=\"https://arxiv.org/abs/2410.08565\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ehttps://arxiv.org/abs/2410.08565\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E)\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EThe training process for Baichuan-Omni involves a three-stage approach:\u003C/p\u003E\u003Col\u003E\u003Cli\u003E\u003Cp\u003E\u003Cstrong\u003EProjector training\u003C/strong\u003E: Initially, only the projector is trained, while both the vision encoder and the language model (LLM) remain frozen.\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003E\u003Cstrong\u003EVision encoder training\u003C/strong\u003E: Next, the vision encoder is unfrozen and trained, with the LLM still frozen.\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003E\u003Cstrong\u003EFull model training\u003C/strong\u003E: Finally, the LLM is unfrozen, allowing the entire model to be trained end-to-end.\u003C/p\u003E\u003C/li\u003E\u003C/ol\u003E\u003Cp\u003EThe model utilizes the SigLIP vision encoder and incorporates the \u003Ca href=\"https://arxiv.org/abs/2204.07156\" target=\"_blank\" rel=\"nofollow noopener\"\u003EAnyRes\u003C/a\u003E module to handle high-resolution images through down-sampling techniques.\u003C/p\u003E\u003Cp\u003EWhile the report does not explicitly specify the LLM backbone, it is likely based on the Baichuan 7B LLM, given the model's parameter size and the naming convention.\u003C/p\u003E\u003Ch3\u003E4.9 Emu3: Next-Token Prediction is All You Need\u003C/h3\u003E\u003Cp\u003EThe \u003Cem\u003EEmu3: Next-Token Prediction is All You Need\u003C/em\u003E paper (September 27, 2024) presents a compelling alternative to diffusion models for image generation, which is solely based on a transformer-based decoder architecture. Although it's not a multimodal LLM in the classic sense (i.e., models focused on image understanding rather than generation), Emu3 is super interesting as it demonstrates that it's possible to use transformer decoders for image generation, which is a task typically dominated by diffusion methods. (However, note that there have been other similar approaches before, such as \u003Ca href=\"https://arxiv.org/abs/2406.06525\" target=\"_blank\" rel=\"nofollow noopener\"\u003EAutoregressive Model Beats Diffusion: Llama for Scalable Image Generation\u003C/a\u003E.)\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQGOSheoVRcEeQ/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1730576881770?e=1748476800&v=beta&t=VlXoZDxsKrWzKhEcgMXrhra-MVylgGYCQpTXH-GNjYY\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EEmu3 is primarily an LLM for image generation as an alternative to diffusion models. (Annotated figure from the Emu3 paper: \u003C/em\u003E\u003Ca href=\"https://arxiv.org/abs/2409.18869\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ehttps://arxiv.org/abs/2409.18869\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E)\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EThe researchers trained Emu3 from scratch and then used \u003Ca href=\"https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/04_preference-tuning-with-dpo/dpo-from-scratch.ipynb\" target=\"_blank\" rel=\"nofollow noopener\"\u003EDirect Preference Optimization\u003C/a\u003E (DPO) to align the model with human preferences. \u003C/p\u003E\u003Cp\u003EThe architecture includes a vision tokenizer inspired by \u003Ca href=\"https://arxiv.org/abs/2209.09002\" target=\"_blank\" rel=\"nofollow noopener\"\u003ESBER-MoVQGAN\u003C/a\u003E. The core LLM architecture is based on Llama 2, yet it is trained entirely from scratch.\u003C/p\u003E\u003Ch3\u003E4.10 Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation\u003C/h3\u003E\u003Cp\u003EWe previously focused on multimodal LLMs for image understanding and just saw one example for image generation with Emu 3 above. Now, the \u003Ca href=\"https://arxiv.org/abs/2410.13848\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003EJanus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation\u003C/em\u003E\u003C/a\u003E paper (October 17, 2024) introduces a framework that unifies multimodal understanding and generation tasks within a single LLM backbone. \u003C/p\u003E\u003Cp\u003EA key feature of Janus is the decoupling of visual encoding pathways to address the distinct requirements of understanding and generation tasks. The researchers argue that image understanding tasks require high-dimensional semantic representations, while generation tasks require detailed local information and global consistency in images. By separating these pathways, Janus effectively manages these differing needs. \u003C/p\u003E\u003Cp\u003EThe model employs the SigLIP vision encoder, similar to that used in Baichuan-Omni, for processing visual inputs. For image generation, it utilizes a \u003Ca href=\"https://arxiv.org/abs/2406.06525\" target=\"_blank\" rel=\"nofollow noopener\"\u003EVector Quantized (VQ)\u003C/a\u003E tokenizer to handle the generation process. The base LLM in Janus is the \u003Ca href=\"https://arxiv.org/abs/2401.02954\" target=\"_blank\" rel=\"nofollow noopener\"\u003EDeepSeek-LLM\u003C/a\u003E with 1.3 billion parameters.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQHyjseTJyKjNg/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1730576902923?e=1748476800&v=beta&t=PSYyu4CyyaW3Oyqyj8yjjGNNErtMdObcJfiADG9q5Do\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EAn overview of the unified decoder-only framework used in Janus. (Annotated figure from the Janus paper: \u003C/em\u003E\u003Ca href=\"https://arxiv.org/abs/2410.13848\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ehttps://arxiv.org/abs/2410.13848\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E.)\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EThe training process for the model in this image follows three stages, as shown in the figure below.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQELvKv70n8uwA/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1730576919126?e=1748476800&v=beta&t=ijjqe_xkmz6Nnk4ia1nDRIBCyBrgpZpM9Ju6veO6wHo\" src=\"//:0\"\u003E\u003Cfigcaption\u003EIllustration of the 3-stage training process of the Janus model. (Annotated figure from the Janus paper: \u003Ca href=\"https://arxiv.org/abs/2410.13848\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehttps://arxiv.org/abs/2410.13848\u003C/a\u003E)\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EIn Stage I, only the projector layers and image output layer are trained while the LLM, understanding, and generation encoders remain frozen. In Stage II, the LLM backbone and text output layer are unfrozen, allowing for unified pretraining across understanding and generation tasks. Finally, in Stage III, the entire model, including the SigLIP image encoder, is unfrozen for supervised fine-tuning, enabling the model to fully integrate and refine its multimodal capabilities.\u003C/p\u003E\u003Ch2\u003EConclusion\u003C/h2\u003E\u003Cp\u003EAs you may have noticed, I almost entirely skipped both the modeling and the computational performance comparisons. First, comparing the performance of LLMs and multimodal LLMs on public benchmarks is challenging due to prevalent data contamination, meaning that the test data may have been included in the training data.\u003C/p\u003E\u003Cp\u003EAdditionally, the architectural components vary so much that making an apples-to-apples comparison is difficult. So, big kudos to the NVIDIA team for developing NVLM in different flavors, which allowed for a comparison between the decoder-only and cross-attention approaches at least.\u003C/p\u003E\u003Cp\u003EIn any case, the main takeaway from this article is that multimodal LLMs can be built successfully in many different ways. Below is a figure that summarizes the different components of the models covered in this article.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQGljUEdgIxlnw/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1730576943289?e=1748476800&v=beta&t=5xMzN8WZzll8TYyXZGNlSD3D4WvK2I_XluJ1F59K6pY\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EAn overview of the different models covered in this article along with their subcomponents and training approaches.\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EI hope you found reading this article educational and now have a better understanding of how multimodal LLMs work!\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Chr\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003Cem\u003EThis magazine is a personal passion project that does not offer direct compensation. However, for those who wish to support me, please consider purchasing a copy of my \u003C/em\u003E\u003Ca href=\"https://amzn.to/4fqvn0D\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003EBuild a Large Language Model (From Scratch) book\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E. (I am confident that you'll get lots out of this book as it explains how LLMs work in a level of detail that is not found anywhere else.) \u003C/em\u003E\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQHpDJwQCsxelw/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1730576968271?e=1748476800&v=beta&t=uI5mL7XJ6bqOHKl4c92YXR3-SM64NZ8djlJUAzGX4WQ\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Ca href=\"https://amazon.com/Build-Large-Language-Model-Scratch/dp/1633437167\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003EBuild a Large Language Model (From Scratch)\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E now available on Amazon\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003Cem\u003EIf you read the book and have a few minutes to spare, I'd really appreciate a \u003C/em\u003E\u003Ca href=\"https://www.amazon.com/Build-Large-Language-Model-Scratch/dp/1633437167\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ebrief review\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E. It helps us authors a lot!\u003C/em\u003E\u003C/p\u003E","url":"https://www.linkedin.com/pulse/understanding-multimodal-llms-sebastian-raschka-phd-t7h5c"},{"@type":"Article","author":{"@type":"Person","name":"Sebastian Raschka, PhD","url":"https://www.linkedin.com/in/sebastianraschka"},"dateModified":"","datePublished":"2024-09-21T12:07:01.000+00:00","headline":"Building a GPT-Style LLM Classifier From Scratch","image":{"@type":"ImageObject","url":"https://media.licdn.com/dms/image/v2/D5612AQGAiGhSgAww9Q/article-cover_image-shrink_720_1280/article-cover_image-shrink_720_1280/0/1725887252064?e=2147483647&v=beta&t=CeSTupa_F3hXHyI-vloNIg28s3X2u5c_zJlAEwq82lE"},"articleBody":"\u003Cp\u003EIn this article, I want to show you how to transform pretrained large language models (LLMs) into strong text classifiers. \u003C/p\u003E\u003Cp\u003EBut why focus on classification? First, finetuning a pretrained model for classification offers a gentle yet effective introduction to model finetuning. Second, many real-world and business challenges revolve around text classification: spam detection, sentiment analysis, customer feedback categorization, topic labeling, and more.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEYUgv4B7gcOQ/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1725887583709?e=1749081600&v=beta&t=z-6PIpDKkG1taaBpH1kHo5Zaok93875XQJkb4culOyc\" src=\"//:0\"\u003E\u003Cfigcaption\u003ETurning a GPT model into a text classifier\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch2\u003EAnnouncing My New Book\u003C/h2\u003E\u003Cp\u003EI’m thrilled to announce the release of my new book, \u003Ca href=\"http://mng.bz/orYv\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003EBuild a Large Language Model From Scratch\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E,\u003C/em\u003E published by Manning. This book, which has been nearly two years in the making, is now finally available as ebook and print version on the Manning website (with preorders also available on \u003Ca href=\"https://www.amazon.com/Build-Large-Language-Model-Scratch/dp/1633437167/\" target=\"_blank\" rel=\"nofollow noopener\"\u003EAmazon\u003C/a\u003E).\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFWsmuWC_QLhA/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1725887377810?e=1749081600&v=beta&t=CIY6BhU1vD2itxgevY0jQGZNwUhF8JXSw2Z1r7jb7Qw\" src=\"//:0\"\u003E\u003Cfigcaption\u003EBuild a Large Language Model From Scratch\u003C/figcaption\u003E\u003C/figure\u003E\u003Cul\u003E\u003Cli\u003E\u003Cp\u003E\u003Ca href=\"http://mng.bz/orYv\" target=\"_blank\" rel=\"nofollow noopener\"\u003EManning link\u003C/a\u003E\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003E\u003Ca href=\"https://www.amazon.com/Build-Large-Language-Model-Scratch/dp/1633437167/\" target=\"_blank\" rel=\"nofollow noopener\"\u003EAmazon link\u003C/a\u003E (preorder)\u003C/p\u003E\u003C/li\u003E\u003C/ul\u003E\u003Cp\u003EFrom my experience, the best way to deeply understand a concept is to build it from scratch. And this book guides you through the entire process of building a GPT-like LLM—from implementing data inputs to finetuning with instruction data. My goal is that, after reading this book, you’ll have a deep, detailed and thorough understanding of how LLMs work.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQE1wS1w7JwX2Q/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1725887616764?e=1749081600&v=beta&t=-s94uVa5rMu2gQzth0TXDwRQit_sk1bnw6k7gr4EXLM\" src=\"//:0\"\u003E\u003Cfigcaption\u003EAn overview of the topics covered in \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003Cstrong\u003EWhat You’ll Learn in This Article\u003C/strong\u003E\u003C/p\u003E\u003Cp\u003ETo celebrate the book’s release, I’m sharing an excerpt from one of the chapters that walks you through how to finetune a pretrained LLM as a spam classifier. \u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003Cstrong\u003EImportant Note\u003C/strong\u003E\u003C/p\u003E\u003Cp\u003EThe chapter on classification finetuning is 35 pages long—too long for a single article. So, in this post, I’ll focus on a ~10-page subset that introduces the context and core concepts behind classification finetuning.\u003C/p\u003E\u003Cp\u003EAdditionally, I’ll share insights from some extra experiments that aren’t included in the book and address common questions readers might have. (Please note that the excerpt below is based on my personal draft before Manning’s professional text editing and final figure design.)\u003C/p\u003E\u003Cp\u003EThe full code for this excerpt can be found \u003Ca href=\"https://github.com/rasbt/LLMs-from-scratch/blob/main/ch06/01_main-chapter-code/ch06.ipynb\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehere on GitHub\u003C/a\u003E.\u003C/p\u003E\u003Cp\u003EIn addition, I'll also answer 7 questions you might have regarding training LLM classifiers:\u003C/p\u003E\u003Cp\u003E1) Do we need to train all layers?\u003C/p\u003E\u003Cp\u003E2) Why finetuning the last token, not the first token?\u003C/p\u003E\u003Cp\u003E3) How does BERT compare to GPT performance-wise?\u003C/p\u003E\u003Cp\u003E4) Should we disable the causal mask?\u003C/p\u003E\u003Cp\u003E5) What impact does increasing the model size have?\u003C/p\u003E\u003Cp\u003E6) What improvements can we expect from LoRA?\u003C/p\u003E\u003Cp\u003E7) Padding or no padding?\u003C/p\u003E\u003Cp\u003E\u003Cstrong\u003EHappy reading!\u003C/strong\u003E\u003C/p\u003E\u003Ch2\u003EDifferent categories of finetuning\u003C/h2\u003E\u003Cp\u003EThe most common ways to finetune language models are \u003Cem\u003Einstruction finetuning\u003C/em\u003E and \u003Cem\u003Eclassification finetuning\u003C/em\u003E. Instruction finetuning involves training a language model on a set of tasks using specific instructions to improve its ability to understand and execute tasks described in natural language prompts, as illustrated in Figure 1 below.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQHzvpTycXZMNQ/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1725887643546?e=1749081600&v=beta&t=SUszGvsntDlT4M1ke6ReJSv0mboJ0MtS_7rWlX4lADs\" src=\"//:0\"\u003E\u003Cfigcaption\u003EFigure 1: Illustration of two different instruction finetuning scenarios. At the top, the model is tasked with determining whether a given text is spam. At the bottom, the model is given an instruction on how to translate an English sentence into German.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EThe next chapter will discuss instruction finetuning, as illustrated in Figure 1 above. Meanwhile, this chapter is centered on classification finetuning, a concept you might already be acquainted with if you have a background in machine learning.\u003C/p\u003E\u003Cp\u003EIn classification finetuning, the model is trained to recognize a specific set of class labels, such as \"spam\" and \"not spam.\" Examples of classification tasks extend beyond large language models and email filtering; they include identifying different species of plants from images, categorizing news articles into topics like sports, politics, or technology, and distinguishing between benign and malignant tumors in medical imaging.\u003C/p\u003E\u003Cp\u003EThe key point is that a classification-finetuned model is restricted to predicting classes it has encountered during its training—for instance, it can determine whether something is \"spam\" or \"not spam,\" as illustrated in Figure 2 below, but it can't say anything else about the input text. \u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQHakqdiw8hXng/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1725887680548?e=1749081600&v=beta&t=gSFvMlkdI6hc5Suo13M-ITL6yLu38lYxIe8QtulGbEE\" src=\"//:0\"\u003E\u003Cfigcaption\u003EFigure 2: Illustration of a text classification scenario using an LLM. A model finetuned for spam classification does not require additional instructions alongside the input. However, in contrast to an instruction-finetuned model, it can only respond with \"spam\" and \"not spam.\"\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EIn contrast to the classification-finetuned model depicted in Figure 2, an instruction-finetuned model typically has the capability to undertake a broader range of tasks. We can view a classification-finetuned model as highly specialized, and generally, it is easier to develop a specialized model than a generalist model that works well across various tasks.\u003Cstrong\u003E                                                \u003C/strong\u003E\u003C/p\u003E\u003Cblockquote\u003E\u003Cp\u003E\u003Cstrong\u003E\u003Cem\u003EChoosing the right approach\u003C/em\u003E\u003C/strong\u003E\u003C/p\u003E\u003C/blockquote\u003E\u003Cp\u003EWhile instruction finetuning is more versatile, it demands larger datasets and greater computational resources to develop models proficient in various tasks. In contrast, classification finetuning requires less data and compute power, but its use is confined to the specific classes on which the model has been trained.\u003C/p\u003E\u003Ch2\u003EInitializing a model with pretrained weights\u003C/h2\u003E\u003Cp\u003ESince this is an excerpt, we'll skip over the data preparation parts and the model's initialization, which were implemented and pretrained in previous chapters. From my experience, maintaining focus while reading lengthy digital articles can be challenging compared to physical books. So, I'll try to keep this excerpt/article tightly focused on one of the key takeaways from this chapter.\u003C/p\u003E\u003Cp\u003ETo provide some context on which part of the chapter this excerpt is focused on, this excerpt centers on the modification required to transform a general pretrained LLM into a specialized LLM for classification tasks, as shown in Figure 3 below.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQH_xPMZw_SmBA/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1725887705725?e=1749081600&v=beta&t=wOcbhsupB6fxdHkIoId4m1S5FE3q_Zu6ZuQLd7TcJg4\" src=\"//:0\"\u003E\u003Cfigcaption\u003EFigure 3: In this excerpt, we skip over steps 1-5 and directly jump to step 6 (starting in the next section).\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EBut before we jump to the modification of the LLM mentioned in Figure 3 above, let's have a brief look at the pretrained LLM we are working with.\u003C/p\u003E\u003Cp\u003ESo, for simplicity, as assume that we set up the code to load the model as follows:\u003C/p\u003E\u003Cpre\u003E\u003C/pre\u003E\u003Cp\u003EAfter loading the model weights into the GPTModel, we use the text generation utility function from the previous chapters to ensure that the model generates coherent text:\u003C/p\u003E\u003Cpre\u003E\u003C/pre\u003E\u003Cp\u003EAs we can see based on the following output, the model generates coherent text, which is an indicator that the model weights have been loaded correctly:\u003C/p\u003E\u003Cpre\u003E\u003C/pre\u003E\u003Cp\u003ENow, before we start finetuning the model as a spam classifier, let's see if the model can perhaps already classify spam messages by by prompting it with instructions:\u003C/p\u003E\u003Cpre\u003E\u003C/pre\u003E\u003Cp\u003EThe model output is as follows:\u003C/p\u003E\u003Cpre\u003E\u003C/pre\u003E\u003Cp\u003EBased on the output, it's apparent that the model struggles with following instructions.\u003C/p\u003E\u003Cp\u003EThis is anticipated, as it has undergone only pretraining and lacks instruction finetuning, which we will explore in the upcoming chapter.\u003C/p\u003E\u003Cp\u003EThe next section prepares the model for classification finetuning.\u003C/p\u003E\u003Ch2\u003EAdding a classification head\u003C/h2\u003E\u003Cp\u003EIn this section, we modify the pretrained large language model to prepare it for classification finetuning. To do this, we replace the original output layer, which maps the hidden representation to a vocabulary of 50,257 unique tokens, with a smaller output layer that maps to two classes: 0 (\"not spam\") and 1 (\"spam\"), as shown in Figure 4 below.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFwsHsfnxwUfg/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1725887730659?e=1749081600&v=beta&t=jn30JfxdgOs4CIBO69yrjFI7vsxAFEbSINIWTqkVYsU\" src=\"//:0\"\u003E\u003Cfigcaption\u003EFigure 4: This figure illustrates adapting a GPT model for spam classification by altering its architecture. Initially, the model's linear output layer mapped 768 hidden units to a vocabulary of 50,257 tokens. For spam detection, this layer is replaced with a new output layer that maps the same 768 hidden units to just two classes, representing \"spam\" and \"not spam.\"\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EAs shown in Figure 4 above, we use the same model as in previous chapters except for replacing the output layer.\u003Cstrong\u003E                                                                      \u003C/strong\u003E\u003C/p\u003E\u003Cblockquote\u003E\u003Cp\u003E\u003Cstrong\u003E\u003Cem\u003EOutput layer nodes\u003C/em\u003E\u003C/strong\u003E\u003Cem\u003E. We could technically use a single output node since we are dealing with a binary classification task. However, this would require modifying the loss function, as discussed in an article in the Reference section in Appendix B. Therefore, we choose a more general approach where the number of output nodes matches the number of classes. For example, for a 3-class problem, such as classifying news articles as \"Technology\", \"Sports\", or \"Politics\", we would use three output nodes, and so forth.\u003C/em\u003E\u003C/p\u003E\u003C/blockquote\u003E\u003Cp\u003EBefore we attempt the modification illustrated in Figure 4, let's print the model architecture via print(model), which outputs the following:\u003C/p\u003E\u003Cpre\u003E\u003C/pre\u003E\u003Cp\u003EAbove, we can see the architecture we implemented in Chapter 4 neatly laid out. As discussed in Chapter 4, the GPTModel consists of embedding layers followed by 12 identical \u003Cem\u003Etransformer blocks\u003C/em\u003E (only the last block is shown for brevity), followed by a final LayerNorm and the output layer, out_head. \u003C/p\u003E\u003Cp\u003ENext, we replace the out_head with a new output layer, as illustrated in Figure 4, that we will finetune.\u003C/p\u003E\u003Ch3\u003E\u003C/h3\u003E\u003Cblockquote\u003E\u003Cp\u003E\u003Cstrong\u003EFinetuning selected layers versus all layers. \u003C/strong\u003ESince we start with a pretrained model, it's not necessary to finetune all model layers. This is because, in neural network-based language models, the lower layers generally capture basic language structures and semantics that are applicable across a wide range of tasks and datasets. So, finetuning only the last layers (layers near the output), which are more specific to nuanced linguistic patterns and task-specific features, can often be sufficient to adapt the model to new tasks. A nice side effect is that it is computationally more efficient to finetune only a small number of layers. Interested readers can find more information, including experiments, on which layers to finetune in the References section for this chapter in Appendix B.\u003C/p\u003E\u003C/blockquote\u003E\u003Cp\u003ETo get the model ready for classification finetuning, we first \u003Cem\u003Efreeze\u003C/em\u003E the model, meaning that we make all layers non-trainable:\u003C/p\u003E\u003Cpre\u003E\u003C/pre\u003E\u003Cp\u003EThen, as shown in Figure 4 earlier, we replace the output layer (model.out_head), which originally maps the layer inputs to 50,257 dimensions (the size of the vocabulary):\u003C/p\u003E\u003Cpre\u003E\u003C/pre\u003E\u003Cp\u003ENote that in the preceding code, we use BASE_CONFIG[\"emb_dim\"], which is equal to 768 in the \"gpt2-small (124M)\" model, to keep the code below more general. This means we can also use the same code to work with the larger GPT-2 model variants.\u003C/p\u003E\u003Cp\u003EThis new model.out_head output layer has its requires_grad attribute set to True by default, which means that it's the only layer in the model that will be updated during training.\u003C/p\u003E\u003Cp\u003ETechnically, training the output layer we just added is sufficient. However, as I found in experiments, finetuning additional layers can noticeably improve the predictive performance of the finetuned model. (For more details, refer to the References in Appendix C.)\u003C/p\u003E\u003Cp\u003EAdditionally, we configure the last transformer block and the final LayerNorm module, which connects this block to the output layer, to be trainable, as depicted in Figure 5 below.\u003C/p\u003E\u003Ch3\u003E\u003C/h3\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFyr9bOfeykvw/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1725887760046?e=1749081600&v=beta&t=gbk8QYK3LXnKv4YPD8TCYYtL6lIaDr7zYRqRo7SIsEM\" src=\"//:0\"\u003E\u003Cfigcaption\u003EFigure 5: The GPT model we developed in earlier chapters, which we loaded previously, includes 12 repeated transformer blocks. Alongside the output layer, we set the final LayerNorm and the last transformer block as trainable, while the remaining 11 transformer blocks and the embedding layers are kept non-trainable.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003ETo make the final LayerNorm and last transformer block trainable, as illustrated in Figure 5 above, we set their respective requires_grad to True:\u003C/p\u003E\u003Cpre\u003E\u003C/pre\u003E\u003Cp\u003EEven though we added a new output layer and marked certain layers as trainable or non-trainable, we can still use this model in a similar way to previous chapters. For instance, we can feed it an example text identical to how we have done it in earlier chapters. For example, consider the following example text:\u003C/p\u003E\u003Cpre\u003E\u003C/pre\u003E\u003Cp\u003EAs the print output shows, the preceding code encodes the inputs into a tensor consisting of 4 input tokens:\u003C/p\u003E\u003Cpre\u003E\u003C/pre\u003E\u003Cp\u003EThen, we can pass the encoded token IDs to the model as usual:\u003C/p\u003E\u003Cpre\u003E\u003C/pre\u003E\u003Cp\u003EThe output tensor looks like as follows:\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cpre\u003E\u003C/pre\u003E\u003Cp\u003EIn Chapters 4 and 5, a similar input would have produced an output tensor of shape [1, 4, 50257], where 50,257 represents the vocabulary size. As in previous chapters, the number of output rows corresponds to the number of input tokens (in this case, 4). However, each output's embedding dimension (the number of columns) is now reduced to 2 instead of 50,257 since we replaced the output layer of the model.\u003C/p\u003E\u003Cp\u003ERemember that we are interested in finetuning this model so that it returns a class label that indicates whether a model input is \u003Cem\u003Espam\u003C/em\u003E or \u003Cem\u003Enot spam\u003C/em\u003E. To achieve this, we don't need to finetune all 4 output rows but can focus on a single output token. In particular, we will focus on the last row corresponding to the last output token, as illustrated in Figure 6 below.\u003C/p\u003E\u003Ch3\u003E\u003C/h3\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFTvs-P3G1yIg/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1725887788280?e=1749081600&v=beta&t=h3taIL1dRRB8OPICRwOGxeVLk5gtoG0wZty2XuEWq4E\" src=\"//:0\"\u003E\u003Cfigcaption\u003EFigure 6: An illustration of the GPT model with a 4-token example input and output. The output tensor consists of 2 columns due to the modified output layer. We are only focusing on the last row corresponding to the last token when finetuning the model for spam classification.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003ETo extract the last output token, illustrated in Figure 6 above, from the output tensor, we use the following code:\u003C/p\u003E\u003Cpre\u003E\u003C/pre\u003E\u003Cp\u003EThis prints the following:\u003C/p\u003E\u003Cpre\u003E\u003C/pre\u003E\u003Cp\u003EBefore we proceed to the next section, let's recap our discussion. We will focus on converting the values into a class label prediction. But first, let's understand why we are particularly interested in the last output token, and not the 1st, 2nd, or 3rd output token.\u003C/p\u003E\u003Cp\u003EIn Chapter 3, we explored the attention mechanism, which establishes a relationship between each input token and every other input token. Subsequently, we introduced the concept of a \u003Cem\u003Ecausal attention mask\u003C/em\u003E, commonly used in GPT-like models. This mask restricts a token's focus to only its current position and those before it, ensuring that each token can only be influenced by itself and preceding tokens, as illustrated in Figure 7 below.\u003C/p\u003E\u003Ch3\u003E\u003C/h3\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQHTSROqX3a74g/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1725887831318?e=1749081600&v=beta&t=08w6X-YNffhseJM7C1Fkm41L5Jw5wLl8bf9zINDQB10\" src=\"//:0\"\u003E\u003Cfigcaption\u003EFigure 7: Illustration of the causal attention mechanism as discussed in Chapter 3, where the attention scores between input tokens are displayed in a matrix format. The empty cells indicate masked positions due to the causal attention mask, preventing tokens from attending to future tokens. The values in the cells represent attention scores, with the last token, \"time,\" being the only one that computes attention scores for all preceding tokens.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EGiven the causal attention mask setup shown in Figure 7 above, the last token in a sequence accumulates the most information since it is the only token with access to data from all the previous tokens. Therefore, in our spam classification task, we focus on this last token during the finetuning process.\u003C/p\u003E\u003Cp\u003EHaving modified the model, the next section will detail the process of transforming the last token into class label predictions and calculate the model's initial prediction accuracy. Following this, we will finetune the model for the spam classification task in the subsequent section.\u003C/p\u003E\u003Ch2\u003EEvaluating the model performance\u003C/h2\u003E\u003Cp\u003ESince this excerpt was already long, I won't go into the details on the model evaluation. However, I wanted to share at least the plot showing the classification accuracy on the training and validation sets during training to show you that the model indeed learns really well.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQETjOQxj46tTA/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1725887878397?e=1749081600&v=beta&t=sNy6124NxRiC_TRIyeWkW0DZK_BKRC5g8BYcHPPDgdU\" src=\"//:0\"\u003E\u003Cfigcaption\u003EFigure 8: Both the training accuracy (solid line) and the validation accuracy (dashed line) increase substantially in the early epochs and then plateau, achieving almost perfect accuracy scores of 1.0, corresponding to 100%. The close proximity of the two lines throughout the epochs suggests that the model does not overfit the training data much.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EAs we can see in Figure 8 above, the model achieves a validation accuracy of approximately 97%. The test accuracy (not shown) is approximately 96%. Furthermore, we can see that the model slightly overfits, as indicated by the slightly higher training set accuracy. Overall, though, this model performed really well: a 96% test set accuracy means that it correctly identifies 96 out of 100 messages as spam or not spam. (We didn't discuss the dataset in this excerpt, but it was a balanced dataset with 50% spam and 50% non-spam messages, which means that a random or badly trained classifier would achieve approximately 50% classification accuracy.)\u003C/p\u003E\u003Ch2\u003EInsights from additional experiments\u003C/h2\u003E\u003Cp\u003EYou may have many questions about certain design choices at this point, so I wanted to share a few results from some additional experiments I ran, which may address one or more questions or concerns you might have. The code to reproduce these experiments is available \u003Ca href=\"https://github.com/rasbt/LLMs-from-scratch/tree/main/ch06/02_bonus_additional-experiments\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehere on GitHub\u003C/a\u003E.\u003C/p\u003E\u003Cp\u003E\u003Cem\u003EDisclaimer:\u003C/em\u003E the experiments were mostly only run on 1 dataset, and should be repeated on other datasets in the future to test whether these findings generalize.\u003C/p\u003E\u003Ch3\u003E1) Do we need to train all layers?\u003C/h3\u003E\u003Cp\u003EIn the chapter excerpt above, we only trained the output layer and the last transformer block for efficiency reasons. As explained earlier, for classification finetuning, it is not necessary to update all layers in an LLM. (The fewer weights we update, the faster the training will be because we don't need to compute the gradients for these weights during backpropagation.)\u003C/p\u003E\u003Cp\u003EHowever, you may wonder how much predictive performance we are leaving on the table by not updating all layers. So, in the table below, I ran a comparison between finetuning all layers, only the last transformer block (plus the last layer), and the last layer only.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEwHTxXHKu0lQ/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1725887906754?e=1749081600&v=beta&t=hLWTMXTiTw3PrDlKNdg_BK1O6gYX5OzA2ajKR4waK_Q\" src=\"//:0\"\u003E\u003Cfigcaption\u003ETable 1: Training all layers versus only the last transformer block (including the last layer) versus only the last layer.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EAs shown in Table 1 above, training all layers results in a slightly better performance: 96.67% versus 95.00%. (This increased the runtime by about 2.5fold, though.)\u003C/p\u003E\u003Cp\u003EThe complete set of experiments can be found \u003Ca href=\"https://github.com/rasbt/LLMs-from-scratch/tree/main/ch06/02_bonus_additional-experiments\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehere on GitHub\u003C/a\u003E.\u003C/p\u003E\u003Ch3\u003E2) Why finetuning the last token, not the first token?\u003C/h3\u003E\u003Cp\u003EIf you are familiar with encoder-style language models like BERT (\u003Ca href=\"https://arxiv.org/abs/1810.04805\" target=\"_blank\" rel=\"nofollow noopener\"\u003EBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”\u003C/a\u003E by Devlin et al. 2018), you may know that these have a designated classification token as their first token, as shown in the figure below.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFPi4M9oWyKLA/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1725887959313?e=1749081600&v=beta&t=OkE7uLjkoKgpz1TwJ8YH9eqDR_We4z6Mg-v1FU-2T-Y\" src=\"//:0\"\u003E\u003Cfigcaption\u003EAnnotated figure from the original BERT paper, https://arxiv.org/abs/1810.04805\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EIn contrast to BERT, GPT is a decoder-style model with a causal attention mask (shown in Figure 7 earlier). This means the first token has no context information of any other token in the input. Only the last token has information about all other tokens. \u003C/p\u003E\u003Cp\u003EHence, if we want to use models like GPT for classification finetuning, we should focus on the last token to capture contextual information of all other input tokens.\u003C/p\u003E\u003Cp\u003EBelow is additional experimental evidence, where we can see that using the first token to finetune a GPT model for classification results in a much worse performance.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEsTjpncJycRg/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1725887980411?e=1749081600&v=beta&t=_bdGsrYM1b232qs9QEfhua4K0mvXlorAxyzrVJo404g\" src=\"//:0\"\u003E\u003Cfigcaption\u003ETable 2: Finetuning the last versus first token in a GPT model.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EOverall, I find it still surprising, though, that the first token contains so much information to determine whether a message is spam or not with 75% accuracy. (Not that this is a balanced dataset, and a random classifier yields 50% accuracy).\u003C/p\u003E\u003Ch3\u003E3) How does BERT compare to GPT performance-wise?\u003C/h3\u003E\u003Cp\u003ESpeaking of BERT, you may wonder how it compares to a GPT-style model on classification tasks. \u003C/p\u003E\u003Cp\u003EIn short, the small GPT-2 model from the previous section and BERT performed similarly well on the spam classification dataset, as shown in the table below.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQG8GQSS1btJYw/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1725888034487?e=1749081600&v=beta&t=IomU23CY68sGy3grny9JtWxstJcPU26NYJPhR3-9ths\" src=\"//:0\"\u003E\u003Cfigcaption\u003ETable 3. GPT-2 versus BERT for SPAM classification.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003ENote that the BERT model performs slightly better (1% higher test accuracy), but it is also almost 3x larger. Furthermore, the dataset may be too small and simple, so I also tried the \u003Ca href=\"https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews\" target=\"_blank\" rel=\"nofollow noopener\"\u003EIMDB Movie Review dataset\u003C/a\u003E for sentiment classification (that is, predicting whether a reviewer liked the movie or not).\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQECQqrFxrwS_Q/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1725888063256?e=1749081600&v=beta&t=DZbeuSqq-UvYj_MlkFWFexvUuNT5Zw8gHyE_dqyT-fE\" src=\"//:0\"\u003E\u003Cfigcaption\u003ETable 4: GPT-2 versus BERT for movie review classification. (The code is available \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EAs we can see, the two models, GPT-2 and BERT, also have relatively similar predictive performances on this larger dataset (consisting of 25k training and 25k test set records).\u003C/p\u003E\u003Cp\u003EGenerally, BERT and other encoder-style models were considered superior to decoder-style models for classification tasks. However, as the experiments above showed, there's not a large difference between the encoder-style BERT and the decoder-style GPT model.\u003C/p\u003E\u003Cp\u003EFurthermore, if you are interested in more benchmark comparison and tips for improving decoder-style models for classification further, you might like these two recent papers:\u003C/p\u003E\u003Cul\u003E\u003Cli\u003E\u003Cp\u003E\u003Ca href=\"https://arxiv.org/%20abs/2310.01208\" target=\"_blank\" rel=\"nofollow noopener\"\u003ELabel Supervised LLaMA Finetuning (2023)\u003C/a\u003E by Li et al.\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003E\u003Ca href=\"https://arxiv.org/abs/2404.05961\" target=\"_blank\" rel=\"nofollow noopener\"\u003ELLM2Vec: Large Language Models Are Secretly Powerful Text Encoders (2024)\u003C/a\u003E by BehnamGhader et al.\u003C/p\u003E\u003C/li\u003E\u003C/ul\u003E\u003Cp\u003EFor instance, as the papers above discuss, one can improve the classification performance of decoder-style models further by removing the causal mask during classification finetuning.\u003C/p\u003E\u003Ch3\u003E4) Should we disable the causal mask?\u003C/h3\u003E\u003Cp\u003ESince we train GPT-like models on a next-word prediction task, a core feature of the GPT architecture is the causal attention mask (different from BERT models or the original transformer architecture). \u003C/p\u003E\u003Cp\u003EHowever, we could actually remove the causal mask during classification finetuning, which would allow us to finetune the first rather than the last token since future tokens will no longer be masked, and the first token can see all other tokens.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEzYNZOEG30fQ/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1725888082751?e=1749081600&v=beta&t=ox412W8NxAfhzJjM-liaLPvLMqBspyytyT-iLGn_cV4\" src=\"//:0\"\u003E\u003Cfigcaption\u003EAttention weight matrix with and without a causal mask.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EDisabling a causal attention mask in a GPT-like LLM fortunately requires changing only 2 lines of code:\u003C/p\u003E\u003Cpre\u003E\u003C/pre\u003E\u003Cp\u003ETable 5 below shows how this modification affects the performance of the spam classification task.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQE-cKf4xXPcuQ/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1725888110998?e=1749081600&v=beta&t=fXwPb51HAYf365No_fa1zcNgriEt27TaRrEdcUbJdBk\" src=\"//:0\"\u003E\u003Cfigcaption\u003ETable 5: GPT-2 classifier finetuned with and without causal attention mask.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EAs we can see based on the results in Table 5, we can get a small improvement when we disable the causal mask during finetuning.\u003C/p\u003E\u003Ch3\u003E5) What impact does increasing the model size have?\u003C/h3\u003E\u003Cp\u003ESo far, we've only looked at the performance of the smallest GPT-2 model, the 124 million-parameter version. How does it compare to the larger variants, which have 355 million, 774 million, and 1.5 billion parameters. The results are summarized in Table 6.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQG5yw1nqg-CmQ/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1725888130457?e=1749081600&v=beta&t=SauzWGUycoUTVsC3vAirBxDzcuMxpmfinB0IDdZnqSI\" src=\"//:0\"\u003E\u003Cfigcaption\u003ETable 6: Different GPT-2 model sizes for classification finetuning.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EAs we can see, the the prediction accuracy improves significantly with larger models (however, GPT-2 medium is an outlier here. I have noticed poor performance of this model on other datasets, too, and I suspect that the model has potentially not been pretrained very well.)\u003C/p\u003E\u003Cp\u003EHowever, while the GPT-2 XL model shows a noticeably better classification accuracy than the smallest model, it also took 7x longer to finetune.\u003C/p\u003E\u003Ch3\u003E6) What improvements can we expect from LoRA?\u003C/h3\u003E\u003Cp\u003EIn the very first question, \"1) Do we need to train all layers?\" we found that we could (almost) match the classification performance when finetuning only the last transformer block instead of finetuning the whole model. The advantage of only finetuning the last block is that the training is faster since not all weight parameters are being updated.\u003C/p\u003E\u003Cp\u003EA follow-up question is how this compares to Low-Rank Adaptation (LoRA), a parameter-efficient finetuning technique. (LoRA is covered in Appendix E.)\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQHokTfpRsg5gA/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1725888160886?e=1749081600&v=beta&t=aYIx_27AJTy2Y5pUUNFYqwZTdF6HHYEtcDseL-LjuDg\" src=\"//:0\"\u003E\u003Cfigcaption\u003ETable 7: Full finetuning (all layers) versus parameter-efficient finetuning with LoRA.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EAs we can see in Table 7 above, both full finetuning (all layers) and LoRA result in the same test set performance on this dataset. \u003C/p\u003E\u003Cp\u003EOn the small model, LoRA is slightly slower since the additional overhead from adding LoRA layers may outweigh the benefits, but when training the larger 1.5 billion parameters model, LoRA trains 1.53x faster.\u003C/p\u003E\u003Ch3\u003E7) Padding or no padding?\u003C/h3\u003E\u003Cp\u003EIf we want to process data in batches during training or inference (this involves processing more than one input sequence at a time), we need to insert padding tokens to ensure that the training examples are of equal length.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQF_Cf3u1--fow/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1725888250814?e=1749081600&v=beta&t=-0oKu98QwvwXGxr6IKwgUezP422m8qm0b8n3GCC54r4\" src=\"//:0\"\u003E\u003Cfigcaption\u003EAn illustration of how the input texts in a given batch are padded to an equal length.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EIn regular text generation tasks, padding doesn't affect the model response since padding tokens are usually added to the right side, and due to the causal mask discussed earlier, these padding tokens don't influence the other tokens.\u003C/p\u003E\u003Cp\u003EHowever, remember that we finetuned the last token, as discussed earlier. Since the padding tokens are to the left of this last token, the padding tokens may affect the result. \u003C/p\u003E\u003Cp\u003EIf we use a batch size of 1, we actually don't need to pad the inputs. This is, of course, more inefficient from a computational standpoint (since we only process one input example at a time). Still, a batch size of 1 can be used as a workaround to test whether or not using padding can improve the results. (An alternative solution is to add a custom mask to ignore padding tokens in the attention score computation, but since this would require changing the GPT implementation, that's a topic for another time.)\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQHQrFIXHHqnWA/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1725888308188?e=1749081600&v=beta&t=rLQ-fWV8eK6T49iM3T51eHjnr6hmjbs8FhKr6lshTMM\" src=\"//:0\"\u003E\u003Cfigcaption\u003ETable 8: Training a model with padding (regular) and no padding of inputs to similar lengths.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EAs we can see, avoiding padding tokens can indeed give the model a noticeable boost! (Note that I used gradient accumulation to simulate a batch size of 8 to match the batch size of the default experiment and make it a fair comparison.)\u003C/p\u003E\u003Cp\u003EI hope you found these additional experiments interesting. I have a few more as bonus material \u003Ca href=\"https://github.com/rasbt/LLMs-from-scratch/tree/main/ch06/02_bonus_additional-experiments\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehere on GitHub\u003C/a\u003E.\u003C/p\u003E\u003Ch2\u003EBuild A Large Language Model From Scratch\u003C/h2\u003E\u003Cp\u003EThis article presents a 10-page snippet from Chapter 6 of my new book, \"Build a Large Language Model from Scratch.\" \u003C/p\u003E\u003Cp\u003EWhat you've read is just a small part of the entire 365-page journey of building a GPT-like LLM from scratch to understand how LLMs really work. \u003C/p\u003E\u003Cp\u003EIf this excerpt resonated with you, you might find the rest of the book equally insightful and helpful.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQF5eHIkCdnLBg/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1726489445373?e=1749081600&v=beta&t=KuJPozI6-NImW6DYZT03f_J1x6PYfhw2QWDY-dtDgEI\" src=\"//:0\"\u003E\u003Cfigcaption\u003EBuild a Large Language Model From Scratch\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cul\u003E\u003Cli\u003E\u003Cp\u003E\u003Ca href=\"http://mng.bz/orYv\" target=\"_blank\" rel=\"nofollow noopener\"\u003EManning link\u003C/a\u003E\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003E\u003Ca href=\"https://www.amazon.com/Build-Large-Language-Model-Scratch/dp/1633437167/\" target=\"_blank\" rel=\"nofollow noopener\"\u003EAmazon link\u003C/a\u003E (preorder)\u003C/p\u003E\u003C/li\u003E\u003C/ul\u003E\u003Cp\u003E\u003Cstrong\u003EYour support means a great deal and is tremendously helpful in continuing this journey. Thank you!\u003C/strong\u003E\u003C/p\u003E","url":"https://www.linkedin.com/pulse/building-gpt-style-llm-classifier-from-scratch-sebastian-raschka-phd-itp5c"},{"@type":"Article","author":{"@type":"Person","name":"Sebastian Raschka, PhD","url":"https://www.linkedin.com/in/sebastianraschka"},"dateModified":"","datePublished":"2024-08-17T12:55:01.000+00:00","headline":"New LLM Pre-training and Post-training Paradigms","image":{"@type":"ImageObject","url":"https://media.licdn.com/dms/image/v2/D5612AQGdto_dMucAMA/article-cover_image-shrink_720_1280/article-cover_image-shrink_720_1280/0/1723740451128?e=2147483647&v=beta&t=bZuKacN5K0bmj3mrsNIU86914DJdSq9n_gnH1L6uiiI"},"articleBody":"\u003Cp\u003EThe development of large language models (LLMs) has come a long way, from the early GPT models to the sophisticated open-weight LLMs we have today. Initially, the LLM training process focused solely on pre-training, but it has since expanded to include both pre-training and post-training. Post-training typically encompasses supervised instruction fine-tuning and alignment, which was popularized by ChatGPT.\u003C/p\u003E\u003Cp\u003ETraining methodologies have evolved since ChatGPT was first released. In this article, I review the latest advancements in both pre-training and post-training methodologies, particularly those made in recent months.\u003C/p\u003E\u003Cp\u003EThere are hundreds of LLM papers each month proposing new techniques and approaches. However, one of the best ways to see what actually works well in practice is to look at the pre-training and post-training pipelines of the most recent state-of-the-art models. Luckily, four major new LLMs have been released in the last months, accompanied by relatively detailed technical reports: Meta AI's Llama 3.1, Google's Gemma 2, Alibaba's Qwen 2, and Apple's foundation models.\u003C/p\u003E\u003Cp\u003EIn this article, I focus on the pre-training and post-training pipelines of the following models:\u003C/p\u003E\u003Col\u003E\u003Cli\u003E\u003Cp\u003EAlibaba's Qwen 2\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EApple Intelligence Foundation Language Models\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EGoogle's Gemma 2\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EMeta AI's Llama 3.1\u003C/p\u003E\u003C/li\u003E\u003C/ol\u003E\u003Cp\u003EThese models are presented in order based on the publication dates of their respective technical papers on arXiv.org, which also happens to align with their alphabetical order.\u003C/p\u003E\u003Cp\u003E\u003Cstrong\u003EThis article is a passion project that I created in my free time and over the weekends. If you find it valuable and would like to support my work, please consider purchasing a copy of my books and recommending them to your colleagues. Your review on Amazon would also be greatly appreciated!\u003C/strong\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFejOfxHZsiAA/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1723739409151?e=1749081600&v=beta&t=ZGe-AYiZ29bnpKJk-Hn0VRQjKta0SV3G0A7hekstmrE\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cul\u003E\u003Cli\u003E\u003Cp\u003E\u003Ca href=\"http://mng.bz/M96o\" target=\"_blank\" rel=\"nofollow noopener\"\u003EBuild a Large Language Model (from Scratch)\u003C/a\u003E is a highly focused book dedicated to coding LLMs from the ground up in PyTorch, covering everything from pre-training to post-training—arguably the best way to truly understand LLMs.\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003E\u003Ca href=\"https://nostarch.com/machine-learning-and-ai-beyond-basics\" target=\"_blank\" rel=\"nofollow noopener\"\u003EMachine Learning Q and AI\u003C/a\u003E is a great book for those who are already familiar with the basics; it dives into intermediate and advanced concepts covering deep neural networks, vision transformers, multi-GPU training paradigms, LLMs, and many more.\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003E\u003Ca href=\"https://www.amazon.com/Machine-Learning-PyTorch-Scikit-Learn-scikit-learn-ebook-dp-B09NW48MR1/dp/B09NW48MR1/\" target=\"_blank\" rel=\"nofollow noopener\"\u003EMachine Learning with PyTorch and Scikit-Learn\u003C/a\u003E is a comprehensive guide to machine learning, deep learning, and AI, offering a well-balanced mix of theory and practical code. It's the ideal starting point for anyone new to the field. \u003C/p\u003E\u003C/li\u003E\u003C/ul\u003E\u003Ch2\u003E1. Alibaba's Qwen 2\u003C/h2\u003E\u003Cp\u003ELet's begin with \u003Ca href=\"https://arxiv.org/abs/2407.10671\" target=\"_blank\" rel=\"nofollow noopener\"\u003EQwen 2\u003C/a\u003E, a really strong LLM model family that is competitive with other major LLMs. However, for some reason, it's less popular than the open-weight models from Meta AI, Microsoft, and Google.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E1.1 Qwen 2 Overview\u003C/h3\u003E\u003Cp\u003EBefore looking at the pre-training and post-training methods discussed in the \u003Ca href=\"https://arxiv.org/abs/2407.10671\" target=\"_blank\" rel=\"nofollow noopener\"\u003EQwen 2 Technical Report\u003C/a\u003E, let's briefly summarize some core specifications.\u003C/p\u003E\u003Cp\u003EQwen 2 models come in 5 flavors. There are 4 regular (dense) LLMs with sizes 0.5 billion, 1.5 billion, 7 billion, and 72 billion parameters. In addition, there is a Mixture-of-Experts model with 57 billion parameters, where 14 billion parameters are activated at the same time. (Since architecture details are not the focus this time, I won't go too much into the Mixture-of-Experts model; however, in a nutshell, this is similar to Mixtral by Mistral AI, except that it has more active experts. For a high-level overview, see the \u003Ca href=\"https://magazine.sebastianraschka.com/i/141130005/mixtral-architecture\" target=\"_blank\" rel=\"nofollow noopener\"\u003EMixtral Architecture\u003C/a\u003E section in my \u003Ca href=\"https://magazine.sebastianraschka.com/p/research-papers-in-january-2024\" target=\"_blank\" rel=\"nofollow noopener\"\u003EModel Merging, Mixtures of Experts, and Towards Smaller LLMs article\u003C/a\u003E.)\u003C/p\u003E\u003Cp\u003EOne of the stand-out features of Qwen 2 LLMs are their good multilingual capabilities in 30 languages. They also have a surprisingly large 151,642 token vocabulary (for reference, Llama 2 uses a 32k vocabulary, and Llama 3.1 uses a 128k token vocabulary); as a rule of thumb, increasing the vocab size by 2x reduces the number of input tokens by 2x so the LLM can fit more tokens into the same input. Also it especially helps with multilingual data and coding to cover words outside the standard English vocabulary.\u003C/p\u003E\u003Cp\u003EBelow is a brief MMLU benchmark comparison with other LLMs covered later. (Note that MMLU is a multiple-choice benchmark and thus has its limitations; however, it still is one of the most popular methods for reporting LLM performance.)\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEnQkpjqDe8Hg/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1723739618721?e=1749081600&v=beta&t=Nj1zsBQnR174bh_CUIVkHjumlsO1MLta-ZVGViHhhX8\" src=\"//:0\"\u003E\u003Cfigcaption\u003EMMLU benchmark scores for the latest open-weight models (higher values are better). I collected the scores for this plot from the official research papers of each model.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E(If you are new to MMLU, I briefly discussed it in my \u003Ca href=\"https://www.youtube.com/watch?v=kPGTx4wcm_w\" target=\"_blank\"\u003Erecent talk at minute 46:05\u003C/a\u003E.)\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E1.2 Qwen 2 Pre-training\u003C/h3\u003E\u003Cp\u003EThe Qwen 2 team trained the 1.5 billion, 7 billion, and 72 billion parameter models on 7 trillion training tokens, which is a reasonable size. For comparison, Llama 2 models were trained on 2 trillion tokens, and Llama 3.1 models were trained on 15 trillion tokens.\u003C/p\u003E\u003Cp\u003EInterestingly, the 0.5 billion parameter model was trained on 12 trillion tokens. However, the researchers did not train the other models on the larger 12 trillion token dataset because they did not observe any improvements during training, and the additional computational costs were not justified.\u003C/p\u003E\u003Cp\u003EOne of the focus areas has been improving the data filtering pipeline to remove low-quality data and enhancing data mixing to increase data diversity— a theme we will revisit when examining other models later.\u003C/p\u003E\u003Cp\u003EInterestingly, they also used Qwen models (although they didn't specify details, I assume they mean previous generation Qwen models) to synthesize additional pre-training data.\u003C/p\u003E\u003Cp\u003EFurthermore, they performed training in two stages: regular pre-training followed by long-context training. The latter increased the context length from 4,096 to 32,768 tokens at the end phase of pre-training using \"high-quality, lengthy data.\"\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E(Unfortunately, another theme of the technical reports is that details about the dataset are scarce, so if my write-up does not appear very detailed, it's due to the lack of publicly available information.)\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E1.3 Qwen 2 Post-training\u003C/h3\u003E\u003Cp\u003EThe Qwen 2 team employed the popular two-phase post-training methodology, starting with supervised instruction fine-tuning (SFT), which was applied across 500,000 examples for 2 epochs. This phase aimed to refine the model’s response accuracy in predetermined scenarios.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQHFXd7qjn08ww/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1723739707635?e=1749081600&v=beta&t=-uKyDfvcdZamnJFCIkO1pXca-AcYm7KqbybrEH9xilY\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EAfter SFT, they used direct preference optimization (DPO) to align the LLM with human preferences. (Interestingly referred to in their terminology as reinforcement learning from human feedback, RLHF.) As I discussed in my Tips for LLM Pretraining and Evaluating Reward Models article a few weeks ago, the SFT+DPO approach seems to be the most popular preference tuning strategy at the moment due to the ease of use compared to other methods, such as RLHF with PPO. (If you want to learn how DPO works, I recently implemented it from scratch here.)\u003C/p\u003E\u003Cp\u003EThe alignment phase itself was also done in 2 stages. First using DPO on an existing dataset (offline stage). Second, using a reward model to form the preference pair (online). Here, the model generates multiple responses during training, and a reward model selects the preferred response for the optimization step in \"real-time\" (that is, during training). This is also often referred to as \"rejection sampling.\"\u003C/p\u003E\u003Cp\u003EFor the construction of the dataset, they used existing corpora complemented by human labeling to determine target responses for SFT and identify preferred and rejected responses essential for DPO. The researchers also synthesized artificially annotated data. \u003C/p\u003E\u003Cp\u003EMoreover, the team used LLMs to generate instruction-response pairs specifically tailored for \"high-quality literary data,\" to create high-quality Q&A pairs for training.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEfzeDz_NDvEw/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1723739742407?e=1749081600&v=beta&t=Skvv3BG_EpVpbrK-aEBFbkdSoW7UzLnw2CkWeWh0m1g\" src=\"//:0\"\u003E\u003Cfigcaption\u003ESummary of techniques for Qwen 2 post-training.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E1.4 Conclusion\u003C/h3\u003E\u003Cp\u003EQwen 2 is a relatively capable model, and similar to earlier generations of Qwen. When attending the NeurIPS LLM efficiency challenge in December 2023, I remember that most of the winning approaches involved a Qwen model. \u003C/p\u003E\u003Cp\u003ERegarding the training pipeline of Qwen 2, what stands out is that synthetic data has been used for both pre-training and post-training. Also, the focus on dataset filtering (rather than collecting as much data as possible) is one of the notable trends in LLM training. Here, I would say, more is better, but only if it meets certain quality standards.\u003C/p\u003E\u003Ch2\u003EAligning LLMs with Direct Preference Optimization from Scratch\u003C/h2\u003E\u003Cp\u003EDirect Preference Optimization (DPO) has become one of the go-to methods to align LLMs more closely with user preferences, and it's something you will read a lot in this article. If you want to learn how it works, I coded it from scratch here: \u003Ca href=\"https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/04_preference-tuning-with-dpo/dpo-from-scratch.ipynb\" target=\"_blank\" rel=\"nofollow noopener\"\u003EDirect Preference Optimization (DPO) for LLM Alignment (From Scratch)\u003C/a\u003E.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEDmkHyU6MXzw/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1723739760519?e=1749081600&v=beta&t=1Ni6kVjXPLynfz3qsCNytthq80mQudWoVDhhZ_1TAO4\" src=\"//:0\"\u003E\u003Cfigcaption\u003EAn overview of \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch2\u003E2. Apple's Apple Intelligence Foundation Language Models (AFM)\u003C/h2\u003E\u003Cp\u003EI was really delighted to see another technical paper by Apple on arXiv.org that outlines their model training. An unexpected but definitely positive surprise!\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E2.1 AFM Overview\u003C/h3\u003E\u003Cp\u003EIn the \u003Ca href=\"https://arxiv.org/abs/2407.21075\" target=\"_blank\" rel=\"nofollow noopener\"\u003EApple Intelligence Foundation Language Models\u003C/a\u003E paper, available at, the research team outlines the development of two primary models designed for use in the \"Apple Intelligence\" context on Apple devices. For brevity, these models will be abbreviated as AFM for \"Apple Foundation Models\" throughout this section.\u003C/p\u003E\u003Cp\u003ESpecifically, the paper describes two versions of the AFM: a 3-billion-parameter on-device model intended for deployment on phones, tablets, or laptops, and a more capable server model of unspecified size. \u003C/p\u003E\u003Cp\u003EThese models are developed for chat, math, and coding tasks, although the paper does not discuss any of the coding-specific training and capabilities.\u003C/p\u003E\u003Cp\u003ELike the Qwen 2, the AFMs are dense LLMs and do not utilize a mixture-of-experts approach.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E2.2 AFM Pre-training\u003C/h3\u003E\u003Cp\u003EI'd like to extend two big kudos to the researchers. First, besides using publicly available data and data licensed by publishers, they respected the robots.txt files on websites and refrained from crawling these. Second, they also mentioned that they performed decontamination with benchmark data.\u003C/p\u003E\u003Cp\u003ETo reinforce one of the takeaways of the Qwen 2 paper, the researchers mentioned that quality was much more important than quantity. (With a vocabulary size of 49k tokens for the device model and 100k tokens for the server model, the vocabulary sizes were noticeably smaller than those of the Qwen 2 models, which used 150k token vocabulary.)\u003C/p\u003E\u003Cp\u003EInterestingly, the pre-training was not done in 2 but 3 stages!\u003C/p\u003E\u003Col\u003E\u003Cli\u003E\u003Cp\u003ECore (regular) pre-training\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EContinued pre-training where web-crawl (lower-quality) data was down-weighted; math and code was up-weighted\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EContext-lengthening with longer sequence data and synthetic data\u003C/p\u003E\u003C/li\u003E\u003C/ol\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQHrOaAD95jszg/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1723739788741?e=1749081600&v=beta&t=VZPGqgHs1SE4DTtprSWuxEK-5JxhgykZkkaw-zA5Nwk\" src=\"//:0\"\u003E\u003Cfigcaption\u003EOverview of the 3-step pre-training process that the AFM models underwent.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003ELet's take a look at these 3 steps in a bit more detail.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E2.2.1 Pre-training I: Core Pre-training\u003C/h3\u003E\u003Cp\u003ECore pre-training describes the first pre-training stage in Apple's pre-training pipeline. This is akin to regular pre-training, where the AFM-server model was trained on 6.3 trillion tokens, a batch size of 4096 batch size and a 4096-token sequence length. This is very similar to Qwen 2 models, which were trained in 7 trillion tokens.\u003C/p\u003E\u003Cp\u003EHowever, it gets more interesting for the AFM-on-device model, which is distilled and pruned from a larger 6.4-billion-parameter model (trained from scratch like the AFM-server model described in the previous paragraph. (Note that both AFM-server and AFM-on-device are 3-billion parameter models.)\u003C/p\u003E\u003Cp\u003EThere's not much detail on the distillation process besides \"a distillation loss is used by replacing the target labels with a convex combination of the true labels and the teacher model's top-1 predictions (with 0.9 weight assigned to the teacher labels).\"\u003C/p\u003E\u003Cp\u003EI feel that knowledge distillation is becoming increasingly prevalent and useful for LLM pre-training (Gemma-2 uses it, too). I plan to cover it in more detail one day. For now, here's a brief overview of how this process would work on a high level.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQF158e3eojOjg/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1723739807435?e=1749081600&v=beta&t=J2L4ZTOmn35D7zTKfaiJhc5WtyByGyQynrieJBWGH6M\" src=\"//:0\"\u003E\u003Cfigcaption\u003EAn overview of knowledge distillation, where a small model (here, the AFM-device 3B model) is trained on the original training tokens plus the outputs from a larger teacher model (here, a 6.4B model). Note that the cross entropy loss in a) is the regular training loss used for pre-training LLMs (see chapter 5 in my \"\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EKnowledge distillation, as illustrated above, still involves training on the original dataset. However, in addition to the training tokens in the dataset, the model to be trained (referred to as the student) receives information from the larger (teacher) model, which provides a richer signal compared to training without knowledge distillation. The downside is that you must: 1) train the larger teacher model first, and 2) compute predictions on all training tokens using the larger teacher model. These predictions can be computed ahead of time (which requires substantial storage space) or during training (which may slow down the training process).\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E2.2.2 Pre-training II: Continued Pre-training \u003C/h3\u003E\u003Cp\u003EThe continued pre-training stage includes a small context lengthening step from 4,096 to 8,192 tokens on a dataset consisting of 1 trillion tokens (the core pre-training set was five times larger). The primary focus, however, is on training with a high-quality data mix, with an emphasis on math and code. \u003C/p\u003E\u003Cp\u003EInterestingly, the researchers found that the distillation loss was not beneficial in this context.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E2.2.3 Pre-training III: Context Lengthening\u003C/h3\u003E\u003Cp\u003EThe third pre-training stage involves only 100 billion tokens (10% of the tokens used in the second stage) but represents a more significant context lengthening to 32,768 tokens. To achieve this, the researchers augmented the dataset with synthetic long-context Q&A data.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQHnkphfE14WdA/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1723739827723?e=1749081600&v=beta&t=pT-HNHuOMh3X1VlDoobpNxKu50OXmfqw1GtMAYcvLgE\" src=\"//:0\"\u003E\u003Cfigcaption\u003ESummary of techniques for AFM pre-training.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E2.3 AFM Post-training\u003C/h3\u003E\u003Cp\u003EApple appears to have taken a similarly comprehensive approach to their post-training process as they did with pre-training. They leveraged both human-annotated and synthetic data, emphasizing that data quality was prioritized over quantity. Interestingly, they did not rely on predetermined data ratios; instead, they fine-tuned the data mixture through multiple experiments to achieve the optimal balance.\u003C/p\u003E\u003Cp\u003EThe post-training phase involved a two-step process: supervised instruction fine-tuning followed by several rounds of reinforcement learning with human feedback (RLHF).\u003C/p\u003E\u003Cp\u003EA particularly noteworthy aspect of this process is Apple’s introduction of two new algorithms for the RLHF stage:\u003C/p\u003E\u003Col\u003E\u003Cli\u003E\u003Cp\u003ERejection Sampling Fine-tuning with Teacher Committee (iTeC)\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003ERLHF with Mirror Descent Policy Optimization\u003C/p\u003E\u003C/li\u003E\u003C/ol\u003E\u003Cp\u003EGiven the length of this article, I won’t go into the technical details of these methods, but here’s a brief overview:\u003C/p\u003E\u003Cp\u003EThe iTeC algorithm combines rejection sampling with multiple preference tuning techniques—specifically, SFT, DPO, IPO, and online RL. Rather than relying on a single algorithm, Apple trained models using each approach independently. These models then generated responses, which were evaluated by humans who provided preference labels. This preference data was used to iteratively train a reward model in an RLHF framework. During the rejection sampling phase, a committee of models generated multiple responses, with the reward model selecting the best one.\u003C/p\u003E\u003Cp\u003EThis committee-based approach is quite complex but should be relatively feasible, particularly given the relatively small size of the models involved (around 3 billion parameters). Implementing such a committee with much larger models, like the 70B or 405B parameter models in Llama 3.1, would definitely be more challenging.\u003C/p\u003E\u003Cp\u003EAs for the second algorithm, RLHF with Mirror Descent, it was chosen because it proved more effective than the commonly used PPO (Proximal Policy Optimization).\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQF7Z1qR9alNMA/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1723743281948?e=1749081600&v=beta&t=QWcXbSr5WC_uPncDAXe6ZNaXIUvkXf-sowy_37V-_bE\" src=\"//:0\"\u003E\u003Cfigcaption\u003ESummary of techniques for AFM post-training.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E2.4 Conclusion\u003C/h3\u003E\u003Cp\u003EApple's approach to pre-training and post-training is relatively comprehensive, likely because the stakes are very high (the model is deployed on millions, if not billions, of devices). However, given the small nature of these models, a vast array of techniques also becomes feasible, since a 3B model is less than half the size of the smallest Llama 3.1 model.\u003C/p\u003E\u003Cp\u003EOne of the highlights is that it’s not a simple choice between RLHF and DPO; instead, they used multiple preference-tuning algorithms in the form of a committee.\u003C/p\u003E\u003Cp\u003EIt’s also interesting that they explicitly used Q&A data as part of the pre-training—something I discussed in my previous article, \u003Ca href=\"https://magazine.sebastianraschka.com/p/instruction-pretraining-llms\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003EInstruction Pretraining LLMs\u003C/em\u003E\u003C/a\u003E.\u003C/p\u003E\u003Cp\u003EAll in all, it's a refreshing and delightful technical report.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch2\u003E3. Google's Gemma 2\u003C/h2\u003E\u003Cp\u003EGoogle's Gemma models were recently described in Gemma 2: Improving Open Language Models at a Practical Size (https://arxiv.org/abs/2408.00118).\u003C/p\u003E\u003Cp\u003E I'll provide an overview of some of key facts in the following overview section before discussing the pre-training and post-training processes.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E3.1 Gemma 2 Overview\u003C/h3\u003E\u003Cp\u003EThe Gemma 2 models are available in three sizes: 2 billion, 9 billion, and 27 billion parameters. The primary focus is on exploring techniques that do not necessarily require increasing the size of training datasets but rather on developing relatively small and efficient LLMs.\u003C/p\u003E\u003Cp\u003ENotably, Gemma 2 features a substantial vocabulary size of 256k tokens. For comparison, Llama 2 uses a 32k token vocabulary, and Llama 3 has a 128k token vocabulary.\u003C/p\u003E\u003Cp\u003EAdditionally, Gemma 2 employs sliding window attention, similar to Mistral's early models, likely to reduce memory costs. For more details on the Gemma 2 architecture, please refer to the \u003Ca href=\"https://magazine.sebastianraschka.com/i/146761957/gemma\" target=\"_blank\" rel=\"nofollow noopener\"\u003EGemma 2 section in my previous article\u003C/a\u003E.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E3.2 Gemma 2 Pre-training\u003C/h3\u003E\u003Cp\u003EThe Gemma researchers argue that even small models are often undertrained. However, rather than simply increasing the size of the training dataset, they focus on maintaining quality and achieve improvements through alternative methods, such as knowledge distillation, similar to Apple's approach. \u003C/p\u003E\u003Cp\u003EWhile the 27B Gemma 2 model was trained from scratch, the smaller models were trained using knowledge distillation similar to Apple's approach explained previously. \u003C/p\u003E\u003Cp\u003EThe 27B model was trained on 13 trillion tokens, the 9B model on 8 trillion tokens, and the 2B model on 2 trillion tokens. Additionally, similar to Apple's approach, the Gemma team optimized the data mixture to improve performance.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E3.3 Gemma 2 Post-training\u003C/h3\u003E\u003Cp\u003EThe post-training process for the Gemma models involved the typical supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) steps. \u003C/p\u003E\u003Cp\u003EThe instruction data involved using English-only prompt pairs, which were a mix of human-generated and synthetic-generated content. Specifically, and interestingly, the responses were primarily generated by teacher models, and knowledge distillation was also applied during the SFT phase.\u003C/p\u003E\u003Cp\u003EAn interesting aspect of their RLHF approach, following SFT, is that the reward model used for RLHF is ten times larger than the policy (target) model.\u003C/p\u003E\u003Cp\u003EThe RLHF algorithm employed by Gemma is fairly standard, but with a unique twist: they average the policy models through a method called WARP, a successor to WARM (weight-averaged reward models). I previously discussed this method in detail in my article \"Model Merging, Mixtures of Experts, and Towards Smaller LLMs\" (\u003Ca href=\"https://magazine.sebastianraschka.com/i/141130005/warm-on-the-benefits-of-weight-averaged-reward-models\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehttps://magazine.sebastianraschka.com/i/141130005/warm-on-the-benefits-of-weight-averaged-reward-models\u003C/a\u003E).\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQGFJhMtIPYkGA/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1723740080693?e=1749081600&v=beta&t=5jO89ajQelRD2njKr6LWuepC3AWtAH45_Pv9L96Dm9c\" src=\"//:0\"\u003E\u003Cfigcaption\u003ESummary of techniques for Gemma 2 post-training.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E3.4 Conclusion\u003C/h3\u003E\u003Cp\u003EThe Gemma team seems to really double down on knowledge distillation, which they use during both pre-training and post-training similar to Apple. Interestingly, they didn't use a multi-stage pre-training approach though, or at least, they didn't detail it in their paper.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQGLft1xD95YpA/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1723740093616?e=1749081600&v=beta&t=vRd9CQvLfAOhfZCT7xFKHdPMkQqocCHOqtciIezLj5w\" src=\"//:0\"\u003E\u003Cfigcaption\u003EI am excited to be invited to give a keynote talk at the upcoming \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch2\u003E4. Meta AI's Llama 3.1\u003C/h2\u003E\u003Cp\u003ENew releases of Meta's Llama LLMs are always a big thing. This time, the release was accompanied by a 92-page technical report: \u003Ca href=\"https://arxiv.org/abs/2407.21783\" target=\"_blank\" rel=\"nofollow noopener\"\u003EThe Llama 3 Herd of Models\u003C/a\u003E. Last but not least, in this section, we will look at the fourth big model paper released last month.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E4.1 Llama 3.1 Overview\u003C/h3\u003E\u003Cp\u003EAlong with releasing a huge 405 billion parameter model, Meta updated their previous 8 billion and 70 billion parameter models, giving them a slight MMLU performance boost.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EWhile Llama 3 uses group query attention like other recent LLMs, surprisingly, Meta AI said no to sliding window attention and Mixture-of-Experts approaches. In other words, the Llama 3.1 looks very traditional, and the focus was clearly on the pre-training and post-training rather than architecture innovations.\u003C/p\u003E\u003Cp\u003ESimilar to previous Llama releases, the weights are openly available. Moreover, Meta said that they updated the Llama 3 license so that it's now finally possible (allowed) to use Llama 3 for synthetic data generation or knowledge distillation to improve other models.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E4.2 Llama 3.1 Pre-training\u003C/h3\u003E\u003Cp\u003ELlama 3 was trained on a massive 15.6 trillion tokens dataset, which is a substantial increase from Llama 2's 1.8 trillion tokens. The researchers say that it supports at least eight languages, (whereas Qwen 2 is capable of handling 20). \u003C/p\u003E\u003Cp\u003EAn interesting aspect of Llama 3 is its vocabulary size of 128,000, which was developed using OpenAI's tiktoken tokenizer. (For those interested in tokenizer performance, I did a simple benchmark comparison \u003Ca href=\"https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/02_bonus_bytepair-encoder/compare-bpe-tiktoken.ipynb\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehere\u003C/a\u003E.)\u003C/p\u003E\u003Cp\u003EIn terms of pre-training data quality control, Llama 3 employs heuristic-based filtering alongside model-based quality filtering, utilizing fast classifiers like Meta AI's fastText and RoBERTa-based classifiers. These classifiers also help in determining the context categories for the data mix used during training.\u003C/p\u003E\u003Cp\u003EThe pre-training for Llama 3 is divided into three stages. The first stage involves standard initial pre-training using the 15.6 trillion tokens with an 8k context window. The second stage continues with the pre-training but extends the context length to 128k. The final stage involves annealing, which further enhances the model's performance. Let's look into these stages in more detail below.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E4.2.1 Pre-training I: Standard (Initial) Pre-training\u003C/h3\u003E\u003Cp\u003EIn their training setup, they began with batches consisting of 4 million tokens, each with a sequence length of 4096. This implies a batch size of approximately 1024 tokens, assuming that the 4 million figure is rounded to the nearest digit. After processing the first 252 million tokens, they doubled the sequence length to 8192. Further into the training process, after 2.87 trillion tokens, they doubled the batch size again.\u003C/p\u003E\u003Cp\u003EAdditionally, the researchers did not keep the data mix constant throughout the training. Instead, they adjusted the mix of data being used during the training process to optimize model learning and performance. This dynamic approach to data handling likely helped in improving the model's ability to generalize across different types of data.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E4.2.2 Pre-training II: Continued Pre-training for Context Lengthening\u003C/h3\u003E\u003Cp\u003ECompared to other models that increased their context window in a single step, the Llama 3.1 context lengthening was a more gradual approach: Here, the researchers increased the context length through six distinct stages from 8,000 to 128,000 tokens. This stepwise increment likelely allowed the model to adapt more smoothly to larger contexts. \u003C/p\u003E\u003Cp\u003EThe training set utilized for this process was involved 800 billion tokens, about 5% of the total dataset size.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E4.2.3 Pre-training III: Annealing on High-quality Data\u003C/h3\u003E\u003Cp\u003EFor the third pre-training stage, the researchers trained the model on a small but high-quality mix, which they found helps improve the performance on benchmark datasets. For example, annealing on the GSM8K and MATH training sets provided a significant boost on the respective GSM8K and MATH validation sets.\u003C/p\u003E\u003Cp\u003EIn section 3.1.3 of the paper, the researchers stated that the annealing dataset size was 40 billion tokens (0.02% of the total dataset size). However, in section 3.4.3, they state that the annealing was done only on 40 million tokens (0.1% of the annealing data).\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQF9pc_ifCM_eQ/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1723740184720?e=1749081600&v=beta&t=YYoTxkfVCaN_UzxAMF16p-9JGScsvn4LW1HG9QwZGbE\" src=\"//:0\"\u003E\u003Cfigcaption\u003ESummary of techniques for Llama 3.1 pre-training.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E4.3 Llama 3.1 Post-training\u003C/h3\u003E\u003Cp\u003EFor their post-training process, the Meta AI team employed a relatively straightforward method that included supervised fine-tuning (SFT), rejection sampling, and direct preference optimization (DPO). \u003C/p\u003E\u003Cp\u003EThey observed that reinforcement learning algorithms like RLHF with PPO were less stable and more challenging to scale compared to these techniques. It's worth noting that the SFT and DPO steps were iteratively repeated over multiple rounds, incorporating both human-generated and synthetic data.\u003C/p\u003E\u003Cp\u003EBefore describing the further details, their workflow is illustrated in the figure below.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQGU1-vlsXq2Ew/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1723740204542?e=1749081600&v=beta&t=hLxuKiCmt7f2oJQAKGmvV8j4spaE2MlA4zCbcERKDmM\" src=\"//:0\"\u003E\u003Cfigcaption\u003EAnnotated figure from the Llama 3.1 paper describing the post-training procedure\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003ENote that even though they used DPO, they also developed a reward model as you'd do in RLHF. Initially, they trained the reward model using a checkpoint from the pre-training phase, utilizing human-annotated data. This reward model was then used for the rejection sampling process, helping to select appropriate prompts for further training. \u003C/p\u003E\u003Cp\u003EIn each training round, they applied model averaging techniques not only to the reward model but also to the SFT and DPO models. This averaging involved merging the parameters from recent and previous models to stabilize (and improve) performance over time.\u003C/p\u003E\u003Cp\u003EFor those interested in the technical specifics of model averaging, I discussed this topic in the section \"Understanding Model Merging and Weight Averaging\" of my earlier article \u003Ca href=\"https://magazine.sebastianraschka.com/i/141130005/understanding-model-merging-and-weight-averaging\" target=\"_blank\" rel=\"nofollow noopener\"\u003EModel Merging, Mixtures of Experts, and Towards Smaller LLMs\u003C/a\u003E.\u003C/p\u003E\u003Cp\u003ETo sum it up, at the core, it's a relatively standard SFT + DPO stage. However, this stage is repeated over multiple rounds. Then, they sprinkled in a reward model for rejection sampling (like Qwen 2 and AFM). They also used model averaging like Gemma; however, it's not just for the reward models but all models involved.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFMisqtXd1I3A/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1723740233861?e=1749081600&v=beta&t=Q7JjuMFfDUX8VhDMA0iX5ClUhjdtHs_2YcUdQQAQf0c\" src=\"//:0\"\u003E\u003Cfigcaption\u003ESummary of techniques for Llama 3.1 post-training.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E4.4 Conclusion\u003C/h3\u003E\u003Cp\u003EThe Llama 3 models remain fairly standard and similar to the earlier Llama 2 models but with some interesting approaches. Notably, the large 15 trillion token training set distinguishes Llama 3 from other models. Interestingly, like Apple's AFM model, Llama 3 also implemented a 3-stage pre-training process.\u003C/p\u003E\u003Cp\u003EIn contrast to other recent large language models, Llama 3 did not employ knowledge distillation techniques, opting instead for a more straightforward model development path. For post-training, the model utilized Direct Preference Optimization (DPO) instead of the more complex reinforcement learning strategies that have been popular in other models. Overall, this choice is interesting as it indicates a focus on refining LLM performance through simpler (but proven) methods.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch2\u003E5. Main Takeaways\u003C/h2\u003E\u003Cp\u003EWhat can we learn from these four models discussed in this article: Alibaba's Qwen 2, Apple's foundational models (AFM), Google's Gemma 2, and Meta's Llama 3?\u003C/p\u003E\u003Cp\u003EAll four models take somewhat different approaches to pre-training and post-training. Of course, methodologies overlap, but no training pipeline is quite the same. For pre-training, a shared feature seems to be that all methods use a multi-stage pre-training pipeline, where a general core pre-training is followed by a context lengthening and sometimes high-quality annealing step. The figure below shows again the different methods employed in pre-training at a glance.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQF_Rmh9ld7lOA/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1723901804445?e=1749081600&v=beta&t=iZ99NGnI0zPwxT3iL9lve9hyZqqm7DQ1zeWbYTSq5Cg\" src=\"//:0\"\u003E\u003Cfigcaption\u003EOverview of the techniques used for pre-training\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EWhen it comes to post-training, also none of the pipelines was exactly the same. It seems that rejection sampling is now a common staple in the post-training process. However, when it comes to DPO or RLHF, there's no consensus or preference (no pun intended) yet.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQGN8qF0i3Cwpg/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1723740274523?e=1749081600&v=beta&t=6hKPDdc_bxFsMtrAKYikSqjz1qmqD4dZ4gm-krm28I0\" src=\"//:0\"\u003E\u003Cfigcaption\u003EOverview of the techniques used for post-training\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003ESo, in all, there is no single recipe but many paths to developing highly-performant LLMs.\u003C/p\u003E\u003Cp\u003ELastly, the four models perform in the same ballpark. Unfortunately, several of these models have not made it into the LMSYS and AlpacaEval leaderboards, so we have no direct comparison yet, except for the scores on multiple-choice benchmarks like MMLU and others.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch2\u003ESupporting Ahead of AI\u003C/h2\u003E\u003Cp\u003EAhead of AI is a personal passion project that does not offer direct compensation. However, for those who wish to support me, please consider purchasing a copy of \u003Ca href=\"https://sebastianraschka.com/books/\" target=\"_blank\" rel=\"nofollow noopener\"\u003Emy books\u003C/a\u003E. If you find them insightful and beneficial, please feel free to recommend them to your friends and colleagues.\u003C/p\u003E\u003Cp\u003EIf you have a few moments, a review of \u003Ca href=\"https://www.amazon.com/Machine-Learning-AI-Essential-Questions/dp/1718503768\" target=\"_blank\" rel=\"nofollow noopener\"\u003EMachine Learning Q and AI\u003C/a\u003E or \u003Ca href=\"https://www.amazon.com/Machine-Learning-PyTorch-Scikit-Learn-learning/dp/1801819319/ref=sr_1_1?crid=27NRKE8510ZCW&dib=eyJ2IjoiMSJ9.TkNrEFi6U3A1XzQPR1fFxtkV_dPIIK-OXf9buQc9DOIfhEkMakIHB-Vc2_jmSqnevK3jzIMIRVz0VAChRIhNFKXAmxnmCjXHn563FpaCIcpyCvQHafM2tctHm9PSVX11UtMM1pndhNgx9p0K8g7ExA7s-13q5LVOtKoJ-TWB7xswfc-tsL81s1mANhwQmAgDJGJwiewdxKdSEEr1iuoe_8wL9bMLbwxgBl6AuMBLDl0.u59naLiDQN-WVJ7ZGIUS6KEpMOnzd9XpWVMixwa4sjk&dib_tag=se&keywords=raschka+machine+learning&qid=1721327500&sprefix=raschka+machine+learnin%2Caps%2C204&sr=8-1\" target=\"_blank\" rel=\"nofollow noopener\"\u003EMachine Learning with PyTorch and Scikit-Learn\u003C/a\u003E on Amazon would really help, too!\u003C/p\u003E\u003Cp\u003EYour support means a great deal and is tremendously helpful in continuing this journey. Thank you!\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQGnw1nlGb8NjA/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1723740292561?e=1749081600&v=beta&t=9vnFSnLUa9PG8suleL-USGSXhyTdnPulbMT3j89HF94\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cul\u003E\u003Cli\u003E\u003Cp\u003E\u003Ca href=\"http://mng.bz/M96o\" target=\"_blank\" rel=\"nofollow noopener\"\u003EBuild a Large Language Model (from Scratch)\u003C/a\u003E is a highly focused book dedicated to coding LLMs from the ground up in PyTorch, covering everything from pre-training to post-training—arguably the best way to truly understand LLMs.\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003E\u003Ca href=\"https://nostarch.com/machine-learning-and-ai-beyond-basics\" target=\"_blank\" rel=\"nofollow noopener\"\u003EMachine Learning Q and AI\u003C/a\u003E is a great book for those who are already familiar with the basics; it dives into intermediate and advanced concepts covering deep neural networks, vision transformers, multi-GPU training paradigms, LLMs, and many more.\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003E\u003Ca href=\"https://www.amazon.com/Machine-Learning-PyTorch-Scikit-Learn-scikit-learn-ebook-dp-B09NW48MR1/dp/B09NW48MR1/\" target=\"_blank\" rel=\"nofollow noopener\"\u003EMachine Learning with PyTorch and Scikit-Learn\u003C/a\u003E is a comprehensive guide to machine learning, deep learning, and AI, offering a well-balanced mix of theory and practical code. It's the ideal starting point for anyone new to the field. \u003C/p\u003E\u003C/li\u003E\u003C/ul\u003E\u003Cp\u003E\u003C/p\u003E","url":"https://www.linkedin.com/pulse/new-llm-pre-training-post-training-paradigms-sebastian-raschka-phd-l53zc"},{"@type":"Article","author":{"@type":"Person","name":"Sebastian Raschka, PhD","url":"https://www.linkedin.com/in/sebastianraschka"},"dateModified":"","datePublished":"2024-07-20T12:55:01.000+00:00","headline":"Instruction Pretraining LLMs","image":{"@type":"ImageObject","url":"https://media.licdn.com/dms/image/v2/D5612AQHbmz6yfVIGxA/article-cover_image-shrink_720_1280/article-cover_image-shrink_720_1280/0/1721330358225?e=2147483647&v=beta&t=rmnfhvwfvC6BMNBlAcut56cUPuZOalRA8syQaylqZhg"},"articleBody":"\u003Cp\u003EA lot has happened last month: Apple announced the integration of on-device LLMs, Nvidia shared their large Nemotron model, FlashAttention-3 was announced, Google's Gemma 2 came out, and much more. \u003C/p\u003E\u003Cp\u003EYou've probably already read about it all in various news outlets. So, in this article, I want to focus on recent research centered on instruction finetuning, a fundamental technique for training LLMs.\u003C/p\u003E\u003Cp\u003EWhat I am going to cover in this article:\u003C/p\u003E\u003Col\u003E\u003Cli\u003E\u003Cp\u003EA new, cost-effective method for generating data for instruction finetuning\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EInstruction finetuning from scratch\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EPretraining LLMs with instruction data\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EAn overview of what's new in Gemma 2\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EAn overview of all the other interesting research papers that came out in June\u003C/p\u003E\u003C/li\u003E\u003C/ol\u003E\u003Cp\u003EHappy reading!\u003C/p\u003E\u003Ch2\u003E1. Creating Alignment Data from Scratch\u003C/h2\u003E\u003Cp\u003EThe \u003Ca href=\"https://arxiv.org/abs/2406.08464\" target=\"_blank\" rel=\"nofollow noopener\"\u003EMagpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing\u003C/a\u003E paper shares a fascinating hack to generate a high-quality dataset for LLM instruction finetuning. While this doesn't offer any particularly recent research insights, it's one of those interesting, practical exploits that seems super useful.\u003C/p\u003E\u003Ch3\u003E1.1 Generating An Instruction Dataset From Nothing\u003C/h3\u003E\u003Cp\u003EWhat distinguishes this instruction-data-generating method from others is that it can be fully automated and doesn't require any initial questions or instructions. As the paper title suggests, it enables the creation of an instruction dataset from \"Nothing\" – the only thing we need is a locally running Llama 3 8B model. The figure below summarizes how this method works.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQHFErElThN0ew/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1721331107714?e=1749081600&v=beta&t=4vcifTXQPd4ksnnNTuYaAts2EbpNyU-hCMjUJ8l0-gs\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EAnnotated illustration of the Magpie method for generating a synthetic dataset for instruction finetuning. The figure is based on illustrations from the Magpie paper: \u003C/em\u003E\u003Ca href=\"https://arxiv.org/abs/2406.08464\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ehttps://arxiv.org/abs/2406.08464\u003C/em\u003E\u003C/a\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EEssentially, as shown in the figure above, we just have to prompt the \u003Cem\u003ELlama 3 8B Instruct\u003C/em\u003E model with a pre-query template, and it will generate an instruction for us. Then, we feed that instruction back to the LLM, and it will generate a response. If we repeat this procedure a couple of thousand times, we obtain a dataset for instruction finetuning. (Optionally, we can apply an LLM to filter the instruction-response pairs by quality.)\u003C/p\u003E\u003Ch3\u003E1.2 Dataset quality\u003C/h3\u003E\u003Cp\u003EWhat's fascinating is that with the resulting instruction dataset, the authors found that finetuning a Llama 3 8B base model with just instruction finetuning (no preference finetuning via RLHF and DPO) beats the original Llama 2 8B Instruct model by Meta AI, as shown in the figure below.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFeSwK3h3Hnuw/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1721331142361?e=1749081600&v=beta&t=dMk_n-ba9QztxBIBmvW3_5T8tELx4FrHuS4qRXYT2ms\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EA Llama 3 8B base model finetuned on the Magpie-generated instruction dataset beats the original Llama 3 8B Instruct model. Based on an annotated illustration from the Magpie paper: \u003C/em\u003E\u003Ca href=\"https://arxiv.org/abs/2406.08464\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ehttps://arxiv.org/abs/2406.08464\u003C/em\u003E\u003C/a\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EThe Magpie results shown in the figure above wer achieved with 300 thousand samples only. In comparison, The original Llama 3 Instruct model was finetuned and aligned on 100 million samples!\u003C/p\u003E\u003Ch3\u003E1.3 Running the Dataset Generation Locally\u003C/h3\u003E\u003Cp\u003EI was skeptical at first, so I tried to implement this myself. It really works! \u003Ca href=\"https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama2-ollama.ipynb\" target=\"_blank\" rel=\"nofollow noopener\"\u003EHere\u003C/a\u003E, you can find my reimplementation using Ollama, which even runs fine locally on a MacBook Air.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEE62rIo1h00w/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1721331173399?e=1749081600&v=beta&t=FNflToNQiGUcha-o0I5krzLxWgb6-hc4clQspzMdKtM\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003ECode screenshot from a reimplementation of the Magpie method that runs locally. The code is available \u003C/em\u003E\u003Ca href=\"https://github.com/rasbt/LLMs-from-scratch/blob/main/ch07/05_dataset-generation/llama2-ollama.ipynb\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ehere\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E.\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E1.4 Additional Details\u003C/h3\u003E\u003Cp\u003EThe authors created two sets of datasets: A \"Pro\" version using the Llama 3 70B Instruct model and an \"Air\" version using the Llama 3 8B Instruct model. As an earlier figure showed, the Magpie-Pro-generated dataset results in slightly stronger models compared to the Magpie-Air dataset when using it to instruction-finetune a Llama 3 8B base model.\u003C/p\u003E\u003Cp\u003EThe figure below shows an additional comparison of the dataset qualities and difficulties as rated via an LLM.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQExu_LdcwPbVg/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1721331195419?e=1749081600&v=beta&t=71rVXsJGTvX11-n6EK8ZvQAagzqR29nyKpWl4yEhvIo\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EAnnotated plots from the Magpie paper showing the dataset quality and difficulty of the Air and Pro datasets relative to each other.\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EAs the figure above shows, the quality of the Air and Pro datasets is roughly on par. In addition, it would have been interesting to see how the Alpaca dataset compares to these. (The assumption is that the Magpie data is of much higher quality than Alpaca, but a reference point would be interesting.)\u003C/p\u003E\u003Cp\u003EFurthermore, the paper contains an analysis showing that the breadth or diversity in this dataset is much larger than that of other popular datasets for instruction finetuning, such as Alpaca, Evol Instruct, and UltraChat. In addition, when compared to models trained with other instruction finetuning datasets, the Magpie-Pro finetuned model also compares very favorably.\u003C/p\u003E\u003Ch3\u003E1.5 Conclusion\u003C/h3\u003E\u003Cp\u003EOverall, I think that Magpie is an interesting exploit that is, on the one hand, fascinating in its effectiveness and, on the other hand, has a lot of practical utility. I will certainly consider it as an interesting, simple, and cost-effective candidate for constructing general-purpose instruction datasets in the future.\u003C/p\u003E\u003Ch2\u003E2. Instruction Finetuning from Scratch\u003C/h2\u003E\u003Cp\u003EIf you are looking for a resource to understand the instruction finetuning process in LLMs, I am happy to share that Chapter 7 on instruction finetuning LLMs is now finally \u003Ca href=\"https://mng.bz/M96o\" target=\"_blank\" rel=\"nofollow noopener\"\u003Elive on the Manning website\u003C/a\u003E.\u003C/p\u003E\u003Cp\u003EThis is the longest chapter in the book and takes a from-scratch approach to implementing the instruction finetuning pipeline. This includes everything from input formatting to batching with a custom collate function, masking padding tokens, the training loop itself, and scoring the response quality of the finetuned LLM on a custom test set.\u003C/p\u003E\u003Cp\u003E(The exercises include changing prompt styles, instruction masking, and adding LoRA.)\u003C/p\u003E\u003Cp\u003EHappy coding!\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQHD0cWstYxwTg/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1721331224100?e=1749081600&v=beta&t=SDAlRUNtKiLk8T3UcsqqFJqdf25WHa4A0gWoTRT2s_0\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EAn overview of chapter 7 in my \u003C/em\u003E\u003Ca href=\"https://mng.bz/M96o\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003E\"Build a Large Language Model From Scratch\"\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E book. Supplementary code materials are available \u003C/em\u003E\u003Ca href=\"https://github.com/rasbt/LLMs-from-scratch\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ehere on GitHub\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E.\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EPS: it's also the last chapter, and the publisher is currently preparing the layouts for the print version.\u003C/p\u003E\u003Ch2\u003E3. Instruction Pretraining LLMs\u003C/h2\u003E\u003Cp\u003EIn the paper \"Instruction Pre-Training: Language Models are Supervised Multitask Learners\" (\u003Ca href=\"https://arxiv.org/abs/2406.14491\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehttps://arxiv.org/abs/2406.14491\u003C/a\u003E), researchers investigate whether LLM pretraining can be made more efficient by including synthetic instruction-response pairs instead of just raw text. (Here, \"raw text\" means text from books, websites, papers, and so forth that has not been reprocessed into a specific format.)\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQHoPM9dV27-5Q/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1721331267575?e=1749081600&v=beta&t=n4ljI46IEwu-bV6CKgbKkgXNlVA2E4lPD-blzjG7QJs\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EA comparison between regular pretraining (top) and the proposed instruction pretraining approach (bottom) via an annotated figure from \u003C/em\u003E\u003Ca href=\"https://arxiv.org/abs/2406.14491\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ehttps://arxiv.org/abs/2406.14491\u003C/em\u003E\u003C/a\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003ESpecifically, the researchers experiment with generating instruction-response data from the raw training corpus itself via an \"instruction synthesizer,\" an LLM specifically finetuned for this task.\u003C/p\u003E\u003Cp\u003E(Note that this is not the first paper proposing the formatting of raw text as instruction data. Another work that comes to mind is \"Genie: Achieving Human Parity in Content-Grounded Datasets Generation\" (\u003Ca href=\"https://arxiv.org/abs/2401.14367\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehttps://arxiv.org/abs/2401.14367\u003C/a\u003E). I also recall seeing another paper or blog post using instruction data during pretraining a few months ago—I discussed this method with some of my colleagues—but unfortunately, I couldn't find the reference. Nonetheless, the paper discussed here is particularly intriguing since it builds on openly available LLMs that run locally and covers both pretraining and continual pretraining.)\u003C/p\u003E\u003Ch3\u003E3.1 Instruction Synthesizer\u003C/h3\u003E\u003Cp\u003EBefore we dive into the pretraining and continual pretraining results, let's talk about the core component of this method: the instruction synthesizer. This is an openly available Mistral 7B v0.1 LLM (which I wrote about last year here: \u003Ca href=\"https://magazine.sebastianraschka.com/i/138555764/mistral-b\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehttps://magazine.sebastianraschka.com/i/138555764/mistral-b\u003C/a\u003E) that has been finetuned to generate instruction-response pairs from raw text.\u003C/p\u003E\u003Cp\u003ETo finetune this synthesizer, the researchers use datasets such as HotpotQA (\u003Ca href=\"https://arxiv.org/abs/1809.09600\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehttps://arxiv.org/abs/1809.09600\u003C/a\u003E), which consists of passages from Wikipedia associated with questions and answers. For this, the authors also ensure that a variety of tasks, like commonsense reasoning, sentiment analysis, math problems, etc., are covered.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQHbewircJRdLQ/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1721331289542?e=1749081600&v=beta&t=fwV3PA9i6mfPaRWmrcBcq_qgYNiR4Fv8D--DkvxGBZM\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EThe input and output data of the instruction synthesizer via an annotated figure from \u003C/em\u003E\u003Ca href=\"https://arxiv.org/abs/2406.14491\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ehttps://arxiv.org/abs/2406.14491\u003C/em\u003E\u003C/a\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EOnce this instruction synthesizer is developed (i.e., finetuned), it can be used to generate the input data for pretraining the target LLMs.\u003C/p\u003E\u003Cp\u003EOne last noteworthy detail regarding the instruction synthesizer is that multiple raw texts (Tn) and instruction-response pairs (In ⊕ Rn) are concatenated as few-shot examples, as shown in the figure below.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQHqWZk1UNOxoQ/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1721331307374?e=1749081600&v=beta&t=xq54CUl20gFJ_Sy-RsQCM6cchZfvPfHdyATORfUTaZc\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EThe formatting of the instruction-data for finetuning (and using) the instruction synthesizer via an annotated figure from \u003C/em\u003E\u003Ca href=\"https://arxiv.org/abs/2406.14491\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ehttps://arxiv.org/abs/2406.14491\u003C/em\u003E\u003C/a\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E3.2 Pretraining with Instruction Data\u003C/h3\u003E\u003Cp\u003ENow that we have discussed the method to generate the instruction-response pairs, let's get to the interesting part: how well do models train on this augmented dataset. The first set of results looks at two small models trained from scratch: 500M parameters and 1.3B parameters (both are based on the Mistral architecture).\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEJg9uf6aGGMA/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1721331331139?e=1749081600&v=beta&t=7tKmpJeczSlfcAlYp46rr0KMIGQEkTvW7uS5jrxDIsU\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EA comparison of 3 different pretraining approaches used to train models from scratch (annotated table from \u003C/em\u003E\u003Ca href=\"https://arxiv.org/abs/2406.14491\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ehttps://arxiv.org/abs/2406.14491\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E)\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EAs we can see in the table above, the model trained via the proposed instruction pretraining approach (\u003Cstrong\u003E\u003Cem\u003EInstruct PT\u003C/em\u003E\u003C/strong\u003E) performs best on most benchmark tasks (higher values are better). \u003C/p\u003E\u003Cp\u003ENote, though, that it has seen more tokens than the \u003Cstrong\u003E\u003Cem\u003EVanilla PT\u003C/em\u003E\u003C/strong\u003E approach since it included the synthesized instruction-response pairs. Hence, the authors included the \u003Cstrong\u003E\u003Cem\u003EMix PT\u003C/em\u003E\u003C/strong\u003E comparison, which is a model that has been trained on a data mix containing both the raw text and the instruction data used to train the synthesizer. From this comparison, we can see that not simply having more data makes the difference. The fact that \u003Cstrong\u003E\u003Cem\u003EInstruct PT\u003C/em\u003E\u003C/strong\u003E performs better than \u003Cstrong\u003E\u003Cem\u003EMix PT\u003C/em\u003E\u003C/strong\u003E on most tasks illustrates that the nature of the instruction-response data (i.e., instruction-response data related to the raw data) makes the difference.\u003C/p\u003E\u003Cp\u003EIn addition, it's worth noting that the Instruct PT pretrained models have another advantage: They improve a more when they are instruction-finetuned afterwards, as the figure below shows.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQG9Srat5RnpJA/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1721331511559?e=1749081600&v=beta&t=Ojld5H6Z2Db3yIPIO1sB4MeQv-KdFTEZdo3S9vcXkiA\" src=\"//:0\"\u003E\u003Cfigcaption\u003EFinetuning LLMs that have been pretrained with either the traditional pretraining pardigm (Vanilla PT) or instruction pretraining (annotated figure from https://arxiv.org/abs/2406.14491)\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E3.3 Continual Pretraining with Instruction Data\u003C/h3\u003E\u003Cp\u003EPretraining from scratch is interesting because that's how LLMs are created in the first place. However, I'd say that practitioners care more about continual pretraining and finetuning. \u003C/p\u003E\u003Cp\u003EContinual pretraining here means that we take an existing pretrained model and pretrain it further on new domain data. For instance, think of a Llama 3 8B base model that has been trained on a general text corpus and that you want to adapt for finance, medical, legal, or other domains.\u003C/p\u003E\u003Cp\u003EThe table below summarizes the results the researchers obtained when applying the instruction pretraining method to a pretrained Llama 3 8B base model. Specifically, they conducted continual pretraining with both biomedical texts and finance texts.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQG9Y-PMO8WNhA/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1721331575319?e=1749081600&v=beta&t=ZcNNIFxvpV0-yP3fnfIC40oFHpYPqD95DdUWzC_KbRY\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EA comparison of 3 different pretraining approaches used for continual pretraining (annotated table from \u003C/em\u003E\u003Ca href=\"https://arxiv.org/abs/2406.14491\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ehttps://arxiv.org/abs/2406.14491\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E)\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003ELooking at the table above, we can see that the instruction pretraining approach (\u003Cstrong\u003E\u003Cem\u003EInstruct PT\u003C/em\u003E\u003C/strong\u003E) clearly outperforms the vanilla pretraining (\u003Cstrong\u003E\u003Cem\u003EVanilla PT\u003C/em\u003E\u003C/strong\u003E) approach (here, this means regular continual pretraining of the base model). \u003C/p\u003E\u003Cp\u003EThe Llama 3 70B base model is included as a reference; I suppose to showcase that small specialized models can beat larger general models.\u003C/p\u003E\u003Ch3\u003E3.4 Conclusion\u003C/h3\u003E\u003Cp\u003EAlmost every time I explain the LLM pretraining pipeline to someone, they are surprised by its simplicity and the fact that this is still what's commonly used to train LLMs today. The instruction pretraining approach is quite refreshing in that sense. \u003C/p\u003E\u003Cp\u003EOne caveat is that for large pretraining corpora, it might still be expensive to create the instruction-augmented corpora. However, the nice thing about generated data is that it can be reused in many different projects once created.\u003C/p\u003E\u003Ch2\u003E4. Gemma 2\u003C/h2\u003E\u003Cp\u003EI cannot write this article without mentioning Google's new \u003Ca href=\"https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf\" target=\"_blank\" rel=\"nofollow noopener\"\u003EGemma 2 models\u003C/a\u003E, which are arguably the biggest model release last month. However, when it comes to pure size, Nvidia's Nemotron-4 340B takes the crown (\u003Ca href=\"https://arxiv.org/abs/2406.11704\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehttps://arxiv.org/abs/2406.11704\u003C/a\u003E). The Gemma 2 models come in 2.6B, 9B, and 27B parameter versions.\u003C/p\u003E\u003Cp\u003ESince this article is already quite lengthy, and you're likely familiar with Gemma 2 from other sources, let's cut to the chase. What are the main highlights and noteworthy updates in Google's newly released Gemma 2 LLMs? The main theme is exploring techniques without necessarily increasing the size of training datasets but rather focusing on developing relatively small and efficient LLMs.\u003C/p\u003E\u003Cp\u003ESpecifically, they blend three main architectural and training choices to create the 2.6B and 9B parameter models: sliding window attention, grouped-query attention, and knowledge distillation.\u003C/p\u003E\u003Ch3\u003E4.1 Sliding window attention\u003C/h3\u003E\u003Cp\u003E\u003Cstrong\u003ESliding window attention\u003C/strong\u003E (e.g., as popularized by Mistral) is a technique using a fixed-sized attention block that allows a current token to attend to only a specific number of previous tokens instead of all previous tokens, as illustrated in the figure below.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQGMrmQqfxEfFg/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1721331597903?e=1749081600&v=beta&t=peorWGD8-w_XI34s3s3kqdRVaXO65pkTlLvyBlEN0AU\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EA comparison of 3 different pretraining approaches used for continual pretraining (annotated table from \u003C/em\u003E\u003Ca href=\"https://arxiv.org/abs/2406.14491\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ehttps://arxiv.org/abs/2406.14491\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E)\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EIn the case of Gemma 2, the authors alternated between regular attention and sliding window attention layers. The sliding attention block size was 4096 tokens, spanning a total block size of 8192 tokens. \u003C/p\u003E\u003Cp\u003ESliding window attention is mainly used to improve computational performance, and the researchers also included a small ablation study showing that there's a barely noticeable difference in perplexity when shrinking the block size during inference.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQHdyWuzgMZYcA/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1721331621608?e=1749081600&v=beta&t=-4WkRBhcI6Z5XFlW-41DSwIbCbUqNL2AMLlTcoup0gQ\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EAn ablation study from the \u003C/em\u003E\u003Ca href=\"https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003EGemma 2 technical report\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E showing that a decreased block size for the sliding window barely impacts the modeling performance of the 9B parameter model during inference.\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E(It would have been interesting to see the GPU memory improvement side-by-side.)\u003C/p\u003E\u003Ch3\u003E4.2 Group-query attention\u003C/h3\u003E\u003Cp\u003E\u003Cstrong\u003EGroup-query attention\u003C/strong\u003E (like in Llama 2 and 3) can be regarded as a more generalized form of multi-query attention. The motivation behind this is to reduce the number of trainable parameters by sharing the same Keys and Values heads for multiple Query heads, thereby lowering computational requirements.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQF1uT_aJzoGdw/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1721331644101?e=1749081600&v=beta&t=C_m964JsR-HUgDF1xszZfLpEZXL5f5sxmXN6wDoiRHk\" src=\"//:0\"\u003E\u003Cfigcaption\u003EAnnotated figure from \u003Ca href=\"https://arxiv.org/abs/2305.13245\" target=\"_blank\" rel=\"nofollow noopener\"\u003EAinslie \u003Cem\u003Eet al.\u003C/em\u003E 2023\u003C/a\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E4.3 Knowledge distillation\u003C/h3\u003E\u003Cp\u003EThe general idea of \u003Cstrong\u003EKnowledge distillation\u003C/strong\u003E (as in MiniLLM, \u003Ca href=\"https://arxiv.org/abs/2306.08543\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehttps://arxiv.org/abs/2306.08543\u003C/a\u003E) is to transfer knowledge from a larger model (the teacher) to a smaller model (the student). Here, they trained a 27B (teacher) model from scratch and then trained the smaller 2B and 9B (student) models on the outputs of the larger teacher model. The 27B model doesn't use knowledge distillation but was trained from scratch to serve as a \"teacher\" for the smaller models.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQH5dQevh340iw/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1721331677177?e=1749081600&v=beta&t=pV8D4q37RVjeEXH8xhLInWxvd-KvwV5dOjUW7T7lSE4\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EAn overview of knowledge distillation from \u003C/em\u003E\u003Ca href=\"https://www.amazon.com/Machine-Learning-AI-Essential-Questions/dp/1718503768/ref=sr_1_1\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Emy Machine Learning Q and AI book\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E in the context of computer vision. In an LLM-context, think of text instead of images and predicted tokens instead of class labels.\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E4.4 Other interesting architecture details\u003C/h3\u003E\u003Cp\u003EThe paper contains many other interesting tidbits. For instance, one hallmark of Gemma 2 is its relatively large vocabulary size: 256,000 tokens. This is similar to the first Gemma model, but it's still worth noting since it's twice the size of the Llama 3 vocabulary (128,000) and eight times the size of the Phi-3 vocabulary (32,000).\u003C/p\u003E\u003Cp\u003EThe vocabulary size of an LLM refers to the number of unique tokens (words, subwords, or characters) that the model can recognize and generate. \u003C/p\u003E\u003Cp\u003EA large vocabulary size in LLMs allows for better coverage of words and concepts, improved handling of multilingual content, and reduced tokenization artifacts. However, a large vocabulary size also comes with trade-offs, such as increased model size and potentially slower inference due to the larger embedding and output layers. (That's where the sliding window attention and multi-query attention mechanism are important to offset this.)\u003C/p\u003E\u003Cp\u003EThere's also an interesting section on \"logit capping,\" a technique I haven't seen used before. Essentially, it is a form of min-max normalizing and clipping of the logit values to keep them within a certain range. I presume this is to improve stability and gradient flow during training.\u003C/p\u003E\u003Cp\u003Elogits ← soft_cap ∗ tanh(logits/soft_cap).\u003C/p\u003E\u003Cp\u003EAdditionally, they leverage model merging techniques to combine models from multiple runs with different hyperparameters, although the paper doesn't provide much detail about that. (However, interested readers can read more about this in \u003Ca href=\"https://arxiv.org/abs/2406.16768\" target=\"_blank\" rel=\"nofollow noopener\"\u003EWARP: On the Benefits of Weight Averaged Rewarded Policies\u003C/a\u003E, which Gemma 2 uses for this.) \u003C/p\u003E\u003Cp\u003EIn terms of modeling performance, Gemma 2 is almost as good as the 3x larger Llama 3 70B, and it beats the old Qwen 1.5 32B model. It would be interesting to see a comparison with the more recent Qwen 2 model.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFdyHiiV7L7zQ/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1721331701299?e=1749081600&v=beta&t=MpbOwEnJbnz5mr7W5-IQ3mFd2mSmW-NZhglejnE2uMA\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EA comparison between two other popular models with openly available weights: Llama 3 and Qwen 1.5. (Annotated table from the \u003C/em\u003E\u003Ca href=\"https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003EGemma 2 technical report\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E).\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EPersonally, a highlight is that the Gemma 2 report includes ablation studies for some of its architectural choices. This was once a given in academic research but is increasingly rare for LLM research.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEqWQU11e21tg/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1721331724853?e=1749081600&v=beta&t=STyy7b4d2Rv27_LU3U1mXgBwWBWTtOfNrNfCzFOYX5c\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EAn example of one of the ablation studies included in the \u003C/em\u003E\u003Ca href=\"https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003EGemma 2 technical report\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E. Here \"wide\" refers to a model with 28 layers and an intermediate size of 24,576, and \"deep\" refers to a architecture with 42 layers with an intermediate size of 14,336.\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E4.5 Conclusion\u003C/h3\u003E\u003Cp\u003EIt's refreshing to see such a relatively detailed technical report from Google. When it comes to the model itself, based on public consensus, Gemma 2 is likely the most capable model for single-GPU use cases today. For larger models, Llama 3 70B and Qwen 2 72B remain strong contenders.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Chr\u003E\u003Ch2\u003ESupporting Ahead of AI\u003C/h2\u003E\u003Cp\u003E\u003Cem\u003EThis magazine is personal passion project that does not offer direct compensation. However, for those who wish to support me, please consider purchasing a copy of \u003C/em\u003E\u003Ca href=\"https://sebastianraschka.com/books\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Eone of my books\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E. If you find them insightful and beneficial, please feel free to recommend them to your friends and colleagues. (Sharing your feedback with others \u003C/em\u003E\u003Ca href=\"https://www.amazon.com/Machine-Learning-AI-Essential-Questions/dp/1718503768/ref=sr_1_1?crid=1566EI5BQC9U0&dib=eyJ2IjoiMSJ9.4oCd5DaBraiVbzZDag-sX4dJQTIguc2mCbDGm1UCKmulcZsTRWmz--_y1AwHt5OmSFglsDpUXQO6FJ_fhs3n9qizrIqlU4STsWxFGor7WdW0QRtPtWgyzz8w0C3PHW8uwsDMJLN4VxjnFIkMizRHBoiHZjLGslLzmiLpzLTTlhbS7bSkhGnUcb1wKkartwWqVtq8c8KbnTdEJ34G6dOlf_YIfsYyG2XOGXneQuNmWh8.AZ3qpJ95F8x9PhDbClERH4iibU3U4r4CTLdtf2r1Nyc&dib_tag=se&keywords=Machine+Learning+Q+and+AI&qid=1717088426&sprefix=%2Caps%2C275&sr=8-1\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Evia a review on Amazon\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E helps a lot, too!)\u003C/em\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFD1Wr3d8swog/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1721330899076?e=1749081600&v=beta&t=TOkC5SsZ_5gU7161fe70sd2a7uhUemINHSzfe2nBLU0\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Ca href=\"https://www.amazon.com/Machine-Learning-PyTorch-Scikit-Learn-scikit-learn-ebook-dp-B09NW48MR1/dp/B09NW48MR1/\" target=\"_blank\" rel=\"nofollow noopener\"\u003EMachine Learning with PyTorch and Scikit-Learn\u003C/a\u003E, \u003Ca href=\"https://nostarch.com/machine-learning-and-ai-beyond-basics\" target=\"_blank\" rel=\"nofollow noopener\"\u003EMachine Learning Q and AI\u003C/a\u003E, and \u003Ca href=\"http://mng.bz/M96o\" target=\"_blank\" rel=\"nofollow noopener\"\u003EBuild a Large Language Model (from Scratch)\u003C/a\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003Cstrong\u003EYour support means a great deal! Thank you!\u003C/strong\u003E\u003C/p\u003E","url":"https://www.linkedin.com/pulse/instruction-pretraining-llms-sebastian-raschka-phd-x6zoc"},{"@type":"Article","author":{"@type":"Person","name":"Sebastian Raschka, PhD","url":"https://www.linkedin.com/in/sebastianraschka"},"dateModified":"","datePublished":"2024-06-02T12:45:01.000+00:00","headline":"LLM Research Insights: Instruction Masking and New LoRA Finetuning Experiments","image":{"@type":"ImageObject","url":"https://media.licdn.com/dms/image/v2/D5612AQHrFO7oCsjB8g/article-cover_image-shrink_720_1280/article-cover_image-shrink_720_1280/0/1721183355165?e=2147483647&v=beta&t=pESTVCk__yC_9wzIYhg1AfmiUxE69pV-tXjhZoxMO9Y"},"articleBody":"\u003Cp\u003EThis month, I am covering three new papers related to instruction finetuning and parameter-efficient finetuning with LoRA in large language models (LLMs). I work with these methods on a daily basis, so it's always exciting to see new research that provides practical insights.\u003C/p\u003E\u003Cp\u003EThis article might be a bit shorter than usual since I am currently finishing the final chapter of my book, \"\u003Ca href=\"http://mng.bz/orYv\" target=\"_blank\" rel=\"nofollow noopener\"\u003EBuild a Large Language Model From Scratch\u003C/a\u003E.\" Additionally, I am preparing for a virtual \u003Ca href=\"https://events.zoom.us/ev/ArzwACAJCGWLB-pPrWeIwszDr8WDhlEcFLL4VMCb1SJVU4fzrQo9~AvhUziSVIoulf4yLm7f0hhyew2qRK0ZEIE6Xztz4yEDudekpMeEd9L_UXVY_6lWFFOTqXHlF-N_xKaCrP5aP07ZzFQ\" target=\"_blank\" rel=\"nofollow noopener\"\u003EACM Tech Talk on LLMs\u003C/a\u003E this Wednesday. The talk is free and open to all, and you are very welcome to join if you're interested!\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFDpIqTpsj0Bg/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1721201287105?e=1749081600&v=beta&t=7L0q8ck8zy3EkegAyHmrP029FBoXgaBulhJJxjREy-c\" src=\"//:0\"\u003E\u003Cfigcaption\u003EUnderstanding the \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch2\u003E1. Instruction Tuning With Loss Over Instructions\u003C/h2\u003E\u003Cp\u003EOne paper that caught my eye this month is \u003Ca href=\"https://arxiv.org/abs/2405.14394\" target=\"_blank\" rel=\"nofollow noopener\"\u003EInstruction Tuning With Loss Over Instructions.\u003C/a\u003E\u003C/p\u003E\u003Cp\u003EIn this paper, the authors question a widely accepted practice in instruction finetuning: masking the instruction when calculating the loss. But before we discuss the findings, let's begin with a general overview.\u003C/p\u003E\u003Ch3\u003E1.1 Instruction Masking During Instruction Finetuning\u003C/h3\u003E\u003Cp\u003EInstruction finetuning (or instruction tuning for short) is the task of improving the responses of a pretrained LLM to follow instructions (\"\u003Cem\u003ESummarize this article\u003C/em\u003E,\" \"\u003Cem\u003ETranslate this sentence\u003C/em\u003E,\" etc.).\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQGeua_ZHQ05IA/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1721191531545?e=1749081600&v=beta&t=B5GmGOV7pP6yfiKMEDMaCMyiKKM8H4egbzVMXOn4geI\" src=\"//:0\"\u003E\u003Cfigcaption\u003EAn illustration of a dataset example for instruction finetuning\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EWhen instruction finetuning LLMs, it is common to mask out the instruction itself when calculating the loss. For instance, this is done by default in our \u003Ca href=\"https://github.com/Lightning-AI/litgpt\" target=\"_blank\" rel=\"nofollow noopener\"\u003ELitGPT\u003C/a\u003E library, and I use it also in Chapter 7 of my \u003Ca href=\"https://github.com/rasbt/LLMs-from-scratch\" target=\"_blank\" rel=\"nofollow noopener\"\u003EBuild a Large Language Model From Scratch\u003C/a\u003E book (however, I am now considering moving the masking to a reader exercise). \u003C/p\u003E\u003Cp\u003EIn other popular LLM libraries like \u003Ca href=\"https://github.com/OpenAccess-AI-Collective/axolotl\" target=\"_blank\" rel=\"nofollow noopener\"\u003EAxolotl\u003C/a\u003E, this is also done automatically via the train_on_inputs: false default setting in the config.yaml. In Hugging Face, this is not done by default, but one can implement this through a DataCollatorForCompletionOnlyLM dataset collator, as detailed in \u003Ca href=\"https://huggingface.co/docs/trl/en/sft_trainer#train-on-completions-only\" target=\"_blank\" rel=\"nofollow noopener\"\u003Etheir documentation.\u003C/a\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEGJ-RK2cfNrw/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1721196493005?e=1749081600&v=beta&t=-k4SJFydHilg7tIVI3VEpy7iPusKZEUAs3cvVzOlo58\" src=\"//:0\"\u003E\u003Cfigcaption\u003EAn illustration of input masking where the highlighted text is still fed to the LLM, but it is not used when calculating the loss during training.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EMasking the input prompts, as described above, is a routine task, and some papers may include comparisons with and without this masking. For example, the QLoRA paper included a comparison in its appendix, finding that masking performs better.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFgmqqhMlTH3Q/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1721193533100?e=1749081600&v=beta&t=dquTs8DdImATIi2dEwOPI1wNs35BMiniwBVJOkngkA0\" src=\"//:0\"\u003E\u003Cfigcaption\u003EAnnotated table from \"QLoRA: Efficient Finetuning of Quantized LLMs\", \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003ENote that MMLU is a benchmark focused on measuring the performance on multiple-choice questions, and the authors did not measure or study how it affects conversational performance when the finetuned model is used as a chatbot.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E1.2 Instruction Modeling\u003C/h3\u003E\u003Cp\u003EAfter the brief introduction to the topic in the previous section, let's examine the \u003Ca href=\"https://arxiv.org/abs/2405.14394\" target=\"_blank\" rel=\"nofollow noopener\"\u003EInstruction Tuning With Loss Over Instructions\u003C/a\u003E paper. In this study, the authors systematically investigate the differences in LLM performance when instructions are masked versus unmasked. \u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFyb7aI6O4_2g/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1721192208744?e=1749081600&v=beta&t=AsmLdHDq8S5ol-Fc3RmF4tZW-GAQk9dia8NxLcMSaxk\" src=\"//:0\"\u003E\u003Cfigcaption\u003EIllustration of (1) no instruction masking, (2) instruction masking, and (3) boilerplate masking.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EApproach number 1 in the figure above is the default approach when implementing an LLM because it does not require any extra work or modification of the loss function. It's the method that the authors refer to in the paper as \"instruction modeling.\" (In the paper, they additionally mask special prompt tokens like <|user|>, <|assistant|>, and <|system|>that may occur in non-Alpaca prompt templates.)\u003C/p\u003E\u003Cp\u003EApproach number 2 is currently the most common approach in practice, where everything except for the response is masked when computing the loss. In the paper, they refer to this approach as \"instruction tuning\" (the naming is a bit unfortunate in this context, because we are not \"tuning\" the instruction but excluding the instructions from the loss, compared to approach 1).\u003C/p\u003E\u003Cp\u003EWhile drawing the figure above, I thought of an interesting third approach that is not explored in the paper: Masking prompt-specific boilerplate text. For instance, in Alpaca-style prompts, all examples begin with \"Below is an instruction...\". Compared to the actual instruction and input, this text is constant and could thus be excluded from the loss. This is not studied in the paper but might be an interesting additional experiment, which I am planning to add as another reader exercise (plus solution) to my \u003Ca href=\"https://github.com/rasbt/LLMs-from-scratch\" target=\"_blank\" rel=\"nofollow noopener\"\u003ELLMs from Scratch book\u003C/a\u003E.\u003C/p\u003E\u003Cp\u003EIt turns out that instruction modeling (that is, not masking the instructions) seems to outperform the masking approach, as shown in the figure below.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFQEFShlqQPWw/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1721195821309?e=1749081600&v=beta&t=1wAIBzSMIz_GATd_x6nbxuhqFKVhBbRkbd29UcaHqhU\" src=\"//:0\"\u003E\u003Cfigcaption\u003ENot masking the instructions performs better than masking the instructions. Annotated from Instruction Tuning With Loss Over Instructions, (\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EHowever, the performance of \"instruction modeling\" depends on (a) the ratio between the instruction and response length and (b) the size of the dataset (in terms of the number of training examples).\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQG34DWZfJhTSA/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1721199549505?e=1749081600&v=beta&t=DQl1aPgSM9jB8bI4HI8FPcBirKrwsBWebpLOVZE1XkE\" src=\"//:0\"\u003E\u003Cfigcaption\u003EThe benefit of not masking instructions depends on the dataset length and size. Annotated from Instruction Tuning With Loss Over Instructions, (\u003Ca href=\"https://arxiv.org/abs/2405.14394\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehttps://arxiv.org/abs/2405.14394\u003C/a\u003E)\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EA plausible explanation for the dataset length and size dependence shown above is that if the responses are short and there are only a few training examples, it becomes easier for the model to memorize the responses, so including the instruction modeling, that is, modeling more of the model output in the loss, can help reduce overfitting. \u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E1.3 Conclusion\u003C/h3\u003E\u003Cp\u003EIn short, the authors found that going back to the basics, not masking the instructions, can benefit model performance.\u003C/p\u003E\u003Cp\u003EIt's positively surprising that the simpler approach of not masking the instruction (except for mask special prompt tokens like <|user|>, <|assistant|>, and <|system|>) performs better. \u003C/p\u003E\u003Cp\u003EIn my personal experience, when I tried both approaches in the past and didn't see a clear winner, I usually didn't experiment with masking much, though, and focused more on changing other settings if things didn't work well. In hindsight, it may make sense to consider experimenting with the (not) masking approach a bit more, as it seems to depend on the dataset size and length.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch2\u003E2. LoRA learns less and forgets less\u003C/h2\u003E\u003Cp\u003E\u003Ca href=\"https://arxiv.org/abs/2405.09673\" target=\"_blank\" rel=\"nofollow noopener\"\u003ELoRA Learns Less and Forgets Less\u003C/a\u003E is a comprehensive empirical study of low-rank adaptation (LoRA) for finetuning LLMs, comparing LoRA to full finetuning across two target domains: programming and mathematics. In addition to these domains, the comparison extends to two target tasks: instruction finetuning and continued pretraining.\u003C/p\u003E\u003Cp\u003EIf you are looking for a LoRA refresher before reading on, I recently covered in \u003Ca href=\"https://magazine.sebastianraschka.com/p/lora-and-dora-from-scratch\" target=\"_blank\" rel=\"nofollow noopener\"\u003EImproving LoRA: Implementing Weight-Decomposed Low-Rank Adaptation (DoRA) from Scratch\u003C/a\u003E.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E2.1 LoRA Learns Less\u003C/h3\u003E\u003Cp\u003EBased on the first set of results, LoRA learns significantly less than full finetuning, as shown in the figure below. This is expected, in my view, because updating fewer parameters limits learning capacity. Learning new knowledge generally requires more capacity than converting a pretrained base model into an instruction-following model.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEIX5PZPNU4Dw/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1721192937074?e=1749081600&v=beta&t=-EcL_YQEFCoVcxcbUjAq38qF01d6jZqR362eeMRqcPY\" src=\"//:0\"\u003E\u003Cfigcaption\u003EFull finetuning vs LoRA. The performance is measured on HumanEval, which is a dataset consisting of 164 coding challenges. Annotated figures from LoRA Learns Less and Forgets Less, \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EIn the figure above, it is also interesting to note that the gap between LoRA and full finetuning is larger for continued pretraining than for instruction finetuning. This is consistent with the common intuition that pretraining primarily teaches an LLM new knowledge, whereas instruction finetuning primarily changes the behavior of an LLM.\u003C/p\u003E\u003Cp\u003ENext, examining the same set of experiments in the domain of math instead of coding, we observe that the gap between full finetuning and LoRA shrinks, as illustrated in the figure below.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQGcPka77VguRw/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1721196151731?e=1749081600&v=beta&t=catyCATO6xkbz1Q6Jm1QpEYKdnnVHFX9ISLTQ5u5GcM\" src=\"//:0\"\u003E\u003Cfigcaption\u003EThe full finetuning vs. LoRA comparison for math tasks. Annotated figures from LoRA Learns Less and Forgets Less, \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EOne could argue that solving math problems is more closely related to the source domain of the LLM than coding is. In other words, the LLM may have encountered more math problems during pretraining than programming tasks. Furthermore, math problems are usually described in words, whereas programming requires an entirely new set of terms.\u003C/p\u003E\u003Cp\u003EAs a takeaway, so far we can conclude that the further the new task deviates from the pretraining data, the more advantageous it is to use full finetuning over LoRA when it comes to acquiring new knowledge, such as through continued pretraining.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E2.2 LoRA Forgets Less\u003C/h3\u003E\u003Cp\u003EAbove, we examined how LoRA compares to full finetuning in terms of knowledge updates. The next set of experiments assesses these two approaches in terms of forgetting information after additional training, via continued pretraining and instruction finetuning. The difference to the previous results is that here they measure the performance on the original source tasks.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQHTxeKCX5LU8w/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1721192286937?e=1749081600&v=beta&t=ajXANJp1MMZ9LqEPgML-qCE8GcorhWZw9bC_G5fLEvo\" src=\"//:0\"\u003E\u003Cfigcaption\u003EFull finetuning vs LoRA on the original source tasks after training on programming data. \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EAs we can see from the figure above, full finetuning forgets much more knowledge than LoRA when these methods are applied to datasets that are farther from the source domain (here, coding). The gaps are smaller for the math dataset, as shown below.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQG3RCGqFI1rQA/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1721193706777?e=1749081600&v=beta&t=e9AiOeS5kemm64aFMMxYgSr2j2X7Awc-_qUJo-kY9Rs\" src=\"//:0\"\u003E\u003Cfigcaption\u003EFull finetuning vs LoRA on the original source tasks after training on math data. \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E2.3 Conclusion\u003C/h3\u003E\u003Cp\u003ELoRA versus full finetuning? Perhaps as expected, it all boils down to a learning-forgetting trade-off. Full finetuning results in stronger performance in the new target domain, whereas LoRA maintains better performance in the original source domain *.\u003C/p\u003E\u003Cp\u003EIntuitively, I suspect this is merely a side effect of LoRA changing fewer parameters in the model—the goal of LoRA, as its name suggests, is low-rank adaptation, which means not substantially modifying all model parameters. \u003C/p\u003E\u003Cp\u003EMoreover, in practice, it often isn't a question of whether to use full finetuning or LoRA, as the latter might be the only feasible option due to its memory savings and lower storage footprint.\u003C/p\u003E\u003Cp\u003ENonetheless, it's very interesting to see this thoroughly laid out and analyzed in great experimental detail. (The experiments were conducted with 7B and 13B Llama 2 models).\u003C/p\u003E\u003Cp\u003E* (One caveat that Mariano Kamp pointed out to me was that they did not update the embedding layers in the LoRA experiments, which is crucial when adapting a model to a new task.)\u003C/p\u003E\u003Ch2\u003E3. MoRA: High-Rank Updating for Parameter-Efficient Finetuning\u003C/h2\u003E\u003Cp\u003EIt's always exciting when a new paper with a LoRA-like method for efficient LLM finetuning is published. In \u003Ca href=\"https://arxiv.org/abs/2405.12130\" target=\"_blank\" rel=\"nofollow noopener\"\u003EMoRA: High-Rank Updating for Parameter-Efficient Finetuning\u003C/a\u003E the authors take a related yet opposite approach to low-rank adaptation by replacing the LoRA adapters with a square matrix.\u003C/p\u003E\u003Cp\u003EAlso, Appendix E of my \u003Ca href=\"http://mng.bz/orYv\" target=\"_blank\" rel=\"nofollow noopener\"\u003EBuild a Large Language Model From Scratch book\u003C/a\u003E implements LoRA from scratch for a GPT model trained to classify spam messages.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQF-ZklUqYnlaA/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1721192555068?e=1749081600&v=beta&t=sasFKO-DNXIaeaWhPC0bxqcroh11L_Gw6CBSXMYrjMU\" src=\"//:0\"\u003E\u003Cfigcaption\u003EVisualizations from the LoRA sections of my \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E3.1 Low Vs High Ranks\u003C/h3\u003E\u003Cp\u003EAs shown in the figure below, MoRA uses a small square matrix (M) instead of the two small LoRA matrices, A and B. We will return to discussing how this works in the next section.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFfUbmj2U3RpA/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1721192570463?e=1749081600&v=beta&t=WgOPWHOYUxmoK3XAyV34aHn6K0z-e2a3y3-RJ6BSapM\" src=\"//:0\"\u003E\u003Cfigcaption\u003ELoRA and MoRA side by side, where W is the weight in a neural network layer. Figure source: MoRA: High-Rank Updating for Parameter-Efficient Finetuning, \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EWhy introduce another LoRA alternative, now with \"high\" ranks? The reason is that LoRA updates the original weights in a relatively limited way. In my view, this is by design, because we don't want to perturb or change the original model capabilities too much when we finetune a model with LoRA. However, while this low-rank updating is effective and sufficient for tasks like instruction finetuning, one downside of the low-rank updating is that it is relatively ineffective for incorporating new knowledge, such as via continued pretraining (also referred to as finetuning on regular text data, not instruction finetuning).\u003C/p\u003E\u003Cp\u003EIn the MoRA paper, the authors seek to develop a parameter-efficient finetuning method that can perform well for \u003Cstrong\u003E\u003Cem\u003Eboth\u003C/em\u003E\u003C/strong\u003Einstruction finetuning and absorbing new knowledge in continued pretraining. \u003C/p\u003E\u003Cp\u003EBelow is a side-by-side comparison between LoRA, MoRA, and regular full finetuning (FFT) based on a synthetic dataset that requires the LLM to memorize specific identifier codes.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQG7Sq6-FcX6cw/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1721199609699?e=1749081600&v=beta&t=04ajtTX6eE8kBdSg2YpSk3TzB18jrKFaz_TsdsvOE9w\" src=\"//:0\"\u003E\u003Cfigcaption\u003EPerformance of LoRA, MoRA and full finetuning (FFT) for memorizing identifier codes. Figure source: MoRA: High-Rank Updating for Parameter-Efficient Finetuning, \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EIn the synthetic benchmark shown in the figure above, MoRA has similar knowledge-uptake (or memorization) capabilities as full fine-tuning. LoRA with high ranks (r=256) can eventually memorize the synthetic identifier codes as well, but it requires more steps. LoRA with a small rank (r=8) fails to memorize.\u003C/p\u003E\u003Cp\u003ENote that memorization in LLMs is not necessarily a good thing (except for historical dates and facts, of course). However, the fact that LoRA does not tend to easily memorize can also be advantageous in reducing the tendency to overfit the training data. That being said, this benchmark primarily probes whether LoRA adds enough capacity to learn new knowledge. Think of it as similar to batch overfitting, a popular debugging technique where we try to get the model to overfit on a small portion of the training set to ensure that we have implemented the architecture correctly. (We will look at a benchmark on real data in the next section.)\u003C/p\u003E\u003Ch3\u003E3.2 MoRA in a Nutshell\u003C/h3\u003E\u003Cp\u003ESo, how is MoRA implemented?\u003C/p\u003E\u003Cp\u003EHere, the authors use a trainable square matrix that is applied to the original weights W instead of the matrices AB in LoRA. This square matrix is usually much smaller than the original weight matrix. In fact, the number of trainable parameters in LoRA and MoRA can even be the same.\u003C/p\u003E\u003Cp\u003EFor example, if the original weight layer has 4096 × 4096 = 16,777,216 parameters, LoRA with r=8 has 4096×8 + 8×4096 = 65,536 parameters. With MoRA, we can match the number of parameters with r=256, i.e., 256 × 256 = 65,536.\u003C/p\u003E\u003Cp\u003EHow do you apply this 256×256 matrix to the original 1024×1024 weight matrix? They define several non-parametric compression and decompression methods (the details are a bit out of scope for this post, but I tried to summarize them with PyTorch code in the figure).\u003C/p\u003E\u003Cp\u003ELeft: Visualization of the matrix dimensions for LoRA and DoRA when the total number of parameters is the same. Right: Example code illustrating the compression and decompression of the MoRA matrix.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQGkDpIZtn8DBQ/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1721192134633?e=1749081600&v=beta&t=Wbsu47BO-1J2QunARK0Xi7dktWbOyv6hCrFEpFe18EI\" src=\"//:0\"\u003E\u003Cfigcaption\u003EFigure source for the left subfigure: MoRA: High-Rank Updating for Parameter-Efficient Finetuning, \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EHow do LoRA and MoRA compare?\u003C/p\u003E\u003Cp\u003ELooking at real dataset benchmarks (see the table below), MoRA and LoRA are in relatively similar ballparks. However, in the case of continued pretraining on biomedical and finance data, MoRA outperforms all LoRA variants; only full fine-tuning (FFT) performs better. Additionally, an interesting observation is that LoRA ties with or outperforms DoRA, a LoRA variant I covered a few months ago in \u003Ca href=\"https://magazine.sebastianraschka.com/p/lora-and-dora-from-scratch\" target=\"_blank\" rel=\"nofollow noopener\"\u003EImproving LoRA: Implementing Weight-Decomposed Low-Rank Adaptation (DoRA) from Scratch\u003C/a\u003E.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEAdnGQZ7urYg/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1721196576343?e=1749081600&v=beta&t=EUKvv_yUG4wYOlW8QJCpYD1Z1Tq99I6uNI-OSySWBEw\" src=\"//:0\"\u003E\u003Cfigcaption\u003EComparison of several LoRA variants with MoRA on real datasets. Annotated table from MoRA: High-Rank Updating for Parameter-Efficient Finetuning, \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E3.3 Conclusion\u003C/h3\u003E\u003Cp\u003EIt turns out that the relatively simple MoRA method works surprisingly well. And it indeed slightly outperforms LoRA in continued pretraining. However, it does lag behind LoRA a bit in instruction fine-tuning and mathematical reasoning with small ranks. For me, the results are not compelling enough to replace LoRA with MoRA. Nonetheless, the paper presents an interesting method with an interesting set of experiments.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E \u003C/p\u003E\u003Chr\u003E\u003Cp\u003E\u003Cem\u003EThis magazine is personal passion project that does not offer direct compensation. However, for those who wish to support me, please consider purchasing a copy of \u003C/em\u003E\u003Ca href=\"https://sebastianraschka.com/books\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Eone of my books\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E. If you find them insightful and beneficial, please feel free to recommend them to your friends and colleagues. (Sharing your feedback with others \u003C/em\u003E\u003Ca href=\"https://www.amazon.com/Machine-Learning-AI-Essential-Questions/dp/1718503768/ref=sr_1_1?crid=1566EI5BQC9U0&dib=eyJ2IjoiMSJ9.4oCd5DaBraiVbzZDag-sX4dJQTIguc2mCbDGm1UCKmulcZsTRWmz--_y1AwHt5OmSFglsDpUXQO6FJ_fhs3n9qizrIqlU4STsWxFGor7WdW0QRtPtWgyzz8w0C3PHW8uwsDMJLN4VxjnFIkMizRHBoiHZjLGslLzmiLpzLTTlhbS7bSkhGnUcb1wKkartwWqVtq8c8KbnTdEJ34G6dOlf_YIfsYyG2XOGXneQuNmWh8.AZ3qpJ95F8x9PhDbClERH4iibU3U4r4CTLdtf2r1Nyc&dib_tag=se&keywords=Machine+Learning+Q+and+AI&qid=1717088426&sprefix=%2Caps%2C275&sr=8-1\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Evia a review on Amazon\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E helps a lot, too!)\u003C/em\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQG6HzR0-zsliw/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1721200388837?e=1749081600&v=beta&t=MqgFFW_U0D9UQLeUkadugSW14MXFg02G-MEUNwZjGDg\" src=\"//:0\"\u003E\u003Cfigcaption\u003EMachine Learning with PyTorch and Scikit-Learn\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003Cstrong\u003EYour support means a great deal! Thank you!\u003C/strong\u003E\u003C/p\u003E","url":"https://www.linkedin.com/pulse/llm-research-insights-instruction-masking-new-lora-raschka-phd-7p1oc"},{"@type":"Article","author":{"@type":"Person","name":"Sebastian Raschka, PhD","url":"https://www.linkedin.com/in/sebastianraschka"},"dateModified":"","datePublished":"2024-05-12T12:27:31.000+00:00","headline":"How Good Are the Latest Open LLMs? And Is DPO Better Than PPO?","image":{"@type":"ImageObject","url":"https://media.licdn.com/dms/image/v2/D5612AQESXYeiXIXGfw/article-cover_image-shrink_720_1280/article-cover_image-shrink_720_1280/0/1715381661557?e=2147483647&v=beta&t=2yjudmlWm5wd9jsd4Vpz94sAbp8NvBagcV1mSRp8xyw"},"articleBody":"\u003Cp\u003EApril 2024, what a month! My birthday, a \u003Ca href=\"https://www.amazon.com/Machine-Learning-AI-Essential-Questions/dp/1718503768\" target=\"_blank\" rel=\"nofollow noopener\"\u003Enew book release\u003C/a\u003E, spring is finally here, and four major open LLM releases: Mixtral, Meta AI's Llama 3, Microsoft's Phi-3, and Apple's OpenELM.\u003C/p\u003E\u003Cp\u003EThis article reviews and discusses all four major transformer-based LLM model releases that have been happening in the last few weeks, followed by new research on reinforcement learning with human feedback methods for instruction finetuning using PPO and DPO algorithms.\u003C/p\u003E\u003Cp\u003E1. How Good are Mixtral, Llama 3, and Phi-3?\u003C/p\u003E\u003Cp\u003E2. OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework\u003C/p\u003E\u003Cp\u003E3. Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch2\u003E1. Mixtral, Llama 3, and Phi-3: What's New?\u003C/h2\u003E\u003Cp\u003EFirst, let's start with the most prominent topic: the new major LLM releases this month. This section will briefly cover Mixtral, Llama 3, and Phi-3, which have been accompanied by short blog posts or short technical papers. The next section will cover Apple's OpenELM in a bit more detail, which thankfully comes with a research paper that shares lots of interesting details.\u003C/p\u003E\u003Ch3\u003E1.1 Mixtral 8x22B: Larger models are better!\u003C/h3\u003E\u003Cp\u003E\u003Ca href=\"https://mistral.ai/news/mixtral-8x22b/\" target=\"_blank\" rel=\"nofollow noopener\"\u003EMixtral 8x22B\u003C/a\u003E is the latest mixture-of-experts (MoE) model by Mistral AI, which has been released under a permissive Apache 2.0 open-source license.\u003C/p\u003E\u003Cp\u003ESimilar to the Mixtral 8x7B released in January 2024, the key idea behind this model is to replace each feed-forward module in a transformer architecture with 8 expert layers. It's going to be a relatively long article, so I am skipping the MoE explanations, but if you are interested, the \u003Ca href=\"https://magazine.sebastianraschka.com/p/research-papers-in-january-2024\" target=\"_blank\" rel=\"nofollow noopener\"\u003EMixtral 8x7B section in an article I shared a few months ago is a bit more detailed.\u003C/a\u003E\u003C/p\u003E\u003Cp\u003EThe perhaps most interesting plot from the \u003Ca href=\"https://mistral.ai/news/mixtral-8x22b/\" target=\"_blank\" rel=\"nofollow noopener\"\u003EMixtral blog post\u003C/a\u003E, which compares Mixtral 8x22B to several LLMs on two axes: modeling performance on the popular \u003Ca href=\"https://arxiv.org/abs/2009.03300\" target=\"_blank\" rel=\"nofollow noopener\"\u003EMeasuring Massive Multitask Language Understanding\u003C/a\u003E (MMLU) benchmark and active parameters (related to computational resource requirements).\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQGedEA97c-6AA/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1715381113169?e=1748476800&v=beta&t=XsQr6jL5nzDqwQ56o7F0DHa3aGdBr4-kf-Zepv3nksY\" src=\"//:0\"\u003E\u003Cfigcaption\u003EA comparison between Mixtral 8x22B and other LLMs. (Annotated figure based on a plot from \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E1.2 Llama 3: Larger data is better!\u003C/h3\u003E\u003Cp\u003E\u003Ca href=\"https://arxiv.org/abs/2302.13971\" target=\"_blank\" rel=\"nofollow noopener\"\u003EMeta AI's first Llama model release\u003C/a\u003E in February 2023 was a big breakthrough for openly available LLM and was a pivotal moment for open(-source) LLMs. So, naturally, everyone was excited about the \u003Ca href=\"https://arxiv.org/abs/2307.09288\" target=\"_blank\" rel=\"nofollow noopener\"\u003ELlama 2 release\u003C/a\u003E last year. Now, the Llama 3 models, which \u003Ca href=\"https://ai.meta.com/blog/meta-llama-3/\" target=\"_blank\" rel=\"nofollow noopener\"\u003EMeta AI has started to roll out\u003C/a\u003E, are similarly exciting.\u003C/p\u003E\u003Cp\u003EWhile Meta is still training some of their largest models (e.g., the 400B variant), they released models in the familiar 8B and 70B size ranges. And they are good! Below, I added the MMLU scores from the official \u003Ca href=\"https://ai.meta.com/blog/meta-llama-3/\" target=\"_blank\" rel=\"nofollow noopener\"\u003ELlama 3 blog article\u003C/a\u003E to the Mixtral plot I shared earlier.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEduqUGNtdWGQ/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1715381135009?e=1748476800&v=beta&t=xIUMMMXf74VZH2McrdEWaBDR6EMl788dbeDtnxoFQF4\" src=\"//:0\"\u003E\u003Cfigcaption\u003EA comparison between Llama 3, Mixtral, and other LLMs. (Annotated figure based on a plot from \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EOverall, the Llama 3 architecture is almost identical to Llama 2. The main differences are the increased vocabulary size and the fact that Llama 3 also uses grouped-query attention for the smaller-sized model. If you are looking for a grouped-query attention explainer, \u003Ca href=\"https://magazine.sebastianraschka.com/p/ahead-of-ai-11-new-foundation-models\" target=\"_blank\" rel=\"nofollow noopener\"\u003EI've written about it here\u003C/a\u003E.\u003C/p\u003E\u003Cp\u003EBelow are the configuration files used for implementing Llama 2 and Llama 3 in LitGPT, which help show the main differences at a glance.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEDLdYguqzqhA/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1715381177877?e=1748476800&v=beta&t=I43g-m08yoqdm-MRqUHNmtQ1tz-Wi11smI9_h_pJnV0\" src=\"//:0\"\u003E\u003Cfigcaption\u003EA comparison of Llama 2 and Llama 3 configurations via LitGPT, \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003Cstrong\u003ETraining data size\u003C/strong\u003E\u003C/p\u003E\u003Cp\u003EThe main contributor to the substantially better performance compared to Llama 2 is the much larger dataset. Llama 3 was trained on 15 trillion tokens, as opposed to \"only\" 2 trillion for Llama 2.\u003C/p\u003E\u003Cp\u003EThis is a very interesting finding because, as the \u003Ca href=\"https://ai.meta.com/blog/meta-llama-3/\" target=\"_blank\" rel=\"nofollow noopener\"\u003ELlama 3 blog post\u003C/a\u003E notes, according to the Chinchilla scaling laws, the optimal amount of training data for an 8 billion parameter model is much smaller, approximately 200 billion tokens. Moreover, the authors of Llama 3 observed that both the 8 billion and 70 billion parameter models demonstrated log-linear improvements even at the 15 trillion scale. This suggests that we (that is, researchers in general) could further enhance the model with more training data beyond 15 trillion tokens.\u003C/p\u003E\u003Cp\u003E\u003Cstrong\u003EInstruction finetuning and alignment\u003C/strong\u003E\u003C/p\u003E\u003Cp\u003EFor instruction finetuning and alignment, researchers usually choose between using reinforcement learning with human feedback (RLHF) via proximal policy optimization (PPO) or the reward-model-free direct preference optimization (DPO). Interestingly, the Llama 3 researchers did not favor one over the other; they used both! (More on PPO and DPO in a later section.)\u003C/p\u003E\u003Cp\u003EThe \u003Ca href=\"https://ai.meta.com/blog/meta-llama-3/\" target=\"_blank\" rel=\"nofollow noopener\"\u003ELlama 3 blog post\u003C/a\u003E stated that a Llama 3 research paper would follow in the coming month, and I am looking forward to the additional details that will hopefully be shared in this article.\u003C/p\u003E\u003Ch3\u003E1.3 Phi-3: Higher-quality data is better!\u003C/h3\u003E\u003Cp\u003EJust one week after the big Llama 2 release, Microsoft shared their new Phi-3 LLM. According to the benchmarks in the \u003Ca href=\"https://arxiv.org/abs/2404.14219\" target=\"_blank\" rel=\"nofollow noopener\"\u003Etechnical report\u003C/a\u003E, even the smallest Phi-3 model outperforms the Llama 3 8B model despite being less than half its size.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEE_zsyH-mZnw/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1715381198015?e=1748476800&v=beta&t=du2yysdDvmzZK3V9JHbWuKV0n44zDXyPjt_OdKQl7ns\" src=\"//:0\"\u003E\u003Cfigcaption\u003EA comparison between Phi-3, Llama 3, Mixtral, and other LLMs. (Annotated figure based on a plot from \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003ENotably, Phi-3, which is based on the Llama architecture, has been trained on 5x fewer tokens than Llama 3 (3.3 trillion instead of 15 trillion). Phi-3 even uses the same tokenizer with a vocabulary size of 32,064 as Llama 2, which is much smaller than the Llama 3 vocabulary size.\u003C/p\u003E\u003Cp\u003EAlso, Phi-3-mini has \"only\" 3.8 billion parameters, which is less than half the size of Llama 3 8B.\u003C/p\u003E\u003Cp\u003ESo, What is the secret sauce? According to the technical report, it's dataset quality over quantity: \"heavily filtered web data and synthetic data\".\u003C/p\u003E\u003Cp\u003EThe paper didn't go into too much detail regarding the data curation, but it largely follows the recipe used for previous Phi models. \u003Ca href=\"https://magazine.sebastianraschka.com/p/ahead-of-ai-12-llm-businesses\" target=\"_blank\" rel=\"nofollow noopener\"\u003EI wrote more about Phi models a few months ago here\u003C/a\u003E.\u003C/p\u003E\u003Cp\u003EAs of this writing, people are still unsure whether Phi-3 is really as good as promised. For instance, many people I talked to noted that Phi-3 is much worse than Llama 3 for non-benchmark tasks.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E1.4 Conclusion\u003C/h3\u003E\u003Cp\u003EBased on the three major releases described above, this has been an exceptional month for openly available LLMs. And I haven't even talked about my favorite model, OpenELM, which is discussed in the next section.\u003C/p\u003E\u003Cp\u003EWhich model should we use in practice? I think all three models above are attractive for different reasons. Mixtral has a lower active-parameter count than Llama 3 70B but still maintains a pretty good performance level. Phi-3 3.8B may be very appealing for mobile devices; according to the authors, a quantized version of it can run on an iPhone 14. And Llama 3 8B might be the most interesting all-rounder for fine-tuning since it can be comfortably fine-tuned on a single GPU when using LoRA.\u003C/p\u003E\u003Ch2\u003E2. OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework\u003C/h2\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003Ca href=\"https://arxiv.org/abs/2404.14619\" target=\"_blank\" rel=\"nofollow noopener\"\u003EOpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework\u003C/a\u003E is the latest LLM model suite and paper shared by researchers at Apple, aiming to provide small LLMs for deployment on mobile devices.\u003C/p\u003E\u003Cp\u003ESimilar to the \u003Ca href=\"https://magazine.sebastianraschka.com/p/research-papers-in-february-2024?utm_source=profile&utm_medium=reader2\" target=\"_blank\" rel=\"nofollow noopener\"\u003EOLMo\u003C/a\u003E, it's refreshing to see an LLM paper that shares details discussing the architecture, training methods, and training data. \u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQGit9C9D1aX6A/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1715381283976?e=1748476800&v=beta&t=NnkBWFbLwR7nOQK1NFYSZi4F0z-_YqWV5xh4QR-2OAI\" src=\"//:0\"\u003E\u003Cfigcaption\u003EComparison between OpenELM and other open-source LLMs that share datasets, code, and weights (there are not many out there that are similarly open). Annotated table from OpenELM paper, \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003ELet's start with the most interesting tidbits:\u003C/p\u003E\u003Cul\u003E\u003Cli\u003E\u003Cp\u003EOpenELM comes in 4 relatively small and convenient sizes: 270M, 450M, 1.1B, and 3B\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EFor each size, there's also an instruct-version available trained with \u003Ca href=\"https://arxiv.org/abs/2309.06657\" target=\"_blank\" rel=\"nofollow noopener\"\u003Erejection sampling\u003C/a\u003E and \u003Ca href=\"https://magazine.sebastianraschka.com/i/142924793/rlhf-vs-direct-preference-optimization-dpo\" target=\"_blank\" rel=\"nofollow noopener\"\u003Edirect preference optimization\u003C/a\u003E\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EOpenELM performs slightly better than OLMo even though it's trained on 2x fewer tokens\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EThe main architecture tweak is a layer-wise scaling strategy\u003C/p\u003E\u003C/li\u003E\u003C/ul\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E2.1 Architecture details\u003C/h3\u003E\u003Cp\u003EBesides the layer-wise scaling strategy (more details later), the overall architecture settings and hyperparameter configuration are relatively similar to other LLMs like OLMo and Llama, as summarized in the figure below.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQHlKIL_WWXFsQ/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1715381302942?e=1748476800&v=beta&t=am1MNz5ActLGU4yBPAAeLC-IxX5KAigWPPm7_4DFsgs\" src=\"//:0\"\u003E\u003Cfigcaption\u003EArchitecture and hyperparameter comparison between OpenELM, the smallest OLMo model, and the smallest Llama 2 model. \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E2.2 Training dataset\u003C/h3\u003E\u003Cp\u003ESharing details is different from explaining them as research papers aimed to do when I was a student. For instance, they sampled a relatively small subset of 1.8T tokens from various public datasets (\u003Ca href=\"https://arxiv.org/abs/2306.01116\" target=\"_blank\" rel=\"nofollow noopener\"\u003ERefinedWeb\u003C/a\u003E, \u003Ca href=\"https://github.com/togethercomputer/RedPajama-Data\" target=\"_blank\" rel=\"nofollow noopener\"\u003ERedPajama\u003C/a\u003E, \u003Ca href=\"https://arxiv.org/abs/2101.00027\" target=\"_blank\" rel=\"nofollow noopener\"\u003EThe PILE\u003C/a\u003E, and \u003Ca href=\"https://arxiv.org/abs/2402.00159\" target=\"_blank\" rel=\"nofollow noopener\"\u003EDolma\u003C/a\u003E). This subset was 2x smaller than Dolma, which was used for training OLMo. But what was the rationale for this subsampling, and what were the sampling criteria?\u003C/p\u003E\u003Cp\u003EOne of the authors kindly followed up with me on that saying \"Regarding dataset: We did not have any rationale behind dataset sampling, except we wanted to use public datasets of about 2T tokens (following LLama2).\"\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFVHxF4lkTo6w/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1715381432848?e=1748476800&v=beta&t=JzMF1_Te-yprkUXXa8gnMoyatKCkfdUxSMwyiiryZgk\" src=\"//:0\"\u003E\u003Cfigcaption\u003ENumber of tokens used for training OpenELM vs the original number of tokens in the dataset (note that the precise token number depends on the tokenizer used). Annotated table from OpenELM paper, \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E2.3 Layer-wise scaling\u003C/h3\u003E\u003Cp\u003EThe layer-wise scaling strategy (adopted from the \u003Ca href=\"https://arxiv.org/abs/2008.00623\" target=\"_blank\" rel=\"nofollow noopener\"\u003EDeLighT: Deep and Light-weight Transformer\u003C/a\u003E paper) is very interesting. Essentially, the researchers gradually widen the layers from the early to the later transformer blocks. In particular, keeping the head size constant, the researchers increase the number of heads in the attention module. They also scale the hidden dimension of the feed-forward module, as illustrated in the figure below.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEmgdav_w_vTg/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1715381450660?e=1748476800&v=beta&t=VK51OQHIkhOXTP_ksNupQuDSDmc_V4mBX0cB6hhojVY\" src=\"//:0\"\u003E\u003Cfigcaption\u003ELLM architecture based on my \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EI wish there was an ablation study training an LLM with and without the layer-wise scaling strategy on the same dataset. But those experiments are expensive, and I can understand why it wasn't done.\u003C/p\u003E\u003Cp\u003EHowever, we can find ablation studies in the \u003Ca href=\"https://arxiv.org/abs/2008.00623\" target=\"_blank\" rel=\"nofollow noopener\"\u003EDeLighT: Deep and Light-weight Transformer\u003C/a\u003E paper that first introduced the layer-wise scaling on a smaller dataset based on the original encoder-decoder architecture, as shown below.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFxcJLTYGsxBw/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1715381472453?e=1748476800&v=beta&t=0VxUDgwzUkcxtVlN1IBBlzulQeLJt-OiMrMnoWZJLbg\" src=\"//:0\"\u003E\u003Cfigcaption\u003EA comparison between standard transformer blocks and transformer-blocks with layer-wise (block-wise) scaling from the DeLighT paper, \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E2.4 LoRA vs DoRA\u003C/h3\u003E\u003Cp\u003EAn interesting bonus I didn't expect was that the researchers compared LoRA and DoRA (\u003Ca href=\"https://magazine.sebastianraschka.com/p/lora-and-dora-from-scratch\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ewhich I discussed a few weeks ago\u003C/a\u003E) for parameter-efficient finetuning! It turns out that there wasn't a noticeable difference between the two methods, though.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQGB1Fn7fl2fQQ/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1715381488064?e=1748476800&v=beta&t=P-DfU0O4avxw6KALG85RrOFahE6NaYeOKQNocgRZ2yI\" src=\"//:0\"\u003E\u003Cfigcaption\u003EModeling performance comparison between two parameter-efficient finetuning methods: LoRA and DoRA. Annotated table from OpenELM paper, \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E2.5 Conclusion\u003C/h3\u003E\u003Cp\u003EWhile the paper doesn't answer any research questions, it's a great, transparent write-up of the LLM implementation details. The layer-wise scaling strategy might be something that we could see more often in LLMs from now on. Also, the paper is only one part of the release. For further details, Apple also shared the \u003Ca href=\"https://github.com/apple/corenet/tree/main/mlx_examples/open_elm\" target=\"_blank\" rel=\"nofollow noopener\"\u003EOpenELM code on GitHub\u003C/a\u003E.\u003C/p\u003E\u003Cp\u003EAnyways, great work, and big kudos to the researchers (and Apple) for sharing!\u003C/p\u003E\u003Ch2\u003E3. Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study\u003C/h2\u003E\u003Cp\u003E\u003Ca href=\"https://arxiv.org/abs/2404.10719\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003EIs DPO Superior to PPO for LLM Alignment? A Comprehensive Study\u003C/em\u003E\u003C/a\u003E finally answers one of the key questions I've been raising in previous months. \u003C/p\u003E\u003Cp\u003ELet's start with a brief overview before diving into the results: Both PPO (proximal policy optimization) and DPO (direct preference optimization) are popular methods for aligning LLMs via reinforcement learning with human feedback (RLHF). \u003C/p\u003E\u003Cp\u003ERLHF is a key component of LLM development, and it's used to align LLMs with human preferences, for example, to improve the safety and helpfulness of LLM-generated responses. \u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQF--7F8YCAYgg/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1715381508663?e=1748476800&v=beta&t=Igp8UlB9vFlbU2XtqPmIDS7cBGj70UKZEyUjXRhOAuc\" src=\"//:0\"\u003E\u003Cfigcaption\u003EThe typical LLM training lifecycle\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EFor a more detailed explanation and comparison, also see the \u003Cem\u003EEvaluating Reward Modeling for Language Modeling section\u003C/em\u003E in my \u003Ca href=\"https://magazine.sebastianraschka.com/p/tips-for-llm-pretraining-and-evaluating-rms\" target=\"_blank\" rel=\"nofollow noopener\"\u003ETips for LLM Pretraining and Evaluating Reward Models\u003C/a\u003E article that I shared last month.\u003C/p\u003E\u003Ch3\u003E3.1 What are RLHF-PPO and DPO?\u003C/h3\u003E\u003Cp\u003ERLHF-PPO, the original LLM alignment method, has been the backbone of OpenAI's \u003Ca href=\"https://arxiv.org/abs/2203.02155\" target=\"_blank\" rel=\"nofollow noopener\"\u003EInstructGPT\u003C/a\u003E and the LLMs deployed in ChatGPT. However, the landscape has shifted in recent months with the emergence of DPO-finetuned LLMs, which have made a significant impact on public leaderboards. This surge in popularity can be attributed to DPO's reward-free alternative, which is notably easier to use: Unlike PPO, DPO doesn't require training a separate reward model but uses a classification-like objective to update the LLM directly.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFpCL6c44qW_w/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1715381526740?e=1748476800&v=beta&t=UI1V68OVPnH4lvQtqyUN61-s1NblC_qlPLUnImO8Sw0\" src=\"//:0\"\u003E\u003Cfigcaption\u003EExcerpt from \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EToday, most LLMs on top of public leaderboards have been trained with DPO rather than PPO. Unfortunately, though, there have not been any direct head-to-head comparisons where the same model was trained with either PPO or DPO using the same dataset until this new paper came along. \u003C/p\u003E\u003Ch3\u003E3.2 PPO is generally better than DPO\u003C/h3\u003E\u003Cp\u003E\u003Ca href=\"https://arxiv.org/abs/2404.10719\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003EIs DPO Superior to PPO for LLM Alignment? A Comprehensive Study\u003C/em\u003E\u003C/a\u003E is a well-written paper with lots of experiments and results, but the main takeaways are that PPO is generally better than DPO, and DPO suffers more heavily from out-of-distribution data.\u003C/p\u003E\u003Cp\u003EHere, out-of-distribution data means that the LLM has been previously trained on instruction data (using supervised finetuning) that is different from the preference data for DPO. For example, an LLM has been trained on the general Alpaca dataset before being DPO-finetuned on a different dataset with preference labels. (One way to improve DPO on out-of-distribution data is to add a supervised instruction-finetuning round on the preference dataset before following up with DPO finetuning).\u003C/p\u003E\u003Cp\u003EThe main findings are summarized in the figure below.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEkjPAILqHSrw/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1715381539609?e=1748476800&v=beta&t=O0ON63r_6J5gOWLuRwngo_fk4-eBh6kEWUTvtqPwf7c\" src=\"//:0\"\u003E\u003Cfigcaption\u003EAnnotated table from the \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EIn addition to the main results above, the paper includes several additional experiments and ablation studies that I recommend checking out if you are interested in this topic. \u003C/p\u003E\u003Ch3\u003E3.3 Best practices\u003C/h3\u003E\u003Cp\u003EFurthermore, interesting takeaways from this paper include best-practice recommendations when using DPO and PPO.\u003C/p\u003E\u003Cp\u003EFor instance, if you use DPO, make sure to perform supervised finetuning on the preference data first. Also, iterative DPO, which involves labeling additional data with an existing reward model, is better than DPO on the existing preference data.\u003C/p\u003E\u003Cp\u003EIf you use PPO, the key success factors are large batch sizes, advantage normalization, and parameter updates via an exponential moving average.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQHI1tesZKY3dg/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1715381556705?e=1748476800&v=beta&t=YS95BRrb_zBACM6QyMxfS_LEDyrJseSEb37VPC30nYY\" src=\"//:0\"\u003E\u003Cfigcaption\u003EAn excerpt of a preference dataset (examples taken from the \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003E3.4 Conclusion\u003C/h3\u003E\u003Cp\u003EBased on this paper's results, PPO seems superior to DPO if used correctly. However, given that DPO is more straightforward to use and implement, I expect DPO to remain a popular go-to method. \u003C/p\u003E\u003Cp\u003EA good practical recommendation may be to use PPO if you have ground truth reward labels (so you don't have to pretrain your own reward model) or if you can download an in-domain reward model. Otherwise, use DPO for simplicity.\u003C/p\u003E\u003Cp\u003EAlso, based on what we know from the LLama 3 blog post, we don't have to decide whether to use PPO or DPO, but we can use both! For instance, the recipe behind Llama 3 has been the following pipeline: Pretraining → supervised finetuning → rejection sampling → PPO → DPO. (I am hoping the Llama 3 developers will share a paper with more details soon!)\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Chr\u003E\u003Ch3\u003EUnderstanding LLMs (really well)\u003C/h3\u003E\u003Cp\u003EOne of the best ways to understand LLMs is to code one from scratch!\u003C/p\u003E\u003Cp\u003EIf you are interested in learning more about LLMs, I am covering, implementing, and explaining the whole LLM lifecycle in my \u003Ca href=\"https://www.manning.com/books/build-a-large-language-model-from-scratch\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\"Build a Large Language Model from Scratch\" book\u003C/a\u003E, which is currently available at a discounted price before it is published in Summer 2024. \u003C/p\u003E\u003Cp\u003EChapter 5, which covers the pretraining, was just released 2 weeks ago. If this sounds interesting and useful to you, you can take a sneak peak at the code on GitHub \u003Ca href=\"https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/01_main-chapter-code/ch05.ipynb\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehere\u003C/a\u003E.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQG0NSZdEjiVig/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1715381579884?e=1748476800&v=beta&t=d6AM7Qf7uhUtAw3Q4ZuTsCJpTyOyoqeAqc2gifC3FlY\" src=\"//:0\"\u003E\u003Cfigcaption\u003E“Build a Large Language Model from Scratch” book\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E","url":"https://www.linkedin.com/pulse/how-good-latest-open-llms-dpo-better-than-ppo-sebastian-raschka-phd-tjl2c"},{"@type":"Article","author":{"@type":"Person","name":"Sebastian Raschka, PhD","url":"https://www.linkedin.com/in/sebastianraschka"},"dateModified":"","datePublished":"2024-04-20T12:46:01.000+00:00","headline":"Using and Finetuning Pretrained Transformers","image":{"@type":"ImageObject","url":"https://media.licdn.com/dms/image/v2/D4D12AQE9u2TSkhrLGw/article-cover_image-shrink_720_1280/article-cover_image-shrink_720_1280/0/1713458042630?e=2147483647&v=beta&t=1HsR4VaXK7sMf19jM1mxLUuiepjCmzXtY1HiPaR8dGM"},"articleBody":"\u003Cp\u003EThis week has been filled with developments, including exciting new AI research that I’ll be discussing in \u003Ca href=\"https://magazine.sebastianraschka.com/archive\" target=\"_blank\" rel=\"nofollow noopener\"\u003Emy usual end-of-month write-ups\u003C/a\u003E.\u003C/p\u003E\u003Cp\u003EAdditionally, I am excited to announce the release of my new book, \u003Ca href=\"https://www.amazon.com/Machine-Learning-AI-Essential-Questions/dp/1718503768\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003EMachine Learning Q and AI\u003C/em\u003E\u003C/a\u003E, published by No Starch Press.\u003C/p\u003E\u003Cp\u003EIf you’ve been searching for a resource following an introductory machine learning course, this might be the one. I’m covering 30 concepts that were slightly out of scope for the previous books and courses I’ve taught, and I’ve compiled them here in a concise question-and-answer format (including exercises).\u003C/p\u003E\u003Cp\u003EI believe it will also serve as a useful companion for preparing for machine learning interviews.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEkPR29t2r5kg/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1713386938223?e=1749081600&v=beta&t=bMDmQHnacq-IBhW3cu0lXl_UprwPDyJpmCLar_AUUTU\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EMachine Learning Q and AI is available on the \u003C/em\u003E\u003Ca href=\"https://nostarch.com/machine-learning-q-and-ai\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Epublisher's website\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E, \u003C/em\u003E\u003Ca href=\"https://www.amazon.com/Machine-Learning-AI-Essential-Questions/dp/1718503768\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003EAmazon\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E, and many other book stores.\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003ESince the different ways to use and finetune pretrained large language models are currently one of the most frequently discussed topics, I wanted to share an excerpt from the book, in hope that it might be useful for your current projects.\u003C/p\u003E\u003Cp\u003E\u003Cem\u003EHappy reading!\u003C/em\u003E\u003C/p\u003E\u003Ch3\u003EWhat are the different ways to use and finetune pretrained large language models?\u003C/h3\u003E\u003Cp\u003EWhat are the different ways to use and finetune pretrained large language models (LLMs)? The three most common ways to use and finetune pretrained LLMs include a feature-based approach, in-context prompting, and updating a subset of the model parameters.\u003C/p\u003E\u003Cp\u003EFirst, most pretrained LLMs or language transformers can be utilized without the need for further finetuning. For instance, we can employ a feature-based method to train a new downstream model, such as a linear classifier, using embeddings generated by a pretrained transformer. Second, we can showcase examples of a new task within the input itself, which means we can directly exhibit the expected outcomes without requiring any updates or learning from the model. This concept is also known as prompting. Finally, it’s also possible to finetune all or just a small number of parameters to achieve the desired outcomes.\u003C/p\u003E\u003Cp\u003EThe following sections discuss these types of approaches in greater depth.\u003C/p\u003E\u003Ch3\u003EUsing Transformers for Classification Tasks\u003C/h3\u003E\u003Cp\u003ELet’s start with the conventional methods for utilizing pretrained transformers: training another model on feature embeddings, finetuning output layers, and finetuning all layers. We’ll discuss these in the context of classification. (We will revisit prompting later in the section “In-Context Learning, Indexing, and Prompt Tuning”.)\u003C/p\u003E\u003Cp\u003E\u003Cstrong\u003EThe feature-based approach\u003C/strong\u003E\u003C/p\u003E\u003Cp\u003EIn the feature-based approach, we load the pretrained model and keep it “frozen,” meaning we do not update any parameters of the pretrained model. Instead, we treat the model as a feature extractor that we apply to our new dataset. We then train a downstream model on these embeddings. This downstream model can be any model we like (random forests, XGBoost, and so on), but linear classifiers typically perform best. This is likely because pretrained transformers like BERT, GPT, Llama, Mistral, and so on already extract high-quality, informative features from the input data. These feature embeddings often capture complex relationships and patterns, making it easy for a linear classifier to effectively separate the data into different classes.\u003C/p\u003E\u003Cp\u003EFurthermore, linear classifiers, such as logistic regression models and support vector machines, tend to have strong regularization properties. These regularization properties help prevent overfitting when working with high-dimensional feature spaces generated by pretrained transformers. This feature-based approach is the most efficient method since it doesn’t require updating the transformer model at all. Finally, the embeddings can be precomputed for a given training dataset (since they don’t change) when training a classifier for multiple training epochs.\u003C/p\u003E\u003Cp\u003EFigure 1 illustrates how LLMs are typically created and adopted for downstream tasks using finetuning. Here, a pretrained model, trained on a general text corpus, is finetuned to perform tasks like German-to-English translation.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEBzeMprwvpwQ/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1713386969053?e=1749081600&v=beta&t=TZqnHM18VthI-3-1dQ6ld3y_dUlpFTiGq7YYGtlOqfI\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EFigure 1: The general finetuning workflow of large language models\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003Cstrong\u003EFinetuning\u003C/strong\u003E\u003C/p\u003E\u003Cp\u003EThe conventional methods for finetuning pretrained LLMs include updating only the output layers, a method we’ll refer to as finetuning I, and updating all layers, which we’ll call finetuning II.\u003C/p\u003E\u003Cp\u003EFinetuning I is similar to the feature-based approach described earlier, but it adds one or more output layers to the LLM itself. The backbone of the LLM remains frozen, and we update only the model parameters in these new layers. Since we don’t need to backpropagate through the whole network, this approach is relatively efficient regarding throughput and memory requirements. In finetuning II, we load the model and add one or more output layers, similarly to finetuning I. However, instead of backpropagating only through the last layers, we update all layers via backpropagation, making this the most expensive approach. While this method is computationally more expensive than the feature-based approach and finetuning I, it typically leads to better modeling or predictive performance. This is especially true for more specialized domain-specific datasets.\u003C/p\u003E\u003Cp\u003EFigure 2 summarizes the three approaches described in this section so far.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEeeY5cKW6ZVg/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1713386987990?e=1749081600&v=beta&t=Fmdps1tOkrfqrX3ihevqVODlBd9kHJ244-X2yqvMUNg\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EFigure 2: The three conventional approaches for utilizing pretrained LLMs. You can find code examples of all three methods \u003C/em\u003E\u003Ca href=\"https://github.com/rasbt/MachineLearning-QandAI-book/tree/main/supplementary/q18-using-llms/01_classifier-finetuning\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ehere\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E.\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EIn addition to the conceptual summary of the three finetuning methods described in this section, Figure 2 also provides a rule-of-thumb guideline for these methods regarding training efficiency. Since finetuning II involves updating more layers and parameters than finetuning I, backpropagation is costlier for finetuning II. For similar reasons, finetuning II is costlier than a simpler feature-based approach.\u003C/p\u003E\u003Cp\u003EInterested readers can find code examples illustrating the feature-based approach, finetuning one or more layers, and finetuning the complete transformer for classification \u003Ca href=\"https://github.com/rasbt/MachineLearning-QandAI-book/tree/main/supplementary/q18-using-llms/01_classifier-finetuning\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehere\u003C/a\u003E.\u003C/p\u003E\u003Ch3\u003EIn-Context Learning, Indexing, and Prompt Tuning\u003C/h3\u003E\u003Cp\u003ELLMs like GPT-2 and GPT-3 popularized the concept of in-context learning, often called zero-shot or few-shot learning in this context, which is illustrated in Figure 3.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQFz3PgUbG4WDw/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1713387013917?e=1749081600&v=beta&t=BpdpCN8jNdSD7zpqgUahuZEjUF0WRym1fuVxqZADL50\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EFigure 3: Prompting an LLM for in-context learning. You can find code for in-context learning \u003C/em\u003E\u003Ca href=\"https://github.com/rasbt/MachineLearning-QandAI-book/blob/main/supplementary/q18-using-llms/02_prompting\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ehere\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E.\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EAs Figure 3 shows, in-context learning aims to provide context or examples of the task within the input or prompt, allowing the model to infer the desired behavior and generate appropriate responses. This approach takes advantage of the model’s ability to learn from vast amounts of data during pretraining, which includes diverse tasks and contexts.\u003C/p\u003E\u003Cblockquote\u003E\u003Cp\u003ENote: The definition of few-shot learning, considered synonymous with in-context learning-based methods, differs from the conventional approach to few-shot learning discussed in Chapter 3.\u003C/p\u003E\u003C/blockquote\u003E\u003Cp\u003EFor example, suppose we want to use in-context learning for few-shot German–English translation using a large-scale pretrained language model like GPT-3. To do so, we provide a few examples of German–English translations to help the model understand the desired task, as follows:\u003C/p\u003E\u003Cpre\u003E\u003C/pre\u003E\u003Cp\u003EGenerally, in-context learning does not perform as well as finetuning for certain tasks or specific datasets since it relies on the pretrained model’s ability to generalize from its training data without further adapting its parameters for the particular task at hand.\u003C/p\u003E\u003Cp\u003EHowever, in-context learning has its advantages. It can be particularly useful when labeled data for finetuning is limited or unavailable. It also enables rapid experimentation with different tasks without finetuning the model parameters in cases where we don’t have direct access to the model or where we interact only with the model through a UI or API (for example, ChatGPT).\u003C/p\u003E\u003Cp\u003ERelated to in-context learning is the concept of hard prompt tuning, where hard refers to the non-differentiable nature of the input tokens. Where the previously described finetuning methods update the model parameters to better perform the task at hand, hard prompt tuning aims to optimize the prompt itself to achieve better performance. Prompt tuning does not modify the model parameters, but it may involve using a smaller labeled dataset to identify the best prompt formulation for the specific task. For example, to improve the prompts for the previous German–English translation task, we might try the following three prompting variations:\u003C/p\u003E\u003Cul\u003E\u003Cli\u003E\u003Cp\u003E\"Translate the German sentence '{german_sentence}' into English: {english_translation}\"\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003E\"German: '{german_sentence}' | English: {english_translation}\"\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003E\"From German to English: '{german_sentence}' -> {english_translation}\"\u003C/p\u003E\u003C/li\u003E\u003C/ul\u003E\u003Cp\u003EPrompt tuning is a resource-efficient alternative to parameter finetuning. However, its performance is usually not as good as full model finetuning, as it does not update the model’s parameters for a specific task, potentially limiting its ability to adapt to task-specific nuances. Furthermore, prompt tuning can be labor intensive since it requires either human involvement comparing the quality of the different prompts or another similar method to do so. This is often known as hard prompting since, again, the input tokens are not differentiable. In addition, other methods exist that propose to use another LLM for automatic prompt generation and evaluation.\u003C/p\u003E\u003Cblockquote\u003E\u003Cp\u003EYou can find code examples demonstrating prompting and in-context learning \u003Ca href=\"https://github.com/rasbt/MachineLearning-QandAI-book/blob/main/supplementary/q18-using-llms/02_prompting\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehere\u003C/a\u003E.\u003C/p\u003E\u003C/blockquote\u003E\u003Cp\u003EYet another way to leverage a purely in-context learning-based approach is LLM indexing, illustrated in Figure 4.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQGp4pPvpsKkNw/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1713387053809?e=1749081600&v=beta&t=KzJF7WcNfwcF6yhpAAP0zS-NxgisQZNRX0F9i8fVdhM\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EFigure 4: LLM indexing to retrieve information from external documents. You can find a indexing code example \u003C/em\u003E\u003Ca href=\"https://github.com/rasbt/MachineLearning-QandAI-book/blob/main/supplementary/q18-using-llms/03_retrieval-augmented-generation\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ehere\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E.\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EIn the context of LLMs, we can think of indexing as a workaround based on in-context learning that allows us to turn LLMs into information retrieval systems to extract information from external resources and websites. In Figure 4, an indexing module parses a document or website into smaller chunks. These chunks are embedded into vectors that can be stored in a vector database. When a user submits a query, the indexing module computes the vector similarity between the embedded query and each vector stored in the database. Finally, the indexing module retrieves the top k most similar embeddings to synthesize the response.\u003C/p\u003E\u003Cblockquote\u003E\u003Cp\u003ELLM indexing is often used as an umbrella term that describes a framework or process of connecting an LLM to an existing data source. A specific example of this is retrieval augmented generation (RAG). RAG involves combining an LLM with a retrieval system to enhance the model’s ability to generate responses.\u003C/p\u003E\u003C/blockquote\u003E\u003Cp\u003EInterested readers can find code examples illustrating LLM Indexing and retrieval augmented generation \u003Ca href=\"https://github.com/rasbt/MachineLearning-QandAI-book/blob/main/supplementary/q18-using-llms/04_retrieval-augmented-generation\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehere\u003C/a\u003E.\u003C/p\u003E\u003Ch3\u003EParameter-Efficient Finetuning\u003C/h3\u003E\u003Cp\u003EIn recent years, many methods have been developed to adapt pretrained transformers more efficiently for new target tasks. These methods are commonly referred to as parameter-efficient finetuning, with the most popular methods at the time of writing summarized in Figure 5.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQENMQFhvOfkEA/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1713387188304?e=1749081600&v=beta&t=hlSVZNXpwHeleyAST7q83Dm3hx-pffrmrSWFoSlG65I\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EFigure 5: The main categories of parameter-efficient finetuning techniques, with popular examples\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EIn contrast to the hard prompting approach discussed in the previous section, soft prompting strategies optimize embedded versions of the prompts. While in hard prompt tuning we modify the discrete input tokens, in soft prompt tuning we utilize trainable parameter tensors instead.\u003C/p\u003E\u003Cp\u003E\u003Cstrong\u003ESoft Prompt Tuning\u003C/strong\u003E\u003C/p\u003E\u003Cp\u003EThe idea behind soft prompt tuning is to prepend a trainable parameter tensor (the “soft prompt”) to the embedded query tokens. The prepended tensor is then tuned to improve the modeling performance on a target dataset using gradient descent. In Python-like pseudocode, soft prompt tuning can be described as\u003C/p\u003E\u003Cpre\u003E\u003C/pre\u003E\u003Cp\u003Ewhere the soft_prompt_tensor has the same feature dimension as the embedded inputs produced by the embedding layer. Consequently, the modified input matrix has additional rows (as if it extended the original input sequence with additional tokens, making it longer).\u003C/p\u003E\u003Cp\u003E\u003Cstrong\u003EPrefix Tuning\u003C/strong\u003E\u003C/p\u003E\u003Cp\u003EAnother popular prompt tuning method is prefix tuning. Prefix tuning is similar to soft prompt tuning, except that in prefix tuning, we prepend trainable tensors (soft prompts) to each transformer block instead of only the embedded inputs, which can stabilize the training. The implementation of prefix tuning is illustrated in the following pseudocode:\u003C/p\u003E\u003Cpre\u003E\u003C/pre\u003E\u003Cp\u003E\u003Cem\u003EListing 1: A transformer block modified for prefix tuning\u003C/em\u003E\u003C/p\u003E\u003Cp\u003ELet’s break Listing 1 into three main parts: implementing the soft prompt, concatenating the soft prompt (prefix) with the input, and implementing the rest of the transformer block. First, the soft_prompt, a tensor, is processed through a set of fully connected layers ➊. Second, the transformed soft prompt is concatenated with the main input, x ➋. The dimension along which they are concatenated is denoted by seq_len, referring to the sequence length dimension. Third, the subsequent lines of code ➌ describe the standard operations in a transformer block, including self-attention, layer normalization, and feed-forward neural network layers, wrapped around residual connections.\u003C/p\u003E\u003Cp\u003EAs shown in Listing 1, prefix tuning modifies a transformer block by adding a trainable soft prompt. Figure 6 further illustrates the difference between a regular transformer block and a prefix tuning transformer block.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQGe034gTMGR2Q/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1713387213812?e=1749081600&v=beta&t=ft1ivu9V0y7xtKb5RRmhEsk4361a_iKnFfnq4v55qN8\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EFigure 6: A regular transformer compared with prefix tuning\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EBoth soft prompt tuning and prefix tuning are considered parameter efficient since they require training only the prepended parameter tensors and not the LLM parameters themselves.\u003C/p\u003E\u003Cp\u003E\u003Cstrong\u003EAdapter Methods\u003C/strong\u003E\u003C/p\u003E\u003Cp\u003EAdapter methods are related to prefix tuning in that they add additional parameters to the transformer layers. In the original adapter method, additional fully connected layers were added after the multi-head self-attention and existing fully connected layers in each transformer block, as illustrated in Figure 7.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQG0EaWiWui3qg/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1713387250399?e=1749081600&v=beta&t=EZIkqHIL9NATytpeY56ffVZq_y4dnDUidOZByC_7_4g\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Cem\u003EFigure 7: Comparison of a regular transformer block (left) and a transformer block with adapter layers. You can find a code example illustrating adapter layers here \u003C/em\u003E\u003Ca href=\"https://github.com/rasbt/MachineLearning-QandAI-book/tree/main/supplementary/q18-using-llms/04_adapter\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ehere\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E.\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EOnly the new adapter layers are updated when training the LLM using the original adapter method, while the remaining transformer layers remain frozen. Since the adapter layers are usually small—the first fully connected layer in an adapter block projects its input into a low-dimensional representation, while the second layer projects it back into the original input dimension—this adapter method is usually considered parameter efficient.\u003C/p\u003E\u003Cp\u003EIn pseudocode, the original adapter method can be written as follows:\u003C/p\u003E\u003Cpre\u003E\u003C/pre\u003E\u003Cp\u003E\u003Cstrong\u003ELow-rank adaptation\u003C/strong\u003E\u003C/p\u003E\u003Cp\u003ELow-rank adaptation (LoRA), another popular parameter-efficient finetuning method worth considering, refers to reparameterizing pretrained LLM weights using low-rank transformations. LoRA is related to the concept of low-rank transformation, a technique to approximate a high-dimensional matrix or dataset using a lower-dimensional representation. The lower-dimensional representation (or low-rank approximation) is achieved by finding a combination of fewer dimensions that can effectively capture most of the information in the original data. Popular low-rank transformation techniques include principal component analysis and singular vector decomposition.\u003C/p\u003E\u003Cp\u003EFor example, suppose \u003Cem\u003EΔW\u003C/em\u003E represents the parameter update for a weight matrix of the LLM with dimension \u003Cem\u003EℝA×B\u003C/em\u003E. We can decompose the weight update matrix into two smaller matrices: \u003Cem\u003EΔW=WAWB\u003C/em\u003E, where \u003Cem\u003EWA ∈ ℝA×h\u003C/em\u003E and \u003Cem\u003EWA ∈ ℝh×B\u003C/em\u003E. Here, we keep the original weight frozen and train only the new matrices WA and WB\u003C/p\u003E\u003Cp\u003EHow is this method parameter efficient if we introduce new weight matrices? These new matrices can be very small. For example, if A=25 and B=50, then the size of \u003Cem\u003EΔW\u003C/em\u003Eis \u003Cem\u003E25×50=1,250\u003C/em\u003E. If \u003Cem\u003Eh=5\u003C/em\u003E, then \u003Cem\u003EWA\u003C/em\u003E has 125 parameters, WB has 250 parameters, and the two matrices combined have only \u003Cem\u003E125+250=375\u003C/em\u003E parameters in total.\u003C/p\u003E\u003Cp\u003EAfter learning the weight update matrix, we can then write the matrix multiplication of a fully connected layer, as shown in this pseudocode:\u003C/p\u003E\u003Cpre\u003E\u003C/pre\u003E\u003Cp\u003E\u003Cem\u003EListing 2: Matrix multiplication with LoRA. You can find code examples illustrating adapter layers here \u003C/em\u003E\u003Ca href=\"https://github.com/rasbt/MachineLearning-QandAI-book/tree/main/supplementary/q18-using-llms/05_lora\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ehere\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E.\u003C/em\u003E\u003C/p\u003E\u003Cp\u003EIn Listing 2, scalar is a scaling factor that adjusts the magnitude of the combined result (original model output plus low-rank adaptation). This balances the pretrained model’s knowledge and the new task-specific adaptation. According to the original paper introducing the LoRA method, models using LoRA perform slightly better than models using the adapter method across several task-specific benchmarks. Often, LoRA performs even better than models finetuned using the finetuning II method described earlier.\u003C/p\u003E\u003Ch3\u003EReinforcement Learning with Human Feedback\u003C/h3\u003E\u003Cp\u003EThe previous section focused on ways to make finetuning more efficient. Switching gears, how can we improve the modeling performance of LLMs via finetuning?\u003C/p\u003E\u003Cp\u003EThe conventional way to adapt or finetune an LLM for a new target domain or task is to use a supervised approach with labeled target data. For instance, the finetuning II approach allows us to adapt a pretrained LLM and finetune it on a target task such as sentiment classification, using a dataset that contains texts with sentiment labels like positive, neutral, and negative.\u003C/p\u003E\u003Cp\u003ESupervised finetuning is a foundational step in training an LLM. An additional, more advanced step is reinforcement learning with human feedback (RLHF), which can be used to further improve the model’s alignment with human preferences. For example, ChatGPT and its predecessor, InstructGPT, are two popular examples of pretrained LLMs (GPT-3) finetuned using RLHF.\u003C/p\u003E\u003Cp\u003EIn RLHF, a pretrained model is finetuned using a combination of supervised learning and reinforcement learning. This approach was popularized by the original ChatGPT model, which was in turn based on InstructGPT. Human feedback is collected by having humans rank or rate different model outputs, providing a reward signal. The collected reward labels can be used to train a reward model that is then used to guide the LLMs’ adaptation to human preferences. The reward model is learned via supervised learning, typically using a pretrained LLM as the base model, and is then used to adapt the pretrained LLM to human preferences via additional finetuning. The training in this additional finetuning stage uses a flavor of reinforcement learning called proximal policy optimization.\u003C/p\u003E\u003Cp\u003ERLHF uses a reward model instead of training the pretrained model on the human feedback directly because involving humans in the learning process would create a bottleneck since we cannot obtain feedback in real time.\u003C/p\u003E\u003Ch3\u003EAdapting Pretrained Language Models\u003C/h3\u003E\u003Cp\u003EWhile finetuning all layers of a pretrained LLM remains the gold standard for adaption to new target tasks, several efficient alternatives exist for leveraging pretrained transformers. For instance, we can effectively apply LLMs to new tasks while minimizing computational costs and resources by utilizing feature-based methods, in-context learning, or parameter-efficient finetuning techniques.\u003C/p\u003E\u003Cp\u003EThe three conventional methods—feature-based approach, finetuning I, and finetuning II—provide different computational efficiency and performance trade-offs. Parameter-efficient finetuning methods like soft prompt tuning, prefix tuning, and adapter methods further optimize the adaptation process, reducing the number of parameters to be updated. Meanwhile, RLHF presents an alternative approach to supervised finetuning, potentially improving modeling performance.\u003C/p\u003E\u003Cp\u003EIn sum, the versatility and efficiency of pretrained LLMs continue to advance, offering new opportunities and strategies for effectively adapting these models to a wide array of tasks and domains. As research in this area progresses, we can expect further improvements and innovations in using pretrained language models.\u003C/p\u003E\u003Ch3\u003EReferences and Further Reading\u003C/h3\u003E\u003Col\u003E\u003Cli\u003E\u003Cp\u003EThe paper introducing the GPT-2 model: Alec Radford \u003Cem\u003Eet al.\u003C/em\u003E, “Language Models Are Unsupervised Multitask Learners” (2019), \u003Ca href=\"https://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehttps://www.semanticscholar.org/paper/Language-Models-are-Unsupervised-Multitask-Learners-Radford-Wu/9405cc0d6169988371b2755e573cc28650d14dfe\u003C/a\u003E.\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EThe paper introducing the GPT-3 model: Tom B. Brown \u003Cem\u003Eet al.\u003C/em\u003E, “Language Models Are Few-Shot Learners” (2020), \u003Ca href=\"https://arxiv.org/abs/2005.14165\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehttps://arxiv.org/abs/2005.14165\u003C/a\u003E.\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EThe automatic prompt engineering method, which proposes using another LLM for automatic prompt generation and evaluation: Yongchao Zhou \u003Cem\u003Eet al.\u003C/em\u003E, “Large Language Models Are Human-Level Prompt Engineers” (2023), \u003Ca href=\"https://arxiv.org/abs/2211.01910\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehttps://arxiv.org/abs/2211.01910\u003C/a\u003E.\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003ELlamaIndex is an example of an indexing approach that leverages in-context learning: \u003Ca href=\"https://github.com/jerryjliu/llama_index\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehttps://github.com/jerryjliu/llama_index\u003C/a\u003E.\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EDSPy is another popular open source library for LLM applications, such as retrieval augmentation and indexing: \u003Ca href=\"https://github.com/stanfordnlp/dspy\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehttps://github.com/stanfordnlp/dspy\u003C/a\u003E.\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EA first instance of soft prompting: Brian Lester, Rami Al-Rfou, and Noah Constant, “The Power of Scale for Parameter-Efficient Prompt Tuning” (2021), \u003Ca href=\"https://arxiv.org/abs/2104.08691\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehttps://arxiv.org/abs/2104.08691\u003C/a\u003E.\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EThe paper that first described prefix tuning: Xiang Lisa Li and Percy Liang, “Prefix-Tuning: Optimizing Continuous Prompts for Generation” (2021), \u003Ca href=\"https://arxiv.org/abs/2101.00190\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehttps://arxiv.org/abs/2101.00190\u003C/a\u003E.\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EThe paper introducing the original adapter method: Neil Houlsby \u003Cem\u003Eet al.\u003C/em\u003E, “Parameter-Efficient Transfer Learning for NLP” (2019), \u003Ca href=\"https://arxiv.org/abs/1902.00751\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehttps://arxiv.org/abs/1902.00751\u003C/a\u003E.\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EThe paper introducing the LoRA method: Edward J. Hu \u003Cem\u003Eet al.\u003C/em\u003E, “LoRA: Low-Rank Adaptation of Large Language Models” (2021), \u003Ca href=\"https://arxiv.org/abs/2106.09685\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehttps://arxiv.org/abs/2106.09685\u003C/a\u003E.\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EA survey of more than 40 research papers covering parameterefficient finetuning methods: Vladislav Lialin, Vijeta Deshpande, and Anna Rumshisky, “Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning” (2023), \u003Ca href=\"https://arxiv.org/abs/2303.15647\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehttps://arxiv.org/abs/2303.15647\u003C/a\u003E.\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EThe InstructGPT paper: Long Ouyang \u003Cem\u003Eet al.\u003C/em\u003E, “Training Language Models to Follow Instructions with Human Feedback” (2022), \u003Ca href=\"https://arxiv.org/abs/2203.02155\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehttps://arxiv.org/abs/2203.02155\u003C/a\u003E.\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EProximal policy optimization, which is used for reinforcement learning with human feedback: John Schulman \u003Cem\u003Eet al.\u003C/em\u003E, “Proximal Policy Optimization Algorithms” (2017), \u003Ca href=\"https://arxiv.org/abs/1707.06347\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehttps://arxiv.org/abs/1707.06347\u003C/a\u003E.\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EAn article by the author describing reinforcement learning with human feedback (RLHF) and its alternatives, \u003Ca href=\"https://magazine.sebastianraschka.com/p/llm-training-rlhf-and-its-alternatives\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehttps://magazine.sebastianraschka.com/p/llm-training-rlhf-and-its-alternatives\u003C/a\u003E\u003C/p\u003E\u003C/li\u003E\u003C/ol\u003E\u003Ch3\u003ECode Examples\u003C/h3\u003E\u003Cul\u003E\u003Cli\u003E\u003Cp\u003E\u003Ca href=\"https://github.com/rasbt/MachineLearning-QandAI-book/tree/main/supplementary/q18-using-llms/01_classifier-finetuning\" target=\"_blank\" rel=\"nofollow noopener\"\u003EFeature-extractor vs. finetuning different output vs finetuning all layers\u003C/a\u003E\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003E\u003Ca href=\"https://github.com/rasbt/MachineLearning-QandAI-book/blob/main/supplementary/q18-using-llms/02_prompting\" target=\"_blank\" rel=\"nofollow noopener\"\u003EIn-context learning and prompt tuning\u003C/a\u003E\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003E\u003Ca href=\"https://github.com/rasbt/MachineLearning-QandAI-book/blob/main/supplementary/q18-using-llms/03_retrieval-augmented-generation\" target=\"_blank\" rel=\"nofollow noopener\"\u003EIndexing and retrieval augmented generation (RAG)\u003C/a\u003E\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003E\u003Ca href=\"https://github.com/rasbt/MachineLearning-QandAI-book/blob/main/supplementary/q18-using-llms/04_adapter\" target=\"_blank\" rel=\"nofollow noopener\"\u003EAdapter finetuning\u003C/a\u003E\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003E\u003Ca href=\"https://github.com/rasbt/MachineLearning-QandAI-book/tree/main/supplementary/q18-using-llms/05_lora\" target=\"_blank\" rel=\"nofollow noopener\"\u003ELow-rank adaptation (LoRA) finetuning\u003C/a\u003E\u003C/p\u003E\u003C/li\u003E\u003C/ul\u003E\u003Ch3\u003EExercises\u003C/h3\u003E\u003Col\u003E\u003Cli\u003E\u003Cp\u003EWhen does it make more sense to use in-context learning rather than finetuning, and vice versa?\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EIn prefix tuning, adapters, and LoRA, how can we ensure that the model preserves (and does not forget) the original knowledge?\u003C/p\u003E\u003C/li\u003E\u003C/ol\u003E\u003Ch3\u003EMachine Learning Q and AI\u003C/h3\u003E\u003Cp\u003EI hope you liked this excerpt! If you are interested in 29 other topics related to machine learning and AI, you can find the book on the \u003Ca href=\"https://nostarch.com/machine-learning-and-ai-beyond-basics\" target=\"_blank\" rel=\"nofollow noopener\"\u003Epublisher’s website\u003C/a\u003E, \u003Ca href=\"https://www.amazon.com/Machine-Learning-AI-Beyond-Basics/dp/1718503768\" target=\"_blank\" rel=\"nofollow noopener\"\u003EAmazon\u003C/a\u003E, and many other book stores.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEIVT9nMzUuFw/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1713388981412?e=1749081600&v=beta&t=pR23XuKuXVjL-M3czvItZKMnybgImNs6DkrzRzOpoUU\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003Ca href=\"https://www.amazon.com/Machine-Learning-AI-Essential-Questions/dp/1718503768\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003EMachine Learning Q and AI\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E covers 30 essential topics related to machine learning and AI.\u003C/em\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EAhead of AI is a personal passion project that does not offer direct compensation, and I’d appreciate it if you’d consider purchasing a copy of my book to support these endeavors.\u003C/p\u003E","url":"https://www.linkedin.com/pulse/using-finetuning-pretrained-transformers-sebastian-raschka-phd-08yff"},{"@type":"Article","author":{"@type":"Person","name":"Sebastian Raschka, PhD","url":"https://www.linkedin.com/in/sebastianraschka"},"dateModified":"","datePublished":"2023-10-08T10:55:01.000+00:00","headline":"Ahead of AI #12: LLM Businesses and Busyness","image":{"@type":"ImageObject","url":"https://media.licdn.com/dms/image/v2/D5612AQGtYjf4FHtrSQ/article-cover_image-shrink_720_1280/article-cover_image-shrink_720_1280/0/1696691152180?e=2147483647&v=beta&t=gY8slhdJ92S27SezJGaXYIsIfjG3xO5DjgFLLTkRZf4"},"articleBody":"\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EIn \u003Cem\u003EAhead of AI\u003C/em\u003E, I try to strike a balance between discussing recent research, explaining AI-related concepts, and delving into general AI-relevant news and developments. Given that the previous issues leaned heavily towards research, I aim to address the latest trends in this issue. \u003C/p\u003E\u003Cp\u003ESpecifically, I'll explore the current endeavors of major tech companies. It appears that every one of these entities is either training or developing LLMs, with a noticeable shift of their core operations towards AI -- thus the title \"LLM Businesses and Busyness.\"\u003C/p\u003E\u003Cp\u003EIt's worth noting that the initial segment of this article will focus on discussions related to these companies. None of this content is sponsored. It's 100% based on my thoughts and what I found interesting.\u003C/p\u003E\u003Cp\u003EFor those particularly interested in open-source and freely accessible LLMs, fear not. This edition will also spotlight some innovative LLMs that have garnered attention due to their small nature and exceptional benchmark results.\u003C/p\u003E\u003Cp\u003EGiven that LoRA is a favored technique in research for efficient finetuning, I will also delve into two new intriguing variants that were proposed last month. To conclude, I'll highlight the recent launches of some major open-source initiatives.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003Cstrong\u003EDisclaimer: \u003C/strong\u003ELast month, I announced that I would be moving this newsletter to Substack due to technical issues on the LinkedIn newsletter platform. After I made this announcement, the LinkedIn Newsletter Staff kindly reached out to me to discuss several improvements they've made to the platform. Since the last \"Ahead of AI\" issue on LinkedIn, the newsletter team has overhauled the interface and addressed several issues, including missing drafts and image resizing, among others. Consequently, I've decided to give the newsletter tool another chance and crosspost this month's Ahead of AI issue.\u003C/p\u003E\u003Chr\u003E\u003Cp\u003E\u003Cem\u003EAhead of AI\u003C/em\u003E focuses on the reader experience, not ads. If you've found value in this newsletter, consider supporting it through a \u003Ca href=\"https://magazine.sebastianraschka.com/subscribe?utm_medium=web&utm_source=magaziney-home-page\" target=\"_blank\" rel=\"nofollow noopener\"\u003Epaid subscription option\u003C/a\u003E on Substack. And I'd be truly grateful if you could share the newsletter with your colleagues on your social channels. \u003C/p\u003E\u003Chr\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003EMultimodal ChatGPT\u003C/h3\u003E\u003Cp\u003EOpenAI started rolling out GPT-4V to ChatGPT Pro users with a \u003Ca href=\"https://openai.com/blog/chatgpt-can-now-see-hear-and-speak\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ebold announcement\u003C/a\u003E that ChatGPT could now \"see, hear and speak.\" This is the multimodal ChatGPT version that I alluded to in previous issues.\u003C/p\u003E\u003Cp\u003EThe release is accompanied by a 166-page technical report from researchers at Microsoft, who illustrate and analyze the different use cases with hands-on examples: \u003Ca href=\"https://arxiv.org/abs/2309.17421\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003EThe Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)\u003C/em\u003E\u003C/a\u003E\u003Cem\u003E.\u003C/em\u003E\u003C/p\u003E\u003Cp\u003EIn a nutshell, GPT-4V allows you to upload images and audio to the ChatGPT interface and ask questions about it. I tried one of my favorite use cases, extracting the LaTeX code for equations (very helpful when writing papers), and it seems to work quite well.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEEQZENO-4Nuw/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1696691305803?e=1748476800&v=beta&t=6pOusfsBDi0rNsBpSJcBgNAeugjG9Jg1fTrUMVJ6pQs\" src=\"//:0\"\u003E\u003Cfigcaption\u003EEquation to LaTeX conversion with GPT-4V\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EBy the way, if you are interested in the equation to LaTeX conversion, I recommend checking out \u003Ca href=\"https://mathpix.com/\" target=\"_blank\" rel=\"nofollow noopener\"\u003EMathpix\u003C/a\u003E, which has been doing the same thing for many years more accurately. I am a long-term subscriber since I use it almost daily for extracting equations from books and papers for my personal notes in Markdown. (I am not affiliated and have not been sponsored to say this.)\u003C/p\u003E\u003Cp\u003EBack to GPT-4V: while I currently don't have a use case that requires describing general pictures to me, I can see this being useful in certain automation workflows, though, brainstorming, and so forth. \u003C/p\u003E\u003Cp\u003EBut note that while this is impressive, Google's Bard supported images, including the above LaTeX extraction capabilities, for a while now. While GPT4-V is a nice improvement of the ChatGPT interface, Google's less popular Bard chatbot was ahead of OpenAI this time (although, according to rumors, Google is using two separate models/API calls for this whereas OpenAI may be using a unified LLM model).\u003C/p\u003E\u003Cp\u003EIf you are interested in outfitting open-source LLMs like Falcon or Llama 2 with multimodal capabilities, I recommend checking out the Llama-Adapter v2 method (despite its name, it's not just for Llama models), which I covered a few months ago in \u003Ca href=\"https://magazine.sebastianraschka.com/i/119566166/finetuning-multimodal-llms-with-llama-adapter-v\" target=\"_blank\" rel=\"nofollow noopener\"\u003EAhead of AI #8\u003C/a\u003E.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003EData Regains Its Title as the 'New Oil'\u003C/h3\u003E\u003Cp\u003EI usually try not to pay attention to ads, but I recently stumbled upon one on LinkedIn that emphasizes how everyone is going all-in on the development of custom and next-generation AI models. But as always, the bottleneck is the data (or the lack thereof). \u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQF8CQ-nQcWdqA/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1696691367136?e=1748476800&v=beta&t=g9k_LrAa3lISmyxXEif0RvhwyCzvPoqKEUJTDEo1jlc\" src=\"//:0\"\u003E\u003Cfigcaption\u003EData labeling is a bottleneck ... and a job, and a business.\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003ERemember the term \u003Cem\u003EBig Data\u003C/em\u003E that was commonly used >10 years ago? It would not surprise me if data becomes an even more hotly traded commodity in the upcoming months or years, similar to how oil has been in the past few decades, to use a worn-out analogy.\u003C/p\u003E\u003Cp\u003EData, or data usage rights, might indeed become scarcer (or at least costlier) in the future as more and more platforms are closing up their free API access. Examples of that we have seen earlier this year include Reddit and Twitter/X. The latter has also just \u003Ca href=\"https://techcrunch.com/2023/09/08/x-updates-its-terms-to-ban-crawling-and-scraping\" target=\"_blank\" rel=\"nofollow noopener\"\u003Eupdated its usage terms\u003C/a\u003E to prevent crawling and scraping last month.\u003C/p\u003E\u003Cp\u003EThis summer, major online publishers such as CNN, ABC, and the New York Times \u003Ca href=\"https://www.theguardian.com/technology/2023/aug/25/new-york-times-cnn-and-abc-block-openais-gptbot-web-crawler-from-scraping-content\" target=\"_blank\" rel=\"nofollow noopener\"\u003Etook steps to prevent OpenAI's crawlers\u003C/a\u003E from freely accessing their websites' content.\u003C/p\u003E\u003Cp\u003EAnd last week, Google introduced a new \u003Ca href=\"https://www.theverge.com/2023/9/28/23894779/google-ai-extended-training-data-toggle-bard-vertex\" target=\"_blank\" rel=\"nofollow noopener\"\u003Eflag for a website's robots.txt file\u003C/a\u003E, allowing publishers to opt out of having their articles used as AI training data.\u003C/p\u003E\u003Cp\u003ESay what you may about the impact of AI, but one undeniable effect has been its role in ushering in a restructuring of the internet. \u003C/p\u003E\u003Cp\u003EWhile companies are working on improved search capabilities and helpful chatbots, the content these tools are supposed to find for a user will be harder to access. One silver lining might be that this will all lead to hopefully stronger cybersecurity and authenticity verification systems. \u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003EDALL-E 3: The Next Iteration of Text-to-Image AI\u003C/h3\u003E\u003Cp\u003EIt's hard to believe that it's already 1 1/2 years since DALL-E 2 was launched and popularized text-to-image generative AI. Last month, OpenAI \u003Ca href=\"https://openai.com/dall-e-3\" target=\"_blank\" rel=\"nofollow noopener\"\u003Eannounced the new version DALL-E 3\u003C/a\u003E and made it available on the \u003Ca href=\"https://www.bing.com/images/create\" target=\"_blank\" rel=\"nofollow noopener\"\u003EBing Create website\u003C/a\u003E where OpenAI's DALL-E 3 model is offered.\u003C/p\u003E\u003Cp\u003EI must say that the original announcement was a bit misleading as it first looked like DALL-E 3 is capable of generating image annotations, as shown in the screenshot below:\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQGMf04j0qNEYw/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1696691396028?e=1748476800&v=beta&t=rifXfPe2GQWD76B0w3W3CzTl6-HstQLFhy1B20JYCjo\" src=\"//:0\"\u003E\u003Cfigcaption\u003EA figure that appeared in the \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EHowever, this annotation is just added by the author for illustration purposes.\u003C/p\u003E\u003Cp\u003EIn either case, I am not a heavy text-to-image AI user, but I must say that after playing around with different prompts on the Bing Create website, the results are quite stunning.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003ENvidia's French Branch Raided in Cloud-Computing Antitrust Probe\u003C/h3\u003E\u003Cp\u003EThe Wall Street Journal reported, \"Nvidia’s French Offices Are Raided in Cloud Inquiry.\" Further, the newspaper stated, \"Typically, such raids are hours-long exercises where officials arrive early in the morning, search the company's premises, seize physical and digital materials, and interview employees as they come in.\"\u003C/p\u003E\u003Cp\u003EWhile the French authorities have not disclosed the reason for the raid, it might be related to \u003Ca href=\"https://www.investing.com/news/stock-market-news/citi-says-nvidia-will-have-a-90-market-share-in-ai-chips-market-432SI-3122461\" target=\"_blank\" rel=\"nofollow noopener\"\u003ENvidia's 90% market share\u003C/a\u003E in the AI chip sector.\u003C/p\u003E\u003Cp\u003EHaving trained deep learning models on Nvidia hardware for over a decade, I believe its dominant market share is attributed to its superior software support through CUDA and cuDNN. While there are competent AMD GPUs, the software support is still seen as experimental, as mentioned by colleagues who've used AMD GPUs with deep learning frameworks. (Although, I recently read \u003Ca href=\"https://news.ycombinator.com/item?id=37793635\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ereports from users who had good experiences with AMD GPUs\u003C/a\u003E \"after pulling some hair.\")\u003C/p\u003E\u003Cp\u003EAdditionally, Google's TPUs and Amazon's Trainium chips are, to the best of my knowledge, not currently available for purchase. Furthermore, TPU support via XLA is primarily optimized for Google's proprietary software, like TensorFlow and Jax, making it less versatile than CUDA. (For instance, I utilized CUDA for molecular dynamics simulations during my undergraduate studies even before deep learning became popular.)\u003C/p\u003E\u003Cp\u003EIn my opinion, Nvidia's large market share is partly owed to their more mature software support. Of course, it's a good thing for the consumer to have more alternatives, and as the saying goes, \"competition is good for business.\"\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003EOpenAI Explores Developing Its Own AI Chips\u003C/h3\u003E\u003Cp\u003EIn a surprising news article, Reuters reported that \u003Ca href=\"https://www.reuters.com/technology/chatgpt-owner-openai-is-exploring-making-its-own-ai-chips-sources-2023-10-06/\" target=\"_blank\" rel=\"nofollow noopener\"\u003EOpenAI is exploring the possibility of manufacturing its own AI chips\u003C/a\u003E.\u003C/p\u003E\u003Cp\u003EIn recent years, there's been a growing trend toward vertical integration, a business strategy where a company controls multiple stages of its production process. A notable example is Apple producing its own processors for the iPhone and Mac. Within the realm of AI, Google has developed TPUs and Amazon has created \u003Ca href=\"https://aws.amazon.com/machine-learning/trainium/\" target=\"_blank\" rel=\"nofollow noopener\"\u003ETrainium chips\u003C/a\u003E. \u003C/p\u003E\u003Cp\u003EThis is not discussed within the Reuters article above, but I find it interesting that currently, Microsoft stands out as the only major cloud provider that hasn't developed its own custom AI chips. Given Microsoft's substantial investment in OpenAI, it's possible that Microsoft Azure might benefit from custom AI inference and/or training chips in the future. As always, a significant challenge will be ensuring software support and seamless integration with major deep learning libraries like PyTorch.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003EMicrosoft to Unveil Data Center Chip for LLM Training\u003C/h3\u003E\u003Cp\u003EShortly after writing the previous section on OpenAI's chip plans and the speculation about Microsofts potential interest in catching up with Google and Amazon when it comes to developing custom AI chips for their data centers, a new article came out that mentions exactly that. According to The Information, \u003Ca href=\"https://www.theinformation.com/articles/microsoft-to-debut-ai-chip-next-month-that-could-cut-nvidia-gpu-costs\" target=\"_blank\" rel=\"nofollow noopener\"\u003EMicrosoft plans to unveil a chip designed for data centers that train and run LLMs in November\u003C/a\u003E.\u003C/p\u003E\u003Ch3\u003EAmazon Invests in Anthropic and Google Also Shows Interest\u003C/h3\u003E\u003Cp\u003E\u003Ca href=\"https://www.reuters.com/markets/deals/amazon-steps-up-ai-race-with-up-4-billion-deal-invest-anthropic-2023-09-25/\" target=\"_blank\" rel=\"nofollow noopener\"\u003EAmazon has invested $1.25 billion in Anthropic\u003C/a\u003E, the developer of a ChatGPT-like chatbot called Claude. \u003Ca href=\"https://www.semianalysis.com/p/amazon-anthropic-poison-pill-or-empire\" target=\"_blank\" rel=\"nofollow noopener\"\u003EThis investment allegedly constitutes approximately 23% of Anthropic's equity\u003C/a\u003E. There's also an option for an additional $4 billion investment at a later stage. Interestingly, the investment is not in cash but in cloud credits for model training using Amazon's in-house Trainium chips.\u003C/p\u003E\u003Cp\u003EThis also means that  Amazon now also has access to a general-purpose chatbot, Claude, which they can offer to customers, following Google's Bard and Microsoft's association with ChatGPT.\u003C/p\u003E\u003Cp\u003EA few days after the announcement above, it was reported that \u003Ca href=\"https://www.theinformation.com/articles/openai-rival-anthropic-in-talks-to-raise-2-billion-from-google-others-as-ai-arms-race-accelerates\" target=\"_blank\" rel=\"nofollow noopener\"\u003EAnthropic in Talks to Raise $2 Billion From Google and Others Just Days After Amazon Investment\u003C/a\u003E, which is an interesting development.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch2\u003ENew Openly Available Large Language Models\u003C/h2\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EMany new LLMs are hitting the public benchmark leaderboards each week. The two most interesting and notable ones I've been toying around in the last couple of weeks have been phi-1.5 and Mistral, which I wanted to highlight in this newsletter.\u003C/p\u003E\u003Cp\u003EBoth phi-1.5 and Mistral are very capable yet relatively small (1.3 and 7 billion parameters versus GPT-3's 175 billion parameters) and come with openly available weights that can be used. On that note, there's an \u003Ca href=\"https://x.com/percyliang/status/1708560401754202621?s=20\" target=\"_blank\" rel=\"nofollow noopener\"\u003Einteresting and accurate quote\u003C/a\u003E by Percy Liang making the distinction between open LLMs and open-source LLMs:\u003C/p\u003E\u003Cp\u003E\u003Cem\u003E> \"Many 'open' language models only come with released weights. In software, this is analogous to releasing a binary without code (you wouldn't call this open-source). To get the full benefits of transparency, you need the training data. GPT-J, GPT-NeoX, BLOOM, RedPajama do this.\"\u003C/em\u003E\u003C/p\u003E\u003Cp\u003EIf you are interested in using or finetuning phi-1.5 or Mistral, both LLMs are available via the \u003Ca href=\"https://github.com/Lightning-AI/lit-gpt\" target=\"_blank\" rel=\"nofollow noopener\"\u003Eopen-source Lit-GPT repository\u003C/a\u003E, which I help maintain.\u003C/p\u003E\u003Ch3\u003EPhi-1.5\u003C/h3\u003E\u003Cp\u003EPhi-1.5 is a \"small\" 1.3 billion parameter LLM with an impressive performance for its size.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQHqk2_7nx53Pg/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1696691421103?e=1748476800&v=beta&t=zhmoCZwpTx5Uu6LK2RutbX7uBWPHmK8I59Z0NKx166M\" src=\"//:0\"\u003E\u003Cfigcaption\u003EAnnotated figures from the \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EHow does this small model accomplish such a good performance? The secret ingredient seems to be the high-quality data. \u003C/p\u003E\u003Cp\u003EThe pretraining is based on the \u003Ca href=\"https://arxiv.org/abs/2306.11644\" target=\"_blank\" rel=\"nofollow noopener\"\u003ETextbooks Is All You Need\u003C/a\u003E approach that I briefly covered in the \u003Ca href=\"https://magazine.sebastianraschka.com/p/ai-research-highlights-in-3-sentences-738\" target=\"_blank\" rel=\"nofollow noopener\"\u003EAhead AI research highlights\u003C/a\u003E earlier this summer, where researchers trained the original 1.3B phi model on a relatively small 6B token \"textbook quality\" dataset from the web plus 1B tokens of exercises synthesized via GPT-3.5. \u003C/p\u003E\u003Cp\u003EThe 1.3B phi-1.5 successor model was trained on a similar \"high-quality\" dataset, this time using ~20B tokens of textbook-quality synthetic text on top of 7B tokens of code from \u003Ca href=\"https://arxiv.org/abs/2211.15533\" target=\"_blank\" rel=\"nofollow noopener\"\u003EThe Stack\u003C/a\u003E, StackOverflow, and others. It's worth noting that phi-1.5 has only undergone pretraining, not instruction finetuning -- the latter could potentially improve its performance even further. \u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEemOE4lFwGtA/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1696691452505?e=1748476800&v=beta&t=icjL0ogCEJxryS_-crEtiGnyyRlhBALtiuyr_N2d23k\" src=\"//:0\"\u003E\u003Cfigcaption\u003EDataset and performance of phi-1, the phi-1.5 predecessor\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EThe authors hypothesize that the model gains instruction following capabilities without being instruction finetuning, which is an interesting observation. However, phi-1.5 has been pretrained on a dataset that has been generated by instruction-prompting another LLM. So, in my opinion, it should not be totally surprising that the model gains certain instruction finetuning from the pretraining stage. (Supervised instruction finetuning is a next-word prediction task similar to pretraining, as I described in the \u003Ca href=\"https://magazine.sebastianraschka.com/p/llm-training-rlhf-and-its-alternatives\" target=\"_blank\" rel=\"nofollow noopener\"\u003Eprevious issue\u003C/a\u003E of Ahead of AI.)\u003C/p\u003E\u003Cp\u003EYou can read more about the phi-1.5 training details in \u003Ca href=\"https://arxiv.org/abs/2309.05463\" target=\"_blank\" rel=\"nofollow noopener\"\u003ETextbooks Are All You Need II: phi-1.5 technical report\u003C/a\u003E.\u003C/p\u003E\u003Cp\u003E\u003Ca href=\"https://twitter.com/suchenzang/status/1701615026648605095?s=20\" target=\"_blank\" rel=\"nofollow noopener\"\u003ESusan Zhang's\u003C/a\u003E posts on X started debates regarding the performance metrics of phi-1.5, speculating that the model may have unintentionally been trained using benchmark datasets. Susan Zhang showcases examples where phi-1.5 is very particular with respect to the formatting. For instance, it perfectly answers math questions formatted similarly to the benchmark datasets but starts hallucinating when the format slightly changes. Zhang believes that this indicates that the model is merely memorizing the test dataset.\u003C/p\u003E\u003Cp\u003EOn that note, in the satirical \u003Ca href=\"https://arxiv.org/abs/2309.08632\" target=\"_blank\" rel=\"nofollow noopener\"\u003EPretraining on the Test Set Is All You Need\u003C/a\u003E paper, the author trains a small 1M parameter LLM that outperforms all other models, including the 1.3B phi-1.5 model. This is achieved by training the model on all downstream academic benchmarks. It appears to be a subtle criticism underlining how easily benchmarks can be \"cheated\" intentionally or unintentionally (due to data contamination).\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQHKBJsAKHMFIA/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1696691482858?e=1748476800&v=beta&t=nWulumId_AHjRrRKuS38x2dAMjHI15uIuhi3T8dtJVY\" src=\"//:0\"\u003E\u003Cfigcaption\u003E\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EThe bottom line is that we currently don't have a good way of telling whether a model has been trained on benchmark or test data. This is especially true since most companies no longer share or disclose the training data, which is likely to prevent lawsuits (e.g., see\" '\u003Ca href=\"https://apnews.com/article/openai-lawsuit-authors-grisham-george-rr-martin-37f9073ab67ab25b7e6b2975b2a63bfe\" target=\"_blank\" rel=\"nofollow noopener\"\u003EGame of Thrones' creator and other authors sue ChatGPT-maker OpenAI for copyright infringement\u003C/a\u003E\").\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003ESubscribe now\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003EMistral\u003C/h3\u003E\u003Cp\u003EIn its first public announcement, the new AI company Mistral AI shared its first openly available LLM. According to the company's benchmarks, the 7B Mistral model outperforms the larger 13B Llama 2 model on all benchmarks and even approaches CodeLlama 7B performance on code at the same time. Hence, it's been the talk of the (social media) town last week.\u003C/p\u003E\u003Cp\u003EI tried Mistral on a randomly selected subset of Evaluation Harness tasks. It does perform (almost suspiciously) well on arithmetic benchmarks. It does not, however, beat the 13B Llama 2 model on \u003Cem\u003Eall\u003C/em\u003E tasks.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQH5LxDXbDmCQQ/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1696691513611?e=1748476800&v=beta&t=tZlcNy4aBpe36VBZaHcbwBzHgxOClgXcvTeOkqAjJ_U\" src=\"//:0\"\u003E\u003Cfigcaption\u003EA comparison of various base models (not finetuned models) on a random subset of \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003ETitle: A comparison of various base models (not finetuned models) on a random subset of \u003Ca href=\"https://github.com/EleutherAI/lm-evaluation-harness\" target=\"_blank\" rel=\"nofollow noopener\"\u003EEvaluation Harness\u003C/a\u003E tasks\u003C/p\u003E\u003Cp\u003EIf you are interested in giving, Mistral and Phi-1.5 have also been added to the \u003Ca href=\"https://github.com/Lightning-AI/lit-gpt\" target=\"_blank\" rel=\"nofollow noopener\"\u003ELit-GPT\u003C/a\u003E repository that I regularly contribute to and which is part of the NeurIPS LLM efficiency challenge.\u003C/p\u003E\u003Cp\u003ENow, some people have raised concerns similar to Phi-1.5. I.e., people are \u003Ca href=\"https://www.reddit.com/r/LocalLLaMA/comments/16twtfn/llm_chatrp_comparisontest_mistral_7b_base_instruct/k2hr0yl/\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ediscussing that there was some potential data contamination\u003C/a\u003E, which led the model to be trained on test data, which could explain some of the high benchmark scores. Overall, it does seem to be very interesting and capable, though.\u003C/p\u003E\u003Cp\u003EMoreover, it also uses an interesting self-attention variant, sliding window attention, to save memory and improve computational throughput for faster training. (Sliding window attention was previously proposed in \u003Ca href=\"https://arxiv.org/abs/1904.10509\" target=\"_blank\" rel=\"nofollow noopener\"\u003EChild et al. 2019\u003C/a\u003E and \u003Ca href=\"https://arxiv.org/abs/2004.05150\" target=\"_blank\" rel=\"nofollow noopener\"\u003EBeltagy et al. 2020\u003C/a\u003E.)\u003C/p\u003E\u003Cp\u003EThe sliding window attention mechanism is essentially a fixed-sized attention block that allows a current token to attend only a specific number of previous tokens (instead of all previous tokens), which is illustrated in the figure below.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQF-ccUpzdC6Rw/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1696691549606?e=1748476800&v=beta&t=JJbtZ6gyzdXxg7PDxN-FCbdRw9ck9gNfm80P7zqxPTo\" src=\"//:0\"\u003E\u003Cfigcaption\u003EComparing regular to sliding window attention based on annotated figures based on the figures provided in the \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EIn this specific case, the attention block size is 4096 tokens, and the researchers were training the model with up to 100k token context sizes. \u003C/p\u003E\u003Cp\u003ETo provide a more concrete example, in regular self-attention, a model at the 50,000th token can attend all previous 49,999 tokens. In sliding window self-attention, the Mistral model can only attend tokens 45,904 to 50,000. \u003C/p\u003E\u003Cp\u003EDoes this mean that the model can't access the whole context all at once? Not quite. The Mistral developers argue that the model can still access all input tokens in a hierarchical indirect way, as illustrated in the figure below.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQGfhO4_Lt-r3g/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1696691612945?e=1748476800&v=beta&t=PflpqJY9Rv2K1FJjSuSA8eFweibxHsRJ6Yif2k-NHUQ\" src=\"//:0\"\u003E\u003Cfigcaption\u003EAnnotated figure based on the figures provided in the \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003ESince there are no research papers and ablation studies for GPT-like LLMs trained with sliding window attention, we don't know yet whether it positively or negatively affects the modeling performance. However, it surely has a positive effect on the computational throughput and the Mistral team mentioned that it led to a 2x speed improvement.\u003C/p\u003E\u003Ch2\u003ENew LoRA Variants\u003C/h2\u003E\u003Cp\u003E\u003Ca href=\"https://arxiv.org/abs/2106.09685\" target=\"_blank\" rel=\"nofollow noopener\"\u003ELoRA\u003C/a\u003E (short for \"low-rank adaptation\") is the single most popular method for finetuning LLMs efficiently, and researchers developed two new variants in recent weeks: \u003Ca href=\"https://arxiv.org/abs/2309.14717\" target=\"_blank\" rel=\"nofollow noopener\"\u003EQA-LoRA\u003C/a\u003E and \u003Ca href=\"https://arxiv.org/abs/2309.12307\" target=\"_blank\" rel=\"nofollow noopener\"\u003ELongLoRA\u003C/a\u003E.\u003C/p\u003E\u003Cp\u003E(In short, LoRA involves finetuning a pretrained model by adapting a low-rank projection of its parameters. Most readers are probably familiar with how LoRA works, but if there's interest, I can share a from-scratch implementation and explanation in a future article to explain how it works in detail. Please let me know in the comments.)\u003C/p\u003E\u003Ch3\u003ELongLora\u003C/h3\u003E\u003Cp\u003EIn \u003Ca href=\"https://arxiv.org/abs/2309.12307\" target=\"_blank\" rel=\"nofollow noopener\"\u003ELongLoRA: Efficient Fine-tuning of Long-Context Large Language Models\u003C/a\u003E, researchers propose LongLoRA, an efficient method for finetuning LLMs that expands their context sizes without the usual high computational costs associated with longer contexts. The approach uses sparse local attention, allowing computation savings during finetuning. One can then still use regular, dense attention during inference time.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQEI5HKjsHT6bw/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1696691687535?e=1748476800&v=beta&t=adtnZVrEiYhopIS0MjvEM2OXT5XRfD7ho48aJP23eho\" src=\"//:0\"\u003E\u003Cfigcaption\u003EAnnotated figures from LongLoRA paper: \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003EQA-LoRA\u003C/h3\u003E\u003Cp\u003E\u003Ca href=\"https://arxiv.org/abs/2309.14717\" target=\"_blank\" rel=\"nofollow noopener\"\u003EQA-LoRA\u003Cstrong\u003E: \u003C/strong\u003EQuantization-Aware Low-Rank Adaptation of Large Language Models\u003C/a\u003E is essentially a small modification or improvement over QLoRA (quantized LoRA) that addresses its higher computational efficiency cost -- QLoRA saves GPU memory but increases the runtime due to dequantizing the quantized weights of the base model in each forward pass.\u003C/p\u003E\u003Cp\u003EQA-LoRA, in contrast to QLoRA, quantizes the LoRA (adapter) weights, avoiding a costly conversion of the quantized base model weights back into 16-bit when adding the adapter weights. This concep summarized in the annotated figure below.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQENQopuOX3PGA/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1696691768698?e=1748476800&v=beta&t=LG_LImodmpvz6IZuqWfwFM6YAtoXFJ_lPpNaS4lC-H4\" src=\"//:0\"\u003E\u003Cfigcaption\u003EAnnotated figures from QA-LoRA paper: \u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003Cem\u003EFigure: Annotated figures from QA-LoRA paper: \u003C/em\u003E\u003Ca href=\"https://arxiv.org/abs/2309.14717\" target=\"_blank\" rel=\"nofollow noopener\"\u003E\u003Cem\u003Ehttps://arxiv.org/abs/2309.14717\u003C/em\u003E\u003C/a\u003E\u003C/p\u003E\u003Cp\u003EA little nitpick: In Table 2, QA-LoRA is about 2x faster than QLoRA for finetuning. However, a much smaller number of parameters was used for the adapter weights. I believe it would have been interesting to use the same number of parameters for both when comparing their speeds.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch2\u003EOpen Source Highlights\u003C/h2\u003E\u003Cp\u003ELastly, I want to close this article with highlighting some notable open-source releases that don't happen too often: Python 3.12, PyTorch 2.1, and XGBoost 2.0.\u003C/p\u003E\u003Ch3\u003EPython 3.12\u003C/h3\u003E\u003Cp\u003EThe newest version of Python comes with a lot of computational performance improvements with the asyncio standard library package being up to 75% faster as mentioned in the official \u003Ca href=\"https://docs.python.org/3/whatsnew/3.12.html\" target=\"_blank\" rel=\"nofollow noopener\"\u003EWhat's New release\u003C/a\u003E notes.\u003C/p\u003E\u003Cp\u003E\u003Ca href=\"https://www.pythonweekly.com/\" target=\"_blank\" rel=\"nofollow noopener\"\u003EPython Weekly\u003C/a\u003E, my favorite Python newsletter, summarized some of the major changes as follows:\u003C/p\u003E\u003Cul\u003E\u003Cli\u003E\u003Cp\u003EMore flexible f-string parsing\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003ESupport for the buffer protocol in Python code \u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EA new debugging/profiling API \u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003ESupport for isolated subinterpreters with separate Global Interpreter Locks \u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EEven more improved error messages. \u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003ESupport for the Linux perf profiler to report Python function names in traces.\u003C/p\u003E\u003C/li\u003E\u003Cli\u003E\u003Cp\u003EMany large and small performance improvements \u003C/p\u003E\u003C/li\u003E\u003C/ul\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003EPyTorch 2.1\u003C/h3\u003E\u003Cp\u003ELast year was a particularly big year for PyTorch with its 2.0 release featuring torch.compile as a new option to convert dynamic PyTorch graphs into full or partial static graphs for performance boosts. This year, \u003Ca href=\"https://pytorch.org/blog/pytorch-2-1/\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ethe PyTorch team released PyTorch 2.1\u003C/a\u003E. As far I can tell, no breaking changes were introduced, and the focus was more on performance enhancements.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQGSLegSEJnPvA/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1696691807525?e=1748476800&v=beta&t=cEW0PBFt5H-w6L_Aoh5r8MBA7msgKbthE-LbehKvY90\" src=\"//:0\"\u003E\u003Cfigcaption\u003EPyTorch 2.1 release\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003EA notable feature (although still in beta) is that torch.compile can now also compile NumPy operations via translating them into PyTorch-equivalent operations. This means that we can now GPU-accelerate NumPy code as well.\u003C/p\u003E\u003Cp\u003EThe associated domain libraries such as torchvision also got some attention. For example, the popular data augmentations \u003Ca href=\"https://pytorch.org/vision/stable/auto_examples/transforms/plot_cutmix_mixup.html#sphx-glr-auto-examples-transforms-plot-cutmix-mixup-py\" target=\"_blank\" rel=\"nofollow noopener\"\u003ECutMix and MixUp were now finally integrated\u003C/a\u003E.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Ch3\u003EXGBoost 2.0\u003C/h3\u003E\u003Cp\u003EYes, machine learning without deep neural networks and LLMs is still relevant! XGBoost is one of the most popular flavors and software implementations of gradient boosting, which is still considered \u003Ca href=\"https://arxiv.org/abs/2207.08815\" target=\"_blank\" rel=\"nofollow noopener\"\u003Estate-of-the-art for tabular datasets\u003C/a\u003E.\u003C/p\u003E\u003Cfigure\u003E\u003Cimg data-media-urn=\"\" data-li-src=\"https://media.licdn.com/dms/image/v2/D5612AQHmUQPp7pc2Ww/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1696691841646?e=1748476800&v=beta&t=dRzk0gwjnvwyRknhQ1SRK2uyWoBvPBVYaAyHrzcP7P8\" src=\"//:0\"\u003E\u003Cfigcaption\u003EXGBoost 2.0 release\u003C/figcaption\u003E\u003C/figure\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003EXGBoost has been around for almost 10 years and just came out with a big \u003Ca href=\"https://github.com/dmlc/xgboost/releases/tag/v2.0.0\" target=\"_blank\" rel=\"nofollow noopener\"\u003EXGBoost 2.0 release\u003C/a\u003E that features better memory efficiency, support for large datasets that don't fit into memory, multi-target trees, and more.\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E\u003Chr\u003E\u003Cp\u003E\u003Cem\u003EAhead of AI\u003C/em\u003E focuses on the reader experience, not ads. If you've found value in this newsletter, consider supporting it through a \u003Ca href=\"https://magazine.sebastianraschka.com/subscribe?utm_medium=web&utm_source=magaziney-home-page\" target=\"_blank\" rel=\"nofollow noopener\"\u003Epaid subscription option\u003C/a\u003E on Substack. And I'd be truly grateful if you could share the newsletter with your colleagues on your social channels. \u003C/p\u003E\u003Chr\u003E\u003Cp\u003E\u003C/p\u003E\u003Cp\u003E\u003C/p\u003E","url":"https://www.linkedin.com/pulse/ahead-ai-12-llm-businesses-busyness-sebastian-raschka-phd"},{"@type":"Article","author":{"@type":"Person","name":"Sebastian Raschka, PhD","url":"https://www.linkedin.com/in/sebastianraschka"},"dateModified":"","datePublished":"2023-08-26T12:13:51.000+00:00","headline":"Ahead of AI #11: New Foundation Models","image":{"@type":"ImageObject","url":"https://media.licdn.com/dms/image/v2/D5612AQENk7XRprLa1Q/article-cover_image-shrink_720_1280/article-cover_image-shrink_720_1280/0/1693051970547?e=2147483647&v=beta&t=ecuQcBlKcd-dLstbt81DO_6N6YaLkwSA5iD9i6oSX_M"},"articleBody":"\u003Cp\u003EDear readers,\u003C/p\u003E\u003Cp\u003EThe latest issue of Ahead of AI covers the recent and noteworthy developments around LLMs this summer: Llama 2, CodeLlama, and leaked GPT-4 details.\u003C/p\u003E\u003Cp\u003EI've made the decision to pause this newsletter on LinkedIn. But don't worry, Ahead of AI isn't ending; I am just moving it to Substack. You can easily subscribe with any email. \u003C/p\u003E\u003Cp\u003EI've been sharing content on both platforms a year, but Substack offers a more efficient process and avoids the challenges I've experienced with LinkedIn's editor, like lost drafts. Prioritizing Substack will allow me to spend more time on content creation rather than wrestling with formatting.\u003C/p\u003E\u003Cp\u003EYou can read (and optionally subscribe) to Ahead of AI on Substack here: \u003Ca href=\"https://magazine.sebastianraschka.com/p/ahead-of-ai-11-new-foundation-models\" target=\"_blank\" rel=\"nofollow noopener\"\u003Ehttps://magazine.sebastianraschka.com/p/ahead-of-ai-11-new-foundation-models\u003C/a\u003E\u003C/p\u003E\u003Cp\u003E\u003Cbr\u003E\u003C/p\u003E\u003Cp\u003EBest regards,\u003C/p\u003E\u003Cp\u003ESebastian\u003C/p\u003E","url":"https://www.linkedin.com/pulse/ahead-ai-11-new-foundation-models-sebastian-raschka-phd"},{"@type":"PublicationIssue","author":{"@type":"Person","name":"Sebastian Raschka, PhD","url":"https://www.linkedin.com/in/sebastianraschka"},"name":"For a complete list of publications, please see my Google Scholar profile (link below)","url":"https://www.linkedin.com/redir/redirect?url=https%3A%2F%2Fscholar%2Egoogle%2Ecom%2Fcitations%3Fuser%3DX4RCC0IAAAAJ%26hl%3Den&urlhash=b1RK"},{"@type":"WebPage","reviewedBy":{"@type":"Person","name":"Sebastian Raschka, PhD"},"url":"https://www.linkedin.com/in/sebastianraschka"},{"@context":"http://schema.org","@type":"Person","address":{"@type":"PostalAddress","addressCountry":"US","addressLocality":"Madison, Wisconsin, United States"},"alumniOf":[{"@type":"Organization","name":"Lightning AI","url":"https://www.linkedin.com/company/pytorch-lightning","location":"New York City Metropolitan Area","member":{"@type":"OrganizationRole","description":"Sebastian is a Senior Staff Research Engineer implementing, training, and finetuning large language models (LLMs).","startDate":"2024-10","endDate":"2025-02"}},{"@type":"EducationalOrganization","name":"Michigan State University","url":"https://www.linkedin.com/school/michigan-state-university/","member":{"@type":"OrganizationRole","description":"PhD thesis: Uncovering Hidden Patterns of Molecular Recognition\u003Cbr\u003EUse of statistical data mining approaches and development of new software protocols to uncover novel patterns of preferential interactions between proteins and their small molecule binding partners (the Protein Recognition Index and Screenlamp data mining toolkits), for use in enhancing the design of proteins and small molecules for protein engineering and drug discovery.\u003Cbr\u003E\u003Cbr\u003E...","startDate":2012,"endDate":2017}},{"@type":"EducationalOrganization","name":"Heinrich-Heine-Universität Düsseldorf","url":"https://de.linkedin.com/school/heinrich-heine-universitat-dusseldorf/","member":{"@type":"OrganizationRole","startDate":2008,"endDate":2012}},{"@type":"EducationalOrganization","name":"Gymnasium Rheinkamp Europaschule Moers","member":{"@type":"OrganizationRole","startDate":1998,"endDate":2007}}],"awards":["Best Paper Award at ICB 2018 (Semi-Adversarial Neural Networks paper)","ACM Computing Reviews’ Best of 2016","Outstanding Graduate Student Award"],"disambiguatingDescription":"Creator, Top Voice","image":{"@type":"ImageObject","contentUrl":"https://media.licdn.com/dms/image/v2/C4E03AQHVyH-IfD1KbQ/profile-displayphoto-shrink_200_200/profile-displayphoto-shrink_200_200/0/1629877234109?e=2147483647&v=beta&t=OTOiJBYLX5LT7qC2dke5DvYyo_GnlKJRKC45xXeY_7o"},"jobTitle":["Founder | Principal AI & LLM Research Engineer","Advisory Board Member"],"knowsLanguage":[{"@type":"Language","name":"German"},{"@type":"Language","name":"English"}],"memberOf":[],"name":"Sebastian Raschka, PhD","sameAs":"https://www.linkedin.com/in/sebastianraschka","url":"https://www.linkedin.com/in/sebastianraschka","worksFor":[{"@type":"Organization","name":"RAIR Lab","url":"https://www.linkedin.com/company/rair-lab","location":"United States","member":{"@type":"OrganizationRole","description":"Working on AI research and LLM engineering with a practical and code-driven focus.","startDate":"2025-01"}},{"@type":"Organization","name":"University of Wisconsin-Madison","url":"https://www.linkedin.com/school/uwmadison/","member":{"@type":"OrganizationRole","description":"Advisory board member for the Open Source Program Office at the UW–Madison Data Science Institute","startDate":"2023-02"}}],"interactionStatistic":{"@type":"InteractionCounter","interactionType":"https://schema.org/FollowAction","name":"Follows","userInteractionCount":155688},"description":"Sebastian Raschka, PhD, is an LLM Research Engineer with over a decade of experience in artificial intelligence. His work spans industry and academia, including implementing LLM solutions as a senior engineer at Lightning AI and teaching as a statistics professor at the University of Wisconsin–Madison.\u003Cbr\u003E\u003Cbr\u003ESebastian collaborates with Fortune 500 companies on AI solutions and serves on the Open Source Board at University of Wisconsin–Madison. He specializes in LLMs and the development of high-performance AI systems, with a deep focus on practical, code-driven implementations. He is the author of the bestselling books Machine Learning with PyTorch and Scikit-Learn, Machine Learning Q and AI, and Build a Large Language Model (From Scratch).\u003Cbr\u003E\u003Cbr\u003EFor more information about specific projects, please see the resources below.\u003Cbr\u003E\u003Cbr\u003EWebsite: https://sebastianraschka.com\u003Cbr\u003EPublications: https://scholar.google.com/citations?user=X4RCC0IAAAAJ&hl=enrasbt\u003Cbr\u003EOpen-source projects: https://github.com/rasbt"}]} </script>  </head> <body dir="ltr">   <a href="#main-content" class="skip-link btn-md btn-primary absolute z-11 -top-[100vh] focus:top-0"> Skip to main content </a> <header class="header base-detail-page__header px-mobile-container-padding bg-color-background-container global-alert-offset sticky-header"> <nav class="nav pt-1.5 pb-2 flex items-center justify-between relative flex-nowrap babymamabear:py-1.5 nav--minified-mobile babybear:flex-wrap " aria-label="Primary"> <a href="/?trk=public_profile_nav-header-logo" class="nav__logo-link link-no-visited-state z-1 mr-auto min-h-[52px] flex items-center babybear:z-0 hover:no-underline focus:no-underline active:no-underline babymamabear:mr-3" data-tracking-control-name="public_profile_nav-header-logo" data-tracking-will-navigate> <span class="sr-only">LinkedIn</span> <icon class="nav-logo--inbug flex text-color-brand papabear:hidden mamabear:hidden" data-svg-class-name="h-[34px] w-[34px] babybear:h-[26px] babybear:w-[26px]" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/4zqr0f9jf98vi2nkijyc3bex2"></icon> <icon class="block text-color-brand w-[102px] h-[26px] babybear:hidden" data-test-id="nav-logo" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/8fkga714vy9b2wk5auqo5reeb"></icon> </a>  <ul class="top-nav-menu flex items-center babybear:w-full babybear:justify-between babybear:pt-1 justify-start w-max pt-0 overflow-x-auto nav__menu babybear:order-last order-3 ml-auto"> <li class> <a href="https://www.linkedin.com/pulse/topics/home/?trk=public_profile_guest_nav_menu_articles" data-tracking-control-name="public_profile_guest_nav_menu_articles" data-tracking-will-navigate class="top-nav-link flex justify-center items-center min-h-[52px] hover:text-color-text visited:hover:text-color-text hover:no-underline w-8 flex-col mx-1 babybear:mx-0 text-color-text-secondary visited:text-color-text-secondary"> <icon class="top-nav-link__icon flex h-3 w-3 flex-shrink-0 justify-center " data-delayed-url="https://static.licdn.com/aero-v1/sc/h/6ulnj3n2ijcmhej768y6oj1hr"> </icon> <span class="top-nav-link__label-text font-sans text-xs leading-regular text-center font-regular"> Articles </span> </a> </li> <li class> <a href="https://www.linkedin.com/pub/dir/+/+?trk=public_profile_guest_nav_menu_people" data-tracking-control-name="public_profile_guest_nav_menu_people" data-tracking-will-navigate class="top-nav-link flex justify-center items-center min-h-[52px] hover:text-color-text visited:hover:text-color-text hover:no-underline w-8 flex-col mx-1 babybear:mx-0 top-nav-link--selected text-color-text visited:text-color-text border-solid border-b-2 border-color-text" aria-current="page"> <icon class="top-nav-link__icon flex h-3 w-3 flex-shrink-0 justify-center " data-delayed-url="https://static.licdn.com/aero-v1/sc/h/7kb6sn3tm4cx918cx9a5jlb0"> </icon> <span class="top-nav-link__label-text font-sans text-xs leading-regular text-center font-regular"> People </span> </a> </li> <li class> <a href="https://www.linkedin.com/learning/search?trk=public_profile_guest_nav_menu_learning" data-tracking-control-name="public_profile_guest_nav_menu_learning" data-tracking-will-navigate class="top-nav-link flex justify-center items-center min-h-[52px] hover:text-color-text visited:hover:text-color-text hover:no-underline w-8 flex-col mx-1 babybear:mx-0 text-color-text-secondary visited:text-color-text-secondary"> <icon class="top-nav-link__icon flex h-3 w-3 flex-shrink-0 justify-center " data-delayed-url="https://static.licdn.com/aero-v1/sc/h/8wykgzgbqy0t3fnkgborvz54u"> </icon> <span class="top-nav-link__label-text font-sans text-xs leading-regular text-center font-regular"> Learning </span> </a> </li> <li class> <a href="https://www.linkedin.com/jobs/jobs-in-singapore?trk=public_profile_guest_nav_menu_jobs" data-tracking-control-name="public_profile_guest_nav_menu_jobs" data-tracking-will-navigate class="top-nav-link flex justify-center items-center min-h-[52px] hover:text-color-text visited:hover:text-color-text hover:no-underline w-8 flex-col mx-1 babybear:mx-0 text-color-text-secondary visited:text-color-text-secondary"> <icon class="top-nav-link__icon flex h-3 w-3 flex-shrink-0 justify-center " data-delayed-url="https://static.licdn.com/aero-v1/sc/h/92eb1xekc34eklevj0io6x4ki"> </icon> <span class="top-nav-link__label-text font-sans text-xs leading-regular text-center font-regular"> Jobs </span> </a> </li> <li class> <a href="https://www.linkedin.com/games?trk=public_profile_guest_nav_menu_games" data-tracking-control-name="public_profile_guest_nav_menu_games" data-tracking-will-navigate class="top-nav-link flex justify-center items-center min-h-[52px] hover:text-color-text visited:hover:text-color-text hover:no-underline w-8 flex-col mx-1 babybear:mx-0 text-color-text-secondary visited:text-color-text-secondary"> <icon class="top-nav-link__icon flex h-3 w-3 flex-shrink-0 justify-center " data-delayed-url="https://static.licdn.com/aero-v1/sc/h/29h8hsjuomfp50lam5ipnc3uh"> </icon> <span class="top-nav-link__label-text font-sans text-xs leading-regular text-center font-regular"> Games </span> </a> </li> <li class> <a href="ms-windows-store://pdp/?ProductId=9WZDNCRFJ4Q7&mode=mini&cid=guest_nav_upsell&trk=public_profile_guest_nav_menu_windows" data-tracking-control-name="public_profile_guest_nav_menu_windows" data-tracking-will-navigate class="top-nav-link flex justify-center items-center min-h-[52px] hover:text-color-text visited:hover:text-color-text hover:no-underline w-[96px] px-1 border-solid border-l-1 border-r-1 babybear:border-r-0 border-color-border-faint flex-col mx-1 babybear:mx-0 text-color-text-secondary visited:text-color-text-secondary"> <icon class="top-nav-link__icon flex h-3 w-3 flex-shrink-0 justify-center " data-delayed-url="https://static.licdn.com/aero-v1/sc/h/admayac2rnonsqhz9v3rzwcyu"> </icon> <span class="top-nav-link__label-text font-sans text-xs leading-regular text-center font-regular"> Get the app </span> </a> </li> </ul> <div class="nav__cta-container order-3 flex gap-x-1 justify-end min-w-[100px] flex-nowrap flex-shrink-0 babybear:flex-wrap flex-2 babymamabear:min-w-[50px] ">  <a class="nav__button-tertiary btn-md btn-tertiary" href="https://www.linkedin.com/signup/public-profile-join?vieweeVanityName=sebastianraschka&session_redirect=https%3A%2F%2Fwww.linkedin.com%2Fin%2Fsebastianraschka&trk=public_profile_nav-header-join" data-tracking-control-name="public_profile_nav-header-join" data-test-live-nav-primary-cta data-tracking-will-navigate data-tracking-client-ingraph> Join now </a> <a class="nav__button-secondary btn-secondary-emphasis btn-md" href="https://www.linkedin.com/login?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fin%2Fsebastianraschka&fromSignIn=true&trk=public_profile_nav-header-signin" data-tracking-control-name="public_profile_nav-header-signin" data-tracking-will-navigate data-tracking-client-ingraph> Sign in </a> <a aria-label="Sign in" class="nav__link-person papabear:hidden mamabear:hidden" data-tracking-control-name="public_profile_nav-header-signin" data-tracking-will-navigate href="https://www.linkedin.com/login?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fin%2Fsebastianraschka&fromSignIn=true&trk=public_profile_nav-header-signin"> <img class="inline-block relative rounded-[50%] w-4 h-4 bg-color-entity-ghost-background" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> </a> </div>   </nav> </header>   <main class="main papabear:flex papabear:w-content-max-w papabear:mx-auto papabear:pt-desktop-content-top-margin mamabear:pt-desktop-content-top-margin " id="main-content" role="main"> <section class="core-rail mx-auto papabear:w-core-rail-width mamabear:max-w-[790px] babybear:max-w-[790px]"> <div class="details mx-details-container-padding"> <section class="profile"> <section class="top-card-layout container-lined overflow-hidden babybear:rounded-[0px]"> <figure class="cover-img min-h-[87px] papbear:min-h-[100px] rounded-t-[2px] babybear:rounded-[0px] -z-1">  <div class="cover-img__image-frame relative w-full overflow-hidden pb-[calc((134/782)*100%)]"> <div class="cover-img__image-position absolute top-0 right-0 bottom-0 left-0 "> <img class="cover-img__image relative w-full h-full object-cover" src="https://media.licdn.com/dms/image/v2/D5616AQFfrF_qxhNiOA/profile-displaybackgroundimage-shrink_200_800/B56ZWftPSvGQAU-/0/1742141191756?e=2147483647&v=beta&t=pOhuhaInlnGJXo313ngPIsR91gh-__rV3jxlh5o96Cg" fetchpriority="auto" data-embed-id="cover-image" alt tabindex="0"> </div> </div>  </figure> <div class="top-card-layout__card relative p-2 papabear:p-details-container-padding"> <div class="top-card__profile-image-container top-card__profile-image-container--cvw-fix flex top-card-layout__entity-image-container flex" data-section="picture"> <button class="cursor-pointer" aria-label="Sebastian Raschka, PhD" data-modal="public_profile_logo_contextual-sign-in-info_modal" data-test-id="logo-button" type="button"> <div class="bg-white rounded-full h-[142px] mt-[-100px] p-[7px] w-[142px]"> <div class="top-card__entity-logo-gradient-ring rounded-full h-[134px] ml-[-3px] mt-[-3px] p-[6px] w-[134px]"> <img class="inline-block relative rounded-[50%] w-16 h-16 top-card-layout__entity-image top-card__profile-image top-card__profile-image--real-image top-card__entity-inner-ring onload top-card-layout__entity-image shadow-color-shadow shadow-[0_4px_12px] border-2 border-solid border-color-surface mt-[-70px] mb-[14px] papabear:border-4 papabear:mt-[-100px] papabear:mb-[18px]" data-delayed-url="https://media.licdn.com/dms/image/v2/C4E03AQHVyH-IfD1KbQ/profile-displayphoto-shrink_200_200/profile-displayphoto-shrink_200_200/0/1629877234109?e=2147483647&v=beta&t=OTOiJBYLX5LT7qC2dke5DvYyo_GnlKJRKC45xXeY_7o" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt="Sebastian Raschka, PhD"> </div> </div> </button> <div class="contextual-sign-in-modal top-card__logo-modal" data-impression-id="public_profile_logo_cta_contextual-sign-in-modal">  <div class>  <div id="public_profile_logo_contextual-sign-in-info_modal" class="modal modal--contextual-sign-in" data-outlet="public_profile_logo_contextual-sign-in-info_modal">  <div class="modal__overlay flex items-center bg-color-background-scrim justify-center fixed bottom-0 left-0 right-0 top-0 opacity-0 invisible pointer-events-none z-[1000] transition-[opacity] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.17s] py-4 " aria-hidden="true"> <section aria-modal="true" role="dialog" aria-labelledby="public_profile_logo_contextual-sign-in-info_modal-modal-header" tabindex="-1" class="max-h-full modal__wrapper overflow-auto p-0 bg-color-surface max-w-[1128px] min-h-[160px] relative scale-[0.25] shadow-sm shadow-color-border-faint transition-[transform] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.33s] focus:outline-0 w-[1128px] mamabear:w-[744px] babybear:w-[360px] rounded-md"> <button class="modal__dismiss btn-tertiary h-[40px] w-[40px] p-0 rounded-full indent-0 contextual-sign-in-modal__modal-dismiss absolute right-0 m-[20px] cursor-pointer" aria-label="Dismiss" data-tracking-control-name="public_profile_logo_cta_contextual-sign-in-modal_modal_dismiss"> <icon class="contextual-sign-in-modal__modal-dismiss-icon" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/gs508lg3t2o81tq7pmcgn6m2"></icon> </button> <div class="modal__main w-full"> <div class="contextual-sign-in-modal__screen contextual-sign-in-modal__context-screen flex flex-col my-4 mx-3"> <img class="inline-block relative rounded-[50%] w-16 h-16 contextual-sign-in-modal__img m-auto" data-delayed-url="https://media.licdn.com/dms/image/v2/C4E03AQHVyH-IfD1KbQ/profile-displayphoto-shrink_200_200/profile-displayphoto-shrink_200_200/0/1629877234109?e=2147483647&v=beta&t=OTOiJBYLX5LT7qC2dke5DvYyo_GnlKJRKC45xXeY_7o" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> <h2 class="contextual-sign-in-modal__context-screen-title font-sans text-xl text-color-text my-2 mx-4 text-center" id="public_profile_logo_contextual-sign-in-info_modal-modal-header"> Sign in to view Sebastian’s full profile </h2>  <div class="contextual-sign-in-modal__btn-container m-auto w-[320px] babybear:w-full">  <div class="w-full max-w-[400px] mx-auto"> <div class="google-auth-button">  <div class="google-auth-button__placeholder mx-auto " data-theme="filled_blue" data-logo-alignment="center" data-locale="en_US" role="button" aria-label="Continue with google"></div>  </div> </div> <div class="sign-in-modal" data-impression-id="public_profile_logo_cta_contextual-sign-in-modal_sign-in-modal"> <button class="sign-in-modal__outlet-btn cursor-pointer btn-md btn-primary btn-secondary" data-tracking-client-ingraph data-tracking-control-name="public_profile_logo_cta_contextual-sign-in-modal_sign-in-modal_outlet-button" data-modal="public_profile_logo_contextual-sign-in-info_modal_sign-in-modal">  Sign in </button> <div class>  <div id="public_profile_logo_contextual-sign-in-info_modal_sign-in-modal" class="modal modal--sign-in" data-outlet="public_profile_logo_contextual-sign-in-info_modal_sign-in-modal">  <div class="modal__overlay flex items-center bg-color-background-scrim justify-center fixed bottom-0 left-0 right-0 top-0 opacity-0 invisible pointer-events-none z-[1000] transition-[opacity] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.17s] py-4 " aria-hidden="true"> <section aria-modal="true" role="dialog" aria-labelledby="public_profile_logo_contextual-sign-in-info_modal_sign-in-modal-modal-header" tabindex="-1" class="max-h-full modal__wrapper overflow-auto p-0 bg-color-surface max-w-[1128px] min-h-[160px] relative scale-[0.25] shadow-sm shadow-color-border-faint transition-[transform] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.33s] focus:outline-0 w-[1128px] mamabear:w-[744px] babybear:w-[360px] rounded-md"> <button class="modal__dismiss btn-tertiary h-[40px] w-[40px] p-0 rounded-full indent-0 sign-in-modal__dismiss absolute right-0 cursor-pointer m-[20px]" aria-label="Dismiss" data-tracking-control-name="public_profile_logo_cta_contextual-sign-in-modal_sign-in-modal_dismiss"> <icon class="sign-in-modal__dismiss-icon" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/gs508lg3t2o81tq7pmcgn6m2"></icon> </button> <div class="modal__main w-full"> <div class="sign-in-modal__screen flex flex-col py-4 w-[513px] babybear:w-full px-3"> <h2 class="sign-in-modal__header font-sans text-display-md text-color-text "> Welcome back </h2> <code id="i18n_sign_in_form_show_text" style="display: none"></code> <code id="i18n_sign_in_form_show_label" style="display: none"></code> <code id="i18n_sign_in_form_hide_text" style="display: none"></code> <code id="i18n_sign_in_form_hide_label" style="display: none"></code> <code id="i18n_username_error_empty" style="display: none"></code> <code id="i18n_username_error_too_long" style="display: none"></code> <code id="i18n_username_error_too_short" style="display: none"></code> <code id="i18n_password_error_empty" style="display: none"></code> <code id="i18n_password_error_too_short" style="display: none"></code> <code id="i18n_password_error_too_long" style="display: none"></code>  <form data-id="sign-in-form" action="https://www.linkedin.com/uas/login-submit" method="post" novalidate class="mt-1.5 mb-2"> <input name="loginCsrfParam" value="bfb3b82d-2e82-4a0c-8244-5b7e6275c22d" type="hidden"> <div class="flex flex-col"> <div class="mt-1.5" data-js-module-id="guest-input"> <div class="flex flex-col"> <label class="input-label mb-1" for="public_profile_logo_contextual-sign-in-info_modal_sign-in-modal_session_key"> Email or phone </label> <div class="text-input flex"> <input class="text-color-text font-sans text-md outline-0 bg-color-transparent w-full" autocomplete="username" id="public_profile_logo_contextual-sign-in-info_modal_sign-in-modal_session_key" name="session_key" required data-tracking-control-name="public_profile_logo_cta_contextual-sign-in-modal_sign-in-modal_sign-in-session-key" data-tracking-client-ingraph type="text"> </div> </div> <p class="input-helper mt-1.5" for="public_profile_logo_contextual-sign-in-info_modal_sign-in-modal_session_key" role="alert" data-js-module-id="guest-input__message"></p> </div> <div class="mt-1.5" data-js-module-id="guest-input"> <div class="flex flex-col"> <label class="input-label mb-1" for="public_profile_logo_contextual-sign-in-info_modal_sign-in-modal_session_password"> Password </label> <div class="text-input flex"> <input class="text-color-text font-sans text-md outline-0 bg-color-transparent w-full" autocomplete="current-password" id="public_profile_logo_contextual-sign-in-info_modal_sign-in-modal_session_password" name="session_password" required data-tracking-control-name="public_profile_logo_cta_contextual-sign-in-modal_sign-in-modal_sign-in-password" data-tracking-client-ingraph type="password"> <button aria-live="assertive" aria-relevant="text" data-id="sign-in-form__password-visibility-toggle" class="font-sans text-md font-bold text-color-action z-10 ml-[12px] hover:cursor-pointer" aria-label="Show your LinkedIn password" data-tracking-control-name="public_profile_logo_cta_contextual-sign-in-modal_sign-in-modal_sign-in-password-visibility-toggle-btn" type="button">Show</button> </div> </div> <p class="input-helper mt-1.5" for="public_profile_logo_contextual-sign-in-info_modal_sign-in-modal_session_password" role="alert" data-js-module-id="guest-input__message"></p> </div> <input name="session_redirect" type="hidden">  </div> <div data-id="sign-in-form__footer" class="flex justify-between sign-in-form__footer--full-width"> <a data-id="sign-in-form__forgot-password" class="font-sans text-md font-bold link leading-regular sign-in-form__forgot-password--full-width" href="https://www.linkedin.com/uas/request-password-reset?trk=public_profile_logo_cta_contextual-sign-in-modal_sign-in-modal_forgot_password" data-tracking-control-name="public_profile_logo_cta_contextual-sign-in-modal_sign-in-modal_forgot_password" data-tracking-will-navigate>Forgot password?</a>  <input name="trk" value="public_profile_logo_cta_contextual-sign-in-modal_sign-in-modal_sign-in-submit" type="hidden"> <button class="btn-md btn-primary flex-shrink-0 cursor-pointer sign-in-form__submit-btn--full-width" data-id="sign-in-form__submit-btn" data-tracking-control-name="public_profile_logo_cta_contextual-sign-in-modal_sign-in-modal_sign-in-submit-btn" data-tracking-client-ingraph data-tracking-litms type="submit"> Sign in </button> </div> <div class="sign-in-form__divider left-right-divider pt-2 pb-3"> <p class="sign-in-form__divider-text font-sans text-sm text-color-text px-2"> or </p> </div> </form> <div class="w-full max-w-[400px] mx-auto"> <div class="google-auth-button" data-tracking-control-name="public_profile_logo_cta_contextual-sign-in-modal_sign-in-modal_google-auth-button" data-tracking-client-ingraph> <p class="linkedin-tc__text text-color-text-low-emphasis text-xs pb-2" data-impression-id="public_profile_logo_cta_contextual-sign-in-modal_sign-in-modal__button-skip-tc-text"> By clicking Continue to join or sign in, you agree to LinkedIn’s <a href="/legal/user-agreement?trk=public_profile_logo_cta_contextual-sign-in-modal_sign-in-modal_auth-button_user-agreement" target="_blank" data-tracking-control-name="public_profile_logo_cta_contextual-sign-in-modal_sign-in-modal_auth-button_user-agreement" data-tracking-will-navigate="true">User Agreement</a>, <a href="/legal/privacy-policy?trk=public_profile_logo_cta_contextual-sign-in-modal_sign-in-modal_auth-button_privacy-policy" target="_blank" data-tracking-control-name="public_profile_logo_cta_contextual-sign-in-modal_sign-in-modal_auth-button_privacy-policy" data-tracking-will-navigate="true">Privacy Policy</a>, and <a href="/legal/cookie-policy?trk=public_profile_logo_cta_contextual-sign-in-modal_sign-in-modal_auth-button_cookie-policy" target="_blank" data-tracking-control-name="public_profile_logo_cta_contextual-sign-in-modal_sign-in-modal_auth-button_cookie-policy" data-tracking-will-navigate="true">Cookie Policy</a>. </p> <div class="google-auth-button__placeholder mx-auto google-auth-button__placeholder--black-border" data-theme="outline" data-logo-alignment="center" data-locale="en_US" role="button" aria-label="Continue with google" data-safe-to-skip-tnc-redirect></div>  </div> </div>  <p class="sign-in-modal__join-now m-auto font-sans text-md text-color-text mt-2"> New to LinkedIn? <a href="https://www.linkedin.com/signup/public-profile-join?vieweeVanityName=sebastianraschka&trk=public_profile_logo_cta_contextual-sign-in-modal_sign-in-modal_join-link" data-tracking-control-name="public_profile_logo_cta_contextual-sign-in-modal_sign-in-modal_join-link" data-tracking-will-navigate="true" class="sign-in-modal__join-link">Join now</a> </p> </div> </div>  </section> </div> </div> </div> </div> <div class="contextual-sign-in-modal__divider left-right-divider"> <p class="contextual-sign-in-modal__divider-text font-sans text-sm text-color-text px-2"> or </p> </div> </div> <p class="contextual-sign-in-modal__join-now m-auto font-sans text-md text-color-text my-1"> New to LinkedIn? <a href="https://www.linkedin.com/signup/public-profile-join?vieweeVanityName=sebastianraschka&trk=public_profile_logo_cta_contextual-sign-in-modal_join-link" data-tracking-control-name="public_profile_logo_cta_contextual-sign-in-modal_join-link" data-tracking-will-navigate="true" class="contextual-sign-in-modal__join-link">Join now</a> </p> <p class="linkedin-tc__text text-color-text-low-emphasis text-xs pb-2 contextual-sign-in-modal__terms-and-conditions m-auto w-[320px] pt-2 babybear:w-full" data-impression-id="linkedin-tc__button-skip-tc-text"> By clicking Continue to join or sign in, you agree to LinkedIn’s <a href="/legal/user-agreement?trk=linkedin-tc_auth-button_user-agreement" target="_blank" data-tracking-control-name="linkedin-tc_auth-button_user-agreement" data-tracking-will-navigate="true">User Agreement</a>, <a href="/legal/privacy-policy?trk=linkedin-tc_auth-button_privacy-policy" target="_blank" data-tracking-control-name="linkedin-tc_auth-button_privacy-policy" data-tracking-will-navigate="true">Privacy Policy</a>, and <a href="/legal/cookie-policy?trk=linkedin-tc_auth-button_cookie-policy" target="_blank" data-tracking-control-name="linkedin-tc_auth-button_cookie-policy" data-tracking-will-navigate="true">Cookie Policy</a>. </p> </div> </div>  </section> </div> </div> </div> </div>  </div> <div class="top-card-layout__entity-info-container flex flex-wrap papabear:flex-nowrap"> <div class="top-card-layout__entity-info flex-grow flex-shrink-0 basis-0 babybear:flex-none babybear:w-full babybear:flex-none babybear:w-full"> <button class="cursor-pointer hover:bg-color-background-none-tint-hover" data-modal="public_profile_top-card_title-modal-id" data-tracking-control-name="public_profile_top-card_title-modal-outlet-button"> <h1 class="top-card-layout__title font-sans text-lg papabear:text-xl font-bold leading-open text-color-text mb-0"> Sebastian Raschka, PhD </h1> </button> <img data-delayed-url="https://static.licdn.com/aero-v1/sc/h/5ufqwtxp4bgp1054vtveqepqe" alt="Sebastian Raschka, PhD is an influencer" class="top-card__influencer-icon ml-1">  <div class="contextual-sign-in-modal top-card-layout__title-modal" data-impression-id="public_profile_top-card_title-modal_contextual-sign-in-modal">  <div class>  <div id="public_profile_top-card_title-modal-id" class="modal modal--contextual-sign-in" data-outlet="public_profile_top-card_title-modal-id">  <div class="modal__overlay flex items-center bg-color-background-scrim justify-center fixed bottom-0 left-0 right-0 top-0 opacity-0 invisible pointer-events-none z-[1000] transition-[opacity] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.17s] py-4 " aria-hidden="true"> <section aria-modal="true" role="dialog" aria-labelledby="public_profile_top-card_title-modal-id-modal-header" tabindex="-1" class="max-h-full modal__wrapper overflow-auto p-0 bg-color-surface max-w-[1128px] min-h-[160px] relative scale-[0.25] shadow-sm shadow-color-border-faint transition-[transform] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.33s] focus:outline-0 w-[1128px] mamabear:w-[744px] babybear:w-[360px] rounded-md"> <button class="modal__dismiss btn-tertiary h-[40px] w-[40px] p-0 rounded-full indent-0 contextual-sign-in-modal__modal-dismiss absolute right-0 m-[20px] cursor-pointer" aria-label="Dismiss" data-tracking-control-name="public_profile_top-card_title-modal_contextual-sign-in-modal_modal_dismiss"> <icon class="contextual-sign-in-modal__modal-dismiss-icon" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/gs508lg3t2o81tq7pmcgn6m2"></icon> </button> <div class="modal__main w-full"> <div class="contextual-sign-in-modal__screen contextual-sign-in-modal__context-screen flex flex-col my-4 mx-3"> <img class="inline-block relative rounded-[50%] w-16 h-16 contextual-sign-in-modal__img m-auto" data-delayed-url="https://media.licdn.com/dms/image/v2/C4E03AQHVyH-IfD1KbQ/profile-displayphoto-shrink_200_200/profile-displayphoto-shrink_200_200/0/1629877234109?e=2147483647&v=beta&t=OTOiJBYLX5LT7qC2dke5DvYyo_GnlKJRKC45xXeY_7o" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> <h2 class="contextual-sign-in-modal__context-screen-title font-sans text-xl text-color-text my-2 mx-4 text-center" id="public_profile_top-card_title-modal-id-modal-header"> Sign in to view Sebastian’s full profile </h2>  <div class="contextual-sign-in-modal__btn-container m-auto w-[320px] babybear:w-full">  <div class="w-full max-w-[400px] mx-auto"> <div class="google-auth-button">  <div class="google-auth-button__placeholder mx-auto " data-theme="filled_blue" data-logo-alignment="center" data-locale="en_US" role="button" aria-label="Continue with google"></div>  </div> </div> <div class="sign-in-modal" data-impression-id="public_profile_top-card_title-modal_contextual-sign-in-modal_sign-in-modal"> <button class="sign-in-modal__outlet-btn cursor-pointer btn-md btn-primary btn-secondary" data-tracking-client-ingraph data-tracking-control-name="public_profile_top-card_title-modal_contextual-sign-in-modal_sign-in-modal_outlet-button" data-modal="public_profile_top-card_title-modal-id_sign-in-modal">  Sign in </button> <div class>  <div id="public_profile_top-card_title-modal-id_sign-in-modal" class="modal modal--sign-in" data-outlet="public_profile_top-card_title-modal-id_sign-in-modal">  <div class="modal__overlay flex items-center bg-color-background-scrim justify-center fixed bottom-0 left-0 right-0 top-0 opacity-0 invisible pointer-events-none z-[1000] transition-[opacity] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.17s] py-4 " aria-hidden="true"> <section aria-modal="true" role="dialog" aria-labelledby="public_profile_top-card_title-modal-id_sign-in-modal-modal-header" tabindex="-1" class="max-h-full modal__wrapper overflow-auto p-0 bg-color-surface max-w-[1128px] min-h-[160px] relative scale-[0.25] shadow-sm shadow-color-border-faint transition-[transform] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.33s] focus:outline-0 w-[1128px] mamabear:w-[744px] babybear:w-[360px] rounded-md"> <button class="modal__dismiss btn-tertiary h-[40px] w-[40px] p-0 rounded-full indent-0 sign-in-modal__dismiss absolute right-0 cursor-pointer m-[20px]" aria-label="Dismiss" data-tracking-control-name="public_profile_top-card_title-modal_contextual-sign-in-modal_sign-in-modal_dismiss"> <icon class="sign-in-modal__dismiss-icon" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/gs508lg3t2o81tq7pmcgn6m2"></icon> </button> <div class="modal__main w-full"> <div class="sign-in-modal__screen flex flex-col py-4 w-[513px] babybear:w-full px-3"> <h2 class="sign-in-modal__header font-sans text-display-md text-color-text "> Welcome back </h2> <code id="i18n_sign_in_form_show_text" style="display: none"></code> <code id="i18n_sign_in_form_show_label" style="display: none"></code> <code id="i18n_sign_in_form_hide_text" style="display: none"></code> <code id="i18n_sign_in_form_hide_label" style="display: none"></code> <code id="i18n_username_error_empty" style="display: none"></code> <code id="i18n_username_error_too_long" style="display: none"></code> <code id="i18n_username_error_too_short" style="display: none"></code> <code id="i18n_password_error_empty" style="display: none"></code> <code id="i18n_password_error_too_short" style="display: none"></code> <code id="i18n_password_error_too_long" style="display: none"></code>  <form data-id="sign-in-form" action="https://www.linkedin.com/uas/login-submit" method="post" novalidate class="mt-1.5 mb-2"> <input name="loginCsrfParam" value="bfb3b82d-2e82-4a0c-8244-5b7e6275c22d" type="hidden"> <div class="flex flex-col"> <div class="mt-1.5" data-js-module-id="guest-input"> <div class="flex flex-col"> <label class="input-label mb-1" for="public_profile_top-card_title-modal-id_sign-in-modal_session_key"> Email or phone </label> <div class="text-input flex"> <input class="text-color-text font-sans text-md outline-0 bg-color-transparent w-full" autocomplete="username" id="public_profile_top-card_title-modal-id_sign-in-modal_session_key" name="session_key" required data-tracking-control-name="public_profile_top-card_title-modal_contextual-sign-in-modal_sign-in-modal_sign-in-session-key" data-tracking-client-ingraph type="text"> </div> </div> <p class="input-helper mt-1.5" for="public_profile_top-card_title-modal-id_sign-in-modal_session_key" role="alert" data-js-module-id="guest-input__message"></p> </div> <div class="mt-1.5" data-js-module-id="guest-input"> <div class="flex flex-col"> <label class="input-label mb-1" for="public_profile_top-card_title-modal-id_sign-in-modal_session_password"> Password </label> <div class="text-input flex"> <input class="text-color-text font-sans text-md outline-0 bg-color-transparent w-full" autocomplete="current-password" id="public_profile_top-card_title-modal-id_sign-in-modal_session_password" name="session_password" required data-tracking-control-name="public_profile_top-card_title-modal_contextual-sign-in-modal_sign-in-modal_sign-in-password" data-tracking-client-ingraph type="password"> <button aria-live="assertive" aria-relevant="text" data-id="sign-in-form__password-visibility-toggle" class="font-sans text-md font-bold text-color-action z-10 ml-[12px] hover:cursor-pointer" aria-label="Show your LinkedIn password" data-tracking-control-name="public_profile_top-card_title-modal_contextual-sign-in-modal_sign-in-modal_sign-in-password-visibility-toggle-btn" type="button">Show</button> </div> </div> <p class="input-helper mt-1.5" for="public_profile_top-card_title-modal-id_sign-in-modal_session_password" role="alert" data-js-module-id="guest-input__message"></p> </div> <input name="session_redirect" value="https://www.linkedin.com/in/sebastianraschka" type="hidden">  </div> <div data-id="sign-in-form__footer" class="flex justify-between sign-in-form__footer--full-width"> <a data-id="sign-in-form__forgot-password" class="font-sans text-md font-bold link leading-regular sign-in-form__forgot-password--full-width" href="https://www.linkedin.com/uas/request-password-reset?trk=public_profile_top-card_title-modal_contextual-sign-in-modal_sign-in-modal_forgot_password" data-tracking-control-name="public_profile_top-card_title-modal_contextual-sign-in-modal_sign-in-modal_forgot_password" data-tracking-will-navigate>Forgot password?</a>  <input name="trk" value="public_profile_top-card_title-modal_contextual-sign-in-modal_sign-in-modal_sign-in-submit" type="hidden"> <button class="btn-md btn-primary flex-shrink-0 cursor-pointer sign-in-form__submit-btn--full-width" data-id="sign-in-form__submit-btn" data-tracking-control-name="public_profile_top-card_title-modal_contextual-sign-in-modal_sign-in-modal_sign-in-submit-btn" data-tracking-client-ingraph data-tracking-litms type="submit"> Sign in </button> </div> <div class="sign-in-form__divider left-right-divider pt-2 pb-3"> <p class="sign-in-form__divider-text font-sans text-sm text-color-text px-2"> or </p> </div> </form> <div class="w-full max-w-[400px] mx-auto"> <div class="google-auth-button" data-tracking-control-name="public_profile_top-card_title-modal_contextual-sign-in-modal_sign-in-modal_google-auth-button" data-tracking-client-ingraph> <p class="linkedin-tc__text text-color-text-low-emphasis text-xs pb-2" data-impression-id="public_profile_top-card_title-modal_contextual-sign-in-modal_sign-in-modal__button-skip-tc-text"> By clicking Continue to join or sign in, you agree to LinkedIn’s <a href="/legal/user-agreement?trk=public_profile_top-card_title-modal_contextual-sign-in-modal_sign-in-modal_auth-button_user-agreement" target="_blank" data-tracking-control-name="public_profile_top-card_title-modal_contextual-sign-in-modal_sign-in-modal_auth-button_user-agreement" data-tracking-will-navigate="true">User Agreement</a>, <a href="/legal/privacy-policy?trk=public_profile_top-card_title-modal_contextual-sign-in-modal_sign-in-modal_auth-button_privacy-policy" target="_blank" data-tracking-control-name="public_profile_top-card_title-modal_contextual-sign-in-modal_sign-in-modal_auth-button_privacy-policy" data-tracking-will-navigate="true">Privacy Policy</a>, and <a href="/legal/cookie-policy?trk=public_profile_top-card_title-modal_contextual-sign-in-modal_sign-in-modal_auth-button_cookie-policy" target="_blank" data-tracking-control-name="public_profile_top-card_title-modal_contextual-sign-in-modal_sign-in-modal_auth-button_cookie-policy" data-tracking-will-navigate="true">Cookie Policy</a>. </p> <div class="google-auth-button__placeholder mx-auto google-auth-button__placeholder--black-border" data-theme="outline" data-logo-alignment="center" data-locale="en_US" role="button" aria-label="Continue with google" data-safe-to-skip-tnc-redirect></div>  </div> </div>  <p class="sign-in-modal__join-now m-auto font-sans text-md text-color-text mt-2"> New to LinkedIn? <a href="https://www.linkedin.com/signup/public-profile-join?vieweeVanityName=sebastianraschka&trk=public_profile_top-card_title-modal_contextual-sign-in-modal_sign-in-modal_join-link" data-tracking-control-name="public_profile_top-card_title-modal_contextual-sign-in-modal_sign-in-modal_join-link" data-tracking-will-navigate="true" class="sign-in-modal__join-link">Join now</a> </p> </div> </div>  </section> </div> </div> </div> </div> <div class="contextual-sign-in-modal__divider left-right-divider"> <p class="contextual-sign-in-modal__divider-text font-sans text-sm text-color-text px-2"> or </p> </div> </div> <p class="contextual-sign-in-modal__join-now m-auto font-sans text-md text-color-text my-1"> New to LinkedIn? <a href="https://www.linkedin.com/signup/public-profile-join?vieweeVanityName=sebastianraschka&trk=public_profile_top-card_title-modal_contextual-sign-in-modal_join-link" data-tracking-control-name="public_profile_top-card_title-modal_contextual-sign-in-modal_join-link" data-tracking-will-navigate="true" class="contextual-sign-in-modal__join-link">Join now</a> </p> <p class="linkedin-tc__text text-color-text-low-emphasis text-xs pb-2 contextual-sign-in-modal__terms-and-conditions m-auto w-[320px] pt-2 babybear:w-full" data-impression-id="linkedin-tc__button-skip-tc-text"> By clicking Continue to join or sign in, you agree to LinkedIn’s <a href="/legal/user-agreement?trk=linkedin-tc_auth-button_user-agreement" target="_blank" data-tracking-control-name="linkedin-tc_auth-button_user-agreement" data-tracking-will-navigate="true">User Agreement</a>, <a href="/legal/privacy-policy?trk=linkedin-tc_auth-button_privacy-policy" target="_blank" data-tracking-control-name="linkedin-tc_auth-button_privacy-policy" data-tracking-will-navigate="true">Privacy Policy</a>, and <a href="/legal/cookie-policy?trk=linkedin-tc_auth-button_cookie-policy" target="_blank" data-tracking-control-name="linkedin-tc_auth-button_cookie-policy" data-tracking-will-navigate="true">Cookie Policy</a>. </p> </div> </div>  </section> </div> </div> </div> </div> <h2 class="top-card-layout__headline break-words font-sans text-md leading-open text-color-text"> ML/AI research engineer. Author of Build a Large Language Model From Scratch (amzn.to/4fqvn0D) and Ahead of AI (magazine.sebastianraschka.com), on how LLMs work and the latest developments in the field. </h2> <h3 class="top-card-layout__first-subline font-sans text-md leading-open text-color-text-low-emphasis"> <div class="profile-info-subheader"> <span>Madison, Wisconsin, United States</span> <span class="before:middot"></span> <button class="link cursor-pointer text-md focus:outline focus:outline-2 focus:outline-color-action" data-tracking-client-ingraph data-tracking-control-name="public_profile_profile-info-subheader_contact-info_modal-trigger" data-modal="public_profile_profile-info-subheader__contact-info_modal" data-no-cool-off="true" tabindex="0"> Contact Info </button> <div class="contextual-sign-in-modal contact-info-modal" data-impression-id="public_profile_profile-info-subheader_contact-info_modal_contextual-sign-in-modal">  <div class>  <div id="public_profile_profile-info-subheader__contact-info_modal" class="modal modal--contextual-sign-in" data-outlet="public_profile_profile-info-subheader__contact-info_modal">  <div class="modal__overlay flex items-center bg-color-background-scrim justify-center fixed bottom-0 left-0 right-0 top-0 opacity-0 invisible pointer-events-none z-[1000] transition-[opacity] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.17s] py-4 " aria-hidden="true"> <section aria-modal="true" role="dialog" aria-labelledby="public_profile_profile-info-subheader__contact-info_modal-modal-header" tabindex="-1" class="max-h-full modal__wrapper overflow-auto p-0 bg-color-surface max-w-[1128px] min-h-[160px] relative scale-[0.25] shadow-sm shadow-color-border-faint transition-[transform] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.33s] focus:outline-0 w-[1128px] mamabear:w-[744px] babybear:w-[360px] rounded-md"> <button class="modal__dismiss btn-tertiary h-[40px] w-[40px] p-0 rounded-full indent-0 contextual-sign-in-modal__modal-dismiss absolute right-0 m-[20px] cursor-pointer" aria-label="Dismiss" data-tracking-control-name="public_profile_profile-info-subheader_contact-info_modal_contextual-sign-in-modal_modal_dismiss"> <icon class="contextual-sign-in-modal__modal-dismiss-icon" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/gs508lg3t2o81tq7pmcgn6m2"></icon> </button> <div class="modal__main w-full"> <div class="contextual-sign-in-modal__screen contextual-sign-in-modal__context-screen flex flex-col my-4 mx-3"> <img class="inline-block relative rounded-[50%] w-16 h-16 contextual-sign-in-modal__img m-auto" data-delayed-url="https://media.licdn.com/dms/image/v2/C4E03AQHVyH-IfD1KbQ/profile-displayphoto-shrink_200_200/profile-displayphoto-shrink_200_200/0/1629877234109?e=2147483647&v=beta&t=OTOiJBYLX5LT7qC2dke5DvYyo_GnlKJRKC45xXeY_7o" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> <h2 class="contextual-sign-in-modal__context-screen-title font-sans text-xl text-color-text my-2 mx-4 text-center" id="public_profile_profile-info-subheader__contact-info_modal-modal-header"> Sign in to view Sebastian’s full profile </h2>  <div class="contextual-sign-in-modal__btn-container m-auto w-[320px] babybear:w-full">  <div class="w-full max-w-[400px] mx-auto"> <div class="google-auth-button">  <div class="google-auth-button__placeholder mx-auto " data-theme="filled_blue" data-logo-alignment="center" data-locale="en_US" role="button" aria-label="Continue with google"></div>  </div> </div> <div class="sign-in-modal" data-impression-id="public_profile_profile-info-subheader_contact-info_modal_contextual-sign-in-modal_sign-in-modal"> <button class="sign-in-modal__outlet-btn cursor-pointer btn-md btn-primary btn-secondary" data-tracking-client-ingraph data-tracking-control-name="public_profile_profile-info-subheader_contact-info_modal_contextual-sign-in-modal_sign-in-modal_outlet-button" data-modal="public_profile_profile-info-subheader_sign-in-modal">  Sign in </button> <div class>  <div id="public_profile_profile-info-subheader_sign-in-modal" class="modal modal--sign-in" data-outlet="public_profile_profile-info-subheader_sign-in-modal">  <div class="modal__overlay flex items-center bg-color-background-scrim justify-center fixed bottom-0 left-0 right-0 top-0 opacity-0 invisible pointer-events-none z-[1000] transition-[opacity] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.17s] py-4 " aria-hidden="true"> <section aria-modal="true" role="dialog" aria-labelledby="public_profile_profile-info-subheader_sign-in-modal-modal-header" tabindex="-1" class="max-h-full modal__wrapper overflow-auto p-0 bg-color-surface max-w-[1128px] min-h-[160px] relative scale-[0.25] shadow-sm shadow-color-border-faint transition-[transform] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.33s] focus:outline-0 w-[1128px] mamabear:w-[744px] babybear:w-[360px] rounded-md"> <button class="modal__dismiss btn-tertiary h-[40px] w-[40px] p-0 rounded-full indent-0 sign-in-modal__dismiss absolute right-0 cursor-pointer m-[20px]" aria-label="Dismiss" data-tracking-control-name="public_profile_profile-info-subheader_contact-info_modal_contextual-sign-in-modal_sign-in-modal_dismiss"> <icon class="sign-in-modal__dismiss-icon" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/gs508lg3t2o81tq7pmcgn6m2"></icon> </button> <div class="modal__main w-full"> <div class="sign-in-modal__screen flex flex-col py-4 w-[513px] babybear:w-full px-3"> <h2 class="sign-in-modal__header font-sans text-display-md text-color-text "> Welcome back </h2> <code id="i18n_sign_in_form_show_text" style="display: none"></code> <code id="i18n_sign_in_form_show_label" style="display: none"></code> <code id="i18n_sign_in_form_hide_text" style="display: none"></code> <code id="i18n_sign_in_form_hide_label" style="display: none"></code> <code id="i18n_username_error_empty" style="display: none"></code> <code id="i18n_username_error_too_long" style="display: none"></code> <code id="i18n_username_error_too_short" style="display: none"></code> <code id="i18n_password_error_empty" style="display: none"></code> <code id="i18n_password_error_too_short" style="display: none"></code> <code id="i18n_password_error_too_long" style="display: none"></code>  <form data-id="sign-in-form" action="https://www.linkedin.com/uas/login-submit" method="post" novalidate class="mt-1.5 mb-2"> <input name="loginCsrfParam" value="bfb3b82d-2e82-4a0c-8244-5b7e6275c22d" type="hidden"> <div class="flex flex-col"> <div class="mt-1.5" data-js-module-id="guest-input"> <div class="flex flex-col"> <label class="input-label mb-1" for="public_profile_profile-info-subheader_sign-in-modal_session_key"> Email or phone </label> <div class="text-input flex"> <input class="text-color-text font-sans text-md outline-0 bg-color-transparent w-full" autocomplete="username" id="public_profile_profile-info-subheader_sign-in-modal_session_key" name="session_key" required data-tracking-control-name="public_profile_profile-info-subheader_contact-info_modal_contextual-sign-in-modal_sign-in-modal_sign-in-session-key" data-tracking-client-ingraph type="text"> </div> </div> <p class="input-helper mt-1.5" for="public_profile_profile-info-subheader_sign-in-modal_session_key" role="alert" data-js-module-id="guest-input__message"></p> </div> <div class="mt-1.5" data-js-module-id="guest-input"> <div class="flex flex-col"> <label class="input-label mb-1" for="public_profile_profile-info-subheader_sign-in-modal_session_password"> Password </label> <div class="text-input flex"> <input class="text-color-text font-sans text-md outline-0 bg-color-transparent w-full" autocomplete="current-password" id="public_profile_profile-info-subheader_sign-in-modal_session_password" name="session_password" required data-tracking-control-name="public_profile_profile-info-subheader_contact-info_modal_contextual-sign-in-modal_sign-in-modal_sign-in-password" data-tracking-client-ingraph type="password"> <button aria-live="assertive" aria-relevant="text" data-id="sign-in-form__password-visibility-toggle" class="font-sans text-md font-bold text-color-action z-10 ml-[12px] hover:cursor-pointer" aria-label="Show your LinkedIn password" data-tracking-control-name="public_profile_profile-info-subheader_contact-info_modal_contextual-sign-in-modal_sign-in-modal_sign-in-password-visibility-toggle-btn" type="button">Show</button> </div> </div> <p class="input-helper mt-1.5" for="public_profile_profile-info-subheader_sign-in-modal_session_password" role="alert" data-js-module-id="guest-input__message"></p> </div> <input name="session_redirect" value="https://www.linkedin.com/in/sebastianraschka" type="hidden">  </div> <div data-id="sign-in-form__footer" class="flex justify-between sign-in-form__footer--full-width"> <a data-id="sign-in-form__forgot-password" class="font-sans text-md font-bold link leading-regular sign-in-form__forgot-password--full-width" href="https://www.linkedin.com/uas/request-password-reset?trk=public_profile_profile-info-subheader_contact-info_modal_contextual-sign-in-modal_sign-in-modal_forgot_password" data-tracking-control-name="public_profile_profile-info-subheader_contact-info_modal_contextual-sign-in-modal_sign-in-modal_forgot_password" data-tracking-will-navigate>Forgot password?</a>  <input name="trk" value="public_profile_profile-info-subheader_contact-info_modal_contextual-sign-in-modal_sign-in-modal_sign-in-submit" type="hidden"> <button class="btn-md btn-primary flex-shrink-0 cursor-pointer sign-in-form__submit-btn--full-width" data-id="sign-in-form__submit-btn" data-tracking-control-name="public_profile_profile-info-subheader_contact-info_modal_contextual-sign-in-modal_sign-in-modal_sign-in-submit-btn" data-tracking-client-ingraph data-tracking-litms type="submit"> Sign in </button> </div> <div class="sign-in-form__divider left-right-divider pt-2 pb-3"> <p class="sign-in-form__divider-text font-sans text-sm text-color-text px-2"> or </p> </div> </form> <div class="w-full max-w-[400px] mx-auto"> <div class="google-auth-button" data-tracking-control-name="public_profile_profile-info-subheader_contact-info_modal_contextual-sign-in-modal_sign-in-modal_google-auth-button" data-tracking-client-ingraph> <p class="linkedin-tc__text text-color-text-low-emphasis text-xs pb-2" data-impression-id="public_profile_profile-info-subheader_contact-info_modal_contextual-sign-in-modal_sign-in-modal__button-skip-tc-text"> By clicking Continue to join or sign in, you agree to LinkedIn’s <a href="/legal/user-agreement?trk=public_profile_profile-info-subheader_contact-info_modal_contextual-sign-in-modal_sign-in-modal_auth-button_user-agreement" target="_blank" data-tracking-control-name="public_profile_profile-info-subheader_contact-info_modal_contextual-sign-in-modal_sign-in-modal_auth-button_user-agreement" data-tracking-will-navigate="true">User Agreement</a>, <a href="/legal/privacy-policy?trk=public_profile_profile-info-subheader_contact-info_modal_contextual-sign-in-modal_sign-in-modal_auth-button_privacy-policy" target="_blank" data-tracking-control-name="public_profile_profile-info-subheader_contact-info_modal_contextual-sign-in-modal_sign-in-modal_auth-button_privacy-policy" data-tracking-will-navigate="true">Privacy Policy</a>, and <a href="/legal/cookie-policy?trk=public_profile_profile-info-subheader_contact-info_modal_contextual-sign-in-modal_sign-in-modal_auth-button_cookie-policy" target="_blank" data-tracking-control-name="public_profile_profile-info-subheader_contact-info_modal_contextual-sign-in-modal_sign-in-modal_auth-button_cookie-policy" data-tracking-will-navigate="true">Cookie Policy</a>. </p> <div class="google-auth-button__placeholder mx-auto google-auth-button__placeholder--black-border" data-theme="outline" data-logo-alignment="center" data-locale="en_US" role="button" aria-label="Continue with google" data-safe-to-skip-tnc-redirect></div>  </div> </div>  <p class="sign-in-modal__join-now m-auto font-sans text-md text-color-text mt-2"> New to LinkedIn? <a href="https://www.linkedin.com/signup/public-profile-join?vieweeVanityName=sebastianraschka&trk=public_profile_profile-info-subheader_contact-info_modal_contextual-sign-in-modal_sign-in-modal_join-link" data-tracking-control-name="public_profile_profile-info-subheader_contact-info_modal_contextual-sign-in-modal_sign-in-modal_join-link" data-tracking-will-navigate="true" class="sign-in-modal__join-link">Join now</a> </p> </div> </div>  </section> </div> </div> </div> </div> <div class="contextual-sign-in-modal__divider left-right-divider"> <p class="contextual-sign-in-modal__divider-text font-sans text-sm text-color-text px-2"> or </p> </div> </div> <p class="contextual-sign-in-modal__join-now m-auto font-sans text-md text-color-text my-1"> New to LinkedIn? <a href="https://www.linkedin.com/signup/public-profile-join?vieweeVanityName=sebastianraschka&trk=public_profile_profile-info-subheader_contact-info_modal_contextual-sign-in-modal_join-link" data-tracking-control-name="public_profile_profile-info-subheader_contact-info_modal_contextual-sign-in-modal_join-link" data-tracking-will-navigate="true" class="contextual-sign-in-modal__join-link">Join now</a> </p> <p class="linkedin-tc__text text-color-text-low-emphasis text-xs pb-2 contextual-sign-in-modal__terms-and-conditions m-auto w-[320px] pt-2 babybear:w-full" data-impression-id="linkedin-tc__button-skip-tc-text"> By clicking Continue to join or sign in, you agree to LinkedIn’s <a href="/legal/user-agreement?trk=linkedin-tc_auth-button_user-agreement" target="_blank" data-tracking-control-name="linkedin-tc_auth-button_user-agreement" data-tracking-will-navigate="true">User Agreement</a>, <a href="/legal/privacy-policy?trk=linkedin-tc_auth-button_privacy-policy" target="_blank" data-tracking-control-name="linkedin-tc_auth-button_privacy-policy" data-tracking-will-navigate="true">Privacy Policy</a>, and <a href="/legal/cookie-policy?trk=linkedin-tc_auth-button_cookie-policy" target="_blank" data-tracking-control-name="linkedin-tc_auth-button_cookie-policy" data-tracking-will-navigate="true">Cookie Policy</a>. </p> </div> </div>  </section> </div> </div> </div> </div> <div class="not-first-middot"> <span> 156K followers </span> <span> 500+ connections </span> </div> </div> </h3> <h4 class="top-card-layout__second-subline font-sans text-sm leading-open text-color-text-low-emphasis mt-0.5">  <div class="mutual-connections mt-1"> <div class="face-pile flex !no-underline"> <div class="face-pile__images-container self-start flex-shrink-0 mr-1 leading-[1]"> <img class="inline-block relative rounded-[50%] w-4 h-4 face-pile__image border-1 border-solid border-color-transparent -ml-2 first:ml-0" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/8w44rdi2q9j581bh0jvajry6x" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> <img class="inline-block relative rounded-[50%] w-4 h-4 face-pile__image border-1 border-solid border-color-transparent -ml-2 first:ml-0" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/4gybc9qd9imal0s1aw11rym5e" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> <img class="inline-block relative rounded-[50%] w-4 h-4 face-pile__image border-1 border-solid border-color-transparent -ml-2 first:ml-0" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/pwqt08yz6sj59esncfc133u8" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> </div> <button aria-label="See your mutual connections with Sebastian" class="font-semibold text-color-link cursor-pointer" data-tracking-client-ingraph data-tracking-control-name="public_profile_mutual-connections_modal-trigger" data-modal="public_profile_mutual-connections_contextual-sign-in-modal" data-no-cool-off="true">See your mutual connections</button> </div> <div class="contextual-sign-in-modal mutual-connections-modal" data-impression-id="public_profile_mutual-connections_contextual-sign-in-modal">  <div class>  <div id="public_profile_mutual-connections_contextual-sign-in-modal" class="modal modal--contextual-sign-in" data-outlet="public_profile_mutual-connections_contextual-sign-in-modal">  <div class="modal__overlay flex items-center bg-color-background-scrim justify-center fixed bottom-0 left-0 right-0 top-0 opacity-0 invisible pointer-events-none z-[1000] transition-[opacity] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.17s] py-4 " aria-hidden="true"> <section aria-modal="true" role="dialog" aria-labelledby="public_profile_mutual-connections_contextual-sign-in-modal-modal-header" tabindex="-1" class="max-h-full modal__wrapper overflow-auto p-0 bg-color-surface max-w-[1128px] min-h-[160px] relative scale-[0.25] shadow-sm shadow-color-border-faint transition-[transform] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.33s] focus:outline-0 w-[1128px] mamabear:w-[744px] babybear:w-[360px] rounded-md"> <button class="modal__dismiss btn-tertiary h-[40px] w-[40px] p-0 rounded-full indent-0 contextual-sign-in-modal__modal-dismiss absolute right-0 m-[20px] cursor-pointer" aria-label="Dismiss" data-tracking-control-name="public_profile_mutual-connections_contextual-sign-in-modal_modal_dismiss"> <icon class="contextual-sign-in-modal__modal-dismiss-icon" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/gs508lg3t2o81tq7pmcgn6m2"></icon> </button> <div class="modal__main w-full"> <div class="contextual-sign-in-modal__screen contextual-sign-in-modal__context-screen flex flex-col my-4 mx-3"> <img class="inline-block relative rounded-[50%] w-16 h-16 contextual-sign-in-modal__img m-auto" data-delayed-url="https://media.licdn.com/dms/image/v2/C4E03AQHVyH-IfD1KbQ/profile-displayphoto-shrink_200_200/profile-displayphoto-shrink_200_200/0/1629877234109?e=2147483647&v=beta&t=OTOiJBYLX5LT7qC2dke5DvYyo_GnlKJRKC45xXeY_7o" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> <h2 class="contextual-sign-in-modal__context-screen-title font-sans text-xl text-color-text my-2 mx-4 text-center" id="public_profile_mutual-connections_contextual-sign-in-modal-modal-header"> View mutual connections with Sebastian </h2>  <div class="contextual-sign-in-modal__btn-container m-auto w-[320px] babybear:w-full">  <div class="w-full max-w-[400px] mx-auto"> <div class="google-auth-button">  <div class="google-auth-button__placeholder mx-auto " data-theme="filled_blue" data-logo-alignment="center" data-locale="en_US" role="button" aria-label="Continue with google"></div>  </div> </div> <div class="sign-in-modal" data-impression-id="public_profile_mutual-connections_contextual-sign-in-modal_sign-in-modal"> <button class="sign-in-modal__outlet-btn cursor-pointer btn-md btn-primary btn-secondary" data-tracking-client-ingraph data-tracking-control-name="public_profile_mutual-connections_contextual-sign-in-modal_sign-in-modal_outlet-button" data-modal="public_profile_mutual-connections_sign-in-modal">  Sign in </button> <div class>  <div id="public_profile_mutual-connections_sign-in-modal" class="modal modal--sign-in" data-outlet="public_profile_mutual-connections_sign-in-modal">  <div class="modal__overlay flex items-center bg-color-background-scrim justify-center fixed bottom-0 left-0 right-0 top-0 opacity-0 invisible pointer-events-none z-[1000] transition-[opacity] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.17s] py-4 " aria-hidden="true"> <section aria-modal="true" role="dialog" aria-labelledby="public_profile_mutual-connections_sign-in-modal-modal-header" tabindex="-1" class="max-h-full modal__wrapper overflow-auto p-0 bg-color-surface max-w-[1128px] min-h-[160px] relative scale-[0.25] shadow-sm shadow-color-border-faint transition-[transform] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.33s] focus:outline-0 w-[1128px] mamabear:w-[744px] babybear:w-[360px] rounded-md"> <button class="modal__dismiss btn-tertiary h-[40px] w-[40px] p-0 rounded-full indent-0 sign-in-modal__dismiss absolute right-0 cursor-pointer m-[20px]" aria-label="Dismiss" data-tracking-control-name="public_profile_mutual-connections_contextual-sign-in-modal_sign-in-modal_dismiss"> <icon class="sign-in-modal__dismiss-icon" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/gs508lg3t2o81tq7pmcgn6m2"></icon> </button> <div class="modal__main w-full"> <div class="sign-in-modal__screen flex flex-col py-4 w-[513px] babybear:w-full px-3"> <h2 class="sign-in-modal__header font-sans text-display-md text-color-text "> Welcome back </h2> <code id="i18n_sign_in_form_show_text" style="display: none"></code> <code id="i18n_sign_in_form_show_label" style="display: none"></code> <code id="i18n_sign_in_form_hide_text" style="display: none"></code> <code id="i18n_sign_in_form_hide_label" style="display: none"></code> <code id="i18n_username_error_empty" style="display: none"></code> <code id="i18n_username_error_too_long" style="display: none"></code> <code id="i18n_username_error_too_short" style="display: none"></code> <code id="i18n_password_error_empty" style="display: none"></code> <code id="i18n_password_error_too_short" style="display: none"></code> <code id="i18n_password_error_too_long" style="display: none"></code>  <form data-id="sign-in-form" action="https://www.linkedin.com/uas/login-submit" method="post" novalidate class="mt-1.5 mb-2"> <input name="loginCsrfParam" value="bfb3b82d-2e82-4a0c-8244-5b7e6275c22d" type="hidden"> <div class="flex flex-col"> <div class="mt-1.5" data-js-module-id="guest-input"> <div class="flex flex-col"> <label class="input-label mb-1" for="public_profile_mutual-connections_sign-in-modal_session_key"> Email or phone </label> <div class="text-input flex"> <input class="text-color-text font-sans text-md outline-0 bg-color-transparent w-full" autocomplete="username" id="public_profile_mutual-connections_sign-in-modal_session_key" name="session_key" required data-tracking-control-name="public_profile_mutual-connections_contextual-sign-in-modal_sign-in-modal_sign-in-session-key" data-tracking-client-ingraph type="text"> </div> </div> <p class="input-helper mt-1.5" for="public_profile_mutual-connections_sign-in-modal_session_key" role="alert" data-js-module-id="guest-input__message"></p> </div> <div class="mt-1.5" data-js-module-id="guest-input"> <div class="flex flex-col"> <label class="input-label mb-1" for="public_profile_mutual-connections_sign-in-modal_session_password"> Password </label> <div class="text-input flex"> <input class="text-color-text font-sans text-md outline-0 bg-color-transparent w-full" autocomplete="current-password" id="public_profile_mutual-connections_sign-in-modal_session_password" name="session_password" required data-tracking-control-name="public_profile_mutual-connections_contextual-sign-in-modal_sign-in-modal_sign-in-password" data-tracking-client-ingraph type="password"> <button aria-live="assertive" aria-relevant="text" data-id="sign-in-form__password-visibility-toggle" class="font-sans text-md font-bold text-color-action z-10 ml-[12px] hover:cursor-pointer" aria-label="Show your LinkedIn password" data-tracking-control-name="public_profile_mutual-connections_contextual-sign-in-modal_sign-in-modal_sign-in-password-visibility-toggle-btn" type="button">Show</button> </div> </div> <p class="input-helper mt-1.5" for="public_profile_mutual-connections_sign-in-modal_session_password" role="alert" data-js-module-id="guest-input__message"></p> </div> <input name="session_redirect" value="https://www.linkedin.com/in/sebastianraschka" type="hidden">  </div> <div data-id="sign-in-form__footer" class="flex justify-between sign-in-form__footer--full-width"> <a data-id="sign-in-form__forgot-password" class="font-sans text-md font-bold link leading-regular sign-in-form__forgot-password--full-width" href="https://www.linkedin.com/uas/request-password-reset?trk=public_profile_mutual-connections_contextual-sign-in-modal_sign-in-modal_forgot_password" data-tracking-control-name="public_profile_mutual-connections_contextual-sign-in-modal_sign-in-modal_forgot_password" data-tracking-will-navigate>Forgot password?</a>  <input name="trk" value="public_profile_mutual-connections_contextual-sign-in-modal_sign-in-modal_sign-in-submit" type="hidden"> <button class="btn-md btn-primary flex-shrink-0 cursor-pointer sign-in-form__submit-btn--full-width" data-id="sign-in-form__submit-btn" data-tracking-control-name="public_profile_mutual-connections_contextual-sign-in-modal_sign-in-modal_sign-in-submit-btn" data-tracking-client-ingraph data-tracking-litms type="submit"> Sign in </button> </div> <div class="sign-in-form__divider left-right-divider pt-2 pb-3"> <p class="sign-in-form__divider-text font-sans text-sm text-color-text px-2"> or </p> </div> </form> <div class="w-full max-w-[400px] mx-auto"> <div class="google-auth-button" data-tracking-control-name="public_profile_mutual-connections_contextual-sign-in-modal_sign-in-modal_google-auth-button" data-tracking-client-ingraph> <p class="linkedin-tc__text text-color-text-low-emphasis text-xs pb-2" data-impression-id="public_profile_mutual-connections_contextual-sign-in-modal_sign-in-modal__button-skip-tc-text"> By clicking Continue to join or sign in, you agree to LinkedIn’s <a href="/legal/user-agreement?trk=public_profile_mutual-connections_contextual-sign-in-modal_sign-in-modal_auth-button_user-agreement" target="_blank" data-tracking-control-name="public_profile_mutual-connections_contextual-sign-in-modal_sign-in-modal_auth-button_user-agreement" data-tracking-will-navigate="true">User Agreement</a>, <a href="/legal/privacy-policy?trk=public_profile_mutual-connections_contextual-sign-in-modal_sign-in-modal_auth-button_privacy-policy" target="_blank" data-tracking-control-name="public_profile_mutual-connections_contextual-sign-in-modal_sign-in-modal_auth-button_privacy-policy" data-tracking-will-navigate="true">Privacy Policy</a>, and <a href="/legal/cookie-policy?trk=public_profile_mutual-connections_contextual-sign-in-modal_sign-in-modal_auth-button_cookie-policy" target="_blank" data-tracking-control-name="public_profile_mutual-connections_contextual-sign-in-modal_sign-in-modal_auth-button_cookie-policy" data-tracking-will-navigate="true">Cookie Policy</a>. </p> <div class="google-auth-button__placeholder mx-auto google-auth-button__placeholder--black-border" data-theme="outline" data-logo-alignment="center" data-locale="en_US" role="button" aria-label="Continue with google" data-safe-to-skip-tnc-redirect></div>  </div> </div>  <p class="sign-in-modal__join-now m-auto font-sans text-md text-color-text mt-2"> New to LinkedIn? <a href="https://www.linkedin.com/signup/public-profile-join?vieweeVanityName=sebastianraschka&trk=public_profile_mutual-connections_contextual-sign-in-modal_sign-in-modal_join-link" data-tracking-control-name="public_profile_mutual-connections_contextual-sign-in-modal_sign-in-modal_join-link" data-tracking-will-navigate="true" class="sign-in-modal__join-link">Join now</a> </p> </div> </div>  </section> </div> </div> </div> </div> <div class="contextual-sign-in-modal__divider left-right-divider"> <p class="contextual-sign-in-modal__divider-text font-sans text-sm text-color-text px-2"> or </p> </div> </div> <p class="contextual-sign-in-modal__join-now m-auto font-sans text-md text-color-text my-1"> New to LinkedIn? <a href="https://www.linkedin.com/signup/public-profile-join?vieweeVanityName=sebastianraschka&trk=public_profile_mutual-connections_contextual-sign-in-modal_join-link" data-tracking-control-name="public_profile_mutual-connections_contextual-sign-in-modal_join-link" data-tracking-will-navigate="true" class="contextual-sign-in-modal__join-link">Join now</a> </p> <p class="linkedin-tc__text text-color-text-low-emphasis text-xs pb-2 contextual-sign-in-modal__terms-and-conditions m-auto w-[320px] pt-2 babybear:w-full" data-impression-id="linkedin-tc__button-skip-tc-text"> By clicking Continue to join or sign in, you agree to LinkedIn’s <a href="/legal/user-agreement?trk=linkedin-tc_auth-button_user-agreement" target="_blank" data-tracking-control-name="linkedin-tc_auth-button_user-agreement" data-tracking-will-navigate="true">User Agreement</a>, <a href="/legal/privacy-policy?trk=linkedin-tc_auth-button_privacy-policy" target="_blank" data-tracking-control-name="linkedin-tc_auth-button_privacy-policy" data-tracking-will-navigate="true">Privacy Policy</a>, and <a href="/legal/cookie-policy?trk=linkedin-tc_auth-button_cookie-policy" target="_blank" data-tracking-control-name="linkedin-tc_auth-button_cookie-policy" data-tracking-will-navigate="true">Cookie Policy</a>. </p> </div> </div>  </section> </div> </div> </div> </div> </div> </h4> <div class="top-card-layout__cta-container flex flex-wrap mt-0.5 papabear:mt-0 ml-[-12px]"> <a class="top-card-layout__cta mt-2 ml-1.5 h-auto babybear:flex-auto top-card-layout__cta--primary btn-md btn-primary" href="https://www.linkedin.com/signup/public-profile-join?vieweeVanityName=sebastianraschka&trk=public_profile_top-card-primary-button-join-to-view-profile" data-tracking-client-ingraph data-tracking-control-name="public_profile_top-card-primary-button-join-to-view-profile" data-tracking-will-navigate> Join to view profile </a> <div class="contextual-sign-in-modal top-card-layout__secondary-cta-modal flex babybear:flex-auto" data-impression-id="public_profile_top-card_secondary-cta-modal_contextual-sign-in-modal"> <button class="contextual-sign-in-modal__outlet-btn cursor-pointer btn-md btn-primary top-card-layout__cta mt-2 ml-1.5 h-auto babybear:flex-auto top-card-layout__cta--secondary btn-md btn-secondary-emphasis" data-tracking-client-ingraph data-tracking-control-name="public_profile_top-card_secondary-cta-modal_contextual-sign-in-modal_outlet-button" data-modal="public_profile_top-card_secondary-cta-modal-id"> <icon class="w-2 h-2 align-middle mr-1" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/3ljz3g2ipmq0j47d44m25e1ug"></icon> Message </button> <div class>  <div id="public_profile_top-card_secondary-cta-modal-id" class="modal modal--contextual-sign-in" data-outlet="public_profile_top-card_secondary-cta-modal-id">  <div class="modal__overlay flex items-center bg-color-background-scrim justify-center fixed bottom-0 left-0 right-0 top-0 opacity-0 invisible pointer-events-none z-[1000] transition-[opacity] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.17s] py-4 " aria-hidden="true"> <section aria-modal="true" role="dialog" aria-labelledby="public_profile_top-card_secondary-cta-modal-id-modal-header" tabindex="-1" class="max-h-full modal__wrapper overflow-auto p-0 bg-color-surface max-w-[1128px] min-h-[160px] relative scale-[0.25] shadow-sm shadow-color-border-faint transition-[transform] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.33s] focus:outline-0 w-[1128px] mamabear:w-[744px] babybear:w-[360px] rounded-md"> <button class="modal__dismiss btn-tertiary h-[40px] w-[40px] p-0 rounded-full indent-0 contextual-sign-in-modal__modal-dismiss absolute right-0 m-[20px] cursor-pointer" aria-label="Dismiss" data-tracking-control-name="public_profile_top-card_secondary-cta-modal_contextual-sign-in-modal_modal_dismiss"> <icon class="contextual-sign-in-modal__modal-dismiss-icon" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/gs508lg3t2o81tq7pmcgn6m2"></icon> </button> <div class="modal__main w-full"> <div class="contextual-sign-in-modal__screen contextual-sign-in-modal__context-screen flex flex-col my-4 mx-3"> <img class="inline-block relative rounded-[50%] w-16 h-16 contextual-sign-in-modal__img m-auto" data-delayed-url="https://media.licdn.com/dms/image/v2/C4E03AQHVyH-IfD1KbQ/profile-displayphoto-shrink_200_200/profile-displayphoto-shrink_200_200/0/1629877234109?e=2147483647&v=beta&t=OTOiJBYLX5LT7qC2dke5DvYyo_GnlKJRKC45xXeY_7o" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> <h2 class="contextual-sign-in-modal__context-screen-title font-sans text-xl text-color-text my-2 mx-4 text-center" id="public_profile_top-card_secondary-cta-modal-id-modal-header"> Sign in to view Sebastian’s full profile </h2>  <div class="contextual-sign-in-modal__btn-container m-auto w-[320px] babybear:w-full">  <div class="w-full max-w-[400px] mx-auto"> <div class="google-auth-button">  <div class="google-auth-button__placeholder mx-auto " data-theme="filled_blue" data-logo-alignment="center" data-locale="en_US" role="button" aria-label="Continue with google"></div>  </div> </div> <div class="sign-in-modal" data-impression-id="public_profile_top-card_secondary-cta-modal_contextual-sign-in-modal_sign-in-modal"> <button class="sign-in-modal__outlet-btn cursor-pointer btn-md btn-primary btn-secondary" data-tracking-client-ingraph data-tracking-control-name="public_profile_top-card_secondary-cta-modal_contextual-sign-in-modal_sign-in-modal_outlet-button" data-modal="public_profile_top-card_secondary-cta-modal-id_sign-in-modal">  Sign in </button> <div class>  <div id="public_profile_top-card_secondary-cta-modal-id_sign-in-modal" class="modal modal--sign-in" data-outlet="public_profile_top-card_secondary-cta-modal-id_sign-in-modal">  <div class="modal__overlay flex items-center bg-color-background-scrim justify-center fixed bottom-0 left-0 right-0 top-0 opacity-0 invisible pointer-events-none z-[1000] transition-[opacity] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.17s] py-4 " aria-hidden="true"> <section aria-modal="true" role="dialog" aria-labelledby="public_profile_top-card_secondary-cta-modal-id_sign-in-modal-modal-header" tabindex="-1" class="max-h-full modal__wrapper overflow-auto p-0 bg-color-surface max-w-[1128px] min-h-[160px] relative scale-[0.25] shadow-sm shadow-color-border-faint transition-[transform] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.33s] focus:outline-0 w-[1128px] mamabear:w-[744px] babybear:w-[360px] rounded-md"> <button class="modal__dismiss btn-tertiary h-[40px] w-[40px] p-0 rounded-full indent-0 sign-in-modal__dismiss absolute right-0 cursor-pointer m-[20px]" aria-label="Dismiss" data-tracking-control-name="public_profile_top-card_secondary-cta-modal_contextual-sign-in-modal_sign-in-modal_dismiss"> <icon class="sign-in-modal__dismiss-icon" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/gs508lg3t2o81tq7pmcgn6m2"></icon> </button> <div class="modal__main w-full"> <div class="sign-in-modal__screen flex flex-col py-4 w-[513px] babybear:w-full px-3"> <h2 class="sign-in-modal__header font-sans text-display-md text-color-text "> Welcome back </h2> <code id="i18n_sign_in_form_show_text" style="display: none"></code> <code id="i18n_sign_in_form_show_label" style="display: none"></code> <code id="i18n_sign_in_form_hide_text" style="display: none"></code> <code id="i18n_sign_in_form_hide_label" style="display: none"></code> <code id="i18n_username_error_empty" style="display: none"></code> <code id="i18n_username_error_too_long" style="display: none"></code> <code id="i18n_username_error_too_short" style="display: none"></code> <code id="i18n_password_error_empty" style="display: none"></code> <code id="i18n_password_error_too_short" style="display: none"></code> <code id="i18n_password_error_too_long" style="display: none"></code>  <form data-id="sign-in-form" action="https://www.linkedin.com/uas/login-submit" method="post" novalidate class="mt-1.5 mb-2"> <input name="loginCsrfParam" value="bfb3b82d-2e82-4a0c-8244-5b7e6275c22d" type="hidden"> <div class="flex flex-col"> <div class="mt-1.5" data-js-module-id="guest-input"> <div class="flex flex-col"> <label class="input-label mb-1" for="public_profile_top-card_secondary-cta-modal-id_sign-in-modal_session_key"> Email or phone </label> <div class="text-input flex"> <input class="text-color-text font-sans text-md outline-0 bg-color-transparent w-full" autocomplete="username" id="public_profile_top-card_secondary-cta-modal-id_sign-in-modal_session_key" name="session_key" required data-tracking-control-name="public_profile_top-card_secondary-cta-modal_contextual-sign-in-modal_sign-in-modal_sign-in-session-key" data-tracking-client-ingraph type="text"> </div> </div> <p class="input-helper mt-1.5" for="public_profile_top-card_secondary-cta-modal-id_sign-in-modal_session_key" role="alert" data-js-module-id="guest-input__message"></p> </div> <div class="mt-1.5" data-js-module-id="guest-input"> <div class="flex flex-col"> <label class="input-label mb-1" for="public_profile_top-card_secondary-cta-modal-id_sign-in-modal_session_password"> Password </label> <div class="text-input flex"> <input class="text-color-text font-sans text-md outline-0 bg-color-transparent w-full" autocomplete="current-password" id="public_profile_top-card_secondary-cta-modal-id_sign-in-modal_session_password" name="session_password" required data-tracking-control-name="public_profile_top-card_secondary-cta-modal_contextual-sign-in-modal_sign-in-modal_sign-in-password" data-tracking-client-ingraph type="password"> <button aria-live="assertive" aria-relevant="text" data-id="sign-in-form__password-visibility-toggle" class="font-sans text-md font-bold text-color-action z-10 ml-[12px] hover:cursor-pointer" aria-label="Show your LinkedIn password" data-tracking-control-name="public_profile_top-card_secondary-cta-modal_contextual-sign-in-modal_sign-in-modal_sign-in-password-visibility-toggle-btn" type="button">Show</button> </div> </div> <p class="input-helper mt-1.5" for="public_profile_top-card_secondary-cta-modal-id_sign-in-modal_session_password" role="alert" data-js-module-id="guest-input__message"></p> </div> <input name="session_redirect" value="https://www.linkedin.com/in/sebastianraschka" type="hidden">  </div> <div data-id="sign-in-form__footer" class="flex justify-between sign-in-form__footer--full-width"> <a data-id="sign-in-form__forgot-password" class="font-sans text-md font-bold link leading-regular sign-in-form__forgot-password--full-width" href="https://www.linkedin.com/uas/request-password-reset?trk=public_profile_top-card_secondary-cta-modal_contextual-sign-in-modal_sign-in-modal_forgot_password" data-tracking-control-name="public_profile_top-card_secondary-cta-modal_contextual-sign-in-modal_sign-in-modal_forgot_password" data-tracking-will-navigate>Forgot password?</a>  <input name="trk" value="public_profile_top-card_secondary-cta-modal_contextual-sign-in-modal_sign-in-modal_sign-in-submit" type="hidden"> <button class="btn-md btn-primary flex-shrink-0 cursor-pointer sign-in-form__submit-btn--full-width" data-id="sign-in-form__submit-btn" data-tracking-control-name="public_profile_top-card_secondary-cta-modal_contextual-sign-in-modal_sign-in-modal_sign-in-submit-btn" data-tracking-client-ingraph data-tracking-litms type="submit"> Sign in </button> </div> <div class="sign-in-form__divider left-right-divider pt-2 pb-3"> <p class="sign-in-form__divider-text font-sans text-sm text-color-text px-2"> or </p> </div> </form> <div class="w-full max-w-[400px] mx-auto"> <div class="google-auth-button" data-tracking-control-name="public_profile_top-card_secondary-cta-modal_contextual-sign-in-modal_sign-in-modal_google-auth-button" data-tracking-client-ingraph> <p class="linkedin-tc__text text-color-text-low-emphasis text-xs pb-2" data-impression-id="public_profile_top-card_secondary-cta-modal_contextual-sign-in-modal_sign-in-modal__button-skip-tc-text"> By clicking Continue to join or sign in, you agree to LinkedIn’s <a href="/legal/user-agreement?trk=public_profile_top-card_secondary-cta-modal_contextual-sign-in-modal_sign-in-modal_auth-button_user-agreement" target="_blank" data-tracking-control-name="public_profile_top-card_secondary-cta-modal_contextual-sign-in-modal_sign-in-modal_auth-button_user-agreement" data-tracking-will-navigate="true">User Agreement</a>, <a href="/legal/privacy-policy?trk=public_profile_top-card_secondary-cta-modal_contextual-sign-in-modal_sign-in-modal_auth-button_privacy-policy" target="_blank" data-tracking-control-name="public_profile_top-card_secondary-cta-modal_contextual-sign-in-modal_sign-in-modal_auth-button_privacy-policy" data-tracking-will-navigate="true">Privacy Policy</a>, and <a href="/legal/cookie-policy?trk=public_profile_top-card_secondary-cta-modal_contextual-sign-in-modal_sign-in-modal_auth-button_cookie-policy" target="_blank" data-tracking-control-name="public_profile_top-card_secondary-cta-modal_contextual-sign-in-modal_sign-in-modal_auth-button_cookie-policy" data-tracking-will-navigate="true">Cookie Policy</a>. </p> <div class="google-auth-button__placeholder mx-auto google-auth-button__placeholder--black-border" data-theme="outline" data-logo-alignment="center" data-locale="en_US" role="button" aria-label="Continue with google" data-safe-to-skip-tnc-redirect></div>  </div> </div>  <p class="sign-in-modal__join-now m-auto font-sans text-md text-color-text mt-2"> New to LinkedIn? <a href="https://www.linkedin.com/signup/public-profile-join?vieweeVanityName=sebastianraschka&trk=public_profile_top-card_secondary-cta-modal_contextual-sign-in-modal_sign-in-modal_join-link" data-tracking-control-name="public_profile_top-card_secondary-cta-modal_contextual-sign-in-modal_sign-in-modal_join-link" data-tracking-will-navigate="true" class="sign-in-modal__join-link">Join now</a> </p> </div> </div>  </section> </div> </div> </div> </div> <div class="contextual-sign-in-modal__divider left-right-divider"> <p class="contextual-sign-in-modal__divider-text font-sans text-sm text-color-text px-2"> or </p> </div> </div> <p class="contextual-sign-in-modal__join-now m-auto font-sans text-md text-color-text my-1"> New to LinkedIn? <a href="https://www.linkedin.com/signup/public-profile-join?vieweeVanityName=sebastianraschka&trk=public_profile_top-card_secondary-cta-modal_contextual-sign-in-modal_join-link" data-tracking-control-name="public_profile_top-card_secondary-cta-modal_contextual-sign-in-modal_join-link" data-tracking-will-navigate="true" class="contextual-sign-in-modal__join-link">Join now</a> </p> <p class="linkedin-tc__text text-color-text-low-emphasis text-xs pb-2 contextual-sign-in-modal__terms-and-conditions m-auto w-[320px] pt-2 babybear:w-full" data-impression-id="linkedin-tc__button-skip-tc-text"> By clicking Continue to join or sign in, you agree to LinkedIn’s <a href="/legal/user-agreement?trk=linkedin-tc_auth-button_user-agreement" target="_blank" data-tracking-control-name="linkedin-tc_auth-button_user-agreement" data-tracking-will-navigate="true">User Agreement</a>, <a href="/legal/privacy-policy?trk=linkedin-tc_auth-button_privacy-policy" target="_blank" data-tracking-control-name="linkedin-tc_auth-button_privacy-policy" data-tracking-will-navigate="true">Privacy Policy</a>, and <a href="/legal/cookie-policy?trk=linkedin-tc_auth-button_cookie-policy" target="_blank" data-tracking-control-name="linkedin-tc_auth-button_cookie-policy" data-tracking-will-navigate="true">Cookie Policy</a>. </p> </div> </div>  </section> </div> </div> </div> </div> </div> </div> <div class="top-card-layout__entity-info flex-grow flex-shrink-0 basis-0 babybear:flex-none babybear:w-full top-card-layout__entity-info--right-column ml-details-container-padding max-w-[288px] babybear:my-2 babybear:ml-0"> <div class="top-card__links-container"> <div data-section="currentPositionsDetails"> <a href="https://www.linkedin.com/company/rair-lab?trk=public_profile_topcard-current-company" target="_self" data-tracking-control-name="public_profile_topcard-current-company" data-tracking-will-navigate class="flex text-md mb-1.5 font-sans font-bold leading-open items-center link" data-test-id="top-card-link"> <img class="inline-block relative w-4 h-4 mr-1 shrink-0 border-4 border-color-transparent border-solid rounded-[6px] bg-clip-content" data-delayed-url="https://media.licdn.com/dms/image/v2/D560BAQHFKhONBX4BQw/company-logo_100_100/B56ZUg2rekHEAQ-/0/1740012960464/raschka_ai_research_rair_lab_logo?e=2147483647&v=beta&t=gqfxBsMwrR25krp4APIACSqzXNnzRq6z66wLduqk33g" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/cs8pjfgyw96g44ln9r7tct85f" alt> <span class="top-card-link__description line-clamp-2"> RAIR Lab  </span> </a> </div> <div data-section="educationsDetails"> <a href="https://www.linkedin.com/school/michigan-state-university/?trk=public_profile_topcard-school" target="_self" data-tracking-control-name="public_profile_topcard-school" data-tracking-will-navigate class="flex text-md mb-1.5 font-sans font-bold leading-open items-center link" data-test-id="top-card-link"> <img class="inline-block relative w-4 h-4 mr-1 shrink-0 border-4 border-color-transparent border-solid rounded-[6px] bg-clip-content" data-delayed-url="https://media.licdn.com/dms/image/v2/C560BAQFx0VIjYYSXBw/company-logo_100_100/company-logo_100_100/0/1645471386372/michigan_state_university_logo?e=2147483647&v=beta&t=8iz7-UPGdaClJsAC_04mKsA62Z8AhHjLFq4grYc7FbI" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/6qpnald1ddva78jx4bnnl3vw" alt> <span class="top-card-link__description line-clamp-2"> Michigan State University  </span> </a> </div> <div data-section="websites"> <div class> <button class="modal__outlet " data-tracking-control-name="public_profile_websites-modal-outlet" data-modal="default-outlet" aria-hidden="false" tabindex="0"> <div class="flex text-md mb-1.5 font-sans font-bold leading-open items-center link-no-visited-state" data-test-id="top-card-link" data-tracking-client-ingraph data-tracking-control-name="public_profile-website"> <img alt class="mr-1 shrink-0 bg-transparent" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/9o8qqup6da04vhqijz8ft1j5g"> <span class="top-card-link__description line-clamp-2"> Websites  </span> </div> </button> <div id="websites" class="modal " data-outlet="default-outlet">  <div class="modal__overlay flex items-center bg-color-background-scrim justify-center fixed bottom-0 left-0 right-0 top-0 opacity-0 invisible pointer-events-none z-[1000] transition-[opacity] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.17s] py-4 " aria-hidden="true"> <section aria-modal="true" role="dialog" aria-labelledby="websites-modal-header" tabindex="-1" class="max-h-full modal__wrapper overflow-auto p-0 bg-color-surface max-w-[1128px] min-h-[160px] relative scale-[0.25] shadow-sm shadow-color-border-faint transition-[transform] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.33s] focus:outline-0 w-[1128px] mamabear:w-[744px] babybear:w-[360px] max-w-[774px] rounded-md"> <header class="modal__header flex items-center justify-between py-1.5 px-3 "> <h2 id="websites-modal-header" class="modal__title font-normal leading-open text-color-text text-lg">Websites</h2> <button class="modal__dismiss modal__dismiss--with-icon btn-tertiary h-[40px] w-[40px] p-0 rounded-full indent-0 " aria-label="Dismiss" data-tracking-control-name="public_profile_websites-modal-dismiss" type="button"> <icon class="modal__dismiss-icon relative top-[2px]" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/gs508lg3t2o81tq7pmcgn6m2"></icon> </button>  </header> <div class="modal__main w-full"> <dl class="websites-list"> <div class="websites-list__row"> <dt class="grow shrink-0 basis-1/2 websites__name mw-[50%] leading-regular pr-3"> Personal Website </dt> <dd class="break-all"> <a href="https://www.linkedin.com/redir/redirect?url=https%3A%2F%2Fsebastianraschka%2Ecom&urlhash=OPu1&trk=public_profile_website" target="_blank" data-tracking-control-name="public_profile_website" data-tracking-will-navigate> <span class="font-semibold"> https://sebastianraschka.com </span><img alt class="h-2 w-2 ml-0.5 align-baseline" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/8w0vew433o9nluoruq9k5eqy"> </a> </dd> </div> <div class="websites-list__row"> <dt class="grow shrink-0 basis-1/2 websites__name mw-[50%] leading-regular pr-3"> Portfolio </dt> <dd class="break-all"> <a href="https://www.linkedin.com/redir/redirect?url=https%3A%2F%2Fgithub%2Ecom%2Frasbt&urlhash=ED71&trk=public_profile_website" target="_blank" data-tracking-control-name="public_profile_website" data-tracking-will-navigate> <span class="font-semibold"> https://github.com/rasbt </span><img alt class="h-2 w-2 ml-0.5 align-baseline" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/8w0vew433o9nluoruq9k5eqy"> </a> </dd> </div> <div class="websites-list__row"> <dt class="grow shrink-0 basis-1/2 websites__name mw-[50%] leading-regular pr-3"> Blog </dt> <dd class="break-all"> <a href="https://www.linkedin.com/redir/redirect?url=https%3A%2F%2Fmagazine%2Esebastianraschka%2Ecom&urlhash=eBB_&trk=public_profile_website" target="_blank" data-tracking-control-name="public_profile_website" data-tracking-will-navigate> <span class="font-semibold"> https://magazine.sebastianraschka.com </span><img alt class="h-2 w-2 ml-0.5 align-baseline" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/8w0vew433o9nluoruq9k5eqy"> </a> </dd> </div> </dl> </div>  </section> </div> </div> </div> </div> </div> </div> </div> <div class="ellipsis-menu absolute right-0 top-0 top-card-layout__ellipsis-menu mr-1 papabear:mt-0.5 papabear:mr-2"> <div class="collapsible-dropdown flex items-center relative hyphens-auto"> <button class="ellipsis-menu__trigger collapsible-dropdown__button btn-md btn-tertiary cursor-pointer !py-[6px] !px-1 flex items-center rounded-[50%] " aria-expanded="false" aria-label="Open menu" data-tracking-control-name="public_profile_ellipsis-menu-trigger" tabindex="0"> <icon class="ellipsis-menu__trigger-icon m-0 p-0 centered-icon" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/671xosfpvk4c0kqtyl87hashi"></icon> </button> <ul class="collapsible-dropdown__list hidden container-raised absolute w-auto overflow-y-auto flex-col items-stretch z-1 bottom-auto top-[100%]" role="menu" tabindex="-1"> <li class="ellipsis-menu__item border-t-1 border-solid border-color-border-low-emphasis first-of-type:border-none flex" role="presentation"> <a href="/uas/login?fromSignIn=true&session_redirect=https%3A%2F%2Fwww.linkedin.com%2Fin%2Fsebastianraschka&trk=public_profile_ellipsis-menu-semaphore-sign-in-redirect&guestReportContentType=PROFILE&_f=guest-reporting" data-tracking-control-name="public_profile_ellipsis-menu-semaphore-sign-in-redirect" data-tracking-will-navigate data-item-type="semaphore" data-semaphore-content-type="PROFILE" data-semaphore-content-urn="urn:li:member:208280997" data-semaphore-tracking-prefix="public_profile_ellipsis-menu-semaphore" data-is-logged-in="false" data-modal="semaphore__toggle" class="semaphore__toggle visited:text-color-text-secondary ellipsis-menu__semaphore ellipsis-menu__item-button flex items-center w-full p-1 cursor-pointer font-sans text-sm font-bold link-styled focus:link-styled link:no-underline active:bg-color-background-container-tint focus:bg-color-background-container-tint hover:bg-color-background-container-tint outline-offset-[-2px]" role="menuitem">  <icon class="ellipsis-menu__item-icon text-color-text h-[24px] w-[24px] mr-1" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/iq0x9q37wj214o129ai1yjut"> </icon> Report this profile </a>  </li>  </ul>  </div> </div>  </div> </section>   <section class="core-section-container core-section-container--with-border border-b-1 border-solid border-color-border-faint py-4 pp-section summary" data-section="summary">  <h2 class="core-section-container__title section-title"> About </h2>  <div class="core-section-container__content break-words"> <p>Sebastian Raschka, PhD, is an LLM Research Engineer with over a decade of experience in artificial intelligence. His work spans industry and academia, including implementing LLM solutions as a senior engineer at Lightning AI and teaching as a statistics professor at the University of Wisconsin–Madison.<br><br>Sebastian collaborates with Fortune 500 companies on AI solutions and serves on the Open Source Board at University of Wisconsin–Madison. He specializes in LLMs and the development of high-performance AI systems, with a deep focus on practical, code-driven implementations. He is the author of the bestselling books Machine Learning with PyTorch and Scikit-Learn, Machine Learning Q and AI, and Build a Large Language Model (From Scratch).<br><br>For more information about specific projects, please see the resources below.<br><br>Website: https://sebastianraschka.com<br>Publications: https://scholar.google.com/citations?user=X4RCC0IAAAAJ&hl=enrasbt<br>Open-source projects: https://github.com/rasbt</p> </div> </section>    <section class="core-section-container core-section-container--with-border border-b-1 border-solid border-color-border-faint py-4 articles" data-section="articles">  <h2 class="core-section-container__title section-title"> Articles by Sebastian </h2>  <div class="core-section-container__content break-words"> <div class="show-more-less">  <ul data-max-num-to-show="3" class="show-more-less__list show-more-less__list--hide-after-3" data-impression-id="public_profile_article_view_show-more-less"> <li> <div class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-main-card flex flex-wrap py-2 pr-2 babybear:pr-0 base-main-card--link babybear:py-1 main-article-card"> <a class="base-card__full-link absolute top-0 right-0 bottom-0 left-0 p-0 z-[2]" href="https://www.linkedin.com/pulse/understanding-reasoning-llms-sebastian-raschka-phd-1tshc" data-tracking-control-name="public_profile_article_view" data-tracking-will-navigate> <span class="sr-only"> Understanding Reasoning LLMs </span> </a> <div class="base-main-card__media relative w-[228px] block overflow-hidden flex-shrink-0 rounded-md h-[129px] babybear:w-[105px] babybear:h-[59px]"> <img class="h-full" data-delayed-url="https://media.licdn.com/dms/image/v2/D5612AQHNs0sfbHFYyg/article-cover_image-shrink_720_1280/article-cover_image-shrink_720_1280/0/1738711288594?e=2147483647&v=beta&t=iepjYE-w0xdsh16mhqNYN-TidZXJ8wl0GxNqiRhYsDo" alt>  </div> <div class="base-main-card__info self-center ml-1 flex-1 relative break-words papabear:min-w-0 mamabear:min-w-0 babybear:w-full article-card-social-actions"> <div class="body-text text-color-text-low-emphasis base-main-card__metadata mb-0 babybear:mb-0 babybear:-mt-0.5"> <span class="body-text text-color-text-low-emphasis base-main-card__metadata-item babybear:text-xs">Feb 5, 2025</span>  </div> <h3 class="base-main-card__title font-sans text-[18px] font-bold text-color-text overflow-hidden babybear:text-sm"> Understanding Reasoning LLMs </h3>  <p class="base-main-card__description body-text text-color-text -mt-0.5"> Methods and Strategies for Building and Refining Reasoning Models In this article, I will describe the four main… </p>  <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs -mt-1"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="1,866 Reactions" data-separate-ctas="false" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="1866" data-singular="%numReactions%" data-plural="%numReactions%" data-entity-urn="urn:li:linkedInArticle:7292682478085935105" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <img alt data-reaction-type="INTEREST" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/a0e8rff6djeoq8iympcysuqfu" height="16px" width="16px"> <img alt data-reaction-type="EMPATHY" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/asiqslyf4ooq7ggllg4fyo4o2" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 1,866 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis before:middot my-1" data-separate-ctas="false" data-test-id="social-actions__comments" data-id="social-actions__comments" data-num-comments="94" data-singular="%numComments% Comment" data-plural="%numComments% Comments" data-entity-urn="urn:li:linkedInArticle:7292682478085935105"> 94 Comments </div>  </div> </div>  </div> </li> <li> <div class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-main-card flex flex-wrap py-2 pr-2 babybear:pr-0 base-main-card--link babybear:py-1 main-article-card"> <a class="base-card__full-link absolute top-0 right-0 bottom-0 left-0 p-0 z-[2]" href="https://www.linkedin.com/pulse/understanding-multimodal-llms-sebastian-raschka-phd-t7h5c" data-tracking-control-name="public_profile_article_view" data-tracking-will-navigate> <span class="sr-only"> Understanding Multimodal LLMs </span> </a> <div class="base-main-card__media relative w-[228px] block overflow-hidden flex-shrink-0 rounded-md h-[129px] babybear:w-[105px] babybear:h-[59px]"> <img class="h-full" data-delayed-url="https://media.licdn.com/dms/image/v2/D5612AQEeA4ibot-6MQ/article-cover_image-shrink_720_1280/article-cover_image-shrink_720_1280/0/1730575858615?e=2147483647&v=beta&t=Kz0rm76WIizQSrc60TGwT_vJVifl0DF6lr9qLKVOHqM" alt>  </div> <div class="base-main-card__info self-center ml-1 flex-1 relative break-words papabear:min-w-0 mamabear:min-w-0 babybear:w-full article-card-social-actions"> <div class="body-text text-color-text-low-emphasis base-main-card__metadata mb-0 babybear:mb-0 babybear:-mt-0.5"> <span class="body-text text-color-text-low-emphasis base-main-card__metadata-item babybear:text-xs">Nov 3, 2024</span>  </div> <h3 class="base-main-card__title font-sans text-[18px] font-bold text-color-text overflow-hidden babybear:text-sm"> Understanding Multimodal LLMs </h3>  <p class="base-main-card__description body-text text-color-text -mt-0.5"> It was a wild two months. There have once again been many developments in AI research, with two Nobel Prizes awarded to… </p>  <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs -mt-1"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="1,772 Reactions" data-separate-ctas="false" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="1772" data-singular="%numReactions%" data-plural="%numReactions%" data-entity-urn="urn:li:linkedInArticle:7258561250463264770" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <img alt data-reaction-type="INTEREST" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/a0e8rff6djeoq8iympcysuqfu" height="16px" width="16px"> <img alt data-reaction-type="EMPATHY" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/asiqslyf4ooq7ggllg4fyo4o2" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 1,772 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis before:middot my-1" data-separate-ctas="false" data-test-id="social-actions__comments" data-id="social-actions__comments" data-num-comments="53" data-singular="%numComments% Comment" data-plural="%numComments% Comments" data-entity-urn="urn:li:linkedInArticle:7258561250463264770"> 53 Comments </div>  </div> </div>  </div> </li> <li> <div class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-main-card flex flex-wrap py-2 pr-2 babybear:pr-0 base-main-card--link babybear:py-1 main-article-card"> <a class="base-card__full-link absolute top-0 right-0 bottom-0 left-0 p-0 z-[2]" href="https://www.linkedin.com/pulse/building-gpt-style-llm-classifier-from-scratch-sebastian-raschka-phd-itp5c" data-tracking-control-name="public_profile_article_view" data-tracking-will-navigate> <span class="sr-only"> Building a GPT-Style LLM Classifier From Scratch </span> </a> <div class="base-main-card__media relative w-[228px] block overflow-hidden flex-shrink-0 rounded-md h-[129px] babybear:w-[105px] babybear:h-[59px]"> <img class="h-full" data-delayed-url="https://media.licdn.com/dms/image/v2/D5612AQGAiGhSgAww9Q/article-cover_image-shrink_720_1280/article-cover_image-shrink_720_1280/0/1725887252064?e=2147483647&v=beta&t=CeSTupa_F3hXHyI-vloNIg28s3X2u5c_zJlAEwq82lE" alt>  </div> <div class="base-main-card__info self-center ml-1 flex-1 relative break-words papabear:min-w-0 mamabear:min-w-0 babybear:w-full article-card-social-actions"> <div class="body-text text-color-text-low-emphasis base-main-card__metadata mb-0 babybear:mb-0 babybear:-mt-0.5"> <span class="body-text text-color-text-low-emphasis base-main-card__metadata-item babybear:text-xs">Sep 21, 2024</span>  </div> <h3 class="base-main-card__title font-sans text-[18px] font-bold text-color-text overflow-hidden babybear:text-sm"> Building a GPT-Style LLM Classifier From Scratch </h3>  <p class="base-main-card__description body-text text-color-text -mt-0.5"> In this article, I want to show you how to transform pretrained large language models (LLMs) into strong text… </p>  <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs -mt-1"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="1,760 Reactions" data-separate-ctas="false" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="1760" data-singular="%numReactions%" data-plural="%numReactions%" data-entity-urn="urn:li:linkedInArticle:7238895426014224385" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <img alt data-reaction-type="INTEREST" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/a0e8rff6djeoq8iympcysuqfu" height="16px" width="16px"> <img alt data-reaction-type="EMPATHY" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/asiqslyf4ooq7ggllg4fyo4o2" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 1,760 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis before:middot my-1" data-separate-ctas="false" data-test-id="social-actions__comments" data-id="social-actions__comments" data-num-comments="54" data-singular="%numComments% Comment" data-plural="%numComments% Comments" data-entity-urn="urn:li:linkedInArticle:7238895426014224385"> 54 Comments </div>  </div> </div>  </div> </li> <li> <div class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-main-card flex flex-wrap py-2 pr-2 babybear:pr-0 base-main-card--link babybear:py-1 main-article-card"> <a class="base-card__full-link absolute top-0 right-0 bottom-0 left-0 p-0 z-[2]" href="https://www.linkedin.com/pulse/new-llm-pre-training-post-training-paradigms-sebastian-raschka-phd-l53zc" data-tracking-control-name="public_profile_article_view" data-tracking-will-navigate> <span class="sr-only"> New LLM Pre-training and Post-training Paradigms </span> </a> <div class="base-main-card__media relative w-[228px] block overflow-hidden flex-shrink-0 rounded-md h-[129px] babybear:w-[105px] babybear:h-[59px]"> <img class="h-full" data-delayed-url="https://media.licdn.com/dms/image/v2/D5612AQGdto_dMucAMA/article-cover_image-shrink_720_1280/article-cover_image-shrink_720_1280/0/1723740451128?e=2147483647&v=beta&t=bZuKacN5K0bmj3mrsNIU86914DJdSq9n_gnH1L6uiiI" alt>  </div> <div class="base-main-card__info self-center ml-1 flex-1 relative break-words papabear:min-w-0 mamabear:min-w-0 babybear:w-full article-card-social-actions"> <div class="body-text text-color-text-low-emphasis base-main-card__metadata mb-0 babybear:mb-0 babybear:-mt-0.5"> <span class="body-text text-color-text-low-emphasis base-main-card__metadata-item babybear:text-xs">Aug 17, 2024</span>  </div> <h3 class="base-main-card__title font-sans text-[18px] font-bold text-color-text overflow-hidden babybear:text-sm"> New LLM Pre-training and Post-training Paradigms </h3>  <p class="base-main-card__description body-text text-color-text -mt-0.5"> The development of large language models (LLMs) has come a long way, from the early GPT models to the sophisticated… </p>  <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs -mt-1"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="1,391 Reactions" data-separate-ctas="false" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="1391" data-singular="%numReactions%" data-plural="%numReactions%" data-entity-urn="urn:li:linkedInArticle:7229885092033142784" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <img alt data-reaction-type="INTEREST" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/a0e8rff6djeoq8iympcysuqfu" height="16px" width="16px"> <img alt data-reaction-type="PRAISE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/2tzoeodxy0zug4455msr0oq0v" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 1,391 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis before:middot my-1" data-separate-ctas="false" data-test-id="social-actions__comments" data-id="social-actions__comments" data-num-comments="35" data-singular="%numComments% Comment" data-plural="%numComments% Comments" data-entity-urn="urn:li:linkedInArticle:7229885092033142784"> 35 Comments </div>  </div> </div>  </div> </li> <li> <div class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-main-card flex flex-wrap py-2 pr-2 babybear:pr-0 base-main-card--link babybear:py-1 main-article-card"> <a class="base-card__full-link absolute top-0 right-0 bottom-0 left-0 p-0 z-[2]" href="https://www.linkedin.com/pulse/instruction-pretraining-llms-sebastian-raschka-phd-x6zoc" data-tracking-control-name="public_profile_article_view" data-tracking-will-navigate> <span class="sr-only"> Instruction Pretraining LLMs </span> </a> <div class="base-main-card__media relative w-[228px] block overflow-hidden flex-shrink-0 rounded-md h-[129px] babybear:w-[105px] babybear:h-[59px]"> <img class="h-full" data-delayed-url="https://media.licdn.com/dms/image/v2/D5612AQHbmz6yfVIGxA/article-cover_image-shrink_720_1280/article-cover_image-shrink_720_1280/0/1721330358225?e=2147483647&v=beta&t=rmnfhvwfvC6BMNBlAcut56cUPuZOalRA8syQaylqZhg" alt>  </div> <div class="base-main-card__info self-center ml-1 flex-1 relative break-words papabear:min-w-0 mamabear:min-w-0 babybear:w-full article-card-social-actions"> <div class="body-text text-color-text-low-emphasis base-main-card__metadata mb-0 babybear:mb-0 babybear:-mt-0.5"> <span class="body-text text-color-text-low-emphasis base-main-card__metadata-item babybear:text-xs">Jul 20, 2024</span>  </div> <h3 class="base-main-card__title font-sans text-[18px] font-bold text-color-text overflow-hidden babybear:text-sm"> Instruction Pretraining LLMs </h3>  <p class="base-main-card__description body-text text-color-text -mt-0.5"> A lot has happened last month: Apple announced the integration of on-device LLMs, Nvidia shared their large Nemotron… </p>  <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs -mt-1"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="952 Reactions" data-separate-ctas="false" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="952" data-singular="%numReactions%" data-plural="%numReactions%" data-entity-urn="urn:li:linkedInArticle:7219782809446436865" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <img alt data-reaction-type="INTEREST" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/a0e8rff6djeoq8iympcysuqfu" height="16px" width="16px"> <img alt data-reaction-type="EMPATHY" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/asiqslyf4ooq7ggllg4fyo4o2" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 952 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis before:middot my-1" data-separate-ctas="false" data-test-id="social-actions__comments" data-id="social-actions__comments" data-num-comments="39" data-singular="%numComments% Comment" data-plural="%numComments% Comments" data-entity-urn="urn:li:linkedInArticle:7219782809446436865"> 39 Comments </div>  </div> </div>  </div> </li> <li> <div class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-main-card flex flex-wrap py-2 pr-2 babybear:pr-0 base-main-card--link babybear:py-1 main-article-card"> <a class="base-card__full-link absolute top-0 right-0 bottom-0 left-0 p-0 z-[2]" href="https://www.linkedin.com/pulse/llm-research-insights-instruction-masking-new-lora-raschka-phd-7p1oc" data-tracking-control-name="public_profile_article_view" data-tracking-will-navigate> <span class="sr-only"> LLM Research Insights: Instruction Masking and New LoRA Finetuning Experiments </span> </a> <div class="base-main-card__media relative w-[228px] block overflow-hidden flex-shrink-0 rounded-md h-[129px] babybear:w-[105px] babybear:h-[59px]"> <img class="h-full" data-delayed-url="https://media.licdn.com/dms/image/v2/D5612AQHrFO7oCsjB8g/article-cover_image-shrink_720_1280/article-cover_image-shrink_720_1280/0/1721183355165?e=2147483647&v=beta&t=pESTVCk__yC_9wzIYhg1AfmiUxE69pV-tXjhZoxMO9Y" alt>  </div> <div class="base-main-card__info self-center ml-1 flex-1 relative break-words papabear:min-w-0 mamabear:min-w-0 babybear:w-full article-card-social-actions"> <div class="body-text text-color-text-low-emphasis base-main-card__metadata mb-0 babybear:mb-0 babybear:-mt-0.5"> <span class="body-text text-color-text-low-emphasis base-main-card__metadata-item babybear:text-xs">Jun 2, 2024</span>  </div> <h3 class="base-main-card__title font-sans text-[18px] font-bold text-color-text overflow-hidden babybear:text-sm"> LLM Research Insights: Instruction Masking and New LoRA Finetuning Experiments </h3>  <p class="base-main-card__description body-text text-color-text -mt-0.5"> This month, I am covering three new papers related to instruction finetuning and parameter-efficient finetuning with… </p>  <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs -mt-1"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="687 Reactions" data-separate-ctas="false" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="687" data-singular="%numReactions%" data-plural="%numReactions%" data-entity-urn="urn:li:linkedInArticle:7202108954154344448" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <img alt data-reaction-type="INTEREST" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/a0e8rff6djeoq8iympcysuqfu" height="16px" width="16px"> <img alt data-reaction-type="EMPATHY" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/asiqslyf4ooq7ggllg4fyo4o2" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 687 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis before:middot my-1" data-separate-ctas="false" data-test-id="social-actions__comments" data-id="social-actions__comments" data-num-comments="15" data-singular="%numComments% Comment" data-plural="%numComments% Comments" data-entity-urn="urn:li:linkedInArticle:7202108954154344448"> 15 Comments </div>  </div> </div>  </div> </li> <li> <div class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-main-card flex flex-wrap py-2 pr-2 babybear:pr-0 base-main-card--link babybear:py-1 main-article-card"> <a class="base-card__full-link absolute top-0 right-0 bottom-0 left-0 p-0 z-[2]" href="https://www.linkedin.com/pulse/how-good-latest-open-llms-dpo-better-than-ppo-sebastian-raschka-phd-tjl2c" data-tracking-control-name="public_profile_article_view" data-tracking-will-navigate> <span class="sr-only"> How Good Are the Latest Open LLMs? And Is DPO Better Than PPO? </span> </a> <div class="base-main-card__media relative w-[228px] block overflow-hidden flex-shrink-0 rounded-md h-[129px] babybear:w-[105px] babybear:h-[59px]"> <img class="h-full" data-delayed-url="https://media.licdn.com/dms/image/v2/D5612AQESXYeiXIXGfw/article-cover_image-shrink_720_1280/article-cover_image-shrink_720_1280/0/1715381661557?e=2147483647&v=beta&t=2yjudmlWm5wd9jsd4Vpz94sAbp8NvBagcV1mSRp8xyw" alt>  </div> <div class="base-main-card__info self-center ml-1 flex-1 relative break-words papabear:min-w-0 mamabear:min-w-0 babybear:w-full article-card-social-actions"> <div class="body-text text-color-text-low-emphasis base-main-card__metadata mb-0 babybear:mb-0 babybear:-mt-0.5"> <span class="body-text text-color-text-low-emphasis base-main-card__metadata-item babybear:text-xs">May 12, 2024</span>  </div> <h3 class="base-main-card__title font-sans text-[18px] font-bold text-color-text overflow-hidden babybear:text-sm"> How Good Are the Latest Open LLMs? And Is DPO Better Than PPO? </h3>  <p class="base-main-card__description body-text text-color-text -mt-0.5"> April 2024, what a month! My birthday, a new book release, spring is finally here, and four major open LLM releases:… </p>  <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs -mt-1"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="817 Reactions" data-separate-ctas="false" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="817" data-singular="%numReactions%" data-plural="%numReactions%" data-entity-urn="urn:li:linkedInArticle:7194827172937699329" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <img alt data-reaction-type="INTEREST" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/a0e8rff6djeoq8iympcysuqfu" height="16px" width="16px"> <img alt data-reaction-type="PRAISE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/2tzoeodxy0zug4455msr0oq0v" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 817 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis before:middot my-1" data-separate-ctas="false" data-test-id="social-actions__comments" data-id="social-actions__comments" data-num-comments="26" data-singular="%numComments% Comment" data-plural="%numComments% Comments" data-entity-urn="urn:li:linkedInArticle:7194827172937699329"> 26 Comments </div>  </div> </div>  </div> </li> <li> <div class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-main-card flex flex-wrap py-2 pr-2 babybear:pr-0 base-main-card--link babybear:py-1 main-article-card"> <a class="base-card__full-link absolute top-0 right-0 bottom-0 left-0 p-0 z-[2]" href="https://www.linkedin.com/pulse/using-finetuning-pretrained-transformers-sebastian-raschka-phd-08yff" data-tracking-control-name="public_profile_article_view" data-tracking-will-navigate> <span class="sr-only"> Using and Finetuning Pretrained Transformers </span> </a> <div class="base-main-card__media relative w-[228px] block overflow-hidden flex-shrink-0 rounded-md h-[129px] babybear:w-[105px] babybear:h-[59px]"> <img class="h-full" data-delayed-url="https://media.licdn.com/dms/image/v2/D4D12AQE9u2TSkhrLGw/article-cover_image-shrink_720_1280/article-cover_image-shrink_720_1280/0/1713458042630?e=2147483647&v=beta&t=1HsR4VaXK7sMf19jM1mxLUuiepjCmzXtY1HiPaR8dGM" alt>  </div> <div class="base-main-card__info self-center ml-1 flex-1 relative break-words papabear:min-w-0 mamabear:min-w-0 babybear:w-full article-card-social-actions"> <div class="body-text text-color-text-low-emphasis base-main-card__metadata mb-0 babybear:mb-0 babybear:-mt-0.5"> <span class="body-text text-color-text-low-emphasis base-main-card__metadata-item babybear:text-xs">Apr 20, 2024</span>  </div> <h3 class="base-main-card__title font-sans text-[18px] font-bold text-color-text overflow-hidden babybear:text-sm"> Using and Finetuning Pretrained Transformers </h3>  <p class="base-main-card__description body-text text-color-text -mt-0.5"> This week has been filled with developments, including exciting new AI research that I’ll be discussing in my usual… </p>  <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs -mt-1"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="710 Reactions" data-separate-ctas="false" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="710" data-singular="%numReactions%" data-plural="%numReactions%" data-entity-urn="urn:li:linkedInArticle:7186465002520772608" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <img alt data-reaction-type="INTEREST" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/a0e8rff6djeoq8iympcysuqfu" height="16px" width="16px"> <img alt data-reaction-type="EMPATHY" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/asiqslyf4ooq7ggllg4fyo4o2" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 710 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis before:middot my-1" data-separate-ctas="false" data-test-id="social-actions__comments" data-id="social-actions__comments" data-num-comments="19" data-singular="%numComments% Comment" data-plural="%numComments% Comments" data-entity-urn="urn:li:linkedInArticle:7186465002520772608"> 19 Comments </div>  </div> </div>  </div> </li> <li> <div class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-main-card flex flex-wrap py-2 pr-2 babybear:pr-0 base-main-card--link babybear:py-1 main-article-card"> <a class="base-card__full-link absolute top-0 right-0 bottom-0 left-0 p-0 z-[2]" href="https://www.linkedin.com/pulse/ahead-ai-12-llm-businesses-busyness-sebastian-raschka-phd" data-tracking-control-name="public_profile_article_view" data-tracking-will-navigate> <span class="sr-only"> Ahead of AI #12: LLM Businesses and Busyness </span> </a> <div class="base-main-card__media relative w-[228px] block overflow-hidden flex-shrink-0 rounded-md h-[129px] babybear:w-[105px] babybear:h-[59px]"> <img class="h-full" data-delayed-url="https://media.licdn.com/dms/image/v2/D5612AQGtYjf4FHtrSQ/article-cover_image-shrink_720_1280/article-cover_image-shrink_720_1280/0/1696691152180?e=2147483647&v=beta&t=gY8slhdJ92S27SezJGaXYIsIfjG3xO5DjgFLLTkRZf4" alt>  </div> <div class="base-main-card__info self-center ml-1 flex-1 relative break-words papabear:min-w-0 mamabear:min-w-0 babybear:w-full article-card-social-actions"> <div class="body-text text-color-text-low-emphasis base-main-card__metadata mb-0 babybear:mb-0 babybear:-mt-0.5"> <span class="body-text text-color-text-low-emphasis base-main-card__metadata-item babybear:text-xs">Oct 8, 2023</span>  </div> <h3 class="base-main-card__title font-sans text-[18px] font-bold text-color-text overflow-hidden babybear:text-sm"> Ahead of AI #12: LLM Businesses and Busyness </h3>  <p class="base-main-card__description body-text text-color-text -mt-0.5"> In Ahead of AI, I try to strike a balance between discussing recent research, explaining AI-related concepts, and… </p>  <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs -mt-1"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="230 Reactions" data-separate-ctas="false" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="230" data-singular="%numReactions%" data-plural="%numReactions%" data-entity-urn="urn:li:linkedInArticle:7116435796466810880" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <img alt data-reaction-type="INTEREST" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/a0e8rff6djeoq8iympcysuqfu" height="16px" width="16px"> <img alt data-reaction-type="PRAISE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/2tzoeodxy0zug4455msr0oq0v" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 230 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis before:middot my-1" data-separate-ctas="false" data-test-id="social-actions__comments" data-id="social-actions__comments" data-num-comments="13" data-singular="%numComments% Comment" data-plural="%numComments% Comments" data-entity-urn="urn:li:linkedInArticle:7116435796466810880"> 13 Comments </div>  </div> </div>  </div> </li> <li> <div class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-main-card flex flex-wrap py-2 pr-2 babybear:pr-0 base-main-card--link babybear:py-1 main-article-card"> <a class="base-card__full-link absolute top-0 right-0 bottom-0 left-0 p-0 z-[2]" href="https://www.linkedin.com/pulse/ahead-ai-11-new-foundation-models-sebastian-raschka-phd" data-tracking-control-name="public_profile_article_view" data-tracking-will-navigate> <span class="sr-only"> Ahead of AI #11: New Foundation Models </span> </a> <div class="base-main-card__media relative w-[228px] block overflow-hidden flex-shrink-0 rounded-md h-[129px] babybear:w-[105px] babybear:h-[59px]"> <img class="h-full" data-delayed-url="https://media.licdn.com/dms/image/v2/D5612AQENk7XRprLa1Q/article-cover_image-shrink_720_1280/article-cover_image-shrink_720_1280/0/1693051970547?e=2147483647&v=beta&t=ecuQcBlKcd-dLstbt81DO_6N6YaLkwSA5iD9i6oSX_M" alt>  </div> <div class="base-main-card__info self-center ml-1 flex-1 relative break-words papabear:min-w-0 mamabear:min-w-0 babybear:w-full article-card-social-actions"> <div class="body-text text-color-text-low-emphasis base-main-card__metadata mb-0 babybear:mb-0 babybear:-mt-0.5"> <span class="body-text text-color-text-low-emphasis base-main-card__metadata-item babybear:text-xs">Aug 26, 2023</span>  </div> <h3 class="base-main-card__title font-sans text-[18px] font-bold text-color-text overflow-hidden babybear:text-sm"> Ahead of AI #11: New Foundation Models </h3>  <p class="base-main-card__description body-text text-color-text -mt-0.5"> Dear readers, The latest issue of Ahead of AI covers the recent and noteworthy developments around LLMs this summer:… </p>  <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs -mt-1"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="180 Reactions" data-separate-ctas="false" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="180" data-singular="%numReactions%" data-plural="%numReactions%" data-entity-urn="urn:li:linkedInArticle:7101173894924947456" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <img alt data-reaction-type="INTEREST" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/a0e8rff6djeoq8iympcysuqfu" height="16px" width="16px"> <img alt data-reaction-type="PRAISE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/2tzoeodxy0zug4455msr0oq0v" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 180 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis before:middot my-1" data-separate-ctas="false" data-test-id="social-actions__comments" data-id="social-actions__comments" data-num-comments="2" data-singular="%numComments% Comment" data-plural="%numComments% Comments" data-entity-urn="urn:li:linkedInArticle:7101173894924947456"> 2 Comments </div>  </div> </div>  </div> </li> </ul> <button class="show-more-less-button show-more-less__button show-more-less__more-button !py-1.5" data-tracking-control-name="_show_more" aria-label="Show more articles"> Show more <icon class="show-more-less-button-icon" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/cyolgscd0imw2ldqppkrb84vo"></icon></button> <div class="pt-1"> <a data-tracking-control-name="_see_all_articles" data-tracking-will-navigate href="https://www.linkedin.com/in/sebastianraschka/recent-activity/articles/" class="show-more-less-button see-more-link show-more-less__less-button show-more-less__button--hide !text-color-link !border-color-link !shadow-none border-[1px] border-solid !py-1.5"> See all articles </a> </div> </div>  </div> </section> <section class="core-section-container core-section-container--with-border border-b-1 border-solid border-color-border-faint py-4 contributions" data-section="contributions">  <h2 class="core-section-container__title section-title"> Contributions </h2>  <div class="core-section-container__content break-words"> <div class="show-more-less">  <ul data-max-num-to-show="4" class="show-more-less__list show-more-less__list--hide-after-4" data-impression-id="public_profile_contributions_show-more-less"> <li class="contributions__list-item-wrapper"> <div class="contribution-item my-2"> <h4 class="font-semibold text-lg leading-regular"> <a class="contribution-title-link text-color-text hover:text-color-text focus:text-color-text active:text-color-text" data-tracking-control-name="public_profile_contributions_contribution-title" data-tracking-will-navigate href="https://www.linkedin.com/advice/0/youre-considering-career-machine-learning-which-tkqte"> You're considering a career in machine learning. Which in-demand skills should you focus on? </a> </h4> <a class="hover:no-underline active:no-underline visited:no-underline focus:no-underline" data-tracking-control-name="public_profile_contributions_contribution-body" data-tracking-will-navigate href="https://www.linkedin.com/advice/0/youre-considering-career-machine-learning-which-tkqte?contributionUrn=urn%3Ali%3Acomment%3A%28urn%3Ali%3AarticleSegment%3A%28urn%3Ali%3AlinkedInArticle%3A7232129409992867841%2C7232129412060585984%29%2C7300519292880527360%29"> <p class="text-md leading-regular my-1 line-clamp-3"> I exceeded the character limit for my comment above, but there is something I really need to add: I think core skills like programming, math, statistics, and software engineering will remain useful along with learning machine learning, deep learning, and AI. Having a good understanding of what you are doing is essential. Ultimately, while the barrier for writing code and developing machine learning models may become lower, but understanding the fundamentals is what will make the difference. </p> </a> <div class="contribution-list flex justify-between flex-col items-start gap-y-2"> <time class="contribution__time font-sans text-sm text-color-text-low-emphasis"> Sebastian Raschka, PhD contributed 1 month ago </time> <a class="insightful-button rounded-xl border-solid py-0.5 px-2 flex text-sm gap-1 hover:no-underline btn-secondary-emphasis" href="https://www.linkedin.com/advice/0/youre-considering-career-machine-learning-which-tkqte?contributionUrn=urn%3Ali%3Acomment%3A%28urn%3Ali%3AarticleSegment%3A%28urn%3Ali%3AlinkedInArticle%3A7232129409992867841%2C7232129412060585984%29%2C7300519292880527360%29" data-tracking-control-name="public_profile_contributions_contribution-insightful-link" data-tracking-will-navigate> <img alt="Insightful" class="lazy-load w-2 h-2 bg-white rounded-full" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/byjfih68cazgghr6pv616zma1"> <span> 1 Upvote</span> </a> </div> </div> </li> <li class="contributions__list-item-wrapper"> <div class="contribution-item my-2"> <h4 class="font-semibold text-lg leading-regular"> <a class="contribution-title-link text-color-text hover:text-color-text focus:text-color-text active:text-color-text" data-tracking-control-name="public_profile_contributions_contribution-title" data-tracking-will-navigate href="https://www.linkedin.com/advice/0/youre-considering-career-machine-learning-which-tkqte"> You're considering a career in machine learning. Which in-demand skills should you focus on? </a> </h4> <a class="hover:no-underline active:no-underline visited:no-underline focus:no-underline" data-tracking-control-name="public_profile_contributions_contribution-body" data-tracking-will-navigate href="https://www.linkedin.com/advice/0/youre-considering-career-machine-learning-which-tkqte?contributionUrn=urn%3Ali%3Acomment%3A%28urn%3Ali%3AarticleSegment%3A%28urn%3Ali%3AlinkedInArticle%3A7232129409992867841%2C7232129412060585984%29%2C7300518903888125953%29"> <p class="text-md leading-regular my-1 line-clamp-3"> One of the most important things in today's world is learning how to work with AI to improve AI. In other words, it will become more and more important to integrate LLMs and coding assistants into your workflow. This approach is not just about generating code and solutions faster but also about getting suggestions for improvements. It isn't always about writing more code, but writing better code. Also, while machine learning is moving from classic methods to deep neural nets, it's still important to learn about the classic methods as well. Many business problems can be solved more effectively with simpler models. And even if a simpler model is not the final solution, it often serves as a good baseline to start with. </p> </a> <div class="contribution-list flex justify-between flex-col items-start gap-y-2"> <time class="contribution__time font-sans text-sm text-color-text-low-emphasis"> Sebastian Raschka, PhD contributed 1 month ago </time> <a class="insightful-button rounded-xl border-solid py-0.5 px-2 flex text-sm gap-1 hover:no-underline btn-secondary-emphasis" href="https://www.linkedin.com/advice/0/youre-considering-career-machine-learning-which-tkqte?contributionUrn=urn%3Ali%3Acomment%3A%28urn%3Ali%3AarticleSegment%3A%28urn%3Ali%3AlinkedInArticle%3A7232129409992867841%2C7232129412060585984%29%2C7300518903888125953%29" data-tracking-control-name="public_profile_contributions_contribution-insightful-link" data-tracking-will-navigate> <img alt="Insightful" class="lazy-load w-2 h-2 bg-white rounded-full" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/byjfih68cazgghr6pv616zma1"> <span> 1 Upvote</span> </a> </div> </div> </li> <li class="contributions__list-item-wrapper"> <div class="contribution-item my-2"> <h4 class="font-semibold text-lg leading-regular"> <a class="contribution-title-link text-color-text hover:text-color-text focus:text-color-text active:text-color-text" data-tracking-control-name="public_profile_contributions_contribution-title" data-tracking-will-navigate href="https://www.linkedin.com/advice/0/youre-aiming-excel-ai-careers-which-programming-uf6we"> You're aiming to excel in AI careers. Which programming languages are essential for your success? </a> </h4> <a class="hover:no-underline active:no-underline visited:no-underline focus:no-underline" data-tracking-control-name="public_profile_contributions_contribution-body" data-tracking-will-navigate href="https://www.linkedin.com/advice/0/youre-aiming-excel-ai-careers-which-programming-uf6we?contributionUrn=urn%3Ali%3Acomment%3A%28urn%3Ali%3AarticleSegment%3A%28urn%3Ali%3AlinkedInArticle%3A7242587359819501570%2C7242587387791294464%29%2C7257369758700863490%29"> <p class="text-md leading-regular my-1 line-clamp-3"> Regarding questions about Mojo, I think it's still too early to say what role it will play. As far as I know, it doesn’t support GPUs yet, which limits its real-world applicability for now. If you're interested in implementing GPU-targeted code and optimizations but don’t want to dive into full-blown CUDA, I’d say Triton is currently the best and most commonly used option. For instance, PyTorch uses it directly. </p> </a> <div class="contribution-list flex justify-between flex-col items-start gap-y-2"> <time class="contribution__time font-sans text-sm text-color-text-low-emphasis"> Sebastian Raschka, PhD contributed 5 months ago </time> <a class="insightful-button rounded-xl border-solid py-0.5 px-2 flex text-sm gap-1 hover:no-underline btn-secondary-emphasis" href="https://www.linkedin.com/advice/0/youre-aiming-excel-ai-careers-which-programming-uf6we?contributionUrn=urn%3Ali%3Acomment%3A%28urn%3Ali%3AarticleSegment%3A%28urn%3Ali%3AlinkedInArticle%3A7242587359819501570%2C7242587387791294464%29%2C7257369758700863490%29" data-tracking-control-name="public_profile_contributions_contribution-insightful-link" data-tracking-will-navigate> <img alt="Insightful" class="lazy-load w-2 h-2 bg-white rounded-full" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/byjfih68cazgghr6pv616zma1"> <span> 4 Upvotes</span> </a> </div> </div> </li> <li class="contributions__list-item-wrapper"> <div class="contribution-item my-2"> <h4 class="font-semibold text-lg leading-regular"> <a class="contribution-title-link text-color-text hover:text-color-text focus:text-color-text active:text-color-text" data-tracking-control-name="public_profile_contributions_contribution-title" data-tracking-will-navigate href="https://www.linkedin.com/advice/0/youre-aiming-excel-ai-careers-which-programming-uf6we"> You're aiming to excel in AI careers. Which programming languages are essential for your success? </a> </h4> <a class="hover:no-underline active:no-underline visited:no-underline focus:no-underline" data-tracking-control-name="public_profile_contributions_contribution-body" data-tracking-will-navigate href="https://www.linkedin.com/advice/0/youre-aiming-excel-ai-careers-which-programming-uf6we?contributionUrn=urn%3Ali%3Acomment%3A%28urn%3Ali%3AarticleSegment%3A%28urn%3Ali%3AlinkedInArticle%3A7242587359819501570%2C7242587387791294464%29%2C7257195521079943169%29"> <p class="text-md leading-regular my-1 line-clamp-3"> To keep it brief: Python takes you far and remains the most widely used and versatile language in AI. That said, you'll be learning to use libraries like PyTorch, which uses Python as a "glue" language to call C++ and CUDA code behind the scenes. If you're a researcher or practitioner, Python + PyTorch will take you a long way. However, if you’re an AI/ML engineer working on lower-level optimizations, adding C++ and CUDA skills will be crucial for writing optimized PyTorch kernels, exporting and optimizing models from PyTorch, or even implementing models without PyTorch. And for those on the data science side of AI and machine learning, Rust has also been gaining traction recently. </p> </a> <div class="contribution-list flex justify-between flex-col items-start gap-y-2"> <time class="contribution__time font-sans text-sm text-color-text-low-emphasis"> Sebastian Raschka, PhD contributed 5 months ago </time> <a class="insightful-button rounded-xl border-solid py-0.5 px-2 flex text-sm gap-1 hover:no-underline btn-secondary-emphasis" href="https://www.linkedin.com/advice/0/youre-aiming-excel-ai-careers-which-programming-uf6we?contributionUrn=urn%3Ali%3Acomment%3A%28urn%3Ali%3AarticleSegment%3A%28urn%3Ali%3AlinkedInArticle%3A7242587359819501570%2C7242587387791294464%29%2C7257195521079943169%29" data-tracking-control-name="public_profile_contributions_contribution-insightful-link" data-tracking-will-navigate> <img alt="Insightful" class="lazy-load w-2 h-2 bg-white rounded-full" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/byjfih68cazgghr6pv616zma1"> <span> 18 Upvotes</span> </a> </div> </div> </li> <li class="contributions__list-item-wrapper"> <div class="contribution-item my-2"> <h4 class="font-semibold text-lg leading-regular"> <a class="contribution-title-link text-color-text hover:text-color-text focus:text-color-text active:text-color-text" data-tracking-control-name="public_profile_contributions_contribution-title" data-tracking-will-navigate href="https://www.linkedin.com/advice/1/youre-aiming-advance-ai-expertise-how-can-regtf"> You're aiming to advance in AI expertise. How can you navigate the maze of machine learning career paths? </a> </h4> <a class="hover:no-underline active:no-underline visited:no-underline focus:no-underline" data-tracking-control-name="public_profile_contributions_contribution-body" data-tracking-will-navigate href="https://www.linkedin.com/advice/1/youre-aiming-advance-ai-expertise-how-can-regtf?contributionUrn=urn%3Ali%3Acomment%3A%28urn%3Ali%3AarticleSegment%3A%28urn%3Ali%3AlinkedInArticle%3A7229264269635125248%2C7229264271338086401%29%2C7247253897809981441%29"> <p class="text-md leading-regular my-1 line-clamp-3"> Mathematical foundations like linear algebra and calculus remain crucial, especially with modern architectures (e.g., transformer-based LLMs) relying heavily on them. However, AI in industry is largely applied, so computational skills are just as important. Learning PyTorch is essential if you want to work with the latest models, whether for language or vision tasks. There's also been a shift from single-GPU to multi-GPU and multi-node training, which extends beyond basic PyTorch concepts. To stay current, it's a good investment to start or keep up with evolving tools for scaling PyTorch (multi-GPU, multi-node, mixed-precision, quantization, etc.). </p> </a> <div class="contribution-list flex justify-between flex-col items-start gap-y-2"> <time class="contribution__time font-sans text-sm text-color-text-low-emphasis"> Sebastian Raschka, PhD contributed 6 months ago </time> <a class="insightful-button rounded-xl border-solid py-0.5 px-2 flex text-sm gap-1 hover:no-underline btn-secondary-emphasis" href="https://www.linkedin.com/advice/1/youre-aiming-advance-ai-expertise-how-can-regtf?contributionUrn=urn%3Ali%3Acomment%3A%28urn%3Ali%3AarticleSegment%3A%28urn%3Ali%3AlinkedInArticle%3A7229264269635125248%2C7229264271338086401%29%2C7247253897809981441%29" data-tracking-control-name="public_profile_contributions_contribution-insightful-link" data-tracking-will-navigate> <img alt="Insightful" class="lazy-load w-2 h-2 bg-white rounded-full" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/byjfih68cazgghr6pv616zma1"> <span> 18 Upvotes</span> </a> </div> </div> </li> <li class="contributions__list-item-wrapper"> <div class="contribution-item my-2"> <h4 class="font-semibold text-lg leading-regular"> <a class="contribution-title-link text-color-text hover:text-color-text focus:text-color-text active:text-color-text" data-tracking-control-name="public_profile_contributions_contribution-title" data-tracking-will-navigate href="https://www.linkedin.com/advice/3/youre-aiming-advance-ai-how-can-you-shape-iykaf"> You're aiming to advance in AI. How can you shape your education to match your career goals? </a> </h4> <a class="hover:no-underline active:no-underline visited:no-underline focus:no-underline" data-tracking-control-name="public_profile_contributions_contribution-body" data-tracking-will-navigate href="https://www.linkedin.com/advice/3/youre-aiming-advance-ai-how-can-you-shape-iykaf?contributionUrn=urn%3Ali%3Acomment%3A%28urn%3Ali%3AarticleSegment%3A%28urn%3Ali%3AlinkedInArticle%3A7231356064212897792%2C7231356066133929985%29%2C7237760941663133697%29"> <p class="text-md leading-regular my-1 line-clamp-3"> Even if you’re not working in a specific industry, I recommend focusing on a particular use case or application after mastering the fundamentals. For example, working on an end-to-end project (whether it's a bird detector for your garden or solutions for clients) is a great way to expand your expertise as you implement the entire pipeline and seek improvements, such as better algorithms, data curation, and performance optimizations. </p> </a> <div class="contribution-list flex justify-between flex-col items-start gap-y-2"> <time class="contribution__time font-sans text-sm text-color-text-low-emphasis"> Sebastian Raschka, PhD contributed 6 months ago </time> <a class="insightful-button rounded-xl border-solid py-0.5 px-2 flex text-sm gap-1 hover:no-underline btn-secondary-emphasis" href="https://www.linkedin.com/advice/3/youre-aiming-advance-ai-how-can-you-shape-iykaf?contributionUrn=urn%3Ali%3Acomment%3A%28urn%3Ali%3AarticleSegment%3A%28urn%3Ali%3AlinkedInArticle%3A7231356064212897792%2C7231356066133929985%29%2C7237760941663133697%29" data-tracking-control-name="public_profile_contributions_contribution-insightful-link" data-tracking-will-navigate> <img alt="Insightful" class="lazy-load w-2 h-2 bg-white rounded-full" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/byjfih68cazgghr6pv616zma1"> <span> 6 Upvotes</span> </a> </div> </div> </li> </ul> <button class="show-more-less-button show-more-less__button show-more-less__more-button !btn-tertiary !py-1.5" data-tracking-control-name="public_profile_contributions_show_more"> Show more <icon class="show-more-less-button-icon" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/cyolgscd0imw2ldqppkrb84vo"></icon></button> <div class="pt-1"> <a data-tracking-control-name="public_profile_contributions_see_all_contributions" data-tracking-will-navigate href="https://www.linkedin.com/signup/cold-join?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fin%2Fsebastianraschka%2Frecent-activity%2Fcomments%2F" class="show-more-less-button see-more-link show-more-less__less-button show-more-less__button--hide !text-color-link !border-color-link !shadow-none border-[1px] border-solid !py-1.5"> Join now to see all contributions </a> </div> </div> </div> </section> <section class="core-section-container core-section-container--with-border border-b-1 border-solid border-color-border-faint py-4 activities" data-nosnippet="true" data-section="posts">  <div class="flex justify-between"> <h2 class="core-section-container__title section-title"> Activity </h2> <button class="btn-sm btn-secondary cursor-pointer flex" data-tracking-control-name="public_profile_follow" data-modal="public_profile_activities-follow-modal" data-no-cool-off="true"> <img class="lazy-load" alt data-delayed-url="https://static.licdn.com/aero-v1/sc/h/9ewdf4qtu7uo5bvtrz3r6z5pn"> <span class="ml-1"> Follow </span> </button> </div> <div class="contextual-sign-in-modal activities_follow-modal" data-impression-id="public_profile_follow_contextual-sign-in-modal">  <div class>  <div id="public_profile_activities-follow-modal" class="modal modal--contextual-sign-in" data-outlet="public_profile_activities-follow-modal">  <div class="modal__overlay flex items-center bg-color-background-scrim justify-center fixed bottom-0 left-0 right-0 top-0 opacity-0 invisible pointer-events-none z-[1000] transition-[opacity] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.17s] py-4 " aria-hidden="true"> <section aria-modal="true" role="dialog" aria-labelledby="public_profile_activities-follow-modal-modal-header" tabindex="-1" class="max-h-full modal__wrapper overflow-auto p-0 bg-color-surface max-w-[1128px] min-h-[160px] relative scale-[0.25] shadow-sm shadow-color-border-faint transition-[transform] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.33s] focus:outline-0 w-[1128px] mamabear:w-[744px] babybear:w-[360px] rounded-md"> <button class="modal__dismiss btn-tertiary h-[40px] w-[40px] p-0 rounded-full indent-0 contextual-sign-in-modal__modal-dismiss absolute right-0 m-[20px] cursor-pointer" aria-label="Dismiss" data-tracking-control-name="public_profile_follow_contextual-sign-in-modal_modal_dismiss"> <icon class="contextual-sign-in-modal__modal-dismiss-icon" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/gs508lg3t2o81tq7pmcgn6m2"></icon> </button> <div class="modal__main w-full"> <div class="contextual-sign-in-modal__screen contextual-sign-in-modal__context-screen flex flex-col my-4 mx-3"> <img class="inline-block relative rounded-[50%] w-16 h-16 contextual-sign-in-modal__img m-auto" data-delayed-url="https://media.licdn.com/dms/image/v2/C4E03AQHVyH-IfD1KbQ/profile-displayphoto-shrink_200_200/profile-displayphoto-shrink_200_200/0/1629877234109?e=2147483647&v=beta&t=OTOiJBYLX5LT7qC2dke5DvYyo_GnlKJRKC45xXeY_7o" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> <h2 class="contextual-sign-in-modal__context-screen-title font-sans text-xl text-color-text my-2 mx-4 text-center" id="public_profile_activities-follow-modal-modal-header"> Sign in to view Sebastian’s full profile </h2>  <div class="contextual-sign-in-modal__btn-container m-auto w-[320px] babybear:w-full">  <div class="w-full max-w-[400px] mx-auto"> <div class="google-auth-button">  <div class="google-auth-button__placeholder mx-auto " data-theme="filled_blue" data-logo-alignment="center" data-locale="en_US" role="button" aria-label="Continue with google"></div>  </div> </div> <div class="sign-in-modal" data-impression-id="public_profile_follow_contextual-sign-in-modal_sign-in-modal"> <button class="sign-in-modal__outlet-btn cursor-pointer btn-md btn-primary btn-secondary" data-tracking-client-ingraph data-tracking-control-name="public_profile_follow_contextual-sign-in-modal_sign-in-modal_outlet-button" data-modal="_sign-in-modal">  Sign in </button> <div class>  <div id="_sign-in-modal" class="modal modal--sign-in" data-outlet="_sign-in-modal">  <div class="modal__overlay flex items-center bg-color-background-scrim justify-center fixed bottom-0 left-0 right-0 top-0 opacity-0 invisible pointer-events-none z-[1000] transition-[opacity] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.17s] py-4 " aria-hidden="true"> <section aria-modal="true" role="dialog" aria-labelledby="_sign-in-modal-modal-header" tabindex="-1" class="max-h-full modal__wrapper overflow-auto p-0 bg-color-surface max-w-[1128px] min-h-[160px] relative scale-[0.25] shadow-sm shadow-color-border-faint transition-[transform] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.33s] focus:outline-0 w-[1128px] mamabear:w-[744px] babybear:w-[360px] rounded-md"> <button class="modal__dismiss btn-tertiary h-[40px] w-[40px] p-0 rounded-full indent-0 sign-in-modal__dismiss absolute right-0 cursor-pointer m-[20px]" aria-label="Dismiss" data-tracking-control-name="public_profile_follow_contextual-sign-in-modal_sign-in-modal_dismiss"> <icon class="sign-in-modal__dismiss-icon" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/gs508lg3t2o81tq7pmcgn6m2"></icon> </button> <div class="modal__main w-full"> <div class="sign-in-modal__screen flex flex-col py-4 w-[513px] babybear:w-full px-3"> <h2 class="sign-in-modal__header font-sans text-display-md text-color-text "> Welcome back </h2> <code id="i18n_sign_in_form_show_text" style="display: none"></code> <code id="i18n_sign_in_form_show_label" style="display: none"></code> <code id="i18n_sign_in_form_hide_text" style="display: none"></code> <code id="i18n_sign_in_form_hide_label" style="display: none"></code> <code id="i18n_username_error_empty" style="display: none"></code> <code id="i18n_username_error_too_long" style="display: none"></code> <code id="i18n_username_error_too_short" style="display: none"></code> <code id="i18n_password_error_empty" style="display: none"></code> <code id="i18n_password_error_too_short" style="display: none"></code> <code id="i18n_password_error_too_long" style="display: none"></code>  <form data-id="sign-in-form" action="https://www.linkedin.com/uas/login-submit" method="post" novalidate class="mt-1.5 mb-2"> <input name="loginCsrfParam" value="bfb3b82d-2e82-4a0c-8244-5b7e6275c22d" type="hidden"> <div class="flex flex-col"> <div class="mt-1.5" data-js-module-id="guest-input"> <div class="flex flex-col"> <label class="input-label mb-1" for="_sign-in-modal_session_key"> Email or phone </label> <div class="text-input flex"> <input class="text-color-text font-sans text-md outline-0 bg-color-transparent w-full" autocomplete="username" id="_sign-in-modal_session_key" name="session_key" required data-tracking-control-name="public_profile_follow_contextual-sign-in-modal_sign-in-modal_sign-in-session-key" data-tracking-client-ingraph type="text"> </div> </div> <p class="input-helper mt-1.5" for="_sign-in-modal_session_key" role="alert" data-js-module-id="guest-input__message"></p> </div> <div class="mt-1.5" data-js-module-id="guest-input"> <div class="flex flex-col"> <label class="input-label mb-1" for="_sign-in-modal_session_password"> Password </label> <div class="text-input flex"> <input class="text-color-text font-sans text-md outline-0 bg-color-transparent w-full" autocomplete="current-password" id="_sign-in-modal_session_password" name="session_password" required data-tracking-control-name="public_profile_follow_contextual-sign-in-modal_sign-in-modal_sign-in-password" data-tracking-client-ingraph type="password"> <button aria-live="assertive" aria-relevant="text" data-id="sign-in-form__password-visibility-toggle" class="font-sans text-md font-bold text-color-action z-10 ml-[12px] hover:cursor-pointer" aria-label="Show your LinkedIn password" data-tracking-control-name="public_profile_follow_contextual-sign-in-modal_sign-in-modal_sign-in-password-visibility-toggle-btn" type="button">Show</button> </div> </div> <p class="input-helper mt-1.5" for="_sign-in-modal_session_password" role="alert" data-js-module-id="guest-input__message"></p> </div> <input name="session_redirect" value="https://www.linkedin.com/in/sebastianraschka" type="hidden">  </div> <div data-id="sign-in-form__footer" class="flex justify-between sign-in-form__footer--full-width"> <a data-id="sign-in-form__forgot-password" class="font-sans text-md font-bold link leading-regular sign-in-form__forgot-password--full-width" href="https://www.linkedin.com/uas/request-password-reset?trk=public_profile_follow_contextual-sign-in-modal_sign-in-modal_forgot_password" data-tracking-control-name="public_profile_follow_contextual-sign-in-modal_sign-in-modal_forgot_password" data-tracking-will-navigate>Forgot password?</a>  <input name="trk" value="public_profile_follow_contextual-sign-in-modal_sign-in-modal_sign-in-submit" type="hidden"> <button class="btn-md btn-primary flex-shrink-0 cursor-pointer sign-in-form__submit-btn--full-width" data-id="sign-in-form__submit-btn" data-tracking-control-name="public_profile_follow_contextual-sign-in-modal_sign-in-modal_sign-in-submit-btn" data-tracking-client-ingraph data-tracking-litms type="submit"> Sign in </button> </div> <div class="sign-in-form__divider left-right-divider pt-2 pb-3"> <p class="sign-in-form__divider-text font-sans text-sm text-color-text px-2"> or </p> </div> </form> <div class="w-full max-w-[400px] mx-auto"> <div class="google-auth-button" data-tracking-control-name="public_profile_follow_contextual-sign-in-modal_sign-in-modal_google-auth-button" data-tracking-client-ingraph> <p class="linkedin-tc__text text-color-text-low-emphasis text-xs pb-2" data-impression-id="public_profile_follow_contextual-sign-in-modal_sign-in-modal__button-skip-tc-text"> By clicking Continue to join or sign in, you agree to LinkedIn’s <a href="/legal/user-agreement?trk=public_profile_follow_contextual-sign-in-modal_sign-in-modal_auth-button_user-agreement" target="_blank" data-tracking-control-name="public_profile_follow_contextual-sign-in-modal_sign-in-modal_auth-button_user-agreement" data-tracking-will-navigate="true">User Agreement</a>, <a href="/legal/privacy-policy?trk=public_profile_follow_contextual-sign-in-modal_sign-in-modal_auth-button_privacy-policy" target="_blank" data-tracking-control-name="public_profile_follow_contextual-sign-in-modal_sign-in-modal_auth-button_privacy-policy" data-tracking-will-navigate="true">Privacy Policy</a>, and <a href="/legal/cookie-policy?trk=public_profile_follow_contextual-sign-in-modal_sign-in-modal_auth-button_cookie-policy" target="_blank" data-tracking-control-name="public_profile_follow_contextual-sign-in-modal_sign-in-modal_auth-button_cookie-policy" data-tracking-will-navigate="true">Cookie Policy</a>. </p> <div class="google-auth-button__placeholder mx-auto google-auth-button__placeholder--black-border" data-theme="outline" data-logo-alignment="center" data-locale="en_US" role="button" aria-label="Continue with google" data-safe-to-skip-tnc-redirect></div>  </div> </div>  <p class="sign-in-modal__join-now m-auto font-sans text-md text-color-text mt-2"> New to LinkedIn? <a href="https://www.linkedin.com/signup/public-profile-join?vieweeVanityName=sebastianraschka&trk=public_profile_follow_contextual-sign-in-modal_sign-in-modal_join-link" data-tracking-control-name="public_profile_follow_contextual-sign-in-modal_sign-in-modal_join-link" data-tracking-will-navigate="true" class="sign-in-modal__join-link">Join now</a> </p> </div> </div>  </section> </div> </div> </div> </div> <div class="contextual-sign-in-modal__divider left-right-divider"> <p class="contextual-sign-in-modal__divider-text font-sans text-sm text-color-text px-2"> or </p> </div> </div> <p class="contextual-sign-in-modal__join-now m-auto font-sans text-md text-color-text my-1"> New to LinkedIn? <a href="https://www.linkedin.com/signup/public-profile-join?vieweeVanityName=sebastianraschka&trk=public_profile_follow_contextual-sign-in-modal_join-link" data-tracking-control-name="public_profile_follow_contextual-sign-in-modal_join-link" data-tracking-will-navigate="true" class="contextual-sign-in-modal__join-link">Join now</a> </p> <p class="linkedin-tc__text text-color-text-low-emphasis text-xs pb-2 contextual-sign-in-modal__terms-and-conditions m-auto w-[320px] pt-2 babybear:w-full" data-impression-id="linkedin-tc__button-skip-tc-text"> By clicking Continue to join or sign in, you agree to LinkedIn’s <a href="/legal/user-agreement?trk=linkedin-tc_auth-button_user-agreement" target="_blank" data-tracking-control-name="linkedin-tc_auth-button_user-agreement" data-tracking-will-navigate="true">User Agreement</a>, <a href="/legal/privacy-policy?trk=linkedin-tc_auth-button_privacy-policy" target="_blank" data-tracking-control-name="linkedin-tc_auth-button_privacy-policy" data-tracking-will-navigate="true">Privacy Policy</a>, and <a href="/legal/cookie-policy?trk=linkedin-tc_auth-button_cookie-policy" target="_blank" data-tracking-control-name="linkedin-tc_auth-button_cookie-policy" data-tracking-will-navigate="true">Cookie Policy</a>. </p> </div> </div>  </section> </div> </div> </div> </div>  <div class="core-section-container__content break-words"> <ul data-test-id="activities__list"> <li> <div class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-main-card flex flex-wrap py-2 pr-2 babybear:pr-0 base-main-card--link main-activity-card"> <a class="base-card__full-link absolute top-0 right-0 bottom-0 left-0 p-0 z-[2]" href="https://www.linkedin.com/posts/sidharthmahotra_llm-machinelearning-pytorch-activity-7312594550332538880-lXyZ" data-tracking-control-name="public_profile" data-tracking-will-navigate> <span class="sr-only"> Sharing this amazing resource: 𝑨 𝒍𝒍𝒂𝒎𝒂-3.2-𝒇𝒓𝒐𝒎-𝒔𝒄𝒓𝒂𝒕𝒄𝒉 by my favourite author Sebastian Raschka, PhD showing implementation of… </span> </a> <div class="base-main-card__media relative w-[228px] block overflow-hidden flex-shrink-0 rounded-md h-[134px] babybear:w-full babybear:h-auto babybear:min-h-[134px] babybear:max-h-[250px]"> <img class="main-activity-card__img h-full " data-delayed-url="https://media.licdn.com/dms/image/v2/D5622AQEDOkkP1vYXRA/feedshare-shrink_800/B56ZXtNHCqHoAo-/0/1743441394112?e=2147483647&v=beta&t=20Pw9QeM2lFAId8PPqlP61uaYy8fcIWZJWJx7z6IEbQ" alt>  </div> <div class="base-main-card__info self-center ml-1 flex-1 relative break-words papabear:min-w-0 mamabear:min-w-0 babybear:w-full ">  <h3 class="base-main-card__title font-sans text-[18px] font-bold text-color-text overflow-hidden "> Sharing this amazing resource: 𝑨 𝒍𝒍𝒂𝒎𝒂-3.2-𝒇𝒓𝒐𝒎-𝒔𝒄𝒓𝒂𝒕𝒄𝒉 by my favourite author Sebastian Raschka, PhD showing implementation of… </h3> <h4 class="base-main-card__subtitle body-text text-color-text overflow-hidden "> Liked by <a href="https://www.linkedin.com/in/sebastianraschka?trk=public_profile_actor-name" data-tracking-control-name="public_profile_actor-name" data-tracking-will-navigate="true" class="hidden-nested-link">Sebastian Raschka, PhD</a> </h4>   </div>  </div> </li> </ul>  </div> </section> <section class="core-section-container core-section-container--with-border border-b-1 border-solid border-color-border-faint py-4 pp-section experience" data-section="experience">  <h2 class="core-section-container__title section-title"> Experience </h2>  <div class="core-section-container__content break-words"> <ul class="experience__list"> <li class="profile-section-card relative flex w-full list-none py-1.5 pr-2 pl-1 experience-item" data-section="currentPositionsDetails">  <a class="profile-section-card__image-link" href="https://www.linkedin.com/company/rair-lab?trk=public_profile_experience-item_profile-section-card_image-click" data-tracking-control-name="public_profile_experience-item_profile-section-card_image-click" data-tracking-will-navigate> <img class="inline-block relative w-6 h-6 shrink-0 mr-0.5 border-4 border-color-transparent border-solid box-content rounded-[6px] profile-section-card__image" data-delayed-url="https://media.licdn.com/dms/image/v2/D560BAQHFKhONBX4BQw/company-logo_100_100/B56ZUg2rekHEAQ-/0/1740012960464/raschka_ai_research_rair_lab_logo?e=2147483647&v=beta&t=gqfxBsMwrR25krp4APIACSqzXNnzRq6z66wLduqk33g" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/cs8pjfgyw96g44ln9r7tct85f" alt="RAIR Lab Graphic"> </a> <div class="pl-0.5 grow break-words"> <h3 class="[&>*]:mb-0 text-[18px] text-color-text leading-regular group-hover:underline font-semibold"> <span class="experience-item__title"> Founder | Principal AI & LLM Research Engineer </span> </h3> <h4 class="text-color-text text-md [&>*]:mb-0 not-first-middot leading-[1.75]"> <a class="relative hover:underline link-styled hover:!text-color-text active:!text-color-text !font-normal" href="https://www.linkedin.com/company/rair-lab?trk=public_profile_experience-item_profile-section-card_subtitle-click" data-tracking-control-name="public_profile_experience-item_profile-section-card_subtitle-click" data-tracking-client-ingraph data-tracking-will-navigate> <span class="experience-item__subtitle"> RAIR Lab </span> </a> </h4> <div class="text-color-text-low-emphasis text-md [&>*]:mb-0 [&>*]:text-md [&>*]:text-color-text-low-emphasis"> <p class="experience-item__meta-item"> <span class="date-range text-color-text-secondary font-sans text-md leading-open font-regular"> <time>Jan 2025</time> - Present <span class="before:middot">4 months</span> </span> </p> <p class="experience-item__meta-item"> United States </p> <div class="experience-item__meta-item" data-section="currentPositions"> <div class="show-more-less-text"> <p class="show-more-less-text__text--less"> Working on AI research and LLM engineering with a practical and code-driven focus.  </p>  </div> </div> </div> </div> </li> <li class="experience-group"> <a class="experience-group-header__url no-underline hover:no-underline font-bold border-0 bg-color-transparent" data-tracking-control-name="public_profile_experience-group-header" data-tracking-will-navigate href="https://www.linkedin.com/school/uwmadison/?trk=public_profile_experience-group-header" title="University of Wisconsin-Madison"> <div class="experience-group-header flex pb-1.5"> <div class="mr-0.5 pb-0.5"> <img class="inline-block relative w-6 h-6 " data-delayed-url="https://media.licdn.com/dms/image/v2/C560BAQFyYW54n5mwAw/company-logo_100_100/company-logo_100_100/0/1630601304331/university_of_wisconsin_madison_logo?e=2147483647&v=beta&t=yXQHCZ5iL7OeeAfCfYqLmpIshlgcs15N8QyTTzH08iQ" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/cs8pjfgyw96g44ln9r7tct85f" alt="University of Wisconsin-Madison"> </div> <div class="pl-1"> <h4 class="experience-group-header__company text-[18px] font-bold text-color-text"> University of Wisconsin-Madison </h4> <p class="experience-group-header__meta-item mt-0.5"> <span class="date-range text-color-text-secondary font-sans text-md leading-open font-regular">  <span>7 years</span> </span> </p> </div> </div> </a> <ul class="experience-group__positions"> <li class="profile-section-card relative flex w-full list-none py-1.5 pr-2 pl-1 experience-group-position py-0" data-section="currentPositionsDetails">  <a class="profile-section-card__image-link" href="https://www.linkedin.com/school/uwmadison/?trk=public_profile_experience-item_profile-section-card_image-click" data-tracking-control-name="public_profile_experience-item_profile-section-card_image-click" data-tracking-will-navigate> <img class="inline-block relative w-6 h-6 shrink-0 mr-0.5 border-4 border-color-transparent border-solid box-content rounded-[6px] profile-section-card__image" data-delayed-url="https://media.licdn.com/dms/image/v2/C560BAQFyYW54n5mwAw/company-logo_100_100/company-logo_100_100/0/1630601304331/university_of_wisconsin_madison_logo?e=2147483647&v=beta&t=yXQHCZ5iL7OeeAfCfYqLmpIshlgcs15N8QyTTzH08iQ" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/cs8pjfgyw96g44ln9r7tct85f" alt="University of Wisconsin-Madison Graphic"> </a> <div class="pl-0.5 grow break-words"> <h3 class="[&>*]:mb-0 text-[18px] text-color-text leading-regular group-hover:underline font-semibold"> <span class="experience-item__title"> Advisory Board Member </span> </h3> <h4 class="text-color-text text-md [&>*]:mb-0 not-first-middot leading-[1.75]"> <a class="relative hover:underline link-styled hover:!text-color-text active:!text-color-text !font-normal" href="https://www.linkedin.com/school/uwmadison/?trk=public_profile_experience-item_profile-section-card_subtitle-click" data-tracking-control-name="public_profile_experience-item_profile-section-card_subtitle-click" data-tracking-client-ingraph data-tracking-will-navigate> <span class="experience-item__subtitle"> University of Wisconsin-Madison </span> </a> </h4> <div class="text-color-text-low-emphasis text-md [&>*]:mb-0 [&>*]:text-md [&>*]:text-color-text-low-emphasis"> <p class="experience-item__meta-item"> <span class="date-range text-color-text-secondary font-sans text-md leading-open font-regular"> <time>Feb 2023</time> - Present <span class="before:middot">2 years 3 months</span> </span> </p>  <div class="experience-item__meta-item" data-section="currentPositions"> <div class="show-more-less-text"> <p class="show-more-less-text__text--less"> Advisory board member for the Open Source Program Office at the UW–Madison Data Science Institute  </p>  </div> </div> </div> </div> </li> <li class="profile-section-card relative flex w-full list-none py-1.5 pr-2 pl-1 experience-group-position py-0" data-section="currentPositionsDetails">  <a class="profile-section-card__image-link" href="https://www.linkedin.com/school/uwmadison/?trk=public_profile_experience-item_profile-section-card_image-click" data-tracking-control-name="public_profile_experience-item_profile-section-card_image-click" data-tracking-will-navigate> <img class="inline-block relative w-6 h-6 shrink-0 mr-0.5 border-4 border-color-transparent border-solid box-content rounded-[6px] profile-section-card__image" data-delayed-url="https://media.licdn.com/dms/image/v2/C560BAQFyYW54n5mwAw/company-logo_100_100/company-logo_100_100/0/1630601304331/university_of_wisconsin_madison_logo?e=2147483647&v=beta&t=yXQHCZ5iL7OeeAfCfYqLmpIshlgcs15N8QyTTzH08iQ" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/cs8pjfgyw96g44ln9r7tct85f" alt="University of Wisconsin-Madison Graphic"> </a> <div class="pl-0.5 grow break-words"> <h3 class="[&>*]:mb-0 text-[18px] text-color-text leading-regular group-hover:underline font-semibold"> <span class="experience-item__title"> Assistant Professor of Statistics </span> </h3> <h4 class="text-color-text text-md [&>*]:mb-0 not-first-middot leading-[1.75]"> <a class="relative hover:underline link-styled hover:!text-color-text active:!text-color-text !font-normal" href="https://www.linkedin.com/school/uwmadison/?trk=public_profile_experience-item_profile-section-card_subtitle-click" data-tracking-control-name="public_profile_experience-item_profile-section-card_subtitle-click" data-tracking-client-ingraph data-tracking-will-navigate> <span class="experience-item__subtitle"> University of Wisconsin-Madison </span> </a> </h4> <div class="text-color-text-low-emphasis text-md [&>*]:mb-0 [&>*]:text-md [&>*]:text-color-text-low-emphasis"> <p class="experience-item__meta-item"> <span class="date-range text-color-text-secondary font-sans text-md leading-open font-regular"> <time>2018</time> - <time>Jan 2023</time> <span class="before:middot">5 years</span> </span> </p> <p class="experience-item__meta-item"> Madison, Wisconsin Area </p> <div class="experience-item__meta-item" data-section="currentPositions"> <div class="show-more-less-text"> <p class="show-more-less-text__text--less"> Assistant Professor of Statistics (tenure track 2018-2025) focusing on machine learning and deep learning methods development. Focused on model evaluation, ordinal regression with deep neural networks, graph neural networks, and applications of predictive modeling in biomedicine and biology.<br>On leave since January 2022 to focus on start-up experience.<br>You can find a list of my research publications on Google Scholar: https://scholar.google.com/citations?user=X4RCC0IAAAAJ&hl=enrasbt  </p>  </div> </div> </div> </div> </li> </ul> </li> <li class="experience-group"> <a class="experience-group-header__url no-underline hover:no-underline font-bold border-0 bg-color-transparent" data-tracking-control-name="public_profile_experience-group-header" data-tracking-will-navigate href="https://www.linkedin.com/company/pytorch-lightning?trk=public_profile_experience-group-header" title="Lightning AI"> <div class="experience-group-header flex pb-1.5"> <div class="mr-0.5 pb-0.5"> <img class="inline-block relative w-6 h-6 " data-delayed-url="https://media.licdn.com/dms/image/v2/D4E0BAQGrZH40MJLAaw/company-logo_100_100/company-logo_100_100/0/1702484831807/pytorch_lightning_logo?e=2147483647&v=beta&t=RuHLKDzmeGnGfEWs76KYX42o_ARa5zkQ1pnDwQFkpBI" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/cs8pjfgyw96g44ln9r7tct85f" alt="Lightning AI"> </div> <div class="pl-1"> <h4 class="experience-group-header__company text-[18px] font-bold text-color-text"> Lightning AI </h4> <p class="experience-group-header__meta-item mt-0.5"> <span class="date-range text-color-text-secondary font-sans text-md leading-open font-regular">  <span>3 years 2 months</span> </span> </p> </div> </div> </a> <ul class="experience-group__positions"> <li class="profile-section-card relative flex w-full list-none py-1.5 pr-2 pl-1 experience-group-position py-0" data-section="currentPositionsDetails">  <a class="profile-section-card__image-link" href="https://www.linkedin.com/company/pytorch-lightning?trk=public_profile_experience-item_profile-section-card_image-click" data-tracking-control-name="public_profile_experience-item_profile-section-card_image-click" data-tracking-will-navigate> <img class="inline-block relative w-6 h-6 shrink-0 mr-0.5 border-4 border-color-transparent border-solid box-content rounded-[6px] profile-section-card__image" data-delayed-url="https://media.licdn.com/dms/image/v2/D4E0BAQGrZH40MJLAaw/company-logo_100_100/company-logo_100_100/0/1702484831807/pytorch_lightning_logo?e=2147483647&v=beta&t=RuHLKDzmeGnGfEWs76KYX42o_ARa5zkQ1pnDwQFkpBI" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/cs8pjfgyw96g44ln9r7tct85f" alt="Lightning AI Graphic"> </a> <div class="pl-0.5 grow break-words"> <h3 class="[&>*]:mb-0 text-[18px] text-color-text leading-regular group-hover:underline font-semibold"> <span class="experience-item__title"> Senior Staff Research Engineer </span> </h3> <h4 class="text-color-text text-md [&>*]:mb-0 not-first-middot leading-[1.75]"> <a class="relative hover:underline link-styled hover:!text-color-text active:!text-color-text !font-normal" href="https://www.linkedin.com/company/pytorch-lightning?trk=public_profile_experience-item_profile-section-card_subtitle-click" data-tracking-control-name="public_profile_experience-item_profile-section-card_subtitle-click" data-tracking-client-ingraph data-tracking-will-navigate> <span class="experience-item__subtitle"> Lightning AI </span> </a> </h4> <div class="text-color-text-low-emphasis text-md [&>*]:mb-0 [&>*]:text-md [&>*]:text-color-text-low-emphasis"> <p class="experience-item__meta-item"> <span class="date-range text-color-text-secondary font-sans text-md leading-open font-regular"> <time>Oct 2024</time> - <time>Feb 2025</time> <span class="before:middot">5 months</span> </span> </p> <p class="experience-item__meta-item"> New York City Metropolitan Area </p> <div class="experience-item__meta-item" data-section="currentPositions"> <div class="show-more-less-text"> <p class="show-more-less-text__text--less"> Sebastian is a Senior Staff Research Engineer implementing, training, and finetuning large language models (LLMs).  </p>  </div> </div> </div> </div> </li> <li class="profile-section-card relative flex w-full list-none py-1.5 pr-2 pl-1 experience-group-position py-0" data-section="currentPositionsDetails">  <a class="profile-section-card__image-link" href="https://www.linkedin.com/company/pytorch-lightning?trk=public_profile_experience-item_profile-section-card_image-click" data-tracking-control-name="public_profile_experience-item_profile-section-card_image-click" data-tracking-will-navigate> <img class="inline-block relative w-6 h-6 shrink-0 mr-0.5 border-4 border-color-transparent border-solid box-content rounded-[6px] profile-section-card__image" data-delayed-url="https://media.licdn.com/dms/image/v2/D4E0BAQGrZH40MJLAaw/company-logo_100_100/company-logo_100_100/0/1702484831807/pytorch_lightning_logo?e=2147483647&v=beta&t=RuHLKDzmeGnGfEWs76KYX42o_ARa5zkQ1pnDwQFkpBI" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/cs8pjfgyw96g44ln9r7tct85f" alt="Lightning AI Graphic"> </a> <div class="pl-0.5 grow break-words"> <h3 class="[&>*]:mb-0 text-[18px] text-color-text leading-regular group-hover:underline font-semibold"> <span class="experience-item__title"> Staff Research Engineer </span> </h3> <h4 class="text-color-text text-md [&>*]:mb-0 not-first-middot leading-[1.75]"> <a class="relative hover:underline link-styled hover:!text-color-text active:!text-color-text !font-normal" href="https://www.linkedin.com/company/pytorch-lightning?trk=public_profile_experience-item_profile-section-card_subtitle-click" data-tracking-control-name="public_profile_experience-item_profile-section-card_subtitle-click" data-tracking-client-ingraph data-tracking-will-navigate> <span class="experience-item__subtitle"> Lightning AI </span> </a> </h4> <div class="text-color-text-low-emphasis text-md [&>*]:mb-0 [&>*]:text-md [&>*]:text-color-text-low-emphasis"> <p class="experience-item__meta-item"> <span class="date-range text-color-text-secondary font-sans text-md leading-open font-regular"> <time>Feb 2024</time> - <time>Nov 2024</time> <span class="before:middot">10 months</span> </span> </p> <p class="experience-item__meta-item"> New York City Metropolitan Area </p> <div class="experience-item__meta-item" data-section="currentPositions"> <div class="show-more-less-text"> <p class="show-more-less-text__text--less"> Sebastian is a Staff Research Engineer focusing on the intersection of AI research, software development, and large language models (LLMs).  </p>  </div> </div> </div> </div> </li> <li class="profile-section-card relative flex w-full list-none py-1.5 pr-2 pl-1 experience-group-position py-0" data-section="currentPositionsDetails">  <a class="profile-section-card__image-link" href="https://www.linkedin.com/company/pytorch-lightning?trk=public_profile_experience-item_profile-section-card_image-click" data-tracking-control-name="public_profile_experience-item_profile-section-card_image-click" data-tracking-will-navigate> <img class="inline-block relative w-6 h-6 shrink-0 mr-0.5 border-4 border-color-transparent border-solid box-content rounded-[6px] profile-section-card__image" data-delayed-url="https://media.licdn.com/dms/image/v2/D4E0BAQGrZH40MJLAaw/company-logo_100_100/company-logo_100_100/0/1702484831807/pytorch_lightning_logo?e=2147483647&v=beta&t=RuHLKDzmeGnGfEWs76KYX42o_ARa5zkQ1pnDwQFkpBI" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/cs8pjfgyw96g44ln9r7tct85f" alt="Lightning AI Graphic"> </a> <div class="pl-0.5 grow break-words"> <h3 class="[&>*]:mb-0 text-[18px] text-color-text leading-regular group-hover:underline font-semibold"> <span class="experience-item__title"> Staff AI Educator </span> </h3> <h4 class="text-color-text text-md [&>*]:mb-0 not-first-middot leading-[1.75]"> <a class="relative hover:underline link-styled hover:!text-color-text active:!text-color-text !font-normal" href="https://www.linkedin.com/company/pytorch-lightning?trk=public_profile_experience-item_profile-section-card_subtitle-click" data-tracking-control-name="public_profile_experience-item_profile-section-card_subtitle-click" data-tracking-client-ingraph data-tracking-will-navigate> <span class="experience-item__subtitle"> Lightning AI </span> </a> </h4> <div class="text-color-text-low-emphasis text-md [&>*]:mb-0 [&>*]:text-md [&>*]:text-color-text-low-emphasis"> <p class="experience-item__meta-item"> <span class="date-range text-color-text-secondary font-sans text-md leading-open font-regular"> <time>Sep 2023</time> - <time>Feb 2024</time> <span class="before:middot">6 months</span> </span> </p> <p class="experience-item__meta-item"> New York City Metropolitan Area </p> <div class="experience-item__meta-item" data-section="currentPositions"> <div class="show-more-less-text"> <p class="show-more-less-text__text--less"> Sebastian is an AI researcher and educator who works on developing large language models (LLMs).<br><br>*Promoted from Lead to Staff AI Educator.  </p>  </div> </div> </div> </div> </li> <li class="profile-section-card relative flex w-full list-none py-1.5 pr-2 pl-1 experience-group-position py-0" data-section="currentPositionsDetails">  <a class="profile-section-card__image-link" href="https://www.linkedin.com/company/pytorch-lightning?trk=public_profile_experience-item_profile-section-card_image-click" data-tracking-control-name="public_profile_experience-item_profile-section-card_image-click" data-tracking-will-navigate> <img class="inline-block relative w-6 h-6 shrink-0 mr-0.5 border-4 border-color-transparent border-solid box-content rounded-[6px] profile-section-card__image" data-delayed-url="https://media.licdn.com/dms/image/v2/D4E0BAQGrZH40MJLAaw/company-logo_100_100/company-logo_100_100/0/1702484831807/pytorch_lightning_logo?e=2147483647&v=beta&t=RuHLKDzmeGnGfEWs76KYX42o_ARa5zkQ1pnDwQFkpBI" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/cs8pjfgyw96g44ln9r7tct85f" alt="Lightning AI Graphic"> </a> <div class="pl-0.5 grow break-words"> <h3 class="[&>*]:mb-0 text-[18px] text-color-text leading-regular group-hover:underline font-semibold"> <span class="experience-item__title"> Lead AI Educator </span> </h3> <h4 class="text-color-text text-md [&>*]:mb-0 not-first-middot leading-[1.75]"> <a class="relative hover:underline link-styled hover:!text-color-text active:!text-color-text !font-normal" href="https://www.linkedin.com/company/pytorch-lightning?trk=public_profile_experience-item_profile-section-card_subtitle-click" data-tracking-control-name="public_profile_experience-item_profile-section-card_subtitle-click" data-tracking-client-ingraph data-tracking-will-navigate> <span class="experience-item__subtitle"> Lightning AI </span> </a> </h4> <div class="text-color-text-low-emphasis text-md [&>*]:mb-0 [&>*]:text-md [&>*]:text-color-text-low-emphasis"> <p class="experience-item__meta-item"> <span class="date-range text-color-text-secondary font-sans text-md leading-open font-regular"> <time>Jan 2022</time> - <time>Sep 2023</time> <span class="before:middot">1 year 9 months</span> </span> </p> <p class="experience-item__meta-item"> New York City Metropolitan Area </p> <div class="experience-item__meta-item" data-section="currentPositions"> <div class="show-more-less-text"> <p class="show-more-less-text__text--less"> Sebastian specializes in Deep Learning & AI research and develops materials to help people to utilize AI & deep learning at scale.  </p>  </div> </div> </div> </div> </li> </ul> </li> </ul> </div> </section> <section class="core-section-container core-section-container--with-border border-b-1 border-solid border-color-border-faint py-4 education text-color-text text-md" data-section="educationsDetails">  <h2 class="text-xl"> Education </h2>  <div class="core-section-container__content break-words"> <ul class="education__list"> <li class="profile-section-card relative flex w-full list-none py-1.5 pr-2 pl-1 education__list-item">  <a class="profile-section-card__image-link" href="https://www.linkedin.com/school/michigan-state-university/?trk=public_profile_school_profile-section-card_image-click" data-tracking-control-name="public_profile_school_profile-section-card_image-click" data-tracking-will-navigate> <img class="inline-block relative w-6 h-6 shrink-0 mr-0.5 border-4 border-color-transparent border-solid box-content rounded-[6px] profile-section-card__image" data-delayed-url="https://media.licdn.com/dms/image/v2/C560BAQFx0VIjYYSXBw/company-logo_100_100/company-logo_100_100/0/1645471386372/michigan_state_university_logo?e=2147483647&v=beta&t=8iz7-UPGdaClJsAC_04mKsA62Z8AhHjLFq4grYc7FbI" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/6qpnald1ddva78jx4bnnl3vw" alt="Michigan State University Graphic"> </a> <div class="pl-0.5 grow break-words"> <h3 class="[&>*]:mb-0 text-[18px] text-color-text leading-regular group-hover:underline font-semibold"> <a class="text-color-text text-[18px] link-styled link-no-visited-state hover:!text-color-text active:!text-color-text" data-tracking-control-name="public_profile_school" data-tracking-will-navigate href="https://www.linkedin.com/school/michigan-state-university/?trk=public_profile_school"> Michigan State University </a> </h3> <h4 class="text-color-text text-md [&>*]:mb-0 not-first-middot leading-[1.75]"> <span class="control-transition">Doctor of Philosophy (PhD)</span>  </h4> <div class="text-color-text-low-emphasis text-md [&>*]:mb-0 [&>*]:text-md [&>*]:text-color-text-low-emphasis"> <p class="my-0.5 mx-0 inline-block"> <span class="date-range text-color-text-secondary font-sans text-md leading-open font-regular"> <time>2012</time> - <time>2017</time>  </span> </p> <div class="control-transition" data-section="educations">  <div class="show-more-less-text"> <p class="show-more-less-text__text--less"> PhD thesis: Uncovering Hidden Patterns of Molecular Recognition<br>Use of statistical data mining approaches and development of new software protocols to uncover novel patterns of preferential interactions between proteins and their small molecule binding partners (the Protein Recognition Index and Screenlamp data mining toolkits), for use in enhancing the design of proteins and small molecules for protein engineering and drug discovery.<br><br>...  </p>  </div> </div> </div> </div> </li> <li class="profile-section-card relative flex w-full list-none py-1.5 pr-2 pl-1 education__list-item">  <a class="profile-section-card__image-link" href="https://de.linkedin.com/school/heinrich-heine-universitat-dusseldorf/?trk=public_profile_school_profile-section-card_image-click" data-tracking-control-name="public_profile_school_profile-section-card_image-click" data-tracking-will-navigate> <img class="inline-block relative w-6 h-6 shrink-0 mr-0.5 border-4 border-color-transparent border-solid box-content rounded-[6px] profile-section-card__image" data-delayed-url="https://media.licdn.com/dms/image/v2/C4E0BAQF1aM9pOZ26qg/company-logo_100_100/company-logo_100_100/0/1668089365611/heinrich_heine_universitat_dusseldorf_logo?e=2147483647&v=beta&t=SXNPPEtSu9I0U3Sy2wASyqgssoYPHjJC04KAaXPdSyg" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/6qpnald1ddva78jx4bnnl3vw" alt="University of Düsseldorf Graphic"> </a> <div class="pl-0.5 grow break-words"> <h3 class="[&>*]:mb-0 text-[18px] text-color-text leading-regular group-hover:underline font-semibold"> <a class="text-color-text text-[18px] link-styled link-no-visited-state hover:!text-color-text active:!text-color-text" data-tracking-control-name="public_profile_school" data-tracking-will-navigate href="https://de.linkedin.com/school/heinrich-heine-universitat-dusseldorf/?trk=public_profile_school"> Heinrich-Heine-Universität Düsseldorf </a> </h3> <h4 class="text-color-text text-md [&>*]:mb-0 not-first-middot leading-[1.75]"> <span class="control-transition">Bachelor of Science (BSc)</span>  </h4> <div class="text-color-text-low-emphasis text-md [&>*]:mb-0 [&>*]:text-md [&>*]:text-color-text-low-emphasis"> <p class="my-0.5 mx-0 inline-block"> <span class="date-range text-color-text-secondary font-sans text-md leading-open font-regular"> <time>2008</time> - <time>2012</time>  </span> </p>  </div> </div> </li> <li class="profile-section-card relative flex w-full list-none py-1.5 pr-2 pl-1 education__list-item">  <img class="inline-block relative w-6 h-6 bg-color-entity-ghost-background shrink-0 mr-0.5 border-4 border-color-transparent border-solid box-content rounded-[6px] profile-section-card__image" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/6qpnald1ddva78jx4bnnl3vw" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/6qpnald1ddva78jx4bnnl3vw" alt> <div class="pl-0.5 grow break-words"> <h3 class="[&>*]:mb-0 text-[18px] text-color-text leading-regular group-hover:underline font-semibold"> Gymnasium Rheinkamp Europaschule Moers </h3> <h4 class="text-color-text text-md [&>*]:mb-0 not-first-middot leading-[1.75]"> <span class="control-transition">High School Degree</span>  </h4> <div class="text-color-text-low-emphasis text-md [&>*]:mb-0 [&>*]:text-md [&>*]:text-color-text-low-emphasis"> <p class="my-0.5 mx-0 inline-block"> <span class="date-range text-color-text-secondary font-sans text-md leading-open font-regular"> <time>1998</time> - <time>2007</time>  </span> </p>  </div> </div> </li> </ul> </div> </section>   <section class="core-section-container core-section-container--with-border border-b-1 border-solid border-color-border-faint py-4 skills" data-section="skills">  <h2 class="core-section-container__title section-title"> Skills </h2>  <div class="core-section-container__content break-words"> <div class="show-more-less">  <ul data-max-num-to-show="12" class="show-more-less__list show-more-less__list--hide-after-12" data-impression-id="public_profile_skills_show-more-less"> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/search?keywords=Open-Source+Development&trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Open-Source Development </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/search?keywords=Large-scale+Data+Analysis&trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Large-scale Data Analysis </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/search?keywords=Large+Language+Models+%28LLM%29&trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Large Language Models (LLM) </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/search?keywords=Natural+Language+Processing+%28NLP%29&trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Natural Language Processing (NLP) </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/search?keywords=PyTorch&trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> PyTorch </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/search?keywords=GPU+Computing&trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> GPU Computing </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/search?keywords=Open-Source+Software&trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Open-Source Software </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/search?keywords=Educational+Leadership&trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Educational Leadership </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/search?keywords=Artificial+Intelligence+%28AI%29&trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Artificial Intelligence (AI) </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/topics/python?trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Python (Programming Language) </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/topics/machine-learning?trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Machine Learning </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/topics/python?trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Python </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/search?keywords=Deep+Learning&trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Deep Learning </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/topics/statistics?trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Statistics </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/search?keywords=Data+Mining&trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Data Mining </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/topics/data-analysis?trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Data Analysis </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/search?keywords=Data+Science&trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Data Science </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/search?keywords=Programming&trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Programming </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/search?keywords=Bioinformatics&trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Bioinformatics </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/search?keywords=Molecular+Biology&trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Molecular Biology </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/search?keywords=Biochemistry&trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Biochemistry </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/topics/data-visualization?trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Data Visualization </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/search?keywords=Research&trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Research </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/topics/version-control?trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Version Control </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/topics/r?trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> R </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/topics/linux-2?trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Linux </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/topics/java?trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Java </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/topics/javascript?trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> JavaScript </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/topics/4-45?trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> SQLite </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/topics/c-plus-plus?trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> C++ </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/topics/github?trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Github </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/topics/git?trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Git </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/search?keywords=Probability+Theory&trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Probability Theory </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/search?keywords=Pattern+Recognition&trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Pattern Recognition </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/search?keywords=High+Performance+Computing&trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> High Performance Computing </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/search?keywords=Protein+Structure&trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Protein Structure </a> </li> <li class="skills__item skills__item--link"> <a href="https://www.linkedin.com/learning/search?keywords=Scientific+Computing&trk=public_profile_skill" data-tracking-control-name="public_profile_skill" data-tracking-will-navigate> Scientific Computing </a> </li> </ul> <button class="show-more-less__button show-more-less__more-button show-more-less-button " aria-expanded="false" data-tracking-control-name="public_profile_skills_show_more"> Show 25 more skills <icon class="show-more-less__button--chevron show-more-less-button-icon" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/cyolgscd0imw2ldqppkrb84vo"></icon> </button> <button class="show-more-less__button show-more-less__less-button show-more-less-button show-more-less__button--hide" aria-expanded="false" data-tracking-control-name="public_profile_skills_show_more"> Show fewer skills <icon class="show-more-less__button--chevron show-more-less-button-icon" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/4chtt12k98xwnba1nimld2oyg"></icon> </button> </div> </div> </section> <section class="core-section-container core-section-container--with-border border-b-1 border-solid border-color-border-faint py-4 publications" data-section="publications">  <h2 class="core-section-container__title section-title"> Publications </h2>  <div class="core-section-container__content break-words"> <ul> <li class="profile-section-card relative flex w-full list-none py-1.5 pr-2 pl-1 personal-project">  <div class="pl-0.5 grow break-words"> <h3 class="[&>*]:mb-0 text-[18px] text-color-text leading-regular group-hover:underline font-semibold"> <a class="text-color-text text-[18px] link-styled link-no-visited-state hover:!text-color-text active:!text-color-text" data-tracking-control-name="public_profile_publication-title" data-tracking-will-navigate href="https://www.linkedin.com/redir/redirect?url=https%3A%2F%2Fscholar%2Egoogle%2Ecom%2Fcitations%3Fuser%3DX4RCC0IAAAAJ%26hl%3Den&urlhash=b1RK&trk=public_profile_publication-title" title="For a complete list of publications, please see my Google Scholar profile (link below)" rel="nofollow"> For a complete list of publications, please see my Google Scholar profile (link below) </a> </h3> <h4 class="text-color-text text-md [&>*]:mb-0">-</h4> <div class="text-color-text-low-emphasis text-md [&>*]:mb-0 [&>*]:text-md [&>*]:text-color-text-low-emphasis">  <a class="rounded-sm min-h-[32px] mt-2 -ml-1 px-1 py-0 btn-md btn-tertiary-emphasis link-no-visited-state inline-flex items-center" data-tracking-control-name="public_profile_publication-button" data-tracking-will-navigate href="https://www.linkedin.com/redir/redirect?url=https%3A%2F%2Fscholar%2Egoogle%2Ecom%2Fcitations%3Fuser%3DX4RCC0IAAAAJ%26hl%3Den&urlhash=b1RK&trk=public_profile_publication-button" rel="nofollow" target="_blank" title="For a complete list of publications, please see my Google Scholar profile (link below)"> See publication <img alt class="h-2 w-2 ml-0.5 align-baseline" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/8w0vew433o9nluoruq9k5eqy" data-tracking-will-navigate> </a> </div> </div> </li> </ul>  </div> </section>    <section class="core-section-container core-section-container--with-border border-b-1 border-solid border-color-border-faint py-4 honors-and-awards" data-section="honors-and-awards">  <h2 class="core-section-container__title section-title"> Honors & Awards </h2>  <div class="core-section-container__content break-words"> <ul class="awards__list"> <li class="profile-section-card relative flex w-full list-none py-1.5 pr-2 pl-1">  <div class="pl-0.5 grow break-words"> <h3 class="[&>*]:mb-0 text-[18px] text-color-text leading-regular group-hover:underline font-semibold"> Best Paper Award at ICB 2018 (Semi-Adversarial Neural Networks paper) </h3> <h4 class="text-color-text text-md [&>*]:mb-0 not-first-middot leading-[1.75]"> International Conference on Biometrics (ICB 2018) </h4> <div class="text-color-text-low-emphasis text-md [&>*]:mb-0 [&>*]:text-md [&>*]:text-color-text-low-emphasis"> <span class="date-range text-color-text-secondary font-sans text-md leading-open font-regular"> <time> Feb 2018 </time> </span>  </div> </div> </li> <li class="profile-section-card relative flex w-full list-none py-1.5 pr-2 pl-1">  <div class="pl-0.5 grow break-words"> <h3 class="[&>*]:mb-0 text-[18px] text-color-text leading-regular group-hover:underline font-semibold"> ACM Computing Reviews’ Best of 2016 </h3> <h4 class="text-color-text text-md [&>*]:mb-0 not-first-middot leading-[1.75]"> ACM Computing Reviews </h4> <div class="text-color-text-low-emphasis text-md [&>*]:mb-0 [&>*]:text-md [&>*]:text-color-text-low-emphasis"> <span class="date-range text-color-text-secondary font-sans text-md leading-open font-regular"> <time> Dec 2016 </time> </span>  </div> </div> </li> <li class="profile-section-card relative flex w-full list-none py-1.5 pr-2 pl-1">  <div class="pl-0.5 grow break-words"> <h3 class="[&>*]:mb-0 text-[18px] text-color-text leading-regular group-hover:underline font-semibold"> Outstanding Graduate Student Award </h3> <h4 class="text-color-text text-md [&>*]:mb-0 not-first-middot leading-[1.75]"> BMB Department at MSU </h4> <div class="text-color-text-low-emphasis text-md [&>*]:mb-0 [&>*]:text-md [&>*]:text-color-text-low-emphasis"> <span class="date-range text-color-text-secondary font-sans text-md leading-open font-regular"> <time> Oct 2016 </time> </span>  </div> </div> </li> </ul> </div> </section>  <section class="core-section-container core-section-container--with-border border-b-1 border-solid border-color-border-faint py-4 languages" data-section="languages">  <h2 class="core-section-container__title section-title"> Languages </h2>  <div class="core-section-container__content break-words"> <ul> <li class="profile-section-card relative flex w-full list-none py-1.5 pr-2 pl-1">  <div class="pl-0.5 grow break-words"> <h3 class="[&>*]:mb-0 text-[18px] text-color-text leading-regular group-hover:underline font-semibold"> German </h3> <h4 class="text-color-text text-md [&>*]:mb-0">-</h4> <div class="text-color-text-low-emphasis text-md [&>*]:mb-0 [&>*]:text-md [&>*]:text-color-text-low-emphasis"> </div> </div> </li> <li class="profile-section-card relative flex w-full list-none py-1.5 pr-2 pl-1">  <div class="pl-0.5 grow break-words"> <h3 class="[&>*]:mb-0 text-[18px] text-color-text leading-regular group-hover:underline font-semibold"> English </h3> <h4 class="text-color-text text-md [&>*]:mb-0">-</h4> <div class="text-color-text-low-emphasis text-md [&>*]:mb-0 [&>*]:text-md [&>*]:text-color-text-low-emphasis"> </div> </div> </li> </ul> </div> </section>     <section class="core-section-container my-3 bottom-cta-banner">    <div class="core-section-container__content break-words"> <section class="hidden-summary container-lined p-3 overflow-hidden babybear:p-2" data-impression-id="public_profile_bottom-cta-banner_guest_hidden_summary"> <h2 class="hidden-summary__title text-xl text-color-text overflow-hidden break-words mb-2 leading-regular font-normal"> View Sebastian’s full profile </h2> <ul class="hidden-summary__summary-items"> <li class="hidden-summary__summary-item flex text-md text-color-text font-normal leading-open items-center mb-1.5 last:mb-0"> <div class="hidden-summary__summary-item-icon-container flex items-center justify-center shrink-0 mr-1"> <icon class="hidden-summary__summary-item-icon h-2 w-2 " alt data-delayed-url="https://static.licdn.com/aero-v1/sc/h/au8rc359lanmyfaah39izyss1"></icon> </div> <span class="hidden-summary__summary-item-text overflow-hidden break-words"> See who you know in common </span> </li> <li class="hidden-summary__summary-item flex text-md text-color-text font-normal leading-open items-center mb-1.5 last:mb-0"> <div class="hidden-summary__summary-item-icon-container flex items-center justify-center shrink-0 mr-1"> <icon class="hidden-summary__summary-item-icon h-2 w-2 " alt data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bk9h057z1lch588recizysfdc"></icon> </div> <span class="hidden-summary__summary-item-text overflow-hidden break-words"> Get introduced </span> </li> <li class="hidden-summary__summary-item flex text-md text-color-text font-normal leading-open items-center mb-1.5 last:mb-0"> <div class="hidden-summary__summary-item-icon-container flex items-center justify-center shrink-0 mr-1"> <icon class="hidden-summary__summary-item-icon h-2 w-2 " alt data-delayed-url="https://static.licdn.com/aero-v1/sc/h/engl6kavv3716laqjpfbilqqt"></icon> </div> <span class="hidden-summary__summary-item-text overflow-hidden break-words"> Contact Sebastian directly </span> </li> </ul> <a class="hidden-summary__cta hidden-summary__cta--secondary btn-sm !text-[16px] btn-secondary-emphasis inline-block mt-3 mr-1.5" href="https://www.linkedin.com/signup/public-profile-join?vieweeVanityName=sebastianraschka&trk=public_profile_bottom-cta-banner" data-tracking-control-name="public_profile_bottom-cta-banner" data-tracking-will-navigate> Join to view full profile </a> </section> </div> </section> </section> </div> </section> <section class="right-rail papabear:w-right-rail-width papabear:ml-column-gutter mamabear:max-w-[790px] mamabear:px-mobile-container-padding babybear:max-w-[790px] babybear:px-mobile-container-padding">  <section class="aside-section-container mb-4 browsemap" data-nosnippet="true"> <h2 class="aside-section-container__title section-title"> Other similar profiles </h2>  <div class="aside-section-container__content break-words"> <ul class="show-more-less__list show-more-less__list--no-hidden-elems aside-profiles-list" data-impression-id="public_profile_browsemap_show-more-less"> <li>  <a href="https://www.linkedin.com/in/semhar-michael-87694451?trk=public_profile_browsemap-profile" target="_self" data-impression-id="public_profile_browsemap-0" data-tracking-control-name="public_profile_browsemap-profile" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link">   <div class="bg-clip-content bg-contain border-2 border-color-container-neutral-border border-solid box-border h-[56px] rounded-[49.9%] shrink-0 w-[56px] " data-delayed-url="https://media.licdn.com/dms/image/v2/D5603AQGLZoHr26TEKQ/profile-displayphoto-shrink_400_400/profile-displayphoto-shrink_400_400/0/1702067682657?e=2147483647&v=beta&t=epyv0M-Kkqn6y6pV5_kddKvbPT1jxOLQnvQY3pcZiIY" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" data-delayed-background data-impression-id="public_profile_browsemap-0"></div> <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative"> Semhar Michael  </h3>    <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> Brookings, SD </div>  </div>  </a> <a href="https://www.linkedin.com/in/semhar-michael-87694451?trk=public_profile_browsemap_browse-map_connect-button" aria-label="Connect with Semhar Michael" class="relative -top-0.5 btn-sm btn-secondary inline-flex items-center ml-[52px]" data-tracking-client-ingraph data-tracking-control-name="public_profile_browsemap_browse-map_connect-button" data-tracking-will-navigate> <icon class="w-2 h-2 align-middle mr-1" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/82mk0tliwd4tdyyne68xrnb2a"></icon> Connect </a> </li> <li>  <a href="https://www.linkedin.com/in/tungisu?trk=public_profile_browsemap-profile" target="_self" data-impression-id="public_profile_browsemap-1" data-tracking-control-name="public_profile_browsemap-profile" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link">   <div class="bg-clip-content bg-contain border-2 border-color-container-neutral-border border-solid box-border h-[56px] rounded-[49.9%] shrink-0 w-[56px] " data-delayed-url="https://media.licdn.com/dms/image/v2/C4E03AQHy2qcA8W5yNg/profile-displayphoto-shrink_400_400/profile-displayphoto-shrink_400_400/0/1516341514699?e=2147483647&v=beta&t=Gy8oRo7bwShpmKhCbiPpeazghUbV8a0t9YQ6S_UZP4c" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" data-delayed-background data-impression-id="public_profile_browsemap-1"></div> <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative"> Tung Nguyen  </h3> <p class="base-aside-card__subtitle font-sans text-sm text-color-text leading-open mt-0.5 break-words line-clamp-2"> Assistant Professor at Auburn University </p>   <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> Auburn, AL </div>  </div>  </a> <a href="https://www.linkedin.com/in/tungisu?trk=public_profile_browsemap_browse-map_connect-button" aria-label="Connect with Tung Nguyen" class="relative -top-0.5 btn-sm btn-secondary inline-flex items-center ml-[52px]" data-tracking-client-ingraph data-tracking-control-name="public_profile_browsemap_browse-map_connect-button" data-tracking-will-navigate> <icon class="w-2 h-2 align-middle mr-1" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/82mk0tliwd4tdyyne68xrnb2a"></icon> Connect </a> </li> <li>  <a href="https://www.linkedin.com/in/ebisa-wollega-ph-d-9aab412?trk=public_profile_browsemap-profile" target="_self" data-impression-id="public_profile_browsemap-2" data-tracking-control-name="public_profile_browsemap-profile" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link">   <div class="bg-clip-content bg-contain border-2 border-color-container-neutral-border border-solid box-border h-[56px] rounded-[49.9%] shrink-0 w-[56px] " data-delayed-url="https://media.licdn.com/dms/image/v2/C4E03AQH8LxHgw3VoXA/profile-displayphoto-shrink_200_200/profile-displayphoto-shrink_200_200/0/1516279934981?e=2147483647&v=beta&t=pCr4SF9HxrKRFicccCBUU-cLopLiQJkv6B7WssL9WY0" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" data-delayed-background data-impression-id="public_profile_browsemap-2"></div> <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative"> Ebisa Wollega, Ph.D.  </h3>    <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> Lakeland, FL </div>  </div>  </a> <a href="https://www.linkedin.com/in/ebisa-wollega-ph-d-9aab412?trk=public_profile_browsemap_browse-map_connect-button" aria-label="Connect with Ebisa Wollega, Ph.D." class="relative -top-0.5 btn-sm btn-secondary inline-flex items-center ml-[52px]" data-tracking-client-ingraph data-tracking-control-name="public_profile_browsemap_browse-map_connect-button" data-tracking-will-navigate> <icon class="w-2 h-2 align-middle mr-1" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/82mk0tliwd4tdyyne68xrnb2a"></icon> Connect </a> </li> <li>  <a href="https://www.linkedin.com/in/yanjie-fu-5ba40358?trk=public_profile_browsemap-profile" target="_self" data-impression-id="public_profile_browsemap-3" data-tracking-control-name="public_profile_browsemap-profile" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link">   <div class="bg-clip-content bg-contain border-2 border-color-container-neutral-border border-solid box-border h-[56px] rounded-[49.9%] shrink-0 w-[56px] " data-delayed-url="https://media.licdn.com/dms/image/v2/C5103AQGv-cDro52L2g/profile-displayphoto-shrink_400_400/profile-displayphoto-shrink_400_400/0/1517042842279?e=2147483647&v=beta&t=bQl0IKDDruXaYrrwNN9O_zJq3lHafRQtCO_WT5zbE18" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" data-delayed-background data-impression-id="public_profile_browsemap-3"></div> <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative"> Yanjie Fu  </h3>    <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> Harrison, NJ </div>  </div>  </a> <a href="https://www.linkedin.com/in/yanjie-fu-5ba40358?trk=public_profile_browsemap_browse-map_connect-button" aria-label="Connect with Yanjie Fu" class="relative -top-0.5 btn-sm btn-secondary inline-flex items-center ml-[52px]" data-tracking-client-ingraph data-tracking-control-name="public_profile_browsemap_browse-map_connect-button" data-tracking-will-navigate> <icon class="w-2 h-2 align-middle mr-1" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/82mk0tliwd4tdyyne68xrnb2a"></icon> Connect </a> </li> <li>  <a href="https://www.linkedin.com/in/xiali-sharon-hei-3a284a24?trk=public_profile_browsemap-profile" target="_self" data-impression-id="public_profile_browsemap-4" data-tracking-control-name="public_profile_browsemap-profile" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link">   <div class="bg-clip-content bg-contain border-2 border-color-container-neutral-border border-solid box-border h-[56px] rounded-[49.9%] shrink-0 w-[56px] " data-delayed-url="https://media.licdn.com/dms/image/v2/C4D03AQEg01Zf1Al79g/profile-displayphoto-shrink_400_400/profile-displayphoto-shrink_400_400/0/1517232968144?e=2147483647&v=beta&t=91FGz3Tcfp-Lw3qucICsQdwjfbSAI-zJ97Q78xDf6S0" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" data-delayed-background data-impression-id="public_profile_browsemap-4"></div> <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative"> Xiali (Sharon) Hei  </h3> <p class="base-aside-card__subtitle font-sans text-sm text-color-text leading-open mt-0.5 break-words line-clamp-2"> Alfred and Helen M. Lamson Endowed Associate Professor in Computer Science </p>   <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> Lafayette, LA </div>  </div>  </a> <a href="https://www.linkedin.com/in/xiali-sharon-hei-3a284a24?trk=public_profile_browsemap_browse-map_connect-button" aria-label="Connect with Xiali (Sharon) Hei" class="relative -top-0.5 btn-sm btn-secondary inline-flex items-center ml-[52px]" data-tracking-client-ingraph data-tracking-control-name="public_profile_browsemap_browse-map_connect-button" data-tracking-will-navigate> <icon class="w-2 h-2 align-middle mr-1" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/82mk0tliwd4tdyyne68xrnb2a"></icon> Connect </a> </li> <li>  <a href="https://www.linkedin.com/in/wenchangzhang?trk=public_profile_browsemap-profile" target="_self" data-impression-id="public_profile_browsemap-5" data-tracking-control-name="public_profile_browsemap-profile" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link">   <div class="bg-clip-content bg-contain border-2 border-color-container-neutral-border border-solid box-border h-[56px] rounded-[49.9%] shrink-0 w-[56px] " data-delayed-url="https://media.licdn.com/dms/image/v2/C4D03AQEBVYmMp2Qp9g/profile-displayphoto-shrink_400_400/profile-displayphoto-shrink_400_400/0/1517022976515?e=2147483647&v=beta&t=XU6s-VG8TRv0EQlDjrjfki0BmYE48qnXLDAn5k_Gviw" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" data-delayed-background data-impression-id="public_profile_browsemap-5"></div> <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative"> Wenchang Zhang  </h3> <p class="base-aside-card__subtitle font-sans text-sm text-color-text leading-open mt-0.5 break-words line-clamp-2"> Assistant Professor at Indiana University - Kelley School of Business </p>   <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> Bloomington, IN </div>  </div>  </a> <a href="https://www.linkedin.com/in/wenchangzhang?trk=public_profile_browsemap_browse-map_connect-button" aria-label="Connect with Wenchang Zhang" class="relative -top-0.5 btn-sm btn-secondary inline-flex items-center ml-[52px]" data-tracking-client-ingraph data-tracking-control-name="public_profile_browsemap_browse-map_connect-button" data-tracking-will-navigate> <icon class="w-2 h-2 align-middle mr-1" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/82mk0tliwd4tdyyne68xrnb2a"></icon> Connect </a> </li> <li>  <a href="https://www.linkedin.com/in/omer-san-62585897?trk=public_profile_browsemap-profile" target="_self" data-impression-id="public_profile_browsemap-6" data-tracking-control-name="public_profile_browsemap-profile" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link">   <div class="bg-clip-content bg-contain border-2 border-color-container-neutral-border border-solid box-border h-[56px] rounded-[49.9%] shrink-0 w-[56px] " data-delayed-url="https://media.licdn.com/dms/image/v2/D4E03AQFTMrVzoTBBNQ/profile-displayphoto-shrink_400_400/B4EZT9aqJ5HUAg-/0/1739418412710?e=2147483647&v=beta&t=hYYJ5HJtomQyZqmm1T3n-syiijaEZyKYqQaV1QoIF7g" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" data-delayed-background data-impression-id="public_profile_browsemap-6"></div> <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative"> Omer San  </h3> <p class="base-aside-card__subtitle font-sans text-sm text-color-text leading-open mt-0.5 break-words line-clamp-2"> Associate Professor at the University of Tennessee, Knoxville </p>   <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> Knoxville, TN </div>  </div>  </a> <a href="https://www.linkedin.com/in/omer-san-62585897?trk=public_profile_browsemap_browse-map_connect-button" aria-label="Connect with Omer San" class="relative -top-0.5 btn-sm btn-secondary inline-flex items-center ml-[52px]" data-tracking-client-ingraph data-tracking-control-name="public_profile_browsemap_browse-map_connect-button" data-tracking-will-navigate> <icon class="w-2 h-2 align-middle mr-1" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/82mk0tliwd4tdyyne68xrnb2a"></icon> Connect </a> </li> <li>  <a href="https://www.linkedin.com/in/mahdi-khodayar?trk=public_profile_browsemap-profile" target="_self" data-impression-id="public_profile_browsemap-7" data-tracking-control-name="public_profile_browsemap-profile" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link">   <div class="bg-clip-content bg-contain border-2 border-color-container-neutral-border border-solid box-border h-[56px] rounded-[49.9%] shrink-0 w-[56px] " data-delayed-url="https://media.licdn.com/dms/image/v2/D5603AQHXxS_ZtufzFA/profile-displayphoto-shrink_400_400/profile-displayphoto-shrink_400_400/0/1695614672617?e=2147483647&v=beta&t=gd4mw9bFlgJh8GPwY4v19wRfN2pGeVqmw8BPC7998Xw" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" data-delayed-background data-impression-id="public_profile_browsemap-7"></div> <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative"> Mahdi Khodayar  </h3> <p class="base-aside-card__subtitle font-sans text-sm text-color-text leading-open mt-0.5 break-words line-clamp-2"> Assistant Professor </p>   <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> United States </div>  </div>  </a> <a href="https://www.linkedin.com/in/mahdi-khodayar?trk=public_profile_browsemap_browse-map_connect-button" aria-label="Connect with Mahdi Khodayar" class="relative -top-0.5 btn-sm btn-secondary inline-flex items-center ml-[52px]" data-tracking-client-ingraph data-tracking-control-name="public_profile_browsemap_browse-map_connect-button" data-tracking-will-navigate> <icon class="w-2 h-2 align-middle mr-1" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/82mk0tliwd4tdyyne68xrnb2a"></icon> Connect </a> </li> <li>  <a href="https://www.linkedin.com/in/xiaochen-guo-1456b620?trk=public_profile_browsemap-profile" target="_self" data-impression-id="public_profile_browsemap-8" data-tracking-control-name="public_profile_browsemap-profile" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link">   <div class="bg-clip-content bg-contain border-2 border-color-container-neutral-border border-solid box-border h-[56px] rounded-[49.9%] shrink-0 w-[56px] " data-delayed-url="https://media.licdn.com/dms/image/v2/C4D03AQHJIitTrsn3yw/profile-displayphoto-shrink_400_400/profile-displayphoto-shrink_400_400/0/1631373445650?e=2147483647&v=beta&t=qk6_cgqbgkWkzkprhNXnyWRKf82qkO78PLa27r499Z8" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" data-delayed-background data-impression-id="public_profile_browsemap-8"></div> <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative"> Xiaochen Guo  </h3>    <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> San Diego, CA </div>  </div>  </a> <a href="https://www.linkedin.com/in/xiaochen-guo-1456b620?trk=public_profile_browsemap_browse-map_connect-button" aria-label="Connect with Xiaochen Guo" class="relative -top-0.5 btn-sm btn-secondary inline-flex items-center ml-[52px]" data-tracking-client-ingraph data-tracking-control-name="public_profile_browsemap_browse-map_connect-button" data-tracking-will-navigate> <icon class="w-2 h-2 align-middle mr-1" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/82mk0tliwd4tdyyne68xrnb2a"></icon> Connect </a> </li> <li>  <a href="https://www.linkedin.com/in/jacqueline-j-harris-ph-d-a5594133?trk=public_profile_browsemap-profile" target="_self" data-impression-id="public_profile_browsemap-9" data-tracking-control-name="public_profile_browsemap-profile" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link">   <div class="bg-clip-content bg-contain border-2 border-color-container-neutral-border border-solid box-border h-[56px] rounded-[49.9%] shrink-0 w-[56px] " data-delayed-url="https://media.licdn.com/dms/image/v2/D4E03AQHplDHZze31qw/profile-displayphoto-shrink_400_400/profile-displayphoto-shrink_400_400/0/1704205602088?e=2147483647&v=beta&t=7SRCV1pUTaS38BL9Z2uM3tiBVDiJmRzr_epSOHJ8umc" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" data-delayed-background data-impression-id="public_profile_browsemap-9"></div> <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative"> Jacqueline J Harris Ph.D.  </h3> <p class="base-aside-card__subtitle font-sans text-sm text-color-text leading-open mt-0.5 break-words line-clamp-2"> Assistant Professor of Chemistry </p>   <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> United States </div>  </div>  </a> <a href="https://www.linkedin.com/in/jacqueline-j-harris-ph-d-a5594133?trk=public_profile_browsemap_browse-map_connect-button" aria-label="Connect with Jacqueline J Harris Ph.D." class="relative -top-0.5 btn-sm btn-secondary inline-flex items-center ml-[52px]" data-tracking-client-ingraph data-tracking-control-name="public_profile_browsemap_browse-map_connect-button" data-tracking-will-navigate> <icon class="w-2 h-2 align-middle mr-1" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/82mk0tliwd4tdyyne68xrnb2a"></icon> Connect </a> </li> </ul> </div> </section> <section class="aside-section-container mb-4 related-posts" data-nosnippet="true"> <h2 class="aside-section-container__title section-title"> Explore more posts </h2>  <div class="aside-section-container__content break-words"> <div class="show-more-less aside-entities-list">  <ul data-max-num-to-show="4" class="show-more-less__list show-more-less__list--hide-after-4" data-impression-id="public_profile_relatedPosts_show-more-less"> <li class="border-b-1 border-solid border-color-divider max-w-[360px] mb-1"> <div class="aside-activity-card "> <a href="https://www.linkedin.com/in/malan?trk=public_profile_relatedPosts_face-pile-cta" target="_self" data-tracking-control-name="full-link" data-tracking-will-navigate class="face-pile flex !no-underline"> <div class="face-pile__images-container self-start flex-shrink-0 mr-1 leading-[1]"> <img class="inline-block relative rounded-[50%] w-4 h-4 face-pile__image border-1 border-solid border-color-transparent -ml-2 first:ml-0" data-delayed-url="https://media.licdn.com/dms/image/v2/C4E03AQGiWpTUewQ76Q/profile-displayphoto-shrink_200_200/profile-displayphoto-shrink_200_200/0/1613923672703?e=2147483647&v=beta&t=c9uMoxicUpxVBKB0HisVW6Kt6FY2M9_8cszgRFjHBK8" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> </div> <p class="face-pile__text self-center font-sans text-sm link-styled hover:underline"> David J. Malan </p> </a> <a href="https://www.linkedin.com/posts/malan_cs50r-introduction-activity-7214078913776357376-nsZx?trk=public_profile_relatedPosts" target="_self" data-tracking-control-name="public_profile_relatedPosts" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link mt-0 mb-0.5">   <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative">  </h3> <p class="base-aside-card__subtitle font-sans text-sm text-color-text leading-open mt-0.5 break-words overflow-anywhere line-clamp-2"> Now available on edX, CS50's new Introduction to Programming with R, a popular language for statistical computing and graphics in data science and other domains, with CS50's own Carter Zenke. Overview at https://lnkd.in/eY-Rw4hk. Enroll for free at https://cs50.edx.org/r. #education #community #r #statistics </p>   <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs aside-activity-card__social-actions -mt-0.5"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="1,575 Reactions" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="1575" data-singular="%numReactions%" data-plural="%numReactions%" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <img alt data-reaction-type="EMPATHY" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/asiqslyf4ooq7ggllg4fyo4o2" height="16px" width="16px"> <img alt data-reaction-type="PRAISE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/2tzoeodxy0zug4455msr0oq0v" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 1,575 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis before:middot my-1" data-test-id="social-actions__comments" data-id="social-actions__comments" data-num-comments="34" data-singular="%numComments% Comment" data-plural="%numComments% Comments"> 34 Comments </div>  </div> </div>  </div> <div class="base-aside-card__media flex-shrink-0 mr-0.5 overflow-hidden relative rounded-md h-[75px] w-[75px] base-aside-card__media--right"> <img alt="Now available on edX, CS50's new Introduction to Programming with R, a popular language for statistical computing and graphics in data science and other domains, with CS50's own Carter Zenke. Overview at https://lnkd.in/eY-Rw4hk. Enroll for free at https://cs50.edx.org/r. #education #community #r #statistics " aria-hidden="true" data-delayed-url="https://media.licdn.com/dms/image/sync/v2/D4E27AQEoFPUvqmWBbQ/articleshare-shrink_800/articleshare-shrink_800/0/1719970430861?e=2147483647&v=beta&t=ZW-EliPLRYm2BtXCADfYp956_FKQSfSLiojoWSnvL7M">  </div> </a> </div> </li> <li class="border-b-1 border-solid border-color-divider max-w-[360px] mb-1"> <div class="aside-activity-card "> <a href="https://www.linkedin.com/in/opencontent?trk=public_profile_relatedPosts_face-pile-cta" target="_self" data-tracking-control-name="full-link" data-tracking-will-navigate class="face-pile flex !no-underline"> <div class="face-pile__images-container self-start flex-shrink-0 mr-1 leading-[1]"> <img class="inline-block relative rounded-[50%] w-4 h-4 face-pile__image border-1 border-solid border-color-transparent -ml-2 first:ml-0" data-delayed-url="https://media.licdn.com/dms/image/v2/D5603AQHigAEMVeSv4w/profile-displayphoto-shrink_200_200/profile-displayphoto-shrink_200_200/0/1674493172315?e=2147483647&v=beta&t=8EKUFI-OY5oD1zGlV9eNCr1Ed5qweFeHllddbI4d-QM" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> </div> <p class="face-pile__text self-center font-sans text-sm link-styled hover:underline"> David Wiley </p> </a> <a href="https://www.linkedin.com/posts/opencontent_pedagogical-alignment-of-large-language-models-activity-7253040910496673792-HJ4l?trk=public_profile_relatedPosts" target="_self" data-tracking-control-name="public_profile_relatedPosts" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link mt-0 mb-0.5">   <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative">  </h3> <p class="base-aside-card__subtitle font-sans text-sm text-color-text leading-open mt-0.5 break-words overflow-anywhere line-clamp-2"> Shashank Sonkar and colleagues are back with another great paper and matching GitHub repo, "Pedagogical Alignment of Large Language Models". You can read the paper on arXiv here - https://lnkd.in/enZfCJgi, and browse the code here - https://lnkd.in/gyW5hSzp. They demonstrate an interesting approach to generating synthetic data to train open weights models to behave more pedagogically, as well as developing some new evaluation metrics. I think there's lots here for those of us interested in this kind of work to learn from and build on. (If synthetic data generation is new do you, check out this post from answer.ai earlier this week - https://lnkd.in/ecsG7R4b.) </p>   <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs aside-activity-card__social-actions -mt-0.5"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="38 Reactions" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="38" data-singular="%numReactions%" data-plural="%numReactions%" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <img alt data-reaction-type="PRAISE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/2tzoeodxy0zug4455msr0oq0v" height="16px" width="16px"> <img alt data-reaction-type="APPRECIATION" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/cyfai5zw4nrqhyyhl0p7so58v" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 38 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis before:middot my-1" data-test-id="social-actions__comments" data-id="social-actions__comments" data-num-comments="4" data-singular="%numComments% Comment" data-plural="%numComments% Comments"> 4 Comments </div>  </div> </div>  </div> <div class="base-aside-card__media flex-shrink-0 mr-0.5 overflow-hidden relative rounded-md h-[75px] w-[75px] base-aside-card__media--right"> <img alt="Shashank Sonkar and colleagues are back with another great paper and matching GitHub repo, "Pedagogical Alignment of Large Language Models". You can read the paper on arXiv here - https://lnkd.in/enZfCJgi, and browse the code here - https://lnkd.in/gyW5hSzp. They demonstrate an interesting approach to generating synthetic data to train open weights models to behave more pedagogically, as well as developing some new evaluation metrics. I think there's lots here for those of us interested in this kind of work to learn from and build on. (If synthetic data generation is new do you, check out this post from answer.ai earlier this week - https://lnkd.in/ecsG7R4b.) " aria-hidden="true" data-delayed-url="https://media.licdn.com/dms/image/sync/v2/D4E27AQEo69vQkm3Vtw/articleshare-shrink_800/articleshare-shrink_800/0/1724962274795?e=2147483647&v=beta&t=8_1TJcF6HHsaNqoypoKHhnH5-ONe5wcGLTSnwdf2SE4">  </div> </a> </div> </li> <li class="border-b-1 border-solid border-color-divider max-w-[360px] mb-1"> <div class="aside-activity-card "> <a href="https://www.linkedin.com/in/edward-y-chang-218b182?trk=public_profile_relatedPosts_face-pile-cta" target="_self" data-tracking-control-name="full-link" data-tracking-will-navigate class="face-pile flex !no-underline"> <div class="face-pile__images-container self-start flex-shrink-0 mr-1 leading-[1]"> <img class="inline-block relative rounded-[50%] w-4 h-4 bg-color-entity-ghost-background face-pile__image border-1 border-solid border-color-transparent -ml-2 first:ml-0" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> </div> <p class="face-pile__text self-center font-sans text-sm link-styled hover:underline"> Edward Y. Chang </p> </a> <a href="https://www.linkedin.com/posts/edward-y-chang-218b182_the-path-to-artificial-general-intelligence-activity-7188284347928440832-jAho?trk=public_profile_relatedPosts" target="_self" data-tracking-control-name="public_profile_relatedPosts" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link mt-0 mb-0.5">   <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative">  </h3> <p class="base-aside-card__subtitle font-sans text-sm text-color-text leading-open mt-0.5 break-words overflow-anywhere line-clamp-2"> Predictions based on Machine Learning and Statistics often rely on maximal likelihood, which can trap LLMs in biases inherited from the majority opinions of training data. This problem has been substantially addressed through conditional statistics, allowing for mitigation of biases and preventing the cancellation of minority perspectives. https://lnkd.in/gMXKqwyB </p>   <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs aside-activity-card__social-actions -mt-0.5"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="7 Reactions" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="7" data-singular="%numReactions%" data-plural="%numReactions%" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <img alt data-reaction-type="APPRECIATION" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/cyfai5zw4nrqhyyhl0p7so58v" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 7 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  </div> </div>  </div> <div class="base-aside-card__media flex-shrink-0 mr-0.5 overflow-hidden relative rounded-md h-[75px] w-[75px] base-aside-card__media--right"> <img alt="Predictions based on Machine Learning and Statistics often rely on maximal likelihood, which can trap LLMs in biases inherited from the majority opinions of training data. This problem has been substantially addressed through conditional statistics, allowing for mitigation of biases and preventing the cancellation of minority perspectives. https://lnkd.in/gMXKqwyB" aria-hidden="true" data-delayed-url="https://media.licdn.com/dms/image/sync/v2/D5627AQGb15x162QgRA/articleshare-shrink_800/articleshare-shrink_800/0/1723572090342?e=2147483647&v=beta&t=FkzqfdWVoKPTvWbivNq46Z07y1o4lqKECGjrXEu7txM">  </div> </a> </div> </li> <li class="border-b-1 border-solid border-color-divider max-w-[360px] mb-1"> <div class="aside-activity-card "> <a href="https://uk.linkedin.com/in/mupshall?trk=public_profile_relatedPosts_face-pile-cta" target="_self" data-tracking-control-name="full-link" data-tracking-will-navigate class="face-pile flex !no-underline"> <div class="face-pile__images-container self-start flex-shrink-0 mr-1 leading-[1]"> <img class="inline-block relative rounded-[50%] w-4 h-4 bg-color-entity-ghost-background face-pile__image border-1 border-solid border-color-transparent -ml-2 first:ml-0" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> </div> <p class="face-pile__text self-center font-sans text-sm link-styled hover:underline"> Michael Upshall </p> </a> <a href="https://www.linkedin.com/posts/mupshall_the-problem-of-peer-review-httpslnkdin-activity-7231348601476534273-Ny8S?trk=public_profile_relatedPosts" target="_self" data-tracking-control-name="public_profile_relatedPosts" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link mt-0 mb-0.5">   <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative">  </h3> <p class="base-aside-card__subtitle font-sans text-sm text-color-text leading-open mt-0.5 break-words overflow-anywhere line-clamp-2"> The Problem of Peer Review https://lnkd.in/dsbQxmwv It’s been stated before, but an excellent paper in arXiv (“what can Natural Language Processing do for Peer Review?” spells out the problem of peer review very clearly, by describing what happens with papers submitted for a major conference: a vast amount of human effort. “For a concrete example, the ACL-2023 Conference alone generated over 12,000 reviews for over 4,500 full-text manuscripts … yet none of this data is public.” Clearly, when we think about open access, we should think about open reviews as well. NLP can’t fix everything connected with peer review, but this paper shows there is a lot that it could be doing right now. Incidentally, the authors, more than 20 of them, deserve praise for writing a wholly intelligible article (the outcome of a seminar), that doesn’t need a plain-text summary. </p>   <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs aside-activity-card__social-actions -mt-0.5"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="13 Reactions" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="13" data-singular="%numReactions%" data-plural="%numReactions%" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <img alt data-reaction-type="INTEREST" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/a0e8rff6djeoq8iympcysuqfu" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 13 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  </div> </div>  </div>  </a> </div> </li> <li class="border-b-1 border-solid border-color-divider max-w-[360px] mb-1"> <div class="aside-activity-card "> <a href="https://www.linkedin.com/in/edward-y-chang-218b182?trk=public_profile_relatedPosts_face-pile-cta" target="_self" data-tracking-control-name="full-link" data-tracking-will-navigate class="face-pile flex !no-underline"> <div class="face-pile__images-container self-start flex-shrink-0 mr-1 leading-[1]"> <img class="inline-block relative rounded-[50%] w-4 h-4 bg-color-entity-ghost-background face-pile__image border-1 border-solid border-color-transparent -ml-2 first:ml-0" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> </div> <p class="face-pile__text self-center font-sans text-sm link-styled hover:underline"> Edward Y. Chang </p> </a> <a href="https://www.linkedin.com/posts/edward-y-chang-218b182_the-path-to-artificial-general-intelligence-activity-7186480098366177280-lYtj?trk=public_profile_relatedPosts" target="_self" data-tracking-control-name="public_profile_relatedPosts" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link mt-0 mb-0.5">   <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative">  </h3> <p class="base-aside-card__subtitle font-sans text-sm text-color-text leading-open mt-0.5 break-words overflow-anywhere line-clamp-2"> New GAI Textbook Announcement (The textbook is being used by Stanford CS372: AI for precision medicine and psychological disorders) The past two years have been a significant period in the field of AI, during which we have observed substantial strides in generative AI technologies. Having dedicated over 5,000 hours to research and analysis, I have compiled this book to share a collection of findings and hypotheses regarding Large Language Models (LLMs)—detailing their operational mechanisms, and strategies to mitigate biases and reduce instances of hallucination. The central premise is to enhance context awareness and define human objectives with greater precision. By doing so, LLMs can more effectively mirror human linguistic behaviors, utilizing context-specific linguistic features to fulfill intended goals. SocraSynth: Socratic Synthesis with Multiple Large Language Models. https://lnkd.in/giC4474p </p>   <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs aside-activity-card__social-actions -mt-0.5"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="25 Reactions" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="25" data-singular="%numReactions%" data-plural="%numReactions%" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 25 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  </div> </div>  </div> <div class="base-aside-card__media flex-shrink-0 mr-0.5 overflow-hidden relative rounded-md h-[75px] w-[75px] base-aside-card__media--right"> <img alt="New GAI Textbook Announcement (The textbook is being used by Stanford CS372: AI for precision medicine and psychological disorders) The past two years have been a significant period in the field of AI, during which we have observed substantial strides in generative AI technologies. Having dedicated over 5,000 hours to research and analysis, I have compiled this book to share a collection of findings and hypotheses regarding Large Language Models (LLMs)—detailing their operational mechanisms, and strategies to mitigate biases and reduce instances of hallucination. The central premise is to enhance context awareness and define human objectives with greater precision. By doing so, LLMs can more effectively mirror human linguistic behaviors, utilizing context-specific linguistic features to fulfill intended goals. SocraSynth: Socratic Synthesis with Multiple Large Language Models. https://lnkd.in/giC4474p " aria-hidden="true" data-delayed-url="https://media.licdn.com/dms/image/sync/v2/D5627AQGb15x162QgRA/articleshare-shrink_800/articleshare-shrink_800/0/1723572090342?e=2147483647&v=beta&t=FkzqfdWVoKPTvWbivNq46Z07y1o4lqKECGjrXEu7txM">  </div> </a> </div> </li> <li class="border-b-1 border-solid border-color-divider max-w-[360px] mb-1"> <div class="aside-activity-card "> <a href="https://www.linkedin.com/in/joe-faith-554b89165?trk=public_profile_relatedPosts_face-pile-cta" target="_self" data-tracking-control-name="full-link" data-tracking-will-navigate class="face-pile flex !no-underline"> <div class="face-pile__images-container self-start flex-shrink-0 mr-1 leading-[1]"> <img class="inline-block relative rounded-[50%] w-4 h-4 face-pile__image border-1 border-solid border-color-transparent -ml-2 first:ml-0" data-delayed-url="https://media.licdn.com/dms/image/v2/D4E03AQEqiZtD3sfsVw/profile-displayphoto-shrink_200_200/B4EZTN2Ii.HgAc-/0/1738620309898?e=2147483647&v=beta&t=ZZGMxRjxjyJsCFYVLugOQYyLOnacRBCSKyAOHDgw9qc" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> </div> <p class="face-pile__text self-center font-sans text-sm link-styled hover:underline"> Joe Faith </p> </a> <a href="https://www.linkedin.com/posts/joe-faith-554b89165_gpu-mode-irl-2024-keynotes-activity-7248127002505883649-NQyg?trk=public_profile_relatedPosts" target="_self" data-tracking-control-name="public_profile_relatedPosts" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link mt-0 mb-0.5">   <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative">  </h3> <p class="base-aside-card__subtitle font-sans text-sm text-color-text leading-open mt-0.5 break-words overflow-anywhere line-clamp-2"> I thought it was very interesting to hear Andrej Karpathy talk at GPU MODE IRL 2024 about llm.c and the conversion of PyTorch code into C/CUDA, as well as the potential implications of this work for future LLM progress. The idea of converting high-level deep learning frameworks like PyTorch into low-level, highly efficient C/CUDA code could drastically improve model performance and execution speed, especially in specialized environments. It will be exciting to see how software engineering (SWE) practices evolve over the coming years as this becomes more widely adopted. One of the most intriguing aspects is the potential for a model to generate a specific kernel for a use case at someone's company in code that has no dependencies and could be fully optimized for that specific task. This capability could not only make models more efficient by reducing overhead but also streamline the workflow for developers, allowing them to focus on higher-level design and problem-solving. Essentially, it introduces a whole new paradigm for how we think about integrating AI into SWE roles. Instead of relying on pre-built libraries or packages, engineers might one day leverage AI models to dynamically create specialized, hyper-optimized code that meets their unique needs, reducing both the time and complexity of development. This shift could also lead to a greater level of automation, where AI models help developers by taking over tasks like performance optimization or low-level code generation, potentially reshaping the way we approach software development as a whole. You can listen to the keynote here: https://lnkd.in/eqJYS8Gb #AI #MachineLearning #DeepLearning #PyTorch #CUDA #SoftwareEngineering #C #LLM </p>   <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs aside-activity-card__social-actions -mt-0.5"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="2 Reactions" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="2" data-singular="%numReactions%" data-plural="%numReactions%" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 2 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  </div> </div>  </div> <div class="base-aside-card__media flex-shrink-0 mr-0.5 overflow-hidden relative rounded-md h-[75px] w-[75px] base-aside-card__media--right"> <img alt="I thought it was very interesting to hear Andrej Karpathy talk at GPU MODE IRL 2024 about llm.c and the conversion of PyTorch code into C/CUDA, as well as the potential implications of this work for future LLM progress. The idea of converting high-level deep learning frameworks like PyTorch into low-level, highly efficient C/CUDA code could drastically improve model performance and execution speed, especially in specialized environments. It will be exciting to see how software engineering (SWE) practices evolve over the coming years as this becomes more widely adopted. One of the most intriguing aspects is the potential for a model to generate a specific kernel for a use case at someone's company in code that has no dependencies and could be fully optimized for that specific task. This capability could not only make models more efficient by reducing overhead but also streamline the workflow for developers, allowing them to focus on higher-level design and problem-solving. Essentially, it introduces a whole new paradigm for how we think about integrating AI into SWE roles. Instead of relying on pre-built libraries or packages, engineers might one day leverage AI models to dynamically create specialized, hyper-optimized code that meets their unique needs, reducing both the time and complexity of development. This shift could also lead to a greater level of automation, where AI models help developers by taking over tasks like performance optimization or low-level code generation, potentially reshaping the way we approach software development as a whole. You can listen to the keynote here: https://lnkd.in/eqJYS8Gb #AI #MachineLearning #DeepLearning #PyTorch #CUDA #SoftwareEngineering #C #LLM" aria-hidden="true" data-delayed-url="https://media.licdn.com/dms/image/sync/v2/D4D27AQHOuQ1SiWQlPA/articleshare-shrink_800/articleshare-shrink_800/0/1728025735473?e=2147483647&v=beta&t=Rk58y9TaNWpWKzbCLCak3IVhMq6LbWzyjE9tztWCtb4">  </div> </a> </div> </li> <li class="border-b-1 border-solid border-color-divider max-w-[360px] mb-1"> <div class="aside-activity-card "> <a href="https://www.linkedin.com/in/ramon-m-0a085623b?trk=public_profile_relatedPosts_face-pile-cta" target="_self" data-tracking-control-name="full-link" data-tracking-will-navigate class="face-pile flex !no-underline"> <div class="face-pile__images-container self-start flex-shrink-0 mr-1 leading-[1]"> <img class="inline-block relative rounded-[50%] w-4 h-4 face-pile__image border-1 border-solid border-color-transparent -ml-2 first:ml-0" data-delayed-url="https://media.licdn.com/dms/image/v2/C4E03AQEnoq3GSDdiTg/profile-displayphoto-shrink_200_200/profile-displayphoto-shrink_200_200/0/1653863495317?e=2147483647&v=beta&t=QGVeRqw0-HAcv9zZJ_GPAjaAOI1WOj0hvzWtfQ0nkhI" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> </div> <p class="face-pile__text self-center font-sans text-sm link-styled hover:underline"> Ramon M. </p> </a> <a href="https://www.linkedin.com/posts/ramon-m-0a085623b_github-mqcomplabprime-protein-structure-activity-7191908401407250432-835F?trk=public_profile_relatedPosts" target="_self" data-tracking-control-name="public_profile_relatedPosts" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link mt-0 mb-0.5">   <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative">  </h3> <p class="base-aside-card__subtitle font-sans text-sm text-color-text leading-open mt-0.5 break-words overflow-anywhere line-clamp-2"> Looking how to select better representatives from your clusters? Take a look at Lexin Chen's PRIME repo: https://lnkd.in/enM_Btbx Now with updated functionality and tutorials! Great collaboration with Arup Mondal and Alberto Perez! #moleculardynamics #clustering #machinelearning </p>   <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs aside-activity-card__social-actions -mt-0.5"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="65 Reactions" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="65" data-singular="%numReactions%" data-plural="%numReactions%" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <img alt data-reaction-type="INTEREST" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/a0e8rff6djeoq8iympcysuqfu" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 65 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  </div> </div>  </div> <div class="base-aside-card__media flex-shrink-0 mr-0.5 overflow-hidden relative rounded-md h-[75px] w-[75px] base-aside-card__media--right"> <img alt="Looking how to select better representatives from your clusters? Take a look at Lexin Chen's PRIME repo: https://lnkd.in/enM_Btbx Now with updated functionality and tutorials! Great collaboration with Arup Mondal and Alberto Perez! #moleculardynamics #clustering #machinelearning" aria-hidden="true" data-delayed-url="https://media.licdn.com/dms/image/sync/v2/D4E27AQGgLMQ8H6Vi0A/articleshare-shrink_800/articleshare-shrink_800/0/1731651217091?e=2147483647&v=beta&t=UdY4U7596v-CGYbyW5IO7OpNsKhBGgS0Z6CMNsJSd7M">  </div> </a> </div> </li> <li class="border-b-1 border-solid border-color-divider max-w-[360px] mb-1"> <div class="aside-activity-card "> <a href="https://es.linkedin.com/in/eordax?trk=public_profile_relatedPosts_face-pile-cta" target="_self" data-tracking-control-name="full-link" data-tracking-will-navigate class="face-pile flex !no-underline"> <div class="face-pile__images-container self-start flex-shrink-0 mr-1 leading-[1]"> <img class="inline-block relative rounded-[50%] w-4 h-4 bg-color-entity-ghost-background face-pile__image border-1 border-solid border-color-transparent -ml-2 first:ml-0" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> </div> <p class="face-pile__text self-center font-sans text-sm link-styled hover:underline"> Eduardo Ordax </p> </a> <a href="https://www.linkedin.com/posts/eordax_ai-genai-finetuning-activity-7210377788069969920-GL6e?trk=public_profile_relatedPosts" target="_self" data-tracking-control-name="public_profile_relatedPosts" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link mt-0 mb-0.5">   <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative">  </h3> <p class="base-aside-card__subtitle font-sans text-sm text-color-text leading-open mt-0.5 break-words overflow-anywhere line-clamp-2"> 𝗧𝗵𝗲 𝗲𝘃𝗼𝗹𝘂𝘁𝗶𝗼𝗻𝗮𝗿𝘆 𝗱𝗲𝘃𝗲𝗹𝗼𝗽𝗺𝗲𝗻𝘁 𝗼𝗳 𝗣𝗘𝗙𝗧 To perform task-specific full fine-tuning with LLMs like LLaMA3 70B, a minimum of 5120GB of computational resources may be required. The enormous computational resource requirements are prohibitive for anyone but the superpower players to utilize LLMs for task-specific fine-tuning. To address this challenge, PEFT has emerged as a viable solution to compensate for the tremendous computational cost of full parameter fine-tuning. PEFT updates only a small number of additional parameters or updates a subset of the pre-trained parameters, preserving the knowledge and reducing the risk of catastrophic forgetting. As a result, there has been an increase on PEFT techniques during the last few months. Some of the most popular techniques where many customers are purring lot of effort to improve LLMs for theirs specific purposes are LoRA, QLoRA or Prompt-Tuning; If you want to start easily testing these techniques, I recommend to follow this couple of repositories: 👉 https://lnkd.in/dtvHP_Cn 👉 https://lnkd.in/daKDY4f5 #ai #genai #finetuning </p>   <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs aside-activity-card__social-actions -mt-0.5"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="72 Reactions" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="72" data-singular="%numReactions%" data-plural="%numReactions%" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <img alt data-reaction-type="INTEREST" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/a0e8rff6djeoq8iympcysuqfu" height="16px" width="16px"> <img alt data-reaction-type="EMPATHY" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/asiqslyf4ooq7ggllg4fyo4o2" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 72 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis before:middot my-1" data-test-id="social-actions__comments" data-id="social-actions__comments" data-num-comments="3" data-singular="%numComments% Comment" data-plural="%numComments% Comments"> 3 Comments </div>  </div> </div>  </div> <div class="base-aside-card__media flex-shrink-0 mr-0.5 overflow-hidden relative rounded-md h-[75px] w-[75px] base-aside-card__media--right"> <img alt="𝗧𝗵𝗲 𝗲𝘃𝗼𝗹𝘂𝘁𝗶𝗼𝗻𝗮𝗿𝘆 𝗱𝗲𝘃𝗲𝗹𝗼𝗽𝗺𝗲𝗻𝘁 𝗼𝗳 𝗣𝗘𝗙𝗧 To perform task-specific full fine-tuning with LLMs like LLaMA3 70B, a minimum of 5120GB of computational resources may be required. The enormous computational resource requirements are prohibitive for anyone but the superpower players to utilize LLMs for task-specific fine-tuning. To address this challenge, PEFT has emerged as a viable solution to compensate for the tremendous computational cost of full parameter fine-tuning. PEFT updates only a small number of additional parameters or updates a subset of the pre-trained parameters, preserving the knowledge and reducing the risk of catastrophic forgetting. As a result, there has been an increase on PEFT techniques during the last few months. Some of the most popular techniques where many customers are purring lot of effort to improve LLMs for theirs specific purposes are LoRA, QLoRA or Prompt-Tuning; If you want to start easily testing these techniques, I recommend to follow this couple of repositories: 👉 https://lnkd.in/dtvHP_Cn 👉 https://lnkd.in/daKDY4f5 #ai #genai #finetuning" aria-hidden="true" data-delayed-url="https://media.licdn.com/dms/image/v2/D4D22AQG7GLVy7hUoqA/feedshare-shrink_800/feedshare-shrink_800/0/1719088025723?e=2147483647&v=beta&t=AIfb27jM9f5unyj9igfddh9_u_Fs_04n9mMTRUi2fhE">  </div> </a> </div> </li> <li class="border-b-1 border-solid border-color-divider max-w-[360px] mb-1"> <div class="aside-activity-card "> <a href="https://fr.linkedin.com/in/prof-nikos-paragios-20777869?trk=public_profile_relatedPosts_face-pile-cta" target="_self" data-tracking-control-name="full-link" data-tracking-will-navigate class="face-pile flex !no-underline"> <div class="face-pile__images-container self-start flex-shrink-0 mr-1 leading-[1]"> <img class="inline-block relative rounded-[50%] w-4 h-4 face-pile__image border-1 border-solid border-color-transparent -ml-2 first:ml-0" data-delayed-url="https://media.licdn.com/dms/image/v2/D4E03AQHsT1j7mimldA/profile-displayphoto-shrink_200_200/B4EZRbqH4NHAAg-/0/1736704556927?e=2147483647&v=beta&t=nPzZZM-KjOZNrVGg9dHRo-VZFJTZF5XuLsfjpCh7tTA" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> </div> <p class="face-pile__text self-center font-sans text-sm link-styled hover:underline"> Prof. Nikos Paragios </p> </a> <a href="https://www.linkedin.com/posts/prof-nikos-paragios-20777869_ai-machinelearning-hiring-activity-7244277916291731456-9H_s?trk=public_profile_relatedPosts" target="_self" data-tracking-control-name="public_profile_relatedPosts" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link mt-0 mb-0.5">   <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative">  </h3> <p class="base-aside-card__subtitle font-sans text-sm text-color-text leading-open mt-0.5 break-words overflow-anywhere line-clamp-2"> 🚫 Top 5 reasons why you should NOT apply for a Machine Learning / Artificial Intelligence position at TheraPanacea 🚫 Application Portal: https://lnkd.in/ePDdTgGv 1️⃣ You're just not into the latest trends in Artificial Intelligence 🤖... too futuristic for you, right? 2️⃣ Helping advance precision medicine and providing the best treatment to patients? Nah, you prefer things to stay... imprecise 🧑‍⚕️❌. 3️⃣ The thought of collaborating with the best clinical and academic centers worldwide doesn't give you any goosebumps 🌍💼... 4️⃣ You’re allergic to international, interdisciplinary, and gender-balanced teams 🌐👩‍💻👨‍💻. 5️⃣ Paris? ... who likes croissants, the Eiffel Tower, and vibrant culture anyway 🥐🇫🇷? But hey, if none of these sound like you, maybe you’re just the person we're looking for 😉! #AI #MachineLearning #Hiring #JoinUs </p>   <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs aside-activity-card__social-actions -mt-0.5"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="107 Reactions" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="107" data-singular="%numReactions%" data-plural="%numReactions%" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <img alt data-reaction-type="ENTERTAINMENT" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/22ifp2etz8kb9tgjqn65s9ics" height="16px" width="16px"> <img alt data-reaction-type="EMPATHY" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/asiqslyf4ooq7ggllg4fyo4o2" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 107 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis before:middot my-1" data-test-id="social-actions__comments" data-id="social-actions__comments" data-num-comments="3" data-singular="%numComments% Comment" data-plural="%numComments% Comments"> 3 Comments </div>  </div> </div>  </div> <div class="base-aside-card__media flex-shrink-0 mr-0.5 overflow-hidden relative rounded-md h-[75px] w-[75px] base-aside-card__media--right"> <img alt="🚫 Top 5 reasons why you should NOT apply for a Machine Learning / Artificial Intelligence position at TheraPanacea 🚫 Application Portal: https://lnkd.in/ePDdTgGv 1️⃣ You're just not into the latest trends in Artificial Intelligence 🤖... too futuristic for you, right? 2️⃣ Helping advance precision medicine and providing the best treatment to patients? Nah, you prefer things to stay... imprecise 🧑‍⚕️❌. 3️⃣ The thought of collaborating with the best clinical and academic centers worldwide doesn't give you any goosebumps 🌍💼... 4️⃣ You’re allergic to international, interdisciplinary, and gender-balanced teams 🌐👩‍💻👨‍💻. 5️⃣ Paris? ... who likes croissants, the Eiffel Tower, and vibrant culture anyway 🥐🇫🇷? But hey, if none of these sound like you, maybe you’re just the person we're looking for 😉! #AI #MachineLearning #Hiring #JoinUs" aria-hidden="true" data-delayed-url="https://media.licdn.com/dms/image/v2/D4E22AQHMW_5pnXlPjg/feedshare-shrink_800/feedshare-shrink_800/0/1727170446481?e=2147483647&v=beta&t=59PZtMBRtG1vogFLHQKUKCxHsq_0CA9xhahHGFKZKTE">  </div> </a> </div> </li> <li class="border-b-1 border-solid border-color-divider max-w-[360px] mb-1"> <div class="aside-activity-card "> <a href="https://www.linkedin.com/in/pkghosh?trk=public_profile_relatedPosts_face-pile-cta" target="_self" data-tracking-control-name="full-link" data-tracking-will-navigate class="face-pile flex !no-underline"> <div class="face-pile__images-container self-start flex-shrink-0 mr-1 leading-[1]"> <img class="inline-block relative rounded-[50%] w-4 h-4 face-pile__image border-1 border-solid border-color-transparent -ml-2 first:ml-0" data-delayed-url="https://media.licdn.com/dms/image/v2/D5603AQElCAER7UR04g/profile-displayphoto-shrink_200_200/profile-displayphoto-shrink_200_200/0/1683841857079?e=2147483647&v=beta&t=vikgDLNBC8MKfPu2p4v91lxhZoHtYCN5k1BJimBhQEc" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> </div> <p class="face-pile__text self-center font-sans text-sm link-styled hover:underline"> Pranab Ghosh </p> </a> <a href="https://www.linkedin.com/posts/pkghosh_intuitive-explanation-of-llm-hallucination-activity-7268500522498703360-GjcB?trk=public_profile_relatedPosts" target="_self" data-tracking-control-name="public_profile_relatedPosts" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link mt-0 mb-0.5">   <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative">  </h3> <p class="base-aside-card__subtitle font-sans text-sm text-color-text leading-open mt-0.5 break-words overflow-anywhere line-clamp-2"> I tried to explain LLM hallucination intuitively using the concept of semantic entropy. I treated the two different scenarios of short response and long response separately. Hallucination mechanism with long response is more complex since the output typically weaves through multiple concepts. #ai #llm #hallucination #semantic #entropy https://lnkd.in/gF2_2H6w </p>   <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs aside-activity-card__social-actions -mt-0.5"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="35 Reactions" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="35" data-singular="%numReactions%" data-plural="%numReactions%" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <img alt data-reaction-type="EMPATHY" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/asiqslyf4ooq7ggllg4fyo4o2" height="16px" width="16px"> <img alt data-reaction-type="INTEREST" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/a0e8rff6djeoq8iympcysuqfu" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 35 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis before:middot my-1" data-test-id="social-actions__comments" data-id="social-actions__comments" data-num-comments="5" data-singular="%numComments% Comment" data-plural="%numComments% Comments"> 5 Comments </div>  </div> </div>  </div> <div class="base-aside-card__media flex-shrink-0 mr-0.5 overflow-hidden relative rounded-md h-[75px] w-[75px] base-aside-card__media--right"> <img alt="I tried to explain LLM hallucination intuitively using the concept of semantic entropy. I treated the two different scenarios of short response and long response separately. Hallucination mechanism with long response is more complex since the output typically weaves through multiple concepts. #ai #llm #hallucination #semantic #entropy https://lnkd.in/gF2_2H6w" aria-hidden="true" data-delayed-url="https://media.licdn.com/dms/image/sync/v2/D5627AQE3jjPlADhvig/articleshare-shrink_800/articleshare-shrink_800/0/1732945350165?e=2147483647&v=beta&t=z_IlcrxOTLP_OLzNaY0ANPwv6At-kPp7Gpi_vs7lITw">  </div> </a> </div> </li> <li class="border-b-1 border-solid border-color-divider max-w-[360px] mb-1"> <div class="aside-activity-card "> <a href="https://fr.linkedin.com/in/adrien-besson-828bbb47?trk=public_profile_relatedPosts_face-pile-cta" target="_self" data-tracking-control-name="full-link" data-tracking-will-navigate class="face-pile flex !no-underline"> <div class="face-pile__images-container self-start flex-shrink-0 mr-1 leading-[1]"> <img class="inline-block relative rounded-[50%] w-4 h-4 face-pile__image border-1 border-solid border-color-transparent -ml-2 first:ml-0" data-delayed-url="https://media.licdn.com/dms/image/v2/D4E03AQFZ-H4javwjpg/profile-displayphoto-shrink_200_200/profile-displayphoto-shrink_200_200/0/1687845967539?e=2147483647&v=beta&t=bBvV21LwwO315A_mTupwVzbWuUuW7eyze-urxGXvk6E" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> </div> <p class="face-pile__text self-center font-sans text-sm link-styled hover:underline"> Adrien Besson </p> </a> <a href="https://www.linkedin.com/posts/adrien-besson-828bbb47_enhancement-of-time-dependent-coherent-backscattering-activity-7258183439038738432-yyZy?trk=public_profile_relatedPosts" target="_self" data-tracking-control-name="public_profile_relatedPosts" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link mt-0 mb-0.5">   <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative">  </h3> <p class="base-aside-card__subtitle font-sans text-sm text-color-text leading-open mt-0.5 break-words overflow-anywhere line-clamp-2"> 🚀✨ Paper Alert! ✨🚀 Absolutely thrilled to share the publication of our latest paper entitled "𝐄𝐧𝐡𝐚𝐧𝐜𝐞𝐦𝐞𝐧𝐭 𝐨𝐟 𝐓𝐢𝐦𝐞-𝐃𝐞𝐩𝐞𝐧𝐝𝐞𝐧𝐭 𝐂𝐨𝐡𝐞𝐫𝐞𝐧𝐭 𝐁𝐚𝐜𝐤𝐬𝐜𝐚𝐭𝐭𝐞𝐫𝐢𝐧𝐠: 𝐋𝐞𝐯𝐞𝐫𝐚𝐠𝐢𝐧𝐠 𝐏𝐡𝐚𝐬𝐞 𝐈𝐧𝐟𝐨𝐫𝐦𝐚𝐭𝐢𝐨𝐧" in Physical Review B. In this work, we show how the angular framework could be leveraged to characterize multiple scattering in a way that has never been done before. Instead of measuring the width of the incoherent cone as is done in SOTA techniques, we have an even better idea! We exploit the coherence of complex signals backscattered from different directions. We therefore enhance the coherent cone and we measure its width! This significantly facilitates the measurement of transport properties and paves the way to in vivo applications e.g., characterization of lungs or bones! Kudos to Baptiste HERIARD DUBREUIL for the amazing work! </p>   <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs aside-activity-card__social-actions -mt-0.5"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="32 Reactions" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="32" data-singular="%numReactions%" data-plural="%numReactions%" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <img alt data-reaction-type="EMPATHY" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/asiqslyf4ooq7ggllg4fyo4o2" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 32 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis before:middot my-1" data-test-id="social-actions__comments" data-id="social-actions__comments" data-num-comments="1" data-singular="%numComments% Comment" data-plural="%numComments% Comments"> 1 Comment </div>  </div> </div>  </div>  </a> </div> </li> <li class="border-b-1 border-solid border-color-divider max-w-[360px] mb-1"> <div class="aside-activity-card "> <a href="https://ca.linkedin.com/in/edgarbermudezcontreras?trk=public_profile_relatedPosts_face-pile-cta" target="_self" data-tracking-control-name="full-link" data-tracking-will-navigate class="face-pile flex !no-underline"> <div class="face-pile__images-container self-start flex-shrink-0 mr-1 leading-[1]"> <img class="inline-block relative rounded-[50%] w-4 h-4 face-pile__image border-1 border-solid border-color-transparent -ml-2 first:ml-0" data-delayed-url="https://media.licdn.com/dms/image/v2/D5603AQG3CdGxX4mxCA/profile-displayphoto-shrink_200_200/profile-displayphoto-shrink_200_200/0/1664407160043?e=2147483647&v=beta&t=oTaQ3KcMNqJRCcXnOhWU3WXx1qFFIroZ8bsZHTxcgD8" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> </div> <p class="face-pile__text self-center font-sans text-sm link-styled hover:underline"> Edgar Bermudez, PhD </p> </a> <a href="https://www.linkedin.com/posts/edgarbermudezcontreras_i-recently-read-towards-monosemanticity-activity-7202071919590875136-vLsv?trk=public_profile_relatedPosts" target="_self" data-tracking-control-name="public_profile_relatedPosts" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link mt-0 mb-0.5">   <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative">  </h3> <p class="base-aside-card__subtitle font-sans text-sm text-color-text leading-open mt-0.5 break-words overflow-anywhere line-clamp-2"> I recently read “Towards Monosemanticity: Decomposing Language Models with Dictionary Learning” by Bricken et al 2023. Published at transformer-circuits.pub In this paper the authors discuss the use of autoencoders to extract interpretable features from a one-layer transformer model. The approach aims to improve the understanding and transparency of artificial neural networks. To do this they use the concept of dictionary learning which is a method used to decompose the activations of the transformer model into a set of basis features that are sparse and interpretable. This involves training sparse autoencoders to identify a dictionary of monosemantic features (or features with a single meaning). These monosemantic features in single units (artificial neurons) reminds me of the “grandmother neuron” theory in neuroscience which proposes that specific neurons respond to specific highly complex stimuli. However, in real brains, it is widely believed that complex representations are encoded by the collective activity of groups of neurons (and not monosemantic). But going back to the paper, I really like the visualization tools and illustration to make such complex networks more transparent (as opposed to black boxes). Paper: https://lnkd.in/gR87pJcs </p>   <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs aside-activity-card__social-actions -mt-0.5"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="3 Reactions" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="3" data-singular="%numReactions%" data-plural="%numReactions%" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 3 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  </div> </div>  </div> <div class="base-aside-card__media flex-shrink-0 mr-0.5 overflow-hidden relative rounded-md h-[75px] w-[75px] base-aside-card__media--right"> <img alt="I recently read “Towards Monosemanticity: Decomposing Language Models with Dictionary Learning” by Bricken et al 2023. Published at transformer-circuits.pub In this paper the authors discuss the use of autoencoders to extract interpretable features from a one-layer transformer model. The approach aims to improve the understanding and transparency of artificial neural networks. To do this they use the concept of dictionary learning which is a method used to decompose the activations of the transformer model into a set of basis features that are sparse and interpretable. This involves training sparse autoencoders to identify a dictionary of monosemantic features (or features with a single meaning). These monosemantic features in single units (artificial neurons) reminds me of the “grandmother neuron” theory in neuroscience which proposes that specific neurons respond to specific highly complex stimuli. However, in real brains, it is widely believed that complex representations are encoded by the collective activity of groups of neurons (and not monosemantic). But going back to the paper, I really like the visualization tools and illustration to make such complex networks more transparent (as opposed to black boxes). Paper: https://lnkd.in/gR87pJcs" aria-hidden="true" data-delayed-url="https://media.licdn.com/dms/image/v2/D5622AQHwpeDhcFaK3A/feedshare-shrink_800/feedshare-shrink_800/0/1717107752148?e=2147483647&v=beta&t=74GXdcaJ7laUfhUOy1d5_gPS63L1pYoVlDQ03QDSiFs">  </div> </a> </div> </li> <li class="border-b-1 border-solid border-color-divider max-w-[360px] mb-1"> <div class="aside-activity-card "> <a href="https://www.linkedin.com/in/yubin-park-phd?trk=public_profile_relatedPosts_face-pile-cta" target="_self" data-tracking-control-name="full-link" data-tracking-will-navigate class="face-pile flex !no-underline"> <div class="face-pile__images-container self-start flex-shrink-0 mr-1 leading-[1]"> <img class="inline-block relative rounded-[50%] w-4 h-4 face-pile__image border-1 border-solid border-color-transparent -ml-2 first:ml-0" data-delayed-url="https://media.licdn.com/dms/image/v2/D4E03AQHFvHJ7j2Wm4A/profile-displayphoto-shrink_200_200/profile-displayphoto-shrink_200_200/0/1711923590238?e=2147483647&v=beta&t=95XOgr9axVQU-W7uKqz60LMll5oR9a_frwaVnYcXX8Y" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> </div> <p class="face-pile__text self-center font-sans text-sm link-styled hover:underline"> Yubin Park, PhD </p> </a> <a href="https://www.linkedin.com/posts/yubin-park-phd_synthetic-data-in-health-care-a-narrative-activity-7218357178263367681-qGZb?trk=public_profile_relatedPosts" target="_self" data-tracking-control-name="public_profile_relatedPosts" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link mt-0 mb-0.5">   <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative">  </h3> <p class="base-aside-card__subtitle font-sans text-sm text-color-text leading-open mt-0.5 break-words overflow-anywhere line-clamp-2"> Synthetic Data for LLM applications Synthetic healthcare datasets have been around for some time. Whenever we discuss synthetic data, we often find ourselves trapped in one question: How realistic is the synthetic data? We create synthetic data because we cannot share the original data for various obvious reasons, e.g., privacy, security, etc. The challenging part is that we want the synthetic data to have all the nice characteristics of the original data. With all the challenges aside, many researchers adopted and tried to prove the value of synthetic data. This morning, I came across this well-summarized survey article, "Synthetic data in healthcare: A narrative review," by Aldren Gonzales, Guruprabha Guruswamy, and Scott Smith. According to the paper, people have used synthetic data for seven categories: 1. Simulation and Prediction Research 2. Hypothesis, Methods, and Algorithm Testing 3. Epidemiological Study/Public Health Research 4. Health IT Development and Testing 5. Education and Training 6. Public Release of Datasets 7. Linking data I have been thinking about synthetic data quite a bit these days. It's because of the Large Language Models (LLMs) like ChatGPT by OpenAI and Claude by Anthropic. I have been playing around with those LLMs. I realize these days that, even if you are an expert, it's almost impossible to predict the outputs. As we cannot logically bind the output behaviors, testing an LLM model with all possible(or available) cases is the only way. Here, I think, is the new application of synthetic data: LLM application testing. If we were to build many LLM applications, we would need a lot of synthetic healthcare data. In this case, the synthetic data do not need to be hyper-realistic. They just need to represent the "quirkiness" of the real-life use cases, so weird synthetic data should be fine. Healthcare-focused LLM applications would need to test with all sorts of available synthetic data to see if such applications produce weird outputs. We may need a mechanism to do so. I think this new use case of synthetic data will be critical in healthcare. Let's see. [1] https://lnkd.in/e8DiEH9j </p>   <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs aside-activity-card__social-actions -mt-0.5"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="64 Reactions" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="64" data-singular="%numReactions%" data-plural="%numReactions%" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <img alt data-reaction-type="INTEREST" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/a0e8rff6djeoq8iympcysuqfu" height="16px" width="16px"> <img alt data-reaction-type="EMPATHY" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/asiqslyf4ooq7ggllg4fyo4o2" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 64 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis before:middot my-1" data-test-id="social-actions__comments" data-id="social-actions__comments" data-num-comments="9" data-singular="%numComments% Comment" data-plural="%numComments% Comments"> 9 Comments </div>  </div> </div>  </div> <div class="base-aside-card__media flex-shrink-0 mr-0.5 overflow-hidden relative rounded-md h-[75px] w-[75px] base-aside-card__media--right"> <img alt="Synthetic Data for LLM applications Synthetic healthcare datasets have been around for some time. Whenever we discuss synthetic data, we often find ourselves trapped in one question: How realistic is the synthetic data? We create synthetic data because we cannot share the original data for various obvious reasons, e.g., privacy, security, etc. The challenging part is that we want the synthetic data to have all the nice characteristics of the original data. With all the challenges aside, many researchers adopted and tried to prove the value of synthetic data. This morning, I came across this well-summarized survey article, "Synthetic data in healthcare: A narrative review," by Aldren Gonzales, Guruprabha Guruswamy, and Scott Smith. According to the paper, people have used synthetic data for seven categories: 1. Simulation and Prediction Research 2. Hypothesis, Methods, and Algorithm Testing 3. Epidemiological Study/Public Health Research 4. Health IT Development and Testing 5. Education and Training 6. Public Release of Datasets 7. Linking data I have been thinking about synthetic data quite a bit these days. It's because of the Large Language Models (LLMs) like ChatGPT by OpenAI and Claude by Anthropic. I have been playing around with those LLMs. I realize these days that, even if you are an expert, it's almost impossible to predict the outputs. As we cannot logically bind the output behaviors, testing an LLM model with all possible(or available) cases is the only way. Here, I think, is the new application of synthetic data: LLM application testing. If we were to build many LLM applications, we would need a lot of synthetic healthcare data. In this case, the synthetic data do not need to be hyper-realistic. They just need to represent the "quirkiness" of the real-life use cases, so weird synthetic data should be fine. Healthcare-focused LLM applications would need to test with all sorts of available synthetic data to see if such applications produce weird outputs. We may need a mechanism to do so. I think this new use case of synthetic data will be critical in healthcare. Let's see. [1] https://lnkd.in/e8DiEH9j" aria-hidden="true" data-delayed-url="https://media.licdn.com/dms/image/sync/v2/D4D27AQFKejcVN1rArw/articleshare-shrink_800/articleshare-shrink_800/0/1712175566751?e=2147483647&v=beta&t=0ZQYB_1EkH70S_DGe2dmmKoWk_wzQjMuEHElbXNI8-I">  </div> </a> </div> </li> <li class="border-b-1 border-solid border-color-divider max-w-[360px] mb-1"> <div class="aside-activity-card "> <a href="https://it.linkedin.com/in/marcoramilli?trk=public_profile_relatedPosts_face-pile-cta" target="_self" data-tracking-control-name="full-link" data-tracking-will-navigate class="face-pile flex !no-underline"> <div class="face-pile__images-container self-start flex-shrink-0 mr-1 leading-[1]"> <img class="inline-block relative rounded-[50%] w-4 h-4 face-pile__image border-1 border-solid border-color-transparent -ml-2 first:ml-0" data-delayed-url="https://media.licdn.com/dms/image/v2/D4D03AQHkR9MaEG5gsA/profile-displayphoto-shrink_200_200/B4DZVQqslgHIAY-/0/1740815125451?e=2147483647&v=beta&t=7TNK6Z2Q1R72i1c5BxMaIktoj2xyl2Cb_RiGC4-JQBE" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> </div> <p class="face-pile__text self-center font-sans text-sm link-styled hover:underline"> Marco Ramilli, PhD </p> </a> <a href="https://www.linkedin.com/posts/marcoramilli_anthropic-unveils-groundbreaking-insights-activity-7200004985214902272-rZee?trk=public_profile_relatedPosts" target="_self" data-tracking-control-name="public_profile_relatedPosts" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link mt-0 mb-0.5">   <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative">  </h3> <p class="base-aside-card__subtitle font-sans text-sm text-color-text leading-open mt-0.5 break-words overflow-anywhere line-clamp-2"> Anthropic has unveiled groundbreaking insights into the inner workings of Large Language Models (LLMs) in a recent paper. This pioneering research employs "dictionary learning" to map and manipulate the internal states of these models, offering unprecedented transparency and control. This breakthrough not only enhances our understanding of neural networks but also opens new avenues for ensuring the safety and reliability of AI systems. Read more: https://lnkd.in/eAzYgnKj IdentifAI - Find Origin - #artificialintelligence </p>   <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs aside-activity-card__social-actions -mt-0.5"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="236 Reactions" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="236" data-singular="%numReactions%" data-plural="%numReactions%" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <img alt data-reaction-type="EMPATHY" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/asiqslyf4ooq7ggllg4fyo4o2" height="16px" width="16px"> <img alt data-reaction-type="PRAISE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/2tzoeodxy0zug4455msr0oq0v" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 236 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis before:middot my-1" data-test-id="social-actions__comments" data-id="social-actions__comments" data-num-comments="3" data-singular="%numComments% Comment" data-plural="%numComments% Comments"> 3 Comments </div>  </div> </div>  </div> <div class="base-aside-card__media flex-shrink-0 mr-0.5 overflow-hidden relative rounded-md h-[75px] w-[75px] base-aside-card__media--right"> <img alt="Anthropic has unveiled groundbreaking insights into the inner workings of Large Language Models (LLMs) in a recent paper. This pioneering research employs "dictionary learning" to map and manipulate the internal states of these models, offering unprecedented transparency and control. This breakthrough not only enhances our understanding of neural networks but also opens new avenues for ensuring the safety and reliability of AI systems. Read more: https://lnkd.in/eAzYgnKj IdentifAI - Find Origin - #artificialintelligence " aria-hidden="true" data-delayed-url="https://media.licdn.com/dms/image/sync/v2/D4E27AQEYzV8l4wYldg/articleshare-shrink_800/articleshare-shrink_800/0/1716614865585?e=2147483647&v=beta&t=u8iwL1VAU23RR1DK6CeaJYfq2OEisdY_iKEHRyuriZc">  </div> </a> </div> </li> <li class="border-b-1 border-solid border-color-divider max-w-[360px] mb-1"> <div class="aside-activity-card "> <a href="https://www.linkedin.com/in/brianbuntz?trk=public_profile_relatedPosts_face-pile-cta" target="_self" data-tracking-control-name="full-link" data-tracking-will-navigate class="face-pile flex !no-underline"> <div class="face-pile__images-container self-start flex-shrink-0 mr-1 leading-[1]"> <img class="inline-block relative rounded-[50%] w-4 h-4 face-pile__image border-1 border-solid border-color-transparent -ml-2 first:ml-0" data-delayed-url="https://media.licdn.com/dms/image/v2/D5603AQFoTwoUQv_s-Q/profile-displayphoto-shrink_200_200/B56ZPHV2swGwAc-/0/1734216215617?e=2147483647&v=beta&t=ISNfJ7lAn9hU52RJVcRsxEp1PTbCUv08YbmJjjpc1HU" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> </div> <p class="face-pile__text self-center font-sans text-sm link-styled hover:underline"> Brian K. Buntz </p> </a> <a href="https://www.linkedin.com/posts/brianbuntz_the-information-explores-how-openais-reasoning-activity-7262223492580823040-xVRP?trk=public_profile_relatedPosts" target="_self" data-tracking-control-name="public_profile_relatedPosts" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link mt-0 mb-0.5">   <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative">  </h3> <p class="base-aside-card__subtitle font-sans text-sm text-color-text leading-open mt-0.5 break-words overflow-anywhere line-clamp-2"> The Information explores how OpenAI's reasoning models could help counteract potentially slowing progress from pretraining of larger more sophisticated models with more incremental gains over earlier models. It also notes that "relatively few developers" are using o1 models. While I haven't seen the full-fledged o1 models (and am certainly not a developer), the current models sometimes think hard about how to break code -- have seen them replace functional authentication with placeholders that you have to go back and update later, or swap out the correct variables with wrong ones -- for instance upper- or lowercasing something you didn't ask for, or adding an underscore where you didn't want one. At present, I think it still is better to have tighter feedback loops with the user -- otherwise, the genAI is thinking long and hard -- while a game of telephone can ensue where mistakes sometimes increase in the process of reflection -- about something that might not be what you wanted. A snippet from the paywalled article. The Information has done a lot of unique reporting on the subject. </p>   <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs aside-activity-card__social-actions -mt-0.5"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="4 Reactions" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="4" data-singular="%numReactions%" data-plural="%numReactions%" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <img alt data-reaction-type="INTEREST" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/a0e8rff6djeoq8iympcysuqfu" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 4 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis before:middot my-1" data-test-id="social-actions__comments" data-id="social-actions__comments" data-num-comments="1" data-singular="%numComments% Comment" data-plural="%numComments% Comments"> 1 Comment </div>  </div> </div>  </div> <div class="base-aside-card__media flex-shrink-0 mr-0.5 overflow-hidden relative rounded-md h-[75px] w-[75px] base-aside-card__media--right"> <img alt="The Information explores how OpenAI's reasoning models could help counteract potentially slowing progress from pretraining of larger more sophisticated models with more incremental gains over earlier models. It also notes that "relatively few developers" are using o1 models. While I haven't seen the full-fledged o1 models (and am certainly not a developer), the current models sometimes think hard about how to break code -- have seen them replace functional authentication with placeholders that you have to go back and update later, or swap out the correct variables with wrong ones -- for instance upper- or lowercasing something you didn't ask for, or adding an underscore where you didn't want one. At present, I think it still is better to have tighter feedback loops with the user -- otherwise, the genAI is thinking long and hard -- while a game of telephone can ensue where mistakes sometimes increase in the process of reflection -- about something that might not be what you wanted. A snippet from the paywalled article. The Information has done a lot of unique reporting on the subject. " aria-hidden="true" data-delayed-url="https://media.licdn.com/dms/image/v2/D5622AQH8O_85kPulkg/feedshare-shrink_800/feedshare-shrink_800/0/1731449005328?e=2147483647&v=beta&t=NVlG98Z7uCS-BOZKwyRfL7G3GhABQ_BvvGmW-DSkibw">  </div> </a> </div> </li> <li class="border-b-1 border-solid border-color-divider max-w-[360px] mb-1"> <div class="aside-activity-card "> <a href="https://www.linkedin.com/in/arnab-chakraborty13?trk=public_profile_relatedPosts_face-pile-cta" target="_self" data-tracking-control-name="full-link" data-tracking-will-navigate class="face-pile flex !no-underline"> <div class="face-pile__images-container self-start flex-shrink-0 mr-1 leading-[1]"> <img class="inline-block relative rounded-[50%] w-4 h-4 face-pile__image border-1 border-solid border-color-transparent -ml-2 first:ml-0" data-delayed-url="https://media.licdn.com/dms/image/v2/D4D03AQHqV4WIVbIiOg/profile-displayphoto-shrink_200_200/profile-displayphoto-shrink_200_200/0/1714912141443?e=2147483647&v=beta&t=LCWVq45a5k-XxNXlHQnBHt-p3hob9ggoW3eN7HpVLcw" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> </div> <p class="face-pile__text self-center font-sans text-sm link-styled hover:underline"> Arnab Chakraborty </p> </a> <a href="https://www.linkedin.com/posts/arnab-chakraborty13_github-chakraborty-arnabinstructionbacktranslation-activity-7226688575730253824-KSmb?trk=public_profile_relatedPosts" target="_self" data-tracking-control-name="public_profile_relatedPosts" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link mt-0 mb-0.5">   <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative">  </h3> <p class="base-aside-card__subtitle font-sans text-sm text-color-text leading-open mt-0.5 break-words overflow-anywhere line-clamp-2"> Self Alignment using Instruction Backtranslation Recently, it's become common practice to fine-tune LLMs on proprietary datasets or domain-specific data. However, this process requires labeled data – often referred to as ground truth – and getting humans to label this data is both time-consuming and expensive. A recent paper, 'Self Alignment by Instruction Backtranslation', where researchers have come up with an innovative approach to generate high-quality training data. Check it out here at: https://lnkd.in/dfx5yyha In simple terms, the process starts with a model fine-tuned on a small amount of seed data. The seed model constructs training examples by generating instruction prompts for web corpus. High-quality examples from among these candidates are selected and used to fine-tune a more aligned model. The cool part? When they tried this method with LLaMA, the resulting model outperformed all other LLaMA-based models on the Alpaca leaderboard not relying on data distillation. I've put together a code implementation in my GitHub repo. Feel free to check it out and let me know what you think! #GenAI #LLMs #Llama #Fine-tuning #MachineLearning #NLP #DataScience </p>   <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs aside-activity-card__social-actions -mt-0.5"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="47 Reactions" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="47" data-singular="%numReactions%" data-plural="%numReactions%" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <img alt data-reaction-type="PRAISE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/2tzoeodxy0zug4455msr0oq0v" height="16px" width="16px"> <img alt data-reaction-type="INTEREST" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/a0e8rff6djeoq8iympcysuqfu" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 47 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  </div> </div>  </div> <div class="base-aside-card__media flex-shrink-0 mr-0.5 overflow-hidden relative rounded-md h-[75px] w-[75px] base-aside-card__media--right"> <img alt="Self Alignment using Instruction Backtranslation Recently, it's become common practice to fine-tune LLMs on proprietary datasets or domain-specific data. However, this process requires labeled data – often referred to as ground truth – and getting humans to label this data is both time-consuming and expensive. A recent paper, 'Self Alignment by Instruction Backtranslation', where researchers have come up with an innovative approach to generate high-quality training data. Check it out here at: https://lnkd.in/dfx5yyha In simple terms, the process starts with a model fine-tuned on a small amount of seed data. The seed model constructs training examples by generating instruction prompts for web corpus. High-quality examples from among these candidates are selected and used to fine-tune a more aligned model. The cool part? When they tried this method with LLaMA, the resulting model outperformed all other LLaMA-based models on the Alpaca leaderboard not relying on data distillation. I've put together a code implementation in my GitHub repo. Feel free to check it out and let me know what you think! #GenAI #LLMs #Llama #Fine-tuning #MachineLearning #NLP #DataScience" aria-hidden="true" data-delayed-url="https://media.licdn.com/dms/image/sync/v2/D4D27AQEsyJXNDWJj_Q/articleshare-shrink_800/articleshare-shrink_800/0/1724197762019?e=2147483647&v=beta&t=BYY6C7m9RJbQIgggirRB3EQjrjP089ZqKkMr19o9NS4">  </div> </a> </div> </li> <li class="border-b-1 border-solid border-color-divider max-w-[360px] mb-1"> <div class="aside-activity-card "> <a href="https://au.linkedin.com/in/aaronquigley?trk=public_profile_relatedPosts_face-pile-cta" target="_self" data-tracking-control-name="full-link" data-tracking-will-navigate class="face-pile flex !no-underline"> <div class="face-pile__images-container self-start flex-shrink-0 mr-1 leading-[1]"> <img class="inline-block relative rounded-[50%] w-4 h-4 face-pile__image border-1 border-solid border-color-transparent -ml-2 first:ml-0" data-delayed-url="https://media.licdn.com/dms/image/v2/D5603AQGtfaogNA9cqQ/profile-displayphoto-shrink_200_200/profile-displayphoto-shrink_200_200/0/1731626504026?e=2147483647&v=beta&t=Bnp_IdADYG6VgzlCC9MY7S3UsDE7zhLLaEecoKDxeYU" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> </div> <p class="face-pile__text self-center font-sans text-sm link-styled hover:underline"> Professor Aaron Quigley </p> </a> <a href="https://www.linkedin.com/posts/aaronquigley_john-lions-distinguished-lecture-2024-activity-7247898649303662592--ytJ?trk=public_profile_relatedPosts" target="_self" data-tracking-control-name="public_profile_relatedPosts" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link mt-0 mb-0.5">   <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative">  </h3> <p class="base-aside-card__subtitle font-sans text-sm text-color-text leading-open mt-0.5 break-words overflow-anywhere line-clamp-2"> Now online - very cool! Making Systems Research More Relevant: Lessons and Experiences with HarmonyOS NEXT https://lnkd.in/g6WK-VNK The relevance of systems research is often measured by its impact on industry practices. This talk revisits the lessons and experiences of conducting systems research crossing academia and industry by the speaker, who has spent around 8 years with various systems projects such as HarmonyOS NEXT and its open-source version OpenHarmony. It then uses HarmonyOS NEXT as a case to illustrate the potential of systems research to drive tangible innovations in the intersection of academia and industry. HarmonyOS NEXT explores many brave research ideas into practice, including microkernel-based architecture, distributed mobile computing and AI integration. By examining its architecture, features, and development process, this talk aim to identify key factors contributing to its success and to extract lessons for future collaborations between researchers and practitioners. By showcasing the potential of HarmonyOS NEXT, the speaker hopes to encourage further exploration of how academic insights can be translated into innovative approaches for complex systems challenges. </p>   <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs aside-activity-card__social-actions -mt-0.5"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="3 Reactions" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="3" data-singular="%numReactions%" data-plural="%numReactions%" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <img alt data-reaction-type="INTEREST" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/a0e8rff6djeoq8iympcysuqfu" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 3 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  </div> </div>  </div> <div class="base-aside-card__media flex-shrink-0 mr-0.5 overflow-hidden relative rounded-md h-[75px] w-[75px] base-aside-card__media--right"> <img alt="Now online - very cool! Making Systems Research More Relevant: Lessons and Experiences with HarmonyOS NEXT https://lnkd.in/g6WK-VNK The relevance of systems research is often measured by its impact on industry practices. This talk revisits the lessons and experiences of conducting systems research crossing academia and industry by the speaker, who has spent around 8 years with various systems projects such as HarmonyOS NEXT and its open-source version OpenHarmony. It then uses HarmonyOS NEXT as a case to illustrate the potential of systems research to drive tangible innovations in the intersection of academia and industry. HarmonyOS NEXT explores many brave research ideas into practice, including microkernel-based architecture, distributed mobile computing and AI integration. By examining its architecture, features, and development process, this talk aim to identify key factors contributing to its success and to extract lessons for future collaborations between researchers and practitioners. By showcasing the potential of HarmonyOS NEXT, the speaker hopes to encourage further exploration of how academic insights can be translated into innovative approaches for complex systems challenges." aria-hidden="true" data-delayed-url="https://media.licdn.com/dms/image/sync/v2/D5627AQExsHAK1bE5qg/articleshare-shrink_800/articleshare-shrink_800/0/1728033671848?e=2147483647&v=beta&t=z_jhh3m0umDYzjoXmI698QN7-rplwaFFxVMXoNYMAmg">  </div> </a> </div> </li> <li class="border-b-1 border-solid border-color-divider max-w-[360px] mb-1"> <div class="aside-activity-card "> <a href="https://www.linkedin.com/in/patrick-l-harvey1?trk=public_profile_relatedPosts_face-pile-cta" target="_self" data-tracking-control-name="full-link" data-tracking-will-navigate class="face-pile flex !no-underline"> <div class="face-pile__images-container self-start flex-shrink-0 mr-1 leading-[1]"> <img class="inline-block relative rounded-[50%] w-4 h-4 face-pile__image border-1 border-solid border-color-transparent -ml-2 first:ml-0" data-delayed-url="https://media.licdn.com/dms/image/v2/D4E03AQEl0LIVi6f64g/profile-displayphoto-shrink_200_200/profile-displayphoto-shrink_200_200/0/1684950831528?e=2147483647&v=beta&t=gUq2gT7RiMfJq8oJj75tmmn5F5vGC3qSsYrKG1yDxTI" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> </div> <p class="face-pile__text self-center font-sans text-sm link-styled hover:underline"> Patrick Harvey </p> </a> <a href="https://www.linkedin.com/posts/patrick-l-harvey1_github-p-harveympmapthdemo-a-review-activity-7261099030619869186-tzyp?trk=public_profile_relatedPosts" target="_self" data-tracking-control-name="public_profile_relatedPosts" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link mt-0 mb-0.5">   <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative">  </h3> <p class="base-aside-card__subtitle font-sans text-sm text-color-text leading-open mt-0.5 break-words overflow-anywhere line-clamp-2"> In case anyone is looking for an easy to use python package for working with complex valued functions or using arbitrary precision, mpmath (https://lnkd.in/eRGHrd2K) has some good things going for it! Demonstration of some capabilities below 👇 https://lnkd.in/eUHQBUW3 </p>   <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs aside-activity-card__social-actions -mt-0.5"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="3 Reactions" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="3" data-singular="%numReactions%" data-plural="%numReactions%" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 3 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  </div> </div>  </div> <div class="base-aside-card__media flex-shrink-0 mr-0.5 overflow-hidden relative rounded-md h-[75px] w-[75px] base-aside-card__media--right"> <img alt="In case anyone is looking for an easy to use python package for working with complex valued functions or using arbitrary precision, mpmath (https://lnkd.in/eRGHrd2K) has some good things going for it! Demonstration of some capabilities below 👇 https://lnkd.in/eUHQBUW3" aria-hidden="true" data-delayed-url="https://media.licdn.com/dms/image/sync/v2/D4E27AQEDRlkSX2QUZg/articleshare-shrink_800/articleshare-shrink_800/0/1732391092436?e=2147483647&v=beta&t=LYiT6SLboaaOi0CnJo-voLLr9nFWWoVoivI_sabWFhs">  </div> </a> </div> </li> <li class="border-b-1 border-solid border-color-divider max-w-[360px] mb-1"> <div class="aside-activity-card "> <a href="https://uk.linkedin.com/in/vincent-hall-consulting?trk=public_profile_relatedPosts_face-pile-cta" target="_self" data-tracking-control-name="full-link" data-tracking-will-navigate class="face-pile flex !no-underline"> <div class="face-pile__images-container self-start flex-shrink-0 mr-1 leading-[1]"> <img class="inline-block relative rounded-[50%] w-4 h-4 face-pile__image border-1 border-solid border-color-transparent -ml-2 first:ml-0" data-delayed-url="https://media.licdn.com/dms/image/v2/C5603AQHoHYGo-ObYsg/profile-displayphoto-shrink_200_200/profile-displayphoto-shrink_200_200/0/1516589352579?e=2147483647&v=beta&t=2BMvhi-4uXFgBUegI1JWN5YBx4ur68qEg2jkwcHJWjQ" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> </div> <p class="face-pile__text self-center font-sans text-sm link-styled hover:underline"> Vincent Hall PhD </p> </a> <a href="https://www.linkedin.com/posts/vincent-hall-consulting_feature-selection-for-supervised-learning-activity-7205887671519379456-4KeA?trk=public_profile_relatedPosts" target="_self" data-tracking-control-name="public_profile_relatedPosts" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link mt-0 mb-0.5">   <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative">  </h3> <p class="base-aside-card__subtitle font-sans text-sm text-color-text leading-open mt-0.5 break-words overflow-anywhere line-clamp-2"> Can't believe I'd not put this paper on Research Gate or Google Scholar until today! Data Compression and Feature Selection: "Feature Selection for Supervised Learning and Compression" https://lnkd.in/e_USPuea My first paper from this work is still not able to be published! I finished it in 2019. </p>   <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs aside-activity-card__social-actions -mt-0.5"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="9 Reactions" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="9" data-singular="%numReactions%" data-plural="%numReactions%" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <img alt data-reaction-type="PRAISE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/2tzoeodxy0zug4455msr0oq0v" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 9 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis before:middot my-1" data-test-id="social-actions__comments" data-id="social-actions__comments" data-num-comments="2" data-singular="%numComments% Comment" data-plural="%numComments% Comments"> 2 Comments </div>  </div> </div>  </div> <div class="base-aside-card__media flex-shrink-0 mr-0.5 overflow-hidden relative rounded-md h-[75px] w-[75px] base-aside-card__media--right"> <img alt="Can't believe I'd not put this paper on Research Gate or Google Scholar until today! Data Compression and Feature Selection: "Feature Selection for Supervised Learning and Compression" https://lnkd.in/e_USPuea My first paper from this work is still not able to be published! I finished it in 2019." aria-hidden="true" data-delayed-url="https://media.licdn.com/dms/image/sync/v2/D4E27AQEqqBd5ueDQIg/articleshare-shrink_1280_800/articleshare-shrink_1280_800/0/1718017428449?e=2147483647&v=beta&t=W-_EOpNUbqPwSxw91qmYjyzRKxoX1l9XApTcA7DD69Q">  </div> </a> </div> </li> <li class=" max-w-[360px] mb-1"> <div class="aside-activity-card "> <a href="https://in.linkedin.com/in/gaurav-mantri-1b75b563?trk=public_profile_relatedPosts_face-pile-cta" target="_self" data-tracking-control-name="full-link" data-tracking-will-navigate class="face-pile flex !no-underline"> <div class="face-pile__images-container self-start flex-shrink-0 mr-1 leading-[1]"> <img class="inline-block relative rounded-[50%] w-4 h-4 face-pile__image border-1 border-solid border-color-transparent -ml-2 first:ml-0" data-delayed-url="https://media.licdn.com/dms/image/v2/C5603AQFz_vnX6U4mEQ/profile-displayphoto-shrink_200_200/profile-displayphoto-shrink_200_200/0/1517547611670?e=2147483647&v=beta&t=LaJyBv7-A7Pz2XCMeWIaYIIwiZYbYhHd42oXjx7ADuE" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> </div> <p class="face-pile__text self-center font-sans text-sm link-styled hover:underline"> Gaurav Mantri </p> </a> <a href="https://www.linkedin.com/posts/gaurav-mantri-1b75b563_mixture-of-experts-explained-activity-7200749458043592705-C9Ka?trk=public_profile_relatedPosts" target="_self" data-tracking-control-name="public_profile_relatedPosts" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 base-aside-card--link mt-0 mb-0.5">   <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative">  </h3> <p class="base-aside-card__subtitle font-sans text-sm text-color-text leading-open mt-0.5 break-words overflow-anywhere line-clamp-2"> Conditional Computation in LLMs (MoEs): Understanding the Concept of MoEs The Mixture of Experts (MoE) works on the idea of conditional computation, allowing the model to only run through some parts (an Expert) for a given input compared to a dense model where all parameters are used. It consists of a gating mechanism that selects the experts for the given input, enabling the model to scale without increasing the required computation. Popular LLM architectures like Mixtral and Grok use the same concept. What is MoEs? The Mixture of Experts (MoE) makes a simple modification to the usual architecture by replacing all or some of the Feed Forward layers with an MoE layer. The MoE layer consists of: 1. Router: A parameter gating mechanism that selects the experts used to process the given input by producing a probability distribution over experts. 2. Experts: Standalone NN modules with independent sets of parameters trained jointly with routers. It can use more complex architecture to create hierarchical MoE modules by implementing each expert as another MoE. - Implementation in Switch Transformers : Training was found to be 7x faster than training the base T5, keeping computation and training time constant. - Gshard and MeshTransformer : These frameworks propose efficient methods to train the MoE architectures using the parallelisation of models across GPUs/TPUs. Challenges with MoEs and Solutions 1. Inconsistent Batch Size for Experts: Can be overcome with the use of "soft" constraints in training loss. 2. Load Balancing Across Experts: The issue of repeatedly using the same few experts while training can be overcome by including better balancing mechanisms. 3. Model Instability Issues: Use of router-z Loss as an auxiliary loss to encourage smaller values within the router mechanism. 4. Model Overfitting: Requires different hyperparameter settings for effective fine-tuning. It was found that models benefit from high learning rates and smaller batch sizes. Recommended Reading/References 1. Hugging Face Blog on MoE - https://lnkd.in/g36gQJTx 2. Cameron Wolfe's Substack on Conditional Computation - https://lnkd.in/gtpjMHnJ </p>   <div class="base-aside-card__metadata font-sans text-sm leading-open font-regular text-color-text-low-emphasis mt-0.5"> <code id="i18n_reaction_singular" style="display: none"></code> <code id="i18n_reactions_plural" style="display: none"></code> <div class="flex items-center font-sans text-sm babybear:text-xs aside-activity-card__social-actions -mt-0.5"> <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis my-1" aria-label="60 Reactions" data-test-id="social-actions__reactions" data-id="social-actions__reactions" data-num-reactions="60" data-singular="%numReactions%" data-plural="%numReactions%" tabindex="0"> <img alt data-reaction-type="LIKE" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/bn39hirwzjqj18ej1fkz55671" height="16px" width="16px"> <img alt data-reaction-type="INTEREST" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/a0e8rff6djeoq8iympcysuqfu" height="16px" width="16px"> <img alt data-reaction-type="EMPATHY" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/asiqslyf4ooq7ggllg4fyo4o2" height="16px" width="16px"> <span aria-hidden="true" class="font-normal ml-0.5" data-test-id="social-actions__reaction-count"> 60 </span> </div> <code id="social-actions__reaction-image-APPRECIATION" style="display: none"></code>  <code id="social-actions__reaction-image-EMPATHY" style="display: none"></code> <code id="social-actions__reaction-image-ENTERTAINMENT" style="display: none"></code>  <code id="social-actions__reaction-image-INTEREST" style="display: none"></code> <code id="social-actions__reaction-image-LIKE" style="display: none"></code> <code id="social-actions__reaction-image-MAYBE" style="display: none"></code> <code id="social-actions__reaction-image-PRAISE" style="display: none"></code>  <div class="flex items-center font-normal text-color-text-low-emphasis no-underline visited:text-color-text-low-emphasis before:middot my-1" data-test-id="social-actions__comments" data-id="social-actions__comments" data-num-comments="1" data-singular="%numComments% Comment" data-plural="%numComments% Comments"> 1 Comment </div>  </div> </div>  </div> <div class="base-aside-card__media flex-shrink-0 mr-0.5 overflow-hidden relative rounded-md h-[75px] w-[75px] base-aside-card__media--right"> <img alt="Conditional Computation in LLMs (MoEs): Understanding the Concept of MoEs The Mixture of Experts (MoE) works on the idea of conditional computation, allowing the model to only run through some parts (an Expert) for a given input compared to a dense model where all parameters are used. It consists of a gating mechanism that selects the experts for the given input, enabling the model to scale without increasing the required computation. Popular LLM architectures like Mixtral and Grok use the same concept. What is MoEs? The Mixture of Experts (MoE) makes a simple modification to the usual architecture by replacing all or some of the Feed Forward layers with an MoE layer. The MoE layer consists of: 1. Router: A parameter gating mechanism that selects the experts used to process the given input by producing a probability distribution over experts. 2. Experts: Standalone NN modules with independent sets of parameters trained jointly with routers. It can use more complex architecture to create hierarchical MoE modules by implementing each expert as another MoE. - Implementation in Switch Transformers : Training was found to be 7x faster than training the base T5, keeping computation and training time constant. - Gshard and MeshTransformer : These frameworks propose efficient methods to train the MoE architectures using the parallelisation of models across GPUs/TPUs. Challenges with MoEs and Solutions 1. Inconsistent Batch Size for Experts: Can be overcome with the use of "soft" constraints in training loss. 2. Load Balancing Across Experts: The issue of repeatedly using the same few experts while training can be overcome by including better balancing mechanisms. 3. Model Instability Issues: Use of router-z Loss as an auxiliary loss to encourage smaller values within the router mechanism. 4. Model Overfitting: Requires different hyperparameter settings for effective fine-tuning. It was found that models benefit from high learning rates and smaller batch sizes. Recommended Reading/References 1. Hugging Face Blog on MoE - https://lnkd.in/g36gQJTx 2. Cameron Wolfe's Substack on Conditional Computation - https://lnkd.in/gtpjMHnJ" aria-hidden="true" data-delayed-url="https://media.licdn.com/dms/image/sync/v2/D4E27AQEsbkI1i2K66g/articleshare-shrink_800/articleshare-shrink_800/0/1711895518265?e=2147483647&v=beta&t=_aFBrSjlkDOZ1BKT8eX37SxNbnadpI35HVJZT8RJEUw">  </div> </a> </div> </li> </ul> <button class="show-more-less__button show-more-less__more-button show-more-less-button " aria-expanded="false" data-tracking-control-name="public_profile_relatedPosts_show_more"> Show more posts <icon class="show-more-less__button--chevron show-more-less-button-icon" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/cyolgscd0imw2ldqppkrb84vo"></icon> </button> <button class="show-more-less__button show-more-less__less-button show-more-less-button show-more-less__button--hide" aria-expanded="false" data-tracking-control-name="public_profile_relatedPosts_show_more"> Show fewer posts <icon class="show-more-less__button--chevron show-more-less-button-icon" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/4chtt12k98xwnba1nimld2oyg"></icon> </button> </div> </div> </section> <section class="aside-section-container mb-4">   <div class="aside-section-container__content break-words"> <section class="content-hub-cta container-lined p-2 max-w-[300px] font-sans text-color-text !max-w-none" data-nosnippet="true"> <div class="flex flex-row"> <img class="content-hub-cta__img block mr-1.5 w-6 h-6" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/c8ms9smqq4fefx1fgb22l0eus" alt> <h2 class="content-hub-cta__title text-lg font-bold m-0"> Explore collaborative articles </h2> </div> <p class="content-hub-cta__body text-sm font-light leading-open my-1"> We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI. </p> <a href="https://www.linkedin.com/pulse/topics/home/" class="content-hub-cta__btn btn-md btn-secondary-emphasis text-md font-bold mt-1 block box-border" data-tracking-control-name="public_profile_content-hub-cta" data-tracking-will-navigate> Explore More </a> </section> </div> </section>  <section class="aside-section-container mb-4 course-recommendations"> <h2 class="aside-section-container__title section-title"> Add new skills with these courses </h2>  <div class="aside-section-container__content break-words"> <ul class="course-recommendations__courses"> <li class="course-recommendations__course">  <a href="https://www.linkedin.com/learning/learning-full-stack-javascript-development-mongodb-node-and-react-15581237?trk=public_profile_recommended-course" target="_self" data-tracking-control-name="public_profile_recommended-course" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 hover:show-play-button focus:show-play-button base-aside-card--link aside-learning-course-card">  <div class="base-aside-card__media flex-shrink-0 mr-0.5 overflow-hidden relative rounded-md h-[54px] w-[95px]"> <img class="base-aside-card__media-element w-[100px] h-full object-cover" alt data-delayed-url="https://media.licdn.com/dms/image/v2/C560DAQGx_lsYcLsXOA/learning-public-crop_288_512/learning-public-crop_288_512/0/1671474208254?e=2147483647&v=beta&t=oTF9fEGc88rb9fDiSNYYl25fecfXfcjRzYlgSe8vqdk"> <div class="aside-learning-course-card__duration duration">3h 35m</div> <icon class="base-aside-card__play-button w-auto play-button overlay-center" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/9n9raq7fmdu241tpsxwodsmcd" data-svg-class-name="base-aside-card__play-button-svg"></icon> </div> <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative"> Learning Full-Stack JavaScript Development: MongoDB, Node, and React  </h3>      </div>  </a> </li> <li class="course-recommendations__course">  <a href="https://www.linkedin.com/learning/deep-learning-and-generative-ai-data-prep-analysis-and-visualization-with-python?trk=public_profile_recommended-course" target="_self" data-tracking-control-name="public_profile_recommended-course" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 hover:show-play-button focus:show-play-button base-aside-card--link aside-learning-course-card">  <div class="base-aside-card__media flex-shrink-0 mr-0.5 overflow-hidden relative rounded-md h-[54px] w-[95px]"> <img class="base-aside-card__media-element w-[100px] h-full object-cover" alt data-delayed-url="https://media.licdn.com/dms/image/v2/D4E0DAQHhXac0lVBFBQ/learning-public-crop_288_512/learning-public-crop_288_512/0/1726775536871?e=2147483647&v=beta&t=iHS_SRiJJITXy05Ev6wyjcAcESXXQgFSDXpLKm78NMM"> <div class="aside-learning-course-card__duration duration">1h 56m</div> <icon class="base-aside-card__play-button w-auto play-button overlay-center" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/9n9raq7fmdu241tpsxwodsmcd" data-svg-class-name="base-aside-card__play-button-svg"></icon> </div> <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative"> Deep Learning and Generative AI: Data Prep, Analysis, and Visualization with Python  </h3>      </div>  </a> </li> <li class="course-recommendations__course">  <a href="https://www.linkedin.com/learning/choose-the-right-tool-for-your-data-python-r-or-sql?trk=public_profile_recommended-course" target="_self" data-tracking-control-name="public_profile_recommended-course" data-tracking-will-navigate class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-aside-card flex my-1.5 hover:show-play-button focus:show-play-button base-aside-card--link aside-learning-course-card">  <div class="base-aside-card__media flex-shrink-0 mr-0.5 overflow-hidden relative rounded-md h-[54px] w-[95px]"> <img class="base-aside-card__media-element w-[100px] h-full object-cover" alt data-delayed-url="https://media.licdn.com/dms/image/v2/C560DAQG7_LXxHrUZ4w/learning-public-crop_288_512/learning-public-crop_288_512/0/1663350562254?e=2147483647&v=beta&t=UZLl3-krQgAh4ooKiosmbwMFP7GvGmikwEpd7XmMX5M"> <div class="aside-learning-course-card__duration duration">1h 7m</div> <icon class="base-aside-card__play-button w-auto play-button overlay-center" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/9n9raq7fmdu241tpsxwodsmcd" data-svg-class-name="base-aside-card__play-button-svg"></icon> </div> <div class="base-aside-card__info self-center pl-0.5 flex flex-col flex-1"> <h3 class="base-aside-card__title font-sans text-md font-bold text-color-text relative"> Choose the Right Tool for Your Data: Python, R, or SQL  </h3>      </div>  </a> </li> </ul> <a href="https://www.linkedin.com/learning/?trk=seo_pp_d_cymbii_more_m015_learning" class="course-recommendations__view-all-link btn-md btn-secondary-emphasis inline-block" data-tracking-control-name="seo_pp_d_cymbii_more_m015_learning" data-tracking-will-navigate> See all courses </a> </div> </section>  </section> </main>  <footer class="li-footer bg-transparent w-full "> <ul class="li-footer__list flex flex-wrap flex-row items-start justify-start w-full h-auto min-h-[50px] my-[0px] mx-auto py-3 px-2 papabear:w-[1128px] papabear:p-0"> <li class="li-footer__item font-sans text-xs text-color-text-solid-secondary flex flex-shrink-0 justify-start p-1 relative w-50% papabear:justify-center papabear:w-auto"> <span class="sr-only">LinkedIn</span> <icon class="li-footer__copy-logo text-color-logo-brand-alt inline-block self-center h-[14px] w-[56px] mr-1" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/5mebydpuuijm3uhv1q375inqh"></icon> <span class="li-footer__copy-text flex items-center">© 2025</span> </li> <li class="li-footer__item font-sans text-xs text-color-text-solid-secondary flex flex-shrink-0 justify-start p-1 relative w-50% papabear:justify-center papabear:w-auto"> <a class="li-footer__item-link flex items-center font-sans text-xs font-bold text-color-text-solid-secondary hover:text-color-link-hover focus:text-color-link-focus" href="https://about.linkedin.com?trk=public_profile_v3_desktop_footer-about" data-tracking-control-name="public_profile_v3_desktop_footer-about" data-tracking-will-navigate> About </a> </li> <li class="li-footer__item font-sans text-xs text-color-text-solid-secondary flex flex-shrink-0 justify-start p-1 relative w-50% papabear:justify-center papabear:w-auto"> <a class="li-footer__item-link flex items-center font-sans text-xs font-bold text-color-text-solid-secondary hover:text-color-link-hover focus:text-color-link-focus" href="https://www.linkedin.com/accessibility?trk=public_profile_v3_desktop_footer-accessibility" data-tracking-control-name="public_profile_v3_desktop_footer-accessibility" data-tracking-will-navigate> Accessibility </a> </li> <li class="li-footer__item font-sans text-xs text-color-text-solid-secondary flex flex-shrink-0 justify-start p-1 relative w-50% papabear:justify-center papabear:w-auto"> <a class="li-footer__item-link flex items-center font-sans text-xs font-bold text-color-text-solid-secondary hover:text-color-link-hover focus:text-color-link-focus" href="https://www.linkedin.com/legal/user-agreement?trk=public_profile_v3_desktop_footer-user-agreement" data-tracking-control-name="public_profile_v3_desktop_footer-user-agreement" data-tracking-will-navigate> User Agreement </a> </li> <li class="li-footer__item font-sans text-xs text-color-text-solid-secondary flex flex-shrink-0 justify-start p-1 relative w-50% papabear:justify-center papabear:w-auto"> <a class="li-footer__item-link flex items-center font-sans text-xs font-bold text-color-text-solid-secondary hover:text-color-link-hover focus:text-color-link-focus" href="https://www.linkedin.com/legal/privacy-policy?trk=public_profile_v3_desktop_footer-privacy-policy" data-tracking-control-name="public_profile_v3_desktop_footer-privacy-policy" data-tracking-will-navigate> Privacy Policy </a> </li>  <li class="li-footer__item font-sans text-xs text-color-text-solid-secondary flex flex-shrink-0 justify-start p-1 relative w-50% papabear:justify-center papabear:w-auto"> <a class="li-footer__item-link flex items-center font-sans text-xs font-bold text-color-text-solid-secondary hover:text-color-link-hover focus:text-color-link-focus" href="https://www.linkedin.com/legal/cookie-policy?trk=public_profile_v3_desktop_footer-cookie-policy" data-tracking-control-name="public_profile_v3_desktop_footer-cookie-policy" data-tracking-will-navigate> Cookie Policy </a> </li> <li class="li-footer__item font-sans text-xs text-color-text-solid-secondary flex flex-shrink-0 justify-start p-1 relative w-50% papabear:justify-center papabear:w-auto"> <a class="li-footer__item-link flex items-center font-sans text-xs font-bold text-color-text-solid-secondary hover:text-color-link-hover focus:text-color-link-focus" href="https://www.linkedin.com/legal/copyright-policy?trk=public_profile_v3_desktop_footer-copyright-policy" data-tracking-control-name="public_profile_v3_desktop_footer-copyright-policy" data-tracking-will-navigate> Copyright Policy </a> </li> <li class="li-footer__item font-sans text-xs text-color-text-solid-secondary flex flex-shrink-0 justify-start p-1 relative w-50% papabear:justify-center papabear:w-auto"> <a class="li-footer__item-link flex items-center font-sans text-xs font-bold text-color-text-solid-secondary hover:text-color-link-hover focus:text-color-link-focus" href="https://brand.linkedin.com/policies?trk=public_profile_v3_desktop_footer-brand-policy" data-tracking-control-name="public_profile_v3_desktop_footer-brand-policy" data-tracking-will-navigate> Brand Policy </a> </li> <li class="li-footer__item font-sans text-xs text-color-text-solid-secondary flex flex-shrink-0 justify-start p-1 relative w-50% papabear:justify-center papabear:w-auto"> <a class="li-footer__item-link flex items-center font-sans text-xs font-bold text-color-text-solid-secondary hover:text-color-link-hover focus:text-color-link-focus" href="https://www.linkedin.com/psettings/guest-controls?trk=public_profile_v3_desktop_footer-guest-controls" data-tracking-control-name="public_profile_v3_desktop_footer-guest-controls" data-tracking-will-navigate> Guest Controls </a> </li> <li class="li-footer__item font-sans text-xs text-color-text-solid-secondary flex flex-shrink-0 justify-start p-1 relative w-50% papabear:justify-center papabear:w-auto"> <a class="li-footer__item-link flex items-center font-sans text-xs font-bold text-color-text-solid-secondary hover:text-color-link-hover focus:text-color-link-focus" href="https://www.linkedin.com/legal/professional-community-policies?trk=public_profile_v3_desktop_footer-community-guide" data-tracking-control-name="public_profile_v3_desktop_footer-community-guide" data-tracking-will-navigate> Community Guidelines </a> </li>  <li class="li-footer__item font-sans text-xs text-color-text-solid-secondary flex flex-shrink-0 justify-start p-1 relative w-50% papabear:justify-center papabear:w-auto"> <div class="collapsible-dropdown collapsible-dropdown--footer collapsible-dropdown--up flex items-center relative hyphens-auto language-selector z-2">  <ul class="collapsible-dropdown__list hidden container-raised absolute w-auto overflow-y-auto flex-col items-stretch z-1 bottom-[100%] top-auto" role="menu" tabindex="-1"> <li class="language-selector__item" role="presentation">  <button aria-label="العربية (Arabic)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-ar_AE" data-locale="ar_AE" role="menuitem" lang="ar_AE"> العربية (Arabic) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="বাংলা (Bangla)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-bn_IN" data-locale="bn_IN" role="menuitem" lang="bn_IN"> বাংলা (Bangla) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="Čeština (Czech)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-cs_CZ" data-locale="cs_CZ" role="menuitem" lang="cs_CZ"> Čeština (Czech) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="Dansk (Danish)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-da_DK" data-locale="da_DK" role="menuitem" lang="da_DK"> Dansk (Danish) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="Deutsch (German)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-de_DE" data-locale="de_DE" role="menuitem" lang="de_DE"> Deutsch (German) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="Ελληνικά (Greek)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-el_GR" data-locale="el_GR" role="menuitem" lang="el_GR"> Ελληνικά (Greek) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="English (English) selected" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link--selected" data-tracking-control-name="language-selector-en_US" data-locale="en_US" role="menuitem" lang="en_US"> <strong>English (English)</strong> </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="Español (Spanish)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-es_ES" data-locale="es_ES" role="menuitem" lang="es_ES"> Español (Spanish) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="فارسی (Persian)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-fa_IR" data-locale="fa_IR" role="menuitem" lang="fa_IR"> فارسی (Persian) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="Suomi (Finnish)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-fi_FI" data-locale="fi_FI" role="menuitem" lang="fi_FI"> Suomi (Finnish) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="Français (French)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-fr_FR" data-locale="fr_FR" role="menuitem" lang="fr_FR"> Français (French) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="हिंदी (Hindi)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-hi_IN" data-locale="hi_IN" role="menuitem" lang="hi_IN"> हिंदी (Hindi) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="Magyar (Hungarian)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-hu_HU" data-locale="hu_HU" role="menuitem" lang="hu_HU"> Magyar (Hungarian) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="Bahasa Indonesia (Indonesian)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-in_ID" data-locale="in_ID" role="menuitem" lang="in_ID"> Bahasa Indonesia (Indonesian) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="Italiano (Italian)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-it_IT" data-locale="it_IT" role="menuitem" lang="it_IT"> Italiano (Italian) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="עברית (Hebrew)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-iw_IL" data-locale="iw_IL" role="menuitem" lang="iw_IL"> עברית (Hebrew) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="日本語 (Japanese)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-ja_JP" data-locale="ja_JP" role="menuitem" lang="ja_JP"> 日本語 (Japanese) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="한국어 (Korean)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-ko_KR" data-locale="ko_KR" role="menuitem" lang="ko_KR"> 한국어 (Korean) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="मराठी (Marathi)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-mr_IN" data-locale="mr_IN" role="menuitem" lang="mr_IN"> मराठी (Marathi) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="Bahasa Malaysia (Malay)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-ms_MY" data-locale="ms_MY" role="menuitem" lang="ms_MY"> Bahasa Malaysia (Malay) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="Nederlands (Dutch)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-nl_NL" data-locale="nl_NL" role="menuitem" lang="nl_NL"> Nederlands (Dutch) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="Norsk (Norwegian)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-no_NO" data-locale="no_NO" role="menuitem" lang="no_NO"> Norsk (Norwegian) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="ਪੰਜਾਬੀ (Punjabi)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-pa_IN" data-locale="pa_IN" role="menuitem" lang="pa_IN"> ਪੰਜਾਬੀ (Punjabi) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="Polski (Polish)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-pl_PL" data-locale="pl_PL" role="menuitem" lang="pl_PL"> Polski (Polish) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="Português (Portuguese)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-pt_BR" data-locale="pt_BR" role="menuitem" lang="pt_BR"> Português (Portuguese) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="Română (Romanian)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-ro_RO" data-locale="ro_RO" role="menuitem" lang="ro_RO"> Română (Romanian) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="Русский (Russian)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-ru_RU" data-locale="ru_RU" role="menuitem" lang="ru_RU"> Русский (Russian) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="Svenska (Swedish)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-sv_SE" data-locale="sv_SE" role="menuitem" lang="sv_SE"> Svenska (Swedish) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="తెలుగు (Telugu)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-te_IN" data-locale="te_IN" role="menuitem" lang="te_IN"> తెలుగు (Telugu) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="ภาษาไทย (Thai)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-th_TH" data-locale="th_TH" role="menuitem" lang="th_TH"> ภาษาไทย (Thai) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="Tagalog (Tagalog)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-tl_PH" data-locale="tl_PH" role="menuitem" lang="tl_PH"> Tagalog (Tagalog) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="Türkçe (Turkish)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-tr_TR" data-locale="tr_TR" role="menuitem" lang="tr_TR"> Türkçe (Turkish) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="Українська (Ukrainian)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-uk_UA" data-locale="uk_UA" role="menuitem" lang="uk_UA"> Українська (Ukrainian) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="Tiếng Việt (Vietnamese)" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-vi_VN" data-locale="vi_VN" role="menuitem" lang="vi_VN"> Tiếng Việt (Vietnamese) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="简体中文 (Chinese (Simplified))" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-zh_CN" data-locale="zh_CN" role="menuitem" lang="zh_CN"> 简体中文 (Chinese (Simplified)) </button> </li> <li class="language-selector__item" role="presentation">  <button aria-label="正體中文 (Chinese (Traditional))" class="font-sans text-xs link block py-[5px] px-2 w-full hover:cursor-pointer hover:bg-color-action hover:text-color-text-on-dark focus:bg-color-action focus:text-color-text-on-dark language-selector__link !font-regular" data-tracking-control-name="language-selector-zh_TW" data-locale="zh_TW" role="menuitem" lang="zh_TW"> 正體中文 (Chinese (Traditional)) </button> </li>  </ul> <button class="language-selector__button select-none relative pr-2 font-sans text-xs font-bold text-color-text-low-emphasis hover:text-color-link-hover hover:cursor-pointer focus:text-color-link-focus focus:outline-dotted focus:outline-1" aria-expanded="false" data-tracking-control-name="footer-lang-dropdown_trigger"> <span class="language-selector__label-text mr-0.5 break-words"> Language </span> <icon class="language-selector__label-chevron w-2 h-2 absolute top-0 right-0" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/cyolgscd0imw2ldqppkrb84vo"></icon> </button> </div> </li> </ul>  </footer> <div class="guest-upsells"> <form class="google-auth base-google-auth" action="https://www.linkedin.com/uas/login-submit" method="post"> <input name="loginCsrfParam" value="bfb3b82d-2e82-4a0c-8244-5b7e6275c22d" type="hidden"> <input name="session_redirect" value="https://www.linkedin.com/in/sebastianraschka" type="hidden"> <input name="trk" value="public_profile_google-one-tap-submit" type="hidden"> <div class="google-one-tap__module hidden fixed flex flex-col items-center top-[20px] right-[20px] z-[9999]"> <div class="google-auth__tnc-container hidden relative top-2 bg-color-background-container-tint pl-2 pr-1 pt-2 pb-3 w-[375px] rounded-md shadow-2xl"> <p class="text-md font-bold text-color-text"> Agree & Join LinkedIn </p> <p class="linkedin-tc__text text-color-text-low-emphasis text-xs pb-2 !text-sm !text-color-text" data-impression-id="public_profile_one-tap-skip-tc-text"> By clicking Continue to join or sign in, you agree to LinkedIn’s <a href="/legal/user-agreement?trk=linkedin-tc_auth-button_user-agreement" target="_blank" data-tracking-control-name="linkedin-tc_auth-button_user-agreement" data-tracking-will-navigate="true">User Agreement</a>, <a href="/legal/privacy-policy?trk=linkedin-tc_auth-button_privacy-policy" target="_blank" data-tracking-control-name="linkedin-tc_auth-button_privacy-policy" data-tracking-will-navigate="true">Privacy Policy</a>, and <a href="/legal/cookie-policy?trk=linkedin-tc_auth-button_cookie-policy" target="_blank" data-tracking-control-name="linkedin-tc_auth-button_cookie-policy" data-tracking-will-navigate="true">Cookie Policy</a>. </p> </div> <div data-tracking-control-name="public_profile_google-one-tap" id="google-one-tap__container"></div> </div> <div class="loader loader--full-screen"> <div class="loader__container mb-2 overflow-hidden"> <icon class="loader__icon inline-block loader__icon--default text-color-progress-loading" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/ddi43qwelxeqjxdd45pe3fvs1" data-svg-class-name="loader__icon-svg--large fill-currentColor h-[60px] min-h-[60px] w-[60px] min-w-[60px]"></icon> </div> </div> </form> <script data-delayed-url="https://static.licdn.com/aero-v1/sc/h/29rdkxlvag0d3cpj96fiilbju" data-module-id="google-gsi-lib"></script> <code id="isLinkedInAppWebView" style="display: none"></code> <code id="shouldRemoveUndefinedValues" style="display: none"></code> <code id="isItpSupportEnabled" style="display: none"></code> <code id="tncFlow" style="display: none"></code> <code id="isGoogleAuthButtonLocaleSupportEnabled" style="display: none"></code> <code id="gsiLocale" style="display: none"></code> <div class="contextual-sign-in-modal base-contextual-sign-in-modal" data-impression-id="public_profile_contextual-sign-in-modal" data-cool-off-enabled data-show-on-page-load>  <div class>  <div id="base-contextual-sign-in-modal" class="modal modal--contextual-sign-in" data-outlet="base-contextual-sign-in-modal">  <div class="modal__overlay flex items-center bg-color-background-scrim justify-center fixed bottom-0 left-0 right-0 top-0 opacity-0 invisible pointer-events-none z-[1000] transition-[opacity] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.17s] py-4 " aria-hidden="true"> <section aria-modal="true" role="dialog" aria-labelledby="base-contextual-sign-in-modal-modal-header" tabindex="-1" class="max-h-full modal__wrapper overflow-auto p-0 bg-color-surface max-w-[1128px] min-h-[160px] relative scale-[0.25] shadow-sm shadow-color-border-faint transition-[transform] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.33s] focus:outline-0 w-[1128px] mamabear:w-[744px] babybear:w-[360px] rounded-md"> <button class="modal__dismiss btn-tertiary h-[40px] w-[40px] p-0 rounded-full indent-0 contextual-sign-in-modal__modal-dismiss absolute right-0 m-[20px] cursor-pointer" aria-label="Dismiss" data-tracking-control-name="public_profile_contextual-sign-in-modal_modal_dismiss"> <icon class="contextual-sign-in-modal__modal-dismiss-icon" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/gs508lg3t2o81tq7pmcgn6m2"></icon> </button> <div class="modal__main w-full"> <div class="contextual-sign-in-modal__screen contextual-sign-in-modal__context-screen flex flex-col my-4 mx-3"> <img class="inline-block relative rounded-[50%] w-16 h-16 contextual-sign-in-modal__img m-auto" data-delayed-url="https://media.licdn.com/dms/image/v2/C4E03AQHVyH-IfD1KbQ/profile-displayphoto-shrink_200_200/profile-displayphoto-shrink_200_200/0/1629877234109?e=2147483647&v=beta&t=OTOiJBYLX5LT7qC2dke5DvYyo_GnlKJRKC45xXeY_7o" data-ghost-classes="bg-color-entity-ghost-background" data-ghost-url="https://static.licdn.com/aero-v1/sc/h/9c8pery4andzj6ohjkjp54ma2" alt> <h2 class="contextual-sign-in-modal__context-screen-title font-sans text-xl text-color-text my-2 mx-4 text-center" id="base-contextual-sign-in-modal-modal-header"> View Sebastian’s full profile </h2>  <div class="contextual-sign-in-modal__btn-container m-auto w-[320px] babybear:w-full">  <div class="w-full max-w-[400px] mx-auto"> <div class="google-auth-button">  <div class="google-auth-button__placeholder mx-auto " data-theme="filled_blue" data-logo-alignment="center" data-locale="en_US" role="button" aria-label="Continue with google"></div>  </div> </div> <div class="sign-in-modal" data-impression-id="public_profile_contextual-sign-in-modal_sign-in-modal"> <button class="sign-in-modal__outlet-btn cursor-pointer btn-md btn-primary btn-secondary" data-tracking-client-ingraph data-tracking-control-name="public_profile_contextual-sign-in-modal_sign-in-modal_outlet-button" data-modal="base-sign-in-modal">  Sign in </button> <div class>  <div id="base-sign-in-modal" class="modal modal--sign-in" data-outlet="base-sign-in-modal">  <div class="modal__overlay flex items-center bg-color-background-scrim justify-center fixed bottom-0 left-0 right-0 top-0 opacity-0 invisible pointer-events-none z-[1000] transition-[opacity] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.17s] py-4 " aria-hidden="true"> <section aria-modal="true" role="dialog" aria-labelledby="base-sign-in-modal-modal-header" tabindex="-1" class="max-h-full modal__wrapper overflow-auto p-0 bg-color-surface max-w-[1128px] min-h-[160px] relative scale-[0.25] shadow-sm shadow-color-border-faint transition-[transform] ease-[cubic-bezier(0.25,0.1,0.25,1.0)] duration-[0.33s] focus:outline-0 w-[1128px] mamabear:w-[744px] babybear:w-[360px] rounded-md"> <button class="modal__dismiss btn-tertiary h-[40px] w-[40px] p-0 rounded-full indent-0 sign-in-modal__dismiss absolute right-0 cursor-pointer m-[20px]" aria-label="Dismiss" data-tracking-control-name="public_profile_contextual-sign-in-modal_sign-in-modal_dismiss"> <icon class="sign-in-modal__dismiss-icon" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/gs508lg3t2o81tq7pmcgn6m2"></icon> </button> <div class="modal__main w-full"> <div class="sign-in-modal__screen flex flex-col py-4 w-[513px] babybear:w-full px-3"> <h2 class="sign-in-modal__header font-sans text-display-md text-color-text "> Welcome back </h2> <code id="i18n_sign_in_form_show_text" style="display: none"></code> <code id="i18n_sign_in_form_show_label" style="display: none"></code> <code id="i18n_sign_in_form_hide_text" style="display: none"></code> <code id="i18n_sign_in_form_hide_label" style="display: none"></code> <code id="i18n_username_error_empty" style="display: none"></code> <code id="i18n_username_error_too_long" style="display: none"></code> <code id="i18n_username_error_too_short" style="display: none"></code> <code id="i18n_password_error_empty" style="display: none"></code> <code id="i18n_password_error_too_short" style="display: none"></code> <code id="i18n_password_error_too_long" style="display: none"></code>  <form data-id="sign-in-form" action="https://www.linkedin.com/uas/login-submit" method="post" novalidate class="mt-1.5 mb-2"> <input name="loginCsrfParam" value="bfb3b82d-2e82-4a0c-8244-5b7e6275c22d" type="hidden"> <div class="flex flex-col"> <div class="mt-1.5" data-js-module-id="guest-input"> <div class="flex flex-col"> <label class="input-label mb-1" for="base-sign-in-modal_session_key"> Email or phone </label> <div class="text-input flex"> <input class="text-color-text font-sans text-md outline-0 bg-color-transparent w-full" autocomplete="username" id="base-sign-in-modal_session_key" name="session_key" required data-tracking-control-name="public_profile_contextual-sign-in-modal_sign-in-modal_sign-in-session-key" data-tracking-client-ingraph type="text"> </div> </div> <p class="input-helper mt-1.5" for="base-sign-in-modal_session_key" role="alert" data-js-module-id="guest-input__message"></p> </div> <div class="mt-1.5" data-js-module-id="guest-input"> <div class="flex flex-col"> <label class="input-label mb-1" for="base-sign-in-modal_session_password"> Password </label> <div class="text-input flex"> <input class="text-color-text font-sans text-md outline-0 bg-color-transparent w-full" autocomplete="current-password" id="base-sign-in-modal_session_password" name="session_password" required data-tracking-control-name="public_profile_contextual-sign-in-modal_sign-in-modal_sign-in-password" data-tracking-client-ingraph type="password"> <button aria-live="assertive" aria-relevant="text" data-id="sign-in-form__password-visibility-toggle" class="font-sans text-md font-bold text-color-action z-10 ml-[12px] hover:cursor-pointer" aria-label="Show your LinkedIn password" data-tracking-control-name="public_profile_contextual-sign-in-modal_sign-in-modal_sign-in-password-visibility-toggle-btn" type="button">Show</button> </div> </div> <p class="input-helper mt-1.5" for="base-sign-in-modal_session_password" role="alert" data-js-module-id="guest-input__message"></p> </div> <input name="session_redirect" value="https://www.linkedin.com/in/sebastianraschka" type="hidden">  </div> <div data-id="sign-in-form__footer" class="flex justify-between sign-in-form__footer--full-width"> <a data-id="sign-in-form__forgot-password" class="font-sans text-md font-bold link leading-regular sign-in-form__forgot-password--full-width" href="https://www.linkedin.com/uas/request-password-reset?trk=public_profile_contextual-sign-in-modal_sign-in-modal_forgot_password" data-tracking-control-name="public_profile_contextual-sign-in-modal_sign-in-modal_forgot_password" data-tracking-will-navigate>Forgot password?</a>  <input name="trk" value="public_profile_contextual-sign-in-modal_sign-in-modal_sign-in-submit" type="hidden"> <button class="btn-md btn-primary flex-shrink-0 cursor-pointer sign-in-form__submit-btn--full-width" data-id="sign-in-form__submit-btn" data-tracking-control-name="public_profile_contextual-sign-in-modal_sign-in-modal_sign-in-submit-btn" data-tracking-client-ingraph data-tracking-litms type="submit"> Sign in </button> </div> <div class="sign-in-form__divider left-right-divider pt-2 pb-3"> <p class="sign-in-form__divider-text font-sans text-sm text-color-text px-2"> or </p> </div> </form> <div class="w-full max-w-[400px] mx-auto"> <div class="google-auth-button" data-tracking-control-name="public_profile_contextual-sign-in-modal_sign-in-modal_google-auth-button" data-tracking-client-ingraph> <p class="linkedin-tc__text text-color-text-low-emphasis text-xs pb-2" data-impression-id="public_profile_contextual-sign-in-modal_sign-in-modal__button-skip-tc-text"> By clicking Continue to join or sign in, you agree to LinkedIn’s <a href="/legal/user-agreement?trk=public_profile_contextual-sign-in-modal_sign-in-modal_auth-button_user-agreement" target="_blank" data-tracking-control-name="public_profile_contextual-sign-in-modal_sign-in-modal_auth-button_user-agreement" data-tracking-will-navigate="true">User Agreement</a>, <a href="/legal/privacy-policy?trk=public_profile_contextual-sign-in-modal_sign-in-modal_auth-button_privacy-policy" target="_blank" data-tracking-control-name="public_profile_contextual-sign-in-modal_sign-in-modal_auth-button_privacy-policy" data-tracking-will-navigate="true">Privacy Policy</a>, and <a href="/legal/cookie-policy?trk=public_profile_contextual-sign-in-modal_sign-in-modal_auth-button_cookie-policy" target="_blank" data-tracking-control-name="public_profile_contextual-sign-in-modal_sign-in-modal_auth-button_cookie-policy" data-tracking-will-navigate="true">Cookie Policy</a>. </p> <div class="google-auth-button__placeholder mx-auto google-auth-button__placeholder--black-border" data-theme="outline" data-logo-alignment="center" data-locale="en_US" role="button" aria-label="Continue with google" data-safe-to-skip-tnc-redirect></div>  </div> </div>  <p class="sign-in-modal__join-now m-auto font-sans text-md text-color-text mt-2"> New to LinkedIn? <a href="https://www.linkedin.com/signup/public-profile-join?vieweeVanityName=sebastianraschka&trk=public_profile_contextual-sign-in-modal_sign-in-modal_join-link" data-tracking-control-name="public_profile_contextual-sign-in-modal_sign-in-modal_join-link" data-tracking-will-navigate="true" class="sign-in-modal__join-link">Join now</a> </p> </div> </div>  </section> </div> </div> </div> </div> <div class="contextual-sign-in-modal__divider left-right-divider"> <p class="contextual-sign-in-modal__divider-text font-sans text-sm text-color-text px-2"> or </p> </div> </div> <p class="contextual-sign-in-modal__join-now m-auto font-sans text-md text-color-text my-1"> New to LinkedIn? <a href="https://www.linkedin.com/signup/public-profile-join?vieweeVanityName=sebastianraschka&trk=public_profile_contextual-sign-in-modal_join-link" data-tracking-control-name="public_profile_contextual-sign-in-modal_join-link" data-tracking-will-navigate="true" class="contextual-sign-in-modal__join-link">Join now</a> </p> <p class="linkedin-tc__text text-color-text-low-emphasis text-xs pb-2 contextual-sign-in-modal__terms-and-conditions m-auto w-[320px] pt-2 babybear:w-full" data-impression-id="linkedin-tc__button-skip-tc-text"> By clicking Continue to join or sign in, you agree to LinkedIn’s <a href="/legal/user-agreement?trk=linkedin-tc_auth-button_user-agreement" target="_blank" data-tracking-control-name="linkedin-tc_auth-button_user-agreement" data-tracking-will-navigate="true">User Agreement</a>, <a href="/legal/privacy-policy?trk=linkedin-tc_auth-button_privacy-policy" target="_blank" data-tracking-control-name="linkedin-tc_auth-button_privacy-policy" data-tracking-will-navigate="true">Privacy Policy</a>, and <a href="/legal/cookie-policy?trk=linkedin-tc_auth-button_cookie-policy" target="_blank" data-tracking-control-name="linkedin-tc_auth-button_cookie-policy" data-tracking-will-navigate="true">Cookie Policy</a>. </p> </div> </div>  </section> </div> </div> </div> </div> <div class="cta-modal overflow-hidden container-raised z-10 fixed bottom-3 right-3 min-h-[56px] p-2 babybear:hidden windows-app-upsell windows-app-upsell--msft flex flex-col p-2 w-[359px] !bg-[#F1F8FA] opacity-90 backdrop-blur-[2px] z-1" data-impression-id="public_profile_windows-app-upsell_cta-modal" role="dialog" aria-labelledby="cta-modal-header" aria-describedby="cta-modal-subheader"> <div class="windows-app-upsell__linkedin-title-container pt-[6px] mb-1.5 flex align-center"> <icon class="windows-app-upsell__linkedin-bug-icon block w-[21px] h-[21px]" data-svg-class-name="windows-app-upsell__linkedin-bug-icon-svg w-[21px] h-[21px]" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/euqjj7tf5wvr33frd3x1jj9s"></icon> <p class="windows-app-upsell__linkedin-title uppercase text-xs text-color-text-secondary leading-[21px] ml-1"> LinkedIn </p> </div> <p class="windows-app-upsell__title font-sans text-md text-color-text-accent-4-hover font-semibold leading-regular mb-1"> LinkedIn is better on the app </p> <p class="windows-app-upsell__body font-sans text-sm text-color-text-secondary leading-regular"> Don’t have the app? Get it in the Microsoft Store. </p> <a class="windows-app-upsell__cta btn-sm btn-secondary-emphasis mt-2 mb-[6px] w-fit" href="ms-windows-store://pdp/?ProductId=9WZDNCRFJ4Q7&mode=mini&cid=guest_desktop_upsell" data-tracking-client-ingraph data-tracking-control-name="public_profile_windows-app-upsell_cta" data-tracking-will-navigate> Open the app </a> <button class="cta-modal__dismiss-btn absolute h-4 w-4 p-1 top-2 right-2 hover:cursor-pointer focus:outline focus:outline-2 focus:outline-color-action" data-tracking-control-name="public_profile_windows-app-upsell_dismiss" aria-label="Dismiss"> <icon class="cta-modal__dismiss-icon block h-2 w-2 onload" data-delayed-url="https://static.licdn.com/aero-v1/sc/h/adzjokfylbe8pvjr9h8iv96mw"></icon> </button> </div> </div> <code id="disableOneTapOnInitIfCsm" style="display: none"></code>  <code id="enableFingerprintingJS" style="display: none"></code> <code id="fingerprintingUrlPath" style="display: none"></code> <code id="fingerprintingWaitTime" style="display: none"></code> <script src="https://static.licdn.com/aero-v1/sc/h/5c81icanok4a9if4xo1qkuq7c" async></script>  <script src="https://static.licdn.com/aero-v1/sc/h/b42mogpzwnwuzsp3ikwd8bza0" async defer></script>   </body> </html>

CINXE.COM

Sebastian Raschka, PhD - Founder | Principal AI & LLM Research Engineer - RAIR Lab | LinkedIn