CINXE.COM

<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" > <channel> <title>Engineering at Meta</title> <atom:link href="https://engineering.fb.com/feed/" rel="self" type="application/rss+xml" /> <link>https://engineering.fb.com/</link> <description>Engineering at Meta Blog</description> <lastBuildDate>Tue, 18 Feb 2025 20:03:42 +0000</lastBuildDate> <language>en-US</language> <sy:updatePeriod> hourly </sy:updatePeriod> <sy:updateFrequency> 1 </sy:updateFrequency> <generator>https://wordpress.org/?v=6.7.2</generator> <site xmlns="com-wordpress:feed-additions:1">147945108</site> <item> <title>Protecting user data through source code analysis at scale</title> <link>https://engineering.fb.com/2025/02/18/security/protecting-user-data-through-source-code-analysis/</link> <dc:creator><![CDATA[]]></dc:creator> <pubDate>Tue, 18 Feb 2025 20:30:49 +0000</pubDate> <category><![CDATA[Security]]></category> <guid isPermaLink="false">https://engineering.fb.com/?p=22267</guid> <description><![CDATA[Meta’s Anti Scraping team focuses on preventing unauthorized scraping as part of our ongoing work to combat data misuse. In order to protect Meta’s changing codebase from scraping attacks, we have introduced static analysis tools into our workflow. These tools allow us to detect potential scraping vectors at scale across our Facebook, Instagram, and even [...] <a class="btn btn-secondary understrap-read-more-link" href="https://engineering.fb.com/2025/02/18/security/protecting-user-data-through-source-code-analysis/">Read More...</a> The post <a rel="nofollow" href="https://engineering.fb.com/2025/02/18/security/protecting-user-data-through-source-code-analysis/">Protecting user data through source code analysis at scale</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>. ]]></description> <content:encoded><![CDATA[Meta’s Anti Scraping team focuses on preventing <a href="https://about.fb.com/news/2021/04/how-we-combat-scraping/" target="_blank" rel="noopener">unauthorized scraping</a> as part of our ongoing work to combat data misuse. In order to protect Meta’s <a href="https://engineering.fb.com/2022/11/16/culture/meta-code-review-time-improving/" target="_blank" rel="noopener">changing codebase</a> from scraping attacks, we have introduced static analysis tools into our workflow. These tools allow us to detect potential scraping vectors at scale across our Facebook, Instagram, and even parts of our Reality Labs codebases. <h2>What is scraping? </h2> <a href="https://about.fb.com/news/2021/04/how-we-combat-scraping/" target="_blank" rel="noopener">Scraping</a> is the automated collection of data from a website or app and can be either authorized or unauthorized. Unauthorized scrapers commonly hide themselves by mimicking the ways users would normally use a product. As a result, unauthorized scraping can be difficult to detect. At Meta, we take a number of steps to <a href="https://about.fb.com/news/2021/04/how-we-combat-scraping/" target="_blank" rel="noopener">combat scraping</a> and have a number of methods to distinguish unauthorized automated activity from legitimate usage. <h2>Proactive detection</h2> Meta’s Anti-Scraping team learns about scrapers (entities attempting to scrape our systems) through many different sources. For example, we <a href="https://about.fb.com/news/2021/04/how-we-combat-scraping/" target="_blank" rel="noopener">investigate suspected unauthorized scraping activity</a> and take actions against such entities, including sending cease-and-desist letters and disabling accounts. Part of our strategy is to further develop proactive measures to mitigate the risk of scraping over and above our reactive approaches. One way we do this is by turning our attack vector criteria into static analysis rules that run automatically on our entire code base. Those static analysis tools, which include <a href="https://engineering.fb.com/2019/08/15/security/zoncolan/" target="_blank" rel="noopener">Zoncolan</a> for Hack and <a href="https://engineering.fb.com/2020/08/07/security/pysa/" target="_blank" rel="noopener">Pysa</a> for Python, run automatically for their respective codebases and are built in-house, allowing us to customize them for Anti-Scraping purposes. This approach can identify potential issues early and ensure product development teams have an opportunity to remediate prior to launch. Static analysis tools enable us to apply learnings across events to systematically prevent similar issues from existing in our codebase. They also help us create best practices when developing code to combat unauthorized scraping. <h2>Developing static analysis rules</h2> Our static analysis tools (like <a href="https://engineering.fb.com/2019/08/15/security/zoncolan/" target="_blank" rel="noopener">Zoncolan</a> and <a href="https://engineering.fb.com/2020/08/07/security/pysa/" target="_blank" rel="noopener">Pysa</a>) focus on tracking data flow through a program. Engineers define classes of issues using the following: <ul> <li style="font-weight: 400;" aria-level="1">Sources are where the data originates. For potential scraping issues, these are mostly user-controlled parameters, as these are the avenues in which scrapers control the data they could receive.</li> <li style="font-weight: 400;" aria-level="1">Sinks are where the data flows to. For scraping, the sink is usually when the data flows back to the user.</li> <li style="font-weight: 400;" aria-level="1">An Issue is found when our tools detect a possibility of data flow from a source to a sink.</li> </ul> For example, assume the “source” to be the user-controlled “count” parameter that determines the number of results loaded, and “the sink” to be the data that is returned to the user. Here, the user controlled “count” parameter is an entrypoint for a scraper who can manipulate its value to extract more data than intended by the application. When our tools suspect that there is a code flow between such sources and sinks, it alerts the team for further triage. <h2>An example of static analysis</h2> Building on the example above, see the below mock code excerpt loading the number of followers for a page: <pre class="line-numbers"><code class="language-none"># views/followers.py async def get_followers(request: HttpRequest) -> HttpResponse: viewer = request.GET['viewer_id'] target = request.GET['target_id'] count = request.GET['count'] if(can_see(viewer, target)): followers = load_followers(target, count) return followers # controller/followers.py async def load_followers(target_id: int, count: int): ...</code></pre> In the example above, the mock endpoint backed by get_followers is a potential scraping attack vector since the “user” and “count” variables control whose information is to be loaded and number of followers returned. Under usual circumstances, the endpoint would be called with suitable parameters that match what the user is browsing on screen. However, scrapers can abuse such an endpoint by specifying arbitrary users and large counts which can result in their entire follower lists returned in a single request. By doing so, scrapers can try to evade rate limiting systems which limit how many requests a user can send to our systems in a defined timeframe. These systems are set in place to stop any scraping attempts at a high level. Since our static analysis systems run automatically on our codebase, the Anti-Scraping team can identify such scraping vectors proactively and make remediations before the code is introduced to our production systems. For example, the recommended fix for the code above is to cap the maximum number of results that can be returned at a time: <pre class="line-numbers"><code class="language-none"># views/followers.py async def get_followers(request: HttpRequest) -> HttpResponse: viewer = request.Get['viewer_id'] target = request.GET['target_id'] count = min(request.GET['count'], MAX_FOLLOWERS_RESULTS) if(can_see(viewer, target)): followers = load_followers(target, count) return followers # controller/followers.py async def load_followers(target_id: int, count: int): ...</code></pre> Following the fix, the maximum number of results retrieved by each request is limited to MAX_FOLLOWERS_RESULTS. Such a change would not affect regular users and only interfere with scrapers, forcing them to send magnitudes more requests that would then trigger our rate limiting systems. <h2>The limitations of static analysis in combating unauthorized scraping </h2> Static analysis tools are not designed to catch all possible unauthorized scraping issues. Because unauthorized scrapers can mimic the legitimate ways that people use Meta’s products, we cannot fully prevent all unauthorized scraping without affecting people’s ability to use our apps and websites the way they enjoy. Since unauthorized scraping is both a common and complex challenge to solve, <a href="https://about.fb.com/news/2021/04/how-we-combat-scraping/" target="_blank" rel="noopener">we combat scraping by taking a more holistic approach</a> to staying ahead of scraping actors. The post <a rel="nofollow" href="https://engineering.fb.com/2025/02/18/security/protecting-user-data-through-source-code-analysis/">Protecting user data through source code analysis at scale</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>. ]]></content:encoded> <post-id xmlns="com-wordpress:feed-additions:1">22267</post-id> </item> <item> <title>Unlocking global AI potential with next-generation subsea infrastructure</title> <link>https://engineering.fb.com/2025/02/14/connectivity/project-waterworth-ai-subsea-infrastructure/</link> <dc:creator><![CDATA[]]></dc:creator> <pubDate>Fri, 14 Feb 2025 16:28:06 +0000</pubDate> <category><![CDATA[Connectivity]]></category> <category><![CDATA[Networking & Traffic]]></category> <guid isPermaLink="false">https://engineering.fb.com/?p=22278</guid> <description><![CDATA[Today, we’re announcing our most ambitious subsea cable endeavor yet: Project Waterworth. Once complete, the project will reach five major continents and span over 50,000 km (longer than the Earth’s circumference), making it the world’s longest subsea cable project using the highest-capacity technology available. Project Waterworth will bring industry-leading connectivity to the U.S., India, Brazil, [...] <a class="btn btn-secondary understrap-read-more-link" href="https://engineering.fb.com/2025/02/14/connectivity/project-waterworth-ai-subsea-infrastructure/">Read More...</a> The post <a rel="nofollow" href="https://engineering.fb.com/2025/02/14/connectivity/project-waterworth-ai-subsea-infrastructure/">Unlocking global AI potential with next-generation subsea infrastructure</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>. ]]></description> <content:encoded><![CDATA[Today, we’re announcing our most ambitious subsea cable endeavor yet: Project Waterworth. Once complete, the project will reach five major continents and span over 50,000 km (longer than the Earth’s circumference), making it the world’s longest subsea cable project using the highest-capacity technology available. Project Waterworth will bring industry-leading connectivity to the U.S., India, Brazil, South Africa, and other key regions. This project will enable greater economic cooperation, facilitate digital inclusion, and open opportunities for technological development in these regions. For example, in India, where we’ve already seen significant growth and investment in digital infrastructure, Waterworth will help accelerate this progress and support the country’s ambitious plans for its digital economy. Subsea cables projects, such as Project Waterworth, are the backbone of global digital infrastructure, accounting for more than <a href="https://globaldigitalinclusion.org/wp-content/uploads/2024/01/GDIP-Good-Practices-for-Subsea-Cables-Policy-Investing-in-Digital-Inclusion.pdf">95% of intercontinental traffic</a> across the world’s oceans to seamlessly enable digital communication, video experiences, online transactions, and more. Project Waterworth will be a multi-billion dollar, multi-year investment to strengthen the scale and reliability of the world’s digital highways by opening three new oceanic corridors with the abundant, high speed connectivity needed to drive AI innovation around the world. <img fetchpriority="high" decoding="async" class="alignnone size-large wp-image-22288" src="https://engineering.fb.com/wp-content/uploads/2025/02/Meta-Project-Waterworth-map.png?w=1024" alt="" width="1024" height="576" srcset="https://engineering.fb.com/wp-content/uploads/2025/02/Meta-Project-Waterworth-map.png 1920w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-Project-Waterworth-map.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-Project-Waterworth-map.png?resize=916,515 916w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-Project-Waterworth-map.png?resize=768,432 768w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-Project-Waterworth-map.png?resize=1024,576 1024w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-Project-Waterworth-map.png?resize=1536,864 1536w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-Project-Waterworth-map.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-Project-Waterworth-map.png?resize=192,108 192w" sizes="(max-width: 992px) 100vw, 62vw" /> We’ve driven infrastructure innovation with various partners over the past decade, <a href="https://engineering.fb.com/2021/03/28/connectivity/echo-bifrost/">developing</a> more than 20 <a href="https://engineering.fb.com/2021/09/28/connectivity/2africa-pearls/">subsea cables</a>. This includes multiple deployments of industry-leading subsea cables of 24 fiber pairs – compared to the typical 8 to 16 fiber pairs of other new systems. These investments enable unmatched connectivity for our world’s increasing digital needs. With Project Waterworth, we continue to advance engineering design to maintain cable resilience, enabling us to build the longest 24 fiber pair cable project in the world and enhance overall speed of deployment. We are also deploying first-of-its-kind routing, maximizing the cable laid in deep water — at depths up to 7,000 meters — and using enhanced burial techniques in high-risk fault areas, such as shallow waters near the coast, to avoid damage from ship anchors and other hazards. AI is revolutionizing every aspect of our lives, from how we interact with each other to how we think about infrastructure – and Meta is at the forefront of building these innovative technologies. As AI continues to transform industries and societies around the world, it’s clear that capacity, resilience, and global reach are more important than ever to support leading infrastructure. With Project Waterworth we can help ensure that the benefits of AI and other emerging technologies are available to everyone, regardless of where they live or work. The post <a rel="nofollow" href="https://engineering.fb.com/2025/02/14/connectivity/project-waterworth-ai-subsea-infrastructure/">Unlocking global AI potential with next-generation subsea infrastructure</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>. ]]></content:encoded> <post-id xmlns="com-wordpress:feed-additions:1">22278</post-id> </item> <item> <title>Looking back at our Bug Bounty program in 2024</title> <link>https://engineering.fb.com/2025/02/13/security/looking-back-at-our-bug-bounty-program-in-2024/</link> <dc:creator><![CDATA[]]></dc:creator> <pubDate>Thu, 13 Feb 2025 17:00:46 +0000</pubDate> <category><![CDATA[Security]]></category> <guid isPermaLink="false">https://engineering.fb.com/?p=22260</guid> <description><![CDATA[In 2024, our bug bounty program awarded more than $2.3 million in bounties, bringing our total bounties since the creation of our program in 2011 to over $20 million. As part of our defense-in-depth strategy, we continued to collaborate with the security research community in the areas of GenAI, AR/VR, ads tools, and more. We [...] <a class="btn btn-secondary understrap-read-more-link" href="https://engineering.fb.com/2025/02/13/security/looking-back-at-our-bug-bounty-program-in-2024/">Read More...</a> The post <a rel="nofollow" href="https://engineering.fb.com/2025/02/13/security/looking-back-at-our-bug-bounty-program-in-2024/">Looking back at our Bug Bounty program in 2024</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>. ]]></description> <content:encoded><![CDATA[<ul> <li style="font-weight: 400;" aria-level="1">In 2024, our bug bounty program awarded more than $2.3 million in bounties, bringing our total bounties since the creation of our program in 2011 to over $20 million. </li> <li style="font-weight: 400;" aria-level="1">As part of our <a href="https://about.fb.com/news/2019/01/designing-security-for-billions/" target="_blank" rel="noopener">defense-in-depth strategy</a>, we continued to collaborate with the security research community in the areas of GenAI, AR/VR, ads tools, and more. </li> <li style="font-weight: 400;" aria-level="1">We also celebrated the security research done by our bug bounty community as part of our annual bug bounty summit and many other industry events. </li> </ul> As we embark on a new year, we’re sharing several updates on our work with external bug bounty security researchers to help protect our global community and platforms. This includes new payout stats, details on what’s in scope for GenAI-related bug reports, and a recap of some of our engagements throughout last year with bug bounty researchers. <h2>Highlights from Meta’s bug bounty program in 2024</h2> In 2024, we received nearly 10,000 bug reports and paid out more than $2.3 million in bounty awards to researchers around the world who helped make our platforms safer. <ul> <li style="font-weight: 400;" aria-level="1">Since 2011, we have paid out more than $20 million in bug bounties. </li> <li style="font-weight: 400;" aria-level="1">Last year, we received nearly 10,000 reports and paid out awards on nearly 600 valid reports.</li> <li style="font-weight: 400;" aria-level="1">In 2024, we awarded more than $2.3 million to nearly 200 researchers from more than 45 countries. </li> <li style="font-weight: 400;" aria-level="1">The top three countries based on bounties awarded last year are India, Nepal, and the United States.</li> </ul> <h2>Engaging researchers in bug hunting in GenAI </h2> After <a href="https://about.fb.com/news/2023/09/building-generative-ai-features-responsibly/" target="_blank" rel="noopener">making our generative AI features available to security researchers</a> through our long-running bug bounty program in 2023, Meta has continued to roll out new GenAI products and tools. In 2024, we provided more details to our research community on <a href="https://bugbounty.meta.com/scope/" target="_blank" rel="noopener">what’s in scope for bug bounty reports related to our large language models (LLMs)</a>. We now welcome reports that demonstrate integral privacy or security issues associated with Meta’s LLMs, including being able to extract training data through tactics like model inversion or extraction attacks. We have already received several impactful reports focused on our GenAI tools, and we look forward to continuing this important work with our community of researchers to help ensure the security and integrity of our GenAI tools. <h2>Encouraging security research in ads audience and hardware products </h2> This year, we prioritized our efforts to steer security research by the bug bounty community towards a number of product surfaces, including: Ads audience tools designed to help people choose a target audience for their ads: <a href="https://bugbounty.meta.com/payout-guidelines/ads-audience/" target="_blank" rel="noopener">We introduced new payout guidelines</a> to provide transparency to our security researchers on how we assess the impact of the report we receive for potential security bugs in Meta’s <a href="https://www.facebook.com/business/help/717368264947302" target="_blank" rel="noopener">ads audience tools</a>. We cap the maximum base payout for discovering PII (name, email, phone number, state, ZIP, gender) for an ads audience at $30,000 and then apply any applicable deduction based on the required user interaction, prerequisites, and any other mitigation factors to arrive at the final awarded bounty amount. More details <a href="https://bugbounty.meta.com/payout-guidelines/ads-audience/" target="_blank" rel="noopener">here</a>. Mixed reality hardware products: As Meta continues to roll out <a href="https://engineering.fb.com/2023/09/12/security/meta-quest-2-defense-through-offense/" target="_blank" rel="noopener">mixed reality products</a>, we work to encourage security research into these hardware and AI-driven technologies to help us find and fix potential bugs as quickly as possible. In 2024, our bug bounty researchers contributed reports on potential issues in Quest that could have impacted safety settings or lead to memory corruption. We also brought our Quest 3 and Ray-Ban Meta glasses to <a href="http://hardwear.io" target="_blank" rel="noopener">hardwear.io USA 2024</a>, a leading conference that brings together top hardware hackers to test new hardware products and help uncover potential vulnerabilities. <h2>Building and celebrating the global bug bounty community</h2> As part of our continuous commitment to security research – both inside and outside Meta – we invested in enabling open collaboration with our bug bounty community by: Organizing community events and presenting joint research: We hosted our annual Meta Bug Bounty Researcher Conference (MBBRC) in Johannesburg, South Africa, bringing together 60 of our top researchers from all over the world. We received more than 100 bug reports and awarded over $320,000 in total. We also co-presented talks at EkoParty, DEF CON, Hardwear.io, Pwn2own, and other security research summits. This year, we’re pleased to share that 2025 MBBRC will be hosted in Tokyo, Japan May 12-15. Stay tuned for more details in 2025. Celebrating long-time researchers: One of our most long-standing and prolific researchers, <a href="https://philippeharewood.com/" target="_blank" rel="noopener">Philippe Harewood</a>, reached a 10-year milestone with over 500 valid reports paid out by our bug bounty program. Noteworthy contributions over the years include Philippe’s groundbreaking research on <a href="https://www.youtube.com/watch?v=vwUxRCmgwSw" target="_blank" rel="noopener">Instagram access token leak</a>, <a href="https://philippeharewood.com/bypass-video-capture-limit-on-ray-ban-stories/" target="_blank" rel="noopener">video capture limit bypass on Ray-Ban stories</a>, and more. Providing resources and timely updates for the research community: The <a href="http://bugbounty.meta.com" target="_blank" rel="noopener">Meta Bug Bounty website</a> serves as a centralized hub for all bug bounty news and updates. Researchers can also follow the program on <a href="http://www.instagram.com/metabugbounty" target="_blank" rel="noopener">Instagram</a>, <a href="http://www.facebook.com/BugBounty" target="_blank" rel="noopener">Facebook</a>, and <a href="http://www.x.com/metabugbounty" target="_blank" rel="noopener">X</a>, for quick updates. <h2>Looking ahead</h2> Meta’s bug bounty team looks forward to introducing new initiatives and continuing to engage with our existing community and new researchers who are just getting started. Additionally, we will continue to provide seasoned experts with unique opportunities to test unreleased features through our private bug bounty tracks. For the past 14 years, our bug bounty program has fostered a collaborative relationship with external researchers that has helped keep our platforms safer and more secure. We would like to extend a heartfelt thanks to everyone who contributed to the growth of our program in 2024. The post <a rel="nofollow" href="https://engineering.fb.com/2025/02/13/security/looking-back-at-our-bug-bounty-program-in-2024/">Looking back at our Bug Bounty program in 2024</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>. ]]></content:encoded> <post-id xmlns="com-wordpress:feed-additions:1">22260</post-id> </item> <item> <title>Revolutionizing software testing: Introducing LLM-powered bug catchers</title> <link>https://engineering.fb.com/2025/02/05/security/revolutionizing-software-testing-llm-powered-bug-catchers-meta-ach/</link> <dc:creator><![CDATA[]]></dc:creator> <pubDate>Wed, 05 Feb 2025 18:30:51 +0000</pubDate> <category><![CDATA[ML Applications]]></category> <category><![CDATA[Security]]></category> <guid isPermaLink="false">https://engineering.fb.com/?p=22242</guid> <description><![CDATA[WHAT IT IS Meta’s Automated Compliance Hardening (ACH) tool is a system for mutation-guided, LLM-based test generation. ACH hardens platforms against regressions by generating undetected faults (mutants) in source code that are specific to a given area of concern and using those same mutants to generate tests. When applied to privacy, for example, ACH automates [...] <a class="btn btn-secondary understrap-read-more-link" href="https://engineering.fb.com/2025/02/05/security/revolutionizing-software-testing-llm-powered-bug-catchers-meta-ach/">Read More...</a> The post <a rel="nofollow" href="https://engineering.fb.com/2025/02/05/security/revolutionizing-software-testing-llm-powered-bug-catchers-meta-ach/">Revolutionizing software testing: Introducing LLM-powered bug catchers</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>. ]]></description> <content:encoded><![CDATA[<h2>WHAT IT IS</h2> <a href="https://arxiv.org/pdf/2501.12862" target="_blank" rel="noopener">Meta’s Automated Compliance Hardening (ACH) tool</a> is a system for mutation-guided, LLM-based test generation. ACH hardens platforms against regressions by generating undetected faults (mutants) in source code that are specific to a given area of concern and using those same mutants to generate tests. When applied to privacy, for example, ACH automates the process of searching for privacy-related faults and preventing them from entering our systems in the future, ultimately hardening our code bases to reduce risk of any privacy regression. ACH automatically generates unit tests that target a particular kind of fault. We describe the faults we care about to ACH in plain text. The description can be incomplete, and even self-contradictory, yet ACH still generates tests that it proves will catch bugs of the kind described. Traditionally, automated test generation techniques sought merely to increase code coverage. As every tester knows, this is only part of the solution because increasing coverage doesn’t necessarily find faults. ACH is a radical departure from this tradition, because it targets specific faults, rather than uncovered code, although it often also increases coverage in the process of targeting faults. Furthermore, because ACH is founded on the principles of <a href="https://arxiv.org/abs/2402.04380" target="_blank" rel="noopener">Assured LLM-based Software Engineering</a>, it keeps verifiable assurances that its tests do catch the kind of faults described. Our new research paper, “<a href="https://arxiv.org/pdf/2501.12862" target="_blank" rel="noopener">Mutation-Guided LLM-based Test Generation at Meta</a>,” gives details of the underlying scientific foundations for ACH and how we apply ACH to privacy testing, but this approach can be applied to any sort of regression testing. <h2>HOW IT WORKS</h2> Mutation testing, where faults (mutants) are deliberately introduced into source code (using version control to keep them away from production) to assess how well an existing testing framework can detect these changes, has been<a href="https://web.eecs.umich.edu/~weimerw/2022-481F/readings/mutation-testing.pdf" target="_blank" rel="noopener"> researched for decades</a>. But, despite this, mutation testing has remained difficult to deploy. <a href="http://crest.cs.ucl.ac.uk/fileadmin/crest/sebasepaper/JiaH10.pdf" target="_blank" rel="noopener">In earlier approaches</a>, mutants themselves would be automatically generated (most often using a rule-based approach). But this method would result in mutants that weren’t particularly realistic in terms of how much of a concern they actually represent. On top of that, even with the mutants being automatically generated, humans would still have to manually write the tests that would kill the mutants (catch the faults). Writing these tests is a painstaking and laborious process. So engineers were faced with a two-pronged issue: Even after doing all of the work to write a test to catch a mutant, there was no guarantee the test would even catch the automatically-generated mutant. By leveraging LLMs, we can generate mutants that represent realistic concerns and also save on human labor by generating tests to catch the faults automatically as well. ACH marries automated test generation techniques with the capabilities of large language models (LLMs) to generate mutants that are highly relevant to an area of testing concern as well as tests that are guaranteed to catch bugs that really matter. Broadly, ACH works in three steps: <ol> <li style="font-weight: 400;" aria-level="1">An engineer describes the kind of bugs they’re concerned about.</li> <li style="font-weight: 400;" aria-level="1">ACH uses that description to automatically generate lots of bugs.</li> <li style="font-weight: 400;" aria-level="1">ACH uses the generated bugs to automatically generate lots of tests that catch them.</li> </ol> At Meta we’ve <a href="https://arxiv.org/pdf/2501.12862" target="_blank" rel="noopener">applied ACH-assisted testing to several of our platforms</a>, including Facebook Feed, Instagram, Messenger, and WhatsApp. Based on our own testing, we’ve concluded that engineers found ACH useful for hardening code against specific concerns and found other benefits even when tests generated by ACH don’t directly tackle a specific concern. <figure id="attachment_22243" aria-describedby="caption-attachment-22243" style="width: 1024px" class="wp-caption alignnone"><img decoding="async" class="size-large wp-image-22243" src="https://engineering.fb.com/wp-content/uploads/2025/02/Meta-ACH-system-chart.png?w=1024" alt="" width="1024" height="576" srcset="https://engineering.fb.com/wp-content/uploads/2025/02/Meta-ACH-system-chart.png 1534w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-ACH-system-chart.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-ACH-system-chart.png?resize=916,515 916w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-ACH-system-chart.png?resize=768,432 768w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-ACH-system-chart.png?resize=1024,576 1024w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-ACH-system-chart.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-ACH-system-chart.png?resize=192,108 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22243" class="wp-caption-text">A top-level overview of the architecture of the ACH system. The system leverages LLMs to generate faults, check them against possible equivalents, and then generate tests to catch those faults.</figcaption></figure> <h2>WHY IT MATTERS</h2> Meta has a very large number of data systems and uses <a href="https://engineering.fb.com/2022/07/27/developer-tools/programming-languages-endorsed-for-server-side-use-at-meta/" target="_blank" rel="noopener">many different programming languages</a>, frameworks, and services to power our family of apps and products. But, how are our thousands of engineers across the world ensuring that their code is reliable and won’t generate bugs that would negatively impact application performance, leading to privacy risk? The answer lies with LLMs. LLM-based test generation and LLM-based mutant generation are not new, but this is the first time they’ve been combined and deployed in large-scaled industrial systems. Generating mutants and the tests to kill them have been traditionally difficult processes to scale. Since LLMs are probabilistic and don’t need to rely on rigidly defined rules to make decisions, they allow us to tackle both sides of this equation – generating mutations and tests to kill them – very efficiently and with a high level of accuracy. This new approach significantly modernizes this form of automated test generation and helps software engineers take in concerns from a variety of sources (previous faults, colleagues, user requirements, regulatory requirements, etc.) and efficiently convert them from freeform text into actionable tests – with the guarantee that the test will catch the fault they’re looking for. ACH can be applied to any class of faults and have a significant impact on hardening against future regressions and optimizing testing itself. <h2>WHAT’S NEXT</h2> Our novel approach combines LLM-based test generation and mutant generation to help automate complex technical organizational workflows in this space. This innovation has the potential to simplify risk assessments, reduce cognitive load for developers, and ultimately create a safer online ecosystem. We’re committed to expanding deployment areas, developing methods to measure mutant relevance, and detecting existing faults to drive industry-wide adoption of automated test generation in compliance. We will be sharing more developments and encourage you to watch this space. <h2>READ THE PAPER</h2> <a href="https://arxiv.org/pdf/2501.12862" target="_blank" rel="noopener">Mutation-Guided LLM-based Test Generation at Meta</a> The post <a rel="nofollow" href="https://engineering.fb.com/2025/02/05/security/revolutionizing-software-testing-llm-powered-bug-catchers-meta-ach/">Revolutionizing software testing: Introducing LLM-powered bug catchers</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>. ]]></content:encoded> <post-id xmlns="com-wordpress:feed-additions:1">22242</post-id> </item> <item> <title>Data logs: The latest evolution in Meta’s access tools</title> <link>https://engineering.fb.com/2025/02/04/security/data-logs-the-latest-evolution-in-metas-access-tools/</link> <dc:creator><![CDATA[]]></dc:creator> <pubDate>Tue, 04 Feb 2025 20:00:18 +0000</pubDate> <category><![CDATA[Security]]></category> <guid isPermaLink="false">https://engineering.fb.com/?p=22222</guid> <description><![CDATA[We’re sharing how Meta built support for data logs, which provide people with additional data about how they use our products. Here we explore initial system designs we considered, an overview of the current architecture, and some important principles Meta takes into account in making data accessible and easy to understand. Users have a variety [...] <a class="btn btn-secondary understrap-read-more-link" href="https://engineering.fb.com/2025/02/04/security/data-logs-the-latest-evolution-in-metas-access-tools/">Read More...</a> The post <a rel="nofollow" href="https://engineering.fb.com/2025/02/04/security/data-logs-the-latest-evolution-in-metas-access-tools/">Data logs: The latest evolution in Meta’s access tools</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>. ]]></description> <content:encoded><![CDATA[<ul> <li style="font-weight: 400;" aria-level="1">We’re sharing how Meta built support for data logs, which provide people with additional data about how they use our products.</li> <li style="font-weight: 400;" aria-level="1">Here we explore initial system designs we considered, an overview of the current architecture, and some important principles Meta takes into account in making data accessible and easy to understand. </li> </ul> Users have a variety of tools they can use to manage and access their information on Meta platforms. Meta is always looking for ways to enhance its access tools in line with technological advances, and in February 2024 we began including <a href="https://www.facebook.com/help/384437594328726#data-logs">data logs</a> in the <a href="https://www.facebook.com/help/212802592074644">Download Your Information</a> (DYI) tool. Data logs include things such as information about content you’ve viewed on Facebook. Some of this data can be unique, but it can also include additional details about information that we already make available elsewhere, such as through a user’s profile, products like <a href="https://www.facebook.com/help/1700142396915814">Access Your Information</a> or <a href="https://www.facebook.com/help/256333951065527">Activity Log</a>, or account downloads. This update is the result of significant investments over a number of years by a large cross-functional team at Meta, and consultations with experts on how to continue enhancing our access tools. Data logs are just the most recent example of how Meta gives users the power to access their data on our platforms. We have a long history of giving users transparency and control over their data: <ul> <li style="font-weight: 400;" aria-level="1">2010: Users can retrieve a copy of their information through DYI. </li> <li style="font-weight: 400;" aria-level="1">2011: Users can easily review actions taken on Facebook through <a href="https://www.facebook.com/help/256333951065527">Activity Log</a>.</li> <li style="font-weight: 400;" aria-level="1">2014: Users have more transparency and control over ads they see with the “<a href="https://www.facebook.com/help/794535777607370#advertiser-choices">Why Am I Seeing This Ad?</a>” feature on Facebook.</li> <li style="font-weight: 400;" aria-level="1">2018: Users have a curated experience to find information about them through <a href="https://about.fb.com/news/2018/03/privacy-shortcuts/">Access Your Information</a>. Users can retrieve a copy of their information on Instagram through <a href="https://help.instagram.com/181231772500920">Download Your Data</a> and on WhatsApp through <a href="https://faq.whatsapp.com/526463418847093/">Request Account Information</a>.</li> <li style="font-weight: 400;" aria-level="1">2019: Users can view their activity off Meta-technologies and clear their history. Meta joins the <a href="https://engineering.fb.com/2019/12/02/security/data-transfer-project/">Data Transfer Project</a> and has continuously led the development of shared technologies that enable users to port their data from one platform to another. </li> <li style="font-weight: 400;" aria-level="1">2020: Users continue to <a href="https://about.fb.com/news/2020/03/data-access-tools/">receive more information</a> in DYI such as additional information about their interactions on Facebook and Instagram.</li> <li style="font-weight: 400;" aria-level="1">2021: Users can more easily navigate categories of information in <a href="https://about.fb.com/news/2021/01/introducing-the-new-access-your-information/">Access Your Information</a></li> <li style="font-weight: 400;" aria-level="1">2023: Users can more easily use our tools as access features are consolidated within <a href="https://www.facebook.com/help/943858526073065">Accounts Center</a>.</li> <li style="font-weight: 400;" aria-level="1">2024: Users can access data logs in Download Your Information.</li> </ul> <h2>What are data logs?</h2> In contrast to our production systems, which can be queried billions of times per second thanks to techniques like caching, Meta’s data warehouse, powered by Hive, is designed to support low volumes of large queries for things like analytics and cannot scale to the query rates needed to power real-time data access. We created data logs as a solution to provide users who want more granular information with access to data stored in Hive. In this context, an individual data log entry is a formatted version of a single row of data from Hive that has been processed to make the underlying data transparent and easy to understand. Obtaining this data from Hive in a format that can be presented to users is not straightforward. Hive tables are partitioned, typically by date and time, so retrieving all the data for a specific user requires scanning through every row of every partition to check whether it corresponds to that user. Facebook has over 3 billion monthly active users, meaning that, assuming an even distribution of data, ~99.999999967% of the rows in a given Hive table might be processed for such a query even though they won’t be relevant. Overcoming this fundamental limitation was challenging, and adapting our infrastructure to enable it has taken multiple years of concerted effort. Data warehouses are commonly used in a range of industry sectors, so we hope that this solution should be of interest to other companies seeking to provide access to the data in their data warehouses. <h3>Initial designs</h3> When we started designing a solution to make data logs available, we first considered whether it would be feasible to simply run queries for each individual as they requested their data, despite the fact that these queries would spend almost all of their time processing irrelevant data. Unfortunately, as we highlighted above, the distribution of data at Meta’s scale makes this approach infeasible and incredibly wasteful: It would require scanning entire tables once per DYI request, scaling linearly with the number of individual users that initiate DYI requests. These performance characteristics were infeasible to work around. We also considered caching data logs in an online system capable of supporting a range of indexed per-user queries. This would make the per-user queries relatively efficient. However, copying and storing data from the warehouse in these other systems presented material computational and storage costs that were not offset by the overall effectiveness of the cache, making this infeasible as well. <h3>Current design</h3> Finally, we considered whether it would be possible to build a system that relies on amortizing the cost of expensive full table scans by batching individual users’ requests into a single scan. After significant engineering investigation and prototyping, we determined that this system provides sufficiently predictable performance characteristics to infrastructure teams to make this possible. Even with this batching over short periods of time, given the relatively small size of the batches of requests for this information compared to the overall user base, most of the rows considered in a given table are filtered and scoped out as they are not relevant to the users whose data has been requested. This is a necessary trade off to enable this information to be made accessible to our users. In more detail, following a pre-defined schedule, a job is triggered using Meta’s internal task-scheduling service to organize the most recent requests, over a short time period, for users’ data logs into a single batch. This batch is submitted to a system built on top of Meta’s <a href="https://atscaleconference.com/workflowsfacebook-powering-developer-productivity-and-automation-at-facebook-scale/">Core Workflow Service</a> (CWS). CWS provides a useful set of guarantees that enable long-running tasks to be executed with predictable performance characteristics and reliability guarantees that are critical for complex multi-step workflows. Once the batch has been queued for processing, we copy the list of user IDs who have made requests in that batch into a new Hive table. For each data logs table, we initiate a new worker task that fetches the relevant metadata describing how to correctly query the data. Once we know what to query for a specific table, we create a task for each partition that executes a job in <a href="https://www.youtube.com/watch?v=4T-MCYWrrOw">Dataswarm</a> (our data pipeline system). This job performs an INNER JOIN between the table containing requesters’ IDs and the column in each table that identifies the owner of the data in that row. As tables in Hive may leverage security mechanisms like access control lists (ACLs) and privacy protections built on top of Meta’s <a href="https://engineering.fb.com/2024/08/27/security/privacy-aware-infrastructure-purpose-limitation-meta/">Privacy Aware Infrastructure</a>, the jobs are configured with appropriate security and privacy policies that govern access to the data. <figure id="attachment_22235" aria-describedby="caption-attachment-22235" style="width: 1024px" class="wp-caption alignnone"><img decoding="async" class="size-large wp-image-22235" src="https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-workflow.png?w=1024" alt="" width="1024" height="417" srcset="https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-workflow.png 4833w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-workflow.png?resize=916,373 916w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-workflow.png?resize=768,312 768w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-workflow.png?resize=1024,417 1024w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-workflow.png?resize=1536,625 1536w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-workflow.png?resize=2048,833 2048w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-workflow.png?resize=96,39 96w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-workflow.png?resize=192,78 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22235" class="wp-caption-text">A diagram showing how a workflow gathers inputs and runs sub-workflows for each table and partition to reliably gather data logs. The data logs workflow will gather metadata, then prepare requester IDs, then run parallel processes for each table. Each table workflow prepares the table, then runs approximately sequential jobs for each partition. Some partitions across tables may be processed in parallel.</figcaption></figure> Once this job is completed, it outputs its results to an intermediate Hive table containing a combination of the data logs for all users in the current batch. This processing is expensive, as the INNER JOIN requires a full table scan across all relevant partitions of the Hive table, an operation which may consume significant computational resources. The output table is then processed using PySpark to identify the relevant data and split it into individual files for each user’s data in a given partition. <figure id="attachment_22234" aria-describedby="caption-attachment-22234" style="width: 1024px" class="wp-caption alignnone"><img loading="lazy" decoding="async" class="size-large wp-image-22234" src="https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-splitter.png?w=1024" alt="" width="1024" height="430" srcset="https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-splitter.png 3166w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-splitter.png?resize=916,385 916w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-splitter.png?resize=768,322 768w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-splitter.png?resize=1024,430 1024w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-splitter.png?resize=1536,645 1536w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-splitter.png?resize=2048,860 2048w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-splitter.png?resize=96,40 96w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-splitter.png?resize=192,81 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22234" class="wp-caption-text">A diagram showing that the processing of each partition splits the data into one file per user per partition. Both users A and B will have a file representing Table 1 Partition 1, and so on.</figcaption></figure> The result of these batch operations in the data warehouse is a set of comma delimited text files containing the unfiltered raw data logs for each user. This raw data is not yet explained or made intelligible to users, so we run a post-processing step in Meta’s Hack language to apply privacy rules and filters and render the raw data into meaningful, well-explained HTML files. We do this by passing the raw data through various renderers, discussed in more detail in the next section. Finally, once all of the processing is completed, the results are aggregated into a ZIP file and made available to the requestor through the DYI tool. <h2>Lessons learned from building data logs</h2> Throughout the development of this system we found it critical to develop robust checkpointing mechanisms that enable incremental progress and resilience in the face of errors and temporary failures. While processing everything in a single pass may reduce latency, the risk is that a single issue will cause all of the previous work to be wasted. For example, in addition to jobs timing out and failing to complete, we also experienced errors where full-table-scan queries would run out of memory and fail partway through processing. The capability to resume work piecemeal increases resiliency and optimizes the overall throughput of the system. <figure id="attachment_22233" aria-describedby="caption-attachment-22233" style="width: 1024px" class="wp-caption alignnone"><img loading="lazy" decoding="async" class="size-large wp-image-22233" src="https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-checkpoints.png?w=1024" alt="" width="1024" height="432" srcset="https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-checkpoints.png 1787w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-checkpoints.png?resize=916,386 916w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-checkpoints.png?resize=768,324 768w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-checkpoints.png?resize=1024,432 1024w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-checkpoints.png?resize=1536,648 1536w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-checkpoints.png?resize=96,41 96w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-checkpoints.png?resize=192,81 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22233" class="wp-caption-text">A diagram explaining that checkpointing after incremental task completions can optimize system throughput. Approach 1 shows one task failure on step 2, which is recomputed successfully. Approach 2 combines everything into one step, but when it fails, it winds up taking longer overall.</figcaption></figure> Ensuring data correctness is also very important. As we built the component that splits combined results into individual files for each user, we encountered an issue that affected this correctness guarantee and could have led to data being returned to the wrong user. The root cause of the issue was a Spark concurrency bug that partitioned data incorrectly across the parallel Spark workers. To prevent this issue, we built verification in the post-processing stage to ensure that the user ID column in the data matches the identifier for the user whose logs we are generating. This means that even if similar bugs were to occur in the core data processing infrastructure we would prevent any incorrect data from being shown to users. Finally, we learned that complex data workflows require advanced tools and the capability to iterate on code changes quickly without re-processing everything. To this end, we built an experimentation platform that enables running modified versions of the workflows to quickly test changes, with the ability to independently execute phases of the process to expedite our work. For example, we found that innocent-looking changes such as altering which column is being fetched can lead to complex failures in the data fetching jobs. We now have the ability to run a test job under the new configuration when making a change to a table fetcher. <h2>Making data consistently understandable and explainable</h2> Meta cares about ensuring that the information we provide is meaningful to end-users. A key challenge in providing transparent access is presenting often highly technical information in a way that is approachable and easy to understand, even for those with little expertise in technology. Providing people access to their data involves working across numerous products and surfaces that our users interact with every day. The way that information is stored on our back-end systems is not always directly intelligible to end-users, and it takes both understanding of the individual products and features as well as the needs of users to make it user-friendly. This is a large, collaborative undertaking, leveraging many forms of expertise. Our process entails working with product teams familiar with the data from their respective products, applying our historical expertise in access surfaces, using innovative tools we have developed, and consulting with experts. In more detail, a cross-functional team of access experts works with specialist teams to review these tables, taking care to avoid exposing information that could adversely affect the rights and freedoms of other users. For example, if you block another user Facebook, this information would not be provided to the person that you have blocked. Similarly, when you view another user’s profile, this information will be available to you, but not the person whose profile you viewed. This is a key principle Meta upholds to respect the rights of everyone who engages with our platforms. It also means that we need a rigorous process to ensure that the data made available is never shared incorrectly. Many of the datasets that power a social network will reference more than one person, but that does not imply everyone referenced should always have equal access to that information. Additionally, Meta must take care not to disclose information that may compromise our integrity or safety systems, or our intellectual property rights. For instance, Meta sends <a href="https://transparency.meta.com/en-gb/ncmec-q2-2023/">millions of NCMEC Cybertip reports per year</a> to help protect children on our platforms. Disclosing this information, or the data signals used to detect apparent violations of laws protecting children, may undermine the sophisticated techniques we have developed to proactively seek out and report these types of content and interactions. One particularly time consuming and challenging task is ensuring that Meta-internal text strings that describe our systems and products are translated into more easily human readable terms. For instance, a <a href="https://docs.hhvm.com/hack/built-in-types/enum">Hack enum</a> could define a set of user interface element references. Exposing the jargon-heavy internal versions of these enums would definitely not be meaningful to an end-user — they may not be meaningful at first glance to other employees without sufficient context! In this case, user-friendly labels are created to replace these internal-facing strings. The resulting content is reviewed for explainability, simplicity, and consistency, with product experts also helping to verify that the final version is accurate. This process makes information more useful by reducing duplicative information. When engineers build and iterate on a product for our platforms, they may log slightly different versions of the same information with the goal of better understanding how people use the product. For example, when users select an option from a list of actions, each part of the system may use slightly different values that represent the same underlying option, such as an option to move content to trash as part of <a href="https://about.fb.com/news/2020/06/introducing-manage-activity/">Manage Activity</a>. As a concrete example, we found this action stored with different values: In the first instance it was entered as MOVE_TO_TRASH, in the second as StoryTrashPostMenuItem, and in the third FBFeedMoveToTrashOption. These differences stemmed from the fact that the logging in question was coming from different parts of the system with different conventions. Through a series of cross-functional reviews with support from product experts, Meta determines an appropriate column header (e.g., “Which option you interacted with”), and the best label for the option (e.g., “Move to trash”). Finally, once content has been reviewed, it can be implemented in code using the renderers we described above. These are responsible for reading the raw values and transforming them into user-friendly representations. This includes transforming raw integer values into meaningful references to entities and converting raw enum values into user-friendly representations. An ID like 1786022095521328 might become “John Doe”; enums with integer values 0, 1, 2, converted into text like “Disabled,” “Active,” or “Hidden;”and columns with string enums can remove jargon that is not understandable to end-users and de-duplicate (as in our “Move to trash” example above). Together, these culminate in a much friendlier representation of the data that might look like this: <figure id="attachment_22228" aria-describedby="caption-attachment-22228" style="width: 1024px" class="wp-caption alignnone"><img loading="lazy" decoding="async" class="size-large wp-image-22228" src="https://engineering.fb.com/wp-content/uploads/2025/01/Meta-data-logs-image-4.png?w=1024" alt="" width="1024" height="197" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/Meta-data-logs-image-4.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/01/Meta-data-logs-image-4.png?resize=916,176 916w, https://engineering.fb.com/wp-content/uploads/2025/01/Meta-data-logs-image-4.png?resize=768,148 768w, https://engineering.fb.com/wp-content/uploads/2025/01/Meta-data-logs-image-4.png?resize=1024,197 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/Meta-data-logs-image-4.png?resize=1536,295 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/Meta-data-logs-image-4.png?resize=96,18 96w, https://engineering.fb.com/wp-content/uploads/2025/01/Meta-data-logs-image-4.png?resize=192,37 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22228" class="wp-caption-text">An example of what a data log might look like, demonstrating meaningful and intelligible output. It shows column headers with descriptions and a row of data.</figcaption></figure> The post <a rel="nofollow" href="https://engineering.fb.com/2025/02/04/security/data-logs-the-latest-evolution-in-metas-access-tools/">Data logs: The latest evolution in Meta’s access tools</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>. ]]></content:encoded> <post-id xmlns="com-wordpress:feed-additions:1">22222</post-id> </item> <item> <title>How Precision Time Protocol handles leap seconds</title> <link>https://engineering.fb.com/2025/02/03/production-engineering/how-precision-time-protocol-ptp-handles-leap-seconds/</link> <dc:creator><![CDATA[]]></dc:creator> <pubDate>Mon, 03 Feb 2025 17:00:34 +0000</pubDate> <category><![CDATA[Networking & Traffic]]></category> <category><![CDATA[Production Engineering]]></category> <guid isPermaLink="false">https://engineering.fb.com/?p=22212</guid> <description><![CDATA[We’ve previously described why we think it’s time to leave the leap second in the past. In today’s rapidly evolving digital landscape, introducing new leap seconds to account for the long-term slowdown of the Earth’s rotation is a risky practice that, frankly, does more harm than good. This is particularly true in the data center [...] <a class="btn btn-secondary understrap-read-more-link" href="https://engineering.fb.com/2025/02/03/production-engineering/how-precision-time-protocol-ptp-handles-leap-seconds/">Read More...</a> The post <a rel="nofollow" href="https://engineering.fb.com/2025/02/03/production-engineering/how-precision-time-protocol-ptp-handles-leap-seconds/">How Precision Time Protocol handles leap seconds</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>. ]]></description> <content:encoded><![CDATA[We’ve previously described why we think <a href="https://engineering.fb.com/2022/07/25/production-engineering/its-time-to-leave-the-leap-second-in-the-past/">it’s time to leave the leap second in the past</a>. In today’s rapidly evolving digital landscape, introducing new leap seconds to account for the long-term slowdown of the Earth’s rotation is a risky practice that, frankly, does more harm than good. This is particularly true in the data center space, where new protocols like<a href="https://engineering.fb.com/2022/11/21/production-engineering/precision-time-protocol-at-meta/"> Precision Time Protocol (PTP)</a> are allowing systems to be synchronized down to nanosecond precision. With the ever-growing demand for higher precision time distribution, and <a href="https://engineering.fb.com/2022/11/21/production-engineering/precision-time-protocol-at-meta/" target="_blank" rel="noopener">the larger role of PTP for time synchronization</a> in data centers, we need to consider how to address leap seconds within systems that use PTP and are thus much more time sensitive. <h2>Leap second smearing – a solution past its time</h2> Leap second smearing is a process of adjusting the speeds of clocks to accommodate the correction that has been a common method for handling leap seconds. At Meta, we’ve traditionally focused our smearing effort on <a href="https://engineering.fb.com/2020/03/18/production-engineering/ntp-service/" target="_blank" rel="noopener">NTP</a> since it has been the de facto standard for time synchronization in data centers. In large NTP deployments, leap second smearing is generally performed at the <a href="https://engineering.fb.com/2020/03/18/production-engineering/ntp-service/" target="_blank" rel="noopener">Stratum 2 (layer</a>), which consists of NTP servers that directly interact with NTP clients (the Stratum 3) that are the downstream users of the NTP service. <img loading="lazy" decoding="async" class="alignnone size-large" src="https://engineering.fb.com/wp-content/uploads/2020/03/Time-Infra-NTP-Service.jpg" width="2000" height="1125" /> There are multiple approaches to smearing. In the case of NTP, linear or quadratic smearing formulas can be applied. Quadratic smearing is often preferred due to the layered nature of the NTP protocol, where clients are encouraged to dynamically adjust their polling interval as the value of pending correction increases. This solution has its own tradeoffs, such as inconsistent adjustments, which can lead to different offset values across a large server fleet. Linear smearing may be superior if an entire fleet is relying on the same time sources and performs smearing at the same time. In combination with more frequent sync cycles of typically once per second, this is a more predictable, precise and reliable approach. <img loading="lazy" decoding="async" class="alignnone size-large wp-image-22216" src="https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png?w=1024" alt="" width="1024" height="576" srcset="https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png 2240w, https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png?resize=916,516 916w, https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png?resize=768,432 768w, https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png?resize=1024,576 1024w, https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png?resize=1536,865 1536w, https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png?resize=2048,1153 2048w, https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png?resize=192,108 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /> <h2>Handling leap seconds in PTP</h2> In contrast to NTP, which synchronizes at the millisecond level, PTP provides a level of precision typically in the range of nanoseconds. At this level of precision even periodic linear smearing would create too much delta across the fleet and violate guarantees provided to the customers. To handle leap seconds in a PTP environment we take an algorithmic approach that shifts time automatically for systems that use PTP and combine this with an emphasis on using <a href="https://en.wikipedia.org/wiki/Coordinated_Universal_Time" target="_blank" rel="noopener">Coordinated Universal Time (UTC)</a> over International Atomic Time (TAI). <h3>Self-smearing</h3> At Meta, users interact with the PTP service via the <a href="https://engineering.fb.com/2022/11/21/production-engineering/precision-time-protocol-at-meta/#client" target="_blank" rel="noopener">fbclock</a> library, which provides a tuple of values, {earliest_ns, latest_ns}, which represents a time interval referred to as the Window of Uncertainty (WOU). Each time the library is called during the smearing period we adjust the return values based on the smearing algorithm, which shifts the time values 1 nanosecond every 62.5 microseconds. This approach has a number of advantages, including being completely stateless and reproducible. The service continues to utilize TAI timestamps but can return UTC timestamps to clients via the API. And, as the start time is determined by <a href="https://engineering.fb.com/2020/03/18/production-engineering/ntp-service/" target="_blank" rel="noopener">tzdata</a> timestamps, the current smearing position can be determined even after a server is rebooted. This approach does come with some tradeoffs. For example, as the leap smearing strategy differs between the NTP (quadratic) and PTP (linear) ecosystems, services may struggle to match timestamps acquired from different sources during the smearing period. The difference between two approaches can mean differences of over 100 microseconds, creating challenges for services that consume time from both systems. <h3>UTC over TAI</h3> The smearing strategy we implemented in our fbclock library shows good performance. However, it still introduces significant time deltas between multiple hosts during the smearing period, despite being fully stateless and using small (1 nanosecond) and fixed step sizes. Another significant drawback comes from periodically running jobs. Smearing time means our scheduling is off by close to 1 millisecond after 60 seconds for services that run at precise intervals. <img loading="lazy" decoding="async" class="alignnone size-large wp-image-22215" src="https://engineering.fb.com/wp-content/uploads/2025/02/Intended-Graph-White.png?w=1024" alt="" width="1024" height="617" srcset="https://engineering.fb.com/wp-content/uploads/2025/02/Intended-Graph-White.png 2240w, https://engineering.fb.com/wp-content/uploads/2025/02/Intended-Graph-White.png?resize=916,552 916w, https://engineering.fb.com/wp-content/uploads/2025/02/Intended-Graph-White.png?resize=768,463 768w, https://engineering.fb.com/wp-content/uploads/2025/02/Intended-Graph-White.png?resize=1024,617 1024w, https://engineering.fb.com/wp-content/uploads/2025/02/Intended-Graph-White.png?resize=1536,925 1536w, https://engineering.fb.com/wp-content/uploads/2025/02/Intended-Graph-White.png?resize=2048,1233 2048w, https://engineering.fb.com/wp-content/uploads/2025/02/Intended-Graph-White.png?resize=96,58 96w, https://engineering.fb.com/wp-content/uploads/2025/02/Intended-Graph-White.png?resize=192,116 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /> This is not ideal for a service that guarantees nanosecond-level accuracy and precision. As a result, we recommend that customers use TAI over UTC and thus avoid having to deal with the leap seconds. Unfortunately, though, in most cases, the conversion to UTC is still required and eventually has to be performed somewhere. <h2>PTP without leap seconds</h2> At Meta, we support the recent push to<a href="https://www.nytimes.com/2022/11/19/science/time-leap-second-bipm.html"> freeze any new leap seconds after 2035</a>. If we can cease the introduction of new leap seconds, then the entire industry can rely on UTC instead of TAI for higher precision timekeeping. This will simplify infrastructure and remove the need for different smearing solutions. Ultimately, a future without leap seconds is one where we can push systems to greater levels of timekeeping precision more easily and efficiently. The post <a rel="nofollow" href="https://engineering.fb.com/2025/02/03/production-engineering/how-precision-time-protocol-ptp-handles-leap-seconds/">How Precision Time Protocol handles leap seconds</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>. ]]></content:encoded> <post-id xmlns="com-wordpress:feed-additions:1">22212</post-id> </item> <item> <title>Bringing Jetpack Compose to Instagram for Android</title> <link>https://engineering.fb.com/2025/01/24/android/bringing-jetpack-compose-to-instagram-for-android/</link> <dc:creator><![CDATA[]]></dc:creator> <pubDate>Fri, 24 Jan 2025 17:30:53 +0000</pubDate> <category><![CDATA[Android]]></category> <category><![CDATA[Culture]]></category> <category><![CDATA[DevInfra]]></category> <category><![CDATA[Instagram]]></category> <category><![CDATA[Meta Tech Podcast]]></category> <guid isPermaLink="false">https://engineering.fb.com/?p=22155</guid> <description><![CDATA[Introducing a new Android UI framework like Jetpack Compose into an existing app is more complicated than importing some AARS and coding away. What if your app has specific performance goals to meet? What about existing design components, integrations with navigation, and logging frameworks? On this episode of the Meta Tech Podcast Pascal Hartig is [...] <a class="btn btn-secondary understrap-read-more-link" href="https://engineering.fb.com/2025/01/24/android/bringing-jetpack-compose-to-instagram-for-android/">Read More...</a> The post <a rel="nofollow" href="https://engineering.fb.com/2025/01/24/android/bringing-jetpack-compose-to-instagram-for-android/">Bringing Jetpack Compose to Instagram for Android</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>. ]]></description> <content:encoded><![CDATA[Introducing a new Android UI framework like Jetpack Compose into an existing app is more complicated than importing some AARS and coding away. What if your app has specific performance goals to meet? What about existing design components, integrations with navigation, and logging frameworks? On this episode of the Meta Tech Podcast <a href="https://www.threads.net/@passy_" target="_blank" rel="noopener">Pascal Hartig</a> is joined by Summer, a software engineer whose team handles large-scale migrations for Instagram. Summer walks through the various thoughtful and intricate phases that Instagram goes through to ensure that developers have the best possible experience when working on our codebases. She also discusses balancing all of this with Meta’s infrastructure teams, who have to maintain multiple implementations at once. Learn how Meta approaches the rollout of a new framework and more! Download or listen to the podcast episode below: <iframe loading="lazy" style="border: none;" title="Libsyn Player" src="//html5-player.libsyn.com/embed/episode/id/34599290/height/90/theme/custom/thumbnail/yes/direction/forward/render-playlist/no/custom-color/000000/" width="100%" height="90" scrolling="no" allowfullscreen="allowfullscreen"></iframe> You can also find the episode wherever you get your podcasts, including: <ul> <li><a href="https://open.spotify.com/episode/3tKuh8iNENKMugNdZ0jTqU" target="_blank" rel="noopener">Spotify</a></li> <li><a href="https://podcasts.apple.com/us/podcast/jetpack-compose-at-meta/id1370910331?i=1000681564997" target="_blank" rel="noopener">Apple Podcasts</a></li> <li><a href="https://pca.st/as8y2yqo" target="_blank" rel="noopener">Pocket Casts</a></li> <li><a href="https://overcast.fm/login" target="_blank" rel="noopener">Overcast</a></li> </ul> The <a href="https://insidefacebookmobile.libsyn.com/" target="_blank" rel="noopener">Meta Tech Podcast</a> is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features. Send us feedback on <a href="https://instagram.com/metatechpod" target="_blank" rel="noopener">Instagram</a>, <a href="https://threads.net/@metatechpod" target="_blank" rel="noopener">Threads</a>, or <a href="https://twitter.com/metatechpod" target="_blank" rel="noopener">X</a>. And if you’re interested in learning more about career opportunities at Meta visit the <a href="https://www.metacareers.com/?ref=engineering.fb.com" target="_blank" rel="noopener">Meta Careers</a> page. The post <a rel="nofollow" href="https://engineering.fb.com/2025/01/24/android/bringing-jetpack-compose-to-instagram-for-android/">Bringing Jetpack Compose to Instagram for Android</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>. ]]></content:encoded> <post-id xmlns="com-wordpress:feed-additions:1">22155</post-id> </item> <item> <title>How Meta discovers data flows via lineage at scale</title> <link>https://engineering.fb.com/2025/01/22/security/how-meta-discovers-data-flows-via-lineage-at-scale/</link> <dc:creator><![CDATA[]]></dc:creator> <pubDate>Thu, 23 Jan 2025 05:00:45 +0000</pubDate> <category><![CDATA[Security]]></category> <guid isPermaLink="false">https://engineering.fb.com/?p=22168</guid> <description><![CDATA[Data lineage is an instrumental part of Meta’s Privacy Aware Infrastructure (PAI) initiative, a suite of technologies that efficiently protect user privacy. It is a critical and powerful tool for scalable discovery of relevant data and data flows, which supports privacy controls across Meta’s systems. This allows us to verify that our users’ everyday interactions [...] <a class="btn btn-secondary understrap-read-more-link" href="https://engineering.fb.com/2025/01/22/security/how-meta-discovers-data-flows-via-lineage-at-scale/">Read More...</a> The post <a rel="nofollow" href="https://engineering.fb.com/2025/01/22/security/how-meta-discovers-data-flows-via-lineage-at-scale/">How Meta discovers data flows via lineage at scale</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>. ]]></description> <content:encoded><![CDATA[<ul> <li style="font-weight: 400;" aria-level="1">Data lineage is an instrumental part of Meta’s Privacy Aware Infrastructure (PAI) initiative, a suite of technologies that efficiently protect user privacy. It is a critical and powerful tool for scalable discovery of relevant data and data flows, which supports privacy controls across Meta’s systems. This allows us to verify that our users’ everyday interactions are protected across our family of apps, such as their religious views in the Facebook Dating app, the example we’ll walk through in this post.</li> <li style="font-weight: 400;" aria-level="1">In order to build high-quality data lineage, we developed different techniques to collect data flow signals across different technology stacks: static code analysis for different languages, runtime instrumentation, and input and output data matching, etc. We then built an intuitive UX into our tooling that enables developers to effectively consume all of this lineage data in a systematic way, saving significant engineering time for building privacy controls. </li> <li style="font-weight: 400;" aria-level="1">As we expanded PAI across Meta, <a href="#learnings">we gained valuable insights</a> about the data lineage space. Our understanding of the privacy space evolved, revealing the need for early focus on data lineage, tooling, a cohesive ecosystem of libraries, and more. These initiatives have assisted in accelerating the development of data lineage and implementing purpose limitation controls more quickly and efficiently.</li> </ul> At Meta, we believe that privacy enables product innovation. This belief has led us to developing <a href="https://engineering.fb.com/2024/08/27/security/privacy-aware-infrastructure-purpose-limitation-meta/" target="_blank" rel="noopener">Privacy Aware Infrastructure (PAI)</a>, which offers efficient and reliable first-class privacy constructs embedded in Meta infrastructure to address different privacy requirements, such as <a href="https://engineering.fb.com/2024/08/27/security/privacy-aware-infrastructure-purpose-limitation-meta/" target="_blank" rel="noopener">purpose limitation</a>, which restricts the purposes for which data can be processed and used. In this blog, we will delve into an early stage in PAI implementation: data lineage. Data lineage refers to the process of tracing the journey of data as it moves through various systems, illustrating how data transitions from one data asset, such as a database table (the source asset), to another (the sink asset). We’ll also walk through how we track the lineage of users’ “religion” information in our Facebook Dating app. Millions of data assets are vital for supporting our product ecosystem, ensuring the functionality our users anticipate, maintaining high product quality, and safeguarding user safety and integrity. Data lineage enables us to efficiently navigate these assets and protect user data. It enhances the traceability of data flows within systems, ultimately empowering developers to swiftly implement privacy controls and create innovative products. Note that data lineage is dependent on having already completed important and complex preliminary steps to inventory, schematize, and annotate data assets into a unified asset catalog. This took Meta multiple years to complete across our millions of disparate data assets, and we’ll cover each of these more deeply in future blog posts: <ul> <li>Inventorying involves collecting various code and data assets (e.g., web endpoints, data tables, AI models) used across Meta.</li> <li>Schematization expresses data assets in structural detail (e.g., indicating that a data asset has a field called “religion”).</li> <li>Annotation labels data to describe its content (e.g., specifying that the identity column contains religion data).</li> </ul> <h2>Understanding data lineage at Meta</h2> To establish robust privacy controls, an essential part of our PAI initiative is to understand how data flows across different systems. Data lineage is part of this discovery step in the PAI workflow, as shown in the following diagram: <img loading="lazy" decoding="async" class="alignnone size-large wp-image-22176" src="https://engineering.fb.com/wp-content/uploads/2025/01/pai_workflow_lineage_s.png?w=1024" alt="" width="1024" height="217" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/pai_workflow_lineage_s.png 2710w, https://engineering.fb.com/wp-content/uploads/2025/01/pai_workflow_lineage_s.png?resize=916,194 916w, https://engineering.fb.com/wp-content/uploads/2025/01/pai_workflow_lineage_s.png?resize=768,163 768w, https://engineering.fb.com/wp-content/uploads/2025/01/pai_workflow_lineage_s.png?resize=1024,217 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/pai_workflow_lineage_s.png?resize=1536,326 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/pai_workflow_lineage_s.png?resize=2048,435 2048w, https://engineering.fb.com/wp-content/uploads/2025/01/pai_workflow_lineage_s.png?resize=96,20 96w, https://engineering.fb.com/wp-content/uploads/2025/01/pai_workflow_lineage_s.png?resize=192,41 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /> Data lineage is a key precursor to implementing Policy Zones, our information flow control technology, because it answers the question, “Where does my data come from and where does it go?” – helping inform the right places to apply privacy controls. In conjunction with Policy Zones, data lineage provides the following key benefits to thousands of developers at Meta: <ul> <li style="font-weight: 400;" aria-level="1">Scalable data flow discovery: Data lineage answers the question above by providing an end-to-end, scalable graph of relevant data flows. We can leverage the lineage graphs to visualize and explain the flow of relevant data from the point where it is collected to all the places where it is processed.</li> <li style="font-weight: 400;" aria-level="1">Efficient rollout of privacy controls: By leveraging data lineage to track data flows, we can easily pinpoint the optimal integration points for privacy controls like Policy Zones within the codebase, streamlining the rollout process. Thus we have developed a powerful flow discovery tool as part of our PAI tool suite, Policy Zone Manager (PZM), based on data lineage. PZM enables developers to rapidly identify multiple downstream assets from a set of sources simultaneously, thereby accelerating the rollout process of privacy controls.</li> <li style="font-weight: 400;" aria-level="1">Continuous compliance verification: Once the privacy requirement has been fully implemented, data lineage plays a vital role in monitoring and validating data flows continuously, in addition to the enforcement mechanisms such as Policy Zones.</li> </ul> Traditionally, data lineage has been collected via code inspection using manually authored data flow diagrams and spreadsheets. However, this approach does not scale in large and dynamic environments, such as Meta, with billions of lines of continuously evolving code. To tackle this challenge, we’ve developed a robust and scalable lineage solution that uses static code analysis signals as well as runtime signals. <h2>Walkthrough: Implementing data lineage for religion data</h2> We’ll share how we have automated lineage tracking to identify religion data flows through our core systems, eventually creating an end-to-end, precise view of downstream religion assets being protected, via the following two key stages: <ol> <li style="font-weight: 400;" aria-level="1">Collecting data flow signals: a process to capture data flow signals from many processing activities across different systems, not only for religion, but for all other types of data, to create an end-to-end lineage graph. </li> <li style="font-weight: 400;" aria-level="1">Identifying relevant data flows: a process to identify the specific subset of data flows (“subgraph”) within the lineage graph that pertains to religion. </li> </ol> These stages propagate through various systems including function-based systems that load, process, and propagate data through stacks of function calls in different programming languages (e.g., Hack, C++, Python, etc.) such as web systems and backend services, and batch-processing systems that process data rows in batch (mainly via SQL) such as data warehouse and AI systems. For simplicity, we will demonstrate these for the web, the data warehouse, and AI, per the diagram below. <img loading="lazy" decoding="async" class="alignnone size-large wp-image-22187" src="https://engineering.fb.com/wp-content/uploads/2025/01/lineage_e2e_s_updated.png?w=1024" alt="" width="1024" height="559" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/lineage_e2e_s_updated.png 2743w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_e2e_s_updated.png?resize=916,500 916w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_e2e_s_updated.png?resize=768,419 768w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_e2e_s_updated.png?resize=1024,559 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_e2e_s_updated.png?resize=1536,838 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_e2e_s_updated.png?resize=2048,1118 2048w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_e2e_s_updated.png?resize=96,52 96w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_e2e_s_updated.png?resize=192,105 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /> <h3>Collecting data flow signals for the web system</h3> When setting up a profile on the Facebook Dating app, people can populate their religious views. This information is then utilized to identify relevant matches with other people who have specified matched values in their dating preferences. On Dating, religious views are subject to purpose limitation requirements, for example, <a href="https://about.fb.com/news/2020/10/privacy-matters-facebook-dating/">they will not be used to personalize experiences on other Facebook Products</a>. <img loading="lazy" decoding="async" class="alignnone size-large wp-image-22169" src="https://engineering.fb.com/wp-content/uploads/2025/01/dating_religion_s.png?w=1024" alt="" width="1024" height="651" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/dating_religion_s.png 1964w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_religion_s.png?resize=916,582 916w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_religion_s.png?resize=768,488 768w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_religion_s.png?resize=1024,651 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_religion_s.png?resize=1536,976 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_religion_s.png?resize=96,61 96w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_religion_s.png?resize=192,122 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /> We start with someone entering their religion information on their dating media profile using their mobile device, which is then transmitted to a web endpoint. The web endpoint subsequently logs the data into a logging table and stores it in a database, as depicted in the following code snippet: <img loading="lazy" decoding="async" class="alignnone size-large wp-image-22191" src="https://engineering.fb.com/wp-content/uploads/2025/01/dating_profile_endpoint_code.png?w=895" alt="" width="895" height="1024" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/dating_profile_endpoint_code.png 1504w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_profile_endpoint_code.png?resize=801,916 801w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_profile_endpoint_code.png?resize=768,878 768w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_profile_endpoint_code.png?resize=895,1024 895w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_profile_endpoint_code.png?resize=1343,1536 1343w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_profile_endpoint_code.png?resize=96,110 96w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_profile_endpoint_code.png?resize=192,220 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /> Now let’s see how we collect lineage signals. To do this, we need to employ both static and runtime analysis tools to effectively discover data flows, particularly focusing on where religion is logged and stored. By combining static and runtime analysis, we enhance our ability to accurately track and manage data flows. <a href="https://engineering.fb.com/2021/10/20/security/static-analysis-award/'">Static analysis tools</a> simulate code execution to map out data flows within our systems. They also emit quality signals to indicate the confidence of whether a data flow signal is a true positive. However, these tools are limited by their lack of access to runtime data, which can lead to false positives from unexecuted code. To address this limitation, we utilize Privacy Probes, a key component of our PAI lineage technologies. Privacy Probes automate data flow discovery by collecting runtime signals. These signals are gathered in real time during the execution of requests, allowing us to trace the flow of data into loggers, databases, and other services. We have instrumented Meta’s core data frameworks and libraries at both the data origin points (sources) and their eventual outputs (sinks), such as logging framework, which allows for comprehensive data flow tracking. This approach is exemplified in the following code snippet: <img loading="lazy" decoding="async" class="alignnone size-large wp-image-22192" src="https://engineering.fb.com/wp-content/uploads/2025/01/endpoint_controller_base_code.png?w=1024" alt="" width="1024" height="791" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/endpoint_controller_base_code.png 1404w, https://engineering.fb.com/wp-content/uploads/2025/01/endpoint_controller_base_code.png?resize=916,707 916w, https://engineering.fb.com/wp-content/uploads/2025/01/endpoint_controller_base_code.png?resize=768,593 768w, https://engineering.fb.com/wp-content/uploads/2025/01/endpoint_controller_base_code.png?resize=1024,791 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/endpoint_controller_base_code.png?resize=96,74 96w, https://engineering.fb.com/wp-content/uploads/2025/01/endpoint_controller_base_code.png?resize=192,148 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /> During runtime execution, Privacy Probes does the following: <ol> <li style="font-weight: 400;" aria-level="1">Capturing payloads: It captures source and sink payloads in memory on a sampled basis, along with supplementary metadata such as event timestamps, asset identifiers, and stack traces as evidence for the data flow. </li> <li style="font-weight: 400;" aria-level="1">Comparing payloads: It then compares the source and sink payloads within a request to identify data matches, which helps in understanding how data flows through the system. </li> <li style="font-weight: 400;" aria-level="1">Categorizing results: It categorizes results into two sets. The match-set includes pairs of source and sink assets where data matches exactly or one is contained by another, therefore providing high confidence evidence of data flow between the assets. The full-set includes all source and sink pairs within a request no matter whether the sink is tainted by the source. Full-set is a superset of match-set with some noise but still important to send to human reviewers since it may contain transformed data flows. </li> </ol> The above procedure is depicted in the diagram below: <img loading="lazy" decoding="async" class="alignnone size-large wp-image-22177" src="https://engineering.fb.com/wp-content/uploads/2025/01/probes_match_s.png?w=1024" alt="" width="1024" height="378" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/probes_match_s.png 2216w, https://engineering.fb.com/wp-content/uploads/2025/01/probes_match_s.png?resize=916,338 916w, https://engineering.fb.com/wp-content/uploads/2025/01/probes_match_s.png?resize=768,283 768w, https://engineering.fb.com/wp-content/uploads/2025/01/probes_match_s.png?resize=1024,378 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/probes_match_s.png?resize=1536,566 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/probes_match_s.png?resize=2048,755 2048w, https://engineering.fb.com/wp-content/uploads/2025/01/probes_match_s.png?resize=96,35 96w, https://engineering.fb.com/wp-content/uploads/2025/01/probes_match_s.png?resize=192,71 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /> Let’s look at the following examples where various religions are received in an endpoint and various values (copied or transformed) being logged in three different loggers: <table border="1"> <tbody> <tr> <td bgcolor="#a0d9f7">Input Value (source)</td> <td bgcolor="#a0d9f7">Output Value (sink)</td> <td bgcolor="#a0d9f7">Data Operation</td> <td bgcolor="#a0d9f7">Match Result</td> <td bgcolor="#a0d9f7">Flow Confidence</td> </tr> <tr> <td>“Atheist”</td> <td>“Atheist”</td> <td>Data Copy</td> <td>EXACT_MATCH</td> <td>HIGH</td> </tr> <tr> <td>“Buddhist”</td> <td>{metadata: {religion: Buddhist}}</td> <td>Substring</td> <td>CONTAINS</td> <td>HIGH</td> </tr> <tr> <td>{religions: [“Catholic”, “Christian”]}</td> <td>{count : 2}</td> <td>Transformed</td> <td>NO_MATCH</td> <td>LOW</td> </tr> </tbody> </table> In the examples above, the first two rows show a precise match of religions in the source and the sink values, thus belonging to the high confidence match-set. The third row depicts a transformed data flow where the input string value is transformed to a count of values before being logged, belonging to full-set. These signals together are used to construct a lineage graph to understand the flow of data through our web system as shown in the following diagram: <img loading="lazy" decoding="async" class="alignnone size-large wp-image-22174" src="https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_s.png?w=1024" alt="" width="1024" height="312" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_s.png 2437w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_s.png?resize=916,279 916w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_s.png?resize=768,234 768w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_s.png?resize=1024,312 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_s.png?resize=1536,468 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_s.png?resize=2048,624 2048w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_s.png?resize=96,29 96w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_s.png?resize=192,59 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /> <h3>Collecting data flow signals for the data warehouse system</h3> With the user’s religion logged in our web system, it can propagate to the data warehouse for offline processing. To gather data flow signals, we employ a combination of both runtime instrumentation and static code analysis in a different way from the web system. The involved SQL queries are logged for data processing activities by the <a href="https://research.facebook.com/publications/presto-sql-on-everything/">Presto</a> and <a href="https://spark.apache.org/">Spark</a> compute engines (among others). Static analysis is then performed for the logged SQL queries and job configs in order to extract data flow signals. Let’s examine a simple SQL query example that processes data for the data warehouse as the following: <img loading="lazy" decoding="async" class="alignnone size-large wp-image-22190" src="https://engineering.fb.com/wp-content/uploads/2025/01/dating_data_sql_code.png?w=1024" alt="" width="1024" height="292" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/dating_data_sql_code.png 1404w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_data_sql_code.png?resize=916,261 916w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_data_sql_code.png?resize=768,219 768w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_data_sql_code.png?resize=1024,292 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_data_sql_code.png?resize=96,27 96w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_data_sql_code.png?resize=192,55 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /> We’ve developed a <a href="https://engineering.fb.com/2022/11/30/data-infrastructure/static-analysis-sql-queries/" target="_blank" rel="noopener">SQL analyzer</a> to extract data flow signals between the input table, “safety_log_tbl” and the output table, “safety_training_tbl” as shown in the following diagram. In practice, we also collect more granular-level lineage such as at column-level (e.g., “user_id” -> “target_user_id”, “religion” -> “target_religion”). There are instances where data is not fully processed by SQL queries, resulting in logs that contain data flow signals for either reads or writes, but not both. To ensure we have complete lineage data, we leverage contextual information (such as execution environments; job or trace IDs) collected at runtime to connect these reads and writes together. The following diagram illustrates how the lineage graph has expanded: <img loading="lazy" decoding="async" class="alignnone size-large wp-image-22173" src="https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_s.png?w=1024" alt="" width="1024" height="593" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_s.png 2436w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_s.png?resize=916,530 916w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_s.png?resize=768,445 768w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_s.png?resize=1024,593 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_s.png?resize=1536,889 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_s.png?resize=2048,1185 2048w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_s.png?resize=96,56 96w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_s.png?resize=192,111 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /> <h3>Collecting data flow signals for the AI system</h3> For our AI systems, we collect lineage signals by tracking relationships between various assets, such as input datasets, features, models, workflows, and inferences. A common approach is to extract data flows from job configurations used for different AI activities such as model training. For instance, in order to improve the relevance of dating matches, we use an AI model to recommend potential matches based on shared religious views from users. Let’s take a look at the following training config example for this model that uses religion data: <img loading="lazy" decoding="async" class="alignnone size-large wp-image-22193" src="https://engineering.fb.com/wp-content/uploads/2025/01/training_config_code.png?w=1024" alt="" width="1024" height="829" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/training_config_code.png 1404w, https://engineering.fb.com/wp-content/uploads/2025/01/training_config_code.png?resize=916,741 916w, https://engineering.fb.com/wp-content/uploads/2025/01/training_config_code.png?resize=768,621 768w, https://engineering.fb.com/wp-content/uploads/2025/01/training_config_code.png?resize=1024,829 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/training_config_code.png?resize=96,78 96w, https://engineering.fb.com/wp-content/uploads/2025/01/training_config_code.png?resize=192,155 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /> By parsing this config obtained from the model training service, we can track the data flow from the input dataset (with asset ID asset://hive.table/dating_training_tbl) and feature (with asset ID asset://ai.feature/DATING_USER_RELIGION_SCORE) to the model (with asset ID asset://ai.model/dating_ranking_model). Our AI systems are also instrumented so that asset relationships and data flow signals are captured at various points at runtime, including data-loading layers (e.g., <a href="https://engineering.fb.com/2022/09/19/ml-applications/data-ingestion-machine-learning-training-meta/">DPP</a>) and libraries (e.g., <a href="https://pytorch.org/">PyTorch</a>), workflow engines (e.g., <a href="https://engineering.fb.com/2016/05/09/core-infra/introducing-fblearner-flow-facebook-s-ai-backbone/">FBLearner Flow</a>), training frameworks, inference systems (as backend services), etc. Lineage collection for backend services utilizes the approach for function-based systems described above. By matching the source and sink assets for different data flow signals, we are able to capture a holistic lineage graph at the desired granularities: <img loading="lazy" decoding="async" class="alignnone size-large wp-image-22172" src="https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_ai_s.png?w=1024" alt="" width="1024" height="600" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_ai_s.png 2416w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_ai_s.png?resize=916,537 916w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_ai_s.png?resize=768,450 768w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_ai_s.png?resize=1024,600 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_ai_s.png?resize=1536,900 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_ai_s.png?resize=2048,1200 2048w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_ai_s.png?resize=96,56 96w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_ai_s.png?resize=192,113 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /> <h2>Identifying relevant data flows from a lineage graph</h2> Now that we have the lineage graph at our disposal, how can we effectively distill a subset of data flows pertinent to a specific privacy requirement for religion data? To address this question, we have developed an iterative analysis tool that enables developers to pinpoint precise data flows and systematically filter out irrelevant ones. The tool kicks off a repetitive discovery process aided by the lineage graph and privacy controls from Policy Zones, to narrow down the most relevant flows. This refined data allows developers to make a final determination about the flows they would like to use, producing an optimal path for traversing the lineage graph. The following are the major steps involved, captured holistically in the diagram, below: <ol> <li style="font-weight: 400;" aria-level="1">Discover data flows: identify data flows from source assets and stop at downstream assets with low-confidence flows (yellow nodes). </li> <li style="font-weight: 400;" aria-level="1">Exclude and include candidates: Developers or automated heuristics exclude candidates (red nodes) that don’t have religion data or include remaining ones (green nodes). By excluding the red nodes early on, it helps to exclude all of their downstream in a cascaded manner, and thus saves developer efforts significantly. As an additional safeguard, developers also implement privacy controls via Policy Zones, so all relevant data flows can be captured.</li> <li style="font-weight: 400;" aria-level="1">Repeat discovery cycle: use the green nodes as new sources and repeat the cycle until no more green nodes are confirmed. </li> </ol> <img loading="lazy" decoding="async" class="alignnone size-large wp-image-22170" src="https://engineering.fb.com/wp-content/uploads/2025/01/iterative_tool_s.png?w=1024" alt="" width="1024" height="711" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/iterative_tool_s.png 2232w, https://engineering.fb.com/wp-content/uploads/2025/01/iterative_tool_s.png?resize=916,636 916w, https://engineering.fb.com/wp-content/uploads/2025/01/iterative_tool_s.png?resize=768,533 768w, https://engineering.fb.com/wp-content/uploads/2025/01/iterative_tool_s.png?resize=1024,711 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/iterative_tool_s.png?resize=1536,1066 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/iterative_tool_s.png?resize=2048,1421 2048w, https://engineering.fb.com/wp-content/uploads/2025/01/iterative_tool_s.png?resize=96,67 96w, https://engineering.fb.com/wp-content/uploads/2025/01/iterative_tool_s.png?resize=192,133 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /> With the collection and data flow identification steps complete, developers are able to successfully locate granular data flows that contain religion across Meta’s complex systems, allowing them to move forward in the PAI workflow to apply necessary privacy controls to safeguard the data. This once-intimidating task has been completed efficiently. Our data lineage technology has provided developers with an unprecedented ability to quickly understand and protect religion and similar sensitive data flows. It enables Meta to scalably and efficiently implement privacy controls via PAI to protect our users’ privacy and deliver products safely. <h2 id="learnings">Learnings and challenges</h2> As we’ve worked to develop and implement lineage as a core PAI technology, we’ve gained valuable insights and overcome significant challenges, yielding some important lessons: <ul> <li style="font-weight: 400;" aria-level="1">Focus on lineage early and reap the rewards: As we developed privacy technologies like Policy Zones, it became clear that gaining a deep understanding of data flows across various systems is essential for scaling the implementation of privacy controls. By investing in lineage, we not only accelerated the adoption of Policy Zones but also uncovered new opportunities for applying the technology. Lineage can also be extended to other use cases such as security and integrity.</li> <li style="font-weight: 400;" aria-level="1">Build lineage consumption tools to gain engineering efficiency: We initially focused on building a lineage solution but didn’t give sufficient attention to consumption tools for developers. As a result, owners had to use raw lineage signals to discover relevant data flows, which was overwhelmingly complex. We addressed this issue by developing the iterative tooling to guide engineers in discovering relevant data flows, significantly reducing engineering efforts by orders of magnitude.</li> <li style="font-weight: 400;" aria-level="1">Integrate lineage with systems to scale the coverage: Collecting lineage from diverse Meta systems was a significant challenge. Initially, we tried to ask every system to collect lineage signals to ingest into the centralized lineage service, but the progress was slow. We overcame this by developing reliable, computationally efficient, and widely applicable PAI libraries with built-in lineage collection logic in various programming languages (Hack, C++, Python, etc.). This enabled much smoother integration with a broad range of Meta’s systems.</li> <li style="font-weight: 400;" aria-level="1">Measurement improves our outcomes: By incorporating the measurement of coverage, we’ve been able to evolve our data lineage so that we stay ahead of the ever-changing landscape of data and code at Meta. By enhancing our signals and adapting to new technologies, we can maintain a strong focus on privacy outcomes and drive ongoing improvements in lineage coverage across our tech stacks.</li> </ul> <h2>The future of data lineage</h2> Data lineage is a vital component of Meta’s PAI initiative, providing a comprehensive view of how data flows across different systems. While we’ve made significant progress in establishing a strong foundation, our journey is ongoing. We’re committed to: <ul> <li style="font-weight: 400;" aria-level="1">Expanding coverage: continuously enhance the coverage of our data lineage capabilities to ensure a comprehensive understanding of data flows.</li> <li style="font-weight: 400;" aria-level="1">Improving consumption experience: streamline the consumption experience to make it easier for developers and stakeholders to access and utilize data lineage information.</li> <li style="font-weight: 400;" aria-level="1">Exploring new frontiers: investigate new applications and use cases for data lineage, driving innovation and collaboration across the industry.</li> </ul> By advancing data lineage, we aim to foster a culture of privacy awareness and drive progress in the broader fields of study. Together, we can create a more transparent and accountable data ecosystem. <h2>Acknowledgements</h2> The authors would like to acknowledge the contributions of many current and former Meta employees who have played a crucial role in developing data lineage technologies over the years. In particular, we would like to extend special thanks to (in alphabetical order) Amit Jain, Aygun Aydin, Ben Zhang, Brian Romanko, Brian Spanton, Daniel Ramagem, David Molnar, Dzmitry Charnahalau, Gayathri Aiyer, George Stasa, Guoqiang Jerry Chen, Graham Bleaney, Haiyang Han, Howard Cheng, Ian Carmichael, Ibrahim Mohamed, Jerry Pan, Jiang Wu, Jonathan Bergeron, Joanna Jiang, Jun Fang, Kiran Badam, Komal Mangtani, Kyle Huang, Maharshi Jha, Manuel Fahndrich, Marc Celani, Lei Zhang, Mark Vismonte, Perry Stoll, Pritesh Shah, Qi Zhou, Rajesh Nishtala, Rituraj Kirti, Seth Silverman, Shelton Jiang, Sushaant Mujoo, Vlad Fedorov, Yi Huang, Xinbo Gao, and Zhaohui Zhang. We would also like to express our gratitude to all reviewers of this post, including (in alphabetical order) Aleksandar Ilic, Avtar Brar, Benjamin Renard, Bogdan Shubravyi, Brianna O’Steen, Chris Wiltz, Daniel Chamberlain, Hannes Roth, Imogen Barnes, Jason Hendrickson, Koosh Orandi, Rituraj Kirti, and Xenia Habekoss. We would like to especially thank Jonathan Bergeron for overseeing the effort and providing all of the guidance and valuable feedback, Supriya Anand for leading the editorial effort to shape the blog content, and Katherine Bates for pulling all required support together to make this blog post happen. The post <a rel="nofollow" href="https://engineering.fb.com/2025/01/22/security/how-meta-discovers-data-flows-via-lineage-at-scale/">How Meta discovers data flows via lineage at scale</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>. ]]></content:encoded> <post-id xmlns="com-wordpress:feed-additions:1">22168</post-id> </item> <item> <title>Strobelight: A profiling service built on open source technology</title> <link>https://engineering.fb.com/2025/01/21/production-engineering/strobelight-a-profiling-service-built-on-open-source-technology/</link> <dc:creator><![CDATA[]]></dc:creator> <pubDate>Tue, 21 Jan 2025 17:00:54 +0000</pubDate> <category><![CDATA[Open Source]]></category> <category><![CDATA[Production Engineering]]></category> <guid isPermaLink="false">https://engineering.fb.com/?p=22157</guid> <description><![CDATA[We’re sharing details about Strobelight, Meta’s profiling orchestrator. Strobelight combines several technologies, many open source, into a single service that helps engineers at Meta improve efficiency and utilization across our fleet. Using Strobelight, we’ve seen significant efficiency wins, including one that has resulted in an estimated 15,000 servers’ worth of annual capacity savings. Strobelight, Meta’s [...] <a class="btn btn-secondary understrap-read-more-link" href="https://engineering.fb.com/2025/01/21/production-engineering/strobelight-a-profiling-service-built-on-open-source-technology/">Read More...</a> The post <a rel="nofollow" href="https://engineering.fb.com/2025/01/21/production-engineering/strobelight-a-profiling-service-built-on-open-source-technology/">Strobelight: A profiling service built on open source technology</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>. ]]></description> <content:encoded><![CDATA[<ul> <li style="font-weight: 400;" aria-level="1">We’re sharing details about Strobelight, Meta’s profiling orchestrator.</li> <li style="font-weight: 400;" aria-level="1">Strobelight combines several technologies, many open source, into a single service that helps engineers at Meta improve efficiency and utilization across our fleet.</li> <li style="font-weight: 400;" aria-level="1">Using Strobelight, we’ve seen significant efficiency wins, including one that has resulted in an estimated 15,000 servers’ worth of annual capacity savings.</li> </ul> Strobelight, Meta’s profiling orchestrator, is not really one technology. It’s several (many open source) combined to make something that unlocks truly amazing efficiency wins. Strobelight is also not a single profiler but an orchestrator of many different profilers (even ad-hoc ones) that runs on all production hosts at Meta, collecting detailed information about CPU usage, memory allocations, and other performance metrics from running processes. Engineers and developers can use this information to identify performance and resource bottlenecks, optimize their code, and improve utilization. When you combine talented engineers with rich performance data you can get efficiency wins by both creating tooling to identify issues before they reach production and finding opportunities in already running code. Let’s say an engineer makes a code change that introduces an unintended copy of some large object on a service’s critical path. Meta’s existing tools can identify the issue and query Strobelight data to estimate the impact on compute cost. Then Meta’s code review tool can notify the engineer that they’re about to waste, say, 20,000 servers. Of course, static analysis tools can pick up on these sorts of issues, but they are unaware of global compute cost and oftentimes these inefficiencies aren’t a problem until they’re gradually serving millions of requests per minute. The frog can boil slowly. <h2>Why do we use profilers?</h2> Profilers operate by sampling data to perform statistical analysis. For example, a profiler takes a sample every N events (or milliseconds in the case of time profilers) to understand where that event occurs or what is happening at the moment of that event. With a CPU-cycles event, for example, the profile will be CPU time spent in functions or function call stacks executing on the CPU. This can give an engineer a high-level understanding of the code execution of a service or binary. <h2>Choosing your own adventure with Strobelight</h2> There are other daemons at Meta that collect observability metrics, but Strobelight’s wheelhouse is software profiling. It connects resource usage to source code (what developers understand best). Strobelight’s profilers are often, but not exclusively, built using <a href="https://docs.ebpf.io/" target="_blank" rel="noopener">eBPF</a>, which is a Linux kernel technology. eBPF allows the safe injection of custom code into the kernel, which enables very low overhead collection of different types of data and unlocks so many possibilities in the observability space that it’s hard to imagine how Strobelight would work without it. As of the time of writing this, Strobelight has 42 different profilers, including: <ul> <li style="font-weight: 400;" aria-level="1">Memory profilers powered by <a href="https://github.com/jemalloc/jemalloc" target="_blank" rel="noopener">jemalloc.</a></li> <li style="font-weight: 400;" aria-level="1">Function call count profilers.</li> <li style="font-weight: 400;" aria-level="1">Event-based profilers for both native and non-native languages (e.g., Python, Java, and Erlang).</li> <li style="font-weight: 400;" aria-level="1">AI/GPU profilers.</li> <li style="font-weight: 400;" aria-level="1">Profilers that track off-CPU time.</li> <li style="font-weight: 400;" aria-level="1">Profilers that track service request latency.</li> </ul> Engineers can utilize any one of these to collect data from servers on demand via Strobelight’s command line tool or web UI. <figure id="attachment_22158" aria-describedby="caption-attachment-22158" style="width: 1024px" class="wp-caption alignnone"><img loading="lazy" decoding="async" class="size-large wp-image-22158" src="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png?w=1024" alt="" width="1024" height="579" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png?resize=916,518 916w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png?resize=768,435 768w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png?resize=1024,579 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png?resize=1536,869 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png?resize=192,109 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22158" class="wp-caption-text">The Strobelight web UI.</figcaption></figure> Users also have the ability to set up continuous or “triggered” profiling for any of these profilers by updating a configuration file in Meta’s <a href="https://research.facebook.com/publications/holistic-configuration-management-at-facebook/" target="_blank" rel="noopener">Configerator</a>, allowing them to target their entire service or, for example, only hosts that run in certain regions. Users can specify how often these profilers should run, the run duration, the symbolization strategy, the process they want to target, and a lot more. Here is an example of a simple configuration for one of these profilers: <pre class="line-numbers"><code class="language-none">add_continuous_override_for_offcpu_data( "my_awesome_team", // the team that owns this service Type.SERVICE_ID, "my_awesome_service", 30_000, // desired samples per hour ) </code></pre> Why does Strobelight have so many profilers? Because there are so many different things happening in these systems powered by so many different technologies. This is also why Strobelight provides ad-hoc profilers. Since the kind of data that can be gathered from a binary is so varied, engineers often need something that Strobelight doesn’t provide out of the box. Adding a new profiler from scratch to Strobelight involves several code changes and could take several weeks to get reviewed and rolled out. However, engineers can write a single <a href="https://github.com/bpftrace/bpftrace" target="_blank" rel="noopener">bpftrace</a> script (a simple language/tool that allows you to easily write eBPF programs) and tell Strobelight to run it like it would any other profiler. An engineer that really cares about the latency of a particular C++ function, for example, could write up a little bpftrace script, commit it, and have Strobelight run it on any number of hosts throughout Meta’s fleet – all within a matter of hours, if needed. If all of this sounds powerfully dangerous, that’s because it is. However, Strobelight has several safeguards in place to prevent users from causing performance degradation for the targeted workloads and retention issues for the databases Strobelight writes to. Strobelight also has enough awareness to ensure that different profilers don’t conflict with each other. For example, if a profiler is tracking CPU cycles, Strobelight ensures another profiler can’t use another PMU counter at the same time (as there are other services that also use them). Strobelight also has concurrency rules and a profiler queuing system. Of course, service owners still have the flexibility to really hammer their machines if they want to extract a lot of data to debug. <h2>Default data for everyone</h2> Since its inception, one of Strobelight’s core principles has been to provide automatic, regularly-collected profiling data for all of Meta’s services. It’s like a flight recorder – something that doesn’t have to be thought about until it’s needed. What’s worse than waking up to an alert that a service is unhealthy and there is no data as to why? For that reason, Strobelight has a handful of curated profilers that are configured to run automatically on every Meta host. They’re not running all the time; that would be “bad” and not really “profiling.” Instead, they have custom run intervals and sampling rates specific to the workloads running on the host. This provides just the right amount of data without impacting the profiled services or overburdening the systems that store Strobelight data. Here is an example: A service, named Soft Server, runs on 1,000 hosts and let’s say we want profiler A to gather 40,000 CPU-cycles samples per hour for this service (remember the config above). Strobelight, knowing how many hosts Soft Server runs on, but not how CPU intensive it is, will start with a conservative run probability, which is a sampling mechanism to prevent bias (e.g., profiling these hosts at noon every day would hide traffic patterns). The next day Strobelight will look at how many samples it was able to gather for this service and then automatically tune the run probability (with some very simple math) to try to hit 40,000 samples per hour. We call this dynamic sampling and Strobelight does this readjustment every day for every service at Meta. And if there is more than one service running on the host (excluding daemons like systemd or Strobelight) then Strobelight will default to using the configuration that will yield more samples for both. Hang on, hang on. If the run probability or sampling rate is different depending on the host for a service, then how can the data be aggregated or compared across the hosts? And how can profiling data for multiple services be compared? Since Strobelight is aware of all these different knobs for profile tuning, it adjusts the “weight” of a profile sample when it’s logged. A sample’s weight is used to normalize the data and prevent bias when analyzing or viewing this data in aggregate. So even if Strobelight is profiling Soft Server less often on one host than on another, the samples can be accurately compared and grouped. This also works for comparing two different services since Strobelight is used both by service owners looking at their specific service as well as efficiency experts who look for “horizontal” wins across the fleet in shared libraries. <h2>How Strobelight saves capacity</h2> There are two default continuous profilers that should be called out because of how much they end up saving in capacity. <h3>The last branch record (LBR) profiler </h3> The LBR profiler, true to its name, is used to sample <a href="https://lwn.net/Articles/680985/" target="_blank" rel="noopener">last branch records</a> (a hardware feature that started on Intel). The data from this profiler doesn’t get visualized but instead is fed into Meta’s feedback directed optimization (FDO) pipeline. This data is used to create FDO profiles that are consumed at compile time (<a href="https://ieeexplore.ieee.org/document/10444807" target="_blank" rel="noopener">CSSPGO</a>) and post-compile time (<a href="https://research.facebook.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/">BOLT</a>) to speed up binaries through the added knowledge of runtime behavior. Meta’s top 200 largest services all have FDO profiles from the LBR data gathered continuously across the fleet. Some of these services see up to 20% reduction in CPU cycles, which equates to a 10-20% reduction in the number of servers needed to run these services at Meta. <h3>The event profiler</h3> The second profiler is Strobelight’s event profiler. This is Strobelight’s version of the Linux perf tool. Its primary job is to collect user and kernel stack traces from multiple performance (perf) events e.g., CPU-cycles, L3 cache misses, instructions, etc. Not only is this data looked at by individual engineers to understand what the hottest functions and call paths are, but this data is also fed into monitoring and testing tools to identify regressions; ideally before they hit production. <h2>Did someone say Meta…data?</h2> Looking at function call stacks with <a href="https://www.brendangregg.com/flamegraphs.html" target="_blank" rel="noopener">flame graphs</a> is great, nothing against it. But a service owner looking at call stacks from their service, which imports many libraries and utilizes Meta’s software frameworks, will see a lot of “foreign” functions. Also, what about finding just the stacks for p99 latency requests? Or how about all the places where a service is making an unintended string copy? <h3>Stack schemas</h3> Strobelight has multiple mechanisms for enhancing the data it produces according to the needs of its users. One such mechanism is called Stack Schemas (inspired by <a href="https://learn.microsoft.com/en-us/windows-hardware/test/wpt/stack-tags" target="_blank" rel="noopener">Microsoft’s stack tags</a>), which is a small DSL that operates on call stacks and can be used to add tags (strings) to entire call stacks or individual frames/functions. These tags can then be utilized in our visualization tool. Stack Schemas can also remove functions users don’t care about with regex matching. Any number of schemas can be applied on a per-service or even per-profile basis to customize the data. There are even folks who create dashboards from this metadata to help other engineers identify expensive copying, use of inefficient or inappropriate C++ containers, overuse of smart pointers, and much more. Static analysis tools that can do this have been around for a long time, but they can’t pinpoint the really painful or computationally expensive instances of these issues across a large fleet of machines. <h3>Strobemeta</h3> Strobemeta is another mechanism, which utilizes thread local storage, to attach bits of dynamic metadata at runtime to call stacks that we gather in the event profiler (and others). This is one of the biggest advantages of building profilers using eBPF: complex and customized actions taken at sample time. Collected Strobemeta is used to attribute call stacks to specific service endpoints, or request latency metrics, or request identifiers. Again, this allows engineers and tools to do more complex filtering to focus the vast amounts of data that Strobelight profilers produce. <h2>Symbolization</h2> Now is a good time to talk about symbolization: taking the virtual address of an instruction, converting it into an actual symbol (function) name, and, depending on the symbolization strategy, also getting the function’s source file, line number, and type information. Most of the time getting the whole enchilada means using a binary’s DWARF debug info. But this can be many megabytes (or even gigabytes) in size because DWARF debug data contains much more than the symbol information. This data needs to be downloaded then parsed. But attempting this while profiling, or even afterwards on the same host where the profile is gathered, is far too computationally expensive. Even with optimal caching strategies it can cause memory issues for the host’s workloads. Strobelight gets around this problem via a symbolization service that utilizes several open source technologies including DWARF, ELF, <a href="https://github.com/YtnbFirewings/gsym" target="_blank" rel="noopener">gsym</a>, and <a href="https://github.com/libbpf/blazesym" target="_blank" rel="noopener">blazesym</a>. At the end of a profile Strobelight sends stacks of binary addresses to a service that sends back symbolized stacks with file, line, type info, and even inline information. It can do this because it has already done all the heavy lifting of downloading and parsing the DWARF data for each of Meta’s binaries (specifically, production binaries) and stores what it needs in a database. Then it can serve multiple symbolization requests coming from different instances of Strobelight running throughout the fleet. To add to that enchilada (hungry yet?), Strobelight also delays symbolization until after profiling and stores raw data to disk to prevent memory thrash on the host. This has the added benefit of not letting the consumer impact the producer – meaning if Strobelight’s user space code can’t handle the speed at which the eBPF kernel code is producing samples (because it’s spending time symbolizing or doing some other processing) it results in dropped samples. All of this is made possible with the inclusion of <a href="https://www.brendangregg.com/blog/2024-03-17/the-return-of-the-frame-pointers.html" target="_blank" rel="noopener">frame pointers</a> in all of Meta’s user space binaries, otherwise we couldn’t walk the stack to get all these addresses (or we’d have to do some other complicated/expensive thing which wouldn’t be as efficient). <figure id="attachment_22159" aria-describedby="caption-attachment-22159" style="width: 1024px" class="wp-caption alignnone"><img loading="lazy" decoding="async" class="size-large wp-image-22159" src="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png?w=1024" alt="" width="1024" height="633" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png 1607w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png?resize=916,566 916w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png?resize=768,475 768w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png?resize=1024,633 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png?resize=1536,949 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png?resize=96,59 96w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png?resize=192,119 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22159" class="wp-caption-text">A simplified Strobelight service graph.</figcaption></figure> <h2>Show me the data (and make it nice)!</h2> The primary tool Strobelight customers use is <a href="https://research.facebook.com/publications/scuba-diving-into-data-at-facebook/" target="_blank" rel="noopener">Scuba</a> – a query language (like SQL), database, and UI. The Scuba UI has a large suite of visualizations for the queries people construct (e.g., flame graphs, pie charts, time series graphs, distributions, etc). Strobelight, for the most part, produces Scuba data and, generally, it’s a happy marriage. If someone runs an on-demand profile, it’s just a few seconds before they can visualize this data in the Scuba UI (and send people links to it). Even tools like <a href="https://perfetto.dev/" target="_blank" rel="noopener">Perfetto</a> expose the ability to query the underlying data because they know it’s impossible to try to come up with enough dropdowns and buttons that can express everything you want to do in a query language – though the Scuba UI comes close. <figure id="attachment_22160" aria-describedby="caption-attachment-22160" style="width: 1024px" class="wp-caption alignnone"><img loading="lazy" decoding="async" class="wp-image-22160 size-large" src="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png?w=1024" alt="" width="1024" height="554" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png?resize=916,495 916w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png?resize=768,415 768w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png?resize=1024,554 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png?resize=1536,831 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png?resize=96,52 96w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png?resize=192,104 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22160" class="wp-caption-text">An example flamegraph/icicle of function call stacks of the CPU cycles event for the mononoke service for one hour.</figcaption></figure> The other tool is a trace visualization tool used at Meta named<a href="https://www.facebook.com/atscaleevents/videos/996197807391867/"> Tracery</a>. We use this tool when we want to combine correlated but different streams of profile data on one screen. This data is also a natural fit for viewing on a timeline. Tracery allows users to make custom visualizations and curated workspaces to share with other engineers to pinpoint the important parts of that data. It’s also powered by a client-side columnar database (written in JavaScript!), which makes it very fast when it comes to zooming and filtering. Strobelight’s Crochet profiler combines service request spans, CPU-cycles stacks, and off-CPU data to give users a detailed snapshot of their service. <figure id="attachment_22161" aria-describedby="caption-attachment-22161" style="width: 975px" class="wp-caption alignnone"><img loading="lazy" decoding="async" class="size-large wp-image-22161" src="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-4.png?w=975" alt="" width="975" height="552" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-4.png 975w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-4.png?resize=916,519 916w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-4.png?resize=768,435 768w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-4.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-4.png?resize=192,109 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22161" class="wp-caption-text">An example trace in Tracery.</figcaption></figure> <h2>The Biggest Ampersand</h2> Strobelight has helped engineers at Meta realize countless efficiency and latency wins, ranging from increases in the number of requests served, to large reductions in heap allocations, to regressions caught in pre-prod analysis tools. But one of the most significant wins is one we call, “The Biggest Ampersand.” A seasoned performance engineer was looking through Strobelight data and discovered that by filtering on a particular std::vector function call (using the symbolized file and line number) he could identify computationally expensive array copies that happen unintentionally with the ‘auto’ keyword in C++. The engineer turned a few knobs, adjusted his Scuba query, and happened to notice one of these copies in a particularly hot call path in one of Meta’s largest ads services. He then cracked open his code editor to investigate whether this particular vector copy was intentional… it wasn’t. It was a simple mistake that any engineer working in C++ has made a hundred times. So, the engineer typed an “&” after the auto keyword to indicate we want a reference instead of a copy. It was a one-character commit, which, after it was shipped to production, equated to an estimated 15,000 servers in capacity savings per year! Go back and re-read that sentence. One ampersand! <h2>An open ending</h2> This only scratches the surface of everything Strobelight can do. The Strobelight team works closely with Meta’s performance engineers on new features that can better analyze code to help pinpoint where things are slow, computationally expensive, and why. We’re currently working on <a href="https://github.com/facebookincubator/strobelight" target="_blank" rel="noopener">open-sourcing</a> Strobelight’s profilers and libraries, which will no doubt make them more robust and useful. Most of the technologies Strobelight uses are already public or open source, so please use and contribute to them! <h2>Acknowledgements</h2> Special thanks to Wenlei He, Andrii Nakryiko, Giuseppe Ottaviano, Mark Santaniello, Nathan Slingerland, Anita Zhang, and the Profilers Team at Meta. The post <a rel="nofollow" href="https://engineering.fb.com/2025/01/21/production-engineering/strobelight-a-profiling-service-built-on-open-source-technology/">Strobelight: A profiling service built on open source technology</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>. ]]></content:encoded> <post-id xmlns="com-wordpress:feed-additions:1">22157</post-id> </item> </channel> </rss>