CINXE.COM
Engineering at Meta
<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" > <channel> <title>Engineering at Meta</title> <atom:link href="https://engineering.fb.com/feed/" rel="self" type="application/rss+xml" /> <link>https://engineering.fb.com/</link> <description>Engineering at Meta Blog</description> <lastBuildDate>Tue, 18 Feb 2025 20:03:42 +0000</lastBuildDate> <language>en-US</language> <sy:updatePeriod> hourly </sy:updatePeriod> <sy:updateFrequency> 1 </sy:updateFrequency> <generator>https://wordpress.org/?v=6.7.2</generator> <site xmlns="com-wordpress:feed-additions:1">147945108</site> <item> <title>Protecting user data through source code analysis at scale</title> <link>https://engineering.fb.com/2025/02/18/security/protecting-user-data-through-source-code-analysis/</link> <dc:creator><![CDATA[]]></dc:creator> <pubDate>Tue, 18 Feb 2025 20:30:49 +0000</pubDate> <category><![CDATA[Security]]></category> <guid isPermaLink="false">https://engineering.fb.com/?p=22267</guid> <description><![CDATA[<p>Meta’s Anti Scraping team focuses on preventing unauthorized scraping as part of our ongoing work to combat data misuse. In order to protect Meta’s changing codebase from scraping attacks, we have introduced static analysis tools into our workflow. These tools allow us to detect potential scraping vectors at scale across our Facebook, Instagram, and even [...]</p> <p><a class="btn btn-secondary understrap-read-more-link" href="https://engineering.fb.com/2025/02/18/security/protecting-user-data-through-source-code-analysis/">Read More...</a></p> <p>The post <a rel="nofollow" href="https://engineering.fb.com/2025/02/18/security/protecting-user-data-through-source-code-analysis/">Protecting user data through source code analysis at scale</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>.</p> ]]></description> <content:encoded><![CDATA[<p><span style="font-weight: 400;">Meta’s Anti Scraping team focuses on preventing </span><a href="https://about.fb.com/news/2021/04/how-we-combat-scraping/" target="_blank" rel="noopener"><span style="font-weight: 400;">unauthorized scraping</span></a><span style="font-weight: 400;"> as part of our ongoing work to combat data misuse. In order to protect Meta’s </span><a href="https://engineering.fb.com/2022/11/16/culture/meta-code-review-time-improving/" target="_blank" rel="noopener"><span style="font-weight: 400;">changing codebase</span></a><span style="font-weight: 400;"> from scraping attacks, we have introduced static analysis tools into our workflow. These tools allow us to </span><b>detect potential scraping vectors</b><span style="font-weight: 400;"> at scale across our Facebook, Instagram, and even parts of our Reality Labs codebases. </span></p> <h2><span style="font-weight: 400;">What is scraping? </span></h2> <p><a href="https://about.fb.com/news/2021/04/how-we-combat-scraping/" target="_blank" rel="noopener"><span style="font-weight: 400;">Scraping</span></a><span style="font-weight: 400;"> is the automated collection of data from a website or app and can be either authorized or unauthorized. Unauthorized scrapers commonly hide themselves by mimicking the ways users would normally use a product. As a result, unauthorized scraping can be difficult to detect. At Meta, we take a number of steps to </span><a href="https://about.fb.com/news/2021/04/how-we-combat-scraping/" target="_blank" rel="noopener"><span style="font-weight: 400;">combat scraping</span></a><span style="font-weight: 400;"> and have a number of methods to distinguish unauthorized automated activity from legitimate usage. </span></p> <h2><span style="font-weight: 400;">Proactive detection</span></h2> <p><span style="font-weight: 400;">Meta’s Anti-Scraping team learns about scrapers (entities attempting to scrape our systems) through many different sources. For example, we </span><a href="https://about.fb.com/news/2021/04/how-we-combat-scraping/" target="_blank" rel="noopener"><span style="font-weight: 400;">investigate suspected unauthorized scraping activity</span></a><span style="font-weight: 400;"> and take actions against such entities, including sending cease-and-desist letters and disabling accounts.</span></p> <p><span style="font-weight: 400;">Part of our strategy is to further develop proactive measures to mitigate the risk of scraping over and above our reactive approaches. One way we do this is by </span><b>turning our attack vector criteria into static analysis rules</b><span style="font-weight: 400;"> that run automatically on our entire code base. Those static analysis tools, which include</span> <a href="https://engineering.fb.com/2019/08/15/security/zoncolan/" target="_blank" rel="noopener"><span style="font-weight: 400;">Zoncolan</span></a><span style="font-weight: 400;"> for Hack and</span> <a href="https://engineering.fb.com/2020/08/07/security/pysa/" target="_blank" rel="noopener"><span style="font-weight: 400;">Pysa</span></a><span style="font-weight: 400;"> for Python, run automatically for their respective codebases and are built in-house, allowing us to customize them for Anti-Scraping purposes. This approach can identify potential issues early and ensure product development teams have an opportunity to remediate prior to launch.</span></p> <p><span style="font-weight: 400;">Static analysis tools enable us to apply learnings across events to systematically prevent similar issues from existing in our codebase. They also help us create best practices when developing code to combat unauthorized scraping.</span></p> <h2><span style="font-weight: 400;">Developing static analysis rules</span></h2> <p><span style="font-weight: 400;">Our static analysis tools (like</span> <a href="https://engineering.fb.com/2019/08/15/security/zoncolan/" target="_blank" rel="noopener"><span style="font-weight: 400;">Zoncolan</span></a><span style="font-weight: 400;"> and</span> <a href="https://engineering.fb.com/2020/08/07/security/pysa/" target="_blank" rel="noopener"><span style="font-weight: 400;">Pysa</span></a><span style="font-weight: 400;">) focus on tracking data flow through a program.</span></p> <p><span style="font-weight: 400;">Engineers define classes of issues using the following:</span></p> <ul> <li style="font-weight: 400;" aria-level="1"><i><span style="font-weight: 400;">Sources</span></i><span style="font-weight: 400;"> are where the data originates. For potential scraping issues, these are mostly user-controlled parameters, as these are the avenues in which scrapers control the data they could receive.</span></li> <li style="font-weight: 400;" aria-level="1"><i><span style="font-weight: 400;">Sinks</span></i><span style="font-weight: 400;"> are where the data flows to. For scraping, the sink is usually when the data flows back to the user.</span></li> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">An </span><i><span style="font-weight: 400;">Issue</span></i><span style="font-weight: 400;"> is found when our tools detect a possibility of data flow from a source to a sink.</span></li> </ul> <p><span style="font-weight: 400;">For example, assume the “source” to be the user-controlled “count” parameter that determines the number of results loaded, and “the sink” to be the data that is returned to the user. Here, the user controlled “count” parameter is an entrypoint for a scraper who can manipulate its value to extract more data than intended by the application. When our tools suspect that there is a code flow between such sources and sinks, it alerts the team for further triage.</span></p> <h2><span style="font-weight: 400;">An example of static analysis</span></h2> <p><span style="font-weight: 400;">Building on the example above, see the below mock code excerpt loading the number of followers for a page:</span></p> <pre class="line-numbers"><code class="language-none"># views/followers.py async def get_followers(request: HttpRequest) -> HttpResponse: viewer = request.GET['viewer_id'] target = request.GET['target_id'] count = request.GET['count'] if(can_see(viewer, target)): followers = load_followers(target, count) return followers # controller/followers.py async def load_followers(target_id: int, count: int): ...</code></pre> <p><span style="font-weight: 400;">In the example above, the mock endpoint backed by </span><span style="font-weight: 400; font-family: 'courier new', courier;">get_followers</span><span style="font-weight: 400;"> is a potential scraping attack vector since the “user” and “count” variables control whose information is to be loaded and number of followers returned. Under usual circumstances, the endpoint would be called with suitable parameters that match what the user is browsing on screen. However, scrapers can abuse such an endpoint by specifying arbitrary users and large counts which can result in their entire follower lists returned in a single request. By doing so, scrapers can try to evade rate limiting systems which limit how many requests a user can send to our systems in a defined timeframe. These systems are set in place to stop any scraping attempts at a high level.</span></p> <p><span style="font-weight: 400;">Since our static analysis systems run automatically on our codebase, the Anti-Scraping team can identify such scraping vectors proactively and make remediations before the code is introduced to our production systems. For example, the recommended fix for the code above is to cap the maximum number of results that can be returned at a time:</span></p> <pre class="line-numbers"><code class="language-none"># views/followers.py async def get_followers(request: HttpRequest) -> HttpResponse: viewer = request.Get['viewer_id'] target = request.GET['target_id'] count = min(request.GET['count'], MAX_FOLLOWERS_RESULTS) if(can_see(viewer, target)): followers = load_followers(target, count) return followers # controller/followers.py async def load_followers(target_id: int, count: int): ...</code></pre> <p><span style="font-weight: 400;">Following the fix, the maximum number of results retrieved by each request is limited to </span><span style="font-weight: 400; font-family: 'courier new', courier;">MAX_FOLLOWERS_RESULTS</span><span style="font-weight: 400;">. Such a change would not affect regular users and only interfere with scrapers, forcing them to send magnitudes more requests that would then trigger our rate limiting systems.</span></p> <h2><span style="font-weight: 400;">The limitations of static analysis in combating unauthorized scraping </span></h2> <p><span style="font-weight: 400;">Static analysis tools are not designed to catch all possible unauthorized scraping issues. Because unauthorized scrapers can mimic the legitimate ways that people use Meta’s products, we cannot fully prevent all unauthorized scraping without affecting people’s ability to use our apps and websites the way they enjoy. Since unauthorized scraping is both a common and complex challenge to solve, </span><a href="https://about.fb.com/news/2021/04/how-we-combat-scraping/" target="_blank" rel="noopener"><span style="font-weight: 400;">we combat scraping by taking a more holistic approach</span></a><span style="font-weight: 400;"> to staying ahead of scraping actors. </span></p> <p>The post <a rel="nofollow" href="https://engineering.fb.com/2025/02/18/security/protecting-user-data-through-source-code-analysis/">Protecting user data through source code analysis at scale</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>.</p> ]]></content:encoded> <post-id xmlns="com-wordpress:feed-additions:1">22267</post-id> </item> <item> <title>Unlocking global AI potential with next-generation subsea infrastructure</title> <link>https://engineering.fb.com/2025/02/14/connectivity/project-waterworth-ai-subsea-infrastructure/</link> <dc:creator><![CDATA[]]></dc:creator> <pubDate>Fri, 14 Feb 2025 16:28:06 +0000</pubDate> <category><![CDATA[Connectivity]]></category> <category><![CDATA[Networking & Traffic]]></category> <guid isPermaLink="false">https://engineering.fb.com/?p=22278</guid> <description><![CDATA[<p>Today, we’re announcing our most ambitious subsea cable endeavor yet: Project Waterworth. Once complete, the project will reach five major continents and span over 50,000 km (longer than the Earth’s circumference), making it the world’s longest subsea cable project using the highest-capacity technology available. Project Waterworth will bring industry-leading connectivity to the U.S., India, Brazil, [...]</p> <p><a class="btn btn-secondary understrap-read-more-link" href="https://engineering.fb.com/2025/02/14/connectivity/project-waterworth-ai-subsea-infrastructure/">Read More...</a></p> <p>The post <a rel="nofollow" href="https://engineering.fb.com/2025/02/14/connectivity/project-waterworth-ai-subsea-infrastructure/">Unlocking global AI potential with next-generation subsea infrastructure</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>.</p> ]]></description> <content:encoded><![CDATA[<p><span style="font-weight: 400;">Today, we’re announcing our most ambitious subsea cable endeavor yet: Project Waterworth. Once complete, the project will reach five major continents and span over 50,000 km (longer than the Earth’s circumference), making it the world’s longest subsea cable project using the highest-capacity technology available. </span></p> <p><span style="font-weight: 400;">Project Waterworth will bring industry-leading connectivity to the U.S., India, Brazil, South Africa, and other key regions. This project will enable greater economic cooperation, facilitate digital inclusion, and open opportunities for technological development in these regions. For example, in India, where we’ve already seen significant growth and investment in digital infrastructure, Waterworth will help accelerate this progress and support the country’s ambitious plans for its digital economy.</span></p> <p><span style="font-weight: 400;">Subsea cables projects, such as Project Waterworth, are the backbone of global digital infrastructure, accounting for more than </span><a href="https://globaldigitalinclusion.org/wp-content/uploads/2024/01/GDIP-Good-Practices-for-Subsea-Cables-Policy-Investing-in-Digital-Inclusion.pdf"><span style="font-weight: 400;">95% of intercontinental traffic</span></a><span style="font-weight: 400;"> across the world’s oceans to seamlessly enable digital communication, video experiences, online transactions, and more. Project Waterworth will be a multi-billion dollar, multi-year investment to strengthen the scale and reliability of the world’s digital highways by opening three new oceanic corridors with the abundant, high speed connectivity needed to drive AI innovation around the world. </span></p> <p><img fetchpriority="high" decoding="async" class="alignnone size-large wp-image-22288" src="https://engineering.fb.com/wp-content/uploads/2025/02/Meta-Project-Waterworth-map.png?w=1024" alt="" width="1024" height="576" srcset="https://engineering.fb.com/wp-content/uploads/2025/02/Meta-Project-Waterworth-map.png 1920w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-Project-Waterworth-map.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-Project-Waterworth-map.png?resize=916,515 916w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-Project-Waterworth-map.png?resize=768,432 768w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-Project-Waterworth-map.png?resize=1024,576 1024w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-Project-Waterworth-map.png?resize=1536,864 1536w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-Project-Waterworth-map.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-Project-Waterworth-map.png?resize=192,108 192w" sizes="(max-width: 992px) 100vw, 62vw" /></p> <p><span style="font-weight: 400;">We’ve driven infrastructure innovation with various partners over the past decade, </span><a href="https://engineering.fb.com/2021/03/28/connectivity/echo-bifrost/"><span style="font-weight: 400;">developing</span></a><span style="font-weight: 400;"> more than 20 </span><a href="https://engineering.fb.com/2021/09/28/connectivity/2africa-pearls/"><span style="font-weight: 400;">subsea cables</span></a><span style="font-weight: 400;">. This includes multiple deployments of industry-leading subsea cables of 24 fiber pairs – compared to the typical 8 to 16 fiber pairs of other new systems. These investments enable unmatched connectivity for our world’s increasing digital needs. </span></p> <p><span style="font-weight: 400;">With Project Waterworth, we continue to advance engineering design to maintain cable resilience, enabling us to build the longest 24 fiber pair cable project in the world and enhance overall speed of deployment. We are also deploying first-of-its-kind routing, maximizing the cable laid in deep water — at depths up to 7,000 meters — and using enhanced burial techniques in high-risk fault areas, such as shallow waters near the coast, to avoid damage from ship anchors and other hazards.</span></p> <p><span style="font-weight: 400;">AI is revolutionizing every aspect of our lives, from how we interact with each other to how we think about infrastructure – and Meta is at the forefront of building these innovative technologies. As AI continues to transform industries and societies around the world, it’s clear that capacity, resilience, and global reach are more important than ever to support leading infrastructure. With Project Waterworth we can help ensure that the benefits of AI and other emerging technologies are available to everyone, regardless of where they live or work.</span></p> <p>The post <a rel="nofollow" href="https://engineering.fb.com/2025/02/14/connectivity/project-waterworth-ai-subsea-infrastructure/">Unlocking global AI potential with next-generation subsea infrastructure</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>.</p> ]]></content:encoded> <post-id xmlns="com-wordpress:feed-additions:1">22278</post-id> </item> <item> <title>Looking back at our Bug Bounty program in 2024</title> <link>https://engineering.fb.com/2025/02/13/security/looking-back-at-our-bug-bounty-program-in-2024/</link> <dc:creator><![CDATA[]]></dc:creator> <pubDate>Thu, 13 Feb 2025 17:00:46 +0000</pubDate> <category><![CDATA[Security]]></category> <guid isPermaLink="false">https://engineering.fb.com/?p=22260</guid> <description><![CDATA[<p>In 2024, our bug bounty program awarded more than $2.3 million in bounties, bringing our total bounties since the creation of our program in 2011 to over $20 million. As part of our defense-in-depth strategy, we continued to collaborate with the security research community in the areas of GenAI, AR/VR, ads tools, and more. We [...]</p> <p><a class="btn btn-secondary understrap-read-more-link" href="https://engineering.fb.com/2025/02/13/security/looking-back-at-our-bug-bounty-program-in-2024/">Read More...</a></p> <p>The post <a rel="nofollow" href="https://engineering.fb.com/2025/02/13/security/looking-back-at-our-bug-bounty-program-in-2024/">Looking back at our Bug Bounty program in 2024</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>.</p> ]]></description> <content:encoded><![CDATA[<ul> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">In 2024, our bug bounty program awarded more than $2.3 million in bounties, bringing our total bounties since the creation of our program in 2011 to over $20 million. </span></li> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">As part of our </span><a href="https://about.fb.com/news/2019/01/designing-security-for-billions/" target="_blank" rel="noopener"><span style="font-weight: 400;">defense-in-depth strategy</span></a><span style="font-weight: 400;">, we continued to collaborate with the security research community in the areas of GenAI, AR/VR, ads tools, and more. </span></li> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">We also celebrated the security research done by our bug bounty community as part of our annual bug bounty summit and many other industry events. </span></li> </ul> <p><span style="font-weight: 400;">As we embark on a new year, we’re sharing several </span><span style="font-weight: 400;">updates</span><span style="font-weight: 400;"> on our work with external bug bounty security researchers to help protect our global community and platforms. This includes new payout stats, details on what’s in scope for GenAI-related bug reports, and a recap of some of our engagements throughout last year with bug bounty researchers. </span></p> <h2><span style="font-weight: 400;">Highlights from Meta’s bug bounty program in 2024</span></h2> <p><span style="font-weight: 400;">In 2024, we received nearly 10,000 bug reports and paid out more than $2.3 million in bounty awards to researchers around the world who helped make our platforms safer.</span></p> <ul> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">Since 2011, we have paid out more than $20 million in bug bounties. </span></li> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">Last year, we received nearly 10,000 reports and paid out awards on nearly 600 valid reports.</span></li> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">In 2024, we awarded more than $2.3 million to nearly 200 researchers from more than 45 countries. </span></li> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">The top three countries based on bounties awarded last year are India, Nepal, and the United States.</span></li> </ul> <h2><span style="font-weight: 400;">Engaging researchers in bug hunting in GenAI </span></h2> <p><span style="font-weight: 400;">After </span><a href="https://about.fb.com/news/2023/09/building-generative-ai-features-responsibly/" target="_blank" rel="noopener"><span style="font-weight: 400;">making our generative AI features available to security researchers</span></a><span style="font-weight: 400;"> through our long-running bug bounty program in 2023, Meta has continued to roll out new GenAI products and tools. In 2024, we provided more details to our research community on </span><a href="https://bugbounty.meta.com/scope/" target="_blank" rel="noopener"><span style="font-weight: 400;">what’s in scope for bug bounty reports related to our large language models (LLMs)</span></a><span style="font-weight: 400;">. We now welcome reports that demonstrate integral privacy or security issues associated with Meta’s LLMs, including being able to extract training data through tactics like model inversion or extraction attacks. </span></p> <p><span style="font-weight: 400;">We have already received several impactful reports focused on our GenAI tools, and </span><span style="font-weight: 400;">we look forward to continuing this important work with our community of researchers to help ensure the security and integrity of our GenAI tools.</span></p> <h2><span style="font-weight: 400;">Encouraging security research in ads audience and hardware products </span></h2> <p><span style="font-weight: 400;">This year, we prioritized our efforts to steer security research by the bug bounty community towards a number of product surfaces, including:</span></p> <p><b>Ads audience tools designed to help people choose a target audience for their ads: </b><a href="https://bugbounty.meta.com/payout-guidelines/ads-audience/" target="_blank" rel="noopener"><span style="font-weight: 400;">We introduced new payout guidelines</span></a><span style="font-weight: 400;"> to provide transparency to our security researchers on how we assess the impact of the report we receive for potential security bugs in Meta’s </span><a href="https://www.facebook.com/business/help/717368264947302" target="_blank" rel="noopener"><span style="font-weight: 400;">ads audience tools</span></a><span style="font-weight: 400;">. We cap the maximum base payout for discovering PII (name, email, phone number, state, ZIP, gender) for an ads audience at $30,000 and then apply any applicable deduction based on the required user interaction, prerequisites, and any other mitigation factors to arrive at the final awarded bounty amount. More details </span><a href="https://bugbounty.meta.com/payout-guidelines/ads-audience/" target="_blank" rel="noopener"><span style="font-weight: 400;">here</span></a><span style="font-weight: 400;">.</span></p> <p><b>Mixed reality hardware products:</b><span style="font-weight: 400;"> As Meta continues to roll out <a href="https://engineering.fb.com/2023/09/12/security/meta-quest-2-defense-through-offense/" target="_blank" rel="noopener">mixed reality products</a>, we work to encourage security research into these hardware and AI-driven technologies to help us find and fix potential bugs as quickly as possible. In 2024, our bug bounty researchers contributed reports on potential issues in Quest that could have impacted safety settings or lead to memory corruption. We also brought our Quest 3 and Ray-Ban Meta glasses to </span><a href="http://hardwear.io" target="_blank" rel="noopener"><span style="font-weight: 400;">hardwear.io USA 2024</span></a><span style="font-weight: 400;">, a leading conference that brings together top hardware hackers to test new hardware products and help uncover potential vulnerabilities. </span></p> <h2><span style="font-weight: 400;">Building and celebrating the global bug bounty community</span></h2> <p><span style="font-weight: 400;">As part of our continuous commitment to security research – both inside and outside Meta – we invested in enabling open collaboration with our bug bounty community by:</span></p> <p><b>Organizing community events and presenting joint research:</b><span style="font-weight: 400;"> We hosted our annual Meta Bug Bounty Researcher Conference (MBBRC) in Johannesburg, South Africa, bringing together 60 of our top researchers from all over the world. We received more than 100 bug reports and awarded over $320,000 in total. We also co-presented talks at EkoParty, DEF CON, Hardwear.io, Pwn2own, and other security research summits. This year, we’re pleased to share that 2025 MBBRC will be hosted in Tokyo, Japan May 12-15. Stay tuned for more details in 2025.</span></p> <p><b>Celebrating long-time researchers:</b><span style="font-weight: 400;"> One of our most long-standing and prolific researchers, </span><a href="https://philippeharewood.com/" target="_blank" rel="noopener"><span style="font-weight: 400;">Philippe Harewood</span></a><span style="font-weight: 400;">, reached a 10-year milestone with over 500 valid reports paid out by our bug bounty program. Noteworthy contributions over the years include Philippe’s groundbreaking research on </span><a href="https://www.youtube.com/watch?v=vwUxRCmgwSw" target="_blank" rel="noopener"><span style="font-weight: 400;">Instagram access token leak</span></a><span style="font-weight: 400;">, </span><a href="https://philippeharewood.com/bypass-video-capture-limit-on-ray-ban-stories/" target="_blank" rel="noopener"><span style="font-weight: 400;">video capture limit bypass on Ray-Ban stories</span></a><span style="font-weight: 400;">, and </span><span style="font-weight: 400;">more</span><span style="font-weight: 400;">. </span></p> <p><b>Providing resources and timely updates for the research community:</b><span style="font-weight: 400;"> The <a href="http://bugbounty.meta.com" target="_blank" rel="noopener">Meta Bug Bounty website</a> serves as a centralized hub for all bug bounty news and updates. Researchers can also follow the program on <a href="http://www.instagram.com/metabugbounty" target="_blank" rel="noopener">Instagram</a>, <a href="http://www.facebook.com/BugBounty" target="_blank" rel="noopener">Facebook</a>, and <a href="http://www.x.com/metabugbounty" target="_blank" rel="noopener">X</a>, for quick updates. </span></p> <h2><span style="font-weight: 400;">Looking ahead</span></h2> <p><span style="font-weight: 400;">Meta’s bug bounty team looks forward to introducing new initiatives and continuing to engage with our existing community and new researchers who are just getting started. Additionally, we will continue to provide seasoned experts with unique opportunities to test unreleased features through our private bug bounty tracks.</span></p> <p><span style="font-weight: 400;">For the past 14 years, our bug bounty program has fostered a collaborative relationship with external researchers that has helped keep our platforms safer and more secure. We would like to extend a heartfelt thanks to everyone who contributed to the growth of our program in 2024.</span></p> <p>The post <a rel="nofollow" href="https://engineering.fb.com/2025/02/13/security/looking-back-at-our-bug-bounty-program-in-2024/">Looking back at our Bug Bounty program in 2024</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>.</p> ]]></content:encoded> <post-id xmlns="com-wordpress:feed-additions:1">22260</post-id> </item> <item> <title>Revolutionizing software testing: Introducing LLM-powered bug catchers</title> <link>https://engineering.fb.com/2025/02/05/security/revolutionizing-software-testing-llm-powered-bug-catchers-meta-ach/</link> <dc:creator><![CDATA[]]></dc:creator> <pubDate>Wed, 05 Feb 2025 18:30:51 +0000</pubDate> <category><![CDATA[ML Applications]]></category> <category><![CDATA[Security]]></category> <guid isPermaLink="false">https://engineering.fb.com/?p=22242</guid> <description><![CDATA[<p>WHAT IT IS Meta’s Automated Compliance Hardening (ACH) tool is a system for mutation-guided, LLM-based test generation. ACH hardens platforms against regressions by generating undetected faults (mutants) in source code that are specific to a given area of concern and using those same mutants to generate tests. When applied to privacy, for example, ACH automates [...]</p> <p><a class="btn btn-secondary understrap-read-more-link" href="https://engineering.fb.com/2025/02/05/security/revolutionizing-software-testing-llm-powered-bug-catchers-meta-ach/">Read More...</a></p> <p>The post <a rel="nofollow" href="https://engineering.fb.com/2025/02/05/security/revolutionizing-software-testing-llm-powered-bug-catchers-meta-ach/">Revolutionizing software testing: Introducing LLM-powered bug catchers</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>.</p> ]]></description> <content:encoded><![CDATA[<h2><span style="font-weight: 400;">WHAT IT IS</span></h2> <p><a href="https://arxiv.org/pdf/2501.12862" target="_blank" rel="noopener"><span style="font-weight: 400;">Meta’s Automated Compliance Hardening (ACH) tool</span></a><span style="font-weight: 400;"> is a system for mutation-guided, LLM-based test generation. ACH hardens platforms against regressions by generating undetected faults (mutants) in source code that are specific to a given area of concern and using those same mutants to generate tests. When applied to privacy, for example, ACH automates the process of searching for privacy-related faults and preventing them from entering our systems in the future, </span><span style="font-weight: 400;">ultimately hardening our code bases to reduce risk of any privacy regression.</span></p> <p><span style="font-weight: 400;">ACH automatically generates unit tests that target a particular kind of fault. We describe the faults we care about to ACH in plain text. The description can be incomplete, and even self-contradictory, yet ACH still generates tests that it proves will catch bugs of the kind described.</span></p> <p><span style="font-weight: 400;">Traditionally, automated test generation techniques sought merely to increase code coverage. As every tester knows, this is only part of the solution because increasing coverage doesn’t necessarily find faults. </span><span style="font-weight: 400;"><br /> </span><span style="font-weight: 400;"><br /> </span><span style="font-weight: 400;">ACH is a radical departure from this tradition, because it targets specific faults, rather than uncovered code, although it often also increases coverage in the process of targeting faults. Furthermore, because ACH is founded on the principles of </span><a href="https://arxiv.org/abs/2402.04380" target="_blank" rel="noopener"><span style="font-weight: 400;">Assured LLM-based Software Engineering</span></a><span style="font-weight: 400;">, it keeps verifiable assurances that its tests do catch the kind of faults described.</span></p> <p><span style="font-weight: 400;">Our new research paper, “</span><a href="https://arxiv.org/pdf/2501.12862" target="_blank" rel="noopener"><span style="font-weight: 400;">Mutation-Guided LLM-based Test Generation at Meta</span></a><span style="font-weight: 400;">,” gives details of the underlying scientific foundations for ACH and how we apply ACH to privacy testing, but this approach can be applied to any sort of regression testing.</span></p> <h2><span style="font-weight: 400;">HOW IT WORKS</span></h2> <p><span style="font-weight: 400;">Mutation testing, where faults (mutants) are deliberately introduced into source code (using version control to keep them away from production) to assess how well an existing testing framework can detect these changes, has been</span><a href="https://web.eecs.umich.edu/~weimerw/2022-481F/readings/mutation-testing.pdf" target="_blank" rel="noopener"> <span style="font-weight: 400;">researched for decades</span></a><span style="font-weight: 400;">. But, despite this, mutation testing has remained difficult to deploy. </span></p> <p><span style="font-weight: 400;"><a href="http://crest.cs.ucl.ac.uk/fileadmin/crest/sebasepaper/JiaH10.pdf" target="_blank" rel="noopener">In earlier approaches</a>, mutants themselves would be automatically generated (most often using a rule-based approach). But this method would result in mutants that weren’t particularly realistic in terms of how much of a concern they actually represent.</span></p> <p><span style="font-weight: 400;">On top of that, even with the mutants being automatically generated, humans would still have to manually write the tests that would kill the mutants (catch the faults).</span></p> <p><span style="font-weight: 400;">Writing these tests is a painstaking and laborious process. So engineers were faced with a two-pronged issue: Even after doing all of the work to write a test to catch a mutant, there was no guarantee the test would even catch the automatically-generated mutant. </span></p> <p><span style="font-weight: 400;">By leveraging LLMs, we can generate mutants that represent realistic concerns and also save on human labor by generating tests to catch the faults automatically as well. ACH marries automated test generation techniques with the capabilities of large language models (LLMs) to generate mutants that are highly relevant to an area of testing concern as well as tests that are guaranteed to catch bugs that really matter.</span></p> <p><span style="font-weight: 400;">Broadly, ACH works in three steps:</span></p> <ol> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">An engineer describes the kind of bugs they’re concerned about.</span></li> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">ACH uses that description to automatically generate lots of bugs.</span></li> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">ACH uses the generated bugs to automatically generate lots of tests that catch them.</span></li> </ol> <p><span style="font-weight: 400;">At Meta we’ve </span><a href="https://arxiv.org/pdf/2501.12862" target="_blank" rel="noopener"><span style="font-weight: 400;">applied ACH-assisted testing to several of our platforms</span></a><span style="font-weight: 400;">, including Facebook Feed, Instagram, Messenger, and WhatsApp. Based on our own testing, we’ve concluded that engineers found ACH useful for hardening code against specific concerns and found other benefits even when tests generated by ACH don’t directly tackle a specific concern.</span></p> <figure id="attachment_22243" aria-describedby="caption-attachment-22243" style="width: 1024px" class="wp-caption alignnone"><img decoding="async" class="size-large wp-image-22243" src="https://engineering.fb.com/wp-content/uploads/2025/02/Meta-ACH-system-chart.png?w=1024" alt="" width="1024" height="576" srcset="https://engineering.fb.com/wp-content/uploads/2025/02/Meta-ACH-system-chart.png 1534w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-ACH-system-chart.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-ACH-system-chart.png?resize=916,515 916w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-ACH-system-chart.png?resize=768,432 768w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-ACH-system-chart.png?resize=1024,576 1024w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-ACH-system-chart.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/02/Meta-ACH-system-chart.png?resize=192,108 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22243" class="wp-caption-text">A top-level overview of the architecture of the ACH system. The system leverages LLMs to generate faults, check them against possible equivalents, and then generate tests to catch those faults.</figcaption></figure> <h2><span style="font-weight: 400;">WHY IT MATTERS</span></h2> <p><span style="font-weight: 400;">Meta has a very large number of data systems and uses </span><a href="https://engineering.fb.com/2022/07/27/developer-tools/programming-languages-endorsed-for-server-side-use-at-meta/" target="_blank" rel="noopener"><span style="font-weight: 400;">many different programming languages</span></a><span style="font-weight: 400;">, frameworks, and services to power our family of apps and products. But, how are our thousands of engineers across the world ensuring that their code is reliable and won’t generate bugs that would negatively impact application performance, leading to privacy risk? The answer lies with LLMs. </span></p> <p><span style="font-weight: 400;">LLM-based test generation and LLM-based mutant generation are not new, but this is the first time they’ve been combined and deployed in large-scaled industrial systems. Generating mutants and the tests to kill them have been traditionally difficult processes to scale. Since LLMs are probabilistic and don’t need to rely on rigidly defined rules to make decisions, they allow us to tackle both sides of this equation – generating mutations and tests to kill them – very efficiently and with a high level of accuracy. </span></p> <p><span style="font-weight: 400;">This new approach significantly modernizes this form of automated test generation and helps software engineers take in concerns from a variety of sources (previous faults, colleagues, user requirements, regulatory requirements, etc.) and efficiently convert them from freeform text into actionable tests – with the guarantee that the test will catch the fault they’re looking for.</span></p> <p><span style="font-weight: 400;">ACH can be applied to any class of faults and have a significant impact on hardening against future regressions and optimizing testing itself.</span></p> <h2><span style="font-weight: 400;">WHAT’S NEXT</span></h2> <p><span style="font-weight: 400;">Our novel approach combines LLM-based test generation and mutant generation to help automate complex technical organizational workflows in this space. This innovation has the potential to simplify risk assessments, reduce cognitive load for developers, and ultimately create a safer online ecosystem. We’re committed to expanding deployment areas, developing methods to measure mutant relevance, and detecting existing faults to drive industry-wide adoption of automated test generation in compliance. </span></p> <p><span style="font-weight: 400;">We will be sharing more developments and encourage you to watch this space.</span></p> <h2><span style="font-weight: 400;">READ THE PAPER</span></h2> <p><a href="https://arxiv.org/pdf/2501.12862" target="_blank" rel="noopener"><span style="font-weight: 400;">Mutation-Guided LLM-based Test Generation at Meta</span></a></p> <p>The post <a rel="nofollow" href="https://engineering.fb.com/2025/02/05/security/revolutionizing-software-testing-llm-powered-bug-catchers-meta-ach/">Revolutionizing software testing: Introducing LLM-powered bug catchers</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>.</p> ]]></content:encoded> <post-id xmlns="com-wordpress:feed-additions:1">22242</post-id> </item> <item> <title>Data logs: The latest evolution in Meta’s access tools</title> <link>https://engineering.fb.com/2025/02/04/security/data-logs-the-latest-evolution-in-metas-access-tools/</link> <dc:creator><![CDATA[]]></dc:creator> <pubDate>Tue, 04 Feb 2025 20:00:18 +0000</pubDate> <category><![CDATA[Security]]></category> <guid isPermaLink="false">https://engineering.fb.com/?p=22222</guid> <description><![CDATA[<p>We’re sharing how Meta built support for data logs, which provide people with additional data about how they use our products. Here we explore initial system designs we considered, an overview of the current architecture, and some important principles Meta takes into account in making data accessible and easy to understand. Users have a variety [...]</p> <p><a class="btn btn-secondary understrap-read-more-link" href="https://engineering.fb.com/2025/02/04/security/data-logs-the-latest-evolution-in-metas-access-tools/">Read More...</a></p> <p>The post <a rel="nofollow" href="https://engineering.fb.com/2025/02/04/security/data-logs-the-latest-evolution-in-metas-access-tools/">Data logs: The latest evolution in Meta’s access tools</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>.</p> ]]></description> <content:encoded><![CDATA[<ul> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">We’re sharing how Meta built support for data logs, which provide people with additional data about how they use our products.</span></li> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">Here we explore initial system designs we considered, an overview of the current architecture, and some important principles Meta takes into account in making data accessible and easy to understand. </span></li> </ul> <p><span style="font-weight: 400;">Users have a variety of tools they can use to manage and access their information on Meta platforms. Meta is always looking for ways to enhance its access tools in line with technological advances, and in February 2024 we began including </span><a href="https://www.facebook.com/help/384437594328726#data-logs"><span style="font-weight: 400;">data logs</span></a><span style="font-weight: 400;"> in the </span><a href="https://www.facebook.com/help/212802592074644"><span style="font-weight: 400;">Download Your Information</span></a><span style="font-weight: 400;"> (DYI) tool. Data logs include things such as information about content you’ve viewed on Facebook. Some of this data can be unique, but it can also include additional details about information that we already make available elsewhere, such as through a user’s profile, products like </span><a href="https://www.facebook.com/help/1700142396915814"><span style="font-weight: 400;">Access Your Information</span></a><span style="font-weight: 400;"> or </span><a href="https://www.facebook.com/help/256333951065527"><span style="font-weight: 400;">Activity Log</span></a><span style="font-weight: 400;">, or account downloads. This update is the result of significant investments over a number of years by a large cross-functional team at Meta, and consultations with experts on how to continue enhancing our access tools.</span></p> <p><span style="font-weight: 400;">Data logs are just the most recent example of how Meta gives users the power to access their data on our platforms. We have a long history of giving users transparency and control over their data:</span></p> <ul> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">2010: Users can retrieve a copy of their information through DYI. </span></li> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">2011: Users can easily review actions taken on Facebook through </span><a href="https://www.facebook.com/help/256333951065527"><span style="font-weight: 400;">Activity Log</span></a><span style="font-weight: 400;">.</span></li> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">2014: Users have more transparency and control over ads they see with the “</span><a href="https://www.facebook.com/help/794535777607370#advertiser-choices"><span style="font-weight: 400;">Why Am I Seeing This Ad?</span></a><span style="font-weight: 400;">” feature on Facebook.</span></li> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">2018: Users have a curated experience to find information about them through </span><a href="https://about.fb.com/news/2018/03/privacy-shortcuts/"><span style="font-weight: 400;">Access Your Information</span></a><span style="font-weight: 400;">. Users can retrieve a copy of their information on Instagram through </span><a href="https://help.instagram.com/181231772500920"><span style="font-weight: 400;">Download Your Data</span></a><span style="font-weight: 400;"> and on WhatsApp through </span><a href="https://faq.whatsapp.com/526463418847093/"><span style="font-weight: 400;">Request Account Information</span></a><span style="font-weight: 400;">.</span></li> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">2019: Users can view their activity off Meta-technologies and clear their history. Meta joins the </span><a href="https://engineering.fb.com/2019/12/02/security/data-transfer-project/"><span style="font-weight: 400;">Data Transfer Project</span></a><span style="font-weight: 400;"> and has continuously led the development of shared technologies that enable users to port their data from one platform to another. </span></li> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">2020: Users continue to </span><a href="https://about.fb.com/news/2020/03/data-access-tools/"><span style="font-weight: 400;">receive more information</span></a><span style="font-weight: 400;"> in DYI such as additional information about their interactions on Facebook and Instagram.</span></li> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">2021: Users can more easily navigate categories of information in </span><a href="https://about.fb.com/news/2021/01/introducing-the-new-access-your-information/"><span style="font-weight: 400;">Access Your Information</span></a></li> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">2023: Users can more easily use our tools as access features are consolidated within </span><a href="https://www.facebook.com/help/943858526073065"><span style="font-weight: 400;">Accounts Center</span></a><span style="font-weight: 400;">.</span></li> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">2024: Users can access data logs in Download Your Information.</span></li> </ul> <h2><span style="font-weight: 400;">What are data logs?</span></h2> <p><span style="font-weight: 400;">In contrast to our production systems, which can be queried billions of times per second thanks to techniques like caching, Meta’s data warehouse, powered by Hive, is designed to support low volumes of large queries for things like analytics and cannot scale to the query rates needed to power real-time data access.</span></p> <p><span style="font-weight: 400;">We created data logs as a solution to provide users who want more granular information with access to data stored in Hive. In this context, an individual data log entry is a formatted version of a single row of data from Hive that has been processed to make the underlying data transparent and easy to understand.</span></p> <p><span style="font-weight: 400;">Obtaining this data from Hive in a format that can be presented to users is not straightforward. Hive tables are partitioned, typically by date and time, so retrieving all the data for a specific user requires scanning through every row of every partition to check whether it corresponds to that user. Facebook has over 3 billion monthly active users, meaning that, assuming an even distribution of data, ~99.999999967% of the rows in a given Hive table might be processed for such a query even though they won’t be relevant. </span></p> <p><span style="font-weight: 400;">Overcoming this fundamental limitation was challenging, and adapting our infrastructure to enable it has taken multiple years of concerted effort. Data warehouses are commonly used in a range of industry sectors, so we hope that this solution should be of interest to other companies seeking to provide access to the data in their data warehouses.</span></p> <h3><span style="font-weight: 400;">Initial designs</span></h3> <p><span style="font-weight: 400;">When we started designing a solution to make data logs available, we first considered whether it would be feasible to simply run queries for each individual as they requested their data, despite the fact that these queries would spend almost all of their time processing irrelevant data. Unfortunately, as we highlighted above, the distribution of data at Meta’s scale makes this approach infeasible and incredibly wasteful: It would require scanning entire tables once per DYI request, scaling linearly with the number of individual users that initiate DYI requests. These performance characteristics were infeasible to work around. </span></p> <p><span style="font-weight: 400;">We also considered caching data logs in an online system capable of supporting a range of indexed per-user queries. This would make the per-user queries relatively efficient. However, copying and storing data from the warehouse in these other systems presented material computational and storage costs that were not offset by the overall effectiveness of the cache, making this infeasible as well. </span></p> <h3><span style="font-weight: 400;">Current design</span></h3> <p><span style="font-weight: 400;">Finally, we considered whether it would be possible to build a system that relies on amortizing the cost of expensive full table scans by batching individual users’ requests into a single scan. After significant engineering investigation and prototyping, we determined that this system provides sufficiently predictable performance characteristics to infrastructure teams to make this possible. Even with this batching over short periods of time, given the relatively small size of the batches of requests for this information compared to the overall user base, most of the rows considered in a given table are filtered and scoped out as they are not relevant to the users whose data has been requested. This is a necessary trade off to enable this information to be made accessible to our users.</span></p> <p><span style="font-weight: 400;">In more detail, following a pre-defined schedule, a job is triggered using Meta’s internal task-scheduling service to organize the most recent requests, over a short time period, for users’ data logs into a single batch. This batch is submitted to a system built on top of Meta’s </span><a href="https://atscaleconference.com/workflowsfacebook-powering-developer-productivity-and-automation-at-facebook-scale/"><span style="font-weight: 400;">Core Workflow Service</span></a><span style="font-weight: 400;"> (CWS). CWS provides a useful set of guarantees that enable long-running tasks to be executed with predictable performance characteristics and reliability guarantees that are critical for complex multi-step workflows.</span></p> <p><span style="font-weight: 400;">Once the batch has been queued for processing, we copy the list of user IDs who have made requests in that batch into a new Hive table. For each data logs table, we initiate a new worker task that fetches the relevant metadata describing how to correctly query the data. Once we know what to query for a specific table, we create a task for each partition that executes a job in </span><a href="https://www.youtube.com/watch?v=4T-MCYWrrOw"><span style="font-weight: 400;">Dataswarm</span></a><span style="font-weight: 400;"> (our data pipeline system). This job performs an INNER JOIN between the table containing requesters’ IDs and the column in each table that identifies the owner of the data in that row. As tables in Hive may leverage security mechanisms like access control lists (ACLs) and privacy protections built on top of Meta’s </span><a href="https://engineering.fb.com/2024/08/27/security/privacy-aware-infrastructure-purpose-limitation-meta/"><span style="font-weight: 400;">Privacy Aware Infrastructure</span></a><span style="font-weight: 400;">, the jobs are configured with appropriate security and privacy policies that govern access to the data.</span></p> <figure id="attachment_22235" aria-describedby="caption-attachment-22235" style="width: 1024px" class="wp-caption alignnone"><img decoding="async" class="size-large wp-image-22235" src="https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-workflow.png?w=1024" alt="" width="1024" height="417" srcset="https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-workflow.png 4833w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-workflow.png?resize=916,373 916w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-workflow.png?resize=768,312 768w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-workflow.png?resize=1024,417 1024w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-workflow.png?resize=1536,625 1536w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-workflow.png?resize=2048,833 2048w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-workflow.png?resize=96,39 96w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-workflow.png?resize=192,78 192w" sizes="(max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22235" class="wp-caption-text">A diagram showing how a workflow gathers inputs and runs sub-workflows for each table and partition to reliably gather data logs. The data logs workflow will gather metadata, then prepare requester IDs, then run parallel processes for each table. Each table workflow prepares the table, then runs approximately sequential jobs for each partition. Some partitions across tables may be processed in parallel.</figcaption></figure> <p><span style="font-weight: 400;">Once this job is completed, it outputs its results to an intermediate Hive table containing a combination of the data logs for all users in the current batch. This processing is expensive, as the INNER JOIN requires a full table scan across all relevant partitions of the Hive table, an operation which may consume significant computational resources. The output table is then processed using PySpark to identify the relevant data and split it into individual files for each user’s data in a given partition. </span></p> <figure id="attachment_22234" aria-describedby="caption-attachment-22234" style="width: 1024px" class="wp-caption alignnone"><img loading="lazy" decoding="async" class="size-large wp-image-22234" src="https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-splitter.png?w=1024" alt="" width="1024" height="430" srcset="https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-splitter.png 3166w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-splitter.png?resize=916,385 916w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-splitter.png?resize=768,322 768w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-splitter.png?resize=1024,430 1024w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-splitter.png?resize=1536,645 1536w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-splitter.png?resize=2048,860 2048w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-splitter.png?resize=96,40 96w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-splitter.png?resize=192,81 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22234" class="wp-caption-text">A diagram showing that the processing of each partition splits the data into one file per user per partition. Both users A and B will have a file representing Table 1 Partition 1, and so on.</figcaption></figure> <p><span style="font-weight: 400;">The result of these batch operations in the data warehouse is a set of comma delimited text files containing the unfiltered raw data logs for each user. This raw data is not yet explained or made intelligible to users, so we run a post-processing step in Meta’s Hack language to apply privacy rules and filters and render the raw data into meaningful, well-explained HTML files. We do this by passing the raw data through various renderers, discussed in more detail in the next section. Finally, once all of the processing is completed, the results are aggregated into a ZIP file and made available to the requestor through the DYI tool.</span></p> <h2>Lessons learned from building data logs</h2> <p><span style="font-weight: 400;">Throughout the development of this system we found it critical to develop robust checkpointing mechanisms that enable incremental progress and resilience in the face of errors and temporary failures. While processing everything in a single pass may reduce latency, the risk is that a single issue will cause all of the previous work to be wasted. For example, in addition to jobs timing out and failing to complete, we also experienced errors where full-table-scan queries would run out of memory and fail partway through processing. The capability to resume work piecemeal increases resiliency and optimizes the overall throughput of the system.</span></p> <figure id="attachment_22233" aria-describedby="caption-attachment-22233" style="width: 1024px" class="wp-caption alignnone"><img loading="lazy" decoding="async" class="size-large wp-image-22233" src="https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-checkpoints.png?w=1024" alt="" width="1024" height="432" srcset="https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-checkpoints.png 1787w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-checkpoints.png?resize=916,386 916w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-checkpoints.png?resize=768,324 768w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-checkpoints.png?resize=1024,432 1024w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-checkpoints.png?resize=1536,648 1536w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-checkpoints.png?resize=96,41 96w, https://engineering.fb.com/wp-content/uploads/2025/02/data-logs-checkpoints.png?resize=192,81 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22233" class="wp-caption-text">A diagram explaining that checkpointing after incremental task completions can optimize system throughput. Approach 1 shows one task failure on step 2, which is recomputed successfully. Approach 2 combines everything into one step, but when it fails, it winds up taking longer overall.</figcaption></figure> <p><span style="font-weight: 400;">Ensuring data correctness is also very important. As we built the component that splits combined results into individual files for each user, we encountered an issue that affected this correctness guarantee and could have led to data being returned to the wrong user. The root cause of the issue was a Spark concurrency bug that partitioned data incorrectly across the parallel Spark workers. To prevent this issue, we built verification in the post-processing stage to ensure that the user ID column in the data matches the identifier for the user whose logs we are generating. This means that even if similar bugs were to occur in the core data processing infrastructure we would prevent any incorrect data from being shown to users. </span></p> <p><span style="font-weight: 400;">Finally, we learned that complex data workflows require advanced tools and the capability to iterate on code changes quickly without re-processing everything. To this end, we built an experimentation platform that enables running modified versions of the workflows to quickly test changes, with the ability to independently execute phases of the process to expedite our work. For example, we found that innocent-looking changes such as altering which column is being fetched can lead to complex failures in the data fetching jobs. We now have the ability to run a test job under the new configuration when making a change to a table fetcher.</span></p> <h2>Making data consistently understandable and explainable</h2> <p><span style="font-weight: 400;">Meta cares about ensuring that the information we provide is meaningful to end-users. A key challenge in providing transparent access is presenting often highly technical information in a way that is approachable and easy to understand, even for those with little expertise in technology. </span></p> <p><span style="font-weight: 400;">Providing people access to their data involves working across numerous products and surfaces that our users interact with every day. The way that information is stored on our back-end systems is not always directly intelligible to end-users, and it takes both understanding of the individual products and features as well as the needs of users to make it user-friendly. This is a large, collaborative undertaking, leveraging many forms of expertise. Our process entails working with product teams familiar with the data from their respective products, applying our historical expertise in access surfaces, using innovative tools we have developed, and consulting with experts.</span></p> <p><span style="font-weight: 400;">In more detail, a cross-functional team of access experts works with specialist teams to review these tables, taking care to avoid exposing information that could adversely affect the rights and freedoms of other users. For example, if you block another user Facebook, this information would not be provided to the person that you have blocked. Similarly, when you view another user’s profile, this information will be available to you, but not the person whose profile you viewed. This is a key principle Meta upholds to respect the rights of everyone who engages with our platforms. It also means that we need a rigorous process to ensure that the data made available is never shared incorrectly. Many of the datasets that power a social network will reference more than one person, but that does not imply everyone referenced should always have equal access to that information. </span></p> <p><span style="font-weight: 400;">Additionally, Meta must take care not to disclose information that may compromise our integrity or safety systems, or our intellectual property rights. For instance, Meta sends </span><a href="https://transparency.meta.com/en-gb/ncmec-q2-2023/"><span style="font-weight: 400;">millions of NCMEC Cybertip reports per year</span></a><span style="font-weight: 400;"> to help protect children on our platforms. Disclosing this information, or the data signals used to detect apparent violations of laws protecting children, may undermine the sophisticated techniques we have developed to proactively seek out and report these types of content and interactions. </span></p> <p><span style="font-weight: 400;">One particularly time consuming and challenging task is ensuring that Meta-internal text strings that describe our systems and products are translated into more easily human readable terms. For instance, a </span><a href="https://docs.hhvm.com/hack/built-in-types/enum"><span style="font-weight: 400;">Hack enum</span></a><span style="font-weight: 400;"> could define a set of user interface element references. Exposing the jargon-heavy internal versions of these enums would definitely not be meaningful to an end-user — they may not be meaningful at first glance to other employees without sufficient context! In this case, user-friendly labels are created to replace these internal-facing strings. The resulting content is reviewed for explainability, simplicity, and consistency, with product experts also helping to verify that the final version is accurate.</span></p> <p><span style="font-weight: 400;">This process makes information more useful by reducing duplicative information. When engineers build and iterate on a product for our platforms, they may log slightly different versions of the same information with the goal of better understanding how people use the product. For example, when users select an option from a list of actions, each part of the system may use slightly different values that represent the same underlying option, such as an option to move content to trash as part of </span><a href="https://about.fb.com/news/2020/06/introducing-manage-activity/"><span style="font-weight: 400;">Manage Activity</span></a><span style="font-weight: 400;">. As a concrete example, we found this action stored with different values: In the first instance it was entered as </span><span style="font-weight: 400;">MOVE_TO_TRASH</span><span style="font-weight: 400;">, in the second as </span><span style="font-weight: 400;">StoryTrashPostMenuItem</span><span style="font-weight: 400;">, and in the third </span><span style="font-weight: 400;">FBFeedMoveToTrashOption</span><span style="font-weight: 400;">. These differences stemmed from the fact that the logging in question was coming from different parts of the system with different conventions. Through a series of cross-functional reviews with support from product experts, Meta determines an appropriate column header (e.g., “Which option you interacted with”), and the best label for the option (e.g., “Move to trash”).</span></p> <p><span style="font-weight: 400;">Finally, once content has been reviewed, it can be implemented in code using the renderers we described above. These are responsible for reading the raw values and transforming them into user-friendly representations. This includes transforming raw integer values into meaningful references to entities and converting raw enum values into user-friendly representations. An ID like 1786022095521328 might become “John Doe”; enums with integer values 0, 1, 2, converted into text like “Disabled,” “Active,” or “Hidden;”and columns with string enums can remove jargon that is not understandable to end-users and de-duplicate (as in our “Move to trash” example above). </span></p> <p><span style="font-weight: 400;">Together, these culminate in a much friendlier representation of the data that might look like this:</span></p> <figure id="attachment_22228" aria-describedby="caption-attachment-22228" style="width: 1024px" class="wp-caption alignnone"><img loading="lazy" decoding="async" class="size-large wp-image-22228" src="https://engineering.fb.com/wp-content/uploads/2025/01/Meta-data-logs-image-4.png?w=1024" alt="" width="1024" height="197" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/Meta-data-logs-image-4.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/01/Meta-data-logs-image-4.png?resize=916,176 916w, https://engineering.fb.com/wp-content/uploads/2025/01/Meta-data-logs-image-4.png?resize=768,148 768w, https://engineering.fb.com/wp-content/uploads/2025/01/Meta-data-logs-image-4.png?resize=1024,197 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/Meta-data-logs-image-4.png?resize=1536,295 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/Meta-data-logs-image-4.png?resize=96,18 96w, https://engineering.fb.com/wp-content/uploads/2025/01/Meta-data-logs-image-4.png?resize=192,37 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22228" class="wp-caption-text">An example of what a data log might look like, demonstrating meaningful and intelligible output. It shows column headers with descriptions and a row of data.</figcaption></figure> <p>The post <a rel="nofollow" href="https://engineering.fb.com/2025/02/04/security/data-logs-the-latest-evolution-in-metas-access-tools/">Data logs: The latest evolution in Meta’s access tools</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>.</p> ]]></content:encoded> <post-id xmlns="com-wordpress:feed-additions:1">22222</post-id> </item> <item> <title>How Precision Time Protocol handles leap seconds</title> <link>https://engineering.fb.com/2025/02/03/production-engineering/how-precision-time-protocol-ptp-handles-leap-seconds/</link> <dc:creator><![CDATA[]]></dc:creator> <pubDate>Mon, 03 Feb 2025 17:00:34 +0000</pubDate> <category><![CDATA[Networking & Traffic]]></category> <category><![CDATA[Production Engineering]]></category> <guid isPermaLink="false">https://engineering.fb.com/?p=22212</guid> <description><![CDATA[<p>We’ve previously described why we think it’s time to leave the leap second in the past. In today’s rapidly evolving digital landscape, introducing new leap seconds to account for the long-term slowdown of the Earth’s rotation is a risky practice that, frankly, does more harm than good. This is particularly true in the data center [...]</p> <p><a class="btn btn-secondary understrap-read-more-link" href="https://engineering.fb.com/2025/02/03/production-engineering/how-precision-time-protocol-ptp-handles-leap-seconds/">Read More...</a></p> <p>The post <a rel="nofollow" href="https://engineering.fb.com/2025/02/03/production-engineering/how-precision-time-protocol-ptp-handles-leap-seconds/">How Precision Time Protocol handles leap seconds</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>.</p> ]]></description> <content:encoded><![CDATA[<p><span style="font-weight: 400;">We’ve previously described why we think</span> <a href="https://engineering.fb.com/2022/07/25/production-engineering/its-time-to-leave-the-leap-second-in-the-past/"><span style="font-weight: 400;">it’s time to leave the leap second in the past</span></a><span style="font-weight: 400;">. In today’s rapidly evolving digital landscape, introducing new leap seconds to account for the long-term slowdown of the Earth’s rotation is a risky practice that, frankly,</span> <span style="font-weight: 400;">does more harm than good</span><span style="font-weight: 400;">. This is particularly true in the data center space, where new protocols like</span><a href="https://engineering.fb.com/2022/11/21/production-engineering/precision-time-protocol-at-meta/"> <span style="font-weight: 400;">Precision Time Protocol (PTP)</span></a><span style="font-weight: 400;"> are allowing systems to be synchronized down to nanosecond precision. </span></p> <p><span style="font-weight: 400;">With the ever-growing demand for higher precision time distribution, and</span> <a href="https://engineering.fb.com/2022/11/21/production-engineering/precision-time-protocol-at-meta/" target="_blank" rel="noopener"><span style="font-weight: 400;">the larger role of PTP for time synchronization</span></a><span style="font-weight: 400;"> in data centers, we need to consider how to address leap seconds within systems that use PTP and are thus much more time sensitive.</span></p> <h2><span style="font-weight: 400;">Leap second smearing – a solution past its time</span></h2> <p><span style="font-weight: 400;">Leap second smearing is a process of adjusting the speeds of clocks to accommodate the correction that has been a common method for handling leap seconds. At Meta, we’ve traditionally focused our smearing effort on <a href="https://engineering.fb.com/2020/03/18/production-engineering/ntp-service/" target="_blank" rel="noopener">NTP</a> since it has been the de facto standard for time synchronization in data centers.</span></p> <p><span style="font-weight: 400;">In large NTP deployments, leap second smearing is generally performed at the</span> <a href="https://engineering.fb.com/2020/03/18/production-engineering/ntp-service/" target="_blank" rel="noopener"><span style="font-weight: 400;">Stratum 2 (layer</span></a><span style="font-weight: 400;">), which consists of NTP servers that directly interact with NTP clients (</span><span style="font-weight: 400;">the Stratum 3</span><span style="font-weight: 400;">) that are the downstream users of the NTP service.</span></p> <p><span style="font-weight: 400;"> <img loading="lazy" decoding="async" class="alignnone size-large" src="https://engineering.fb.com/wp-content/uploads/2020/03/Time-Infra-NTP-Service.jpg" width="2000" height="1125" /></span></p> <p><span style="font-weight: 400;">There are multiple approaches to smearing. In the case of NTP, linear or quadratic smearing formulas can be applied.</span></p> <p><span style="font-weight: 400;">Quadratic smearing is often preferred due to the layered nature of the NTP protocol, where clients are encouraged to dynamically adjust their polling interval as the value of pending correction increases. This solution has its own tradeoffs, such as inconsistent adjustments, which can lead to different offset values across a large server fleet. </span></p> <p><span style="font-weight: 400;">Linear smearing may be superior if an entire fleet is relying on the same time sources and performs smearing at the same time. In combination with more frequent sync cycles of typically once per second, this is a more predictable, precise and reliable approach.</span></p> <p><img loading="lazy" decoding="async" class="alignnone size-large wp-image-22216" src="https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png?w=1024" alt="" width="1024" height="576" srcset="https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png 2240w, https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png?resize=580,326 580w, https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png?resize=916,516 916w, https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png?resize=768,432 768w, https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png?resize=1024,576 1024w, https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png?resize=1536,865 1536w, https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png?resize=2048,1153 2048w, https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/02/Quadratic-Graph-White.png?resize=192,108 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p> <h2><span style="font-weight: 400;">Handling leap seconds in PTP</span></h2> <p><span style="font-weight: 400;">In contrast to NTP, which synchronizes at the millisecond level, PTP provides a level of precision typically in the range of nanoseconds. At this level of precision even periodic linear smearing would create too much delta across the fleet and violate guarantees provided to the customers.</span></p> <p><span style="font-weight: 400;">To handle leap seconds in a PTP environment we take an algorithmic approach that shifts time automatically for systems that use PTP and combine this with an emphasis on using</span> <a href="https://en.wikipedia.org/wiki/Coordinated_Universal_Time" target="_blank" rel="noopener"><span style="font-weight: 400;">Coordinated Universal Time (UTC)</span></a><span style="font-weight: 400;"> over International Atomic Time (TAI).</span></p> <h3><span style="font-weight: 400;">Self-smearing</span></h3> <p><span style="font-weight: 400;">At Meta, users interact with the PTP service via the</span> <a href="https://engineering.fb.com/2022/11/21/production-engineering/precision-time-protocol-at-meta/#client" target="_blank" rel="noopener"><span style="font-weight: 400;">fbclock</span></a><span style="font-weight: 400;"> library, which provides a tuple of values, </span><i><span style="font-weight: 400;">{<span style="font-family: 'courier new', courier;">earliest_ns, latest_ns}</span>,</span></i><span style="font-weight: 400;"> which represents a time interval referred to as the</span> <span style="font-weight: 400;">Window of Uncertainty (WOU). Each time the library is called during the smearing period we adjust the return values based on the smearing algorithm, which shifts the time values 1 nanosecond every 62.5 microseconds.</span></p> <p><span style="font-weight: 400;">This approach has a number of advantages, including being completely stateless and reproducible. The service continues to utilize TAI timestamps but can return UTC timestamps to clients via the API. And, as the start time is determined by</span> <a href="https://engineering.fb.com/2020/03/18/production-engineering/ntp-service/" target="_blank" rel="noopener"><span style="font-weight: 400;">tzdata</span></a><span style="font-weight: 400;"> timestamps, the current smearing position can be determined even after a server is rebooted.</span></p> <p><span style="font-weight: 400;">This approach does come with some tradeoffs. For example, as the leap smearing strategy differs between the NTP (quadratic) and PTP (linear) ecosystems, services may struggle to match timestamps acquired from different sources during the smearing period. </span></p> <p><span style="font-weight: 400;">The difference between two approaches can mean differences of over 100 microseconds, creating challenges for services that consume time from both systems.</span></p> <h3><span style="font-weight: 400;">UTC over TAI</span></h3> <p><span style="font-weight: 400;">The smearing strategy we implemented in our fbclock library shows good performance. However, it still introduces significant time deltas between multiple hosts during the smearing period, despite being fully stateless and using small (1 nanosecond) and fixed step sizes. </span></p> <p><span style="font-weight: 400;">Another significant drawback comes from periodically running jobs. Smearing time means our scheduling is off by close to 1 millisecond after 60 seconds for services that run at precise intervals. </span></p> <p><img loading="lazy" decoding="async" class="alignnone size-large wp-image-22215" src="https://engineering.fb.com/wp-content/uploads/2025/02/Intended-Graph-White.png?w=1024" alt="" width="1024" height="617" srcset="https://engineering.fb.com/wp-content/uploads/2025/02/Intended-Graph-White.png 2240w, https://engineering.fb.com/wp-content/uploads/2025/02/Intended-Graph-White.png?resize=916,552 916w, https://engineering.fb.com/wp-content/uploads/2025/02/Intended-Graph-White.png?resize=768,463 768w, https://engineering.fb.com/wp-content/uploads/2025/02/Intended-Graph-White.png?resize=1024,617 1024w, https://engineering.fb.com/wp-content/uploads/2025/02/Intended-Graph-White.png?resize=1536,925 1536w, https://engineering.fb.com/wp-content/uploads/2025/02/Intended-Graph-White.png?resize=2048,1233 2048w, https://engineering.fb.com/wp-content/uploads/2025/02/Intended-Graph-White.png?resize=96,58 96w, https://engineering.fb.com/wp-content/uploads/2025/02/Intended-Graph-White.png?resize=192,116 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p> <p><span style="font-weight: 400;">This is not ideal for a service that guarantees nanosecond-level accuracy and precision.</span></p> <p><span style="font-weight: 400;">As a result, we recommend that customers use TAI over UTC and thus avoid having to deal with the leap seconds. Unfortunately, though, in most cases, the conversion to UTC is still required and eventually has to be performed somewhere.</span></p> <h2><span style="font-weight: 400;">PTP without leap seconds</span></h2> <p><span style="font-weight: 400;">At Meta, we support the recent push to</span><a href="https://www.nytimes.com/2022/11/19/science/time-leap-second-bipm.html"> <span style="font-weight: 400;">freeze any new leap seconds after 2035</span></a><span style="font-weight: 400;">. If we can cease the introduction of new leap seconds, then the entire industry can rely on UTC instead of TAI for higher precision timekeeping. This will simplify infrastructure and remove the need for different smearing solutions.</span></p> <p><span style="font-weight: 400;">Ultimately, a future without leap seconds is one where we can push systems to greater levels of timekeeping precision more easily and efficiently.</span></p> <p>The post <a rel="nofollow" href="https://engineering.fb.com/2025/02/03/production-engineering/how-precision-time-protocol-ptp-handles-leap-seconds/">How Precision Time Protocol handles leap seconds</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>.</p> ]]></content:encoded> <post-id xmlns="com-wordpress:feed-additions:1">22212</post-id> </item> <item> <title>Bringing Jetpack Compose to Instagram for Android</title> <link>https://engineering.fb.com/2025/01/24/android/bringing-jetpack-compose-to-instagram-for-android/</link> <dc:creator><![CDATA[]]></dc:creator> <pubDate>Fri, 24 Jan 2025 17:30:53 +0000</pubDate> <category><![CDATA[Android]]></category> <category><![CDATA[Culture]]></category> <category><![CDATA[DevInfra]]></category> <category><![CDATA[Instagram]]></category> <category><![CDATA[Meta Tech Podcast]]></category> <guid isPermaLink="false">https://engineering.fb.com/?p=22155</guid> <description><![CDATA[<p>Introducing a new Android UI framework like Jetpack Compose into an existing app is more complicated than importing some AARS and coding away. What if your app has specific performance goals to meet? What about existing design components, integrations with navigation, and logging frameworks? On this episode of the Meta Tech Podcast Pascal Hartig is [...]</p> <p><a class="btn btn-secondary understrap-read-more-link" href="https://engineering.fb.com/2025/01/24/android/bringing-jetpack-compose-to-instagram-for-android/">Read More...</a></p> <p>The post <a rel="nofollow" href="https://engineering.fb.com/2025/01/24/android/bringing-jetpack-compose-to-instagram-for-android/">Bringing Jetpack Compose to Instagram for Android</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>.</p> ]]></description> <content:encoded><![CDATA[<p>Introducing a new Android UI framework like Jetpack Compose into an existing app is more complicated than importing some AARS and coding away. What if your app has specific performance goals to meet? What about existing design components, integrations with navigation, and logging frameworks?</p> <p>On this episode of the Meta Tech Podcast <a href="https://www.threads.net/@passy_" target="_blank" rel="noopener">Pascal Hartig</a> is joined by Summer, a software engineer whose team handles large-scale migrations for Instagram. Summer walks through the various thoughtful and intricate phases that Instagram goes through to ensure that developers have the best possible experience when working on our codebases. She also discusses balancing all of this with Meta’s infrastructure teams, who have to maintain multiple implementations at once.</p> <p>Learn how Meta approaches the rollout of a new framework and more!</p> <p>Download or listen to the podcast episode below:</p> <p><iframe loading="lazy" style="border: none;" title="Libsyn Player" src="//html5-player.libsyn.com/embed/episode/id/34599290/height/90/theme/custom/thumbnail/yes/direction/forward/render-playlist/no/custom-color/000000/" width="100%" height="90" scrolling="no" allowfullscreen="allowfullscreen"></iframe><br /> You can also find the episode wherever you get your podcasts, including:</p> <ul> <li><a href="https://open.spotify.com/episode/3tKuh8iNENKMugNdZ0jTqU" target="_blank" rel="noopener">Spotify</a></li> <li><a href="https://podcasts.apple.com/us/podcast/jetpack-compose-at-meta/id1370910331?i=1000681564997" target="_blank" rel="noopener">Apple Podcasts</a></li> <li><a href="https://pca.st/as8y2yqo" target="_blank" rel="noopener">Pocket Casts</a></li> <li><a href="https://overcast.fm/login" target="_blank" rel="noopener">Overcast</a></li> </ul> <p>The <a href="https://insidefacebookmobile.libsyn.com/" target="_blank" rel="noopener">Meta Tech Podcast</a> is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.</p> <p>Send us feedback on <a href="https://instagram.com/metatechpod" target="_blank" rel="noopener">Instagram</a>, <a href="https://threads.net/@metatechpod" target="_blank" rel="noopener">Threads</a>, or <a href="https://twitter.com/metatechpod" target="_blank" rel="noopener">X</a>.</p> <p>And if you’re interested in learning more about career opportunities at Meta visit the <a href="https://www.metacareers.com/?ref=engineering.fb.com" target="_blank" rel="noopener">Meta Careers</a> page.</p> <p>The post <a rel="nofollow" href="https://engineering.fb.com/2025/01/24/android/bringing-jetpack-compose-to-instagram-for-android/">Bringing Jetpack Compose to Instagram for Android</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>.</p> ]]></content:encoded> <post-id xmlns="com-wordpress:feed-additions:1">22155</post-id> </item> <item> <title>How Meta discovers data flows via lineage at scale</title> <link>https://engineering.fb.com/2025/01/22/security/how-meta-discovers-data-flows-via-lineage-at-scale/</link> <dc:creator><![CDATA[]]></dc:creator> <pubDate>Thu, 23 Jan 2025 05:00:45 +0000</pubDate> <category><![CDATA[Security]]></category> <guid isPermaLink="false">https://engineering.fb.com/?p=22168</guid> <description><![CDATA[<p>Data lineage is an instrumental part of Meta’s Privacy Aware Infrastructure (PAI) initiative, a suite of technologies that efficiently protect user privacy. It is a critical and powerful tool for scalable discovery of relevant data and data flows, which supports privacy controls across Meta’s systems. This allows us to verify that our users’ everyday interactions [...]</p> <p><a class="btn btn-secondary understrap-read-more-link" href="https://engineering.fb.com/2025/01/22/security/how-meta-discovers-data-flows-via-lineage-at-scale/">Read More...</a></p> <p>The post <a rel="nofollow" href="https://engineering.fb.com/2025/01/22/security/how-meta-discovers-data-flows-via-lineage-at-scale/">How Meta discovers data flows via lineage at scale</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>.</p> ]]></description> <content:encoded><![CDATA[<ul> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">Data lineage is an instrumental part of Meta’s Privacy Aware Infrastructure (PAI) initiative, a suite of technologies that efficiently protect user privacy. It is a critical and powerful tool for scalable discovery of relevant data and data flows, which supports privacy controls across Meta’s systems. This allows us to verify that our users’ everyday interactions are protected across our family of apps, such as their religious views in the Facebook Dating app, the example we’ll walk through in this post.</span></li> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">In order to build high-quality data lineage, we developed different techniques to collect data flow signals across different technology stacks: static code analysis for different languages, runtime instrumentation, and input and output data matching, etc. We then built an intuitive UX into our tooling that enables developers to effectively consume all of this lineage data in a systematic way, saving significant engineering time for building privacy controls. </span></li> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">As we expanded PAI across Meta, </span><a href="#learnings"><span style="font-weight: 400;">we gained valuable insights</span></a><span style="font-weight: 400;"> about the data lineage space. Our understanding of the privacy space evolved, revealing the need for early focus on data lineage, tooling, a cohesive ecosystem of libraries, and more. These initiatives have assisted in accelerating the development of data lineage and implementing purpose limitation controls more quickly and efficiently.</span></li> </ul> <p><span style="font-weight: 400;">At Meta, we believe that privacy enables product innovation. This belief has led us to developing </span><a href="https://engineering.fb.com/2024/08/27/security/privacy-aware-infrastructure-purpose-limitation-meta/" target="_blank" rel="noopener"><span style="font-weight: 400;">Privacy Aware Infrastructure (PAI)</span></a><span style="font-weight: 400;">, which offers efficient and reliable first-class privacy constructs embedded in Meta infrastructure to address different privacy requirements, such as </span><a href="https://engineering.fb.com/2024/08/27/security/privacy-aware-infrastructure-purpose-limitation-meta/" target="_blank" rel="noopener"><span style="font-weight: 400;">purpose limitation</span></a><span style="font-weight: 400;">, which restricts the purposes for which data can be processed and used. </span></p> <p><span style="font-weight: 400;">In this blog, we will delve into an early stage in PAI implementation: </span><i><span style="font-weight: 400;">data lineage</span></i><span style="font-weight: 400;">. Data lineage refers to the process of tracing the journey of data as it moves through various systems, illustrating how data transitions from one data asset, such as a database table (the </span><i><span style="font-weight: 400;">source</span></i><span style="font-weight: 400;"> asset), to another (the </span><i><span style="font-weight: 400;">sink</span></i><span style="font-weight: 400;"> asset). We’ll also walk through how we track the lineage of users’ “religion” information in our Facebook Dating app.</span></p> <p><span style="font-weight: 400;">Millions of data assets are vital for supporting our product ecosystem, ensuring the functionality our users anticipate, maintaining high product quality, and safeguarding user safety and integrity. Data lineage enables us to efficiently navigate these assets and protect user data. It enhances the traceability of data flows within systems, ultimately empowering developers to swiftly implement privacy controls and create innovative products.</span></p> <p><span style="font-weight: 400;">Note that data lineage is dependent on having already completed important and complex preliminary steps to inventory, schematize, and annotate data assets into a unified asset catalog. This took Meta multiple years to complete across our millions of disparate data assets, and we’ll cover each of these more deeply in future blog posts:</span></p> <ul> <li><strong>Inventorying</strong> involves collecting various code and data assets (e.g., web endpoints, data tables, AI models) used across Meta.</li> <li><strong>Schematization</strong> expresses data assets in structural detail (e.g., indicating that a data asset has a field called “religion”).</li> <li><strong>Annotation</strong> labels data to describe its content (e.g., specifying that the identity column contains religion data).</li> </ul> <h2><span style="font-weight: 400;">Understanding data lineage at Meta</span></h2> <p><span style="font-weight: 400;">To establish robust privacy controls, an essential part of our PAI initiative is to understand how data flows across different systems. Data lineage is part of this discovery step in the PAI workflow, as shown in the following diagram:</span></p> <p><img loading="lazy" decoding="async" class="alignnone size-large wp-image-22176" src="https://engineering.fb.com/wp-content/uploads/2025/01/pai_workflow_lineage_s.png?w=1024" alt="" width="1024" height="217" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/pai_workflow_lineage_s.png 2710w, https://engineering.fb.com/wp-content/uploads/2025/01/pai_workflow_lineage_s.png?resize=916,194 916w, https://engineering.fb.com/wp-content/uploads/2025/01/pai_workflow_lineage_s.png?resize=768,163 768w, https://engineering.fb.com/wp-content/uploads/2025/01/pai_workflow_lineage_s.png?resize=1024,217 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/pai_workflow_lineage_s.png?resize=1536,326 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/pai_workflow_lineage_s.png?resize=2048,435 2048w, https://engineering.fb.com/wp-content/uploads/2025/01/pai_workflow_lineage_s.png?resize=96,20 96w, https://engineering.fb.com/wp-content/uploads/2025/01/pai_workflow_lineage_s.png?resize=192,41 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p> <p><span style="font-weight: 400;">Data lineage is a key precursor to implementing Policy Zones, our information flow control technology, because it answers the question, “</span><span style="font-weight: 400;">Where does my data come from and where does it go?” – helping inform the right places to apply privacy controls.</span><span style="font-weight: 400;"> In conjunction with Policy Zones, data lineage provides the following key benefits to thousands of developers at Meta: </span></p> <ul> <li style="font-weight: 400;" aria-level="1"><b>Scalable data flow discovery</b><span style="font-weight: 400;">: Data lineage answers the question above by providing an end-to-end, scalable graph of relevant data flows. We can leverage the lineage graphs to visualize and explain the flow of relevant data from the point where it is collected to all the places where it is processed.</span></li> <li style="font-weight: 400;" aria-level="1"><b>Efficient rollout of privacy controls</b><span style="font-weight: 400;">: By leveraging data lineage to track data flows, we can easily pinpoint the optimal integration points for privacy controls like Policy Zones within the codebase, streamlining the rollout process. Thus we have developed a powerful flow discovery tool as part of our PAI tool suite, Policy Zone Manager (PZM), based on data lineage. PZM enables developers to rapidly identify multiple downstream assets from a set of sources simultaneously, thereby accelerating the rollout process of privacy controls.</span></li> <li style="font-weight: 400;" aria-level="1"><b>Continuous compliance verification</b><span style="font-weight: 400;">: Once the privacy requirement has been fully implemented, data lineage plays a vital role in monitoring and validating data flows continuously, in addition to the enforcement mechanisms such as Policy Zones.</span></li> </ul> <p><span style="font-weight: 400;">Traditionally, data lineage has been collected via code inspection using manually authored data flow diagrams and spreadsheets. However, this approach does not scale in large and dynamic environments, such as Meta, with billions of lines of continuously evolving code. To tackle this challenge, we’ve developed a robust and scalable lineage solution that uses static code analysis signals as well as runtime signals.</span></p> <h2><span style="font-weight: 400;">Walkthrough: Implementing data lineage for religion data</span></h2> <p><span style="font-weight: 400;">We’ll share how we have automated lineage tracking to identify religion data flows through our core systems, eventually creating an end-to-end, precise view of downstream religion assets being protected, via the following two key stages:</span></p> <ol> <li style="font-weight: 400;" aria-level="1"><b>Collecting data flow signals</b><span style="font-weight: 400;">: a process to capture data flow signals from many processing activities across different systems, not only for religion, but for all other types of data, to create an end-to-end lineage graph. </span></li> <li style="font-weight: 400;" aria-level="1"><b>Identifying relevant data flows</b><span style="font-weight: 400;">: a process to identify the specific subset of data flows (“subgraph”) within the lineage graph that pertains to religion. </span></li> </ol> <p><span style="font-weight: 400;">These stages propagate through various systems including </span><i><span style="font-weight: 400;">function-based systems</span></i><span style="font-weight: 400;"> that load, process, and propagate data through stacks of function calls in different programming languages (e.g., Hack, C++, Python, etc.) such as web systems and backend services, and </span><i><span style="font-weight: 400;">batch-processing systems</span></i><span style="font-weight: 400;"> that process data rows in batch (mainly via SQL) such as data warehouse and AI systems. </span></p> <p><span style="font-weight: 400;">For simplicity, we will demonstrate these for the web, the data warehouse, and AI, per the diagram below.</span></p> <p><img loading="lazy" decoding="async" class="alignnone size-large wp-image-22187" src="https://engineering.fb.com/wp-content/uploads/2025/01/lineage_e2e_s_updated.png?w=1024" alt="" width="1024" height="559" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/lineage_e2e_s_updated.png 2743w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_e2e_s_updated.png?resize=916,500 916w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_e2e_s_updated.png?resize=768,419 768w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_e2e_s_updated.png?resize=1024,559 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_e2e_s_updated.png?resize=1536,838 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_e2e_s_updated.png?resize=2048,1118 2048w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_e2e_s_updated.png?resize=96,52 96w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_e2e_s_updated.png?resize=192,105 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p> <h3><span style="font-weight: 400;">Collecting data flow signals for the web system</span></h3> <p><span style="font-weight: 400;">When setting up a profile on the Facebook Dating app, people can populate their religious views. This information is then utilized to identify relevant matches with other people who have specified matched values in their dating preferences. On Dating, religious views are subject to purpose limitation requirements, for example, </span><a href="https://about.fb.com/news/2020/10/privacy-matters-facebook-dating/"><span style="font-weight: 400;">they will not be used to personalize experiences on other Facebook Products</span></a><span style="font-weight: 400;">. </span></p> <p><img loading="lazy" decoding="async" class="alignnone size-large wp-image-22169" src="https://engineering.fb.com/wp-content/uploads/2025/01/dating_religion_s.png?w=1024" alt="" width="1024" height="651" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/dating_religion_s.png 1964w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_religion_s.png?resize=916,582 916w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_religion_s.png?resize=768,488 768w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_religion_s.png?resize=1024,651 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_religion_s.png?resize=1536,976 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_religion_s.png?resize=96,61 96w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_religion_s.png?resize=192,122 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p> <p><span style="font-weight: 400;">We start with someone entering their religion information on their dating media profile using their mobile device, which is then transmitted to a web endpoint. The web endpoint subsequently logs the data into a logging table and stores it in a database, as depicted in the following code snippet:</span></p> <p><img loading="lazy" decoding="async" class="alignnone size-large wp-image-22191" src="https://engineering.fb.com/wp-content/uploads/2025/01/dating_profile_endpoint_code.png?w=895" alt="" width="895" height="1024" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/dating_profile_endpoint_code.png 1504w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_profile_endpoint_code.png?resize=801,916 801w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_profile_endpoint_code.png?resize=768,878 768w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_profile_endpoint_code.png?resize=895,1024 895w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_profile_endpoint_code.png?resize=1343,1536 1343w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_profile_endpoint_code.png?resize=96,110 96w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_profile_endpoint_code.png?resize=192,220 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p> <p><span style="font-weight: 400;">Now let’s see how we collect lineage signals. To do this, we need to employ both static and runtime analysis tools to effectively discover data flows, particularly focusing on where religion is logged and stored. </span><span style="font-weight: 400;">By combining static and runtime analysis, we enhance our ability to accurately track and manage data flows.</span></p> <p><a href="https://engineering.fb.com/2021/10/20/security/static-analysis-award/'"><span style="font-weight: 400;">Static analysis tools</span></a><span style="font-weight: 400;"> simulate code execution to map out data flows within our systems. They also emit quality signals to indicate the confidence of whether a data flow signal is a true positive. However, these tools are limited by their lack of access to runtime data, which can lead to false positives from unexecuted code.</span></p> <p><span style="font-weight: 400;">To address this limitation, we utilize </span><b>Privacy Probes</b><span style="font-weight: 400;">, a key component of our PAI lineage technologies. Privacy Probes automate data flow discovery by collecting runtime signals. These signals are gathered in real time during the execution of requests, allowing us to trace the flow of data into loggers, databases, and other services. </span></p> <p><span style="font-weight: 400;">We have instrumented Meta’s core data frameworks and libraries at both the data origin points (sources) and their eventual outputs (sinks), such as logging framework, which allows for comprehensive data flow tracking. This approach is exemplified in the following code snippet:</span></p> <p><img loading="lazy" decoding="async" class="alignnone size-large wp-image-22192" src="https://engineering.fb.com/wp-content/uploads/2025/01/endpoint_controller_base_code.png?w=1024" alt="" width="1024" height="791" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/endpoint_controller_base_code.png 1404w, https://engineering.fb.com/wp-content/uploads/2025/01/endpoint_controller_base_code.png?resize=916,707 916w, https://engineering.fb.com/wp-content/uploads/2025/01/endpoint_controller_base_code.png?resize=768,593 768w, https://engineering.fb.com/wp-content/uploads/2025/01/endpoint_controller_base_code.png?resize=1024,791 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/endpoint_controller_base_code.png?resize=96,74 96w, https://engineering.fb.com/wp-content/uploads/2025/01/endpoint_controller_base_code.png?resize=192,148 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><br /> <span style="font-weight: 400;">During runtime execution, Privacy Probes does the following:</span></p> <ol> <li style="font-weight: 400;" aria-level="1"><b>Capturing payloads</b><span style="font-weight: 400;">: It captures source and sink payloads in memory on a sampled basis, along with supplementary metadata such as event timestamps, asset identifiers, and stack traces as evidence for the data flow. </span></li> <li style="font-weight: 400;" aria-level="1"><b>Comparing payloads</b><span style="font-weight: 400;">: It then compares the source and sink payloads within a request to identify data matches, which helps in understanding how data flows through the system. </span></li> <li style="font-weight: 400;" aria-level="1"><b>Categorizing results</b><span style="font-weight: 400;">: It categorizes results into two sets. The </span><i><span style="font-weight: 400;">match-set</span></i><span style="font-weight: 400;"> includes pairs of source and sink assets where data matches exactly or one is contained by another, therefore providing high confidence evidence of data flow between the assets. The </span><i><span style="font-weight: 400;">full-set</span></i><span style="font-weight: 400;"> includes all source and sink pairs within a request no matter whether the sink is tainted by the source. Full-set is a superset of match-set with some noise but still important to send to human reviewers since it may contain transformed data flows. </span></li> </ol> <p><span style="font-weight: 400;">The above procedure is depicted in the diagram below:</span></p> <p><img loading="lazy" decoding="async" class="alignnone size-large wp-image-22177" src="https://engineering.fb.com/wp-content/uploads/2025/01/probes_match_s.png?w=1024" alt="" width="1024" height="378" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/probes_match_s.png 2216w, https://engineering.fb.com/wp-content/uploads/2025/01/probes_match_s.png?resize=916,338 916w, https://engineering.fb.com/wp-content/uploads/2025/01/probes_match_s.png?resize=768,283 768w, https://engineering.fb.com/wp-content/uploads/2025/01/probes_match_s.png?resize=1024,378 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/probes_match_s.png?resize=1536,566 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/probes_match_s.png?resize=2048,755 2048w, https://engineering.fb.com/wp-content/uploads/2025/01/probes_match_s.png?resize=96,35 96w, https://engineering.fb.com/wp-content/uploads/2025/01/probes_match_s.png?resize=192,71 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p> <p><span style="font-weight: 400;">Let’s look at the following examples where various religions are received in an endpoint and various values (copied or transformed) being logged in three different loggers:</span></p> <table border="1"> <tbody> <tr> <td bgcolor="#a0d9f7"><b>Input Value (source)</b></td> <td bgcolor="#a0d9f7"><b>Output Value (sink)</b></td> <td bgcolor="#a0d9f7"><b>Data Operation</b></td> <td bgcolor="#a0d9f7"><b>Match Result</b></td> <td bgcolor="#a0d9f7"><b>Flow Confidence</b></td> </tr> <tr> <td><span style="font-weight: 400;">“Atheist”</span></td> <td><span style="font-weight: 400;">“Atheist”</span></td> <td><span style="font-weight: 400;">Data Copy</span></td> <td><span style="font-weight: 400;">EXACT_MATCH</span></td> <td><span style="font-weight: 400;">HIGH</span></td> </tr> <tr> <td><span style="font-weight: 400;">“Buddhist”</span></td> <td><span style="font-weight: 400;">{metadata: {religion: Buddhist}}</span></td> <td><span style="font-weight: 400;">Substring</span></td> <td><span style="font-weight: 400;">CONTAINS</span></td> <td><span style="font-weight: 400;">HIGH</span></td> </tr> <tr> <td><span style="font-weight: 400;">{religions:<br /> </span><span style="font-weight: 400;">[“Catholic”, “Christian”]}</span></td> <td><span style="font-weight: 400;">{count : 2}</span></td> <td><span style="font-weight: 400;">Transformed</span></td> <td><span style="font-weight: 400;">NO_MATCH</span></td> <td><span style="font-weight: 400;">LOW</span></td> </tr> </tbody> </table> <p><span style="font-weight: 400;"><br /> In the examples above, the first two rows show a precise match of religions in the source and the sink values, thus belonging to the high confidence match-set. The third row depicts a transformed data flow where the input string value is transformed to a count of values before being logged, belonging to full-set. </span></p> <p><span style="font-weight: 400;">These signals together are used to construct a lineage graph to understand the flow of data through our web system as shown in the following diagram:</span></p> <p><img loading="lazy" decoding="async" class="alignnone size-large wp-image-22174" src="https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_s.png?w=1024" alt="" width="1024" height="312" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_s.png 2437w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_s.png?resize=916,279 916w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_s.png?resize=768,234 768w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_s.png?resize=1024,312 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_s.png?resize=1536,468 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_s.png?resize=2048,624 2048w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_s.png?resize=96,29 96w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_s.png?resize=192,59 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p> <h3><span style="font-weight: 400;">Collecting data flow signals for the data warehouse system</span></h3> <p><span style="font-weight: 400;">With the user’s religion logged in our web system, it can propagate to the data warehouse for offline processing. To gather data flow signals, we employ a combination of both runtime instrumentation and static code analysis in a different way from the web system. The involved SQL queries are logged for data processing activities by the </span><a href="https://research.facebook.com/publications/presto-sql-on-everything/"><span style="font-weight: 400;">Presto</span></a><span style="font-weight: 400;"> and </span><a href="https://spark.apache.org/"><span style="font-weight: 400;">Spark</span></a><span style="font-weight: 400;"> compute engines (among others). Static analysis is then performed for the logged SQL queries and job configs in order to extract data flow signals.</span></p> <p><span style="font-weight: 400;">Let’s examine a simple SQL query example that processes data for the data warehouse as the following:</span></p> <p><img loading="lazy" decoding="async" class="alignnone size-large wp-image-22190" src="https://engineering.fb.com/wp-content/uploads/2025/01/dating_data_sql_code.png?w=1024" alt="" width="1024" height="292" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/dating_data_sql_code.png 1404w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_data_sql_code.png?resize=916,261 916w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_data_sql_code.png?resize=768,219 768w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_data_sql_code.png?resize=1024,292 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_data_sql_code.png?resize=96,27 96w, https://engineering.fb.com/wp-content/uploads/2025/01/dating_data_sql_code.png?resize=192,55 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><br /> <span style="font-weight: 400;">We’ve developed a </span><a href="https://engineering.fb.com/2022/11/30/data-infrastructure/static-analysis-sql-queries/" target="_blank" rel="noopener"><span style="font-weight: 400;">SQL analyzer</span></a><span style="font-weight: 400;"> to extract data flow signals between the input table, “safety_log_tbl” and the output table, “safety_training_tbl” as shown in the following diagram. In practice, we also collect more granular-level lineage such as at column-level (e.g., “user_id” -> “target_user_id”, “religion” -> “target_religion”).</span></p> <p><span style="font-weight: 400;">T</span><span style="font-weight: 400;">here are instances where data is not fully processed by SQL queries, resulting in logs that contain data flow signals for either reads or writes, but not both.</span><span style="font-weight: 400;"> To ensure we have complete lineage data, we leverage contextual information (such as execution environments; job or trace IDs) collected at runtime to connect these reads and writes together. </span></p> <p><span style="font-weight: 400;">The following diagram illustrates how the lineage graph has expanded:</span></p> <p><img loading="lazy" decoding="async" class="alignnone size-large wp-image-22173" src="https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_s.png?w=1024" alt="" width="1024" height="593" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_s.png 2436w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_s.png?resize=916,530 916w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_s.png?resize=768,445 768w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_s.png?resize=1024,593 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_s.png?resize=1536,889 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_s.png?resize=2048,1185 2048w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_s.png?resize=96,56 96w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_s.png?resize=192,111 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p> <h3><span style="font-weight: 400;">Collecting data flow signals for the AI system</span></h3> <p><span style="font-weight: 400;">For our AI systems, we collect lineage signals by tracking relationships between various assets, such as input datasets, features, models, workflows, and inferences. A common approach is to extract data flows from job configurations used for different AI activities such as model training.</span><span style="font-weight: 400;"><br /> </span><span style="font-weight: 400;"><br /> </span><span style="font-weight: 400;">For instance, in order to improve the relevance of dating matches, we use an AI model to recommend potential matches based on shared religious views from users. Let’s take a look at the following training config example for this model that uses religion data:</span></p> <p><img loading="lazy" decoding="async" class="alignnone size-large wp-image-22193" src="https://engineering.fb.com/wp-content/uploads/2025/01/training_config_code.png?w=1024" alt="" width="1024" height="829" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/training_config_code.png 1404w, https://engineering.fb.com/wp-content/uploads/2025/01/training_config_code.png?resize=916,741 916w, https://engineering.fb.com/wp-content/uploads/2025/01/training_config_code.png?resize=768,621 768w, https://engineering.fb.com/wp-content/uploads/2025/01/training_config_code.png?resize=1024,829 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/training_config_code.png?resize=96,78 96w, https://engineering.fb.com/wp-content/uploads/2025/01/training_config_code.png?resize=192,155 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p> <p><span style="font-weight: 400;">By parsing this config obtained from the model training service, we can track the data flow from the input dataset (with asset ID asset://hive.table/dating_training_tbl) and feature (with asset ID asset://ai.feature/DATING_USER_RELIGION_SCORE) to the model (with asset ID asset://ai.model/dating_ranking_model).</span></p> <p><span style="font-weight: 400;">Our AI systems are also instrumented so that asset relationships and data flow signals are captured at various points at runtime, including data-loading layers (e.g., </span><a href="https://engineering.fb.com/2022/09/19/ml-applications/data-ingestion-machine-learning-training-meta/"><span style="font-weight: 400;">DPP</span></a><span style="font-weight: 400;">) and libraries (e.g., </span><a href="https://pytorch.org/"><span style="font-weight: 400;">PyTorch</span></a><span style="font-weight: 400;">), workflow engines (e.g., </span><a href="https://engineering.fb.com/2016/05/09/core-infra/introducing-fblearner-flow-facebook-s-ai-backbone/"><span style="font-weight: 400;">FBLearner Flow</span></a><span style="font-weight: 400;">), training frameworks, inference systems (as backend services), etc. Lineage collection for backend services utilizes the approach for function-based systems described above. By matching the source and sink assets for different data flow signals, we are able to capture a holistic lineage graph at the desired granularities:</span></p> <p><img loading="lazy" decoding="async" class="alignnone size-large wp-image-22172" src="https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_ai_s.png?w=1024" alt="" width="1024" height="600" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_ai_s.png 2416w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_ai_s.png?resize=916,537 916w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_ai_s.png?resize=768,450 768w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_ai_s.png?resize=1024,600 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_ai_s.png?resize=1536,900 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_ai_s.png?resize=2048,1200 2048w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_ai_s.png?resize=96,56 96w, https://engineering.fb.com/wp-content/uploads/2025/01/lineage_exp_web_dw_ai_s.png?resize=192,113 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p> <h2><span style="font-weight: 400;">Identifying relevant data flows from a lineage graph</span></h2> <p><span style="font-weight: 400;">Now that we have the lineage graph at our disposal, how can we effectively distill a subset of data flows pertinent to a specific privacy requirement for religion data? </span><span style="font-weight: 400;">To address this question, we have developed an iterative analysis tool that enables developers to pinpoint precise data flows and systematically filter out irrelevant ones</span><span style="font-weight: 400;">. The tool kicks off a repetitive discovery process aided by the lineage graph and privacy controls from Policy Zones, to narrow down the most relevant flows. This refined data allows developers to make a final determination about the flows they would like to use, producing an optimal path for traversing the lineage graph. The following are the major steps involved, captured holistically in the diagram, below:</span></p> <ol> <li style="font-weight: 400;" aria-level="1"><b>Discover data flows: </b><span style="font-weight: 400;">identify data flows from source assets and stop at downstream assets with low-confidence flows (yellow nodes). </span></li> <li style="font-weight: 400;" aria-level="1"><b>Exclude and include candidates: </b><span style="font-weight: 400;">Developers or automated heuristics exclude candidates (red nodes) that don’t have religion data or include remaining ones (green nodes). By excluding the red nodes early on, it helps to exclude all of their downstream in a cascaded manner, and thus saves developer efforts significantly. As an additional safeguard, developers also implement privacy controls via Policy Zones, so all relevant data flows can be captured.</span></li> <li style="font-weight: 400;" aria-level="1"><b>Repeat discovery cycle: </b><span style="font-weight: 400;">use the green nodes as new sources and repeat the cycle until no more green nodes are confirmed. </span></li> </ol> <p><img loading="lazy" decoding="async" class="alignnone size-large wp-image-22170" src="https://engineering.fb.com/wp-content/uploads/2025/01/iterative_tool_s.png?w=1024" alt="" width="1024" height="711" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/iterative_tool_s.png 2232w, https://engineering.fb.com/wp-content/uploads/2025/01/iterative_tool_s.png?resize=916,636 916w, https://engineering.fb.com/wp-content/uploads/2025/01/iterative_tool_s.png?resize=768,533 768w, https://engineering.fb.com/wp-content/uploads/2025/01/iterative_tool_s.png?resize=1024,711 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/iterative_tool_s.png?resize=1536,1066 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/iterative_tool_s.png?resize=2048,1421 2048w, https://engineering.fb.com/wp-content/uploads/2025/01/iterative_tool_s.png?resize=96,67 96w, https://engineering.fb.com/wp-content/uploads/2025/01/iterative_tool_s.png?resize=192,133 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /></p> <p><span style="font-weight: 400;">With the collection and data flow identification steps complete, developers are able to successfully locate granular data flows that contain religion across Meta’s complex systems, allowing them to move forward in the PAI workflow to apply necessary privacy controls to safeguard the data. This once-intimidating task has been completed efficiently. </span></p> <p><span style="font-weight: 400;">Our data lineage technology has provided developers with an unprecedented ability to quickly understand and protect religion and similar sensitive data flows. It enables Meta to scalably and efficiently implement privacy controls via PAI to protect our users’ privacy and deliver products safely.</span></p> <h2 id="learnings"><span style="font-weight: 400;">Learnings and challenges</span></h2> <p><span style="font-weight: 400;">As we’ve worked to develop and implement lineage as a core PAI technology, we’ve gained valuable insights and overcome significant challenges, yielding some important lessons:</span></p> <ul> <li style="font-weight: 400;" aria-level="1"><b>Focus on lineage early and reap the rewards</b><span style="font-weight: 400;">: As we developed privacy technologies like Policy Zones, it became clear that gaining a deep understanding of data flows across various systems is essential for scaling the implementation of privacy controls. By investing in lineage, we not only accelerated the adoption of Policy Zones but also uncovered new opportunities for applying the technology. Lineage can also be extended to other use cases such as security and integrity.</span></li> <li style="font-weight: 400;" aria-level="1"><b>Build lineage consumption tools to gain engineering efficiency</b><span style="font-weight: 400;">: We initially focused on building a lineage solution but didn’t give sufficient attention to consumption tools for developers. As a result, owners had to use raw lineage signals to discover relevant data flows, which was overwhelmingly complex. We addressed this issue by developing the iterative tooling to guide engineers in discovering relevant data flows, significantly reducing engineering efforts by orders of magnitude.</span></li> <li style="font-weight: 400;" aria-level="1"><b>Integrate lineage with systems to scale the coverage</b><span style="font-weight: 400;">: Collecting lineage from diverse Meta systems was a significant challenge. Initially, we tried to ask every system to collect lineage signals to ingest into the centralized lineage service, but the progress was slow. We overcame this by developing reliable, computationally efficient, and widely applicable PAI libraries with built-in lineage collection logic in various programming languages (Hack, C++, Python, etc.). This enabled much smoother integration with a broad range of Meta’s systems.</span></li> <li style="font-weight: 400;" aria-level="1"><b>Measurement improves our outcomes</b><span style="font-weight: 400;">:</span> <span style="font-weight: 400;">By incorporating the measurement of coverage, we’ve been able to evolve our data lineage so that we stay ahead of the ever-changing landscape of data and code at Meta. By enhancing our signals and adapting to new technologies, we can maintain a strong focus on privacy outcomes and drive ongoing improvements in lineage coverage across our tech stacks.</span></li> </ul> <h2><span style="font-weight: 400;">The future of data lineage</span></h2> <p><span style="font-weight: 400;">Data lineage is a vital component of Meta’s PAI initiative, providing a comprehensive view of how data flows across different systems. While we’ve made significant progress in establishing a strong foundation, our journey is ongoing. We’re committed to:</span></p> <ul> <li style="font-weight: 400;" aria-level="1"><b>Expanding coverage</b><span style="font-weight: 400;">: continuously enhance the coverage of our data lineage capabilities to ensure a comprehensive understanding of data flows.</span></li> <li style="font-weight: 400;" aria-level="1"><b>Improving consumption experience</b><span style="font-weight: 400;">: streamline the consumption experience to make it easier for developers and stakeholders to access and utilize data lineage information.</span></li> <li style="font-weight: 400;" aria-level="1"><b>Exploring new frontiers</b><span style="font-weight: 400;">: investigate new applications and use cases for data lineage, driving innovation and collaboration across the industry.</span></li> </ul> <p><span style="font-weight: 400;">By advancing data lineage, we aim to foster a culture of privacy awareness and drive progress in the broader fields of study. Together, we can create a more transparent and accountable data ecosystem.</span></p> <h2><span style="font-weight: 400;">Acknowledgements</span></h2> <p><i><span style="font-weight: 400;">The authors would like to acknowledge the contributions of many current and former Meta employees who have played a crucial role in developing data lineage technologies over the years. In particular, we would like to extend special thanks to (in alphabetical order) Amit Jain, Aygun Aydin, Ben Zhang, Brian Romanko, Brian Spanton, Daniel Ramagem, David Molnar, Dzmitry Charnahalau, Gayathri Aiyer, George Stasa, Guoqiang Jerry Chen, Graham Bleaney, Haiyang Han, Howard Cheng, Ian Carmichael, Ibrahim Mohamed, Jerry Pan, Jiang Wu, Jonathan Bergeron, Joanna Jiang, Jun Fang, Kiran Badam, Komal Mangtani, Kyle Huang, Maharshi Jha, Manuel Fahndrich, Marc Celani, Lei Zhang, Mark Vismonte, Perry Stoll, Pritesh Shah, Qi Zhou, Rajesh Nishtala, Rituraj Kirti, Seth Silverman, Shelton Jiang, Sushaant Mujoo, Vlad Fedorov, Yi Huang, Xinbo Gao, and Zhaohui Zhang. We would also like to express our gratitude to all reviewers of this post, including (in alphabetical order) Aleksandar Ilic, Avtar Brar, Benjamin Renard, Bogdan Shubravyi, Brianna O’Steen, Chris Wiltz, Daniel Chamberlain, Hannes Roth, Imogen Barnes, Jason Hendrickson, Koosh Orandi, Rituraj Kirti, and Xenia Habekoss. We would like to especially thank Jonathan Bergeron for overseeing the effort and providing all of the guidance and valuable feedback, Supriya Anand for leading the editorial effort to shape the blog content, and Katherine Bates for pulling all required support together to make this blog post happen.</span></i></p> <p>The post <a rel="nofollow" href="https://engineering.fb.com/2025/01/22/security/how-meta-discovers-data-flows-via-lineage-at-scale/">How Meta discovers data flows via lineage at scale</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>.</p> ]]></content:encoded> <post-id xmlns="com-wordpress:feed-additions:1">22168</post-id> </item> <item> <title>Strobelight: A profiling service built on open source technology</title> <link>https://engineering.fb.com/2025/01/21/production-engineering/strobelight-a-profiling-service-built-on-open-source-technology/</link> <dc:creator><![CDATA[]]></dc:creator> <pubDate>Tue, 21 Jan 2025 17:00:54 +0000</pubDate> <category><![CDATA[Open Source]]></category> <category><![CDATA[Production Engineering]]></category> <guid isPermaLink="false">https://engineering.fb.com/?p=22157</guid> <description><![CDATA[<p>We’re sharing details about Strobelight, Meta’s profiling orchestrator. Strobelight combines several technologies, many open source, into a single service that helps engineers at Meta improve efficiency and utilization across our fleet. Using Strobelight, we’ve seen significant efficiency wins, including one that has resulted in an estimated 15,000 servers’ worth of annual capacity savings. Strobelight, Meta’s [...]</p> <p><a class="btn btn-secondary understrap-read-more-link" href="https://engineering.fb.com/2025/01/21/production-engineering/strobelight-a-profiling-service-built-on-open-source-technology/">Read More...</a></p> <p>The post <a rel="nofollow" href="https://engineering.fb.com/2025/01/21/production-engineering/strobelight-a-profiling-service-built-on-open-source-technology/">Strobelight: A profiling service built on open source technology</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>.</p> ]]></description> <content:encoded><![CDATA[<ul> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">We’re sharing details about Strobelight, Meta’s profiling orchestrator.</span></li> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">Strobelight combines several technologies, many open source, into a single service that helps engineers at Meta improve efficiency and utilization across our fleet.</span></li> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">Using Strobelight, we’ve seen significant efficiency wins, including one that has resulted in an estimated 15,000 servers’ worth of annual capacity savings.</span></li> </ul> <p><span style="font-weight: 400;">Strobelight, Meta’s profiling orchestrator, is not really one technology. It’s several (many open source) combined to make something that unlocks truly amazing efficiency wins. Strobelight is also not a single profiler but an orchestrator of many different profilers (even ad-hoc ones) that runs on all production hosts at Meta, collecting detailed information about CPU usage, memory allocations, and other performance metrics from running processes. Engineers and developers can use this information to identify performance and resource bottlenecks, optimize their code, and improve utilization.</span></p> <p><span style="font-weight: 400;">When you combine talented engineers with rich performance data you can get efficiency wins by both creating tooling to identify issues before they reach production and finding opportunities in already running code. Let’s say an engineer makes a code change that introduces an unintended copy of some large object on a service’s critical path. Meta’s existing tools can identify the issue and query Strobelight data to estimate the impact on compute cost. Then Meta’s code review tool can notify the engineer that they’re about to waste, say, 20,000 servers.</span></p> <p><span style="font-weight: 400;">Of course, static analysis tools can pick up on these sorts of issues, but they are unaware of global compute cost and oftentimes these inefficiencies aren’t a problem until they’re gradually serving millions of requests per minute. The frog can boil slowly.</span></p> <h2><span style="font-weight: 400;">Why do we use profilers?</span></h2> <p><span style="font-weight: 400;">Profilers operate by sampling data to perform statistical analysis. For example, a profiler takes a sample every N events (or milliseconds in the case of time profilers) to understand where that event occurs or what is happening at the moment of that event. With a CPU-cycles event, for example, the profile will be CPU time spent in functions or function call stacks executing on the CPU. This can give an engineer a high-level understanding of the code execution of a service or binary.</span></p> <h2><span style="font-weight: 400;">Choosing your own adventure with Strobelight</span></h2> <p><span style="font-weight: 400;">There are other daemons at Meta that collect observability metrics, but Strobelight’s wheelhouse is software profiling. It connects resource usage to source code (what developers understand best). Strobelight’s profilers are often, but not exclusively, built using</span> <a href="https://docs.ebpf.io/" target="_blank" rel="noopener"><span style="font-weight: 400;">eBPF</span></a><span style="font-weight: 400;">, which is a Linux kernel technology. eBPF allows the safe injection of custom code into the kernel, which enables very low overhead collection of different types of data and unlocks so many possibilities in the observability space that it’s hard to imagine how Strobelight would work without it.</span></p> <p><span style="font-weight: 400;">As of the time of writing this, Strobelight has 42 different profilers, including:</span></p> <ul> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">Memory profilers powered by</span> <a href="https://github.com/jemalloc/jemalloc" target="_blank" rel="noopener"><span style="font-weight: 400;">jemalloc.</span></a></li> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">Function call count profilers.</span></li> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">Event-based profilers for both native and non-native languages (e.g., Python, Java, and Erlang).</span></li> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">AI/GPU profilers.</span></li> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">Profilers that track off-CPU time.</span></li> <li style="font-weight: 400;" aria-level="1"><span style="font-weight: 400;">Profilers that track service request latency.</span></li> </ul> <p><span style="font-weight: 400;">Engineers can utilize any one of these to collect data from servers on demand via Strobelight’s command line tool or web UI.</span></p> <figure id="attachment_22158" aria-describedby="caption-attachment-22158" style="width: 1024px" class="wp-caption alignnone"><img loading="lazy" decoding="async" class="size-large wp-image-22158" src="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png?w=1024" alt="" width="1024" height="579" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png?resize=916,518 916w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png?resize=768,435 768w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png?resize=1024,579 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png?resize=1536,869 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-1.png?resize=192,109 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22158" class="wp-caption-text">The Strobelight web UI.</figcaption></figure> <p><span style="font-weight: 400;">Users also have the ability to set up continuous or “triggered” profiling for any of these profilers by updating a configuration file in Meta’s</span> <a href="https://research.facebook.com/publications/holistic-configuration-management-at-facebook/" target="_blank" rel="noopener"><span style="font-weight: 400;">Configerator</span></a><span style="font-weight: 400;">, allowing them to target their entire service or, for example, only hosts that run in certain regions. Users can specify how often these profilers should run, the run duration, the symbolization strategy, the process they want to target, and a lot more.</span></p> <p><span style="font-weight: 400;">Here is an example of a simple configuration for one of these profilers:</span></p> <pre class="line-numbers"><code class="language-none">add_continuous_override_for_offcpu_data( "my_awesome_team", // the team that owns this service Type.SERVICE_ID, "my_awesome_service", 30_000, // desired samples per hour ) </code></pre> <p><span style="font-weight: 400;">Why does Strobelight have so many profilers? Because there are so many different things happening in these systems powered by so many different technologies.</span></p> <p><span style="font-weight: 400;">This is also why Strobelight provides ad-hoc profilers. Since the kind of data that can be gathered from a binary is so varied, engineers often need something that Strobelight doesn’t provide out of the box. Adding a new profiler from scratch to Strobelight involves several code changes and could take several weeks to get reviewed and rolled out.</span></p> <p><span style="font-weight: 400;">However, engineers can write a single</span> <a href="https://github.com/bpftrace/bpftrace" target="_blank" rel="noopener"><i><span style="font-weight: 400;">bpftrace</span></i></a><span style="font-weight: 400;"> script (a simple language/tool that allows you to easily write eBPF programs) and tell Strobelight to run it like it would any other profiler. An engineer that really cares about the latency of a particular C++ function, for example, could write up a little bpftrace script, commit it, and have Strobelight run it on any number of hosts throughout Meta’s fleet – all within a matter of hours, if needed.</span></p> <p><span style="font-weight: 400;">If all of this sounds powerfully dangerous, that’s because it is. However, Strobelight has several safeguards in place to prevent users from causing performance degradation for the targeted workloads and retention issues for the databases Strobelight writes to. Strobelight also has enough awareness to ensure that different profilers don’t conflict with each other. For example, if a profiler is tracking CPU cycles, Strobelight ensures another profiler can’t use another PMU counter at the same time (as there are other services that also use them).</span></p> <p><span style="font-weight: 400;">Strobelight also has concurrency rules and a profiler queuing system. Of course, service owners still have the flexibility to really hammer their machines if they want to extract a lot of data to debug.</span></p> <h2><span style="font-weight: 400;">Default data for everyone</span></h2> <p><span style="font-weight: 400;">Since its inception, one of Strobelight’s core principles has been to provide automatic, regularly-collected profiling data for all of Meta’s services. It’s like a flight recorder – something that doesn’t have to be thought about until it’s needed. What’s worse than waking up to an alert that a service is unhealthy and there is no data as to why?</span></p> <p><span style="font-weight: 400;">For that reason, Strobelight has a handful of curated profilers that are configured to run automatically on every Meta host. They’re not running all the time; that would be “bad” and not really “profiling.” Instead, they have custom run intervals and sampling rates specific to the workloads running on the host. This provides just the right amount of data without impacting the profiled services or overburdening the systems that store Strobelight data.</span></p> <p><span style="font-weight: 400;">Here is an example:</span></p> <p><span style="font-weight: 400;">A service, named Soft Server, runs on 1,000 hosts and let’s say we want profiler A to gather 40,000 CPU-cycles samples per hour for this service (remember the config above). Strobelight, knowing how many hosts Soft Server runs on, but not how CPU intensive it is, will start with a conservative run probability, which is a sampling mechanism to prevent bias (e.g., profiling these hosts at noon every day would hide traffic patterns).</span></p> <p><span style="font-weight: 400;">The next day Strobelight will look at how many samples it was able to gather for this service and then automatically tune the run probability (with some very simple math) to try to hit 40,000 samples per hour. We call this dynamic sampling and Strobelight does this readjustment every day for every service at Meta.</span></p> <p><span style="font-weight: 400;">And if there is more than one service running on the host (excluding daemons like systemd or Strobelight) then Strobelight will default to using the configuration that will yield more samples for both.</span></p> <p><span style="font-weight: 400;">Hang on, hang on. If the run probability or sampling rate is different depending on the host for a service, then how can the data be aggregated or compared across the hosts? And how can profiling data for multiple services be compared?</span></p> <p><span style="font-weight: 400;">Since Strobelight is aware of all these different knobs for profile tuning, it adjusts the “weight” of a profile sample when it’s logged. A sample’s weight is used to normalize the data and prevent bias when analyzing or viewing this data in aggregate. So even if Strobelight is profiling Soft Server less often on one host than on another, the samples can be accurately compared and grouped. This also works for comparing two different services since Strobelight is used both by service owners looking at their specific service as well as efficiency experts who look for “horizontal” wins across the fleet in shared libraries.</span></p> <h2><span style="font-weight: 400;">How Strobelight saves capacity</span></h2> <p><span style="font-weight: 400;">There are two default continuous profilers that should be called out because of how much they end up saving in capacity.</span></p> <h3><span style="font-weight: 400;">The last branch record (LBR) profiler </span></h3> <p><span style="font-weight: 400;">The LBR profiler, true to its name, is used to sample</span> <a href="https://lwn.net/Articles/680985/" target="_blank" rel="noopener"><span style="font-weight: 400;">last branch records</span></a><span style="font-weight: 400;"> (a hardware feature that started on Intel). The data from this profiler doesn’t get visualized but instead is fed into Meta’s feedback directed optimization (FDO) pipeline. This data is used to create FDO profiles that are consumed at compile time (</span><a href="https://ieeexplore.ieee.org/document/10444807" target="_blank" rel="noopener"><span style="font-weight: 400;">CSSPGO</span></a><span style="font-weight: 400;">) and post-compile time (</span><a href="https://research.facebook.com/publications/bolt-a-practical-binary-optimizer-for-data-centers-and-beyond/"><span style="font-weight: 400;">BOLT</span></a><span style="font-weight: 400;">) to speed up binaries through the added knowledge of runtime behavior. Meta’s top 200 largest services all have FDO profiles from the LBR data gathered continuously across the fleet. Some of these services see up to 20% reduction in CPU cycles, which equates to a 10-20% reduction in the number of servers needed to run these services at Meta.</span></p> <h3><span style="font-weight: 400;">The event profiler</span></h3> <p><span style="font-weight: 400;">The second profiler is Strobelight’s event profiler. This is Strobelight’s version of the Linux perf tool. Its primary job is to collect user and kernel stack traces from multiple performance (perf) events e.g., CPU-cycles, L3 cache misses, instructions, etc. Not only is this data looked at by individual engineers to understand what the hottest functions and call paths are, but this data is also fed into monitoring and testing tools to identify regressions; ideally </span><i><span style="font-weight: 400;">before</span></i><span style="font-weight: 400;"> they hit production.</span></p> <h2><span style="font-weight: 400;">Did someone say Meta…data?</span></h2> <p><span style="font-weight: 400;">Looking at function call stacks with</span> <a href="https://www.brendangregg.com/flamegraphs.html" target="_blank" rel="noopener"><span style="font-weight: 400;">flame graphs</span></a><span style="font-weight: 400;"> is great, nothing against it. But a service owner looking at call stacks from their service, which imports many libraries and utilizes Meta’s software frameworks, will see a lot of “foreign” functions. Also, what about finding just the stacks for p99 latency requests? Or how about all the places where a service is making an unintended string copy?</span></p> <h3><span style="font-weight: 400;">Stack schemas</span></h3> <p><span style="font-weight: 400;">Strobelight has multiple mechanisms for enhancing the data it produces according to the needs of its users. One such mechanism is called Stack Schemas (inspired by</span> <a href="https://learn.microsoft.com/en-us/windows-hardware/test/wpt/stack-tags" target="_blank" rel="noopener"><span style="font-weight: 400;">Microsoft’s stack tags</span></a><span style="font-weight: 400;">), which is a small DSL that operates on call stacks and can be used to add tags (strings) to entire call stacks or individual frames/functions. These tags can then be utilized in our visualization tool. Stack Schemas can also remove functions users don’t care about with regex matching. Any number of schemas can be applied on a per-service or even per-profile basis to customize the data.</span></p> <p><span style="font-weight: 400;">There are even folks who create dashboards from this metadata to help other engineers identify expensive copying, use of inefficient or inappropriate C++ containers, overuse of smart pointers, and much more. Static analysis tools that can do this have been around for a long time, but they can’t pinpoint the really painful or computationally expensive instances of these issues across a large fleet of machines.</span></p> <h3><span style="font-weight: 400;">Strobemeta</span></h3> <p><span style="font-weight: 400;">Strobemeta is another mechanism, which utilizes thread local storage, to attach bits of dynamic metadata at runtime to call stacks that we gather in the event profiler (and others). This is one of the biggest advantages of building profilers using eBPF: complex and customized actions taken at sample time. Collected Strobemeta is used to attribute call stacks to specific service endpoints, or request latency metrics, or request identifiers. Again, this allows engineers and tools to do more complex filtering to focus the vast amounts of data that Strobelight profilers produce.</span></p> <h2><span style="font-weight: 400;">Symbolization</span></h2> <p><span style="font-weight: 400;">Now is a good time to talk about symbolization: taking the virtual address of an instruction, converting it into an actual symbol (function) name, and, depending on the symbolization strategy, also getting the function’s source file, line number, and type information.</span></p> <p><span style="font-weight: 400;">Most of the time getting the whole enchilada means using a binary’s DWARF debug info. But this can be many megabytes (or even gigabytes) in size because DWARF debug data contains much more than the symbol information.</span></p> <p><span style="font-weight: 400;">This data needs to be downloaded then parsed. But attempting this while profiling, or even afterwards on the same host where the profile is gathered, is far too computationally expensive. Even with optimal caching strategies it can cause memory issues for the host’s workloads.</span></p> <p><span style="font-weight: 400;">Strobelight gets around this problem via a symbolization service that utilizes several open source technologies including DWARF, ELF,</span> <a href="https://github.com/YtnbFirewings/gsym" target="_blank" rel="noopener"><span style="font-weight: 400;">gsym</span></a><span style="font-weight: 400;">, and</span> <a href="https://github.com/libbpf/blazesym" target="_blank" rel="noopener"><span style="font-weight: 400;">blazesym</span></a><span style="font-weight: 400;">. At the end of a profile Strobelight sends stacks of binary addresses to a service that sends back symbolized stacks with file, line, type info, and even inline information.</span></p> <p><span style="font-weight: 400;">It can do this because it has already done all the heavy lifting of downloading and parsing the DWARF data for each of Meta’s binaries (specifically, production binaries) and stores what it needs in a database. Then it can serve multiple symbolization requests coming from different instances of Strobelight running throughout the fleet.</span></p> <p><span style="font-weight: 400;">To add to that enchilada (hungry yet?), Strobelight also delays symbolization until after profiling and stores raw data to disk to prevent memory thrash on the host. This has the added benefit of not letting the consumer impact the producer – meaning if Strobelight’s user space code can’t handle the speed at which the eBPF kernel code is producing samples (because it’s spending time symbolizing or doing some other processing) it results in dropped samples.</span></p> <p><span style="font-weight: 400;">All of this is made possible with the inclusion of</span> <a href="https://www.brendangregg.com/blog/2024-03-17/the-return-of-the-frame-pointers.html" target="_blank" rel="noopener"><span style="font-weight: 400;">frame pointers</span></a><span style="font-weight: 400;"> in all of Meta’s user space binaries, otherwise we couldn’t walk the stack to get all these addresses (or we’d have to do some other complicated/expensive thing which wouldn’t be as efficient). </span></p> <figure id="attachment_22159" aria-describedby="caption-attachment-22159" style="width: 1024px" class="wp-caption alignnone"><img loading="lazy" decoding="async" class="size-large wp-image-22159" src="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png?w=1024" alt="" width="1024" height="633" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png 1607w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png?resize=916,566 916w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png?resize=768,475 768w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png?resize=1024,633 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png?resize=1536,949 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png?resize=96,59 96w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-2.png?resize=192,119 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22159" class="wp-caption-text">A simplified Strobelight service graph.</figcaption></figure> <h2><span style="font-weight: 400;">Show me the data (and make it nice)!</span></h2> <p><span style="font-weight: 400;">The primary tool Strobelight customers use is</span> <a href="https://research.facebook.com/publications/scuba-diving-into-data-at-facebook/" target="_blank" rel="noopener"><span style="font-weight: 400;">Scuba</span></a><span style="font-weight: 400;"> – a query language (like SQL), database, and UI. The Scuba UI has a large suite of visualizations for the queries people construct (e.g., flame graphs, pie charts, time series graphs, distributions, etc).</span></p> <p><span style="font-weight: 400;">Strobelight, for the most part, produces Scuba data and, generally, it’s a happy marriage. If someone runs an on-demand profile, it’s just a few seconds before they can visualize this data in the Scuba UI (and send people links to it). Even tools like</span> <a href="https://perfetto.dev/" target="_blank" rel="noopener"><span style="font-weight: 400;">Perfetto</span></a><span style="font-weight: 400;"> expose the ability to query the underlying data because they know it’s impossible to try to come up with enough dropdowns and buttons that can express everything you want to do in a query language – though the Scuba UI comes close.</span></p> <figure id="attachment_22160" aria-describedby="caption-attachment-22160" style="width: 1024px" class="wp-caption alignnone"><img loading="lazy" decoding="async" class="wp-image-22160 size-large" src="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png?w=1024" alt="" width="1024" height="554" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png 1999w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png?resize=916,495 916w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png?resize=768,415 768w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png?resize=1024,554 1024w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png?resize=1536,831 1536w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png?resize=96,52 96w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-3.png?resize=192,104 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22160" class="wp-caption-text">An example flamegraph/icicle of function call stacks of the CPU cycles event for the mononoke service for one hour.</figcaption></figure> <p><span style="font-weight: 400;">The other tool is a trace visualization tool used at Meta named</span><a href="https://www.facebook.com/atscaleevents/videos/996197807391867/"> <span style="font-weight: 400;">Tracery</span></a><span style="font-weight: 400;">. We use this tool when we want to combine correlated but different streams of profile data on one screen. This data is also a natural fit for viewing on a timeline. Tracery allows users to make custom visualizations and curated workspaces to share with other engineers to pinpoint the important parts of that data. It’s also powered by a client-side columnar database (written in JavaScript!), which makes it very fast when it comes to zooming and filtering. Strobelight’s Crochet profiler combines service request spans, CPU-cycles stacks, and off-CPU data to give users a detailed snapshot of their service.</span></p> <figure id="attachment_22161" aria-describedby="caption-attachment-22161" style="width: 975px" class="wp-caption alignnone"><img loading="lazy" decoding="async" class="size-large wp-image-22161" src="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-4.png?w=975" alt="" width="975" height="552" srcset="https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-4.png 975w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-4.png?resize=916,519 916w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-4.png?resize=768,435 768w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-4.png?resize=96,54 96w, https://engineering.fb.com/wp-content/uploads/2025/01/Strobelight-Meta-image-4.png?resize=192,109 192w" sizes="auto, (max-width: 992px) 100vw, 62vw" /><figcaption id="caption-attachment-22161" class="wp-caption-text">An example trace in Tracery.</figcaption></figure> <h2><span style="font-weight: 400;">The Biggest Ampersand</span></h2> <p><span style="font-weight: 400;">Strobelight has helped engineers at Meta realize countless efficiency and latency wins, ranging from increases in the number of requests served, to large reductions in heap allocations, to regressions caught in pre-prod analysis tools.</span></p> <p><span style="font-weight: 400;">But one of the most significant wins is one we call, “The Biggest Ampersand.”</span></p> <p><span style="font-weight: 400;">A seasoned performance engineer was looking through Strobelight data and discovered that by filtering on a particular std::vector function call (using the symbolized file and line number) he could identify computationally expensive array copies that happen unintentionally with the ‘auto’ keyword in C++.</span></p> <p><span style="font-weight: 400;">The engineer turned a few knobs, adjusted his Scuba query, and happened to notice one of these copies in a particularly hot call path in one of Meta’s largest ads services. He then cracked open his code editor to investigate whether this particular vector copy was intentional… it wasn’t.</span></p> <p><span style="font-weight: 400;">It was a simple mistake that any engineer working in C++ has made a hundred times.</span></p> <p><span style="font-weight: 400;">So, the engineer typed an “&” after the auto keyword to indicate we want a reference instead of a copy. It was a one-character commit, which, after it was shipped to production, equated to an estimated 15,000 servers</span> <span style="font-weight: 400;">in capacity savings per year!</span></p> <p><span style="font-weight: 400;">Go back and re-read that sentence. One ampersand! </span></p> <h2><span style="font-weight: 400;">An open ending</span></h2> <p><span style="font-weight: 400;">This only scratches the surface of everything Strobelight can do. The Strobelight team works closely with Meta’s performance engineers on new features that can better analyze code to help pinpoint where things are slow, computationally expensive, and why.</span></p> <p><span style="font-weight: 400;">We’re currently working on</span> <a href="https://github.com/facebookincubator/strobelight" target="_blank" rel="noopener"><span style="font-weight: 400;">open-sourcing</span></a><span style="font-weight: 400;"> Strobelight’s profilers and libraries, which will no doubt make them more robust and useful. Most of the technologies Strobelight uses are already public or open source, so please use and contribute to them!</span></p> <h2><span style="font-weight: 400;">Acknowledgements</span></h2> <p><i><span style="font-weight: 400;">Special thanks to Wenlei He, Andrii Nakryiko, Giuseppe Ottaviano, Mark Santaniello, Nathan Slingerland, Anita Zhang, and the Profilers Team at Meta. </span></i></p> <p>The post <a rel="nofollow" href="https://engineering.fb.com/2025/01/21/production-engineering/strobelight-a-profiling-service-built-on-open-source-technology/">Strobelight: A profiling service built on open source technology</a> appeared first on <a rel="nofollow" href="https://engineering.fb.com">Engineering at Meta</a>.</p> ]]></content:encoded> <post-id xmlns="com-wordpress:feed-additions:1">22157</post-id> </item> </channel> </rss>