CINXE.COM

Orchestrating queries with Airflow - Apache Drill

<!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <meta name=viewport content="width=device-width, initial-scale=1"> <title>Orchestrating queries with Airflow - Apache Drill</title> <link href="https://maxcdn.bootstrapcdn.com/font-awesome/4.3.0/css/font-awesome.min.css" rel="stylesheet" type="text/css"/> <link href="/css/site.css" rel="stylesheet" type="text/css"/> <link rel="shortcut icon" href="/favicon.ico" type="image/x-icon"/> <link rel="icon" href="/favicon.ico" type="image/x-icon"/> <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/1.11.1/jquery.min.js" language="javascript" type="text/javascript"></script> <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery-easing/1.3/jquery.easing.min.js" language="javascript" type="text/javascript"></script> <script language="javascript" type="text/javascript" src="/js/modernizr.custom.js"></script> <script language="javascript" type="text/javascript" src="/js/script.js"></script> <script language="javascript" type="text/javascript" src="/js/drill.js"></script> </head> <body onResize="resized();"> <div class="page-wrap"> <div class="bui"></div> <div id="menu" class="mw"> <ul> <li class='toc-categories'> <a class="expand-toc-icon" href="javascript:void(0);"><i class="fa fa-bars"></i></a> </li> <li class="logo"><a href="/"></a></li> <li class='expand-menu'> <a href="javascript:void(0);"><span class='menu-text'>Menu</span><span class='expand-icon'><i class="fa fa-bars"></i></span></a> </li> <li class="clear-float"></li> <li class="nav"> <a>Language</a> <ul> <li> <a style="font-weight: bold;" href="/docs/orchestrating-queries-with-airflow/" >en</a> </li> <li> <a href="/zh/docs/orchestrating-queries-with-airflow/" >zh</a> </li> </ul> </li> <li class="apache-link"> <a href="/apacheASF/">Apache</a> </li> <li class="poweredby"> <a href="/poweredBy">Powered By</a> </li> <li class="documentation-menu"> <a href="/docs/">Documentation</a> <ul> <li><a href="/docs/getting-started/">Getting Started</a></li> <li><a href="/docs/architecture/">Architecture</a></li> <li><a href="/docs/tutorials/">Tutorials</a></li> <li><a href="/docs/drill-on-yarn/">Drill-on-YARN</a></li> <li><a href="/docs/install-drill/">Install Drill</a></li> <li><a href="/docs/configure-drill/">Configure Drill</a></li> <li><a href="/docs/connect-a-data-source/">Connect a Data Source</a></li> <li><a href="/docs/odbc-jdbc-interfaces/">ODBC/JDBC Interfaces</a></li> <li><a href="/docs/query-data/">Query Data</a></li> <li><a href="/docs/performance-tuning/">Performance Tuning</a></li> <li><a href="/docs/log-and-debug/">Log and Debug</a></li> <li><a href="/docs/sql-reference/">SQL Reference</a></li> <li><a href="/docs/data-sources-and-file-formats/">Data Sources and File Formats</a></li> <li><a href="/docs/develop-custom-functions/">Develop Custom Functions</a></li> <li><a href="/docs/troubleshooting/">Troubleshooting</a></li> <li><a href="/docs/developer-information/">Developer Information</a></li> <li><a href="/docs/release-notes/">Release Notes</a></li> <li><a href="/docs/sample-datasets/">Sample Datasets</a></li> <li><a href="/docs/project-bylaws/">Project Bylaws</a></li> <li><a href="/docs/ecosystem/">Ecosystem</a></li> </ul> </li> <li class='nav'> <a href="/community-resources/">Community</a> <ul> <li><a href="/team/">Team</a></li> <li><a href="/mailinglists/">Mailing Lists</a></li> <li><a href="/community-resources/">Community Resources</a></li> </ul> </li> <li class='nav'><a href="/faq/">FAQ</a></li> <li class='nav'><a href="/blog/">Blog</a></li> <li class="social-menu-item"><a href="https://twitter.com/apachedrill" title="apachedrill on twitter" target="_blank"><img src="/images/twitter_32_26_white.png" alt="twitter logo" align="center"></a> </li> <li class="social-menu-item"><a href="https://join.slack.com/t/apache-drill/shared_invite/enQtNTQ4MjM1MDA3MzQ2LTJlYmUxMTRkMmUwYmQ2NTllYmFmMjU4MDk0NjYwZjBmYjg0MDZmOTE2ZDg0ZjBlYmI3Yjc4Y2I2NTQyNGVlZTc" title="Apache Drill Slack channels" target="_blank"><img src="/images/slack-logo.svg" alt="Slack logo" align="center"></a> </li> <li class='search-bar'> <form id="drill-search-form"> <input type="text" placeholder="Search Apache Drill" id="drill-search-term" /> <button type="submit"> <i class="fa fa-search"></i> </button> </form> </li> <li class="d"> <a href="/download/"> <i class="fa fa-cloud-download"></i> Download </a> </li> </ul> </div> <link href="/css/content.css" rel="stylesheet" type="text/css"> <aside class="sidebar"> <div class="docsidebar"> <div class="docsidebarwrapper"> <ul style="display: block;"> <li class="toctree-l1"><a href="javascript: void(0);">Getting Started</a></li> <ul style="display: none"> <li class="toctree-l2"><a class="reference internal" href="/docs/drill-introduction/">Drill Introduction</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/why-drill/">Why Drill</a></li> </ul> <li class="toctree-l1"><a href="javascript: void(0);">Architecture</a></li> <ul style="display: none"> <li class="toctree-l2"><a class="reference internal" href="/docs/architecture-introduction/">Architecture Introduction</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/drill-query-execution/">Drill Query Execution</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/core-modules/">Core Modules</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/performance/">Performance</a></li> </ul> <li class="toctree-l1 current_section "><a href="javascript: void(0);">Tutorials</a></li> <ul class="current_section"> <li class="toctree-l2"><a class="reference internal" href="/docs/tutorials-introduction/">Tutorials Introduction</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/drill-in-10-minutes/">Drill in 10 Minutes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/analyzing-the-yelp-academic-dataset/">Analyzing the Yelp Academic Dataset</a></li> <li class="toctree-l2"><a href="javascript: void(0);">Learn Drill with the MapR Sandbox</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/about-the-mapr-sandbox/">About the MapR Sandbox</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/installing-the-apache-drill-sandbox/">Installing the Apache Drill Sandbox</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/getting-to-know-the-drill-sandbox/">Getting to Know the Drill Sandbox</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/lesson-1-learn-about-the-data-set/">Lesson 1: Learn about the Data Set</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/lesson-2-run-queries-with-ansi-sql/">Lesson 2: Run Queries with ANSI SQL</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/lesson-3-run-queries-on-complex-data-types/">Lesson 3: Run Queries on Complex Data Types</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/summary/">Summary</a></li> </ul> <li class="toctree-l2"><a class="reference internal" href="/docs/analyzing-highly-dynamic-datasets/">Analyzing Highly Dynamic Datasets</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/analyzing-social-media/">Analyzing Social Media</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/analyzing-data-using-window-functions/">Analyzing Data Using Window Functions</a></li> <li class="toctree-l2 current"><a class="reference internal" href="/docs/orchestrating-queries-with-airflow/">Orchestrating queries with Airflow</a></li> </ul> <li class="toctree-l1"><a href="javascript: void(0);">Drill-on-YARN</a></li> <ul style="display: none"> <li class="toctree-l2"><a class="reference internal" href="/docs/drill-on-yarn-introduction/">Drill-on-YARN Introduction</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/creating-a-basic-drill-cluster/">Creating a Basic Drill Cluster</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/launch-drill-under-yarn/">Launch Drill Under YARN</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/configuration-reference/">Configuration Reference</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/drill-on-yarn-command-line-tool/">Drill-on-YARN Command-Line Tool</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/using-the-drill-on-yarn-web-ui/">Using the Drill-on-YARN Web UI</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/multiple-drill-clusters/">Multiple Drill Clusters</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/enabling-web-ui-security/">Enabling Web UI Security</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/appendix-a-release-note-issues/">Appendix A: Release Note Issues</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/appendix-b-drill-env-sh-settings/">Appendix B: drill-env.sh Settings</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/appendix-c-troubleshooting/">Appendix C: Troubleshooting</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/appendix-d-recreate-the-drill-archive/">Appendix D: Recreate the Drill Archive</a></li> </ul> <li class="toctree-l1"><a href="javascript: void(0);">Install Drill</a></li> <ul style="display: none"> <li class="toctree-l2"><a class="reference internal" href="/docs/install-drill-introduction/">Install Drill Introduction</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/migrating-parquet-data/">Migrating Parquet Data</a></li> <li class="toctree-l2"><a href="javascript: void(0);">Installing Drill in Embedded Mode</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/embedded-mode-prerequisites/">Embedded Mode Prerequisites</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/running-drill-on-docker/">Running Drill on Docker</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/installing-drill-on-linux-and-mac-os-x/">Installing Drill on Linux and Mac OS X</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/starting-drill-on-linux-and-mac-os-x/">Starting Drill on Linux and Mac OS X</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/installing-drill-on-windows/">Installing Drill on Windows</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/starting-drill-on-windows/">Starting Drill on Windows</a></li> </ul> <li class="toctree-l2"><a href="javascript: void(0);">Installing Drill in Distributed Mode</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/distributed-mode-prerequisites/">Distributed Mode Prerequisites</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/installing-drill-on-the-cluster/">Installing Drill on the Cluster</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/starting-drill-in-distributed-mode/">Starting Drill in Distributed Mode</a></li> </ul> <li class="toctree-l2"><a class="reference internal" href="/docs/starting-the-web-ui/">Starting the Web UI</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/stopping-drill/">Stopping Drill</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/identifying-multiple-drill-versions-in-a-cluster/">Identifying Multiple Drill Versions in a Cluster</a></li> <li class="toctree-l2"><a href="javascript: void(0);">Extended</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/installing-drill-in-distributed-mode-with-gcp-dataproc/">Installing Drill in Distributed Mode with GCP Dataproc</a></li> </ul> </ul> <li class="toctree-l1"><a href="javascript: void(0);">Configure Drill</a></li> <ul style="display: none"> <li class="toctree-l2"><a class="reference internal" href="/docs/configure-drill-introduction/">Configure Drill Introduction</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/configuring-drill-memory/">Configuring Drill Memory</a></li> <li class="toctree-l2"><a href="javascript: void(0);">Securing Drill</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/securing-drill-introduction/">Securing Drill Introduction</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/secure-communication-paths/">Secure Communication Paths</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/roles-and-privileges/">Roles and Privileges</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/configuring-user-impersonation/">Configuring User Impersonation</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/configuring-inbound-impersonation/">Configuring Inbound Impersonation</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/configuring-user-impersonation-with-hive-authorization/">Configuring User Impersonation with Hive Authorization</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/configuring-user-security/">Configuring User Security</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/configuring-plain-security/">Configuring Plain Security</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/configuring-ssl-tls-for-encryption/">Configuring SSL/TLS for Encryption</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/using-libpam4j-as-the-pam-authenticator/">Using libpam4j as the PAM Authenticator</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/using-jpam-as-the-pam-authenticator/">Using jpam as the PAM Authenticator</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/configuring-htpasswd-file-authentication/">Configuring htpasswd file authentication</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/configuring-hashicorp-vault-authentication/">Configuring HashiCorp Vault authentication</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/configuring-kerberos-security/">Configuring Kerberos Security</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/configuring-drill-to-use-spnego-for-http-authentication/">Configuring Drill to use SPNEGO for HTTP Authentication</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/configuring-web-ui-and-rest-api-security/">Configuring Web UI and REST API Security</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/configuring-custom-acls-to-secure-znodes/">Configuring Custom ACLs to Secure znodes</a></li> </ul> <li class="toctree-l2"><a href="javascript: void(0);">Configuring a Multitenant Cluster</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/configuring-a-multitenant-cluster-introduction/">Configuring a Multitenant Cluster Introduction</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/configuring-multitenant-resources/">Configuring Multitenant Resources</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/configuring-resources-for-a-shared-drillbit/">Configuring Resources for a Shared Drillbit</a></li> </ul> <li class="toctree-l2"><a href="javascript: void(0);">Configuration Options</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/configuration-options-introduction/">Configuration Options Introduction</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/start-up-options/">Start-Up Options</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/planning-and-execution-options/">Planning and Execution Options</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/persistent-configuration-storage/">Persistent Configuration Storage</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/global-query-list/">Global Query List</a></li> </ul> <li class="toctree-l2"><a class="reference internal" href="/docs/ports-and-bind-addresses-used-by-drill/">Ports and Bind Addresses Used by Drill</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/configuring-the-drill-shell/">Configuring the Drill Shell</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/configuring-cgroups-to-control-cpu-usage/">Configuring cgroups to Control CPU Usage</a></li> </ul> <li class="toctree-l1"><a href="javascript: void(0);">Connect a Data Source</a></li> <ul style="display: none"> <li class="toctree-l2"><a class="reference internal" href="/docs/connect-a-data-source-introduction/">Connect a Data Source Introduction</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/storage-plugin-registration/">Storage Plugin Registration</a></li> <li class="toctree-l2"><a href="javascript: void(0);">Storage Plugin Configuration</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/plugin-configuration-basics/">Plugin Configuration Basics</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/configuring-storage-plugins/">Configuring Storage Plugins</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/storage-plugin-authentication-modes/">Storage plugin authentication modes</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/workspaces/">Workspaces</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/drill-default-input-format/">Drill Default Input Format</a></li> </ul> <li class="toctree-l2"><a class="reference internal" href="/docs/file-system-storage-plugin/">File System Storage Plugin</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/hbase-storage-plugin/">HBase Storage Plugin</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/hive-storage-plugin/">Hive Storage Plugin</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/rdbms-storage-plugin/">RDBMS Storage Plugin</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/mongodb-storage-plugin/">MongoDB Storage Plugin</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/mapr-db-format/">MapR-DB Format</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/s3-storage-plugin/">S3 Storage Plugin</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/oci-os-storage-plugin/">OCI OS Storage Plugin</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/opentsdb-storage-plugin/">OpenTSDB Storage Plugin</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/kafka-storage-plugin/">Kafka Storage Plugin</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/azure-blob-storage-plugin/">Azure Blob Storage Plugin</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/druid-storage-plugin/">Druid Storage Plugin</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/http-storage-plugin/">HTTP Storage Plugin</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/elasticsearch-storage-plugin/">ElasticSearch Storage Plugin</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/splunk-storage-plugin/">Splunk Storage Plugin</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/cassandra-storage-plugin/">Cassandra Storage Plugin</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/dropbox-storage-plugin/">Dropbox Storage Plugin</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/google-sheets-storage-plugin/">Google Sheets Storage Plugin</a></li> </ul> <li class="toctree-l1"><a href="javascript: void(0);">ODBC/JDBC Interfaces</a></li> <ul style="display: none"> <li class="toctree-l2"><a class="reference internal" href="/docs/interfaces-introduction/">Interfaces Introduction</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/using-the-jdbc-driver/">Using the JDBC Driver</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/using-jdbc-with-squirrel-on-windows/">Using JDBC with SQuirreL on Windows</a></li> <li class="toctree-l2"><a href="javascript: void(0);">Installing the ODBC Driver</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/installing-the-driver-on-linux/">Installing the Driver on Linux</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/installing-the-driver-on-mac-os-x/">Installing the Driver on Mac OS X</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/installing-the-driver-on-windows/">Installing the Driver on Windows</a></li> </ul> <li class="toctree-l2"><a href="javascript: void(0);">Configuring ODBC</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/odbc-configuration-reference/">ODBC Configuration Reference</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/logging-and-tracing/">Logging and Tracing</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/configuring-odbc-on-linux/">Configuring ODBC on Linux</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/configuring-odbc-on-mac-os-x/">Configuring ODBC on Mac OS X</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/configuring-odbc-on-windows/">Configuring ODBC on Windows</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/testing-the-odbc-connection/">Testing the ODBC Connection</a></li> </ul> <li class="toctree-l2"><a href="javascript: void(0);">Using Drill Explorer</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/drill-explorer-introduction/">Drill Explorer Introduction</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/connecting-drill-explorer-to-data/">Connecting Drill Explorer to Data</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/browsing-data-and-defining-views/">Browsing Data and Defining Views</a></li> </ul> <li class="toctree-l2"><a href="javascript: void(0);">Using Drill with BI Tools</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/using-drill-with-bi-tools-introduction/">Using Drill with BI Tools Introduction</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/tableau-examples/">Tableau Examples</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/using-microstrategy-analytics-with-apache-drill/">Using MicroStrategy Analytics with Apache Drill</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/using-tibco-spotfire-desktop-with-drill/">Using Tibco Spotfire Desktop with Drill</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/configuring-tibco-spotfire-server-with-drill/">Configuring Tibco Spotfire Server with Drill</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/using-qlik-sense-with-drill/">Using Qlik Sense with Drill</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/using-apache-drill-with-tableau-10-2/">Using Apache Drill with Tableau 10.2</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/using-apache-drill-with-tableau-9-desktop/">Using Apache Drill with Tableau 9 Desktop</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/using-apache-drill-with-tableau-9-server/">Using Apache Drill with Tableau 9 Server</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/using-information-builders-webfocus-with-apache-drill/">Using Information Builders’ WebFOCUS with Apache Drill</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/configuring-jreport-with-drill/">Configuring JReport with Drill</a></li> </ul> </ul> <li class="toctree-l1"><a href="javascript: void(0);">Query Data</a></li> <ul style="display: none"> <li class="toctree-l2"><a class="reference internal" href="/docs/query-data-introduction/">Query Data Introduction</a></li> <li class="toctree-l2"><a href="javascript: void(0);">Querying a File System</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/querying-a-file-system-introduction/">Querying a File System Introduction</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/querying-avro-files/">Querying Avro Files</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/querying-json-files/">Querying JSON Files</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/querying-parquet-files/">Querying Parquet Files</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/querying-plain-text-files/">Querying Plain Text Files</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/querying-directories/">Querying Directories</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/querying-sequence-files/">Querying Sequence Files</a></li> </ul> <li class="toctree-l2"><a class="reference internal" href="/docs/querying-hbase/">Querying HBase</a></li> <li class="toctree-l2"><a href="javascript: void(0);">Querying Complex Data</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/querying-complex-data-introduction/">Querying Complex Data Introduction</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/sample-data-donuts/">Sample Data: Donuts</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/selecting-flat-data/">Selecting Flat Data</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/using-sql-functions-clauses-and-joins/">Using SQL Functions, Clauses, and Joins</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/selecting-nested-data-for-a-column/">Selecting Nested Data for a Column</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/selecting-multiple-columns-within-nested-data/">Selecting Multiple Columns Within Nested Data</a></li> </ul> <li class="toctree-l2"><a class="reference internal" href="/docs/querying-hive/">Querying Hive</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/querying-the-information-schema/">Querying the INFORMATION SCHEMA</a></li> <li class="toctree-l2"><a href="javascript: void(0);">Querying Indexes</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/querying-indexes-introduction/">Querying Indexes Introduction</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/queries-that-qualify-for-index-based-query-plans/">Queries that Qualify for Index-Based Query Plans</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/types-of-indexes/">Types of Indexes</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/index-selection/">Index Selection</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/designing-indexes-for-your-queries/">Designing Indexes for Your Queries</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/configuring-index-planning/">Configuring Index Planning</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/verifying-index-use/">Verifying Index Use</a></li> </ul> <li class="toctree-l2"><a class="reference internal" href="/docs/querying-system-tables/">Querying System Tables</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/monitoring-and-canceling-queries-in-the-drill-web-ui/">Monitoring and Canceling Queries in the Drill Web UI</a></li> </ul> <li class="toctree-l1"><a href="javascript: void(0);">Performance Tuning</a></li> <ul style="display: none"> <li class="toctree-l2"><a href="javascript: void(0);">Drill Metastore</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/using-drill-metastore/">Using Drill Metastore</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/drill-iceberg-metastore/">Drill Iceberg Metastore</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/rdbms-metastore/">RDBMS Metastore</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/mongo-metastore/">Mongo Metastore</a></li> </ul> <li class="toctree-l2"><a class="reference internal" href="/docs/performance-tuning-introduction/">Performance Tuning Introduction</a></li> <li class="toctree-l2"><a href="javascript: void(0);">Partition Pruning</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/partition-pruning-introduction/">Partition Pruning Introduction</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/how-to-partition-data/">How to Partition Data</a></li> </ul> <li class="toctree-l2"><a class="reference internal" href="/docs/asynchronous-parquet-reader/">Asynchronous Parquet Reader</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/optimizing-parquet-metadata-reading/">Optimizing Parquet Metadata Reading</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/parquet-filter-pushdown/">Parquet Filter Pushdown</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/hive-metadata-caching/">Hive Metadata Caching</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/choosing-a-storage-format/">Choosing a Storage Format</a></li> <li class="toctree-l2"><a href="javascript: void(0);">Query Plans and Tuning</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/query-plans-and-tuning-introduction/">Query Plans and Tuning Introduction</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/join-planning-guidelines/">Join Planning Guidelines</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/guidelines-for-optimizing-aggregation/">Guidelines for Optimizing Aggregation</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/modifying-query-planning-options/">Modifying Query Planning Options</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/sort-based-and-hash-based-memory-constrained-operators/">Sort-Based and Hash-Based Memory-Constrained Operators</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/enabling-query-queuing/">Enabling Query Queuing</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/throttling/">Throttling</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/controlling-parallelization-to-balance-performance-with-multi-tenancy/">Controlling Parallelization to Balance Performance with Multi-Tenancy</a></li> </ul> <li class="toctree-l2"><a href="javascript: void(0);">Identifying Performance Issues</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/query-plans/">Query Plans</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/query-profiles/">Query Profiles</a></li> </ul> <li class="toctree-l2"><a href="javascript: void(0);">Performance Tuning Reference</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/query-profile-column-descriptions/">Query Profile Column Descriptions</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/physical-operators/">Physical Operators</a></li> </ul> <li class="toctree-l2"><a class="reference internal" href="/docs/monitoring-metrics/">Monitoring Metrics</a></li> </ul> <li class="toctree-l1"><a href="javascript: void(0);">Log and Debug</a></li> <ul style="display: none"> <li class="toctree-l2"><a class="reference internal" href="/docs/log-and-debug-introduction/">Log and Debug Introduction</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/error-messages/">Error Messages</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/modify-logback-xml/">Modify logback.xml</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/review-the-java-stack-trace/">Review the Java Stack Trace</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/query-audit-logging/">Query Audit Logging</a></li> </ul> <li class="toctree-l1"><a href="javascript: void(0);">SQL Reference</a></li> <ul style="display: none"> <li class="toctree-l2"><a class="reference internal" href="/docs/sql-reference-introduction/">SQL Reference Introduction</a></li> <li class="toctree-l2"><a href="javascript: void(0);">Data Types</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/supported-data-types/">Supported Data Types</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/date-time-and-timestamp/">Date, Time, and Timestamp</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/handling-different-data-types/">Handling Different Data Types</a></li> </ul> <li class="toctree-l2"><a class="reference internal" href="/docs/lexical-structure/">Lexical Structure</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/operators/">Operators</a></li> <li class="toctree-l2"><a href="javascript: void(0);">SQL Functions</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/about-sql-function-examples/">About SQL Function Examples</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/math-and-trig/">Math and Trig</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/statistical/">Statistical</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/data-type-conversion/">Data Type Conversion</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/data-type-functions/">Data Type Functions</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/date-time-functions-and-arithmetic/">Date/Time Functions and Arithmetic</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/string-manipulation/">String Manipulation</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/aggregate-and-aggregate-statistical/">Aggregate and Aggregate Statistical</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/functions-for-handling-nulls/">Functions for Handling Nulls</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/phonetic-functions/">Phonetic Functions</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/string-distance-functions/">String Distance Functions</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/cryptography-functions/">Cryptography Functions</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/sql-dialect-compatibility-functions/">SQL dialect compatibility functions</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/gis-functions/">GIS functions</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/time-series-analysis-functions/">Time Series Analysis Functions</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/ip-networking-functions/">IP Networking functions</a></li> </ul> <li class="toctree-l2"><a href="javascript: void(0);">SQL Window Functions</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/sql-window-functions-introduction/">SQL Window Functions Introduction</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/aggregate-window-functions/">Aggregate Window Functions</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/ranking-window-functions/">Ranking Window Functions</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/value-window-functions/">Value Window Functions</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/sql-window-functions-examples/">SQL Window Functions Examples</a></li> </ul> <li class="toctree-l2"><a href="javascript: void(0);">Nested Data Functions</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/nested-data-limitations/">Nested Data Limitations</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/flatten/">FLATTEN</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/kvgen/">KVGEN</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/repeated-count/">REPEATED_COUNT</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/repeated-contains/">REPEATED_CONTAINS</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/list-creation-functions/">COLLECT_LIST</a></li> </ul> <li class="toctree-l2"><a class="reference internal" href="/docs/query-directory-functions/">Query Directory Functions</a></li> <li class="toctree-l2"><a href="javascript: void(0);">SQL Commands</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/supported-sql-commands/">Supported SQL Commands</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/analyze-table-refresh-metadata/">ANALYZE TABLE REFRESH METADATA</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/analyze-table-compute-statistics/">ANALYZE TABLE COMPUTE STATISTICS</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/set/">SET</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/refresh-table-metadata/">REFRESH TABLE METADATA</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/reset/">RESET</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/alter-system/">ALTER SYSTEM</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/create-or-replace-schema/">CREATE OR REPLACE SCHEMA</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/create-table-as-ctas/">CREATE TABLE AS (CTAS)</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/create-temporary-table-as-cttas/">CREATE TEMPORARY TABLE AS (CTTAS)</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/create-function-using-jar/">CREATE FUNCTION USING JAR</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/partition-by-clause/">PARTITION BY Clause</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/create-view/">CREATE VIEW</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/describe/">DESCRIBE</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/drop-function-using-jar/">DROP FUNCTION USING JAR</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/drop-table/">DROP TABLE</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/drop-view/">DROP VIEW</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/explain/">EXPLAIN</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/lateral-join/">LATERAL Join</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/select/">SELECT</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/select-list/">SELECT List</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/from-clause/">FROM Clause</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/group-by-clause/">GROUP BY Clause</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/having-clause/">HAVING Clause</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/limit-clause/">LIMIT Clause</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/offset-clause/">OFFSET Clause</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/order-by-clause/">ORDER BY Clause</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/set-operators/">Set Operators</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/where-clause/">WHERE Clause</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/with-clause/">WITH Clause</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/show-databases-and-show-schemas/">SHOW DATABASES and SHOW SCHEMAS</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/show-files/">SHOW FILES</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/show-tables/">SHOW TABLES</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/use/">USE</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/pivot-operators/">Pivot Operators</a></li> </ul> <li class="toctree-l2"><a href="javascript: void(0);">SQL Conditional Expressions</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/case/">CASE</a></li> </ul> <li class="toctree-l2"><a class="reference internal" href="/docs/reserved-keywords/">Reserved Keywords</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/sql-extensions/">SQL Extensions</a></li> </ul> <li class="toctree-l1"><a href="javascript: void(0);">Data Sources and File Formats</a></li> <ul style="display: none"> <li class="toctree-l2"><a class="reference internal" href="/docs/data-sources-and-file-formats-introduction/">Data Sources and File Formats Introduction</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/hive-to-drill-data-type-mapping/">Hive-to-Drill Data Type Mapping</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/deploying-and-using-a-hive-udf/">Deploying and Using a Hive UDF</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/parquet-format/">Parquet Format</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/logfile-plugin/">Logfile Plugin</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/json-data-model/">JSON Data Model</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/text-files-csv-tsv-psv/">Text Files: CSV, TSV, PSV</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/sequence-files/">Sequence Files</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/httpd-format-plugin/">HTTPD Format Plugin</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/image-metadata-format-plugin/">Image Metadata Format Plugin</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/syslog-format-plugin/">Syslog Format Plugin</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/ltsv-format-plugin/">LTSV Format Plugin</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/spss-format-plugin/">SPSS Format Plugin</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/esri-shapefile-format-plugin/">ESRI Shapefile Format Plugin</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/excel-format-plugin/">Excel Format Plugin</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/hdf5-format-plugin/">HDF5 Format Plugin</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/ms-access-format-plugin/">Microsoft Access Format Plugin</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/xml-format-plugin/">XML Format Plugin</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/iceberg-format-plugin/">Iceberg Format Plugin</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/delta-lake-format-plugin/">Delta Lake Format Plugin</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/sas-format-plugin/">SAS Format Plugin</a></li> </ul> <li class="toctree-l1"><a href="javascript: void(0);">Develop Custom Functions</a></li> <ul style="display: none"> <li class="toctree-l2"><a class="reference internal" href="/docs/develop-custom-functions-introduction/">Develop Custom Functions Introduction</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/developing-a-simple-function/">Developing a Simple Function</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/tutorial-develop-a-simple-function/">Tutorial: Develop a Simple Function</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/developing-an-aggregate-function/">Developing an Aggregate Function</a></li> <li class="toctree-l2"><a href="javascript: void(0);">Adding Custom Functions to Drill</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/adding-custom-functions-to-drill-introduction/">Adding Custom Functions to Drill Introduction</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/manually-adding-custom-functions-to-drill/">Manually Adding Custom Functions to Drill</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/creating-custom-authenticators/">Creating Custom Authenticators</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/dynamic-udfs/">Dynamic UDFs</a></li> </ul> <li class="toctree-l2"><a class="reference internal" href="/docs/using-custom-functions-in-queries/">Using Custom Functions in Queries</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/custom-function-interfaces/">Custom Function Interfaces</a></li> </ul> <li class="toctree-l1"><a class="reference internal" href="/docs/troubleshooting/">Troubleshooting</a></li> <li class="toctree-l1"><a href="javascript: void(0);">Developer Information</a></li> <ul style="display: none"> <li class="toctree-l2"><a href="javascript: void(0);">REST API</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/rest-api-introduction/">REST API Introduction</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/submitting-queries-from-the-rest-api-when-impersonation-is-enabled-and-authentication-is-disabled/">Submitting Queries from the REST API when Impersonation is Enabled and Authentication is Disabled</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/use-postman-to-run-sql-queries-on-drill-data-sources/">Use Postman to Run SQL Queries on Drill Data Sources</a></li> </ul> <li class="toctree-l2"><a href="javascript: void(0);">Develop Drill</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/compiling-drill-from-source/">Compiling Drill from Source</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/useful-information-for-drill-developers/">Useful Information for Drill Developers</a></li> </ul> <li class="toctree-l2"><a href="javascript: void(0);">Contribute to Drill</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/apache-drill-contribution-guidelines/">Apache Drill Contribution Guidelines</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/apache-drill-contribution-ideas/">Apache Drill Contribution Ideas</a></li> </ul> <li class="toctree-l2"><a href="javascript: void(0);">Design Docs</a></li> <ul style="display: none"> <li class="toctree-l3"><a class="reference internal" href="/docs/drill-plan-syntax/">Drill Plan Syntax</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/rpc-overview/">RPC Overview</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/query-stages/">Query Stages</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/useful-research/">Useful Research</a></li> <li class="toctree-l3"><a class="reference internal" href="/docs/value-vectors/">Value Vectors</a></li> </ul> </ul> <li class="toctree-l1"><a href="javascript: void(0);">Release Notes</a></li> <ul style="display: none"> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-1-21-2-release-notes/">Apache Drill 1.21.2 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-1-21-1-release-notes/">Apache Drill 1.21.1 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-1-21-0-release-notes/">Apache Drill 1.21.0 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-1-20-3-release-notes/">Apache Drill 1.20.3 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-1-20-2-release-notes/">Apache Drill 1.20.2 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-1-20-1-release-notes/">Apache Drill 1.20.1 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-1-20-0-release-notes/">Apache Drill 1.20.0 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-1-19-0-release-notes/">Apache Drill 1.19.0 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-1-18-0-release-notes/">Apache Drill 1.18.0 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-1-17-0-release-notes/">Apache Drill 1.17.0 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-1-16-0-release-notes/">Apache Drill 1.16.0 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-1-15-0-release-notes/">Apache Drill 1.15.0 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-1-14-0-release-notes/">Apache Drill 1.14.0 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-1-13-0-release-notes/">Apache Drill 1.13.0 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-1-12-0-release-notes/">Apache Drill 1.12.0 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-1-11-0-release-notes/">Apache Drill 1.11.0 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-1-10-0-release-notes/">Apache Drill 1.10.0 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-1-9-0-release-notes/">Apache Drill 1.9.0 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-1-8-0-release-notes/">Apache Drill 1.8.0 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-1-7-0-release-notes/">Apache Drill 1.7.0 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-1-6-0-release-notes/">Apache Drill 1.6.0 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-1-5-0-release-notes/">Apache Drill 1.5.0 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-1-4-0-release-notes/">Apache Drill 1.4.0 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-1-3-0-release-notes/">Apache Drill 1.3.0 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-1-2-0-release-notes/">Apache Drill 1.2.0 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-1-1-0-release-notes/">Apache Drill 1.1.0 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-1-0-0-release-notes/">Apache Drill 1.0.0 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-0-9-0-release-notes/">Apache Drill 0.9.0 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-0-8-0-release-notes/">Apache Drill 0.8.0 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-0-7-0-release-notes/">Apache Drill 0.7.0 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-0-6-0-release-notes/">Apache Drill 0.6.0 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-0-5-0-release-notes/">Apache Drill 0.5.0 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-0-4-0-release-notes/">Apache Drill 0.4.0 Release Notes</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/apache-drill-m1-release-notes-apache-drill-alpha/">Apache Drill M1 Release Notes (Apache Drill Alpha)</a></li> </ul> <li class="toctree-l1"><a href="javascript: void(0);">Sample Datasets</a></li> <ul style="display: none"> <li class="toctree-l2"><a class="reference internal" href="/docs/enron-emails/">Enron Emails</a></li> <li class="toctree-l2"><a class="reference internal" href="/docs/wikipedia-edit-history/">Wikipedia Edit History</a></li> </ul> <li class="toctree-l1"><a class="reference internal" href="/docs/project-bylaws/">Project Bylaws</a></li> <li class="toctree-l1"><a href="javascript: void(0);">Ecosystem</a></li> <ul style="display: none"> <li class="toctree-l2"><a class="reference internal" href="/docs/using-saiku-analytics-with-apache-drill/">Using Saiku Analytics with Apache Drill</a></li> </ul> </ul> </div> </div> </aside> <nav class="breadcrumbs"> <li><a href="/docs/">Docs</a></li> <li><a href="/docs/tutorials/">Tutorials</a></li> <li>Orchestrating queries with Airflow</li> </nav> <div class="main-content-wrapper"> <div class="main-content"> <a class="edit-link" href="https://github.com/apache/drill-site/blob/master/_docs/en/tutorials/080-orchestrating-queries-with-airflow.md" target="_blank"><i class="fa fa-pencil-square-o"></i></a> <div class="int_title left"> <h1>Orchestrating queries with Airflow</h1> </div> <!-- jt: we don't need to display a last-modified date on each page to users --> <div class="int_text" align="left"> <p>This tutorial walks through the development of an Apache Airflow DAG that implements a basic ETL process using Apache Drill. We’ll install Airflow into a Python virtualenv using pip before writing and testing our new DAG. Consult the <a href="https://airflow.apache.org/docs/apache-airflow/stable/installation.html">Airflow installation documentation</a> for more information about installing Airflow.</p> <p>I’ll be issuing commands using a shell on a Debian Linux machine in this tutorial but it should be possible with a little translation to follow along on other platforms.</p> <h2 id="prerequisites">Prerequisites</h2> <ol> <li>A Python &gt;= 3.6 installation, including pip and optionally virtualenv.</li> <li>A Drill installation where you have access to run queries and add new storage providers. I’ll be running an embedded mode Drill 1.19.</li> </ol> <h2 id="optional-set-up-a-virtualenv">(Optional) Set up a virtualenv</h2> <p>Create and activate a new virtualenv called “airflow”. If needed, adjust the Python interpreter path and virtualenv target path arguments for your environment.</p> <div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">VIRT_ENV_HOME</span><span class="o">=</span>~/.local/lib/virtualenv virtualenv <span class="nt">-p</span> /usr/bin/python3 <span class="nv">$VIRT_ENV_HOME</span>/airflow <span class="nb">.</span> <span class="nv">$VIRT_ENV_HOME</span>/airflow/activate </code></pre></div></div> <h2 id="install-airflow">Install Airflow</h2> <p>If you’ve read their installation guide, you’ll have seen that the Airflow project provides constraints files that pin its Python package dependencies to known-good versions. In many cases things work fine without constraints but, for the sake of reproducibility, we’ll apply the constraints file applicable to our Python version using the script they provide for the purpose.</p> <div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">AIRFLOW_VERSION</span><span class="o">=</span>2.1.2 <span class="nv">PYTHON_VERSION</span><span class="o">=</span><span class="s2">"</span><span class="si">$(</span>python <span class="nt">--version</span> | <span class="nb">cut</span> <span class="nt">-d</span> <span class="s2">" "</span> <span class="nt">-f</span> 2 | <span class="nb">cut</span> <span class="nt">-d</span> <span class="s2">"."</span> <span class="nt">-f</span> 1-2<span class="si">)</span><span class="s2">"</span> <span class="nv">CONSTRAINT_URL</span><span class="o">=</span><span class="s2">"https://raw.githubusercontent.com/apache/airflow/constraints-</span><span class="k">${</span><span class="nv">AIRFLOW_VERSION</span><span class="k">}</span><span class="s2">/constraints-</span><span class="k">${</span><span class="nv">PYTHON_VERSION</span><span class="k">}</span><span class="s2">.txt"</span> pip <span class="nb">install</span> <span class="s2">"apache-airflow==</span><span class="k">${</span><span class="nv">AIRFLOW_VERSION</span><span class="k">}</span><span class="s2">"</span> <span class="nt">--constraint</span> <span class="s2">"</span><span class="k">${</span><span class="nv">CONSTRAINT_URL</span><span class="k">}</span><span class="s2">"</span> pip <span class="nb">install </span>apache-airflow-providers-apache-drill </code></pre></div></div> <h2 id="initialise-airflow">Initialise Airflow</h2> <p>We’re just experimenting here so we’ll have Airflow set up a local SQLite database and add an admin user for ourselves.</p> <div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Optional: change Airflow's data dir from the default of ~/airflow</span> <span class="nb">export </span><span class="nv">AIRFLOW_HOME</span><span class="o">=</span>~/Development/airflow <span class="nb">mkdir</span> <span class="nt">-p</span> ~/Development/airflow <span class="c"># Create a new SQLite database for Airflow</span> airflow db init <span class="c"># Add an admin user</span> airflow <span class="nb">users </span>create <span class="se">\</span> <span class="nt">--username</span> admin <span class="se">\</span> <span class="nt">--firstname</span> FIRST_NAME <span class="se">\</span> <span class="nt">--lastname</span> LAST_NAME <span class="se">\</span> <span class="nt">--role</span> Admin <span class="se">\</span> <span class="nt">--email</span> admin@example.org <span class="se">\</span> <span class="nt">--password</span> admin </code></pre></div></div> <h2 id="configure-a-drill-connection">Configure a Drill connection</h2> <p>At this point we should have a working Airflow installation. Fire up the web UI with <code class="language-plaintext highlighter-rouge">airflow webserver</code> and browse to http://localhost:8080. Click on Admin -&gt; Connections and add a new Drill connection called <code class="language-plaintext highlighter-rouge">drill_tutorial</code>, setting configuration according to your Drill environment. If you’re using embedded mode Drill locally like I am, then you’ll want the following config.</p> <table> <thead> <tr> <th>Setting</th> <th>Value</th> </tr> </thead> <tbody> <tr> <td>Conn Id</td> <td>drill_tutorial</td> </tr> <tr> <td>Conn Type</td> <td>Drill</td> </tr> <tr> <td>Host</td> <td>localhost</td> </tr> <tr> <td>Port</td> <td>8047</td> </tr> <tr> <td>Extra</td> <td>{“dialect_driver”: “drill+sadrill”, “storage_plugin”: “dfs”}</td> </tr> </tbody> </table> <p>Note that the sqlalchemy-drill dialect and driver information must be specified in the <code class="language-plaintext highlighter-rouge">Extra</code> field. See <a href="https://github.com/JohnOmernik/sqlalchemy-drill">the sqlalchemy-drill documentation</a> for more information about its configuration.</p> <p>After you’ve saved the new connection you can shut the Airflow web UI down with ctrl+c.</p> <h2 id="explore-the-source-data">Explore the source data</h2> <p>If you’ve developed ETLs before you know that you can’t build anything until you’ve come to grips with the source data. Let’s obtain a sample of the first 1m rows from the source take a look.</p> <div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>curl <span class="nt">-s</span> https://data.cdc.gov/api/views/vbim-akqf/rows.csv<span class="se">\?</span>accessType<span class="se">\=</span>DOWNLOAD | pv <span class="nt">-lSs</span> 1000000 <span class="o">&gt;</span> /tmp/cdc_covid_cases.csvh </code></pre></div></div> <p>You can replace <code class="language-plaintext highlighter-rouge">pv -lSs 1000000</code> above with <code class="language-plaintext highlighter-rouge">head -n1000000</code>, or just drop it if you don’t mind fetching the whole file. Downloading the CSV file with a web browser will also get the job done. Note that for a default Drill installation, saving with the file extension <code class="language-plaintext highlighter-rouge">.csvh</code> does matter for what follows because it will set the option <code class="language-plaintext highlighter-rouge">extractHeader = true</code> when this CSV file is queried, something that the SQL code to come relies on.</p> <p>It’s time to break out Drill. Instead of dumping my entire interactive SQL session here, I’ll just list relevant queries that I ran and the corresponding observations that I made.</p> <div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">select</span> <span class="o">*</span> <span class="k">from</span> <span class="n">dfs</span><span class="p">.</span><span class="n">tmp</span><span class="p">.</span><span class="nv">`cdc_covid_case.csvh`</span><span class="p">;</span> <span class="c1">-- 1. In date fields, the empty string '' can be converted to SQL NULL</span> <span class="c1">-- 2. Age groups can be split into two numerical fields, with the final</span> <span class="c1">-- group being unbounded above.</span> <span class="k">select</span> <span class="n">age_group</span><span class="p">,</span> <span class="k">count</span><span class="p">()</span> <span class="k">from</span> <span class="n">dfs</span><span class="p">.</span><span class="n">tmp</span><span class="p">.</span><span class="nv">`cdc_covid_case.csvh`</span> <span class="k">group</span> <span class="k">by</span> <span class="n">age_group</span><span class="p">;</span> <span class="k">select</span> <span class="n">sex</span><span class="p">,</span> <span class="k">count</span><span class="p">()</span> <span class="k">from</span> <span class="n">dfs</span><span class="p">.</span><span class="n">tmp</span><span class="p">.</span><span class="nv">`cdc_covid_case.csvh`</span> <span class="k">group</span> <span class="k">by</span> <span class="n">sex</span><span class="p">;</span> <span class="k">select</span> <span class="n">race_ethnicity_combined</span><span class="p">,</span> <span class="k">count</span><span class="p">()</span> <span class="k">from</span> <span class="n">dfs</span><span class="p">.</span><span class="n">tmp</span><span class="p">.</span><span class="nv">`cdc_covid_case.csvh`</span> <span class="k">group</span> <span class="k">by</span> <span class="n">race_ethnicity_combined</span><span class="p">;</span> <span class="c1">-- 3. The string 'Missing' can be transformed to SQL NULL</span> <span class="c1">-- 4. I should really uncover what the difference between 'NA' and 'Missing' is</span> <span class="c1">-- but for this tutorial 'NA' is going to transformed to NULL too</span> <span class="c1">-- 5. race_ethnicity_combined could possibly be split into two fields but we'll</span> <span class="c1">-- leave it as is for this tutorial.</span> <span class="k">select</span> <span class="n">hosp_yn</span><span class="p">,</span> <span class="k">count</span><span class="p">()</span> <span class="k">from</span> <span class="n">dfs</span><span class="p">.</span><span class="n">tmp</span><span class="p">.</span><span class="nv">`cdc_covid_case.csvh`</span> <span class="k">group</span> <span class="k">by</span> <span class="n">hosp_yn</span><span class="p">;</span> <span class="c1">-- 6. In addition to 'Missing, indicator variables have three possible values</span> <span class="c1">-- so they cannot be transformed to nullable booleans</span> </code></pre></div></div> <p>So… this is what it feels like to be a data scientist 😆! Jokes aside, we learned a lot of neccesary stuff pretty quickly there and it’s easy to see that we could have carried on for a long way, testing ranges, casts and regexps and even creating reports if we didn’t reign ourselves in. Let’s skip forward to the ETL statement I ended up creating after exploring.</p> <h2 id="develop-a-ctas-create-table-as-select-etl">Develop a CTAS (Create Table As Select) ETL</h2> <div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">drop</span> <span class="k">table</span> <span class="n">if</span> <span class="k">exists</span> <span class="n">dfs</span><span class="p">.</span><span class="n">tmp</span><span class="p">.</span><span class="n">cdc_covid_cases</span><span class="p">;</span> <span class="k">create</span> <span class="k">table</span> <span class="n">dfs</span><span class="p">.</span><span class="n">tmp</span><span class="p">.</span><span class="n">cdc_covid_cases</span> <span class="k">as</span> <span class="k">with</span> <span class="n">missing2null</span> <span class="k">as</span> <span class="p">(</span> <span class="k">select</span> <span class="k">nullif</span><span class="p">(</span><span class="n">cdc_case_earliest_dt</span><span class="p">,</span> <span class="s1">''</span><span class="p">)</span> <span class="n">cdc_case_earliest_dt</span><span class="p">,</span> <span class="k">nullif</span><span class="p">(</span><span class="n">cdc_report_dt</span><span class="p">,</span> <span class="s1">''</span><span class="p">)</span> <span class="n">cdc_report_dt</span><span class="p">,</span> <span class="k">nullif</span><span class="p">(</span><span class="n">pos_spec_dt</span><span class="p">,</span> <span class="s1">''</span><span class="p">)</span> <span class="n">pos_spec_dt</span><span class="p">,</span> <span class="k">nullif</span><span class="p">(</span><span class="n">onset_dt</span><span class="p">,</span> <span class="s1">''</span><span class="p">)</span> <span class="n">onset_dt</span><span class="p">,</span> <span class="k">case</span> <span class="k">when</span> <span class="n">current_status</span> <span class="k">not</span> <span class="k">in</span> <span class="p">(</span><span class="s1">'Missing'</span><span class="p">,</span> <span class="s1">'NA'</span><span class="p">)</span> <span class="k">then</span> <span class="n">current_status</span> <span class="k">end</span> <span class="n">current_status</span><span class="p">,</span> <span class="k">case</span> <span class="k">when</span> <span class="n">sex</span> <span class="k">not</span> <span class="k">in</span> <span class="p">(</span><span class="s1">'Missing'</span><span class="p">,</span> <span class="s1">'NA'</span><span class="p">)</span> <span class="k">then</span> <span class="n">sex</span> <span class="k">end</span> <span class="n">sex</span><span class="p">,</span> <span class="k">case</span> <span class="k">when</span> <span class="n">age_group</span> <span class="k">not</span> <span class="k">in</span> <span class="p">(</span><span class="s1">'Missing'</span><span class="p">,</span> <span class="s1">'NA'</span><span class="p">)</span> <span class="k">then</span> <span class="n">age_group</span> <span class="k">end</span> <span class="n">age_group</span><span class="p">,</span> <span class="k">case</span> <span class="k">when</span> <span class="n">race_ethnicity_combined</span> <span class="k">not</span> <span class="k">in</span> <span class="p">(</span><span class="s1">'Missing'</span><span class="p">,</span> <span class="s1">'NA'</span><span class="p">)</span> <span class="k">then</span> <span class="n">race_ethnicity_combined</span> <span class="k">end</span> <span class="n">race_ethnicity_combined</span><span class="p">,</span> <span class="k">case</span> <span class="k">when</span> <span class="n">hosp_yn</span> <span class="k">not</span> <span class="k">in</span> <span class="p">(</span><span class="s1">'Missing'</span><span class="p">,</span> <span class="s1">'NA'</span><span class="p">)</span> <span class="k">then</span> <span class="n">hosp_yn</span> <span class="k">end</span> <span class="n">hosp_yn</span><span class="p">,</span> <span class="k">case</span> <span class="k">when</span> <span class="n">icu_yn</span> <span class="k">not</span> <span class="k">in</span> <span class="p">(</span><span class="s1">'Missing'</span><span class="p">,</span> <span class="s1">'NA'</span><span class="p">)</span> <span class="k">then</span> <span class="n">icu_yn</span> <span class="k">end</span> <span class="n">icu_yn</span><span class="p">,</span> <span class="k">case</span> <span class="k">when</span> <span class="n">death_yn</span> <span class="k">not</span> <span class="k">in</span> <span class="p">(</span><span class="s1">'Missing'</span><span class="p">,</span> <span class="s1">'NA'</span><span class="p">)</span> <span class="k">then</span> <span class="n">death_yn</span> <span class="k">end</span> <span class="n">death_yn</span><span class="p">,</span> <span class="k">case</span> <span class="k">when</span> <span class="n">medcond_yn</span> <span class="k">not</span> <span class="k">in</span> <span class="p">(</span><span class="s1">'Missing'</span><span class="p">,</span> <span class="s1">'NA'</span><span class="p">)</span> <span class="k">then</span> <span class="n">medcond_yn</span> <span class="k">end</span> <span class="n">medcond_yn</span> <span class="k">from</span> <span class="n">dfs</span><span class="p">.</span><span class="n">tmp</span><span class="p">.</span><span class="nv">`cdc_covid_cases.csvh`</span><span class="p">),</span> <span class="n">age_parse</span> <span class="k">as</span> <span class="p">(</span> <span class="k">select</span> <span class="o">*</span><span class="p">,</span> <span class="n">regexp_replace</span><span class="p">(</span><span class="n">age_group</span><span class="p">,</span> <span class="s1">'([0-9]+)[ </span><span class="se">\-\+</span><span class="s1">]+([0-9]*) Years'</span><span class="p">,</span> <span class="s1">'$1'</span><span class="p">)</span> <span class="n">age_min_incl</span><span class="p">,</span> <span class="n">regexp_replace</span><span class="p">(</span><span class="n">age_group</span><span class="p">,</span> <span class="s1">'([0-9]+)[ </span><span class="se">\-\+</span><span class="s1">]+([0-9]*) Years'</span><span class="p">,</span> <span class="s1">'$2'</span><span class="p">)</span> <span class="n">age_max_excl</span> <span class="k">from</span> <span class="n">missing2null</span><span class="p">)</span> <span class="k">select</span> <span class="k">cast</span><span class="p">(</span><span class="n">cdc_case_earliest_dt</span> <span class="k">as</span> <span class="nb">date</span><span class="p">)</span> <span class="n">cdc_case_earliest_dt</span><span class="p">,</span> <span class="k">cast</span><span class="p">(</span><span class="n">cdc_report_dt</span> <span class="k">as</span> <span class="nb">date</span><span class="p">)</span> <span class="n">cdc_report_dt</span><span class="p">,</span> <span class="k">cast</span><span class="p">(</span><span class="n">pos_spec_dt</span> <span class="k">as</span> <span class="nb">date</span><span class="p">)</span> <span class="n">pos_spec_dt</span><span class="p">,</span> <span class="k">cast</span><span class="p">(</span><span class="n">onset_dt</span> <span class="k">as</span> <span class="nb">date</span><span class="p">)</span> <span class="n">onset_dt</span><span class="p">,</span> <span class="n">current_status</span><span class="p">,</span> <span class="n">sex</span><span class="p">,</span> <span class="n">age_group</span><span class="p">,</span> <span class="k">cast</span><span class="p">(</span><span class="n">age_min_incl</span> <span class="k">as</span> <span class="nb">float</span><span class="p">)</span> <span class="n">age_min_incl</span><span class="p">,</span> <span class="mi">1</span> <span class="o">+</span> <span class="k">cast</span><span class="p">(</span><span class="k">case</span> <span class="k">when</span> <span class="n">age_max_excl</span> <span class="o">=</span> <span class="s1">''</span> <span class="k">then</span> <span class="s1">'Infinity'</span> <span class="k">else</span> <span class="n">age_max_excl</span> <span class="k">end</span> <span class="k">as</span> <span class="nb">float</span><span class="p">)</span> <span class="n">age_max_excl</span><span class="p">,</span> <span class="n">race_ethnicity_combined</span><span class="p">,</span> <span class="n">hosp_yn</span><span class="p">,</span> <span class="n">icu_yn</span><span class="p">,</span> <span class="n">death_yn</span><span class="p">,</span> <span class="n">medcond_yn</span> <span class="k">from</span> <span class="n">age_parse</span><span class="p">;</span> </code></pre></div></div> <p>That’s a substantial SQL statement but it covers a fair amount of transformation work and takes us all the way to an output of one (or more) Parquet files, efficient and clean representations of our dataset that are well suited for analytical or ML work. Consider what we have <em>not</em> done to get this far.</p> <ul> <li>We have no configuration hidden in the checkboxes and wizards of an ETL package,</li> <li>we have not had to add another language to the SQL we used to explore and test trasformations at the outset and</li> <li>we have not worried about performance or how to parallelise our data flow because we’ve left that aspect to Drill.</li> </ul> <p>In addition, while I’ve yet to hear of SQL winning a language beauty contest, our ETL code feels obvious, self-contained and maintainable. I’d have no qualms with reviewing a line-by-line diff of this code to isolate a change after a hiatus of months or years, nor any with pointing a SQL-conversant colleague at it with little or even no introduction. The veteran coder knows that these mundane advantages can swing an extended campaign.</p> <p>To complete this step, save the CTAS script above into a new file at <code class="language-plaintext highlighter-rouge">$AIRFLOW_HOME/dags/cdc_covid_cases.drill.sql</code>. The double file extension is just a little convention I use to indicate both the dialect and the language of my SQL scripts, and entirely optional if it’s not to your taste.</p> <h2 id="develop-an-airflow-dag">Develop an Airflow DAG</h2> <p>The definition of our DAG will reside in a single Python script. The complete listing of that script follows immediately, with my commentary continuing as inline source code comments. You should save this script to a new file at <code class="language-plaintext highlighter-rouge">$AIRFLOW_HOME/dags/drill-tutorial.py</code>.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">''' Uses the Apache Drill provider to transform, load and report from COVID case data downloaded from the website of the CDC. Data source citation. Centers for Disease Control and Prevention, COVID-19 Response. COVID-19 Case Surveillance Public Data Access, Summary, and Limitations. https://data.cdc.gov/Case-Surveillance/COVID-19-Case-Surveillance-Public-Use-Data/vbim-akqf '''</span> <span class="kn">from</span> <span class="nn">datetime</span> <span class="kn">import</span> <span class="n">timedelta</span> <span class="kn">from</span> <span class="nn">airflow</span> <span class="kn">import</span> <span class="n">DAG</span> <span class="c1"># We'll use a PythonOperator to stage COVID-19 CSV file from the CDC web site </span><span class="kn">from</span> <span class="nn">airflow.operators.python</span> <span class="kn">import</span> <span class="n">PythonOperator</span> <span class="c1"># We'll use DrillOperators to kick off queries against the COVID-19 data </span><span class="kn">from</span> <span class="nn">airflow.providers.apache.drill.operators.drill</span> <span class="kn">import</span> <span class="n">DrillOperator</span> <span class="kn">from</span> <span class="nn">airflow.utils.dates</span> <span class="kn">import</span> <span class="n">days_ago</span> <span class="c1"># We can assume requests is present because sqlalchemy-drill requires it </span><span class="kn">import</span> <span class="nn">requests</span> <span class="c1"># These args will get passed on to each operator # You can override them on a per-task basis during operator initialization </span><span class="n">default_args</span> <span class="o">=</span> <span class="p">{</span> <span class="s">'owner'</span><span class="p">:</span> <span class="s">'Joe Public'</span><span class="p">,</span> <span class="s">'depends_on_past'</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span> <span class="s">'email'</span><span class="p">:</span> <span class="p">[</span><span class="s">'joe@public.com'</span><span class="p">],</span> <span class="s">'email_on_failure'</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span> <span class="s">'email_on_retry'</span><span class="p">:</span> <span class="bp">False</span><span class="p">,</span> <span class="s">'retries'</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="s">'retry_delay'</span><span class="p">:</span> <span class="n">timedelta</span><span class="p">(</span><span class="n">minutes</span><span class="o">=</span><span class="mi">5</span><span class="p">),</span> <span class="p">}</span> <span class="k">def</span> <span class="nf">stage_from_www</span><span class="p">(</span><span class="n">src_url</span><span class="p">,</span> <span class="n">tgt_path</span><span class="p">):</span> <span class="s">''' Uses the Requests lib to GET case surveillance data from CDC to a local path. If you're in a distributed environment you'll want to replace the local filesystem with HDFS, S3, etc. Another option is to configure Drill's HTTP storage plugin to fetch the data directly from the source. '''</span> <span class="n">resp</span> <span class="o">=</span> <span class="n">requests</span><span class="p">.</span><span class="n">get</span><span class="p">(</span> <span class="n">src_url</span><span class="p">,</span> <span class="n">stream</span><span class="o">=</span><span class="bp">True</span> <span class="c1"># don't buffer big datasets in memory </span> <span class="p">)</span> <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">tgt_path</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span> <span class="n">f</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">resp</span><span class="p">.</span><span class="n">content</span><span class="p">)</span> <span class="k">with</span> <span class="n">DAG</span><span class="p">(</span> <span class="s">'drill_tutorial'</span><span class="p">,</span> <span class="n">default_args</span><span class="o">=</span><span class="n">default_args</span><span class="p">,</span> <span class="n">description</span><span class="o">=</span><span class="s">'Drill tutorial that loads COVID-19 case data from the CDC.'</span><span class="p">,</span> <span class="n">schedule_interval</span><span class="o">=</span><span class="n">timedelta</span><span class="p">(</span><span class="n">weeks</span><span class="o">=</span><span class="mi">2</span><span class="p">),</span> <span class="c1"># source is updated every two weeks </span> <span class="n">start_date</span><span class="o">=</span><span class="n">days_ago</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="p">)</span> <span class="k">as</span> <span class="n">dag</span><span class="p">:</span> <span class="c1"># Use this module's docstring for DAG's documentation (visible in the web UI) </span> <span class="n">dag</span><span class="p">.</span><span class="n">doc_md</span> <span class="o">=</span> <span class="n">__doc__</span> <span class="c1"># First task is a PythonOperator to GET the CSV data from the CDC website </span> <span class="n">stage_from_www_task</span> <span class="o">=</span> <span class="n">PythonOperator</span><span class="p">(</span> <span class="n">task_id</span><span class="o">=</span><span class="s">'stage_from_www'</span><span class="p">,</span> <span class="n">python_callable</span><span class="o">=</span><span class="n">stage_from_www</span><span class="p">,</span> <span class="n">op_kwargs</span><span class="o">=</span> <span class="p">{</span> <span class="s">'src_url'</span><span class="p">:</span> <span class="s">'https://data.cdc.gov/api/views/vbim-akqf/rows.csv?accessType=DOWNLOAD'</span><span class="p">,</span> <span class="s">'tgt_path'</span><span class="p">:</span> <span class="s">'/tmp/cdc_covid_cases.csvh'</span> <span class="p">}</span> <span class="p">)</span> <span class="n">stage_from_www</span><span class="p">.</span><span class="n">doc</span> <span class="o">=</span> <span class="s">'Download COVID case CSV data from the CDC using '</span> \ <span class="s">'an HTTP GET'</span> <span class="c1"># Second task is a DrillOperator the executes our CTAS ETL from an external </span> <span class="c1"># script. It's also possible to specify inline SQL, and to split this </span> <span class="c1"># multi-statement SQL script across tasks e.g. if you prefer to have </span> <span class="c1"># the inital DROP TABLE be a separate task. </span> <span class="n">ctas_etl_task</span> <span class="o">=</span> <span class="n">DrillOperator</span><span class="p">(</span> <span class="n">drill_conn_id</span><span class="o">=</span><span class="s">'drill_tutorial'</span><span class="p">,</span> <span class="n">task_id</span><span class="o">=</span><span class="s">'ctas_etl'</span><span class="p">,</span> <span class="n">sql</span><span class="o">=</span><span class="s">'cdc_covid_cases.drill.sql'</span> <span class="p">)</span> <span class="n">ctas_etl_task</span><span class="p">.</span><span class="n">doc</span> <span class="o">=</span> <span class="s">'Recreate dfs.tmp.cdc_covid_cases using CTAS'</span> <span class="c1"># Third task is a DrillOperator that produces a daily case count report. </span> <span class="c1"># We just write the report back out to dfs.tmp as human-readable CSV, but </span> <span class="c1"># you should imagine using Airflow to route and deliver it in any number </span> <span class="c1"># of ways. </span> <span class="n">daily_count_report_task</span> <span class="o">=</span> <span class="n">DrillOperator</span><span class="p">(</span> <span class="n">drill_conn_id</span><span class="o">=</span><span class="s">'drill_tutorial'</span><span class="p">,</span> <span class="n">task_id</span><span class="o">=</span><span class="s">'drill_report'</span><span class="p">,</span> <span class="n">sql</span><span class="o">=</span><span class="s">''' set `store.format` = 'csv'; drop table if exists dfs.tmp.cdc_daily_counts; create table dfs.tmp.cdc_daily_counts as select cdc_case_earliest_dt, count(*) as case_count from dfs.tmp.cdc_covid_cases group by cdc_case_earliest_dt order by cdc_case_earliest_dt; '''</span> <span class="p">)</span> <span class="n">daily_count_report_task</span><span class="p">.</span><span class="n">doc</span> <span class="o">=</span> <span class="s">'Report daily case counts to CSV'</span> <span class="c1"># Specify the edges of the DAG, i.e. the task dependencies </span> <span class="n">stage_from_www_task</span> <span class="o">&gt;&gt;</span> <span class="n">ctas_etl_task</span> <span class="o">&gt;&gt;</span> <span class="n">daily_count_report_task</span> <span class="n">age_parse</span><span class="p">;</span> </code></pre></div></div> <h2 id="manually-launch-the-airflow-dag">Manually launch the Airflow DAG</h2> <p>You can harmlessly test the Python syntax of a DAG script by running it through the interpreter.</p> <div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python3 <span class="nv">$AIRFLOW_HOME</span>/dags/drill-tutorial.py </code></pre></div></div> <p>If all is well Python will exit without errors and you can proceed to ensure that your Drillbit is running, then launch a test run of you DAG using airflow.</p> <div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>airflow dags <span class="nb">test </span>drill_tutorial <span class="si">$(</span><span class="nb">date</span> +%Y-%m-%d<span class="si">)</span> </code></pre></div></div> <p>After a delay while the COVID case dataset is downloaded to your machine you should start to see all of the queries executed on Drill logged to your console by sqlalchemy-drill. The DAG execution should have produced two outputs.</p> <ol> <li>A Parquet dataset at <code class="language-plaintext highlighter-rouge">$TMPDIR/cdc_covid_cases</code> at the individual case grain.</li> <li>A CSV daily surveilled case count report at <code class="language-plaintext highlighter-rouge">$TMPDIR/cdc_daily_counts</code>.</li> </ol> <p>Try some OLAP in Drill with the first and take a look at the second in a spreadsheet or text editor.</p> <p>Congratulations, you built an ETL using Apache Airflow and Apache Drill!</p> <h2 id="next-steps">Next steps</h2> <ul> <li><a href="https://airflow.apache.org/docs/apache-airflow/1.10.1/scheduler.html">Read about Airflow scheduling</a> and run the scheduler as a daemon to have your job run automatically.</li> <li>Try adapting the DAG here to work with other data sources. If you have databases, files and web services in your own environment those will be natural choices, otherwise you can look around online for more public datasets and APIs.</li> <li>Instead of replacing the target dataset, try adding new partitions to an existing dataset by aiming CTAS at date-labelled subdirectories.</li> <li>Keep an eye out for data crunching steps in existing workflows, including those which are not strictly ETL pipelines, where Drill could shoulder some of the load.</li> </ul> <p>Thanks for joining us for this tutorial and happy Drilling!</p> <div class="doc-nav"> <span class="previous-toc"><a href="/docs/analyzing-data-using-window-functions/">← Analyzing Data Using Window Functions</a></span><span class="next-toc"><a href="/docs/about-the-mapr-sandbox/">About the MapR Sandbox →</a></span> </div> </div> </div> </div> </div> <p class="push"></p> <div id="footer" class="mw"> <div class="wrapper"> Copyright © 2012-2025 The Apache Software Foundation, licensed under the Apache License, Version 2.0.<br> Apache and the Apache feather logo are trademarks of The Apache Software Foundation. Other names appearing on the site may be trademarks of their respective owners.<br/><br/> </div> </div> <script type="text/javascript" src="https://s7.addthis.com/js/300/addthis_widget.js#pubid=ra-548b2caa33765e8d" async="async"></script> </body> </html>

Pages: 1 2 3 4 5 6 7 8 9 10