# Apache Airflow

Euno connects to Apache Airflow to discover workflow orchestration metadata, including Airflow instances, DAGs, tasks, and Airflow datasets or Airflow 3 assets.

Euno's Airflow integration supports auto-discovery of:

* Airflow instances
* DAGs
* Tasks
* Datasets and assets

## Prerequisites

* Airflow 2 with the stable REST API, or Airflow 3 with REST API v2.
* An Airflow API token, or an Airflow username and password.
* Network access from Euno to the Airflow webserver.
* The configured identity must be able to read DAGs and tasks. Dataset or asset permissions are needed to discover dataset-aware scheduling metadata.

Euno detects the supported Airflow API version automatically during validation. If the dataset or asset endpoint is not present or the identity cannot read it, Euno continues without those resources.

## Stage 1: Configure Airflow

### Step 1: Create an API token or service account

Euno supports two authentication methods: a bearer token (recommended) or username/password basic auth.

**Option A — API token (recommended)**

1. Log in to your Airflow webserver and go to **Admin → Users**.
2. Open or create the user you want Euno to authenticate as.
3. Scroll to the **Extra** section and generate a new token, or use the Airflow CLI:

   ```bash
   airflow users create --username euno-service --role Viewer \
     --email euno@example.com --firstname Euno --lastname Service --password <password>
   ```
4. Copy the token — it will be shown only once.

The Airflow user must have at minimum the built-in **Viewer** role so it can read DAGs, tasks, and dataset metadata. If you also want Euno to collect Airflow connections for lineage resolution (the **Resolve warehouse lineage** option), the user additionally needs the **Op** role or a custom role that includes `can_read on Connections`.

{% hint style="info" %}
Airflow 2 requires the stable REST API to be enabled. In `airflow.cfg`, set `[api] auth_backends = airflow.api.auth.backend.basic_auth` (or `jwt_auth` for token-based auth). Airflow 3 enables the REST API v2 by default.
{% endhint %}

**Option B — Username and password**

If your Airflow deployment does not support token authentication, prepare a username and password for a dedicated service account with the permissions described above. Euno will use HTTP Basic auth.

### Step 2: Verify network access

Ensure Euno's servers can reach the Airflow webserver URL over HTTPS. If your Airflow instance uses a private CA or self-signed certificate, disable SSL verification in the source configuration (see Stage 2).

{% hint style="warning" %}
Do not expose your Airflow webserver directly to the public internet if it is not already. Prefer network-level controls (VPN, VPC peering, IP allowlist) to allow only Euno's egress IPs to reach your Airflow instance.
{% endhint %}

## Stage 2: Configure New Airflow Source in Euno

### Step 1: Access the Sources Page

Navigate to **Settings → Sources** and click **Add New Source**. Select **Airflow** from the integration list.

### Step 2: General Configuration

Asterisk (\*) means a mandatory field.

| Configuration                 | Description                                                                                          |
| ----------------------------- | ---------------------------------------------------------------------------------------------------- |
| **Base URL**\*                | Airflow webserver URL. Host-only values are canonicalized to HTTPS.                                  |
| **API token**                 | Bearer token for Airflow API access. If present, Euno uses token auth.                               |
| **Username**                  | Basic auth username. Used only when no API token is configured.                                      |
| **Password**                  | Basic auth password. Used only when no API token is configured.                                      |
| **Verify SSL certificates**   | Keep enabled for production. Disable only for local labs or private certificates. Default: enabled.  |
| **DAG pattern**               | Optional allow/deny regex pattern for DAG IDs. Use this to limit discovery to selected DAGs.         |
| **Connection pattern**        | Optional allow/deny regex pattern for Airflow connection IDs to include in lineage resolution.       |
| **Observe execution history** | Collect bounded DAG run and task instance events for metrics and operator lineage. Default: enabled. |
| **Execution history days**    | Lookback window for execution history. Range: 1–60 days. Default: 30.                                |
| **Observe datasets**          | Collect Airflow datasets or Airflow 3 assets when the endpoint is available. Default: enabled.       |
| **Resolve warehouse lineage** | Collect raw connection and operator evidence for warehouse lineage resolution. Default: enabled.     |
| **Connection mapping**        | Optional manual mapping from Airflow connection IDs to Euno resource URIs.                           |

Large Airflow deployments are protected by internal safety limits; if a limit is reached, the run report explains what stopped or was truncated.

### Step 3: Schedule

Enable the **Schedule** option and choose how often Euno crawls the Airflow source:

* **Hourly**: Set the interval in hours (e.g., every 4 hours). Recommended for active pipelines where DAG and task metadata changes frequently.
* **Weekly**: Set specific days and times for a lighter crawl cadence.

{% hint style="info" %}
**Recommended**: Schedule the Airflow integration to run every 1–4 hours if you have execution history collection enabled. This keeps DAG run metrics and task lineage close to real time. Manual runs are also supported at any time.
{% endhint %}

### Step 4: Resource Cleanup

Configure the cleanup policy to control how Euno handles resources that disappear from Airflow:

* **Immediate Cleanup**: Resources not detected in the most recent successful run are removed immediately. Use this to keep the catalog tightly in sync with your Airflow deployment.
* **TTL-based Cleanup**: Resources are retained for a configurable number of days after they were last seen, then removed. Useful when DAGs are temporarily disabled or when Airflow is redeployed.
* **No Cleanup**: Resources are retained indefinitely even if no longer detected by Airflow.

{% hint style="info" %}
**Recommended**: Use **Immediate Cleanup** for most deployments. This ensures that retired or deleted DAGs and tasks are promptly removed from the catalog.
{% endhint %}

## What Euno Discovers

* Airflow instances
* DAGs
* Tasks
* Datasets and assets

Airflow dataset and asset payloads are kept as raw source evidence. When a dataset or asset URI is a supported warehouse table URI, Euno also emits non-authoritative database, schema, and table observations from that explicit Airflow metadata. Global processors then derive the DAG and task lineage from the same evidence. When Airflow defines an output table, Euno also applies the defining DAG or task upstream table lineage to that output table. When the lineage processor reprocesses a previous Airflow-defined output and sees that Airflow no longer defines it, stale Airflow-owned output lineage is cleared.

Operator SQL lineage is more heuristic. When execution history and connection metadata provide a complete, supported input or output target, Euno emits non-authoritative warehouse database, schema, and table observations for that SQL or operator table. Temporary tables created inside the SQL are skipped as warehouse observations. MySQL and Postgres operator lineage still enriches only warehouse tables that already exist in Euno.

When execution history collection is enabled, Euno stores Airflow DAG runs and task instances as DAMA events:

* `airflow_dag_run` stores the raw Airflow DAG run payload on the DAG URI.
* `airflow_task_instance` stores the raw Airflow task instance payload on the task URI.

Euno derives DAG and task execution metrics from these events. These include latest run status, failing task IDs, run counts, success rates, duration averages, retry rates, and failure streaks. Operator lineage processors also use task instance events when rendered operator fields are needed.

Current operator lineage heuristics cover generic SQL operators, Snowflake operators, `MySqlOperator`, `PostgresOperator`, `DatabricksSubmitRunOperator`, `S3ToSnowflakeOperator`, `BigQueryInsertJobOperator`, and `BigQueryExecuteQueryOperator`. Unsupported operators are reported as unresolved lineage evidence rather than guessed.

SQL operator lineage uses Airflow connection metadata or Connection mapping to resolve warehouse URI namespaces. If Airflow cannot read connections, or if an input or output target is partial or ambiguous, Euno does not create warehouse observations from that SQL evidence. MySQL, Postgres, and Databricks need a single matching mapped warehouse prefix when rendered fields omit `conn_id`. Databricks submit-run lineage reads only SQL task `query_text` from rendered JSON.

For the full list of discovered resource types, properties, and relationships, see [Airflow Integration Discovered Resources](/sources/transformation-etl/airflow-integration/airflow-integration-discovered-resources.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.euno.ai/sources/transformation-etl/airflow-integration.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
