Skip to content

Commit 162b057

Browse files
authored
Use openlineage-airflow in airflow example and fix debugging step (#1635)
* Use openlineage-airflow in airflow example Signed-off-by: wslulciuc <[email protected]> * Fix alter column for step 5 in airflow example Signed-off-by: wslulciuc <[email protected]>
1 parent 278dbcf commit 162b057

File tree

3 files changed

+68
-53
lines changed

3 files changed

+68
-53
lines changed

.gitignore

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -20,11 +20,11 @@ bin/
2020
build/
2121
clients/java/out/
2222
dist/
23-
examples/airflow/marquez.env
24-
examples/airflow/requirements.txt
25-
examples/airflow/whl
2623
integrations/airflow/tests/integration/integration-requirements.txt
2724
integrations/airflow/tests/integration/requirements.txt
25+
examples/airflow/whl
26+
examples/airflow/requirements.txt
27+
examples/airflow/openlineage.env
2828
out/*
2929
venv
3030

@@ -36,7 +36,3 @@ marquez.yml
3636

3737
# Dependent Helm charts
3838
chart/charts/
39-
40-
integrations/spark/src/test/resources/test_data/test_output
41-
integrations/spark/src/test/resources/test_data/rdd_to_csv_output
42-
integrations/spark/src/test/resources/test_data/rdd_to_table

examples/airflow/README.md

Lines changed: 62 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
# [Airflow](https://airflow.apache.org) Example
22

3-
In this example, we'll walk you through how to enable an **Airflow DAG** to send lineage metadata to **Marquez**. The example will help demonstrate some of the features of Marquez.
3+
In this example, we'll walk you through how to enable Airflow DAGs to send lineage metadata to Marquez using [OpenLineage](https://openlineage.io/). The example will help demonstrate some of the features of Marquez.
44

55
### What you’ll learn:
66

7-
* Enable Marquez in Airflow
8-
* Write your very first Marquez enabled DAG
7+
* Enable OpenLineage in Airflow
8+
* Write your very first OpenLineage enabled DAG
99
* Troubleshoot a failing DAG using Marquez
1010

11-
## Prerequisites
11+
# Prerequisites
1212

1313
Before you begin, make sure you have installed:
1414

@@ -17,29 +17,28 @@ Before you begin, make sure you have installed:
1717

1818
> **Note:** We recommend that you have allocated at least **2 CPUs** and **8 GB** of memory to Docker.
1919
20-
## Step 1: Prepare the Environment
20+
# Step 1: Setup
2121

22-
First, if you haven't already, clone the Marquez repository and enter the `examples/airflow` directory.
22+
First, if you haven't already, clone the Marquez repository and change into the `examples/airflow` directory:
2323

2424
```bash
2525
$ git clone https://github.com/MarquezProject/marquez.git
2626
$ cd examples/airflow
2727
```
2828

29-
To make sure the latest [`marquez-airflow`](https://pypi.org/project/marquez-airflow) is downloaded when starting Airflow, you'll need to create a `requirements.txt` file with the following content:
29+
To make sure the latest [`openlineage-airflow`](https://pypi.org/project/openlineage-airflow) library is downloaded and installed when starting Airflow, you'll need to create a `requirements.txt` file with the following content:
3030

3131
```
32-
marquez-airflow
32+
openlineage-airflow
3333
```
3434

35-
Next, we'll need to specify where we want Airflow to send DAG metadata. To do so, create a config file named `marquez.env` with the following environment variables and values:
35+
Next, we'll need to specify where we want Airflow to send DAG metadata. To do so, create a config file named `openlineage.env` with the following environment variables and values:
3636

3737
```bash
38-
MARQUEZ_BACKEND=http # Collect metadata using HTTP backend
39-
MARQUEZ_URL=http://marquez:5000 # The URL of the HTTP backend
40-
MARQUEZ_NAMESPACE=example # The namespace associated with the collected metadata
38+
OPENLINEAGE_URL=http://marquez:5000 # The URL of the HTTP backend
39+
OPENLINEAGE_NAMESPACE=example # The namespace associated with the DAG collected metadata
4140
```
42-
> **Note:** The `marquez.env` config file will be used by the `airflow`, `airflow_scheduler`, and `airflow_worker` containers to send lineage metadata to Marquez.
41+
> **Note:** The `openlineage.env` config file will be used by the `airflow`, `airflow_scheduler`, and `airflow_worker` containers to send lineage metadata to Marquez.
4342
4443
Your `examples/airflow/` directory should now contain the following:
4544

@@ -49,36 +48,36 @@ Your `examples/airflow/` directory should now contain the following:
4948
├── docker
5049
├── docker-compose.yml
5150
├── docs
52-
├── marquez.env
51+
├── openlineage.env
5352
└── requirements.txt
5453
5554
```
5655

57-
## Step 2: Write Airflow DAGs using Marquez
56+
# Step 2: Write Airflow DAGs using OpenLineage
5857

59-
In this step, we will create two new Airflow DAGs that perform simple tasks. The `counter` DAG will generate a random number every minute, while the `sum` DAG calculates a sum every five minutes. This will result in a simple pipeline containing two jobs and two datasets.
58+
In this step, we'll create two new Airflow DAGs that perform simple tasks. The `counter` DAG will generate a random number every minute, while the `sum` DAG calculates a sum every five minutes. This will result in a simple pipeline containing two jobs and two datasets.
6059

6160
First, let's create the `dags/` folder where our example DAGs will be located:
6261

6362
```bash
6463
$ mkdir dags
6564
```
6665

67-
When writing our DAGs, we'll use [`marquez-airflow`](https://pypi.org/project/marquez-airflow), enabling Marquez to observe the DAG and automatically collect task-level metadata. Notice that the only change required to begin collecting DAG metadata is to use `marquez-airflow` instead of `airflow`:
66+
When writing our DAGs, we'll use [`openlineage-airflow`](https://pypi.org/project/openlineage-airflow), enabling OpenLineage to observe the DAG and automatically collect task-level metadata. Notice that the only change required to begin collecting DAG metadata is to use `openlineage.airflow` instead of `airflow`:
6867

6968
```diff
7069
- from airflow import DAG
71-
+ from marquez_airflow import DAG
70+
+ from openlineage.airflow import DAG
7271
```
7372

74-
## Step 2.1: Create DAG `counter`
73+
# Step 2.1: Create DAG `counter`
7574

7675
Under `dags/`, create a file named `counter.py` and add the following code:
7776

7877
```python
7978
import random
8079

81-
from marquez_airflow import DAG
80+
from openlineage.airflow import DAG
8281
from airflow.operators.postgres_operator import PostgresOperator
8382
from airflow.utils.dates import days_ago
8483

@@ -124,15 +123,14 @@ t2 = PostgresOperator(
124123
)
125124

126125
t1 >> t2
127-
128126
```
129127

130-
## Step 2.2: Create DAG `sum`
128+
# Step 2.2: Create DAG `sum`
131129

132130
Under `dags/`, create a file named `sum.py` and add the following code:
133131

134132
```python
135-
from marquez_airflow import DAG
133+
from openlineage.airflow import DAG
136134
from airflow.operators.postgres_operator import PostgresOperator
137135
from airflow.utils.dates import days_ago
138136

@@ -175,7 +173,6 @@ t2 = PostgresOperator(
175173
)
176174

177175
t1 >> t2
178-
179176
```
180177

181178
At this point, you should have the following under your `examples/airflow/` directory:
@@ -189,13 +186,13 @@ At this point, you should have the following under your `examples/airflow/` dire
189186
├── docker/
190187
├── docker-compose.yml
191188
├── docs/
192-
├── marquez.env
189+
├── openlineage.env
193190
└── requirements.txt
194191
```
195192

196-
## Step 3: Start Airflow with Marquez
193+
# Step 3: Start Airflow with Marquez
197194

198-
Now that we have our DAGs defined and Marquez is enabled in Airflow, we can run the example! To start Airflow, run:
195+
Now that we have our DAGs defined and OpenLineage is enabled in Airflow, we can run the example! To start Airflow, run:
199196

200197
```bash
201198
$ docker-compose up
@@ -205,20 +202,19 @@ $ docker-compose up
205202
206203
**The above command will:**
207204

208-
* Start Airflow and install `marquez-airflow`
205+
* Start Airflow and install `openlineage-airflow`
209206
* Start Marquez
210207
* Start Postgres
211208

212-
To view the Airflow UI and verify it's running, open http://localhost:8080. Then, login using the username and password: `airflow` / `airflow`. You can also browse to http://localhost:3000 to view the Marquez UI.
213-
209+
To view the Airflow UI and verify it's running, open [http://localhost:8080](http://localhost:8080). Then, login using the username and password: `airflow` / `airflow`. You can also browse to [http://localhost:3000](http://localhost:3000) to view the Marquez UI.
214210

215-
## Step 4: View Collected Metadata
211+
# Step 4: View Collected Metadata
216212

217-
To ensure that Airflow is executing `counter` and `sum`, navigate to the DAGs tab in Airflow and verify that they are both enabled and have a timestamp in the Last Run column.
213+
To ensure that Airflow is executing `counter` and `sum`, navigate to the DAGs tab in Airflow and verify that they are both enabled and are in a _running_ state:
218214

219215
![](./docs/airflow-view-dag.png)
220216

221-
To view DAG metadata collected by Marquez from Airflow, browse to the Marquez UI by visiting http://localhost:3000. Then, use the _search_ bar in the upper right-side of the page and search for the `counter.inc` job. To view lineage metadata for `counter.inc`, click on the job from the drop-down list:
217+
To view DAG metadata collected by Marquez from Airflow, browse to the Marquez UI by visiting [http://localhost:3000](http://localhost:3000). Then, use the _search_ bar in the upper right-side of the page and search for the `counter.inc` job. To view lineage metadata for `counter.inc`, click on the job from the drop-down list:
222218

223219
> **Note:** If the `counter.inc` job is not in the drop-down list, check to see if Airflow has successfully executed the DAG.
224220
@@ -228,7 +224,7 @@ If you take a quick look at the lineage graph for `counter.inc`, you should see
228224

229225
![](./docs/lineage-view-job.png)
230226

231-
## Step 5: Troubleshoot a Failing DAG with Marquez
227+
# Step 5: Troubleshoot a Failing DAG with Marquez
232228

233229
In this step, let's quickly walk through a simple troubleshooting scenario where DAG `sum` begins to fail as the result of an upstream schema change for table `counts`. So, let's get to it!
234230

@@ -241,13 +237,36 @@ t1 = PostgresOperator(
241237
- task_id='if_not_exists',
242238
+ task_id='alter_name_of_column',
243239
postgres_conn_id='example_db',
244-
- sql='''
240+
sql='''
245241
- CREATE TABLE IF NOT EXISTS counts (
246242
- value INTEGER
247243
- );''',
248-
+ sql='''
249-
+ ALTER TABLE counts RENAME COLUMN value TO value_1_to_10;
250-
+ ''',
244+
+ DO $$
245+
+ BEGIN
246+
+ IF EXISTS(SELECT *
247+
+ FROM information_schema.columns
248+
+ WHERE table_name='counts' and column_name='value')
249+
+ THEN
250+
+ ALTER TABLE "counts" RENAME COLUMN "value" TO "value_1_to_10";
251+
+ END IF;
252+
+ END $$;
253+
''',
254+
dag=dag
255+
)
256+
```
257+
258+
```diff
259+
t2 = PostgresOperator(
260+
task_id='inc',
261+
postgres_conn_id='example_db',
262+
sql='''
263+
- INSERT INTO counts (value)
264+
+ INSERT INTO counts (value_1_to_10)
265+
VALUES (%(value)s)
266+
''',
267+
parameters={
268+
'value': random.randint(1, 10)
269+
},
251270
dag=dag
252271
)
253272
```
@@ -281,11 +300,11 @@ With the code change, the DAG `sum` begins to run successfully:
281300

282301
_Congrats_! You successfully step through a troubleshooting scenario of a failing DAG using metadata collected with Marquez! You can now add your own DAGs to `dags/` to build more complex data lineage graphs.
283302

284-
## Next Steps
303+
# Next Steps
285304

286-
* Review the Marquez [HTTP API](https://marquezproject.github.io/marquez/openapi.html) used to collect Airflow DAG metadata and learn how to build your own integrations
287-
* Take a look at our [`marquez-spark`](https://github.com/MarquezProject/marquez/tree/main/integrations/spark) integration that can be used with Airflow
305+
* Review the Marquez [HTTP API](https://marquezproject.github.io/marquez/openapi.html) used to collect Airflow DAG metadata and learn how to build your own integrations using OpenLineage
306+
* Take a look at [`openlineage-spark`](https://openlineage.io/integration/apache-spark) integration that can be used with Airflow
288307

289-
## Feedback
308+
# Feedback
290309

291-
What did you think of this example? You can reach out to us on [slack](http://bit.ly/MarquezSlack) and leave us feedback, or [open a pull request](https://github.com/MarquezProject/marquez/blob/main/CONTRIBUTING.md#submitting-a-pull-request) with your suggestions!
310+
What did you think of this example? You can reach out to us on [slack](http://bit.ly/MarquezSlack) and leave us feedback, or [open a pull request](https://github.com/MarquezProject/marquez/blob/main/CONTRIBUTING.md#submitting-a-pull-request) with your suggestions!

examples/airflow/docker-compose.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ services:
55
ports:
66
- "8080:8080"
77
env_file:
8-
- marquez.env
8+
- openlineage.env
99
environment:
1010
- AIRFLOW_USERNAME=airflow
1111
- AIRFLOW_PASSWORD=airflow
@@ -28,7 +28,7 @@ services:
2828
airflow_scheduler:
2929
image: bitnami/airflow-scheduler:1.10.13
3030
env_file:
31-
- marquez.env
31+
- openlineage.env
3232
environment:
3333
- AIRFLOW_FERNET_KEY=Z2uDm0ZL60fXNkEXG8LW99Ki2zf8wkmIltaTz1iQPDU=
3434
- AIRFLOW_DATABASE_HOST=postgres
@@ -48,7 +48,7 @@ services:
4848
airflow_worker:
4949
image: bitnami/airflow-worker:1.10.13
5050
env_file:
51-
- marquez.env
51+
- openlineage.env
5252
environment:
5353
- AIRFLOW_FERNET_KEY=Z2uDm0ZL60fXNkEXG8LW99Ki2zf8wkmIltaTz1iQPDU=
5454
- AIRFLOW_DATABASE_HOST=postgres

0 commit comments

Comments
 (0)