Skip to content

T295360 datapipeline scaffolding

This merge request adds a cookiecutter template to scaffold new data pipelines as described in https://phabricator.wikimedia.org/T295360.

This template provides

  • Integration with our tox config (mypy/flake8/pytest)
  • A PySpark job template
  • A pytest template for pyspark code
  • An Airflow dag template to help users getting started.

Structure changes

The project directory largely follows image-matching's strcuture. Notable changes are:

  • Python code has been moved under pyspark
  • Python code is pip installable. This allows to package deps at build time, and ease spark deployment (e.g. we don't need to pass each module like --files schema.py - imports will be resolved from the venv).

How to test

checkout the T295360-datapipeline-scaffolding branch and run

A new datapipline can be created with:

make datapipeline                                                                                                       

This will generate a new directory for pipeline code under:

your_data_pipeline                                                                                                      

And install an Airflow dag template under

dags/your_data_pipeline_dag.py                                                                                          

From the top level directory, you can now run make test-dags. The command will check that dags/your_data_pipeline_dag.py is a valid airflow dag. The output should look like this:

make test-dags

---------- coverage: platform linux, python 3.7.11-final-0 -----------
Name                                    Stmts   Miss  Cover
-----------------------------------------------------------
dags/factory/sequence.py                   70      3    96%
dags/ima.py                                49      5    90%
dags/similarusers-train-and-ingest.py      20      0   100%
dags/your_data_pipeline_dag.py             19      0   100%
-----------------------------------------------------------
TOTAL                                     158      8    95%

=========================== 8 passed, 8 warnings in 12.75s ===========================
______________________________________ summary ____________
Edited by Gmodena

Merge request reports