Skip to content

Use an intermediate table when backfilling wmf_dumps.wikitext_raw_rc1.

Xcollazo requested to merge use-intermediate-table-dumps into main

(Depends on https://gitlab.wikimedia.org/repos/data-engineering/dumps/mediawiki-content-dump/-/merge_requests/13)

This MR replaces the 4 DAGs dumps_merge_backfill_to_wikitext_raw_* with a single one that is much more efficient in time and resource usage.

From T346281#9170772:

The current backfill takes ~7.5h per group per year on recent years. As stated before, mosf of that time is wasted time reading wmf.mediawiki_wikitext_history over and over. This comes to be ~7.5h * 22 years * 4 groups = 660 hours = 27.5 days if run sequentially. Because we run each group in parallel, its actually ~6.8 days using 75% of cluster resources.

But if we have an intermediate table (with schema as in T346281#9170438), we have the following: ~19h (create intermediary table) + 2.4h * 22 years = ~72 hours = ~3 days ! 🎉 All of this was done by using ~18.4% of the cluster resources (100 executors with 24GB RAM, 2 cores each, and in this case spark.sql.adaptive.coalescePartitions.enabled=true)

Bug: T346281

Edited by Xcollazo

Merge request reports