This MR replaces the 4 DAGs
dumps_merge_backfill_to_wikitext_raw_* with a single one that is much more efficient in time and resource usage.
The current backfill takes ~7.5h per group per year on recent years. As stated before, mosf of that time is wasted time reading
wmf.mediawiki_wikitext_historyover and over. This comes to be ~7.5h * 22 years * 4 groups = 660 hours = 27.5 days if run sequentially. Because we run each group in parallel, its actually ~6.8 days using 75% of cluster resources.
But if we have an intermediate table (with schema as in T346281#9170438), we have the following: ~19h (create intermediary table) + 2.4h * 22 years = ~72 hours = ~3 days !
🎉All of this was done by using ~18.4% of the cluster resources (100 executors with 24GB RAM, 2 cores each, and in this case