Skip to content

Update pagelinks query

Marco Fossati requested to merge T350007 into main

This MR addresses a breaking change affecting the wmf_raw.mediawiki_pagelinks: pl_title got dropped, so join wmf_raw.mediawiki_private_linktarget on pl_target_id=lt_id. See root ticket at https://phabricator.wikimedia.org/T299947.

pl_title only impacts lead image data (i.e., image_suggestions_lead_image_data output Hive table) link counts, see https://gitlab.wikimedia.org/repos/structured-data/image-suggestions/-/blob/a8ecba225f612480d64cd0d423c2b6443ee76990/image_suggestions/commonswiki_file.py#L369

Test Airflow run result

Lead image data row counts:

from wmfdata.spark import create_session

spark = create_session(app_name='is-pagelinks-change', type='yarn-large')
prod = spark.read.table('analytics_platform_eng.image_suggestions_lead_image_data').where(f'snapshot="2024-04-01"')
dev = spark.read.table('is_pagelinks.image_suggestions_lead_image_data').where(f'snapshot="2024-04-01"')

prod.count(), dev.count()
(8085464, 7999312)

It's always curious to observe that counts are not identical. I've quickly inspected the data and couldn't spot anything obviously wrong.

@cparle could you please double-check it?

Bug: T350007

Merge request reports