Skip to content

Alter scripts/*.py to write their outputs to parquets

Matthias Mullie requested to merge T339129_1 into main

The output formats of our scripts are inconsistent: Some already store a parquet (check_bad_parsing.py), others (collect_media_prefixes.py, fetch_qids_from_wikidata.py) just write lines, while others (detect_html_tables.py) write jsonl (which is actually ingested as parquet later on) or (gather_section_titles_denylist.py) json

Some end up being bundled as static files within section_topics/data/, so it's rather inconvenient to keep them up-to-date, as it requires new commits & builds.

This updates all scripts to write to a parquet instead.

Note that these new outputs are not yet being used; those changes are coming in a separate merge request, once we've made sure these are run fine and their output has become available for consumption.

Note: this also includes functional changes in 1 script: detect_html_tables.py no longer includes normalized_section_title in its output as it was not Used. It also no longer ingests the denylist to omit rows we likely don't care about anyway. This makes things simpler (less coordination of scripts/outputs) and safer (no need to remember that the output is only partial; i.e. doesn't include denylisted entries). The code that consumes this output also filters out denylisted rows anyway, so this has no other functional impact.

Bug: T339129

Merge request reports