Section Topics merge requestshttps://gitlab-replica.wikimedia.org/repos/structured-data/section-topics/-/merge_requests2024-03-13T13:13:52Zhttps://gitlab-replica.wikimedia.org/repos/structured-data/section-topics/-/merge_requests/31Remove reference tags from section headings2024-03-13T13:13:52ZMarco FossatiRemove reference tags from section headingsScript run:
```py
prod = spark.read.parquet('/user/analytics-platform-eng/structured-data/section_topics/2024-02-19').where("section_title like '%<ref%'")
dev = spark.read.parquet('section_topics/2024-02-19').where("section_title like '...Script run:
```py
prod = spark.read.parquet('/user/analytics-platform-eng/structured-data/section_topics/2024-02-19').where("section_title like '%<ref%'")
dev = spark.read.parquet('section_topics/2024-02-19').where("section_title like '%<ref%'")
prod.count(), dev.count()
(255572, 51)
prod.select('wiki_db', 'page_id', 'section_title').distinct().count()
17054
devref = dev.select('wiki_db', 'page_id', 'section_title').distinct()
devref.count()
2
devref.show()
+-------+-------+--------------------------+
|wiki_db|page_id|section_title |
+-------+-------+--------------------------+
|srwiki |41595 |=_Бивши_корисници<ref_name|
|kowiki |259315 |==_남자부<ref_name |
+-------+-------+--------------------------+
```
:shrug: :shrug: :shrug:
[srwiki](https://sr.wikipedia.org/wiki/?curid=41595#%D0%9A%D0%BE%D1%80%D0%B8%D1%81%D0%BD%D0%B8%D1%86%D0%B8[14]) is broken in real world!
![Screen_Shot_2024-03-07_at_19.38.06](/uploads/3aac27db7cf32a3a60b61f2e1897d889/Screen_Shot_2024-03-07_at_19.38.06.png)
[kowiki](https://ko.wikipedia.org/wiki/?curid=259315#%EB%82%A8%EC%9E%90%EB%B6%80[14]) is correct, perhaps it slipped in due to `</br>`?
```html
=== 남자부<ref name="드래프트">[http://www.cbs.co.kr/Nocut/Show.asp?IDX=976377 문성민, 신인드래프트 1순위로 한국전력에 지명] <노컷뉴스> 2008년 11월 3일</br>
[http://www.mydaily.co.kr/news/read.html?newsid=200810201459462275&ext=na 女배구 세터 염혜선, 드래프트 1순위 현대건설 행(종합)] <마이데일리> 2008년 10월 20일 보도</ref> ===
```
Shall we fix them?
Bug: T341113Marco FossatiMarco Fossatihttps://gitlab-replica.wikimedia.org/repos/structured-data/section-topics/-/merge_requests/29Alter scripts/*.py to write their outputs to parquets2024-02-26T13:05:33ZMatthias MullieAlter scripts/*.py to write their outputs to parquets
The output formats of our scripts are inconsistent:
Some already store a parquet (check_bad_parsing.py),
others (collect_media_prefixes.py,
fetch_qids_from_wikidata.py) just write lines, while
others (detect_html_tables.py) write jsonl ...
The output formats of our scripts are inconsistent:
Some already store a parquet (check_bad_parsing.py),
others (collect_media_prefixes.py,
fetch_qids_from_wikidata.py) just write lines, while
others (detect_html_tables.py) write jsonl (which is
actually ingested as parquet later on) or
(gather_section_titles_denylist.py) json
Some end up being bundled as static files within
section_topics/data/, so it's rather inconvenient to
keep them up-to-date, as it requires new commits & builds.
This updates all scripts to write to a parquet instead.
Note that these new outputs are not yet being used;
those changes are coming in a separate merge request,
once we've made sure these are run fine and their output
has become available for consumption.
Note: this also includes functional changes in 1 script:
detect_html_tables.py no longer includes
`normalized_section_title` in its output as it was not
Used. It also no longer ingests the denylist to omit
rows we likely don't care about anyway. This makes things
simpler (less coordination of scripts/outputs) and safer
(no need to remember that the output is only partial; i.e.
doesn't include denylisted entries). The code that
consumes this output also filters out denylisted rows
anyway, so this has no other functional impact.
Bug: T339129Matthias MullieMatthias Mullie