Body text extraction
The TX collector captures the listing entry (title, date, source URL) on the first pass; a second pass downloads the actual content and extracts body text. Two paths, depending on the source URL:
- PDF (~52% of records): downloaded and parsed with
pdfplumber(x_tolerance=2, y_tolerance=3). pdfplumber correctly handles TX templates that separate words by x-coordinate position rather than space characters — pypdf was concatenating those as “membercommitteeappointmentsforthe”. Residual whitespace artifacts (line-broken words from column layout) are normalized: a newline between two lowercase letters is a wrap, not a break. - HTML press.php (~48% of records): fetched and parsed with BeautifulSoup. Body text is the contents of
<main>with the predictable navigation preamble (“Press Items: Senator X — District N « Return to the home page printer-friendly”) stripped. - videoplayer.php (~3% of records): video press conferences. Bodies live off-platform; we link out and classify as
content_type='other'.
Every body is hashed (SHA-256) at extraction. On future re-fetches we compare the hash; a mismatch indicates the source PDF or HTML was edited after publication, which we surface as edit history on the per-release page.
As of the most recent extraction: 304 of 304 press-release records have body text + content_hash. The 10 video records do not (their content lives off-platform).