Texas Senate scraper methodology

How the 314-record Texas archive is built, where the data comes from, what we don't collect, and how you can verify any of it yourself.

Source

Single platform. All 31 Texas Senate districts publish to the same domain: senate.texas.gov. The pressroom for each member is one URL: senate.texas.gov/pressroom.php?d={district}. There is no Texas-Senate-wide RSS, no JSON API, no mobile alternative. The HTML pressroom is the canonical source.

Static HTML, no JavaScript. The page is server-rendered. We use httpx (async HTTP) to GET each pressroom and parse with BeautifulSoup + lxml. No headless browser required. senate.texas.gov does not sit behind Akamai, Cloudflare, or other anti-bot protection — standard browser User-Agent works on every request.

Robots permissive. senate.texas.gov/robots.txt does not disallow /pressroom.php. Polite pacing: one concurrent request per senator, half-second to 1.5-second backoff between fetches.

What we extract

Every pressroom uses the same DOM shape. The collector parses items in document order — looking for <h3>YEAR</h3> headers and the <p> blocks that follow. Each paragraph carries:

A date in MM/DD/YYYY format as inline text. We regex-match the first date in each paragraph.
An icon—pdficon_sm.png for PDF press releases, playbutton_sm.png for video press conferences.
An anchor with the title and the link to the actual content. Content lives at one of:
- /members/d{NN}/press/en/p{YYYYMMDD}{seq}.pdf — PDF press releases (the majority)
- /press.php?id={district}-{date} — HTML press releases (a minority)
- /videoplayer.php?... — video press conferences

We classify PDFs and HTML pages as content_type='press_release'; videos as content_type='other' and prefix titles with VIDEO: so they're visually distinct in lists.

Body text extraction

The TX collector captures the listing entry (title, date, source URL) on the first pass; a second pass downloads the actual content and extracts body text. Two paths, depending on the source URL:

PDF (~52% of records): downloaded and parsed with pdfplumber (x_tolerance=2, y_tolerance=3). pdfplumber correctly handles TX templates that separate words by x-coordinate position rather than space characters — pypdf was concatenating those as “membercommitteeappointmentsforthe”. Residual whitespace artifacts (line-broken words from column layout) are normalized: a newline between two lowercase letters is a wrap, not a break.
HTML press.php (~48% of records): fetched and parsed with BeautifulSoup. Body text is the contents of <main> with the predictable navigation preamble (“Press Items: Senator X — District N « Return to the home page printer-friendly”) stripped.
videoplayer.php (~3% of records): video press conferences. Bodies live off-platform; we link out and classify as content_type='other'.

Every body is hashed (SHA-256) at extraction. On future re-fetches we compare the hash; a mismatch indicates the source PDF or HTML was edited after publication, which we surface as edit history on the per-release page.

As of the most recent extraction: 304 of 304 press-release records have body text + content_hash. The 10 video records do not (their content lives off-platform).

What we keep, what we skip

January 1, 2025 to present. Anything dated earlier is ignored at ingest. Some senators (Zaffirini in particular) have archives reaching back 25 years; we don't pull pre-2025 content. Uniform with the federal-Senate scope.

Current holders only. District 4 is currently vacant pending a May 2026 special election; we don't collect for it. District 9's Taylor Rehmet was sworn in February 2026 and has not begun publishing; her seed entry has expect_empty: true so the data-quality tests don't false-flag her.

No third-party content. We don't archive press clippings, “in the news” aggregations, or curated mentions even if they live on the senator's site.

Collection cadence

The same python -m pipeline update command runs the TX collector for every TX senator, four times a day: 9 AM, 1 PM, 5 PM, and 9 PM Eastern Time (via GitHub Actions). Per-senator: one HTTP GET, parse, dedup against existing records by source_url, insert new records.

Verification

There are two truth-checks anyone can reproduce:

python -m pipeline tx-truth — hits each of 30 senate.texas.gov pressrooms, counts dated entries since Jan 2025, compares to the DB, reports deltas. Last run: Apr 29. 30 / 30 senators within ±1 of the live count. Zero missing, zero extras, zero errors.
Source-URL spot sample. Random 30 of 314 records, GET each source URL, confirm 200 with real content. Last run: Apr 29. 30 / 30 valid. Mix of PDFs and HTML, all reachable, all real titles.

The tx-truth command is reproducible — anyone with a clone of the repo can run it against the same live source. Source.

What can fail and how we'd notice

Pressroom HTML structure changes. The <h3>YEAR</h3> + sibling <p> shape has been stable since at least 2010. If senate.texas.gov redesigns, the collector returns zero items, the daily run logs“No items found”, the per-senator alert fires, and the next tx-truth run shows a large negative delta.
A senator's pressroom URL changes. Configured per-senator in pipeline/seeds/tx_senate.json. If a URL 404s, the daily run logs HTTP 404 and the per-senator alert fires.
Date format changes. All dates are MM/DD/YYYY. If a senator switches to ISO format, the collector falls back to the year header (Jan 1) with date_confidence=0.0 and the date-quality test catches the regression.
A senator starts publishing in a way we don't recognize. The extractor accepts .pdf, press.php, and videoplayer.php URLs. New URL shapes would be silently skipped. The next tx-truth run would catch this as a positive delta within hours.

What we deliberately don't claim

Not “every record.” We claim “every record on each member's pressroom on senate.texas.gov.” If a senator publishes elsewhere — campaign site, social media, district mailings, local press — those aren't in the archive.
Not “deletion detection as a watchdog.” We re-fetch source URLs as a data-integrity check; if a URL stops resolving on repeated checks, we tombstone it. We don't treat that as proof of intentional removal — sites get redesigned, CDNs hiccup, URLs restructure.
Not “real-time.” Four times a day via cron, not push.

As of this page render: 30 active TX senators tracked, 18 publishing on senate.texas.gov, 12 silent. 314 records archived since January 2025. Last end-to-end live verification: Apr 29. Site-wide methodology · Back to Texas