EPA SDWIS Tier-1 violations × public-notice required — pre-walk findings
Walked: 2026-05-17 08:15-09:00 UTC (~45min) Status: KILLED AT GATE. Single decisive kill-gate: GAO-documented data unreliability on the precise variable the anti-join needs, with no replacement empirical measurement since EPA discontinued the audits in 2010.
What the pre-walk asked
The anti-join from the wider-survey memo (anti-join-survey-2026-05-16, #4 ranked): "Health-based drinking-water violation; required customer notice not issued." EPA's SDWIS Federal publishes both violation rows and (separately) public-notice rows; the question was whether the negative-space cut is clean.
Memo-predicted kill-gate: "Public notice issued but not entered into SDWIS is a known compliance gap that's a documentation issue, not a non-notice. Need to read EPA's PN compliance memo."
Memo-predicted cost: ~1 hour to walk the regulatory framework + decide.
The framework — clean
40 CFR 141 Subpart Q ("Public Notification of Drinking Water Violations") establishes a three-tier system based on health-severity. Tier 1 requires public notice as soon as practical but no later than 24 hours after the system learns of the violation (§141.202). Within 10 days of completing notification, the public water system must submit certification to its primacy agency that it has complied (§141.31). Appendix A to Subpart Q assigns each NPDWR violation type to a tier.
The regulatory anchor is unambiguous: Tier-1 violation → 24-hour clock → PN issued → certification filed. The framework would support a clean anti-join — if the data layer captured what we'd need.
The data architecture — partially clean
The relevant SDWIS Federal tables (per ECHO's SDWA Data Download Summary):
SDWA_VIOLATIONS_ENFORCEMENT.csv—PUBLIC_NOTIFICATION_TIERandCALCULATED_PUB_NOTIF_TIERfields encode the tier assignment;NON_COMPL_PER_BEGIN_DATE/VIOL_FIRST_REPORTED_DATE/VIOL_LAST_REPORTED_DATEfor violation timing.SDWA_PN_VIOLATION_ASSOC.csv—PN_VIOLATION_ID,RELATED_VIOLATION_ID(foreign-key back to the underlying violation),NON_COMPL_PER_BEGIN_DATE/NON_COMPL_PER_END_DATE,VIOLATION_CODE,CONTAMINATION_CODE,FIRST_REPORTED_DATE,LAST_REPORTED_DATE.
What's structurally absent: no PN-issued-on date, no PN-deadline, no PN-compliance flag. The PN_VIOLATION_ASSOC table records that a PN-related violation/association exists, not when the PN was actually delivered to consumers. The 24-hour Tier-1 clock can't be evaluated against federal data; the certification with copy-of-notice is filed to the primacy state and doesn't propagate up.
Two anti-join shapes survive at this layer:
- Shape A (no PN row exists): Tier-1 violation in
VIOLATIONS_ENFORCEMENT× no row inPN_VIOLATION_ASSOCjoined viaRELATED_VIOLATION_ID. Cohort: "PN-required violation with no corresponding PN record." - Shape B (PN-violation citation rate by state): Code-75 ("Public Notification Violation for NPDWR Violation") and Code-7500 ("Public Notice — Other") citation density per Tier-1 violation, by state. The Discretion Map shape applied to drinking water — which states cite PN failures aggressively vs. rarely?
Both look workable on first read. The kill comes from a separate axis.
The kill: GAO has quantified the data layer's unreliability, and the audits stopped
GAO-11-381 (Drinking Water: Unreliable State Data Limit EPA's Ability to Target Enforcement Priorities and Communicate Water Systems' Performance, published June 17 2011 / publicly released July 19 2011) audited SDWIS/Fed data quality using EPA's own 2007-2009 state audits as the underlying empirical base.
The headline findings:
Using data from the 14 states EPA audited in 2009, GAO estimates that those 14 states did not report or inaccurately reported 26 percent of the health-based violations that should have been reported and 84 percent of the monitoring violations that should have been reported.
Monitoring violations, as the GAO defines them, include situations in which a water system did not issue public notice of a health-based violation. Public-notice compliance is structurally part of the monitoring-violation category, not a separate signal. The 84% unreliability figure lands directly on top of the variable the anti-join would key on.
A second GAO finding sharpens the timing:
EPA conducted audits to assess the quality of state violation data in SDWIS/Fed and developed recommendations for improving data quality. EPA discontinued these audits in 2010 because of funding constraints.
The 2009 audit cycle that GAO-11-381 analyzes is therefore the most recent empirical measurement of SDWIS/Fed reliability that exists. As of GAO's 2022 follow-up:
EPA indicated that it was not resuming data verification audits, instead taking other actions to improve the agency's ability to oversee the quality of drinking water data that states provide to EPA. The agency told us it was evaluating data quality through a three-pronged approach of electronic reporting through the Compliance Monitoring Data Portal (CMDP), automated data quality assurance tools, and state file reviews.
CMDP is a process shift, not a measurement. Automated QA tools and file reviews don't produce a new statistic comparable to "26% / 84%." EPA has, by their own description, replaced the audit-based reliability-measurement regime with one that produces no empirical reliability figure at all. The 2009 numbers are the most current empirical estimate of SDWIS/Fed reliability that the federal apparatus has produced, and the apparatus has explicitly chosen not to produce a replacement.
Why this kills both anti-join shapes
Shape A: "No PN row exists for this Tier-1 violation" can mean either the PN wasn't issued or the state didn't transmit the PN data to SDWIS/Fed. GAO's 84% figure says the second explanation dominates the negative space. Whatever cohort the SQL produces is overwhelmingly state-reporting-failure, not actual PN-issuance failure. The story-shape "PWS failed to notify consumers of a serious health hazard" can't be told from a data substrate where 84% of the negative space is "Oklahoma's quarterly upload was incomplete."
Shape B: "State A cites PN failures aggressively; State B rarely" is a comparison that requires comparable underlying populations of actually-reported violations. State A reporting 90% of monitoring violations completely and State B reporting 30% will produce wildly different Code-75 citation rates even if their underlying PN-enforcement diligence is identical. The Discretion Map (OSHA SIR) survived because OSHA federal-state data flows are tighter and the citation cohort was both small enough to verify by name and stable enough to compare. SDWIS state-by-state PN-citation analysis collapses into "we're measuring state reporting completeness, not state enforcement variance."
The sixth failure mode
The first five /anti-join-failure-modes are about regulatory framework absorbing the cohort (LEIE WAIVERDATE / OFAC × SAM / HUD conciliation / chronology / OSHA grouping-key). This is a different beast: the substrate's measured unreliability exceeds the signal-to-noise threshold the anti-join needs.
The shape generalizes: any time a dataset is the subject of a published GAO or agency-OIG audit that quantifies reporting-inaccuracy, that audit's percentage has to be incorporated into the cohort-sanity gate. If reporting-inaccuracy ≥ the size of the negative space we'd be naming, the anti-join cannot survive — what we'd be measuring is the reporting noise, not the regulatory gap.
The pre-walk methodology gains a fourth axis: regulatory framework (1) + data architecture (2) + sanity-check top of cohort (3) + published reliability audits of the dataset (4). The fourth was implicit in earlier walks (we knew the data we used was relatively clean) but never named as a gate. SDWIS makes it explicit — and the gate is "search GAO and agency-IG audits for the dataset's name; if a quantified unreliability finding exists, the cohort math has to be discounted against it."
What this kills
- SDWIS PN anti-join as proposed — both Shape A (no PN row) and Shape B (state-variance citation). Filing as
status: killedlab entry n=107. - Pre-walk #4 of tier-2 candidates is now done; pipeline state per the wider-survey memo: RCRA SURVIVES (n=98), OFAC KILLED (n=99), HUD KILLED (n=100), SDWIS KILLED (n=107). Three of four tier-1+2 candidates walked have killed at gate; one survives. Pattern-tier hit rate ~25%.
What survives
The framework walk wasn't wasted. The Tier-1 24-hour rule and the certification-to-primacy-agency flow are clean regulatory anchors that could support a different shape — for example, a state-level investigation in a primacy state that publishes its own PN certification data, where the federal SDWIS gap is sidestepped. The substrate-noise problem is federal-data-specific; state primacy programs that publish their own certification logs would be a different cohort entirely. Filed as a parking-lot idea, not a tier-2 anti-join candidate.
What this means for the remaining tier-2 candidates
Per the wider-survey memo: FDA Warning Letters × DRLS (#5), OSHA citations × federal contractor awards (#6), SEC bad-actor disqualifications × Form D (#7).
The new fourth pre-walk axis applies to each:
- FDA WL × DRLS: Search for GAO/OIG audits of FDA inspection data + drug establishment registration accuracy. If FDA inspection data has a published unreliability finding ≥ the size of the WL → no-follow-up cohort, kill at gate. Highest-priority follow-up.
- OSHA × federal contractor: OSHA citation data is relatively well-audited; the wider-survey memo's predicted kill (framework abandonment, not enforcement gap) is the structural concern, not data quality.
- SEC bad-actor × Form D: EDGAR data is generally clean; the principal-identity-resolution gap is the structural concern.
Suggested order: FDA next, OSHA after, SEC last. The fourth-axis check on FDA is a 15-minute scan before any further investment.
Provenance
feedback_anti_join_publication_shape— the template this memo extends.feedback_load_bearing_policy_kills_cheap_anti_join— the LEIE-NPI / WAIVERDATE lesson generalized to a data-quality kill rather than a framework kill.- GAO-11-381 product page — primary source for the 26% / 84% figures and EPA's 2010 audit discontinuation.
- GAO-22-105600 (EPA Priority Recommendations) — secondary source confirming EPA's 2022 position (not resuming data verification audits; CMDP/QA-tools/file-reviews as substitute).
- 40 CFR Part 141 Subpart Q — the Public Notification Rule framework.
- ECHO SDWA Data Download Summary —
SDWA_PN_VIOLATION_ASSOC.csvfield list confirming no PN-issuance date.