Automation and the Living Site

Day 6 was about making themlspulse.com stop needing me. A site that updates itself. Every stat, every standing, every coach, every advanced metric—refreshed automatically, on schedule, from public APIs and Wikipedia.

Three Crons, Three Data Sources

The site now runs three Vercel cron jobs:

CronScheduleSourceWhat It Updates
refresh-statsDaily, 8am UTCESPN APIGoals, assists, appearances, minutes, new players
refresh-standingsSundays, 10am UTCESPN API + Wikipedia2026 standings (East/West/Overall) + head coaches
refresh-asaWednesdays, 12pm UTCAmerican Soccer AnalysisxG, xA, goals added, passing stats, GK metrics

The ESPN cron was already there. Today I added the ASA cron and the Wikipedia coach scraper.

The Wikipedia Coach Problem

ESPN's soccer API doesn't expose coaches. That's a gap most sports sites would fill manually. We fill it with a Wikipedia scraper that parses MediaWiki infobox wikitext for all 30 teams every Sunday.

Three edge cases made this interesting:

  • LAFC has both | manager (ownership group) and | coach (head coach). Had to prefer coach over manager.
  • NYCFC and San Jose have | coach = (empty) with the actual coach under | manager. Had to fall through empty fields.
  • Orlando City had ''(interim)'' wiki markup that needed stripping.

The parser now handles all 30 teams cleanly and only writes to the database when a coach actually changes. When it does change, it logs the diff: atlanta-united: Ronny Deila → Gerardo Martino.

ASA Enrichment: The xG Layer

American Soccer Analysis provides the advanced metrics that separate a stats site from a real analytics resource: expected goals, expected assists, goals added above average, passing completion vs. expected, and goalkeeper performance models.

The matching challenge is the same one we hit with youth pathway data—name normalization. ASA uses different name formats than ESPN. "Danny Musovski" vs. "Daniel Musovski." "Matty Longstaff" vs. "Matthew Longstaff." "Nouhou" vs. "Nouhou Tolo."

We maintain a hand-curated alias map (25 entries) plus fallback matching by last-name+team and first-name+team for mononyms. Current match rate: 99.3% of all ASA players who've appeared in 2026.

Youth Networks v2: The Full Pipeline

The youth pathways data from Day 5 was incomplete—single-source, low confidence. Today an 8-agent swarm rebuilt it from scratch:

  • FotMob: 679 players with career histories (3,833 career entries)
  • Wikipedia: 704 players with 2,961 senior + 529 youth career entries
  • Wikidata: 657 players with structured team/education data
  • TheSportsDB: 589 players with birth locations and nationalities

Merged result: 866 players (100% coverage), 1,967 unique clubs across 83 countries. 659 high confidence, 131 medium, 76 low.

New pages shipped:

  • 30 country pathway pages (/pathways/country/united-states, etc.)
  • 30 team development network pages (/pathways/network/atlanta-united, etc.)
  • Upgraded hub with geographic breakdown and pipeline type visualization
  • 2 pillar articles: "Where Do MLS Players Come From" and "MLS Development Networks"

The Health Check

Built a /murk-checkup command—a 10-check diagnostic that audits database integrity, route health, deployment status, sitemap completeness, content quality, data freshness, internal linking, SEO, build health, and comparison coverage. Three modes: read-only audit, auto-fix, and fix+deploy.

First run caught three issues:

  1. Player comparisons had been truncated to 500 rows (should be 100K+). Regenerated.
  2. Stat leaders dataset was completely missing. Seeded 10 categories, 477 entries.
  3. 50 meta descriptions exceeded 160 characters. Trimmed to under 155.

The Numbers

MetricBeforeAfter
Youth clubs tracked8321,967
Countries represented~4083
Pathway confidence (high)~50%76%
ASA coverage61%65% (99.3% of eligible)
Comparisons in DB500100,092
Cron jobs23
Automated data sources1 (ESPN)3 (ESPN, ASA, Wikipedia)

What's Next

The site now has 1,400+ pages of real content, automated data pipelines, and a health monitoring system. The bottleneck is no longer content or data quality. It's visibility. Domain Rating is 0. Google Search Console hasn't been submitted. Nothing we've built matters until people can find it.

Day 7 should be about backlinks and indexation.