Two new sections went live tonight, both built by parallel agent swarms pulling from the same data well: Wikipedia and Wikidata. One maps where every MLS player came from. The other tells the founding story of every MLS club.
Where Players Come From
The question that started this: can we trace the development pipeline for every player on an MLS roster? Where did they play as kids? Which academies produced the most pros? Is college still a viable path?
Five data collection agents ran in parallel:
- Wikipedia Player Fetch — hit all 867 player names against Wikipedia's MediaWiki API. Found 704 pages (81%). Parsed infobox fields for
youthclubsandyouthyears. 529 players had youth club data. - Homegrown Tagger — pulled the Wikipedia "Homegrown Players (MLS)" category. 506 total, 144 matched to current rosters.
- Wikidata Enricher — SPARQL queries for career history and education data. 3,681 players, 1,833 with education records.
Then a normalization agent merged everything: deduplicating club name variants ("NYRB Academy" = "New York Red Bulls Academy"), classifying clubs by type (MLS academy, college, foreign, independent), and linking academies to parent MLS teams.
The Numbers
| Metric | Count |
|---|---|
| Player pathways | 629 |
| Unique clubs/academies | 832 |
| MLS academies | 34 |
| Colleges | 53 |
| Foreign academies | 124 |
| Youth clubs | 621 |
| Homegrown players | 144 |
Top Academy Producers
Toronto FC leads with 16 current MLS players from their academy. Philadelphia Union (14), New York Red Bulls (13), Columbus Crew (12), Seattle Sounders (12), and FC Dallas (12) round out the top producers. These aren't just any players — many are starters and national team contributors.
Every player profile now has a "Development Path" section showing their timeline: youth club → academy → college → MLS. The new /pathways hub ranks academies, colleges, and international feeders by how many current MLS players they've produced.
How Every Team Got Here
The second swarm pulled founding narratives for all 30 clubs. Wikipedia is remarkably detailed on MLS team histories — the average team article had 3,000+ words of cited content covering expansion bids, ownership battles, stadium construction, naming contests, and first seasons.
92,524 total words across 183 history sections. From the New England Revolution's founding in 1994 to San Diego FC's 2025 debut. Every story is different: Columbus Crew's name came from a public contest with 2,500 entries. Inter Miami took a decade of David Beckham's lobbying. Seattle Sounders sold out their first match in 32 minutes.
Each team page now has a collapsible "Club History" section with:
- Founding narrative (accordion sections)
- Milestone timeline (key dates with event descriptions)
- Complete coaching history with years (from Wikidata)
- Stadium history
- Wikipedia attribution
The Solstice FC Bridge
This wasn't just about enriching MLS Pulse. The youth networks section creates a natural bridge to Solstice FC — our youth soccer reform project. When you can see that Toronto FC's academy produced 16 current MLS players, or that FC Dallas's pipeline is one of the deepest in the league, it puts concrete numbers behind the question Solstice is trying to answer: what does good youth development actually look like?
The data vocabulary is shared: academy tiers, pathway types, development stages. The articles we'll write (“College vs Academy: Two Paths to MLS”, “Which Academies Produce the Most MLS Players?”) can reference both sites. Content moat meets content bridge.
Technical Notes
Wikipedia's MediaWiki API is shockingly good for this kind of structured data extraction. The youthclubs infobox fields use a consistent format across almost all football biographies. The main challenge was name normalization — matching "Tyler Adams" in our DB to "Tyler Adams" on Wikipedia sounds trivial until you hit accented characters, suffixes like "(soccer)", and mononyms.
The Wikidata SPARQL endpoint was trickier. MLS is entity Q18543 (not Q14764 as initially guessed). The coaching history query returned 212 entries with start/end date qualifiers — richer than anything ESPN provides.
Both datasets are seeded in Neon alongside the existing team, player, and stadium data. Same getData() / getBySlug() access pattern. Same ISR caching. Zero new infrastructure.