What I Learned Building for 100M+ Users
Scale changes everything
When you go from thousands to millions of users, every assumption gets tested. Database queries that took 5ms start taking 500ms. Caching strategies that seemed overkill become essential. And the code that "worked fine" in staging starts failing in production at 3am on a Saturday.
I learned this firsthand leading the Game Consoles team at Crunchyroll, a streaming platform serving millions of concurrent anime fans across PlayStation 4/5, Xbox One/Series X|S, and Nintendo Switch. Console streaming is a particularly unforgiving environment: you're running on locked hardware with fixed memory budgets, no ability to push a hotfix without going through Sony, Microsoft, or Nintendo certification, and a global audience that notices a 200ms stutter in playback immediately.
Constraints are your best teachers
Every console generation ships with a fixed memory ceiling. There's no "just add RAM" escape hatch. This forces a discipline that server-side developers rarely encounter: you have to reason carefully about every allocation, every cache layer, and every buffer before you ship, because you can't patch it quietly at runtime.
Adaptive bitrate streaming taught me the same lesson from a different angle. ABR logic that works beautifully on a broadband desktop connection behaves entirely differently on a living-room console with a shared Wi-Fi network and a 4K panel expecting a perfect signal. Tuning the video pipeline meant building a deep understanding of the full chain: manifest parsing, segment buffering, bitrate ladder decisions, decoder handoff, and frame presentation timing, all of it under hardware constraints that vary across three console families simultaneously.
The constraint isn't the enemy. The constraint is the thing that makes you actually understand the system.
The migration that couldn't fail
The most concentrated scaling lesson came from an unavoidable crisis. Following Sony's acquisition of Crunchyroll, You.i Engine, the cross-platform UI framework powering all our console clients, became unavailable. We had a hard one-year deadline to migrate three console codebases to a new in-house stack, with 3M+ active subscribers who couldn't experience a single outage.
That project forced every good engineering habit into sharp relief:
- Parallel delivery over big-bang rewrites. We maintained feature parity on the old stack while building the new one in parallel, not because it was comfortable, but because there was no acceptable window to go dark.
- CI/CD as a load-bearing wall. Rebuilding our pipelines from scratch wasn't a nice-to-have. It was what made zero-downtime migration physically possible. Automated builds, console emulator testing, and submission workflows to three certification portals had to be reliable before anything else was.
- Cross-platform abstraction as a design discipline. Across seven repositories and three SDK families (PlayStation SDK, Xbox SDK, Nintendo NDK), the abstraction layer separating platform-specific code from shared business logic was what let us move fast without breaking things that were already certified and live.
The patterns that held up
Event-driven architectures. Feature flags. Progressive rollouts. These aren't just buzzwords; they're survival tools. At Crunchyroll, we could deploy to 1% of traffic, watch the metrics, and roll back in seconds if something looked off.
A few more that proved their worth under pressure:
| Pattern | Why it matters at scale |
|---|---|
| Crash diagnostics before features | On constrained hardware, an ANR or freeze isn't just a bad UX. It can be a certification failure. Investing in deep crash triage tooling early pays back every sprint. |
| Performance profiling as a habit | Profile continuously, not just when something breaks. On consoles, memory pressure shows up as instability long before it shows up as an obvious error. |
| DRM as a first-class citizen | Content protection isn't a feature you bolt on at the end. It touches the entire playback pipeline and varies per platform. The teams that treat it as an afterthought pay for it at certification time. |
| Early AI tooling adoption | Integrating GitHub Copilot before it was standard practice at Crunchyroll gave the team a real productivity edge on repetitive SDK boilerplate and cross-platform abstraction code, freeing focus for the problems that actually required deep thinking. |
What scale really teaches you
Scale doesn't introduce new categories of problems. It just removes your tolerance for the ones you already had. Sloppy abstractions, untested edge cases, deployment processes that relied on someone knowing the right incantation: all of it becomes load-bearing at millions of users in a way it never was at thousands.
The engineers who thrive at scale aren't the ones who know the most tricks. They're the ones who've internalized that systems fail, that failure has to be planned for, and that the goal isn't to prevent every incident. It's to make every incident recoverable.