Foreword

Some background about me. I’ve been in the industry for over 10 years now. I am currently a Site Reliability Engineer for Twitch Interactive working on the live video systems. I have previously built several platforms related to video game security, systems security, and a few other esoteric subject matters. I have also helped secure and operate large public and private clouds. Some gaming focused, some general purpose, and some in remarkably complex environments that require more advanced posturing like the banking sector. I am also a hacker. Mostly ethical, some less than way back in the past.

I am what happens when you mix a hacker, software developer, systems engineer, and network engineer.

Security

User Security

Users want the least amount of friction, if a task costs them money or takes more than a few minutes, they are unlikely to follow through with it.

2FA

I see a lot of people saying things like “OH, you need to force 2FA on all accounts”. 2FA is a blessing and a curse for a platform. You can force people to use it, but the type of 2FA you implement determines the security of your users. Let’s go through a few ranked from worst to best…

  • SMS is a dead short no. Either through socially engineering the phone company into activating a SIM card in place of the current one, or other nefarious means SMS is a bad 2FA implementation.
  • A system like Google Authenticator means users can lose access if a phone breaks, gets erased, stolen, etc.
  • Companies like Authy attempt to help resolve the previous option’s issue by backing up the 2FA tokens and tying them to your account. What happens if your users never setup their authy account? What happens if they change their number and authy can’t verify they owned the number in the past? What happens if a user account gets taken over because authy lets an attacker have access to the 2FA account?
  • There are solutions like RSA tokens, but hard tokens that rotate can easily be defeated through various means and they are excessively expensive. Upwards of $150 for a single fob that has to be registered into an even more expensive system.
  • Yubikeys CAN be good 2FA… Yubikey’s OTP can be defeated like all of the above systems with a well performed MITM attack. While a Yubikey configured in U2F is actually the current gold standard. It’s an HMAC signature valid for only the site that asked for the signature. This breaks MITM attacks as the end site will reject the auth attempt for it being the wrong URL. This system cannot be used on mobile however, meaning it will require engineering hours to work out a solution that allows mobile users to still authenticate to the platform.

2FA will also not keep bad users off a platform. It is actually trivial to automate account setup with forced 2FA so that a bad user can spin up accounts rapidly with zero real friction. Gonna toss some recaptcha in there to prevent bots? Recaptcha can be defeated too. (Yes, those are all unique links. It’s that bad of a solution. And yes hCaptcha too.)

Platform Security

Securing your platform is probably one of the most critical tasks you will have to undergo. Why break into a single user’s account when the site is vulnerable to some generic exploit an attacker can launch from metasploit?

Operation Security (OpSec)

OpSec is a topic most platforms don’t think about until they have already been compromised. Who has access to production systems? How are these users accessing them? How is the company ensuring that only trusted users can gain access and that the access is justified? A 3FA system is the best option in industry currently. A Password, TOTP or other OTP solution, and a second set of eyes. Access attempts should be logged to a Security Event & Incident Management (SEIM) system. These systems are not panacea, but they ensure that bad actors have a harder time maintaining a foothold.

Incident Management (IM)

Incident Management (IM) is a requirement for a platform to survive. These are the people that are concerned with “Why did the site go down for 20 minutes yesterday? What steps are we taking to prevent that in the future?” and “How did the attacker gain a foothold into our systems? How are we patching that out? How are we ensuring they don’t have backdoors back into the network? What data did they obtain? Have we contacted the relevent authorities?” An Incident Management team is a requirement for platform security and stability, without people focused on this task your developers will cause your users to flee.

Safety

IP Bans

A lot of end users think IP bans are the BEST way to remove bad users from a platform. It’s to these users that I address this section… an IP Address is NOT a person. Between VPNs, open proxies, general purpose compute, and just plain old resetting an internet connection. ANYONE can go around IP bans. A lot of large Twitch streamers use VPNs so that they don’t leak their IP when they click links in chat. (It’s just good security, really.) Would you have security conscious users unable to use VPNs to access the platform to prevent bad actors? Knowing full well those bad actors can just restart their DSL or cable modem and they’d have a new IP? “But my IP never changes, so clearly this doesn’t apply to everyone?” Yes, it does apply to everyone or nearly everyone on the open internet. If your IP never changes, unplug your router from your modem and plug a laptop directly in. Reboot the modem. Enjoy your new IP you’ve never had before. An IP is provided to you from your ISP via a protocol called “Dynamic Host Configuration Protocol” (DHCP). DHCP uses your system’s Hardware Address called a Media Access Control (MAC) address. This MAC address is burned into the hardware and while it can be spoofed, it cannot be changed. You don’t ban a MAC address BECAUSE it can be spoofed easily, and you don’t ban an IP address because a bad actor can change it easily. As you can see, this avenue isn’t going to work for safety.

Word Filtering

Did you know worldwide there are 6,500 languages in daily use? What words should be banned? Do they have a common meaning in a different language? A racist word in English can easily be a common place word in a different language with a different meaning. Would you have a word that means “to give” banned because it means something horrible in another language? At what point is your action or inaction impacting to your worldwide userbase? It takes a steady hand to carefully apply word filtering that doesn’t impact good users but prevents bad behavior. Unless you happen to find someone that can speak all 6,500 languages and dialects while being steeped into the local cultures of each of these languages, you will impact a good user. At what point do you stop; where is the line?

Scale

On a Global Scale

Did you know that there are only a handful of truly global platforms? Facebook, Google (including YouTube, Search, Gmail, GCP, etc), Steam, AWS (Netflix get’s thrown in here as they are well known to be on AWS), Twitch, and Twitter. All of these platforms have multiple Points of Presence (PoP) on all or nearly all continents. PoPs are typically interconnected via “Dark Fiber” that is connected up and solely used by the respective platform creating an interconnected web that allows your requests to enter the platform’s network at a location geographically close to you. These PoPs often also have direct connections to the local ISPs that serve your area, This is called Peering and is ESSENTIAL to a globally scaled platform. The actual server you are talking to can be on the other side of the world and the data can get back to you in typically 1-3 seconds. New platforms are unable to perform at this level and will have a restricted userbase to show for it. Several studies have shown that if it takes longer than 3 seconds for a webpage to load, your users are probably long gone.

Horizantal and Vertical

Often times some “DevOps” person will come along and configure some auto-scaling group, not by Requests Per Second (RPS), but by trivial values such as CPU usage or RAM usage. While these can be indicators of a need to scale, they are symptoms that a problem is already happening. Users are already likely impacted by the time your new VM or container are spun up or the problem has already gone away. The proper solution is to load test the platform and scale it to the expected amount of traffic and then scale beyond it to N+X. Where N is your normal machine count and X is your “overflow capacity”. You should always be scaled for your peak expected capacity plus overflow. There is no telling when a “Hug of Death” will hit a site. Overflow capacity looks different for each platform and should be adjusted to ~50% of your normal capacity leaving the final equation of N+(N/2). Auto-scaling should be setup to be triggered about 2/3rd of your service’s max RPS.

Cost

I recently saw an upcoming platform claim that they would function by doing the following: Taking no money from venture capitalists, Have no ads, Give you direct ways to support the platform, and Take no cut from the content creators.
I’m willing to bet that platform will die off rather quickly. In order to scale to a size that can compete with industry giants you need a surplus of capital. Having enough money to only pay your developers, safety personnel, and server costs doesn’t allow you to grow a platform. You need POSITIVE CASH FLOW. For that platform to survive, it will need additional flows of revenue. The likely source being that some large corporation will buy them out and then all those promises go out the window when management changes.
If you give users a choice between free and paid, most of them will take the free option. You can force them to pay for the content like Nebula and Floatplane do, but both of those platforms will never reach their true potential scale.

People

Your people cost is higher than you think. Time and time again you hear about Facebook’s blunders with outsourcing their Safety Operations team. The teams are paid minimum wage and have to regularly see horrible attrocities with no breaks between the content. They are expected to go on to the next task like they didn’t just see someone get maimed or worse. Your safety teams will NEED counseling and downtime to recover and this is still typically a high turnover job even if you focus 100% on their well being.

Platform

Your cost to scale to a global level with proper interconnects to local ISPs starts to rapidly approach the $1 million mark at a few thousand concurrent viewers with proper scaling to allow near realtime glass to glass latency.