11 September 2013POSTED BY justinatpave Pave’s technical foundation
Pave was built from the start with three goals from a technical perspective:
- Build something that enables us to work and iterate quickly.
- Build something that scales.
- Build something we’re proud of.
Over the last year, this is the architecture we’ve developed. It works well, but we’re always interested in how we can improve. Suggest some in the comments.
1/ Amazon Web Services (AWS) Virtual Private Cloud (VPC) + OpenVPN
Before getting into other details, we should discuss a framework for securing and controlling access to the network as a whole. The only external points of access to our internal network are through our VPN server (we use OpenVPN, which works for now) or through the Elastic Load Balancer (ELB) which fronts our webservers.
Within that network, we further segregate into several tiers of machines, with different permissions. Broadly speaking, there are three main concepts:
- Application tier. This is where the production and canary web servers live. This is segregated so that, if something were compromised, attackers wouldn’t have access to almost any resources on other machines.
- Services tier. This hosts machines like our databases, development servers, servers for asynchronous processes, etc… It allows pretty free communication within the tier, but external connections are only allowed from VPN.
- VPN / auth / Network File System (NFS) / etc… - this isn’t really one tier, but several each with very specific network configurations to only allow traffic to/from necessary hosts and on pre-defined ports For example, our dev machines can contact our NFS server specifically for NFS, but it’d be impossible to Secure Shell (SSH) from NFS to the development machines.
2/ Web application.
Pave is written in Python and based on the Flask microframework. The web servers themselves are EC2 instances running nginx + uwsgi to handle requests, although requests are first routed through an ELB. This framework should allow us to scale horizontally for a long time before running into problems. Adding new machines is as simple as provisioning an instance in EC2, running a command to set it up, and adding it to the ELB. This could be entirely automated if that were a priority at the moment.
3/ Data storage.
We use MongoDB running on EC2 instances for our primary data storage. We run three m1.small instances for the mongo config servers, several m1.large instances for our mongod (and their replicas), and then each application or dev server runs its own mongos. This has a few advantages:
- Easy sharding. This isn’t terribly difficult with PostgreSQL or MySQL, but can involve some application-layer logic once the data gets large enough.
- No need for memcache/elasticache. Best practice with Mongo is to use any additional RAM for the DB machines themselves, rather than attempting to front with another caching layer. It’s nice to entirely eliminate an extra layer of architecture.
All the engineers work on a system that closely mimics the production environment. Right now everyone follows one of two approaches:
- SSH’s into a development instance hosted in EC2. This is my workflow, owing to a few advantages like not needing to worry about losing/breaking laptops affecting productivity, not needing to schlep around my work laptop to be productive, no worry about a lost laptop losing important data, and no hardware constraints (e.g. I can create a new dev machine, of any instance size I need, within a few minutes).
- Run Ubuntu locally, and develop there. There’s a little less latency when working (not usually a problem, but can be annoying sometimes), plus you’re not reliant on an internet connection for work.
We run an internal version of the site, this gets updated to the latest code every 5 minutes. This adds two advantages - first, a safe place inside our VPN (added security) to run all admin tools. Second, a place for everyone in the company to always be using the latest version of the code. Many bugs are caught here.
Finally, there’s also a canary tier which fully mimics production. This lets us test things like load balancers, CDN configs, and anything else that’s hard to test without full production replication. We almost always deploy new code here before pushing live.
5/ Hosting photos and other uploaded files
We use AWS S3 for this. I used to manage the Photos product eng team back at Facebook, so I had a pretty good idea of how complicated it is to set up good, efficient, large scale hosting for photos (hint: we even wrote our own file system). S3 abstracts away all that complexity, which makes our lives much much easier.
6/ Static content.
- LESS for styling. LESS is a great way to abstract out a bit of CSS (e.g. it gives you mixins, more obvious scoping, variables). It makes development easier, and when we deploy it just gets compiled into per-page CSS. Each page view only needs 1-2 requests for CSS, which increases performance.
- Jenkins for Continuous Integration (CI). We run our unit and integration tests against each commit. Any time something breaks, the entire engineering team gets and email.
- Phabricator for code review. Every line of code in Pave’s codebase has been reviewed by another engineer. I think this had led to a better overall quality of engineering, plus ensures everyone has an idea of what’s changing in the system.
- Github for repository hosting. Buy don’t build.
- Asana for tasks, bugs, and sprint tracking. Asana strikes me as a second best solution to almost every problem. It’s not purpose built for development. It’s also not purpose built for sales, interview feedback, or tracking what we want to stock in our kitchen. But it works reasonably well for all of these jobs, and lets us unify a solution across the company. That said, I’m well known for being its biggest internal advocate :)
- Hipchat. Pretty cool for internal communication.
- Optimizely. Great way to give autonomy to non-engineers for setting up experiments across the site without taking engineering resources. Isn’t great for non-trivial tests though.
- Mixpanel. Tracking all our funnels, conversion rates, and experiments. We’re watching data from Mixpanel continuously on “information radiators” via…
- NewRelic. I can’t say enough good things about NewRelic for visibility into the application. It handles server monitoring, watches for errors, lets us know about performance concerns, and allows us to track all of that over time.
- Pingdom. First line of defense against major problems. If the site were to go down or a major page start fataling, half the company would have a text message within a minute or two.
- Loggly. All of our logs are sent to loggly where they’re indexed and searchable. It’s not the best tool in the world (NewRelic is better for log aggregation), but it’s easily searchable and has a decent interface.
- Iterable. A service that we use to help send triggered email. This has enabled non-technical team members to edit email copy and formatting.