The second half of All your base was just as great as the first, although there was so many interesting ideas and new shiny things to try I will be busy through the holidays trying to process it all.
The talks this in the afternoon treated us to a feast of: Git as a NoSQL db, Redis, MySQL at Twitter, LawnChair mobile web persistence, Postgres Demystified and Apache Cassandra. What follows is just some of that data goodness…
Postgres is a really powerful database yet still feels pretty lightweight to use and a world away from the heavy-duty admin required for other “enterprise” db’s.
Postgres is also one of the most innovative relational datbases out there, long having an object-relational model and Multiversion Concurrency Control (MVCC) to keep data really safe. Support was recently added for storing data in the json format and you can even hook Postgrest up directly to Redis.
With so many features it often referred to as the emacs of databases, in that it can really do so much more than most databases.
So its no surprise then that the popularity of Postgres has grown over the last few years and that companies such as Atlassian and Heroku run their businesses on it.
One bug-bear for developers was trying to get postgres installed on a Mac, it was far too painful. So the team at Heroku created an app for that at http://postgresapp.com/. Now its just a matter of a download and dragging the .app to your applications folder.
Otherwise, PSQL is a really easy to use terminal client for Postgres. PSQL enabling you to execute queries interactively and see the query results instantly. PSQL also supports tab completes and
\e allows you to edit the commands directly in what ever is your default shell editor (emacs, vim, etc). PSQL also provides meta-commands and various shell-like features to facilitate writing scripts and automating a wide variety of tasks, so its really flexible.
There are lots of data types supported in Postgres, including the really useful arrays. As thought has been put into time based data then support for Range types and
timestamptz is also a really so nice touch to help you prevent time conflicts.
explain command can help you work out inefficiencies in your database queries. Adding explain to your queries will show you what is going to happen if that query is run and an estimate of how long postgres thinks it will take to run.
explain analyze postgres will run the query and then analyse how long the query actually took. A good target is to aim for 10 miliseconds as a baseline.
PgAdmin can also show you a graphical representation of the explain plan.
A great indexing technique is to create your indexes concurrenty, so you dont need to take you database off for a day to improve performance.
create index concurrently _index_name_ on _table_
However, dont worry about indexes too much up front, wait and see how your uses work with your application and then start to optimise.
We all know that Git is amazing at storing and managing source code, but how is it for data? That was the question posed by Brandon Keepers of Github?
When you ask git what it is,
man git, then it just tells you its a stupid content tracker. That doesnt mean its a bad tracker, its just that it does not care what it is tracking, leading to the idea that Git can be used for more than code.
If you look at the contents of the
.git/objects folder for a project you can see all the files that make up the git data model. So you can see, its not like a “traditional” data store and as its file based will have several limitations.
When a file is added to git,
git commit, it puts that file in a blob and creates a hash of the data that is therefore unique. If the hash is the same on two files it means that their data is exactly the same.
A tree is then created that points to that blob and any other related trees. Finally we get back to a commit, which has a tree that is the root of the project as well as project info such as author and message.
When a new change is added to git it has to point to a point in the existing tree. As anything existing is immutable, a new file is created as a blob with a new tree that points to. However, that new tree can use reference a parent in the existing tree so changes are added efficiently as only the things that have changed are added to the data model.
All the great features that database systems have, git has none of those. What git can do is:
- Versioning - look back at previous changes really easily and create diffs of what has changed.
- Hooks - update caches as an alternative form of indexing for searches.
- No “big design up front” - optimize the storage based on the actual usage patterns and evolve the data structures as more discovery is made.
- Branches - gives long lived transactions
- Replication - a clone is a full copy of all the data and clones can be driven by hooks So Querying data is something you just have to figure out yourself. Concurrency is a challenge as git is file system based, so when multiple people try to update then changes could be lost. Scaling is also a really big challenge, especially when randomly updating records in git - partly this is because we are re-writing the whole tree structure to make an update.
Git also does a really bad job when it comes to handling long running projects that have grown considerable in size. Add, status and commit commands talke longer and longer.
So whilst its interesting to think about git as some kind of simple data store, there are a great many challenges if you want to manage big data sets within one git project.
Brandon left us with the thought “abuse your tools, imagine how to make them better!”.
Redis is one of those great projects that when discovered by developers they really take to it. Redis is short for remote dictionary server although you can think of it as a remote hashing server.
You can also consider redis as a Domain Specific Language (DSL) for abstract data types. Its similar to memcache but it adds a whole lot more functionality.
There are three big uses for Redis:
- A simple to use and work with NoSQL database
- A messaging system, Redis comes with a publish / subscribe API
- A web application cache (an alternative to memcache) - you can store your webpages in redis and they come out really fastSony use redis to manage the scoring from their online Playstation3 games network. Using a sorted set in Redis, millions of scores are managed for each game (as I can attest to on my poor scoring when playing StarDust).
Pintrest also used redis as much of their content, specifically as they presented their content in terms of ranking. All those values of all those users have to be kept somewhere.
Some important characteristics of Redis include:
- Individual actions are atomic so data does not become corrupt
- Single threaded
- Event driven
- Data can be expired
- Replication can be done with the use of a slave and Redis is very good at live replacement.
- Very high performance - everything is stored in memory, so its super fast. It does have persistence though as otherwise it wouldn’t be ultimately useful.
- You can use just about any language you have ever heard of with Redis (except Assembly & Fortran).
- You can also use Redis from the Postgres databaseSo what are you waiting for, go and play with Redis today! Or simply add RedisToGo to your application on Heroku
All your base was a fantastic experience and its great to see how hugely popular conferences are. It was great to mix with people from a wider range of backgrounds that the usual developer conference.
I am looking forward to All Your Base 2013!
This work is licensed under a Creative Commons Attribution 4.0 ShareAlike License, including custom images & stylesheets. Permissions beyond the scope of this license may be available at @jr0cket