Authentication - Part 1: Hashes & Salts

Disclaimer: This is part of a series I’m writing based on the content found in App Academy Open as well as my own research. The writings here are purely an academic exercise and not intended as advice on how to implement security.

Concept #1: Don’t store passwords in the database

Database breaches happen all the time. Why does this matter to us? Because we are about to ask for a password from the user in order to authenticate they are who they say. Best-case scenario, that password is just a random string generated by a password manager. But as responsible, security-minded developers, we can’t assume best-case. Worst-case scenario? It is the same password they use for banking, or for GMail, and without 2-factor authentication. They may have handed us keys to all their hopes, dreams, and years of effort. And database breaches happen all the time. Let’s say we just store that password in our database. Then, if and when our database is breached, there goes our user’s hopes, dreams, and years of effort.

Concept #2: Protect user data through a data transformation

If we can’t store a password directly in the database, what do we do with it? We’ll need a way to meaningfully transform the password in to an externally unidentifiable state before we store it. Then, once we transform and store it, we also need to have a way to look up that transformed password.

If a user signs in to the site after sign-up, we need to be able to take the current password they are providing and verify that the user account they are trying to log in as has the same password.

There are three broad approaches we could consider for this kind of transformation: encoding, encrypting, and hashing. Which one should we use? If you are already familiar with this concept, feel free to skip concept #3.

Let’s take a look at these three in a bit more detail to understand why one solves our problem better than the others.

Option #1: Transforming using Encoding

Encoding, specifically character encoding, is a reversible data transformation that maps an input to an output, based on a given scheme. A common example is Base64 encoding, which uses a scheme of mapping an 8-bit binary sequence (such as something like 000010) to its corresponding character (in this case, C). Then to encode or decode something in Base64, we just need a table of the mappings between binary and the 64 characters used in Base64. If you give a Base64 function C, it will give you back 000010, and vice versa. The utility of Base64 is in converting binary data to text for the purposes of data transmission where text may be the only supported format. The biggest issue with this form of plain encoding is that, if given the output (ie. the thing we want to store on the server), the only thing preventing a malicious user from figuring out our input (i.e. password) to know what scheme we used.

Option #2: Transforming using Encrypting

Maybe the solution is to just use a less guessable scheme? Encryption is a form of encoding where the scheme is made intentionally unpredictable. A very simple example of encryption would be a Caesar cipher. In a Caesar cipher, the cipher takes in an alphabetical character, such as a and a key, such as 2. Then the cipher moves from a up 2 places in the alphabet, to c. The Caesar cipher applied to password with a key of 2 would be rcuuyqtf. Although this cipher is not something used in modern development, the essence of encryption is still present. More importantly, the issue with using encryption for our solution can still be seen. Let’s say we encrypt our password prior to storing it, then where and how do we store our encryption key? While encryption could potentially work to protect the user password, we’ve just traded one problem for another. There has to be a more elegant solution.

Option #3: Transforming using Hashing

One key component to identifying a password storage solution is that we know what the password is. And if the user is required to give us the password to log in, we will always have the password when we need to do a user lookup. This is important, as it allows us to use a one-way transformation. As long as our transformation is deterministic (i.e. predictably repeatable), we can take the password, transform it, then use that to identify the stored version. This is where a hash function comes in to play. A hash function is a deterministic, one-way function, which takes in data and applies a tranformation to output a “digest” (i.e. the transformed data).

Concept #3: Salt your hashes

A hashing function gets us most of the way, but before we start on how to build this thing, there is one additional contingency we need to address. While it would take some serious effort for a malicious user to be able to brute force crack the password given a digest, there are other ways they could use a digest to find a password. They may be able to identify the hash based on frequency analysis using rainbow tables. Or even more plausibly, they may be able to use an already compromised user, where the password is known from a separate breach, get the hash from that known password, and check for any corresponding hashes in the database. Then any other user with the same password, regardless of their own password best practices, would be compromised as well.

One workaround for these vulnerabilities is to ensure each user’s hash digest is unique within the database. We can do this by generating a bit of random text called a salt, then adding that salt to the password (either prepending or appending), then hashing the result. As long as the salt is sufficiently random, then frequency analysis of the digests can’t be applied, and there are no matching digests within the database with which to cross-reference compromised passwords.

Zach Grammon