Hash Browns and Cardholder Data
Table of contents
Protecting Credit Card Data
Unless you have been living under a cave, or you happen to be a follower of Dave Ramsey's financial advice, it's highly likely that you own one or more credit cards. They drive the American (if not the world) economy. In 2021, there were 204.5 billion non-cash payments, of which cards (including credit and debit cards) accounted for a whopping 76% of those transactions and three-quarters of US consumers have both credit and debit cards. On the flip side, credit card fraud is at an all-time high, with fraud and identity theft reaching a high of 6.1 million in 2021.
So with all this cardholder data out there, it would be a good thing to protect it. Right? As it turns out, there's a treatise on the protection of cardholder data. It's known as the Payment Card Industry Data Security Standard. It has not one, but two requirements dedicated to protecting cardholder data: protecting stored cardholder data as well as transmission over public, open networks.
Heck, if the 360 pages of the PCI DSS 4.0 are too much to take in they even have a handy, 3 page PCI Data Storage Do's and Dont's.1
But alas, in my growing campaign to receive cease and desist orders, we're going to look at one particular scenario of a failure to adequately protect cardholder data. Let's revist the Marriott data breach, and examine how some recent information just might change the calculus2 of that scenario.
Looking back at the Marriott Data Breach
In 2015, Marriott announced that it was acquiring Starwood in a merger agreement, and the deal was closed in 2016 for a whopping $13 billion dollars. Per CNBC, "30 hotel brands now fall under the Marriott umbrella to create the largest hotel chain in the world with more than 5,800 properties and 1.1 million rooms in more than 110 countries. That’s more than 1 out of every 15 hotel rooms around the globe."
But it was not happily ever after. In 2018, Marriott learned that the Starwood reservation system had been compromised as far back as 2014, with customer's personal information included as part of the breach, including payment card numbers.
For approximately 327 million of these guests, the information includes some combination of name, mailing address, phone number, email address, passport number, Starwood Preferred Guest (“SPG”) account information, date of birth, gender, arrival and departure information, reservation date, and communication preferences. For some, the information also includes payment card numbers and payment card expiration dates, but the payment card numbers were encrypted using Advanced Encryption Standard encryption (AES-128).
In Marriott's original statement, the payment card numbers were encrypted using AES-128. Used correctly with appropriate key management, it would take a sun-exploding amount of time to break the encryption on the card numbers.3
End of story right? Shortest blog post ever. Fast-forward to 2024. In a court disposition, Marriott admitted that instead of AES-128, the payment card information was protected with SHA-1.
Um...oops?
How did Marriott confuse AES-128, an encryption algorithm, with SHA-1, a hashing algorithm?4 What does this mean for the (in)security of the credit card data? Let's dive a little bit deeper.
But first, let's eat.
On the virtues of hash (browns)
As my friends and family are well aware, I love a good breakfast. Whether french toast, pancakes, biscuits, or just a big breakfast sandwich5 I can't turn it down. And what's a good breakfast without the side items - like hash browns! I like mine plain, with a bit of ketchup, but Waffle House will serve em scattered, smothered, covered, and even "all-the-way"6
Tasty diversions aside, a "hash" is also very important in cybersecurity. Cryptographic hash functions form the basis of guaranteeing integrity i.e. ensuring that data or a system has not been modified. More formally, a cryptographic hash function takes an arbitrary input and compresses it to a fixed output. It has an important property, it should be computationally infeasible to find two messages m1 = m2 such that the resultant hash outputs h1 = h2.
This property (known as collision resistance) is important for guaranteeing integrity, which is the assurance that data (or a system) has not been changed. For example, I could take the SHA-1 hash of a file (such as a book) and then send that book to you. I can post the hash of the book somewhere public (like a website). Now, when you download the book you can be confident that the file has not been modified.
$ sha1sum "War and Peace (Leo Tolstoy).pdf"
550fc30716c30775fb25d40d36406ea89bef5776 War and Peace (Leo Tolstoy).pdf
This property of collison resistance makes hash functions useful for file integrity monitoring systems like AIDE. You can create a database of really important files that shouldn't be modified, compute the hashes for those files, and then regularly check the hashes in your database vs. the current hashes of those files. If you have a mismatch - SCREAM LOUDLY!!
Another nice property of hash functions is that they are a one-way function. That is, it's easy to find the hash of a given input, but computationally infeasible to find the original input when given the hash. Just like you'd have a hard time turning your hash browns back into a potato(es), you should not be able to take the output of your hash function and get the original input. This property makes hash functions very useful for protecting passwords, as it ensures that even if the password database is breached the original passwords cannot be recovered.7
So if hash functions are the perfect method for protecting a password, they should be more than enough to protect credit card numbers, right? Right!?!?!
Oh astute reader, we again find ourselves at an impasse. No, hash functions are not sufficient for protecting cardholder data:
- SHA-1 itself has been broken, and no longer provides sufficient collision resistance.
- Hash functions may be one-way, but you can "run them forward" in a brute-force method.
- And Credit card numbers aren't exactly random - many parts of the card are either freely available and/or easily predicatble.
Let's look at each of these in turn.
SHAttered
The SHA-1 acronym literally stands for Secure Hashing Algorithm - 1. It was originally defined in FIPS 180-1 and developed by the NSA back in 1995. Since SHA-1 produces an output of 160 bits, due to the birthday paradox the expected security of SHA-1 was 80 bits. In 2006, it was shown to have an effective strength of 63 bits, forcing NIST to formally deprecate SHA-1 and recommend a transition to SHA-2 by 2010.
On February 23, 2017, SHA-1 was broken in practice by a team of researchers from Google:
It is now practically possible to craft two colliding PDF files and obtain
a SHA-1 digital signature on the first PDF file which can also be abused as
a valid signature on the second PDF file.
SHA-1 is now officialy deprecated for all uses8 and will be removed from the Cryptographic Module Validation Program.
So, Marriott used a broken hash function. That's like, not good. With that said, no general hashing9 function should be used for protecting cardholder data. To understand this issue, let's examine brute-forcing a hash function.
Hash me up some Credit Cards Numbers!
In cryptography, it's not just enough to not have any vulnerabilities. We want to be secure against all adversaries10. This includes not only the bit strength of the scheme, but also protecting against a priori information. This is information this is known beforehand. In a practical sense, this is referring to an attack who can predict qualities about either the plaintext or the ciphertext.
A hash function is a one-way function. This makes it very difficult, if not impossible, to reverse. However, we can try a whole lot of different inputs to the hash function. Let's say we recovered a database of hash values. In this case, we don't have to reverse the hash, we simply need to find an input that equals the hash function.
Since a hash function is designed to map an arbitrary input11 to a fixed output, for an attacker who knows nothing about the plaintext this is an exercise in futility. The input domain is simply too large. But what about credit card numbers? Let's assume that our adversary knows we used a hashing algorithm to encrypt a credit card, and also knows the algorithm used. What information does this give the attacker?
For starters, we know a credit card number is a 16-digit12 decimal number. This information narrows the range of inputs considerably. So how long would it take to crack a SHA-1 hash of a 16 digit decimal number?
import hashlib
import time
def brute_force_cc_hash(target, num_digits):
for i in range(0, 10 ** int(num_digits)):
cc = format(i, '0' + num_digits + 'd')
hash = hashlib.sha1(bytes(cc, "utf-8")).hexdigest()
if hash == target:
print("Found: {}".format(cc))
if __name__ == "__main__":
num = "1234567"
hash_num = hashlib.sha1(bytes(num, "utf-8")).hexdigest()
start = time.time()
brute_force_cc_hash(hash_num, "7")
end = time.time()
print("Time Taken: {}".format(end - start))
In the interest of brevity I've omitted error checking from this code, but it performs the following:
- Function brute_force_cc_hash takes in two inputs, a target (which is a hash output to match) and num_digits which is a string denoting the number of digits.
- It then iterates over all numbers in the range. Credit card numbers are almost always formatted as strings. Thus, we're taking an integer, using the format function to convert it to a string, and then we take the SHA-1 hash of that string.
- Finally, we compare the hash we just took the target. If it matches, we win!
- The main statement simply sets up a target number, takes the hash of this number, and calls brute_force_cc_hash while timing the output.
If you want to run this code yourself, change the num variable to a valid size and be sure to pass the right size into the brute_force_cc_hash.
So how long does it take to brute-force a 16 digit credit card number? Well, only a little over 326 years...
Input Size | Time (seconds) | Time (years) |
---|---|---|
1 | 0.00006 | 1.90258751902588E-12 |
2 | 0.00014 | 4.43937087772704E-12 |
3 | 0.0011 | 3.4880771182141E-11 |
4 | 0.011 | 3.4880771182141E-10 |
5 | 0.11 | 3.4880771182141E-09 |
6 | 1.04 | 3.29781836631152E-08 |
7 | 10.29 | 3.26293759512938E-07 |
8 | 102.9 | 3.26293759512938E-06 |
9 | 1029 | 3.26293759512938E-05 |
10 | 10290 | 0.000326293759512938 |
11 | 102900 | 0.00326293759512938 |
12 | 1029000 | 0.0326293759512938 |
13 | 10290000 | 0.326293759512938 |
14 | 102900000 | 3.26293759512938 |
15 | 1029000000 | 32.6293759512937 |
16 | 10290000000 | 326.293759512938 |
This is actually a nice example of seeing how algorithms grow. At its core, we're just counting. If a human counts one number a second, and never messes up, it would take over 31 years to count to a billion! A computer speeds up that process considerably, but this code still takes 172 minutes to perform this task with an input size of one billion13. And every time we add a single digit to the length the resulting code takes ten times longer to execute.
Now you can understand how cracking a 128-bit AES key takes a sun-exploding amount of time; simply counting to a decimal number with 38 digits is mind boggling.
So, 327 years is a lot of time. Longer than the life span of any human, and much longer than the 4 or so years that most credit cards are valid for. Safe to use to protect cardholder data, right?
Nope. Remember that whole a priori thing? As it turns out, parts of a credit card number are either easily guessed, predictable, or likely stored right alongside the rest of the information. This makes the attacker's job considerably easier...
First 6, Last 4
Ever use your credit card at a gas station, clothing store, or grocery store and looked closely at your receipt? Only the last four digits of your credit card number should be displayed, and the expiration date should be blanked out. This is actually a federal law and is designed to protect consumers; it provides just enough information so that you can remember what card you used, but not enough information for an attacker to steal your card number.
This process, also known as masking, is designed to protect credit card information when displayed on screens or on paper receipts. It's so common that it's abbreviated to "First 6, Last 4" to remember the rules for masking card data. For the full language from PCI DSS Requirement 3.3:
Mask PAN when displayed (the first six and last four digits are the maximum number of digits to be displayed), such that only personnel with a legitimate business need can see more than the first six/last four digits of the PAN.
When this process is designed to protect cardholder data that is electronically stored, it's referred to as truncation.
So what happens if a hash algorithm is used to protect the cardholder data, and it is stored alongside the truncated value of the PAN? Let's assume worst case scenario here, that "First 6, Last 4" are available. This is so damaging that the PCI Data Security Standard warns against it.
Note: It is a relatively trivial effort for a malicious individual to reconstruct original PAN data if they have access to both the truncated and hashed version of a PAN.
Let's look at just how trivial this effort is, using an example for a VISA test card number: 4111 1111 1111 1111. Let's assume we already have the resultant hash. We also know the First 6, Last 4. What do we have to solve for? Just six numbers in the middle.
4111 11** **** 1111
Let's modify our code from above, just a little bit:
import hashlib
import time
def brute_force_cc_hash(target, bin, last_four):
for i in range(0, 10 ** 6):
cc = bin + format(i, '06d') + last_four
hash = hashlib.sha1(bytes(cc, "utf-8")).hexdigest()
if hash == target:
print("Found: {}".format(cc))
if __name__ == "__main__":
full_number = "4111111111111111"
hash_num = hashlib.sha1(bytes(full_number, "utf-8")).hexdigest()
start = time.time()
brute_force_cc_hash(hash_num, "411111", "1111")
end = time.time()
print("Time Taken: {}".format(end - start))
How long does it take to break this one? 1.14 seconds.
But wait you say! SHA-1 is broken! What if we use a hash that's still collision-resistant, like SHA-512?
Well, it does take longer, a whole .29 seconds. SHA-3 adds a whole 'nother .20 seconds...
Okay, so don't store truncated data anywhere near a hash. Problem solved.
Or we can talk more. While it turns out the last four digits of a card number are fairly random, those first six digits are anything but.
Understanding Credit Card BINs
Let's talk a bit more about that First 614, also known as the Bank Identification Number (BIN). The BIN "identifies which particular institution issued a given credit card..it's essentially the bank's calling card; each card-issuing bank has a unique BIN".
Thus, depending on the institution an attacker might be able to predict the entire BIN. Even if the full BIN can't be predicted, some general information is known; all VISA BINs start with a 4. Depending on the card brand, simple knowledge of the type of card may narrow the range of inputs considerably.15 Heck, Capital One is giving one of theirs away for free and a quick Google search shows they may have anywhere between 70 and 90 BINs.
For the sake of keeping this blog PG-13, I'm not going to link to any BIN databases here, but suffice it to say, they are out there. Thus, at this point we have to assume that the knowledge of the BIN is a priori.
So, how long does it take to break a hashed credit card number when only the BIN Is known? It's an attack against 10 digits, so a little under 3 hours from my Python code above.
But let's assume the attacker can do better than some hobbled together Python code. Maybe they have a massively parallel computer just sitting in their laptop and/or desktop computer? That 3 hour cracking time can drop to a whole 35 seconds with a modest laptop and hashcat16.
Moving Forward
Suffice it to say, hashing algorithms, regardless of whether they've been broken or not, are simply no longer sufficient for the protection of predictable data over small domains. We can no longer assume that a truncated portion of the hash isn't available, or that a BIN can't be narrowed down to a small range of possibilities.
Fortunately, it looks like the PCI Software Security Council is wising up to the threat posed by using hash functions to protect cardholder data. In the latest [version (4.0) of the PCI DSS they added a new requirement for keyed cryptographic hashes. This requirement goes into effect March 31, 2025.
3.5.1.1 Hashes used to render PAN unreadable (per the first bullet of Requirement 3.5.1) are keyed cryptographic hashes of the entire PAN, with associated key-management processes
The use of a key requires the attacker to either recover or guess the key, bringing back our standard definition of sun-exploding amounts of time in order to break the cryptography.
So hopefully, moving forward there won't be anymore confusion between encryption and hashing algorithms.
Quick Cheatsheet for Cardholder Data
If you find yourself in a position where you may be around cardholder data:
- If you don't need it, don't store it!!. Consider using tokenization to reduce or eliminate the need to interact with cardhold data entirely,
- If you have to store it, store it securely. Follow the guidance laid out in the PCI Data Security Standard and ask a Qualified Security Assessor for official advice.
- Audit any homegrown cryptography code to ensure cryptograhpic primitives and modes of operation are being used appropriately. Until relatively recently, some crypto libraries defaulted to ECB mode.
- Don't forget about key management! Store cryptographic keys away from data, limit key custodians to the fewest members possible, and deploy dual-knowledge and split control.
-
This is awkward, but this document was written in 2008 and recommends a one-way hash function based on strong cryptography. I think we'll show why that advice is a bit dated...and PCI DSS 4.0 finally corrects it. ↩
-
Let's get our video game reference out of the way, this time Mass Effect 2. In Mordin's loyalty mission, he advocates not performing tests on sapient species. As he puts it No tests on species with members capable of calculus. ↩
-
At the time, Marriot could not confirm that the encryption key wasn't also compromised as part of the data breach. As strong as AES-128 is, if you have the key, it's game over. ↩
-
AES-128 is a block cipher that works with 128 bit inputs and produces a 128 bit output. It uses a 128 bit key. SHA-1 is a hashing algorithm that takes in an arbitrary length input and produces a 160 bit output. 160 > 128. It's simple math. Not to mention that unless your doing something silly (like using ECB mode) the AES ciphertext should also contain an initialization vector, pushing its total size to 256 bits. ↩
-
In fairness, if the breakfast contains bread, eggs, and a meat, I will combine them into a sandwich. It's human nature. ↩
-
Given that Waffle House started in the Atlanta area you might forgive me for assuming it is a southern dish. As it turns out, it was first served in New York City ↩
-
For an example of password protection gone wrong look into the Adobe breach. Not only did Adobe use encryption instead of hashing, it used the same key for all the passwords, and used ECB mode. ↩
-
Generally speaking HMAC-SHA1 remains secure. I suspect the advice is twofold: 1) Attacks only get better and 2) We humans have a limited data retention capability, and thus "all SHA-1 is bad" prevents the misuse of SHA-1. ↩
-
PCI-DSS 4.0 specifies that only a keyed HMAC or CMAC function should be used to protect cardholder data. This changes the security properties significantly, as we're no longer looking for UF-CMA, but PRF security. In less formal terms, you basically need a key to recover the original input, making the security value similar to encryption. ↩
-
In practice this refers to computational security, i.e. time and space. It should be computationally infeasible for an adversary to recover our secrets. ↩
-
So there is technically a limit to the input size for SHA-1, which is 2^64-1. This comes out to two exabytes, 1,000 petabytes, or 1,000,000 terrabytes. Very, very large. ↩
-
American Express uses 15 digits. This only enhances my argument. ↩
-
In fairness, as a high-level language Python isn't exactly known for blistering speed. ↩
-
As it turns out there is a fairly recent effort to move to 8-digit BINs, but we're going to ignore that for the purposes of this discussion. ↩
-
I really don't recommend putting any real BINs into these types of BIN checker sites. There's no telling how they might be protected, if at all. ↩
-
Props where its due: This is a really nice article showing the basics of hashcat and the dangers of hashing functions. ↩