700 million email and password Data Breach

in #privacy6 years ago (edited)

Check your Email and change password ASAP " https://haveibeenpwned.com/"

Let's start with the raw numbers because that's the headline, then I'll drill down into where it's from and what it's composed of. Collection #1 is a set of email addresses and passwords totalling 2,692,818,238 rows. It's made up of many different individual data breaches from literally thousands of different sources. (And yes, fellow techies, that's a sizeable amount more than a 32-bit integer can hold.)

In total, there are 1,160,253,228 unique combinations of email addresses and passwords. This is when treating the password as case sensitive but the email address as not case sensitive. This also includes some junk because hackers being hackers, they don't always neatly format their data dumps into an easily consumable fashion. (I found a combination of different delimiter types including colons, semicolons, spaces and indeed a combination of different file types such as delimited text files, files containing SQL statements and other compressed archives.)

The unique email addresses totalled 772,904,991. This is the headline you're seeing as this is the volume of data that has now been loaded into Have I Been Pwned (HIBP). It's after as much clean-up as I could reasonably do and per the previous paragraph, the source data was presented in a variety of different formats and levels of "cleanliness". This number makes it the single largest breach ever to be loaded into HIBP.

There are 21,222,975 unique passwords. As with the email addresses, this was after implementing a bunch of rules to do as much clean-up as I could including stripping out passwords that were still in hashed form, ignoring strings that contained control characters and those that were obviously fragments of SQL statements. Regardless of best efforts, the end result is not perfect nor does it need to be. It'll be 99.x% perfect though and that x% has very little bearing on the practical use of this data. And yes, they're all now in Pwned Passwords, more on that soon.

That's the numbers, let's move onto where the data has actually come from.
Data Origins

Last week, multiple people reached out and directed me to a large collection of files on the popular cloud service, MEGA (the data has since been removed from the service). The collection totalled over 12,000 separate files and more than 87GB of data. One of my contacts pointed me to a popular hacking forum where the data was being socialised, complete with the following image:
image-17.png

As you can see at the top left of the image, the root folder is called "Collection #1" hence the name I've given this breach. The expanded folders and file listing give you a bit of a sense of the nature of the data (I'll come back to the word "combo" later), and as you can see, it's (allegedly) from many different sources. The post on the forum referenced "a collection of 2000+ dehashed databases and Combos stored by topic" and provided a directory listing of 2,890 of the files which I've reproduced here. This gives you a sense of the origins of the data but again, I need to stress "allegedly". I've written before about what's involved in verifying data breaches and it's often a non-trivial exercise. Whilst there are many legitimate breaches that I recognise in that list, that's the extent of my verification efforts and it's entirely possible that some of them refer to services that haven't actually been involved in a data breach at all.

However, what I can say is that my own personal data is in there and it's accurate; right email address and a password I used many years ago. Like many of you reading this, I've been in multiple data breaches before which have resulted in my email addresses and yes, my passwords, circulating in public. Fortunately, only passwords that are no longer in use, but I still feel the same sense of dismay that many people reading this will when I see them pop up again. They're also ones that were stored as cryptographic hashes in the source data breaches (at least the ones that I've personally seen and verified), but per the quoted sentence above, the data contains "dehashed" passwords which have been cracked and converted back to plain text. (There's an entirely different technical discussion about what makes a good hashing algorithm and why the likes of salted SHA1 is as good as useless.) In short, if you're in this breach, one or more passwords you've previously used are floating around for others to see.

So that's where the data has come from, let me talk about how to assess your own personal exposure.

Checking Email Addresses and Passwords in HIBP

There'll be a significant number of people that'll land here after receiving a notification from HIBP; about 2.2M people presently use the free notification service and 768k of them are in this breach. Many others, over the years to come, will check their address on the site and land on this blog post when clicking in the breach description for more information. These people all know they were in Collection #1 and if they've read this far, hopefully they have a sense of what it is and why they're in there. If you've come here via another channel, checking your email address on HIBP is as simple as going to the site, entering it in then looking at the results (scrolling further down lists the specific data breaches the address was found in):
image-20.png

But what many people will want to know is what password was exposed. HIBP never stores passwords next to email addresses and there are many very good reasons for this. That link explains it in more detail but in short, it poses too big a risk for individuals, too big a risk for me personally and frankly, can't be done without taking the sorts of shortcuts that nobody should be taking with passwords in the first place! But there is another way and that's by using Pwned Passwords.

This is a password search feature I built into HIBP about 18 months ago. The original intention of it was to provide a data set to people building systems so that they could refer to a list of known breached passwords in order to stop people from using them again (or at least advise them of the risk). This provided a means of implementing guidance from government and industry bodies alike, but it also provided individuals with a repository they could check their own passwords against. If you're inclined to lose your mind over that last statement, read about the k-anonymity implementation then continue below.

Here's how it works: let's do a search for the word "P@ssw0rd" which incidentally, meets most password strength criteria (upper case, lower case, number and 8 characters long):
image-19.png

Obviously, any password that's been seen over 51k times is terrible and you'd be ill-advised to use it anywhere. When I searched for that password, the data was anonymised first and HIBP never received the actual value of it. Yes, I'm still conscious of the messaging when suggesting to people that they enter their password on another site but in the broader scheme of things, if someone is actually using the same one all over the place (as the vast majority of people still do), then the wakeup call this provides is worth it.

As of now, all 21,222,975 passwords from Collection #1 have been added to Pwned Passwords bringing the total number of unique values in the list to 551,509,767.

Whilst I can't tell you precisely what password was against your own record in the breach, I can tell you if any password you're interested in has appeared in previous breaches Pwned Passwords has indexed. If one of yours shows up there, you really want to stop using it on any service you care about.

Good Password manager:

https://keepass.info/ (windows)
https://keepassxc.org/ (better linux support)
https://bitwarden.com/ (good multiplatform support)
all of the above are opensource