Identity authentication methods in the digital world rely predominantly on a fragile and unsophisticated combination of username and password. Since passwords are often reused by Internet users, it can be easy to gather information about people by gaining access to email, note taking services, social media accounts, and many other services. It appears to be a fact of life that online services are hacked, password databases are copied, and subsequently posted on the web. For example, last year saw two prominent breaches: credentials to 177 million Linkedin accounts were laid bare and 32 million Twitter passwords were made available. You can see if your passwords have been hacked on https://haveibeenpwned.com/ (my credentials have been leaked at least 8 times!).
I’m currently looking into the ethics of using these data breaches, or ‘password dumps’, for benevolent research purposes. A fairly sizeable research community exists that take the leaked datasets and conducts experiments on them. Several conferences have been organized around this topic (Trondheim, Cambridge, Bochum), and papers are published consistently. The stated purposes range from estimating individual password strengths, the issues that arise from password reuse across websites, and understanding semantic patterns to estimate how easy it is to guess passwords (or pinpoint password cracking tools). This knowledge can be used for password strength assistance, as well as raising general awareness of the fragile nature of the username/password combination for authentication.
But is it ethically justifiable to use these leaked datasets for research? Some claim that the data is already publicly available, so they may as well use them. Others simply point at the existing experiments, and claim it’s become widespread practice to do research on password dumps now. Malevolent hackers have access to the same data, so they claim that their research will help secure accounts to protect Internet users in future.
However, several arguments against these justifications exist. First, since the users in the databases can likely be identified individually without too much effort or with the help of other datasets, this type of research becomes ‘human subject research.’ This means that a range of requirements are called into life for the researcher, including asking for informed consent of the data subject, or the persons who have had their passwords exposed. While it may be infeasible to contact the millions of persons in each dataset and filter out those that do not respond or do not consent, it is still a starting point for research ethics that should not just be neglected because it’s a nuisance.
Second, privacy rules or data protection laws will come into play. If the persons in the datasets can be identified, their usernames and passwords will likely be considered personal data, or personally identifiable information. Again, processing this data for other purposes than for which consent was given in the first place (ie. access to a service) would likely be a violation of these rights, which would make the research projects unlawful.
Finally, when researchers gain in the form of publications and subsequent promotions as a result of using leaked data sources, they may be implicitly condoning the hacks. The use of this data will then incentivize others to do the same, and it may even justify publishing password dumps (“for science!”). I am curious what you, fellow Internet users, think about the reuse of your leaked passwords. Are you happy that this research field exists, or should it be more tightly controlled?