Linking Sensitive Data: Methods and Techniques for Practical Privacy-Preserving Information Sharing Synopsis by Kerina Jones
Main Article Content
Abstract
Springer, November 2020
ISBN-10: 3030597059, ISBN‑13: 978-3030597054, Price: £87.50
Key message from the preface
A key message of this book is that any database that contains sensitive information about individuals in plaintext can be vulnerable to data breaches and attacks by adversaries, both external and internal to an organisation, as well as unintentional revealing or publication due to human or technical mishaps. Encoding personal sensitive information using the techniques and methods discussed in this book can significantly reduce the risks of sensitive data being breached or revealed.
Overview summary from the back cover
This book provides modern technical answers to the legal requirements of pseudonymisation as recommended by privacy legislation. It covers topics such as modern regulatory frameworks for sharing and linking sensitive information, concepts and algorithms for privacy-preserving record linkage and their computational aspects, practical considerations such as dealing with dirty and missing data, as well as privacy, risk, and performance assessment measures. Existing techniques for privacy-preserving record linkage are evaluated empirically and real-world application examples that scale to population sizes are described. The book also includes pointers to freely available software tools, benchmark data sets, and tools to generate synthetic data that can be used to test and evaluate linkage techniques.
This book consists of fourteen chapters grouped into four parts, and two appendices. The first part introduces the reader to the topic of linking sensitive data, the second part covers methods and techniques to link such data, the third part discusses aspects of practical importance, and the fourth part provides an outlook of future challenges and open research problems relevant to linking sensitive databases. The appendices provide pointers and describe freely available, open-source software systems that allow the linkage of sensitive data, and provide further details about the evaluations presented. A companion website at https://dmm.anu.edu.au/lsdbook2020 provides additional material and Python programs used in the book.
Who this book is for
This book is mainly written for applied scientists, researchers, and advanced practitioners in governments, industry, and universities who are concerned with developing, implementing, and deploying systems and tools to share sensitive information in administrative, commercial, or medical databases.
Synopsis by part and chapter
Part I: Introduction
This part begins by setting the scene on the increase in data linkage and poses the question of why data should be linked at all. It introduces sources of personal data, and defines sensitive data and what constitute direct and indirect identifiers. It provides several case studies from areas including finance, law and health, to illustrate the benefits and importance of data linkage, whilst respecting data ethics. Chapter 2 focuses on regulatory frameworks and ethical principles of research. It describes the regulatory landscapes in jurisdictions including the EU and UK, the US, Australia, and Switzerland. It touches on statistical disclosure control (SDC) and how regulatory frameworks both protect individuals and make novel techniques to link sensitive data necessary. Chapter 3 provides a background to linking sensitive data, including a short history and how data can be linked across databases. It raises challenges such as issues in data quality and discusses evaluation methods for linkage quality and complexity. Towards the end of the chapter, it defines privacy-preserving record linkage (PPRL).
Part II: Methods and Techniques
This part begins with a consideration of conceptual protocols for private information sharing along with the roles of different participants in the linkage process. It explains the separation principle, and two-, three- and multi-party linkage protocols and describes various adversarial attack models, such as the malicious and the honest-but-curious models. Chapter 5 focuses on assessing privacy and risks and measuring risks when linking sensitive data. It defines and characterises attacks on sensitive data, such as linkage, dictionary, frequency, and collusion attacks with a consideration of motivations, costs and gains. It introduces SDC techniques and their evaluation. Chapter 6 deals with the building blocks used in protocols for linking sensitive data. Among those described are random number generation, hashing techniques, anonymisation and pseudonymisation, encryption and secure multiparty computation, along with guidance on choosing suitable building blocks. Chapter 7 is about encoding and comparing sensitive values. It begins with a taxonomy of linkage techniques namely, privacy, linkage, theoretical, evaluation, and practical aspects. It moves on to describe techniques such as, phonetic encoding, hashing, and differential privacy. Chapter 8 focuses specifically on Bloom filter based encoding methods with hashing and encoding techniques for textual and numerical data, along with guidance on choosing suitable settings for Bloom filter encoding. Chapter 9 discusses attacks and hardening techniques for Bloom filter encoding. Chapter 10 finishes this part of the book with considerations of computational demand and efficiency, describing blocking and indexing techniques and approaches that make use of modern parallel and distributed computing platforms.
Part III: Practical Aspects, Evaluation, and Applications
This part of the book focuses on practical considerations and starts by highlighting some of the challenges in working with data. It then takes a more in-depth look at data-related issues, such as dirty data, missing values, bias, and lack of ground truth. It highlights the implications of false positive and false negative matches. Chapter 12 provides an empirical evaluation of selected Bloom filter based encoding and hardening techniques. It shows how sensitive databases can be linked using PPRL techniques and how the linkage can be assessed on the dimensions of linkage quality, scalability, and privacy by means of an evaluation framework. The final chapter in this part of the book describes real-world applications of PPRL techniques in a range of countries where differing privacy frameworks and legislation make the use of such methods necessary, or just where PPRL was chosen to make linkage more secure.
Part IV: Outlook
This final part of the book looks to future challenges and directions – practical and conceptual. These include discussions on the development of frameworks to enable comparative evaluation of linkage techniques, benchmarking, linking sensitive data in a cloud environment, and how to assess linkage quality and completeness when only encoded/encrypted records are available. It also highlights challenges and opportunities in linking emerging data types, such as biometric and genetic data.
The book also has two appendices: A) Software and datasets; B) Details of the empirical evaluation. It has an extensive glossary on data matching and linkage, over 650 references, and an index. In total the book has over 450 pages.