Full text available here: Nikolaos Tatarakis - Differentially Private Federated Learning.
Federated learning, is a decentralized way of training models. The idea is that multiple clients/devices can participate during training time without actually sharing their data (ie data stays local). Instead, only local model updates (model parameters) are being transferred to a server, who in turn, aggregates them to improve the global model, and then it redistributes it to every client (Federated Learning/Averaging, McMahan et al.).
Despite not sharing raw data directly, federated learning isn't a perfect privacy solution. The model updates can leak information through various attack vectors (eg gradient inversion attacks, model inversion attacks etc). Combining it with Differential Privacy (Dwork, 2006), can provide quantifiable privacy guarantees.
Essentially, differential privacy, is a mathematical framework that adds controlled noise (via a mechanism
We built upon these two notions, and proposed a new algorithm that can scale down the standard deviation (
- It provides strong privacy guarantees at a data-point level.
- It allows us to recount this noise to provide an additional layer of privacy guarantees at a client level without explicitly adding more noise to the system.
➡️ For privacy calculations we assume:
- Same fraction of client participation per communication round.
- All clients have the same amount of data and batch size.
- All clients perform the same amount of updates/steps per communication round.
For
This repository implements algorithm 4 and algorithm 5, which are the main findings of the thesis.
❗ Legacy code information: This code was written years ago and this is a slightly refactored version of it.
Tested with the following setup:
Python == 3.11.3PyTorch == 2.1.2Torchvision == 0.16.0SciPy == 1.11.1NumPy == 1.25.1Matplotlib == 3.7.2
- Clone the repository:
git clone https://github.com/ntat/Differentially_Private_Federated_Learning.git
- Install dependencies via
pip:pip install -r requirements.txt
- Set the
config.iniaccording to your training and privacy requirements. Before the actual model training happens, it's advisable to look into theoffline_accountingfolder so that you precompute your privacy budget according to your settings in the config, and confirm you are within acceptable privacy levels during the course of the training.
📁 Click to expand: config.ini
[Hyperparams]
Clients = 100
Shards = 200
comm_rounds = 635
local_epochs = 1
learning_rate = 0.02
tr_batch_size = 100
[Privacy]
C = 0.10
sigma = 4.0
target_ep = 1.31
target_ep_client = 8.0
clipThreshold = 4
[Data]
iid = True- Run the
main.pytraining script withpython:python main.py -c config.ini
- 10000 Clients:
Performing Differential Privacy for Machine Learning applications, at a sample level, is quite computationally inefficient mainly because of how auto-differentiation tools are structured. In our approach we are using the 'trick' described in (Goodfellow / Technical Report) for accessing the individual gradients. Although this is only limited to linear layers, it's still relatively efficient. One way to go around this for other type of layers (eg LSTM, ConvNets etc) is by microbatching (ie go through all samples in the batch one by one) and then do manual computations for each microbatch such as backward passes and clipping (very inefficient). Libraries like Opacus from META provide efficient tools for Machine Learning & Differential Privacy.




