I am currently studying how Differential Privacy (DP) interacts with Reinforcement Learning (RL) and am collecting notes as I go. This document distills the main ideas I have learned so far. If you spot an error or see an opportunity for improvement, please let me know — feedback is very welcome!

Definition of DP

Let us begin with a basic question: What is Differential Privacy?

I will give a specific example about the medicine records, which is commonly used in DP.

Running Example – Medication Recommendation in a Hospital

Imagine a hospital that uses an RL‑based system to track patients’ medication histories and recommend future prescriptions.

  • Alice is a diabetic patient who buys medication every week.
  • John is an adversary who knows the system’s outputs for everyone except Alice and also holds aggregate statistics.

If the recommendation mechanism is differentially private, John’s observations will look (almost) the same whether or not the system ever contained Alice’s trajectory. Thus he cannot infer that Alice is a patient—let alone her condition or prescriptions.

Without DP, even tiny output differences (e.g. a new drug suddenly recommended) could reveal Alice’s presence and leak sensitive details over time.

Mathematical Definition

After knowing the main idea of the DP, let me give out the mathematical definition of DP(Dwork et al. 2006):

A randomized algorithm $\mathcal{M}$ is said to satisfy $(\epsilon, \delta)$-Differential Privacy if, for any two datasets $D$ and $D’$ that differ by only one record, and for any possible output $S$ of the algorithm:

$$\Pr[\mathcal{M}(D) \in S] \leq e^{\epsilon} \cdot \Pr[\mathcal{M}(D’) \in S] + \delta$$

Granularities of Differential Privacy

In RL and other sequential settings, we must specify what constitutes “one record”

Item‑Level DP

The classical setting: two datasets are neighbors if they differ in exactly one interaction record. This is appropriate when each user contributes at most a handful of records and we care about protecting each individual interaction.

Joint Differential Privacy (JDP)

In multi‑agent systems each user often receives a personalized output.

Let $\mathcal{M}(D)=\bigl(M_1(D),\dots,M_n(D)\bigr)$ be the vector of outputs, where user $i$ sees only $M_i(D)$. JDP requires that for every user $i$ the joint output seen by others—denoted $M_{-i}(D)$—is $(\varepsilon,\delta)$‑differentially private with respect to changes in $i$’s data. The component $M_i(D)$ itself may depend arbitrarily on $i$’s true data, enabling higher utility.

Formally, for all $i$ and for all neighboring datasets $D,D’$ differing only in
user $i$’s trajectory, and for all measurable sets $S$,
$$\Pr \bigl[M_{-i}(D)\in S\bigr] \le
e^{\varepsilon} \Pr \bigl[M_{-i}(D’)\in S\bigr] + \delta .$$

Because JDP protects only what other users see, it is strictly weaker than standard DP but often sufficient (and far less noisy) in recommender systems, auctions, or federated RL.

User-level Differential Privacy

In many practical applications (such as Federated Learning or mobile keyboard prediction), a single user contributes multiple data points or a complex history of interactions to the dataset.

Let $\mathcal{M}(D)$ be a randomized mechanism that outputs a model or statistic based on the dataset $D$. User-level DP requires that the output distribution remains statistically indistinguishable whether any single user $i$ participates in the dataset or not. Unlike Item-level DP, which assumes data points are independent, User-level DP accounts for the correlation between all records belonging to the same individual.

Formally, for all neighboring datasets $D, D’$ differing by all data associated with a single user $i$ (i.e., adding or removing user $i$’s entire history), and for all measurable sets $S$,

$$\Pr[\mathcal{M}(D) \in S] \leq e^{\varepsilon} \Pr[\mathcal{M}(D’) \in S] + \delta.$$

To satisfy this definition, the mechanism must limit the maximum influence (sensitivity) of any single user on the output. This is typically achieved via contribution bounding (e.g., clipping the $L_2$ norm of a user’s gradient update or limiting the number of records per user) before noise is added.

Because User-level DP hides the presence of the entire user rather than just a single record, it offers a strictly stronger privacy guarantee than Item-level DP, though it often requires adding more noise (or discarding more data) to mask the contribution of highly active users.

Trust Model

While the choice of granularity dictates what is being protected, it does not specify where the protection mechanism is enforced. In practice, the deployment of differential privacy relies heavily on the underlying trust assumptions regarding the data curator. This leads us to the second fundamental dimension of DP systems: the Trust Model, typically categorized into Central and Local Differential Privacy.

Central Differential Privacy (cDP)

Also known as the standard or global model, cDP assumes the existence of a trusted curator. Users send their raw data to a central server, which aggregates the data and adds noise to the output before releasing it.

  • Trust Assumption: Users trust the central server not to leak or misuse raw data.
  • Utility: Because noise is added to the aggregate (rather than individual inputs), CDP generally achieves higher accuracy (utility) for the same privacy budget $\varepsilon$.

Local Differential Privacy (LDP)

LDP addresses scenarios where the central server is untrusted. In this model, users perturb their own data using a randomized mechanism on their local devices before transmitting it to the server. The server only ever sees noisy data.

  • Trust Assumption: Users do not need to trust the server or any third party; privacy is guaranteed at the device level.
  • Utility: To maintain privacy, the noise added by each user accumulates, typically resulting in lower utility compared to cDP for the same sample size.

The choice between these architectures forces a fundamental trade-off between trust (cDP requires more trust) and utility (LDP typically yields lower utility).