The core paper is: Improved Bounds for Private and Robust Alignment, accepted to ICML 2026. A detailed arXiv version is available here: arXiv:2512.23816.
I would like to share our recent work on a question that feels increasingly unavoidable in LLM alignment:
In RLHF or preference alignment, we often assume that human preference labels are trustworthy.
But in the real world, these labels may be both private and messy.
For example, suppose a user is asked to choose between two model responses:
- Response A is safer;
- Response B is more helpful;
- the user chooses A or B.
That choice itself may reveal something about the user’s preferences, values, or even sensitive context. So a platform may want to protect user privacy before using the data for training.
At the same time, the label may simply be wrong. It may be noisy, inconsistent, accidentally flipped, or even maliciously corrupted.
So the central question becomes:
If preference labels contain both privacy noise and corruption noise, can alignment still learn well?
The answer, thankfully, is not “everything breaks.”
The answer is more like: things get harder, but in a very structured and quantifiable way.
Which is good news. Theory is at its best when it tells us exactly how much trouble we are in.
Background: Alignment Is Basically Learning Which Response Is Better
In preference alignment, the data often looks like this:
1 | prompt: s |
In theory, a common way to model this is the Bradley–Terry model:
$$
P^\star(\tau \succ \tau’ \mid s) = \frac{\exp(r^\star(\tau))}
{\exp(r^\star(\tau))+\exp(r^\star(\tau’))}.
$$
Here, $r^\star(\tau)$ can be understood as the true reward of response $\tau$.
The higher the reward, the more likely humans are to prefer that response.
So the goal of alignment is, in essence, to learn a policy $\pi$ that generates responses with higher reward:
$$
J(\pi) = \mathbb{E}_{s,\tau \sim \pi} [r^\star(\tau)].
$$
That is the clean, classical setup.
But reality, as usual, did not sign up for the clean setup.
The First Challenge: Preference Labels May Need Privacy Protection
If a user directly tells us which response they prefer, that preference may reveal private information.
So the paper studies local differential privacy, where the user randomizes their own label locally before sending it to the learner. The most classical mechanism here is randomized response.
Suppose the true label is $y \in {-1,1}$.
The learner observes a privatized label $\tilde y$:
$$
\Pr[\tilde y = y] = \frac{e^\epsilon}{1+e^\epsilon},
$$
$$
\Pr[\tilde y \neq y] = \frac{1}{1+e^\epsilon}.
$$
Here, $\epsilon$ is the privacy budget.
Intuitively:
- larger $\epsilon$ means weaker privacy but more accurate labels;
- smaller $\epsilon$ means stronger privacy, but the labels become more randomized.
A useful quantity that appears throughout the analysis is
$$
c(\epsilon) = \frac{e^\epsilon+1}{e^\epsilon-1}.
$$
You can think of $c(\epsilon)$ as the statistical “price of privacy.”
The stronger the privacy requirement, the larger this term becomes, and the harder learning gets.
Privacy, unfortunately, is not free. But at least here the receipt is itemized.
Key Takeaway 1: MLE Is Not as Bad as People Thought
Some prior work suggested that in private preference learning, the ordinary MLE-style log loss may not be good enough, and may fail to achieve near-optimal rates.
Our answer is:
Not quite.
The issue is not that MLE is fundamentally broken.
The issue is that we should not write the likelihood as if the privatized label were an ordinary clean label.
The learner does not observe the original label $y$.
It observes $\tilde y$, after randomized response.
Therefore, the correct probability should be written as
$$
\tilde P_\theta(\tilde y \mid x) = \sigma(\epsilon)P_\theta(\tilde y \mid x) + (1-\sigma(\epsilon))P_\theta(-\tilde y \mid x),
$$
where
$$
\sigma(\epsilon) = \frac{e^\epsilon}{1+e^\epsilon}.
$$
In other words, the model needs to acknowledge the obvious but often overlooked fact:
The label I see may be the true label, or it may have been flipped by the privacy mechanism.
Once we model this correctly, we can still perform MLE:
$$
\hat \theta = \arg\min_{\theta}
\sum_{t=1}^{n}
-\log \tilde P_\theta(\tilde y^{(t)} \mid x^{(t)}).
$$
We prove that this MLE-type estimator can achieve near-optimal rates.
For offline alignment, the error bound roughly takes the form
$$
J(\pi^\star)-J(\hat \pi)
\lesssim
\kappa(\pi^\star)c(\epsilon)
\sqrt{
\frac{\log(|\Pi|/\delta)}{n}
}.
$$
The exact constants are not the main story here. The bound captures several very intuitive relationships:
- more data leads to smaller error;
- a larger policy class is harder to learn;
- stronger privacy makes learning harder;
- worse offline coverage also makes learning harder.
So the first core message of the paper is:
In the private-only setting, we do not necessarily need to replace MLE with a complicated new loss.
If the likelihood is written correctly, an MLE-type log loss can still work very well.
The Second Challenge: Labels May Be Corrupted
Privacy noise is intentional randomization for protecting users.
Corruption is different. It means the data may genuinely be bad.
We consider the classical Huber corruption model:
$$
(1-\alpha)G+\alpha B.
$$
Here:
- $G$ is the clean label distribution;
- $B$ is an arbitrary bad distribution;
- $\alpha$ is the corruption level.
In simple terms, an $\alpha$ fraction of the labels may come from random mistakes, adversarial attacks, data poisoning, or logging errors.
This is a very natural model for messy preference data.
The Crucial Distinction: Privacy First, or Corruption First?
One point we want to emphasize is:
The order of privacy and corruption matters.
The paper studies two different orders.
The first one is called CTL, short for corruption-then-LDP:
$$
y
\rightarrow
\text{corruption}
\rightarrow
\text{privacy}
\rightarrow
z.
$$
That is, the label is corrupted first and then passed through the privacy mechanism.
The second one is called LTC, short for LDP-then-corruption:
$$
y
\rightarrow
\text{privacy}
\rightarrow
\text{corruption}
\rightarrow
z.
$$
That is, the label is first randomized for privacy and then corrupted.
At first glance, this looks like a small change in ordering.
But the theoretical consequences are different.
Intuitively, LTC is harder.
The label has already been blurred by privacy noise, and then an adversary gets to corrupt it again. At that point, the learner has a much harder time telling whether the weirdness came from privacy or from malicious corruption.
It is like trying to debug a bug after someone has already encrypted the error message. Very fun, in the same way that stepping on a Lego is fun.
Key Takeaway 2: The Corruption Cost Can Be Sharper
When privacy and corruption are both present, we analyze the square loss and obtain sharper bounds.
In the offline setting, the results roughly look like this.
CTL
$$
J(\pi^\star)-J(\hat \pi_{\mathrm{CTL}})
\lesssim
\kappa(\pi^\star)
\left(
c(\epsilon)
\sqrt{
\frac{\log(|\Pi|/\delta)}{n}
}
+
\alpha
\right).
$$
LTC
$$
J(\pi^\star)-J(\hat \pi_{\mathrm{LTC}})
\lesssim
\kappa(\pi^\star)
\left(
c(\epsilon)
\sqrt{
\frac{\log(|\Pi|/\delta)}{n}
}
+
c(\epsilon)\alpha
\right).
$$
The most important part is the corruption term:
- in CTL, the corruption cost is $\alpha$;
- in LTC, the corruption cost is $c(\epsilon)\alpha$.
In other words:
If corruption happens before privacy, the corruption cost is $\alpha$.
If corruption happens after privacy, the corruption cost is amplified by the privacy difficulty factor $c(\epsilon)$.
This gives a clean statistical explanation of why the ordering matters.
Online Alignment: Learning Beyond a Fixed Dataset
Offline alignment has a fundamental limitation: the dataset is already fixed.
If the offline data does not cover a good policy, the learner cannot magically know what good responses look like. Sadly, no theorem has yet discovered a way to learn from data that does not exist.
Online alignment is different.
The model can interact with the environment, generate new responses, and collect fresh preference feedback during training.
In the online setting, the objective often includes KL regularization
$$ J_\beta(\pi) = \mathbb{E}_{\pi} \left[ r^\star(\tau) - \beta \log \frac{\pi(\tau)}{\pi_{\mathrm{ref}}(\tau)} \right]. $$This paper extends the private and robust analysis to online alignment as well.
In the private-only case, the result roughly takes the form
$$
J_\beta(\pi_\beta^\star)-J_\beta(\hat \pi)
\lesssim
\kappa_{\mathrm{cov}}(\Pi)c(\epsilon)
\sqrt{
\frac{\log(|\Pi|T/\delta)}{T}
}.
$$
Here, $T$ is the number of online interaction rounds.
A rough interpretation is:
Offline alignment asks whether the existing dataset covers good policies.
Online alignment asks whether the exploration process can gradually cover the policy class.
When both privacy and corruption are present, the online results mirror the offline story.
CTL
$$
J_\beta(\pi_\beta^\star)-J_\beta(\hat \pi_\text{CTL})
\lesssim
\kappa_\text{cov}(\Pi)
\left(
c(\epsilon)
\sqrt{
\frac{\log(|\Pi|T/\delta)}{T}
}
+
\alpha
\right).
$$
LTC
$$
J_\beta(\pi_\beta^\star)-J_\beta(\hat \pi_{\mathrm{LTC}})
\lesssim
\kappa_{\mathrm{cov}}(\Pi)
\left(
c(\epsilon)
\sqrt{
\frac{\log(|\Pi|T/\delta)}{T}
}
+
c(\epsilon)\alpha
\right).
$$
The story is the same as in the offline case:
LTC is more expensive than CTL because privacy cost and corruption cost become coupled.
Experiments
The paper is mainly theoretical, but we also include experiments for validation.
When the privacy budget is $\epsilon=0.5$, the performance of several losses is quite similar:
| Method | Win rate |
|---|---|
| MLE-type log loss | $65.2 \pm 4.3$ |
| Debiased log loss | $65.8 \pm 5.6$ |
| Square loss | $64.5 \pm 1.2$ |
On the PKU-SafeRLHF dataset, the results are also close:
| Method | Win rate |
|---|---|
| MLE-type log loss | $61.45 \pm 2.81$ |
| Debiased log loss | $61.95 \pm 2.50$ |
| Square loss | $61.64 \pm 2.25$ |
The point of these experiments is not to claim that MLE is always the best choice.
Rather, the message is:
MLE-type log loss does not fail in private alignment.
Both theoretically and empirically, it performs in the same ballpark as specially designed debiased losses and square loss.
And in practice, being in the same ballpark is already a meaningful statement—especially when people were worried you might not even be allowed into the stadium.
One-Sentence Summary
This paper can be understood as follows:
In LLM alignment, human preference labels may need privacy protection and may also be corrupted.
We show that if the privacy mechanism and corruption model are correctly incorporated into the learning problem, several seemingly fragile alignment methods still enjoy strong theoretical guarantees.
In particular, MLE-type log loss does not break down in private alignment.
For researchers working on RLHF or alignment theory, we hope this paper provides a clearer statistical map: it separates the costs of privacy, corruption, offline coverage, and online exploration, and shows how each one affects the final alignment guarantee.
The definitions of randomized response, Huber corruption, CTL/LTC ordering, and the experimental results are taken from the corresponding definitions and appendix tables in the paper.