CriterAlign

Criterion-Centric Rationale Alignment for Code Preference Judging.

Zhenyu Li · Aleksandar Cvejić · Zehui Chen · Peter Wonka

CriterAlign teaser: offline HPAG synthesis from human and judge rationales, then injection into the frozen judge at inference time.
Figure 1. Learning human preference guidance for code judging. Given the same coding task and two candidate responses, human and LLM judges may disagree. CriterAlign analyses these disagreements on a held-out training split, summarises recurring rationale gaps into HPAG (Human-Preference-Aligned Guidance), and injects HPAG into every stage of a frozen judge at inference time.

TL;DR

Method

CriterAlign converts a single preference judgement into four small, pairwise-aware LLM calls. The same judge model is used at every stage; the only thing injected is HPAG, distilled offline from a 20 % training split.

Inference pipeline comparison: monolithic judge vs RRD pointwise pipelines vs CriterAlign pairwise pipeline with BTCR, SCF, and HPAG.
Figure 2. Inference pipeline comparison. Monolithic judges use fixed or implicit criteria; rubric-based methods such as RRD generate criteria but rely on pointwise scoring. CriterAlign synthesises HPAG offline from the training split and injects it into a pairwise rubric-based pipeline. Orange highlights our contributions: BTCR (batched tie-driven criterion refinement), SCF (swap-consistency criterion filtering), and HPAG.
  1. ① Criterion generation (with HPAG). Given the instruction, both candidate responses, and any execution / visual evidence, the judge proposes ~20 atomic, evidence-grounded, comparative criteria. HPAG steers the generator toward dimensions humans actually care about.
  2. ② Pairwise criterion judging (with HPAG). For each criterion the judge outputs v ∈ {A, B, tie, insufficient} directly — comparative evidence, not two independent pointwise scores.
  3. ③ BTCR — Batched Tie-Driven Criterion Refinement. Coarse tied criteria are decomposed (in one batched call) into up to two finer comparative sub-criteria, filtered for redundancy / conflict, and re-judged.
  4. ④ SCF — Swap-Consistency Criterion Filtering. Each criterion is re-judged with the candidate order swapped. We keep only the criteria whose verdict survives the swap operator π(A)=B, π(B)=A, π(tie)=tie — order-sensitive criteria are discarded as unreliable evidence.
  5. ⑤ Final judge (with HPAG). The surviving criterion–verdict pairs are fed as comparative evidence into a final LLM call that synthesises the overall preference. HPAG guides how the evidence is weighted.

HPAG is synthesised once, offline, from the held-out training split (~728 examples) by comparing human preference labels with monolithic-judge predictions and rationales. It is frozen at inference: every test instance sees the same global + category-level guidance.

Results

BigCodeReward validation (n = 3,785), Qwen2.5-VL-32B judge, execution outputs and screenshots provided.

Table 1 — Main results on BigCodeReward validation. CriterAlign reaches 66.3% vs 60.4% for the monolithic Qwen2.5-VL-32B baseline, outperforming reproduced criterion-generation baselines by a large margin.
Table 1. Main results — CriterAlign vs. the monolithic baseline and reproduced criterion-generation pipelines.
Table 2 — Component ablation: pairwise RRD, BTCR, G-HPAG, SCF, and C-HPAG each add to the cumulative gain that takes the pipeline from 60.4% (monolithic) to 66.3% (full CriterAlign).
Table 2. Component ablation — cumulative gains from pairwise criterion judging, BTCR, SCF, and HPAG.

Before & after HPAG

Below are seven real validation cases where the monolithic Qwen2.5-VL-32B judge picks the wrong solution and CriterAlign (pairwise pipeline + BTCR + SCF + HPAG) picks the human-preferred one. Click through the cases, then toggle between the Without HPAG (monolithic) verdict and the With CriterAlign per-criterion breakdown to see what changes.

Instruction

    

Solution A

Code

Solution B

Code
Monolithic verdict:

CriterAlign verdict:

#Surviving criterionVerdictConf.