Documente Academic
Documente Profesional
Documente Cultură
Bits/dim on CIFAR-10
Training Setting MNIST CIFAR-10† CIFAR-10
Change of Variables. With an invertible transformation f , we can build a p(N) with an expected compute of 4 terms. 4.0
Then the log-density of x is given by Epoch Table: Ablation results. †Uses immediate downsampling
Neumann gradient series. For estimating (1), we can either i-ResNet (Biased Train Estimate)
i-ResNet (Actual Test Value)
Residual Flow (Unbiased Train Estimate)
Residual Flow (Actual Test Value)
before any residual blocks.
df (x)
log p(x) = log p(f (x)) + log det . (3) (i) Estimate log p(x), then take gradient.
dx
(ii) Take the gradient, then estimate ∇ log p(x). Qualitative Samples
Flow-based generative models can be The first option uses variable amount of memory depending on the sample of n:
1. sampled, if (2) can be computed or approximated up to some precision.
n k+1 ∂v T (J (x, θ)k )v
2. trained using maximum likelihood, if (3) can be unbiasedly estimated. ∂ df (x) X (−1) g
log det
= En,v . (9)
∂θ dx k ∂θ
k=1
Background: Invertible Residual Networks (i-ResNets) The second option, by using a Neumann series we obtain constant memory cost:
Figure: Qualitative samples. Real (left) and random samples (right) from a model trained on
n k 5bit 64×64 CelebA. The most visually appealing samples were picked out of 5 random batches.
Residual networks are composed of simple transformations ∂ df (x) X (−1) T k ∂(Jg (x, θ))
log det
= En,v v J(x, θ) v (10)
∂θ dx P(N ≥ k) ∂θ
y = f (x) = x + g (x) (4) k=0
Hybrid Modeling
Behrmann et al. (2019) proved that if g is a contractive mapping, ie. Lipschitz
strictly less than one, then the residual block transformation (4) is invertible. df (x)
Backward-in-forward. Since log det dx is a scalar quantity, we can Residual blocks are better building blocks for hybrid models than coupling blocks.
Sampling. The inverse f −1 can be efficiently computed by a fixed-point compute its gradient early and free up memory. This reduces the amount of Trained using a weighted maximum likelihood objective similar to (Nalisnick et al., 2019).
iteration memory by a factor equal to the number of residual blocks with negligible cost. E(x,y )∼pdata [λ log p(x) + log p(y |x)] (11)
x (i+1) = y − g (x (i)) (5)
which converges superlinearly by the Banach fixed-point theorem. MNIST CIFAR-10† CIFAR-10 MNIST SVHN
Log-density. The change of the variables can be applied to invertible residual ELU LipSwish ELU LipSwish ELU LipSwish Relative λ=0 λ = 1/D λ=1 λ=0 λ = 1/D λ=1
networks Naı̈ve Backprop 92.0 192.1 33.3 66.4 120.2 263.5 100% Block Type Acc↑ BPD↓ Acc↑ BPD↓ Acc↑ Acc↑ BPD↓ Acc↑ BPD↓ Acc↑
∞ k+1 Neumann Gradient 13.4 31.2 5.5 11.3 17.6 40.8 15.7% (Nalisnick et al., 2019) 99.33% 1.26 97.78% − − 95.74% 2.40 94.77% − −
(−1)
[Jg (x)]k
X
log p(x) = log p(f (x)) + tr (6) Backward-in-Forward 8.7 19.8 3.8 7.4 11.5 26.1 10.3% Coupling 99.50% 1.18 98.45% 1.04 95.42% 96.27% 2.73 95.15% 2.21 46.22%
k Both Combined 4.9 13.6 3.0 5.9 6.6 18.0 7.1% + 1 × 1 Conv 99.56% 1.15 98.93% 1.03 94.22% 96.72% 2.61 95.49% 2.17 46.58%
k=1
Table: Memory usage (GB) per minibatch of 64 samples when computing n=10 terms in the Residual 99.53% 1.01 99.46% 0.99 98.69% 96.72% 2.29 95.79% 2.06 58.52%
+ Trace can be efficiently estimated using the Skilling-Hutchinson estimator. corresponding power series. †Uses immediate downsampling before any residual blocks. Table: Comparison of residual vs. coupling blocks for the hybrid modeling task.
− The infinite sum can be estimated by truncating to a fixed n. However, this
introduces bias equal to the remaining terms. LipSwish Activation for Enforcing Lipschitz Constraint References
This results in the biased estimator:
We motivate smooth and non-monotonic Lipschitz activation functions. This • Behrmann et al.. “Invertible residual networks.” (2019)
n
(−1) k+1 avoid “gradient saturation”, which occurs if the 2nd derivative asymptotically • Kahn. “Use of different monte carlo sampling techniques.” (1955)
v T [Jg (x)]k v
X
log p(x) ≈ log p(f (x)) + Ev ∼N (0,1) (7) approaches zero when the 1st derivative is close to one.
k • Beatson & Adams. “Efficient Optimization of Loops and Limits with Randomized Telescoping
k=1 Sums.” (2019)
LipSwish(x) = Swish(x)/1.1 = x · σ(βx)/1.1
where Behrmann et al. (2019) chose n = 5, 10. • Nalisnick et al.. “Hybrid Models with Deep and Invertible Features.” (2019)