Sunteți pe pagina 1din 10

StegaStamp: Invisible Hyperlinks in Physical Photographs

Matthew Tancik∗ Ben Mildenhall∗ Ren Ng


University of California, Berkeley
{tancik, bmild, ren}@berkeley.edu
arXiv:1904.05343v1 [cs.CV] 10 Apr 2019

Abstract avoids visible, ugly barcodes, and enables digital informa-


tion to be invisibly and ambiently embedded into the ubiq-
Imagine a world in which each photo, printed or digi- uitous imagery of the modern visual world.
tally displayed, hides arbitrary digital data that can be ac- It is worth taking a moment to consider three potential
cessed through an internet-connected imaging system. An- use cases of our system. First, at the farmer’s market, a
other way to think about this is physical photographs that stand owner may add photos of each type of produce along-
have unique QR codes invisibly embedded within them. This side the price, encoded with extra information for customers
paper presents an architecture, algorithms, and a proto- about the source farm, nutrition information, recipes, and
type implementation addressing this vision. Our key tech- seasonable availability. Second, in the lobby of a university
nical contribution is StegaStamp, the first steganographic department, a photo directory of faculty may be augmented
algorithm to enable robust encoding and decoding of arbi- by encoding a unique URL for each person’s photo that con-
trary hyperlink bitstrings into photos in a manner that ap- tains the professor’s webpage, office hours, location, and di-
proaches perceptual invisibility. StegaStamp comprises a rections. Third, New York City’s Times Square is plastered
deep neural network that learns an encoding/decoding al- with digital billboards. Each image frame displayed may be
gorithm robust to image perturbations that approximate the encoded with a URL containing further information about
space of distortions resulting from real printing and pho- the products, company, and promotional deals.
tography. Our system prototype demonstrates real-time de-
Figure 1 presents an overview of our system, which we
coding of hyperlinks for photos from in-the-wild video sub-
call StegaStamp, in the context of a typical usage flow. The
ject to real-world variation in print quality, lighting, shad-
inputs are an image and a desired hyperlink. First, we assign
ows, perspective, occlusion and viewing distance. Our pro-
the hyperlink a unique bit string (analogous to the process
totype system robustly retrieves 56 bit hyperlinks after error
used by URL-shortening services such as tinyurl.com). Sec-
correction – sufficient to embed a unique code within every
ond, we use our StegaStamp encoder to embed the bit string
photo on the internet.
into the target image. This produces an encoded image that
is ideally perceptually identical to the input image. As de-
scribed in detail in Section 4, our encoder is implemented
1. Introduction as a deep neural network jointly trained with a second net-
work that implements decoding. Third, the encoded image
Our vision is a future in which each photo in the real
is physically printed (or shown on an electronic display) and
world invisibly encodes a unique hyperlink to arbitrary in-
presented in the real world. Fourth, a user takes a photo that
formation. This information is accessed by pointing a cam-
contains the physical print. Fifth, the system uses an im-
era at the photo and using the system described in this
age detector to identify and crop out all images. Sixth, each
paper to decode and follow the hyperlink. In the future,
image is processed with the StegaStamp decoder to retrieve
augmented-reality (AR) systems may perform this task con-
the unique bitstring, which is used to follow the hyperlink
tinuously, visually overlaying retrieved information along-
and retrieve the information associated with the image.
side each photo in the user’s view.
Our approach is related to the ubiquitous QR code and Steganography [7] is the name of the main technical
similar technologies, which are now commonplace for a problem faced in this work – hiding secret data in a non-
wide variety of data-transfer tasks, such as sharing web ad- secret message, in this case an image. The key technical
dresses, purchasing goods, and tracking inventory. Our ap- challenge solved in this paper is to dramatically increase the
proach can be thought of as a complementary solution that robustness and performance of steganography so that it can
retrieve an arbitrary hyperlink in real-world scenarios. Most
∗ Authors contributed equally to this work. previous work on image steganography [4, 15, 33, 34, 36]

1
Message
Detection
1010...001

1010...001
Decoded Message

Input Image Encoder StegaStamp Captured Photo Corrupted Image Decoder

Figure 1: Our deep learning system is trained to hide hyperlinks in images. First, an encoder network processes the input
image and hyperlink bitstring into a StegaStamp (encoded image). The StegaStamp is then printed and captured by a camera.
A detection network localizes and rectifies the StegaStamp before passing it to the decoder network. After the bits are
recovered and error corrected, the user can follow the hyperlink. To train the encoder and decoder networks, we simulate the
corruptions caused by printing, reimaging, and detecting the StegaStamp with a set of differentiable image augmentations.

was designed for perfect digital transmission, and fails com- developed for digital image steganography. Data can be hid-
pletely (see Figure 6, “None”) under “physical transmis- den in the least significant bit of the image, subtle color vari-
sion” – i.e., presenting images physically and using real ations, and subtle luminosity variations. Often methods are
cameras to capture an image for decoding. The closest work designed to evade steganalysis, the detection of hidden mes-
to our own is the HiDDeN system [38], which introduces sages [12, 28]. We refer the interested reader to the survey
various types of noise between encoding and decoding to by Cheddad et al. [7] that reviews a wide set of techniques.
increase robustness, but even this does not improve per-
formance beyond guessing rate after undergoing “physical The most relevant work to our proposal are methods that
transmission” (see Figure 6, “Pixelwise”). utilize deep learning to both encode and decode a message
We show how to achieve robust decoding even under hidden inside an image [4, 15, 33, 34, 36]. Our method
“physical transmission,” delivering excellent performance differs from existing techniques as we assume that the im-
sufficient to encode and retrieve arbitrary hyperlinks for an age will be corrupted by the display-imaging pipeline be-
essentially limitless number of images. Achieving this re- tween the encoding and decoding steps. With the excep-
quires two main technical contributions. First, we extend tion of HiDDeN [38], small image manipulations or cor-
the traditional steganography framework by adding a set of ruptions would render existing techniques useless, as their
differential image corruptions between the encoder and de- target is encoding a large number of bits-per-pixel in a con-
coder that successfully approximate the space of distortions text of perfect digital transmission. HiDDeN introduces
resulting from “physical transmission” (i.e., real printing various types of noise between encoding and decoding to
/ display and photography). The result is robust retrieval increase robustness but focuses only on the set of corrup-
of 95% of encoded bits in real-world conditions. Second, tions that would occur through digital image manipulations
we show how to preserve high image quality – our proto- (e.g., JPEG compression and cropping). For use as a phys-
type can encode 100 bits while preserving excellent per- ical barcode, the decoder cannot assume perfect alignment,
ceptual image quality. Together, these allow our prototype given the perspective shifts and pixel resampling guaranteed
to uniquely encode hidden hyperlinks for orders of magni- to occur when taking a casual photo.
tude more images than exist on the internet today (upper
bounded by 100 trillion). Similar work exists addressing the digital watermarking
problem, where copyright information and other metadata
is imperceptibly hidden in a photograph or document, us-
2. Related Work ing both traditional [9] and learning-based [18] techniques.
2.1. Steganography Some methods [25] particularly target the StirMark bench-
mark [27, 26]. StirMark tests watermarking techniques
Steganography is the act of hiding data within other data against a set of digital attacks, including geometrical dis-
and has a long history that can be traced back to ancient tortions and image compression. However, digital water-
Greece. Our proposed task is a type of steganography where marking methods are not typically designed to work after
we hide a code within an image. Various methods have been printing and recapturing an image, unlike our method.
2.2. Barcodes
Barcodes are one of the most popular solutions for trans-
mitting a short string of data to a computing device, requir-
ing only simple hardware (a laser reader or camera) and an
area for printing or displaying the code. Traditional bar-
codes are a one dimensional pattern where bars of alternat-
ing thickness encode different values. The ubiquity of high
quality cellphone cameras has led to the frequent use of two
dimensional QR codes to transmit data to and from phones.
For example, users can share contact information with one
another, pay for goods, track inventory, or retrieve a coupon
from an advertisement.
Past research has addressed the issue of robustly decod-
ing existing and new barcode designs using cameras [22,
24]. Some designs particularly take advantage of the in-
creased capabilities of cameras beyond simple laser scan-
ners, such as incorporating color into the barcode [6]. Other
work has proposed a method that determines where a bar-
code should be placed on an image and what color should
be used to improve machine readability [23]. Original Image StegaStamp Residual
Another special type of barcode is specially designed to
transmit both a small identifier and a precise six degree-of-
freedom orientation for camera localization or calibration, Figure 2: Examples of encoded images. The residual is
e.g., ArUco markers [13, 29]. Hu et al. [16] train a deep calculated by the encoder network and added back to the
network to localize and identify ArUco markers in challeng- original image to produce the encoded StegaStamp. These
ing real world conditions using data augmentation similarly examples have 100 bit encoded messages and are robust to
to our method. However, their focus is robust detection of the image perturbations that occur through the printing and
highly visible preexisting markers, as opposed to robust de- imaging pipelines.
coding of messages hidden in arbitrary natural images.
2.3. Robust Adversarial Image Attacks 3. Training for Real World Robustness
Adversarial image attacks on object classification CNNs
During training, we apply a set of differentiable image
are designed to minimally perturb an image such that the
perturbations outlined in Figure 3 between the encoder and
network produces an incorrect classification. Most relevant
decoder to approximate the possible distortions caused by
to our work are the demonstrations of adversarial examples
physically displaying and imaging the StegaStamps. Previ-
in the physical world [3, 8, 11, 20, 21, 31, 32], where sys-
ous work on synthesizing robust adversarial examples used
tems are made robust for imaging applications by modeling
a similar method to attack classification networks in the
physically realistic perturbations (i.e., affine image warp-
wild (termed “Expectation over Transformation”), though
ing, additive noise, and JPEG compression). Jan et al. [20]
they used a more limited set of transformations [3]. HiD-
take a different approach, explicitly training a neural net-
DeN [38] used nonspatial perturbations to augment their
work to replicate the distortions added by an imaging sys-
steganography pipeline against digital perturbations only.
tem and showing that applying the attack to the distorted
Deep ChArUco [16] used both spatial and nonspatial per-
image increases the success rate.
turbations to train a robust detector for the fixed category of
These results demonstrate that networks can still be af-
ChArUco fiducial marker boards. We combine ideas from
fected by small perturbations after the image has gone
all of these works, training an encoder and decoder that
through an imaging pipeline. Our proposed task shares
cooperate to robustly transmit hidden messages through a
some similarities; however, classification targets 1 of n ≈
physical display-imaging pipeline.
210 labels, while we aim to uniquely decode 1 of 2m mes-
sages, where m ≈ 100 is the number of encoded bits. Ad- 3.1. Perspective Warp
ditionally, adversarial attacks do not assume the ability to
modify the decoder network, whereas we explicitly train Assuming a pinhole camera model, any two images of
our decoder to cooperate with our encoder for maximum the same planar surface can be related by a homography.
information transferal. We generate a random homography to simulate the effect
Input Perspective warp Motion/defocus blur Color manipulation Noise JPEG compression
(Sec. 3.1) (Sec. 3.2) (Sec. 3.3) (Sec. 3.4) (Sec. 3.5)

Figure 3: Image perturbation pipeline. During training, we approximate the effects of a physical display-imaging pipeline in
order to make our model robust for use in the real world. We take the output of the encoding network and apply the random
transformations shown here before passing the image through the decoding network (see Section 3 for details).

of a camera that is not precisely aligned with the encoded 2. Desaturation: randomly linearly interpolating between
image marker. To sample the space of homographies, we the full RGB image and its grayscale equivalent.
perturb the four corner locations of the marker uniformly at
random within a fixed range (up to ±40 pixels from their 3. Brightness and contrast: affine histogram rescaling
original coordinates) and solve a least squares problem to mx + b with b ∼ U [−0.3, 0.3] and m ∼ U [0.5, 1.5].
recover the homography that maps the original corners to After these transforms, we clip the color channels back to
their new locations. We bilinearly resample the original im- the range [0, 1].
age to create the perspective warped image.
3.5. JPEG Compression
3.2. Motion and Defocus Blur
Camera images are usually stored in a lossy format, such
Blur can result from both camera motion and inaccurate as JPEG. JPEG compresses images by computing the dis-
autofocus. To simulate motion blur, we sample a random crete cosine transform of each 8 × 8 block in the image
angle and generate a straight line blur kernel with a width and quantizing the resulting coefficients by rounding to the
between 3 and 7 pixels. To simulate misfocus, we use a nearest integer (at varying strengths for different frequen-
Gaussian blur kernel with its standard deviation randomly cies). This rounding step is not differentiable. We use the
sampled between 1 and 3 pixels. trick from [31] for approximating the quantization step near
zero with the piecewise function
3.3. Noise
(
Noise introduced by camera systems is well studied and x3 : |x| < 0.5
q(x) = (1)
includes photon noise, dark noise, and shot noise [14]. In x : |x| ≥ 0.5
our system we assume standard non-photon-starved imag-
ing conditions. We employ a Gaussian noise model (sam- which has nonzero derivative almost everywhere. We sam-
pling the standard deviation σ ∼ U [0, 0.2]) to account for ple the JPEG quality uniformly within [50, 100].
the imaging noise. We have found that printing does not
introduce significant pixel-wise noise, since the printer will 4. Implementation Details
typically act to reduce the noise as the mixing of the ink acts
4.1. Encoder
as a low pass filter.
The encoder is trained to embed a message into an im-
3.4. Color Manipulation age while minimizing perceptual differences between the
Printers and displays have a limited gamut compared to input and encoded images. We use a U-Net [30] style archi-
the full RGB color space. Cameras modify their output us- tecture that receives a four channel 400 × 400 pixel input
ing exposure settings, white balance, and a color correction (RGB channels from the input image plus one for the mes-
matrix. We approximate these perturbations with a series sage) and outputs a three channel RGB residual image. The
of random affine color transformations (constant across the input message is represented as a 100 bit binary string, pro-
whole image) as follows: cessed through a fully connected layer to form a 50 × 50 × 3
tensor, then upsampled to produce a 400 × 400 × 3 tensor.
1. Hue shift: adding a random color offset to each of the We find that applying this preprocessing to the message aids
RGB channels sampled uniformly from [−0.1, 0.1]. convergence. To enforce minimal perceptual distortion on
the encoded StegaStamp, we use an L2 loss, the LPIPS per-
ceptual loss [37], and a critic loss calculated between the
98% 100% 100%
encoded image and the original image. We present exam-
ples of encoded images in Figure 2.

4.2. Decoder
The decoder is a network trained to recover the hidden
message from the encoded image. A spatial transformer
network [19] is used to develop robustness against small 100% 99% 99%
perspective changes that are introduced while capturing and
99%
rectifying the encoded image. The transformed image is
fed through a series of convolutional and dense layers and
a sigmoid to produce a final output with the same length as
the message. The decoder network is supervised using cross
entropy loss.
99%
4.3. Detector 98% 100%
For real world use, we must detect and rectify StegaS-
tamps within a wide field of view image before decod- 100%
ing them, since the decoder network alone is not designed
to handle full detection within a much larger image. We
fine-tune an off-the-shelf semantic segmentation network
BiSeNet [35] to segment areas of the image that are be-
lieved to contain StegaStamps. The network is trained using Figure 4: Examples of our system deployed in-the-wild. We
a dataset of randomly transformed StegaStamps embedded outline the StegaStamps detected and decoded by our sys-
into high resolution images sampled from DIV2K [1]. At tem and the show message recovery accuracies. Our method
test time, we fit a quadrilateral to the convex hull of each works in the real world, exhibiting robustness to changing
of the network’s proposed regions, then compute a homog- camera orientation, lighting, shadows, etc. You can find
raphy to warp each quadrilateral back to a 400 × 400 pixel these examples and more in our supplemental video.
square for parsing by the decoder.

4.4. Encoder/Decoder Training Procedure We find three loss function adjustments to particularly aid
in convergence when training the networks:
Training Data During training, we use images from the
MIRFLICKR dataset [17] combined with randomly sam- 1. These image loss weights λR,P,C must initially be set
pled binary messages. We resample the images to 400×400 to zero while the decoder trains to high accuracy, after
resolution. which λR,P,C are increased linearly.
2. The image perturbation strengths must also start at
Critic As part of our total loss, we use a critic network
zero. The perspective warping is the most sensitive
that predicts whether a message is encoded in a image and
perturbation and is increased at the slowest rate.
is used as a perceptual loss for the encoder/decoder pipeline.
The network is composed of a series of convolutional layers 3. The model learns to add distracting patterns at the edge
followed by max pooling. To train the critic, an input image of the image (perhaps to assist in localization). We
and an encoded image are classified and the Wasserstein mitigate this effect by increasing the weight of the L2
loss [2] is used as a supervisory signal. Training of the critic loss at the edges with a cosine dropoff.
is interleaved with the training of the encoder/decoder.

5. Real-World & Simulation-Based Evaluation


Loss Weighting The training loss is the weighted sum of
three image loss terms (residual regularization LR , percep- We test our system in both real-world conditions and
tual loss LP , critic loss LC ) and the cross entropy message synthetic approximations of display-imaging pipelines. We
loss LM : show that our system works in-the-wild, recovering mes-
sages in uncontrolled indoor and outdoor environments. We
L = λR LR + λP LP + λC LC + λM LM (2) evaluate our system in a controlled real world setting with
97% 99% 97% Enterprise
5th
88%
25th
94%
50th
98%
Mean
95.9%

Screen Printer Screen Printer Screen Printer


Consumer 90% 98% 99% 98.1%

Webcam
Pro 97% 99% 100% 99.2%
Monitor 94% 98% 99% 98.5%
Laptop 97% 99% 100% 99.1%
Cellphone 91% 98% 99% 97.7%
Enterprise 88% 96% 98% 96.8%
Consumer 95% 99% 100% 99.0%

Cellphone
Figure 5: Despite not explicitly training the method to be
robust to occlusion, we find that our decoder can handle Pro 97% 99% 100% 99.3%
partial erasures gracefully, maintaining high accuracy. Monitor 98% 99% 100% 99.4%
Laptop 98% 99% 100% 99.7%
Cellphone 96% 99% 100% 99.2%
18 combinations of 6 different displays/printers and 3 dif- Enterprise 86% 96% 99% 97.0%
ferent cameras. Across all settings combined (1890 cap- Consumer 97% 99% 100% 99.3%

DSLR
tured images), we achieve a mean bit-accuracy of 98.7%. Pro 98% 99% 100% 99.5%
We conduct real and synthetic ablation studies with four Monitor 99% 100% 100% 99.8%
different trained models to verify that our system is robust Laptop 99% 100% 100% 99.8%
to each of the image perturbations we apply during train- Cellphone 99% 100% 100% 99.8%
ing and that omitting these augmentations significantly de-
creases decoding performance. Table 1: Real world decoding accuracy (percentage of bits
correctly recovered) tested using a combination of six dis-
5.1. In-the-Wild Robustness play methods (three printers and three screens) and three
cameras. We show the 5th , 25th , and 50th percentiles and
Our method is tested on handheld cellphone camera
mean taken over 105 images chosen randomly from Ima-
videos captured in a variety of real-world environments.
genet [10] with randomly sampled 100 bit messages.
The StegaStamps are printed on a consumer printer. Ex-
amples of the captured frames with detected quadrilater- Real-world Ablation Study
als and decoding accuracy are shown in Figure 4. We also
demonstrate a surprising level of robustness when portions Chance
of the StegaStamp are covered by other objects (Figure 5).
Perturbations Applied

None
During Training

Please see our supplemental video for extensive examples


of real world StegaStamp decoding, including examples of Pixelwise

perfectly recovering 56 bit messages using BCH error cor- Spatial


recting codes [5]. In practice, we generally find that if the
All
bounding rectangle is accurately located, decoding accuracy
will be high. However, it is not uncommon for the detector 0.4 0.6 0.8 1.0
to miss the StegaStamp on a subset of video frames. In Bit Recovery Accuracy
practice this is not an issue, because the code only needs to
be recovered once. We expect future extensions that incor-
Figure 6: Real world ablation test of the four networks de-
porate temporal information and custom detection networks
scribed in Section 5.3 using the cellphone camera and con-
can further improve the detection consistency.
sumer printer pipeline from Table 1. The y-axis denotes
5.2. Controlled Real World Experiments which image perturbations were used to train each network.
We show the distribution of random guessing (with its mean
In order to demonstrate that our model generalizes from of 0.5 indicated by the dotted line) to demonstrate that the
synthetic perturbations to real physical display-imaging no-perturbations ablation performs no better than chance.
pipelines, we conduct a series of test where encoded images Previous work on digital steganography focuses on either
are printed or displayed and captured by a camera prior to the no-perturbations case of perfect transmission or on the
decoding. We randomly select 100 unique images from the pixelwise case where the image may be digitally modified
ImageNet dataset [10] (which is disjoint from our training but not spatially resampled. Adding spatial perturbations is
set) and embed random 100 bit messages within each im- critical for achieving high real-world performance.
age. We generate 5 additional StegaStamps with the same
source image but different messages for a total of 105 test
images. We conduct the experiments in a darkroom with
Pertubations Applied During Training
(a) All Perturbations (b) No Perturbations (c) Pixelwise Perturbations Only (d) Spatial Perturbations Only
1.0 1.0 1.0 1.0
0.9 0.9 0.9 0.9
Accuracy

0.8 0.8 0.8 0.8


0.7 0.7 0.7 0.7
0.6 0.6 0.6 0.6
0.5 0.5 0.5 0.5
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
Perturbation Strength Perturbation Strength Perturbation Strength Perturbation Strength
All Warp Blur Noise Color JPEG
Test Time Perturbations

Figure 7: Synthetic ablation tests showing the effect of training with various image perturbation combinations on bit recovery
robustness. “Pixelwise” perturbations (c) are noise, color transforms, and JPEG compression, and “spatial” perturbations (d)
are perspective warp and blur. To test robustness across a range of possible degradation, we parameterize the strength of
each perturbation on a scale from 0 (weakest) to 1 (maximum value seen during training) to 2 (strongest). Models not trained
against spatial perturbations (b-c) are highly susceptible to warp and blur, and the model trained only on spatial perturbations
(d) is sensitive to color transformations. The lines show the mean accuracies and the shaded regions shows the 25th -75th
percentile range over 100 random images and messages. See Section 5.3 for details.

fixed lighting. The printed images are fixed in a rig for (similar to our “no perturbations” model). HiDDeN [38]
consistency and captured by a tripod-mounted camera. The incorporates augmentations into their training pipeline to
resulting photographs are cropped by hand, rectified, and increase robustness to perturbations. However, they apply
passed through the decoder. only digital corruptions, similar to our pixelwise perturba-
The images are printed using a consumer printer (HP tions, not including any augmentations that spatially resam-
LaserJet Pro M281fdw), an enterprise printer (HP Laser- ple the image. Work in other domains [3, 16] has explored
Jet Enterprise CP4025), and a commercial printer (Xerox training deep networks through pixelwise and spatial per-
700i Digital Color Press). The images are also digitally dis- turbations, but to the best of our knowledge, we are the first
played on a matte 1080p monitor (Dell ST2410), a glossy to apply this technique to steganography.
high DPI laptop screen (Macbook Pro 15 inch), and an For the real world test, we evaluate all four networks
OLED cellphone screen (iPhone X). To image the StegaS- on the same 105 test StegaStamps from Section 5.2 us-
tamps, we use an HD webcam (Logitech C920), a cellphone ing the cellphone camera and consumer printer combina-
camera (Google Pixel 3), and a DSLR camera (Canon 5D tion. We see in Figure 6 that training with no perturbations
Mark II). All 105 images were captured across all 18 com- yields accuracy no better than guessing and that adding pix-
binations of the 6 media and 3 cameras. The results of these elwise perturbations alone barely improves performance.
tests are reported in Table 1. We see that our method is This demonstrates that previous deep learning steganogra-
highly robust across a variety of different combinations of phy methods (most similar to either the “None” or “Pixel-
display or printer and camera; two-thirds of these scenar- wise” ablations) will fail when the encoded image is cor-
ios yield a median accuracy of 100% and 5th percentile of rupted by a display-imaging pipeline. Training with spatial
at least 95% perfect decoding. Our mean accuracy over all perturbations alone yields performance significantly higher
1890 captured images is 98.7%. than pixelwise perturbations alone; however, it cannot reli-
ably recover enough data for practical use. Our presented
5.3. Ablation Tests method, combining both pixelwise and spatial perturba-
We test how training with different subsets of the image tions, achieves the most precise and accurate results by a
perturbations described in Section 3 impacts decoding ac- large margin.
curacy in the real world (Figure 6) and in a synthetic exper- We run a more exhaustive synthetic ablation study over
iment (Figure 7). We evaluate both our base model (trained 1000 images to separately test the effects of each training-
with all perturbations) and three additional models (trained time perturbation on accuracy. The results shown in Fig-
with no perturbations, only pixelwise perturbations, and ure 7 follow a similar pattern to the real world ablation test.
only spatial perturbations). Most work on image steganog- The model trained with no perturbations is surprisingly ro-
raphy focuses on hiding as much information as possible bust to color warps and noise but immediately fails in the
but assumes that no corruption will occur prior to decoding presence of warp, blur, or any level of JPEG compression.
Original 50 bits 100 bits 150 bits 200 bits

Figure 8: Four models trained to encode messages of different lengths. The inset shows the residual relative to the original
image. The perceptual quality decreases as more bits are encoded. We find that a message length of 100 bits provides good
image quality and is sufficient to encode a virtually unlimited number of distinct hyperlinks using error correcting codes.

Message length correcting, using only 50 total message bits would drasti-
Metric 50 100 150 200 cally reduce the number of possible encoded hyperlinks to
PSNR ↑ 29.88 28.50 26.47 21.79 under one billion. The image degradation caused by encod-
SSIM ↑ 0.930 0.905 0.876 0.793 ing 150 or 200 bits is much more perceptible.
LPIPS ↓ 0.100 0.101 0.128 0.184
5.5. Limitations
Table 2: Image quality for models trained with different
message lengths, averaged over 500 images. For PSNR and Though our system works with a high rate of success in
SSIM, higher is better. LPIPS [37] is a learned perceptual the real world, it is still many steps from enabling an en-
similarity metric, lower is better. vironment saturated with ubiquitous invisible hyperlinked
images. Despite often being very subtle in high frequency
textures, the residual added by the encoder network is some-
Training with only pixelwise perturbations yields high ro- times perceptible in large low frequency regions of the im-
bustness to those augmentations but still leaves the network age. Future work could improve upon our architecture and
vulnerable to any amount of pixel resampling from warping loss functions to train an encoder/decoder pair that used
or blur. On the other hand, training with only spatial per- more subtle encodings.
turbations also confers increased robustness against JPEG Additionally, we find our off-the-shelf detection network
compression (perhaps because it has a similar low-pass fil- to be the bottleneck in our decoding performance during
tering effect to blurring). Again, training with both spatial real world testing. A custom detection architecture opti-
and pixelwise augmentations yields the best result. mized end to end with the encoder/decoder could increase
detection performance. The current framework also as-
5.4. Practical Message Length
sumes that the StegaStamps will be single, square images
Our model can be trained to store different numbers of for the purpose of detection. We imagine that embedding
bits. In all previous examples, we use a message length multiple codes seamlessly into a single, larger image (such
of 100. Figure 8 compares encoded images from four as a poster or billboard) could provide even more flexibility.
separately trained models with different message lengths.
Larger message are more difficult to encode and decode; 6. Conclusion
as a result, there is a trade off between recovery accuracy
and perceptual similarity. The associated image metrics are We have presented StegaStamp, an end-to-end deep
reported in Table 2. When training the models, the image learning framework for encoding 56 bit error corrected hy-
and message losses are tuned such that the bit accuracy con- perlinks into arbitrary natural images. Our networks are
verges to at least 95%. trained through an image perturbation module that allows
We settle on a message length of 100 bits as it provides them to generalize to real world display-imaging pipelines.
a good compromise between image quality and information We demonstrate robust decoding performance on a variety
transfer. Given an estimate of at least 95% recovery accu- of printer, screen, and camera combinations in an exper-
racy, we can encode at least 56 error corrected bits using imental setting. We also show that our method is stable
BCH codes [5]. As discussed in the introduction, this gives enough to be deployed in-the-wild as a replacement for ex-
us the ability to uniquely map every recorded image in his- isting barcodes that is less intrusive and more aesthetically
tory to a corresponding StegaStamp. Accounting for error pleasing.
7. Acknowledgments [16] D. Hu, D. DeTone, V. Chauhan, I. Spivak, and T. Mal-
isiewicz. Deep charuco: Dark charuco marker pose estima-
We would like to thank Coline Devin and Cecilia Zhang tion. arXiv preprint arXiv:1812.03247, 2018. 3, 7
for their assistance on our demos and Utkarsh Singhal and [17] M. J. Huiskes and M. S. Lew. The mir flickr retrieval eval-
Pratul Srinivasan for their helpful discussions and feedback. uation. In MIR ’08: Proceedings of the 2008 ACM Inter-
Support for this work was provided by the Fannie and John national Conference on Multimedia Information Retrieval.
Hertz Foundation. ACM, 2008. 5
[18] B. Isac and V. Santhi. A study on digital image and video
References watermarking schemes using neural networks. International
Journal of Computer Applications, 12(9):1–6, 2011. 2
[1] E. Agustsson and R. Timofte. Ntire 2017 challenge on single [19] M. Jaderberg, K. Simonyan, A. Zisserman, and
image super-resolution: Dataset and study. In CVPR Work- K. Kavukcuoglu. Spatial transformer networks. In
shops, 2017. 5 NeurIPS, 2015. 5
[2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gener- [20] S. T. K. Jan, J. Messou, Y.-C. Lin, J.-B. Huang, and G. Wang.
ative adversarial networks. In ICML, 2017. 5 Connecting the digital and physical world: Improving the
[3] A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok. Syn- robustness of adversarial attacks. In AAAI, 2019. 3
thesizing robust adversarial examples. arXiv preprint [21] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial exam-
arXiv:1707.07397, 2017. 3, 7 ples in the physical world. arXiv preprint arXiv:1607.02533,
[4] S. Baluja. Hiding images in plain sight: Deep steganography. 2016. 3
In NeurIPS, 2017. 1, 2 [22] Y. Liu, J. Yang, and M. Liu. Recognition of qr code with mo-
[5] R. C. Bose and D. K. Ray-Chaudhuri. On a class of error cor- bile phones. In Chinese Control and Decision Conference.
recting binary group codes. Information and Control, 1960. IEEE, 2008. 3
6, 8 [23] E. Myodo, S. Sakazawa, and Y. Takishima. Method, appa-
[6] O. Bulan, H. Blasinski, and G. Sharma. Color qr codes: In- ratus and computer program for embedding barcode in color
creased capacity via per-channel data encoding and interfer- image, 2013. US Patent 8,550,366. 3
ence cancellation. In Color and Imaging Conference, 2011. [24] E. Ohbuchi, H. Hanaizumi, and L. A. Hock. Barcode readers
3 using the camera device in mobile phones. In International
[7] A. Cheddad, J. Condell, K. Curran, and P. Mc Kevitt. Digital Conference on Cyberworlds. IEEE, 2004. 3
image steganography: Survey and analysis of current meth- [25] S. Pereira and T. Pun. Robust template matching for affine
ods. Signal Processing, 90(3), 2010. 1, 2 resistant image watermarks. IEEE Transactions on Image
[8] S.-T. Chen, C. Cornelius, J. Martin, and D. H. P. Chau. Processing, 2000. 2
Shapeshifter: Robust physical adversarial attack on faster [26] F. A. Petitcolas. Watermarking schemes evaluation. IEEE
r-cnn object detector. In Joint European Conference on Signal Processing Magazine, 2000. 2
Machine Learning and Knowledge Discovery in Databases, [27] F. A. Petitcolas, R. J. Anderson, and M. G. Kuhn. Attacks
2018. 3 on copyright marking systems. In International workshop on
[9] I. Cox, M. Miller, J. Bloom, J. Fridrich, and T. Kalker. Dig- information hiding. Springer, 1998. 2
ital watermarking and steganography. Morgan kaufmann, [28] T. Pevnỳ, T. Filler, and P. Bas. Using high-dimensional im-
2007. 2 age models to perform highly undetectable steganography.
[10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. In International Workshop on Information Hiding, 2010. 2
ImageNet: A Large-Scale Hierarchical Image Database. In [29] F. Romero Ramirez, R. Muoz-Salinas, and R. Medina-
CVPR, 2009. 6 Carnicer. Speeded up detection of squared fiducial markers.
[11] I. Evtimov, K. Eykholt, E. Fernandes, T. Kohno, B. Li, Image and Vision Computing, 2018. 3
A. Prakash, A. Rahmati, and D. Song. Robust physical- [30] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolu-
world attacks on deep learning models. arXiv preprint tional networks for biomedical image segmentation. In MIC-
arXiv:1707.08945, 2017. 3 CAI. Springer International Publishing, 2015. 4
[12] J. Fridrich, T. Pevnỳ, and J. Kodovskỳ. Statistically unde- [31] R. Shin and D. Song. Jpeg-resistant adversarial images. In
tectable jpeg steganography: Dead ends, challenges, and op- NeurIPS Workshop on Machine Learning and Computer Se-
portunities. In Proceedings of the 9th Workshop on Multime- curity, 2017. 3, 4
dia & Security, 2007. 2 [32] C. Sitawarin, A. N. Bhagoji, A. Mosenia, M. Chiang, and
[13] S. Garrido-Jurado, R. Muoz-Salinas, F. Madrid-Cuevas, and P. Mittal. Darts: Deceiving autonomous cars with toxic
R. Medina-Carnicer. Generation of fiducial marker dictionar- signs. arXiv preprint arXiv:1802.06430, 2018. 3
ies using mixed integer linear programming. Pattern Recog- [33] W. Tang, S. Tan, B. Li, and J. Huang. Automatic stegano-
nition, 2015. 3 graphic distortion learning using a generative adversarial net-
[14] S. W. Hasinoff. Photon, poisson noise. In Computer Vision: work. IEEE Signal Processing Letters, 2017. 1, 2
A Reference Guide. 2014. 4 [34] P. Wu, Y. Yang, and X. Li. Stegnet: Mega image steganog-
[15] J. Hayes and G. Danezis. Generating steganographic images raphy capacity with deep convolutional network. Future In-
via adversarial training. In NeurIPS, 2017. 1, 2 ternet, 2018. 1, 2
[35] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang.
Bisenet: Bilateral segmentation network for real-time se-
mantic segmentation. In ECCV, 2018. 5
[36] K. A. Zhang, A. Cuesta-Infante, and K. Veeramachaneni.
Steganogan: Pushing the limits of image steganography.
arXiv preprint arXiv:1901.03892, 2019. 1, 2
[37] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang.
The unreasonable effectiveness of deep features as a percep-
tual metric. In CVPR, 2018. 5, 8
[38] J. Zhu, R. Kaplan, J. Johnson, and L. Fei-Fei. Hidden: Hid-
ing data with deep networks. In ECCV, 2018. 2, 3, 7

S-ar putea să vă placă și