Note: I've just migrated to a different physical server to run Spivey's Corner,
with a new architecture, a new operating system, a new version of PHP, and an updated version of MediaWiki.
Please let me know if anything needs adjustment! – Mike

Obfuscated PDF

Copyright © 2024 J. M. Spivey
Jump to navigation Jump to search

This document should not be taken as advocating the use of the techniques it describes in any specific situation. The author deprecates plagiarism in all its forms.

These notes document an experiment in using open-source software to create online documents that are resistant to casual attempts to copy text from them and paste it into another document. The goal is to turn a TeX document into a PDF file that looks normal on the screen and prints normally, but to make it difficult to copy text by cutting and pasting. Such PDF files provide a deterrent to plagiarism, because they make it difficult to steal the text without some effort. Naturally, it is possible to steal the text by retyping it, or by using OCR; and as we shall see, simple obfuscations are not genuinely very secure. But obfuscation can act as a very useful deterrent to casual copying. Obfuscated PDF files also make it less easy to violate the copyright of a document's author by incorporating the text into a repository, except in the form of the original, unmodified and complete PDF document. This may provide some reassurance in situations where the author of a document may retain the legal and moral right to control what copies of it are made and for what purposes, but unwillingly lose control over the physical power to make copies.

The conventional approach to these problems is to protect the document with a password. PDF files can have two kinds of password:

  • A password needed to open the document at all. This kind of password is used to encrypt the whole contents of the document, and (short of breaking the cypher) makes it impossible to display or print the document without the password. In many circumstances, it is not acceptable to provide a document that requires a password to read it; and if the password is given, then the document can be decrypted and opened anyway.
  • A password needed to edit, copy or print the document. This kind of password is supposed to be honoured by PDF-reading software, but it is totally insecure, and is simply ignored by some software such as Ghostscript. This does not provide an acceptable solution to the problem of discouraging cut-and-paste copying.

The technique documented here is different, and relies on encrypting the text in the document with a substitution cypher. I know that this form of cypher is not secure, but it is significantly more difficult to evade than password protection, and thus has some value. Certainly there is no widely available software that overcomes this form of encryption automatically.

If the text is encrypted, then how can it be viewed and printed? The answer lies in the way fonts work in PostScript and PDF. By embedding in the document a font whose characters have been permuted with the inverse of the substitution that has been performed on the text, we can arrange that the document appears normal on the screen and when printed. For example, we might decide to replace each A in the document with a B, but then provide a font that made the letter B look like an A, and similarly for all the other characters. The way a human reader identifies each character is by looking at its shape on the screen or on paper, and that shape is not changed by this process of substitution and inverse substitution. When the text is copied and pasted, however, it is the character codes that are copied – and they will be pasted into another environment (a text editor, for example) where it has not been arranged for the B to have the appearance we normally associate with an A. The result is gobbledegook.

To see that this substitution process works properly, it is necessary to understand something about fonts in PostScript, and I will explain that in a moment. To see how to achieve the desired effect with TeX, it is also necessary to understand in some detail how certain TeX-related programs work: specifically, the program dvips that converts the DVI file output by TeX or LaTeX into PostScript. One workable route from TeX or LaTeX to PDF is to use dvips on the TeX output, then use the ps2pdf script of Ghostscript to make a PDF file. That is the route we shall use here. A similar process can be made to work with PDFTeX, except that all the stages are integrated into PDFTeX itself.

Fonts in PostScript and PDF

Here's how fonts work in an ordinary PostScript document: in PDF, they work the same way. PostScript has a notion of the current font, and text is typically set by using the show operator; this expects a string on the evaluation stack, and prints each of the characters in the string in the current font, beginning at the current position on the page. Let's consider the steps involved in deciding what region on the page will be covered with ink, taking as an example the letter A. Usually, the steps are as follows:

  1. The PostScript file contains the command sequence (A) show, which first pushes on the stack a string consisting of a single ASCII letter A, then uses that string as the operand of show operation.
  2. The letter A has an ASCII code of 65 (decimal). The current font has an encoding vector of 256 entries that map character codes onto symbolic names. Code 65 is mapped to the name /A in this vector; there is a standard list of these symbolic names that is used in all commercial text fonts. (Non-letters have symbolic names like /dollar.)
  3. The current font also has a dictionary that maps symbolic names to drawing instructions. In this dictionary, the name /A is mapped to a specific set of lines and curves, usually together with hints about how to make the curves look good on low-resolution devices.

If we want to hide the fact that our text contains a letter A, then there are two things we must do: use a different character code in the string that is the argument of the show operator (let's say 66, which is the standard code for B), and use a glyph name that is different from /A (let's say /ch1616). Actually, that means three things need to change:

  • the software we use to produce the Postscript file must use character 66 (the ASCII code for B) wherever it wants to print a letter A.
  • the encoding vector in the font must map code 66 to the symbolic name /ch1616.
  • the glyph dictionary of the font must map the symbolic name /ch1616 to the original lines and curves for a letter A.

Happily, all three of these changes are easily made. For the first of them, we can modify the behaviour of DVIPS as described in the next section. For the others, standard tools make it simple to modify a font in the ways we need:

  • There are an assembler and disassembler for the Type 1 font format that convert between a textual representation and the binary format normally used for fonts. In the textual representation, it's easy to edit the font to use different glyph names.
  • There is a standard way in PostScript of replacing the encoding vector of a font with a different one.

Experience indicates that it is necessary to modify both the encoding vector and the glyph names in order to obfuscate the text for PDF reader software like Acrobat Reader, and presumably other PDF processors. If just the encoding vector is changed, then Acrobat Reader is certainly able to use the glyph names to recover the text for cutting and pasting.

Fonts in TeX and DVIPS

Like PostScript, TeX works with an encoding of the characters in a font as 8-bit numbers. The program dvips that converts from DVI to PostScript is able deal with the situation where the encoding that is used in TeX is different from the one used in the resulting PostScript. It does this by having a virtual font (VF) file that gives the mapping from the font and character codes used by TeX to fonts and character codes to be used in the PostScript output. These VF files can be created with the help of an accessory program named afm2tfm, which uses the Adobe Font Metric (AFM) file that comes with a PostScript font, together with two encoding vectors in PostScript format, one describing the layout of the font as seen by TeX, and the other decribing its layout as seen by PostScript; it is the glyph names in these encoding files that are used to tie character codes together.

In fact, we will use three encoding vectors as inputs to the process:

  • The vector texnansi.enc (or some other standard encoding vector for use with TeX). To continue the example of the letter A, this maps code 65 (ASCII letter A) to the symbolic name /A.
  • The vector scramble.enc uses the same glyph names as appear in texnansi.enc, but in a scrambled order that is generated randomly. In our example, it could be that this encoding vector maps code 66 to the symbolic name /A.
  • The vector crypt.enc has the same scrambled order as scramble.enc, but uses the glyph names that are present in our modified Type 1 font. In our example, this vector would map code 66 to the symbolic name /ch1616.

Now, by using afm2tfm with the encodings texnansi.enc and scramble.enc, we can make a VF file that maps code 65 in font lbr (for Lucida Bright Roman, say) to code 66 in an invented font klbr. We can then make a file psfonts.map that instructs dvips to download our modified Type 1 font klbr.pfb and re-encode it with the encoding crypt.enc.

klbr LucidaBright " CryptEncoding ReEncodeFont " <crypt.enc <klbr.pfb

In summary, here is how a letter A appears on the screen when it is typeset using our set-up:

  1. You write A in your TeX manuscript, and this causes TeX to call for character 65 in font lbr. Note that this part of the process is unmodified from the usual processing of a document: our obfuscation process begins after TeX has done its job.
  2. DVIPS uses a VF file that maps character 65 in font lbr to character 66 in font klbr. This mapping was set up by afm2tfm as follows:
    1. Encoding texnansi.enc maps 65 to /A
    2. Encoding scramble.enc maps 66 to /A
    3. The VF composes one encoding with the inverse of the other.
  3. In the encoding crypt.enc that is downloaded with the font, code 66 maps to the symbolic name /ch1616.
  4. In the glyph dictionary of the font, /ch1616 maps to the lines and curves for a letter A.

It is the last two parts of this process that are done by a PostScript interpreter or, after conversion to PDF, by Acrobat Reader. By then, all hints that the character was originally a letter A have gone, except that the lines and curves have the right appearance for it.

The process

This is a proof-of-concept, so works only for a document that uses a single font, Lucida Bright Roman, perhaps in several sizes. To make a working system, it would be necessary to perform the same process for each of a set of fonts used in, say, a LaTeX document. The file names that appear below refer to the locations of files in my own TeX setup: it is sure to be different from yours.

Use a TCL script scramble.tcl (shown below) to randomly permute the entries in the encoding vector texnansi.enc to obtain scramble.enc.

tclsh scramble.tcl </usr/local/tex/fonts/enc/texnansi.enc >scramble.enc

Use these two encodings to generate a VPL file lbr.vpl.

afm2tfm lbr -t texnansi.enc -p scramble.enc -v lbr klbr

Compile the VPL file into a VF file lbr.vf and a TFM file lbr.tfm. (Note: the TFM file for lbr that results is slightly different from the one that comes with TeX. Why?)

vptovf lbr

Run TeX on our document to produce doc.dvi. This uses the TFM file lbr.tfm that was produced by vptovf a moment ago, but actually the usual TFM file would do just as well.

tex doc.tex

Generate a sed script that substitutes different non-standard glyph names for the ones in the standard glyph list. (Note: these names are always the same whenever the process runs, and that is a serious but avoidable weakness; we should use random numbers here.)

grep -v '^#' /usr/local/tex/fonts/map/glyphlist.txt | \
		awk 'BEGIN { print "/^\\// {" } \
			{ printf "s/%s /ch%03d @\n", $1, NR } \
			END { print "}" }' FS=';' >remap.sed

Use the sed script on scramble.enc to get crypt.enc.

sed -f remap.sed scramble.enc >crypt.enc

Disassemble the Type 1 font lbr.pfb, apply the sed script, and assemble the result to get klbr.pfb.

t1disasm /usr/local/tex/fonts/outline/lucida/lbr.pfb lbr.raw
sed -e '/^\/UniqueID/d' -f remap.sed lbr.raw >klbr.raw
t1asm -b klbr.raw klbr.pfb

Run dvips to convert the document to PostScript, using all the bits and pieces prepared earlier.

dvips -t a4 doc.dvi

Convert the result to PDF.

ps2pdf doc.ps

Here is the result. Try opening it with Adobe Reader; then try copying and pasting some of the well-known text.

Going further

Clearly, this obfuscation is easy enough to break – by using OCR, or by simply breaking the substitution cypher, a process that could be automated easily enough. So obfuscation is really just an annoyance; however, there are a number of steps that could be taken to increase the level of annoyance:

  • Randomize the glyph names, so they are different from one document to another. Really, I should have done this above.
  • Use several copies of each font, each encrypted differently. Modify TeX, or introduce a post-processing phase, so that two adjacent glyphs in the document are never set in the same font or with the same character encoding.
  • Take each page of text and randomize the order in which the words are set, so that adjacent words in the text are not adjacent in the file.
  • The process will encrypt any copyright notice in the document, and this clearly increases the risk that text will be copied without permission. So before setting text on each page, cover it with many copies of the document's copyright notice, printed white-on-white and not encrypted. This can be managed easily by using both encrypted and non-encrypted copies of the same font under different names.
  • Add to that many random English words, also printed white-on-white and not encrypted.

I don't know how many of these would actually work, but some of them are worth trying.

I'm playing all the right notes, but not necessarily in the right order. [Eric Morecambe]

Scramble

For completeness, here is the script scramble.tcl

#!/usr/bin/tclsh

# Read the input
set nchars 0
while {[gets stdin line] >= 0} {
    regsub -all "\t" $line " " line
    if {[regexp {^%} $line]} continue
    if {[regexp {\[ *$} $line]} continue
    if {[regexp {^\]} $line]} continue
    set slot($nchars) $line
    incr nchars
}

proc special {i} {
    global slot
    return [regexp {^/\.notdef} $slot($i)]
}

# Scramble the encoding
for {set i [expr {$nchars-1}]} {$i > 0} {incr i -1} {
    set j [expr {int($i * rand())}]
    if {[special $i] || [special $j]} continue
    set t $slot($i); set slot($i) $slot($j); set slot($j) $t
}

# Write the result
puts "/CryptEncoding \["
for {set i 0} {$i < $nchars} {incr i} {
    puts $slot($i)
}
puts "\] def"