Sunday, April 24, 2011

Removing PDF Metadata with PDF Toolkit

I little while back, I wrote Fear the FOCA! This is a short write up on retrieving and analyzing metadata using FOCA Free. If you do not know what metadata is, you can think of it as data that describes data. Metadata can be simple information like the document title and the creation and modification dates of the file. It can also contain more sensitive information including names, usernames, computer names, software versions, operating systems, and email addresses. While this may seem harmless, in the hands of a social engineer this information can be very valuable.

One of the problems with metadata is that many people are not aware that metadata even exists or could pose a threat. Because of this, metadata often leaks onto public web servers despite being easy to remove.

One tool that you can use to manipulate metadata in PDF files is the PDF Toolkit or pdftk. Pdftk is a command line tool making it a great choice for scripting. It is available for Linux, Windows, and Mac. This article will demonstrate how to use pdftk on Linux to remove metadata from PDF files. I am using Ubuntu Linux for this article but I have also used pdftk on CentOS. These directions should work on Windows or Mac but I have not tested those platforms.

I have divided this article in to three sections:
Installing The PDF Toolkit
Getting Started With pdftk
Scripting pdftk

Installing The PDF Toolkit

To get started, you will need to install PDF Toolkit (referred to as pdftk). On Ubuntu, simply go to a command prompt and enter
sudo apt-get install pdftk
APT will take care of the dependencies and install pdftk. As of this writing, the Ubuntu package is a little outdated (1.41) but still works for my purposes. If you would like the newest version, you can get it from the PDF Toolkit website at http://www.pdflabs.com/docs/install-pdftk/.

Once you have pdftk installed, you will need a PDF document to analyze. If you do not have a PDF file available, a quick Google search using the "filetype:pdf" keyword should help you get started. My file is named sample.pdf. You will need to substitute your file name in the examples.

Getting Started with pdftk

The first step is to see what metadata is in your file. The command to do this is:
pdftk sample.pdf dump_data
When you enter this command, you will get output similiar to the screen shot below.


While this is useful to view the data, we need to do more. We will put the metadata into a file so it can be manipulated. Use the same command with a little modification.
pdftk sample.pdf dump_data output pdf-metadata
This command will not create any output. It will create a file, pdf-metadata, that contains a copy of the metadata from sample.pdf. You will need to open the pdf-metadata file with the editor of your choice and remove the values from InfoValue. Also remove any other references like bookmarks, page labels, or ids. The pdf-metadata file should look like the screen shot below.


Save the pdf-metadata file. Now we are ready to use that data to wipe the metadata from our sample file. The command to do this is:
pdftk sample.pdf update_info pdf-metadata output sample-no-metadata.pdf
This command will also not produce any output. It takes the original sample.pdf file and create a copy named sample-no-metadata.pdf. The (lack of) metadata from pdf-metadata is used to overwrite the existing metadata. You can test this by using the command from earlier.
pdftk sample-no-metadata.pdf dump_data
You should see much less metadata now. Pdftk adds the Producer and ModDate metadata but all of the other metadata is now gone!


Keep reading for tips on using pdftk in scripts for bulk metadata manipulation.

Scripting pdftk

The above directions are useful but there are much simpler ways to remove metadata from a single PDF document. The value of pdftk is in scripting. Below is my simple script to remove metadata from PDF documents in the /var/www/html directory structure.

#!/bin/bash
SAVEIFS=$IFS
IFS=$(echo -en "\n\b")

pdftk_path="/usr/local/bin/pdftk"  # full path to pdftk binary
pdf_infokeys="~/pdf-infokeys"      # full path to file containing new metadata
pdf_search_path="/var/www/html"    # path to search for pdf files
pdf_temp_path="/tmp"               # temporary directory


for i in $( find $pdf_search_path -type f -name "*.pdf" ); do
  cp $i $pdf_temp_path/temp.pdf
  $pdftk_path $pdf_temp_path/temp.pdf update_info $pdf_infokeys output $i
  rm $pdf_temp_path/temp.pdf
done

IFS=$SAVEIFS
You need to modify the four lines that begin with pdf_ to match your environment. If you are familiar with Bash scripting, then you should not have trouble following this script. However, there is one part that requires further explanation. You need to create the pdf-infokeys file. This is a list of all of the InfoKey data that is found in your PDF files.

Here is a simple way to get that data. Start by opening a terminal window and running this command (modified to search your directory structure):

find /var/www/html -type f -name "*.pdf" -exec ./pdftk {} dump_data \; | \
grep -i infokey | \
sort -u > ~/pdf-infokeys
If you have password protected PDF documents, this may produce a few errors. These can be safely ignored. You will end up with a pdf-infokeys file that contains something similiar to this:

InfoKey: Author
InfoKey: Company
InfoKey: CreationDate
InfoKey: Creator
InfoKey: ModDate
InfoKey: Producer
InfoKey: SourceModified
InfoKey: Title
Open pdf-infokeys with your favorite editor (like vi) and modify it so it looks similiar to this:

InfoKey: Author
InfoValue:
InfoKey: Company
InfoValue:
InfoKey: CreationDate
InfoValue:
InfoKey: Creator
InfoValue:
InfoKey: ModDate
InfoValue:
InfoKey: Producer
InfoValue:
InfoKey: SourceModified
InfoValue:
InfoKey: Title
InfoValue:
Now, save the file and you can use the script above.

Hopefully you found this useful. Please feel free to leave comments or questions below. Thanks for visiting.

Note: For an explanation of the $IFS variable from the script above, check out http://www.cyberciti.biz/tips/handling-filenames-with-spaces-in-bash.html. This is a work around for dealing with spaces found in the path or file name used in bash script loops.