Filtering and Filetypes


Indexing PDF Files using xpdf

To index PDF files with xpdf, you will need to:

Download and install pdftotext, it is part of the xpdf package: xpdf-3.00
(see xpdf home page for more information)

Place the provided script, usexpdf.sh, somewhere accessible, and add these lines:

*.pdf /path/to/usexpdf.sh <
*.PDF /path/to/usexpdf.sh <
*.pdf.gz /path/to/usexpdf.sh -z <
*.PDF.gz /path/to/usexpdf.sh -z <

to the .glimpse_filters file in your archive directory.

NOTE: usexpdf.sh assumes pdftotext is in your path, if not you will need to edit the script accordingly.
NOTE2: when saving usexpdf.sh from the link above, you should delete the .txt extension. It is just there
so you can view the script conveniently from your browser.

The reason we use usexpdf.sh, is because .glimpse_filters works on STDIN, but pdftotext requires an
input file for random access.

On the Manage Archive page, enter

all or pdf PDF

in the field labeled"Prefilter filetypes for speed:"

Prefiltering is recommended for efficiency and speed. However, if you prefer to filter files on the fly in order to save space, then edit the wgreindex file in each archive that needs to access PDF files. You will need to add the -z switch to both glimpseindex command lines.

Make sure .pdf files aren't being excluded from the indexing! Check the .wgfilter-index file and delete any line
with .pdf or .PDF in it.

Important Add a line

rm /tmp/xpdf*

either to your crontab or the end of the wgreindex script. The xpdf filter tends to leave around tmp files and these
can fill up your hard drive if not regularly deleted.

Run ./wgreindex in your archive directory to regenerate the indexes. To search, make sure the "Use Filters" box is
checked in the search form. You may want to make it checked by default or make it a hidden tag.

 

Continue to Next Page >>