Indexing PDF Documents with Xpdf

now using new shell script contributed by Tong Sun for faster processing & gzip support
last updated 5/31/07

To index PDF files with Xpdf, you will need to

  1. Download and install pdftotext, it is part of the xpdf package: xpdf-3.01 (see xpdf home page for more information

  2. Place the provided script, usexpdf.sh, somewhere accessible, and add these lines:
    	*.pdf	/path/to/usexpdf.sh   < 
    	*.PDF	/path/to/usexpdf.sh   < 
    	*.pdf.gz	/path/to/usexpdf.sh -z	<
    	*.PDF.gz	/path/to/usexpdf.sh -z	<
    

    to the .glimpse_filters file in your archive directory.

    NOTE: usexpdf.sh assumes pdftotext is in your path, if not you will need to edit the script accordingly.
    NOTE2: when saving usexpdf.sh from the link above, you should delete the .txt extension. It is just there so you can view the script conveniently from netscape.

    The reason we use usexpdf.sh, is because .glimpse_filters works on STDIN, but pdftotext requires an input file for random access.

  3. On the Manage Archive page, enter
    	all
    or
    	pdf PDF
    
    in the field labelled "Prefilter filetypes for speed:"

    Prefiltering is recommended for efficiency and speed. However, if you prefer to filter files on the fly in order to save space, then edit the wgreindex file in each archive that needs to access PDF files. You will need to add the -z switch to both glimpseindex command lines.

  4. Make sure .pdf files aren't being excluded from the indexing! Check the .wgfilter-index file and delete any line with .pdf or .PDF in it.
  5. Important Add a line
    	rm /tmp/xpdf*
    
    either to your crontab or the end of the wgreindex script. The xpdf filter tends to leave around tmp files and these can fill up your hard drive if not regularly deleted.
  6. Run ./wgreindex in your archive directory to regenerate the indexes. To search, make sure the "Use Filters" box is checked in the search form. You may want to make it checked by default or make it a hidden tag.
    Reported problems & solutions:
    Success with the above steps has been reported by several users. If you needed to do something additional on your system, please let us know at webglimpse-support@iwhome.com so we can add additional notes here.
    Docs and Howtos