Indexing PDF Documents using pstotext

now using new shell script contributed by Tong Sun for faster processing & gzip support

last updated 9/13/01

To index PDF files, you will need to

  1. Install ghostscript, from http://www.cs.wisc.edu/~ghost

  2. Install a PDF-to-text converter, such as pstotext, from http://www.research.digital.com/SRC/virtualpaper/pstotext.html

    Another option would be Prescript, from http://www.nzdl.org/cgi-bin/gw?a=page&p=Prescript

    (Both require ghostscript 4.01 or greater)

  3. Place the provided script, processpdf.sh, somewhere accessible, and add these lines:
    	*.pdf	/path/to/processpdf.sh   < 
    	*.PDF	/path/to/processpdf.sh   < 
    	*.pdf.gz	/path/to/processpdf.sh -z	<
    	*.PDF.gz	/path/to/processpdf.sh -z	<
    

    to the .glimpse_filters file in your archive directory.

    NOTE: processpdf.sh assumes pstotext is in your path - if not you will need to edit the script accordingly.
    NOTE2: when saving processpdf.sh from the link above, you should delete the .txt extension. It is just there so you can view the script conveniently from netscape.

    The reason we use processpdf.sh, is because .glimpse_filters works on STDIN, but pstotext requires random access for PDF files.

  4. Edit the wgreindex file in each archive that needs to access PDF files. Change both glimpseindex command lines to add the -z option, like so:
    	/bin/cat /home/WWW/proj/test/.wg_toindex | /usr/local/bin/glimpseindex -n -H /home/WWW
    /proj/test -o -t -h -X -U -f -C -F -z > /dev/null
    
    	/bin/cat /home/WWW/proj/test/.wg_toindex | /usr/local/bin/glimpseindex -n -H /home/WWW
    /proj/test -o -t -h -X -U -f -C -F -z
    
  5. Make sure .pdf files aren't being excluded from the indexing! Check the .wgfilter-index file and delete any line with .pdf or .PDF in it.
  6. Run ./wgreindex in your archive directory to regenerate the indexes. To search, make sure the "Use Filters" box is checked in the search form. You may want to make it checked by default or make it a hidden tag.
    That "should" do it. Please send your results at webglimpse-support@iwhome.com so we can confirm this is a reliable method for indexing PDF files.
    Docs and Howtos