[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Webglimpse Home]
Cache pdftotext results for glimpse?
Hi,
I have installed glimpse 4.17.1, pdftotext 2.00, and GNU Ghostscript 6.51.
Indexing PDF files works fine.
For more complex queries, the results from the index scan seem to be filtered
by running a further query on the subset of files hit by the index query (I
think using agrep).
For this, it appears that glimpse is re-applying pdftotext to all files from
that subset. This can be quite slow.
Is there a way to transparently cache the pdftotext information so that
glimpse can find it?
I was thinking of writing a pdftotextfilter.sh script that updates a
corresponding .txt file when the .pdf is newer than the .txt, and otherwise
just returns the earlier created .txt file.
Is there a more elegant solution?
With many thanks,
David.
PS: Unrelated pdftotext question: I occasionally get "Floating point error"
messages. Any idea what could be wrong?
---------------------------------------------------------------------------
Dr David Philip Kreil ("`-''-/").___..--''"`-._
Research Fellow `6_ 6 ) `-. ( ).`-.__.`)
University of Cambridge (_Y_.)' ._ ) `._ `. ``-..-'
++44 1223 764107, fax 333992 _..`--'_..-_/ /--'_.' ,'
www.inference.phy.cam.ac.uk/dpk20 (il),-'' (li),' ((!.-'