[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index][Webglimpse Home]

Re: Cache pdftotext results for glimpse? (2)



Dear Lee,

>  * Forgot to add: For such a script to work, I would need access to the name of
>  * the file presently being indexed. Is there a way I can request this name to 
>  * be 
>  * passed on to the script? Maybe something along the lines of
>  * 
>  * *.pdf   pdftotextfilter.sh %1 <
>  * 
>  * in .glimpse_filters?
> 
> Don't know, but you could calculate an MD5 of stdin and use that as an
> index into your cache.
Can you recommend a tool for this?

> I solved the problem by regularly preconverting new/changed pdf, doc
> etc., only indexing text/html, and recognising 'preconversions' in the
> CGI frontend. A bonus is that you can then present both the pdf and
> the txt file in your results in case people don't have a PDF reader
> installed in their browser.
Yes, that's what I do now but that basically means maintaing a "text only 
mirror". I've started writing scripts to do that, and my first hack works 
fine, but I worry how it will scale up.

> Thanks for the URL below BTW.
:)

Many thanks and
best regards,

David.

---------------------------------------------------------------------------
Dr David Philip Kreil                   ("`-''-/").___..--''"`-._
Research Fellow                          `6_ 6  )   `-.  (     ).`-.__.`)
University of Cambridge                  (_Y_.)'  ._   )  `._ `. ``-..-'
++44 1223 764107, fax 333992           _..`--'_..-_/  /--'_.' ,'
www.inference.phy.cam.ac.uk/dpk20     (il),-''  (li),'  ((!.-'