ID: 119 VERSION: Glimpse 4.1 PROGRAM FILE: ??? something in glimpse or glimpseindex DESCRIPTION: Glimpse record-counting bug. Glimpse appears to return incorrect numbers of hits when a record delimiter is used. Here is the detail again in brief... The incorrect count, when record separator is used: glimpse -H . -c -y -d ">>>" PLATYPUS Platypussies: 36 The plain count: glimpse -H . -c -y PLATYPUS Platypussies: 18 And the conditions are: The index was about as simple as possible: glimpseindex -H . Platypussies And the file itself contains only 18 instances of the word: grep PLATYPUS Platypussies returns 18 lines of this: OS ORNITHORHYNCHUS ANATINUS (DUCKBILL PLATYPUS). I hope this helps. It seems quite strange. Note that some other trials return incorrect counts that are not a multiple of 2, but those files are gigabytes, so I doubt you want them emailed :) File available to repro: ftp://webglimpse.net/pub/wgdev/bug119.dataset REPORTED-BY: Cath Lawrence cathl@angis.org.au REPORTED-DATE: 12/21/98 PRIORITY: HIGH WORKING-ON: FIXED-BY: Morey Hubin FIXED-DATE: 9/26/99 FIXED-VERSION: Glimpse 4.12.6 or higher, sgrep.c rev 1.2 FIX-DESCRIPTION: mhubin@mediaone.net Morey Hubin Hi Cath, I saw your glimpse bug dating back to late December and took a crack at it. Using '>>>' as a record delimiter, or any other delimiter, in a regular text file gives double the hit count. glimpse -H index_dir -c -d '>>>' PLATYPUS WORKAROUND: The short answer is, it is a problem with delimiter alignment inside agrep. The immediate solution is to add the '-t' option when ever you use -d and -c together. This will properly adjust the alignment internally and give you the correct hit count. -t simply prints the '>>>' delimiter found at the end of the record rather than the beginning, but sunce you are not printing the record it never shows. SOLUTION: The technical explanation and proper solution follows: sgrep.c's function bm() line 724 calls two functions in succession 1) curtextbegin = backward_delimiter(text,......); 2) curtextend = forward_delimiter(curtextbegin,...); 1) search backward (to the left) in the file for the leading '>>>' delimiter and, 2) from there start searching forward for the next '>>>' to the right. The problem arises because 1) leaves the '>>>' at the beginning of curtextbegin so 2) also finds the leading '>>>' and nothing is done in 2). (ie curtextbegin = curtextend ). This means that it takes two loops to pass out of the current record (until we get to backward_delimiter again). Both loops increment the number of hits and presto, double the count. Using -t causes agrep (and glimpse) to take the trailing delimiter. In this case 1) and 2) work properly because 2) does not get stuck on the leading '>>>' in curtextbegin. The code must be altered so that the leading '>>>' is not left when curtextbegin is passed to forward_delimiter(). The following additions do the job nicely for plain text and compressed-filtered files. These are also -d&-c specific so cannot break anything else in glimpse. See the attachment for the original. The better fix would be to properly fix delim.c's forward_delimiter() to not get hungup on the leading 'D_length' delimiter if OUTTAIL is ON. forward_delimiter is called from a number of other places so I'm not touching it until I get more experience with glimpse source. =) (= ==================================================================================================== # diff sgrep.c sgrep.c119 712c712,716 < curtextend = forward_delimiter(curtextbegin/*text-m*/, textend, tc_D_pattern,tc_D_length, OUTTAIL); --- > if (!OUTTAIL) { > curtextend = forward_delimiter(curtextbegin+D_length/*text-m*/, textend, tc_D_pattern,tc_D_length, OUTTAIL); > }else{ > curtextend = forward_delimiter(curtextbegin/*text-m*/, textend, tc_D_pattern,tc_D_length, OUTTAIL); > } 725c729,733 < curtextend = forward_delimiter(curtextbegin/*text-m*/, textend, D_pattern, D_length,OUTTAIL); --- > if (!OUTTAIL) { > curtextend = forward_delimiter(curtextbegin+D_length/*text-m*/, textend, D_pattern,D_length, OUTTAIL); > }else{ > curtextend = forward_delimiter(curtextbegin/*text-m*/, textend, D_pattern, D_length,OUTTAIL); > } =============================================================================