~geyaeb/haskell-pdftotext

3 2

nondeterministic segfaults and another error

Details
Message ID
<CAHJ505Fv7Be1vyKvDku6Qof4Jf9R1t1+Um+urP8EobNSbD7spg@mail.gmail.com>
DKIM signature
pass
Download raw message
hi - i'm getting frequent segfaults/errors that crash my haskell
executable that i can't even catch with Conrol.Exception.try.  i'm
reading ~30 similarly formatted pdfs; pdftotext reads each with ~70%
success rate, but ~30% of the time will segfault or report another
error, neither of which can be caught afaict.  poppler's command line
pdftotext works fine on them, as does
https://hackage.haskell.org/package/pdf-toolbox-[core|content|document]

code:
          txt <- (\(e::SomeException) -> print e >> ... {- never
occurs -}) ||| pure =<<
            try (pdftotext Physical <$> fromJust <$> openFile f)

results:
succeeds ~70% of the time

once crashed with:
Segmentation fault: 11

once i got:
poppler/error: Embedded font file may be invalid
Segmentation fault: 11

often crashes with:
libc++abi.dylib: terminating with uncaught exception of type
std::__1::system_error: recursive_mutex lock failed: Invalid argument

if i don't use try, it crashes with this:
poppler/error: font resource is not a dictionary
poppler/error: font resource is not a dictionary
poppler/error: font resource is not a dictionary
covid-exe(36374,0x10dc82dc0) malloc: *** error for object
0x7f8c29d60027: pointer being freed was not allocated
covid-exe(36374,0x10dc82dc0) malloc: *** set a breakpoint in
malloc_error_break to debug
Abort trap: 6

here are the pdfs i'm running it on:
https://drive.google.com/file/d/15iOoGT3NCdWq9Gw1OavD7rCcuIqwc0eG/view?usp=sharing

osx 10.15.7
stack 2.3.3 (resolver lts-16.13, ghc 8.8.4)
poppler 20.11.0
Details
Message ID
<SAXwW2ZtpvAs11fR1xnbh0gZlPVE5LPjBDpvngIjbClHAB8HpxRgUhBVkOcvCJOA4K1dkbeWJ4uFB6s0Dn6ki3WVV_pxFI0lTWxWtx10aKM=@protonmail.com>
In-Reply-To
<CAHJ505Fv7Be1vyKvDku6Qof4Jf9R1t1+Um+urP8EobNSbD7spg@mail.gmail.com> (view parent)
DKIM signature
pass
Download raw message
Hello, Erik.

Thanks for reporting.

I tried to read all the PDFs with pdftotext and poppler 20.11 on Linux but have not seen a segfault yet.

Could you please try the following?

1. If you `import Pdftotext.Internal`, you can use IO version of `pdftotext` function,
   i. e. `pdftotextIO`. Could you try using that instead of `pdftotext`?

2. Could you try `propertiesIO` or `pagesTotalIO` instead of `pdftotextIO`? This reads the
   PDF file but does not try to extract text, only some metadata.

3. Could you make sure no multiple threads are involved?

4. `haskell-pdftotext` contains a binary, could you try running the binary
   with the PDFs multiple times and see whether it crashes?

   In `haskell-pdftotext` source code directory, you can run
   `stack exec pdftotext.hs -- text SOME_PDF.pdf` or you can
   `stack install` and then run `pdftotext.hs text SOME_PDF.pdf`.

I have multiple ideas what could be wrong: `unsafePerformIO` may cause some mess; automatic
deletion of pointers (using `ForeignPtr`) may experience some bad timing; AFAIK poppler library
holds some global state → could be problematic in case of multiple threads. However without
being able to reproduce myself or even try on MacOS, it's quite hard.

Thanks!

G. Eyaeb
Details
Message ID
<uCTyvGH-uDifXHr5p34XRwLn4UAEQq8PoRDBZ5ecBfh2TZI9XuVpYH66DY2UmdFZdjey7kFt5DcFmv3rPn8ri-ahmYA7FKyKn0h-3dGVcA8=@protonmail.com>
In-Reply-To
<CAHJ505Fv7Be1vyKvDku6Qof4Jf9R1t1+Um+urP8EobNSbD7spg@mail.gmail.com> (view parent)
DKIM signature
pass
Download raw message
Hello!

Thanks for all your input which was very helpful.

Could you please try newly published version 0.1.0.1? It is on Hackage.
I believe it will solve the issue with segmentation faults.

https://hackage.haskell.org/package/pdftotext-0.1.0.1

Thank you.
Details
Message ID
<CAHJ505GWnHqGoEPYb4HJkPAPM+=X8eHocydwExdcs+WWXvVG=Q@mail.gmail.com>
In-Reply-To
<uCTyvGH-uDifXHr5p34XRwLn4UAEQq8PoRDBZ5ecBfh2TZI9XuVpYH66DY2UmdFZdjey7kFt5DcFmv3rPn8ri-ahmYA7FKyKn0h-3dGVcA8=@protonmail.com> (view parent)
DKIM signature
pass
Download raw message
sorry just getting back to this, but yep, your fix seems to have
worked.  thanks!

i couldn't install from stack, still had to make a local copy of the
package.  i don't know enough to know if that's expected cuz i'm on
lts-16.13 or what.

and ran into the iconv linker problem again, same fix worked.  i don't
think i see a note in the readme you mentioned adding...

thanks again!
-e
Reply to thread Export thread (mbox)