Adv Software Engineering: PDF Tools and Docs
Table of Contents
These are my notes on PDF tools and libraries.
1 PDF Specification
- PDF Specification as developed by Adobe and others: http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf
- http://www.printmyfolders.com/Home/understanding-the-portable-document-format-pdf-part-2
- http://www.pdfa.org/
- Verified against many of the "real world PDF documents" that (slightly) deviate from the spec.
2 Libraries
- All do rendering, and edits, …
- PDFBox library now with Apache. https://pdfbox.apache.org/. Handles all PDF features: attachments, images, video, sound, text, custom fonts, links, etc. sloccount java: 130233 (98.11%)
- iText - Java PDF library. https://github.com/itext/itext7/releases/tag/7.0.0 "iText 7 Community comes with the AGPL license. To enjoy the full scope of iText 7 support and its powerful add-ons, you need a commercial license." http://itextpdf.com/itext7suite. sloccount java: 112591 (30.82%) xml: 252674 (69.18%);
- http://stackoverflow.com/questions/22340674/performance-itext-vs-pdfbox pdfbox much slower than itext.
- PDFClown https://pdfclown.org/ is a PDF library helps to generate, read and edit PDF. It helps to split and merge the PDF documents. sloccount {java=48735,xml=532,sh=2 47993 dotNET cs=47988,sh=5}. Unclear if the cs (C#) code duplicates the functionality of Java code.
- PDF Renderer - It's a Java library which renders PDF documents to the screen using Java2D in to swing panel. It is capable to view the PDF, Converts it to PNG, View PDF in to 3D scene, Print preview support. It does not support to create or manipulate the PDF. https://java.net/projects/pdf-renderer 2011
- jPod - PDF manipulating and rendering framework. https://sourceforge.net/projects/jpodlib/ jPod is our "cleanroom" implementation of the PDF spec. http://opensource.intarsys.de/home/en/index.php?n=OpenSource.HomePage sloccount /usr/local/src/OpenSource/jPod.5.6 java: 45221 (100.00%)
- PDFJet - PDF library for Java and .NET http://pdfjet.com/ Not FOSS.
- The Flying Saucer XHTML renderer project has support for outputting XHTML to PDF. Have a look at an example here.
- https://www.idrsolutions.com/jpedal/ Non FOSS.
- ICEpdf is an open source Java PDF engine for viewing, printing, and manipulating PDF documents. The ICEpdf API is 100% Java-based, lightweight, fast, efficient, and very easy to use. http://www.icesoft.org/java/projects/ICEpdf/overview.jsf Needed to register before I could download the src. ICEpdf Open Source Releases (no font engine)
- PDFSharp is a .NET library for creating and modifying PDF documents. https://sourceforge.net/projects/pdfsharp/ sloccount cs: 85536 (99.02%)
- https://poppler.freedesktop.org/releases.html
- http://java-source.net/open-source/pdf-libraries
- https://github.com/CeON/CERMINE/ "CERMINE is a Java library and a web service (cermine.ceon.pl) for extracting metadata and content from PDF files containing academic publications. CERMINE is written in Java at Centre for Open Science at Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw."
3 Tools
- Edit a pdf file into another.
- Add/delete paragraphs.
- Split a pdf file into several, e.g., by page.
- Extract text; accuracy is important.
- Annotations/ Scribbles
- Search and position a cursor
- Not expected: edit a pdf file as if editing an MSWord document. E.g., not expected to be available: replace one word with another, insert/ delete a word.
- http://www.tracker-software.com/product/pdf-xchange-editor Windows. I am regular user.
- https://www.xodo.com/ Multi-platform, include web browser based. I am regular user on Android.
- % sloccount pdfedit-0.4.5/ cpp: 156685 (92.94%)
- % sloccount epub2pdf-0.5-src java: 4703 (100.00%). Src bundles includes an older version og iText. The sloccount does not include the iText count. http://epub2pdf.com/ 2010
- http://labs.crossref.org/pdfextract/. Extract bibliographical info. Download and begin to experiment. Binary distribution. Not source code.
- http://www.idrsolutions.com/jpedalfx-viewer/
- http://stackoverflow.com/questions/18207116/displaying-pdf-in-javafx
- https://github.com/torutk/pdfviewer Simple PDF viewer and utilities using JavaFX and PDFBox
- https://github.com/james-d/PdfViewer 2013 Simple PDF Viewer in JavaFX using the PDFRenderer library
- http://www.programcreek.com/java-api-examples/index.php?api=org.apache.pdfbox.pdfviewer.PDFTreeModel
- https://sourceforge.net/projects/jpview/?source=directory JPview - Java PDF Viewer A small PDF Reader, PDF Viewer for Java. JPview is developed using Java, Eclipse SWT, jPod intarsys PDF rendering library and runs on a 32-bit Java Virtual Machine.
- https://vslavik.github.io/diff-pdf/ diff-pdf is a tool for visually comparing two PDFs.
- http://ipdfdev.com/2016/04/13/debug-pdf/ 5 apps to debug PDF files iPDFdev - Tips & Tricks for PDF development
- http://www.pdflite.com/ Source Code based on the Sumatra PDF project. PDF lite is a free and open source PDF viewer and PDF printer. You can convert any document or image to a PDF file – doc to PDF and jpg to PDF. ansic: 718363 (79.33%) cpp: 145996 (16.12%) asm: 19038 (2.10%) python: 12631 (1.39%)sh: 9035 (1.00%) cs: 477 (0.05%)
- https://github.com/sumatrapdfreader/sumatrapdf/wiki/ Contribute-code;; Some-notes-for-developers-new-to-Sumatra;; If you need win32 refresher: http://zetcode.com/gui/winapi/ http://www.winprog.org/tutorial/ http://www.relisoft.com/win32/ http://www.catch22.net/tuts
- https://en.wikipedia.org/wiki/MuPDF written in C http://git.ghostscript.com/?p=mupdf.git;a=summary; https://f-droid.org/repository/browse/?fdid=com.artifex.mupdfdemo
4 Techniques
- From stackoverflow.com
- One major difference is that PDFBox always processes text glyph by glyph while iText normally processes it chunk (i.e. single string parameter of text drawing operation) by chunk; that reduces the required resources in iText quite a lot. Furthermore the event oriented architecture of iText text parsing means a lower burden on resources than that of PDFBox. And PDFBox keeps information not strictly required for plain text extraction available for a longer time, costing more resources.
- The way the libraries initially load the document also makes a difference. Here you can experiment a bit, PDFBox not only offers multiple PDDocument.load overloads but also some PDDocument.loadNonSeq overloads (actually PDDocument.loadNonSeq reads documents correctly while PDDocument.load can be tricked to misinterpret PDFs). All these different variants may have different runtime behavior.
- More about how strategies affect performance. iText brings along a simple and a more advanced text extraction strategy. The simple one assumes text in the page content stream to appear in reading order while the more advanced one sorts. By default the more advanced one is used. Thus, you probably can speed up iText even some more by using the simple strategy. PDFBox always sorts.
- lower level (PDF object model) to the higher (PDF document structure and content streaming).