Spring 2017 Project Work
Table of Contents
1 Project for Spring 2017: Batch Rename PDF files.
- We extract the following:
- The subject classification. Last class example: "Software Comprehension"
- The year of publication.
- List of author names (First Middle Last)
- Name of conference or journal of publication
- This sound like a wicked problem, but is not.
- Some of this info may be missing in the pdf.
- Rename a file using the extracted elements
- Format control of the resulting string?
- See an example: https://www.id3renamer.com/moreinfo
1.1 Background
- Overview of PDF tools etc.
- cf. http://labs.crossref.org/pdfextract/. Download and begin to experiment. Binary distribution. Not source code.
- Get GitHub accounts. Free. All source code and documents are expected to be on GitHub.
- https://github.com/CeON/CERMINE "is a Java library and a web service (cermine.ceon.pl) for extracting metadata and content from PDF files containing academic publications. CERMINE is written in Java at Centre for Open Science at Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw."
- http://jats.nlm.nih.gov/archiving/tag-library/1.1/ Journal Archiving and Interchange Tag Library NISO JATS Version 1.1 (ANSI/NISO Z39.96-2015) December 2015
1.2 Toward the Development of Requirements
- Immediate Goal: Develop Requirements, and document as expected in the Projects page. This is team work. Due date: by next Mon class?
Important subsystem #1: Develop a "virtual machine" description suitable for our project. The product of this effort is an API. E.g.,
fi = openPDF(pathName); y = extractYear(fi); s = extractSubject(fi); closePDF(fi);
- Suggestion: Each of you gather, say 5, academic papers. Study the range of where the info that we wish to extract lies.
- Important subsystem #2: Do a mock up of a GUI that is an aid to using the "Rename PDF Files" tool.
1.3 Life Cycle
- Imagine that Dr PM is the client commissioning this software.
- Requirements will/ should become clear as we progress.
- Bottom Line: We will adjust the scope of the project as we go along.
1.4 Example of PDF Annotation
- Annotated file: ./pdfextract-CrossRefLabs.pdf
- Flattened after annotation file: ./pdfextract-CrossRefLabs-Flattened.pdf
2 Notes
- Notes on logistics ./note20170318.html
3 Domain Analysis
3.1 pdfRenamer Names of Fields To Extract
This list is incomplete.
- Topic: The subject/topic classification. Ex: "buffer overflow exploit" "software engineering" "president obama's lectures"
- Year: The year of publication. Ex: "2016"
- Authors: One-author-name consists of [First Middle* Last], list of author names is comma separated. Ex: Umme Ayda Mannan, Iftekhar Ahmed, Rana Abdullah M Almurshed, Danny Dig, Carlos Jensen. This example has 5 authors.
- JournalConf: Name of journal or conference. Ex: "Journal of ACM" "USENIX Symposium on Networked Systems Design and Implementation"
- Pages. Page numbers of the PDF. Ex: "32 - 54"
- Kind: PhD-thesis/ MS-thesis/ Tech-Report/ Manual/ Paper-Regular/ Slides/ Book/ Proceedings/ Unknown.
- Location: City and Country of Publication. Ex: "London, UK" "New York, NY, USA"
- Volume: Usually a number. Some times a name. Ex: "25"
- Issue: Usually a number. Some times a name. Ex: "3" "March"
- Publisher/ Organization. Ex: "Springer Verlag" "IEEE"
3.2 Task Assignment Mar 21, 2017
- Apoorva: 2 Year, 10 Publisher
- Jayanth: 3 Authors, 9 Issue
- Jonathan: 4 JournalConf, 1 Topic
- Sarker: 5 Pages, 7 Location
- Stephanie: 6 Kind, 8 Volume
- Practice TDD, using either iText or PDFbox libraries
- What is due: A jar file containing the extract operations.
- Due Date: Fri Mar 24, 2017 11:59 PM