Opened 10 years ago
Last modified 10 years ago
#47574 new request
port request: 'tabula' and 'tabula-extractor'
Reported by: | KurtPfeifle (Kurt Pfeifle) | Owned by: | macports-tickets@… |
---|---|---|---|
Priority: | Normal | Milestone: | |
Component: | ports | Version: | |
Keywords: | Cc: | ||
Port: |
Description
The self-decription of Tabula project is quite telling and appropriate:
"Tabula is a tool for liberating data tables trapped inside PDF files."
Here is the link to the sources:
Extracting tables from PDF pages into a usable spreadsheet format is extremely difficult. Here is some background information:
Given the scope of this task, Tabula works extremely well.
Tabula family of tools is written in Ruby. In the background they make use of PDFBox (which is written in Java) and a few other third-party libs. To run the command line tool tabula
, hosted in the Tabula-Extractor repository, requires JRuby-1.7 installed.My JRuby is the Macports version.
I've been successful to run tabula
directly from the cloned git repository:
mkdir ~/svn-stuff cd ~/svn-stuff git clone https://github.com/tabulapdf/tabula-extractor.git git.tabula-extractor
Included in this Git clone will already be the required libraries, so no need to install PDFBox.
The command line tool is in the /bin/
subdirectory.
Exploring the command line options:
~/svn-stuff/git.tabula-extractor/bin/tabula -h Tabula helps you extract tables from PDFs Usage: tabula [options] <pdf_file> where [options] are: --pages, -p <s>: Comma separated list of ranges, or all. Examples: --pages 1-3,5-7, --pages 3 or --pages all. Default is --pages 1 (default: 1) --area, -a <s>: Portion of the page to analyze (top,left,bottom,right). Example: --area 269.875,12.75,790.5,561. Default is entire page --columns, -c <s>: X coordinates of column boundaries. Example --columns 10.1,20.2,30.3 --password, -s <s>: Password to decrypt document. Default is empty (default: ) --guess, -g: Guess the portion of the page to analyze per page. --debug, -d: Print detected table areas instead of processing. --format, -f <s>: Output format (CSV,TSV,HTML,JSON) (default: CSV) --outfile, -o <s>: Write output to <file> instead of STDOUT (default: -) --spreadsheet, -r: Force PDF to be extracted using spreadsheet-style extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet) --no-spreadsheet, -n: Force PDF not to be extracted using spreadsheet-style extraction (if there are ruling lines separating each cell, as in a PDF of an Excel spreadsheet) --silent, -i: Suppress all stderr output. --use-line-returns, -u: Use embedded line returns in cells. (Only in spreadsheet mode.) --version, -v: Print version and exit --help, -h: Show this message
Change History (1)
comment:1 Changed 10 years ago by mf2k (Frank Schima)
Keywords: | PDF table csv tsv spreadsheet removed |
---|