TODO
- Fix hardcoded temporary directory, clean up after upload
- Test some PDF files
- Add a wrapper graph which takes an object in, enriches then sends out again
- Integrate with AMQP, add as worker in thegrid-apis
Later
- Test how much faster Tika Java API is at XHTML + image extraction over cli tools
- Avoid temporary files for images+html output if/when passing to NoFlo
Setup
Configuration is passed as environment variables:
AMAZON_API_ID: Amazon S3 API identifier
AMAZON_API_TOKEN: Amazon S3 API token/secret
AMAZON_API_REGION: Amazon S3 region, ex: 'us-west-2'
AMAZON_API_BUCKET: Amazon S3 bucket for uploaded files, ex: 'thegrid-user-content'
Design
Separate Heroku worker, integrated into TheGrid APIs.
Inputs:
- URL to s3 backed document (Word,PDF)
Outputs:
- Extracted HTML with img src referring to S3 backend
Notes:
- Tika provides full XHTML document, where as Embed.ly gives only (and we expect)