OpenTaxForms opens and automates US tax forms--it reads PDF tax forms
(currently from IRS.gov only, not state forms), converts them to more
feature-full HTML5, and offers a database and API for developers to
create their own tax applications. The converted forms will be available
to test (and ultimately use) at
OpenTaxForms.org <http://OpenTaxForms.org/>
__.
pdf2svg <https://github.com/dawbarton/pdf2svg>
__
-
Github
code <https://github.com/jsaponara/opentaxforms/>
__- [issue tracker link forthcoming]
- [milestones link forthcoming]
-
Build status
|Build Status|
-
Form status
The script reports a status for each form. Current status categories
are:
- layout means textboxes and checkboxes--they should not overlap.
- refs are references to other forms--they should all be recognized
(ie, in the list of all forms).
- math is the computed fields and their dependencies--each computed
field should have at least one dependency, or else what is it
computed from?
Each status error has a corresponding warning in the log file, so
they're easy to find. Each bugfix will likely reduce errors across
many forms.
1040 form status listing <https://opentaxforms.org/pages/status-form-1040-family-and-immediate-references.html>
__
-
API
The ReSTful API is read-only and provides a complete accounting of
form fields: data type, size and position on page, and role in field
groupings like dollars-and-cents fields, fields on the same line,
fields in the same table, fields on the same page, and fields
involved in the same formula. The API will also provide status
information and tester feedback for each form.
[API docs forthcoming, for now see examples in
test/run_apiclient.sh]
-
How it works
Most of the IRS tax forms embed all the fillable field information in
the XML Forms Architecture <https://en.wikipedia.org/wiki/XFA>
__
(XFA) format. The OpenTaxForms python <https://www.python.org/>
__
script extracts the XFA from each PDF form, and parses out:
- relationships among fields (such as dollar and cent fields; fields
on the same line; columns and rows of a table).
- math formulas, including which fields are computed vs user-entered
(such as "Subtract line 37 from line 35. If line 37 is greaterthan
line 35, enter -0-").
- references to other forms
All this information is stored in a database (optionally
PostgreSQL <https://www.postgresql.org/>
__ or the default
sqlite <https://sqlite.org/>
) and served according to a
ReSTful <https://en.wikipedia.org/wiki/Representational_state_transfer>
API. For each tax form page, an html form (with javascript to express
the formulas) is generated and overlaid on an svg rendering of the
original PDF. The javascript saves all user inputs to local/web
storage in the browser via
basil.js <https://wisembly.github.io/basil.js/>
. When the page is
loaded, those values are retrieved. Values are keyed by tax year,
form number (eg 1040), and XFA field id (and soon taxpayer name now
that I do my kids' taxes too). Testers will annotate the page image
with boxes and comments via
annotorious.js <http://annotorious.github.io/>
. A few of the 900+
IRS forms don't have embedded XFA (such as 2016 Form 1040 Schedule
A). Eventually those forms may be updated to contain XFA, but until
then, the best automated approach is probably
OCR <link:https://en.wikipedia.org/wiki/Optical_character_recognition>
__
(optical character recognition). OCR may be a less fool-proof
approach in general, especially for state (NJ, NY, etc) forms, which
generally are not XFA-based.
-
To do
- Move lower-level ToDo items to github/issues.
- Refactor toward a less script-ish architecture that will scale to
more developers. [architecturePlease]
- Switch to a pdf-to-svg converter that preserves text (rather than
converting text to paths), perhaps pdfjs, so that testers can
easily copy and paste text from forms. [copyableText]
- Should extractFillableFields.py be a separate project called
xfadump? This might provide a cleaner target output interface for
an OCR effort. [xfadump]
- Replace allpdfnames.txt with a more detailed form dictionary via a
preprocess step. [formDictionary]
- Offer entire-form html interface (currently presenting each page
separately). [formAsSingleHtmlPage]
- Incorporate instructions and publications, especially extracting
the worksheets from instructions. [worksheets]
- Add the ability to process US state forms. [stateForms]
- Fix countless bugs, especially in forms that contain tables (see
[issues])
- Don't seek in a separate file a schedule that occurs within a
form. [refsToEmbeddedSchedules]
- Separate dirName command line option into
pdfInputDir,htmlOutputDir. [splitIoDirs]
0.5.7 (2017-02-24)
- Simplify form extraction from PDF [by perimosocordiae].
- cmds: add a 'divide' command.
- cmds: treat constant as 'enter' command on field with 'realunit' of
'ratio'.
- cmds: temporary: treat ratio field pair [integer and decimal fields]
as dollars and cents fields.
- form: temporary: suppress the math for checkbox fields.
- Form: bugfix: fix 'multiple repeats' regex error via ut.compactify.
- improve Form naming, including bugfix in main.indicateProgress.
- bugfixes for py3 compatibility:
- Form: remove unicode calls.
- config: exception not iterable.
- link: seek bytes, not string.
- linting: remove re-import.
- ut: set log level of root logger.
- bugfix: constant [pre-filled] field converted to cents.
- bugfix: restored absolute positioning in transition from background
image to img element.
- refs: add 'section' to nonformcontexts.
- refs: add schedulecontext for more accurate refs to schedules.
- refs: add inSameFamily rule for assigning schedules to forms.
- refs: suppress unicode punctuation characters in variables later used
for filename.
- Re-enable test_1040_xfa
0.5.6 (2017-02-20)
- add py35 testenv to tox setup.
- Simplify form extraction from PDF [by perimosocordiae].
- Make sure os is imported before use [by perimosocordiae].
- Many minor tweaks [by perimosocordiae].
- Merge branch 'avoid-dumppdf' of
git://github.com/perimosocordiae/opentaxforms into
perimosocordiae-avoid-dumppdf
- Merge branch 'minor-tweaks' of
git://github.com/perimosocordiae/opentaxforms into
perimosocordiae-minor-tweaks
- README henceforth will be created .before. final commit of each
release.
0.5.5 (2017-02-18)
- port to python3--thanks perimosocordiae! [by perimosocordiae]
- windows port: if symlinking fails, copy files instead.
- windows port: replace slashes in file paths w/ calls to os.path.join.
- test: removed xfa test because it currently fails due to seemingly
unimportant unicode differences
- test: reorder tests from simplest to more complex.
- bugfix: logging: replace basicConfig with handlers, which are
replaceable between tests, so each test gets its own log file.
- bugfix: os.path.join ignores any directories before a leading slash.
- bugfix: unsetup even after api tests so subsequent ('steps') tests
can configure.
- bugfix: Fixing YAML syntax--travis doesnt like leading tabs [by
perimosocordiae].
0.5.4 (2017-02-14)
- install pdf2svg in travis before_script step.
0.5.3 (2017-02-14)
- test: add non-warning lines from log file to output [convenient to
know what went wrong with the travis build; helpful for new users
too].
- 2 patches from meetup.com/pug-ip sprint on 13feb2017:
- dependency in README,README.md [by polera--thanks James!].
- bugfix in ut.run [by perimosocordiae--thanks CJ!].
0.5.2 (2017-02-13)
0.5.1 (2017-02-13)
- bugfix: tracking: rootForms must be same type in test_opentaxforms
and main.
0.5.0 (2017-02-13)
- css,template: add minimal tester interface.
- extract: make cryptic forms annotatable.
- html: add pdf link to html page.
- main: fix log.name in indicateProgress [still refers to the previous
form].
- commandline: -f now takes multiple forms, comma-delimited.
- commandline: -C now means useCaches [not ignoreCaches], so the
default runs [avoiding the pint registry error].
- make 'setup.py test' work again by adding pytest-runner.
- add PyPI version badge to README.md.
- remove unused db functions.
- bugfix: just import cross-module globals, dont assign.
- add logPrefix to config.
- add unsetup functions to config and ut for teardown between tests.
- pin versions of requirements in setup.py.
0.4.19 (2017-01-24)
- add pylint config file.
- fix or silence all pylint errors.
- remove unused code from ut.py.
- use another project's pylintrc [pylint config file] as a starting
point to reduce false positives.
- make linting more fine-grained [pylintrc-ut for ut.py, pylintrc for
the rest of the code].
- integrate linting with tox.
- fail the build only for linting errors, not warnings.
0.4.18 (2017-01-19)
- test: separate read-only database file from the overwritten one in
test/.
- test: add script to run just the api tests.
- test: make api tests print feedback.
- fixed all linting errors [according to vim/syntastic] [just errors,
not warnings].
- ran pep8ify.
- mv opentaxforms.py -> main.py.
- main.py: remove sys.setdefaultencoding [and reload].
- remove pointless path from tox env.
- README.md fixes/updates. landscape.io suggestions.
0.4.17 (2017-01-17)
- move cmds.computeMath to Form.computeMath.
- refactoring cmds.computeMath to reduce function length:
- move sentence-level (instruction-level) parsing into class Parser.
- create class CommandParser.
0.4.16 (2017-01-13)
- linting via landscape.io, including shortening long lines, removing
unused variables, and refactoring some long functions.
- fixed all broken doctests.
- both cross-module globals, log and cfg, are now importable.
0.4.15 (2017-01-09)
- bugfix: the ignoreCaches option now works.
- bugfix: skip forms setup if dirName is None.
- bugfix: ut.exists now insensitive to trailing slash.
- test: moved from relative to absolute paths so eg py.test works from
outside of test/ dir.
- test: reorganize test/ dir: test_* subdirs are disposable, test-*
arent.
- test/test_opentaxforms.py: command line options changed and
expanded.
- test/test_opentaxforms.py: refactor TestOtfSteps, add
test_run_1040_xfa, and finally add a script test to the [pytest]
automated tests.
- test/forms/f1040.js,test/run_html5.sh: add casperjs script [to test
f1040 html files] and shell script [for context].
- add test/README.md as guide to the test/ directory.
- probe access to phantomjs,casperjs in .travis.yml, hopefully for
future integration of html5 tests.
- opentaxforms.py: in main, move form-summarizing code to try-else
block.
- cache results of Form.pdfInfo for faster test run.
- add internal option to computeOverlap [for layout boxes; currently
done inefficiently, thus costly].
- update 1040 target output file to fix broken test.
- cleanup opentaxforms.py/main.
0.4.14 (2016-12-31)
- mv math.py cmds.py: workaround for tox build failure [tox.Random
imports py.math and gets our math.py instead].
- oops apparently gotta incl README for travis build to succeed.
- setup.cfg: git flow release already tags the release, so we dont want
bumpversion to do so, otherwise git flow refuses to release.
0.4.13 (2016-12-30)
- gitignore release.sh temp files so they dont affect git status.
- move most of 1100-line main into four new files.
- add Form class.
- domain.py->irs.py.
- extractFillableFields.El derives from dict.
- combined rst format files into a single README file.
- noticed markdown readme is not rendered on pypi, fixing [part of fix
is in release.sh].
0.4.12 (2016-12-23) - bind arrow keys to next/prev page links [for demo
video]. - [release.sh remains untracked while it is being tested.]
0.4.11 (2016-12-21) - allow multiple 'rootForms' via call or
commandline. - output form statuses for external processing. - clean up
"import *". - add cleanup script. - merge the missing and spurious
categories into the form status message. - use bumpversion as
cookiecutter does.
.. |PyPI version| image:: https://badge.fury.io/py/opentaxforms.svg
:target: https://badge.fury.io/py/opentaxforms
.. |Build Status| image:: https://travis-ci.org/jsaponara/opentaxforms.svg?branch=0.4.9
:target: https://travis-ci.org/jsaponara/opentaxforms