Security News
Fluent Assertions Faces Backlash After Abandoning Open Source Licensing
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
pip3 install tika-client
from pathlib import Path
from tika_client import TikaClient
test_file = Path("sample.docx")
with TikaClient("http://localhost:9998") as client:
# Extract a document's metadata
metadata = client.metadata.from_file(test_file)
# Get the content of a document as HTML
data = client.tika.as_html.from_file(test_file)
# Or as plain text
text = client.tika.as_text.from_file(test_file)
# Content and metadata combined
data = client.rmeta.as_text.from_file(test_file)
# The mime type can also be given
# This allows Content-Type to be set most accurately
text = client.tika.as_text.from_file(test_file,
"application/vnd.openxmlformats-officedocument.wordprocessingml.document")
The Tika REST API documentation can be found here. At the moment, only the metadata, tika and recursive metadata endpoints are implemented.
Unfortunately, the set of possible return values of the Tika API are not very well documented. The library makes
a best effort to extract relevant fields into type properties where it understands more about the mime type
of the document (as returned by Tika). This includes information like created/modified information as time zone
aware datetime
objects. The full JSON response is always available to the user under the .data
attribute.
When a particular key is not present in the response, all properties will return None
instead.
Only one other library for interfacing with Tika exists that I know of. I find it too complicated, trying to handle a lot of differing uses.
The biggest issue I have with the library is its downloading and running of a jar file if needed. To me, an API client should only interface to the API and not try to provide functionality to start the API as well. The user is responsible for providing the server with the Tika version they desire.
The library also provides a lot of knobs to turn, but I argue most developers will not want to configure XML as the response type, they just want the data, already parsed to the maximum extend possible.
This library attempts to provide a simpler interface, minimal lines of code and typing of the parsed response.
tika-client
is distributed under the terms of the Mozilla Public License 2.0 license.
FAQs
A modern REST client for Apache Tika server
We found that tika-client demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Fluent Assertions is facing backlash after dropping the Apache license for a commercial model, leaving users blindsided and questioning contributor rights.
Research
Security News
Socket researchers uncover the risks of a malicious Python package targeting Discord developers.
Security News
The UK is proposing a bold ban on ransomware payments by public entities to disrupt cybercrime, protect critical services, and lead global cybersecurity efforts.