Security News
tea.xyz Spam Plagues npm and RubyGems Package Registries
Tea.xyz, a crypto project aimed at rewarding open source contributions, is once again facing backlash due to an influx of spam packages flooding public package registries.
wagtail-textract
Allow searching for text in Documents in the Wagtail content management system
Readme
This package is for replacing Wagtail's Document class with one that allows searching in Document file contents using textract.
Textract can extract text from (among others) PDF, Excel and Word files.
The package was inspired by the "Search: Extract text from documents" issue in Wagtail.
Documents will work as before, except that Document search in Wagtail's admin interface will also find search terms in the files' contents.
Some screenshots to illustrate.
In our fresh Wagtail site with wagtail_textract
installed,
we uploaded a file called test_document.pdf
with handwritten text in it.
It is listed in the admin interface under Documents:
If we now search in Documents for the word correct
, which is one of the handwritten words,
the live search finds it:
The assumption is that this search should not only be available in Wagtail's admin interface, but also in a public-facing search view, for which we provide a code example.
We have been using this package in production since August 2018 on https://nuffic.nl.
wagtail_textract
to your requirements and/or pip install wagtail_textract
INSTALLED_APPS
.WAGTAILDOCS_DOCUMENT_MODEL = "wagtail_textract.document"
in your Django settings.Note: You'll get an incompatibility warning during installation of wagtail_textract (Wagtail 2.0.1 installed):
requests 2.18.4 has requirement chardet<3.1.0,>=3.0.2, but you'll have chardet 2.3.0 which is incompatible.
textract 1.6.1 has requirement beautifulsoup4==4.5.3, but you'll have beautifulsoup4 4.6.0 which is incompatible.
We haven't seen this leading to problems, but it's something to keep in mind.
In order to make textract
use Tesseract, which happens if regular
textract
finds no text, you need to add the data files that Tesseract can
base its word matching on.
Create a tessdata
directory in your project directory, and download the
languages you want.
Transcription is done automatically after Document save,
in an asyncio
executor to prevent blocking the response during processing.
To transcribe all existing Documents, run the management command::
./manage.py transcribe_documents
This may take a long time, obviously.
Here is a code example for a search view (outside Wagtail's admin interface) that shows both Page and Document results.
from itertools import chain
from wagtail.core.models import Page
from wagtail.documents.models import get_document_model
def search(request):
# Search
search_query = request.GET.get('query', None)
if search_query:
page_results = Page.objects.live().search(search_query)
document_results = Document.objects.search(search_query)
search_results = list(chain(page_results, document_results))
# Log the query so Wagtail can suggest promoted results
Query.get(search_query).add_hit()
else:
search_results = Page.objects.none()
# Render template
return render(request, 'website/search_results.html', {
'search_query': search_query,
'search_results': search_results,
})
Your template should allow for handling Documents differently than Pages,
because you can't do pageurl result
on a Document:
{% if result.file %}
<a href="{{ result.url }}">{{ result }}</a>
{% else %}
<a href="{% pageurl result %}">{{ result }}</a>
{% endif %}
In order to use wagtail_textract, your CustomizedDocument
model should do
the same as wagtail_textract's Document:
TranscriptionMixin
search_fields
from wagtail_textract.models import TranscriptionMixin
class CustomizedDocument(TranscriptionMixin, ...):
"""Extra fields and methods for Document model."""
search_fields = ... + [
index.SearchField(
'transcription',
partial_match=False,
),
]
Note that the first class to subclass should be TranscriptionMixin
,
so its save()
takes precedence over that of the other parent classes.
To run tests, checkout this repository and:
make test
A coverage report will be generated in ./coverage_html_report/
.
FAQs
Allow searching for text in Documents in the Wagtail content management system
We found that wagtail-textract demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 2 open source maintainers collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Tea.xyz, a crypto project aimed at rewarding open source contributions, is once again facing backlash due to an influx of spam packages flooding public package registries.
Security News
As cyber threats become more autonomous, AI-powered defenses are crucial for businesses to stay ahead of attackers who can exploit software vulnerabilities at scale.
Security News
UnitedHealth Group disclosed that the ransomware attack on Change Healthcare compromised protected health information for millions in the U.S., with estimated costs to the company expected to reach $1 billion.