Security News
Research
Data Theft Repackaged: A Case Study in Malicious Wrapper Packages on npm
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
A perfect AI powered RAG for document query and summary. Supports ~all LLM and ~all filetypes (url, pdf, epub, youtube (incl playlist), audio, anki, md, docx, pptx, oe any combination!)
I'm wdoc. I solve RAG problems.
- wdoc, imitating Winston "The Wolf" Wolf
wdoc
is a powerful RAG (Retrieval-Augmented Generation) system designed to summarize, search, and query documents across various file types. It's particularly useful for handling large volumes of diverse document types, making it ideal for researchers, students, and professionals dealing with extensive information sources. I was frustrated with all other RAG solutions for querying or summarizing, so I made my perfect solution in a single package.
(The online documentation can be found here)
Goal and project specifications: wdoc
uses LangChain to process and analyze documents. It's capable of querying tens of thousands of documents across various file types at the same time. The project also includes a tailored summary feature to help users efficiently keep up with large amounts of information.
Current status: Under active development
Key Features:
link="https://situational-awareness.ai/wp-content/uploads/2024/06/situationalawareness.pdf"
wdoc --path=$link --task=query --filetype="online_pdf" --query="What does it say about alphago?" --query_retrievers='default_multiquery' --top_k=auto_200_500
link="https://situational-awareness.ai/wp-content/uploads/2024/06/situationalawareness.pdf"
wdoc --path=$link --task=summarize --filetype="online_pdf"
This will:
For extra large documents like books for example, this summary can be recusively fed to wdoc
using argument --summary_n_recursion=2 for example.
Those two tasks, query and summary, can be combined with --task summarize_then_query which will summarize the document but give you a prompt at the end to ask question in case you want to clarify things.
For more, you can jump to the section Walkthrough and examples
api_base
are user set, cache are isolated from the rest, outgoing connections are censored by overloading sockets, etc.Eve the Evaluator
, Anna the Answerer
and Carl the Combiner
are the names given to each LLM in their system prompt, this way you can easily add specific additional instructions to a specific step. There's also Sam the Summarizer
for summaries and Raphael the Rephraser
to expand your query.wdoc
keeps track of the hash of each document used in the answer, allowing you to verify each assertion.--help
etc. The full usage can be found in the file USAGE.md or via python -m wdoc --help
. I work hard to maintain an exhaustive documentation.wdoc
in other python project using --import_mode
. Take a look at the scripts below.WDOC_TYPECHECKING="disabled / warn / crash" wdoc
(by default: warn
). Thanks to beartype it shouldn't even slow down the code!This TODO list is maintained automatically by MdXLogseqTODOSync
auto: default, guess the filetype for you
url: try many ways to load a webpage, with heuristics to find the better parsed one
youtube: text is then either from the yt subtitles / translation or even better: using whisper / deepgram. Note that youtube subtitles are downloaded with the timecode (so you can ask 'when does the author talks about such and such) but at a lower sampling frequency (instead of one timecode per second, only one per 15s). Youtube chapters are also given as context to the LLM when summarizing, which probably help it a lot.
pdf: 15 default loaders are implemented, heuristics are used to keep the best one and stop early. Table support via openparse or UnstructuredPDFLoader. Easy to add more.
online_pdf: via URL then treated as a pdf (see above)
anki: any subset of an anki collection db. alt
and title
of images can be shown to the LLM, meaning that if you used the ankiOCR addon this information will help contextualize the note for the LLM.
string: the cli prompts you for a text so you can easily paste something, handy for paywalled articles!
txt: .txt, markdown, etc
text: send a text content directly as path
local_html: useful for website dumps
logseq_markdown: thanks to my other project: LogseqMarkdownParser you can use your Logseq graph
local_audio: supports many file formats, can use either OpenAI's whisper or deepgram. Supports automatically removing silence etc. Note: audio that are too large for whisper (usually >25mb) are automatically split into smaller files, transcribed, then combined. Also, audio transcripts are converted to text containing timestamps at regular intervals, making it possible to ask the LLM when something was said.
local_video: extract the audio then treat it as local_audio
online_media: use youtube_dl to try to download videos/audio, if fails try to intercept good url candidates using playwright to load the page. Then processed as local_audio (but works with video too).
epub: barely tested because epub is in general a poorly defined format
powerpoint: .ppt, .pptx, .odp, ...
word: .doc, .docx, .odt, ...
json_dict: a text file containing a single json dict.
Recursive types
docs/json_entries_example.json
.docs/toml_entries_example.toml
.utils/prompts.py
.wdoc --task="query" --path="my_file.pdf" --filetype="pdf" --modelname='openai/gpt-4o'
. Note that you could have just let --filetype="auto"
and it would have worked the same.wdoc
tries to parse args as kwargs so wdoc query mydocument What's the age of the captain?
is parsed as wdoc --task=query --path=mydocument --query "What's the age of the captain?"
. Likewise for summaries. This does not always work so use it only after getting comfortable with wdoc
.wdoc --task="query" --path="my/other_dir" --pattern="**/*pdf" --filetype="recursive_paths" --recursed_filetype="pdf" --query="My question about those documents"
. So basically you give as path the path to the dir, as pattern the globbing pattern used to find the files relative to the path, set as filetype "recursive_paths" so that wdoc
knows what arguments to expect, and specify as recursed_filetype "pdf" so that wdoc
knows that each found file must be treated as a pdf. You can use the same idea to glob any kind of file supported by wdoc
like markdown etc. You can even use "auto"! Note that you can either directly ask your question with --query="my question"
, or wait for an interactive prompt to pop up, or just pass the question as *args like so wdoc [your kwargs] here is my question
..json
file where each line (#comments
and empty lines are ignored) will be parsed as a list of argument. For example one line could be : {"path": "my/other_dir", "pattern": "**/*pdf", "filetype": "recursive_paths", "recursed_filetype": "pdf"}
. This way you can use a single json file to specify easily any number of sources. .toml
files are also supported.wdoc
use the source_tag to see if it should continue or crash. If you want to load 10_000 pdf in one go as I do, then it makes sense to continue if some failed to crash but not if a whole source_tag is missing.--save_embeds_as=your/saving/path
to save all this index in a file. Then simply do --load_embeds_from=your/saving/path
to quickly ask queries about it!wdoc --help
--filetype="link_file"
. Basically the file designated by --path
should contain in each line (#comments
and empty lines are ignored) one url, that will be parsed by wdoc
. I made this so that I can quickly use the "share" button on android from my browser to a text file (so it just appends the url to the file), this file is synced via syncthing to my browser and wdoc
automatically summarize them and add them to my Logseq. Note that the url is parsed in each line, so formatting is ignored, for example it works even in markdown bullet point list.wdoc --private --llms_api_bases='{"model": "http://localhost:11434", "query_eval_model": "http://localhost:11434"}' --modelname="ollama_chat/gemma:2b" --query_eval_modelname="ollama_chat/gemma:2b" --embed_model="BAAI/bge-m3" my_task
wdoc --task=summary --path='https://www.youtube.com/watch?v=arj7oStGLkU' --youtube_language="en" --disable_md_printing
:Summary
https://www.youtube.com/watch?v=arj7oStGLkU
- Let me take a deep breath and summarize this TED talk about procrastination:
- [0:00-3:40] Personal experience with procrastination in college:
- Author's pattern with papers: planning to work steadily but actually doing everything last minute
- 90-page senior thesis experience:
- Planned to work steadily over a year
- Actually wrote 90 pages in 72 hours with two all-nighters
- Jokingly implies it was brilliant, then admits it was 'very, very bad'
- [3:40-6:45] Brain comparison between procrastinators and non-procrastinators:
- Both have a Rational Decision-Maker
- Procrastinator's brain also has an Instant Gratification Monkey:
- Lives entirely in present moment
- Only cares about 'easy and fun'
- Works fine for animals but problematic for humans in advanced civilization
- Rational Decision-Maker capabilities:
- Can visualize future
- See big picture
- Make long-term plans
- [6:45-10:55] The procrastinator's system:
- Dark Playground:
- Where leisure activities happen at wrong times
- Characterized by guilt, dread, anxiety, self-hatred
- Panic Monster:
- Only thing monkey fears
- Awakens near deadlines or threats of public embarrassment
- Enables last-minute productivity
- Personal example with TED talk preparation:
- Procrastinated for months
- Only started working when panic set in
- [10:55-13:05] Two types of procrastination:
- Deadline-based procrastination:
- Effects contained due to Panic Monster intervention
- Less harmful long-term
- Non-deadline procrastination:
- More dangerous
- Affects important life areas without deadlines:
- Entrepreneurial pursuits
- Family relationships
- Health
- Personal relationships
- Can cause long-term unhappiness and regrets
- [13:05-14:04] Concluding thoughts:
- Author believes no true non-procrastinators exist
- Presents Life Calendar:
- Shows 90 years in weekly boxes
- Emphasizes limited time available
- Call to action: need to address procrastination 'sometime soon'
- Key audience response moments:
- Multiple instances of '(Laughter)' noted throughout
- Particularly strong response from PhD students relating to procrastination issues
- Received thousands of emails after blog post about procrastination Tokens used for https://www.youtube.com/watch?v=arj7oStGLkU: '4936' (in: 4307, out: 629, cost: $0.00063) Total cost of those summaries: 4936 tokens for $0.00063 (estimate was $0.00030) Total time saved by those summaries: 8.8 minutes Done summarizing.
wdoc
currently requires Python 3.11 to run for now. Make sure that your Python version matches this one or it will not work.
pip install -U wdoc
dev
branch: pip install git+https://github.com/thiswillbeyourgithub/wdoc@dev
main
branch: pip install git+https://github.com/thiswillbeyourgithub/wdoc@main
uvx wdoc --help
pipx run wdoc --help
pip install -U wdoc[pdftotext]
as well as add fasttext support with pip install -U wdoc[fasttext]
.export OPENAI_API_KEY="***my_key***"
wdoc --task=query --path=MYDOC [ARGS]
python -m wdoc
. And if everything fails, try with uvx wdoc@latest
, or as last resort clone this repo and try again after cd
inside it? Don't hesitate to open an issue.eval $(cat shell_completions/wdoc_completion.zsh)
. Also provided for bash
and fish
. You can generate your own with wdoc -- --completion MYSHELL > my_completion_file"
.wdoc query --path="PATH/TO/YOUR/FILE" --filetype="auto"
--saveas="some/path"
to the previous command to save the generated embeddings to a file and replace with --loadfrom "some/path"
on every subsequent call.wdoc --help
wdoc
.Who is this for?
wdoc
is for power users who want document querying on steroid, and in depth AI powered document summaries.What's RAG?
Why make another RAG system? Can't you use any of the others?
Why is wdoc
better than most RAG system to ask questions on documents?
wdoc
is very customizable.Why can wdoc
also produce summaries?
utils/prompts.py
and focus on extracting the arguments/reasonning/though process/arguments of the author then use markdown indented bullet points to make it easy to read. It's really good! The prompts dataclass is not frozen so you can provide your own prompt if you want.What other tasks are supported by wdoc
?
Which LLM providers are supported by wdoc
?
wdoc
supports virtually any LLM provider thanks to litellm. It even supports local LLM and local embeddings (see Walkthrough and examples section).What do you use wdoc
for?
wdoc
I can automatically create awesome markdown summaries that end up straight into my Logseq database as a bunch of TODO
blocks.--private
argument.wdoc
sectionWhat's up with the name?
WolfDoc
would be too confusing and WinstonDoc
sounds like something micro$oft would do. Also wd
and wdoc
were free, whereas doctools
was already taken. The initial name of the project was DocToolsLLM
, a play on words between 'doctor' and 'tool'.How can I improve the prompt for a specific task without coding?
query
task are roleplaying as employees working for WDOC-CORP©, either as Eve the Evaluator
(the LLM that filters out relevant documents), Anna the Answerer
(the LLM that answers the question from a filtered document) or Carl the Combiner
(the LLM that combines answers from Answerer as one). There's also Sam the Summarizer
for summaries and Raphael the Rephraser
to expand your query. They are all receiving orders from you if you talk to them in a prompt.How can I use wdoc
's parser for my own documents?
wdoc parse my_file.pdf
(this actually replaces the call to call instead wdoc_parse_file my_file.pdf
).
add --only_text
to only get the text and no metadata. If you're having problem with argument parsing you can try adding the --pipe
argument.from wdoc import wdoc
list_of_docs = Wdoc.parse_file(path=my_path)
wdoc_parse_file --filetype "anki" --anki_profile "Main" --anki_deck "mydeck::subdeck1" --anki_notetype "my_notetype" --anki_template "<header>\n{header}\n</header>\n<body>\n{body}\n</body>\n<personal_notes>\n{more}\n</personal_notes>\n<tags>{tags}</tags>\n{image_ocr_alt}" --anki_tag_filter "a::tag::regex::.*something.*" --only_text
What should I do if my PDF are encrypted?
qpdf --decrypt input.pdf output.pdf
How can I add my own pdf parser?
wdoc.utils.loaders.pdf_loaders['parser_name']=parser_object
then call wdoc
with --pdf_parsers=parser_name
.
path
argument in __init__
, have a load
method taking
no argument but returning a List[Document]
. Take a look at the OpenparseDocumentParser
class for an example.What should I do if I keep hitting rate limits?
debug
argument. It will disable multithreading,
multiprocessing and LLM concurrency. A less harsh alternative is to set the
environment variable WDOC_LLM_MAX_CONCURRENCY
to a lower value.How can I run the tests?
python -m pytest tests
FAQs
A perfect AI powered RAG for document query and summary. Supports ~all LLM and ~all filetypes (url, pdf, epub, youtube (incl playlist), audio, anki, md, docx, pptx, oe any combination!)
We found that wdoc demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
Research
The Socket Research Team breaks down a malicious wrapper package that uses obfuscation to harvest credentials and exfiltrate sensitive data.
Research
Security News
Attackers used a malicious npm package typosquatting a popular ESLint plugin to steal sensitive data, execute commands, and exploit developer systems.
Security News
The Ultralytics' PyPI Package was compromised four times in one weekend through GitHub Actions cache poisoning and failure to rotate previously compromised API tokens.