Give it to me I am in a hurry!
Note: a list of examples can be found in examples.md
link="https://situational-awareness.ai/wp-content/uploads/2024/06/situationalawareness.pdf"
wdoc --path=$link --task=query --filetype="online_pdf" --query="What does it say about alphago?" --query_retrievers='default_multiquery' --top_k=auto_200_500
- This will:
- parse what's in --path as a link to a pdf to download (otherwise the url could simply be a webpage, but in most cases you can leave it to 'auto' by default as heuristics are in place to detect the most appropriate parser).
- cut the text into chunks and create embeddings for each
- Take the user query, create embeddings for it ('default') AND ask the default LLM to generate alternative queries and embed those
- Use those embeddings to search through all chunks of the text and get the 200 most appropriate documents
- Pass each of those documents to the smaller LLM (default: anthropic/claude-3-5-haiku-20241022) to tell us if the document seems appropriate given the user query
- If More than 90% of the 200 documents are appropriate, then we do another search with a higher top_k and repeat until documents start to be irrelevant OR we it 500 documents.
- Then each relevant doc is sent to the strong LLM (by default, anthropic/claude-3-7-sonnet-20250219) to extract relevant info and give one answer.
- Then all those "intermediate" answers are 'semantic batched' (meaning we create embeddings, do hierarchical clustering, then create small batch containing several intermediate answers) and each batch is combined into a single answer.
- Rinse and repeat steps 7+8 until we have only one answer, that is returned to the user.
link="https://situational-awareness.ai/wp-content/uploads/2024/06/situationalawareness.pdf"
wdoc --path=$link --task=summarize --filetype="online_pdf"
-
This will:
- Split the text into chunks
- pass each chunk into the strong LLM (by default anthropic/claude-3-7-sonnet-20250219) for a very low level (=with all details) summary. The format is markdown bullet points for each idea and with logical indentation.
- When creating each new chunk, the LLM has access to the previous chunk for context.
- All summary are then concatenated and returned to the user
-
For extra large documents like books for example, this summary can be recusively fed to wdoc
using argument --summary_n_recursion=2 for example.
-
Those two tasks, query and summary, can be combined with --task summarize_then_query which will summarize the document but give you a prompt at the end to ask question in case you want to clarify things.
-
For more, you can jump to the section Walkthrough and examples
-
auto: default, guess the filetype for you
-
url: try many ways to load a webpage, with heuristics to find the better parsed one
-
youtube: text is then either from the yt subtitles / translation or even better: using whisper / deepgram. Note that youtube subtitles are downloaded with the timecode (so you can ask 'when does the author talks about such and such) but at a lower sampling frequency (instead of one timecode per second, only one per 15s). Youtube chapters are also given as context to the LLM when summarizing, which probably help it a lot.
-
pdf: 15 default loaders are implemented, heuristics are used to keep the best one and stop early. Table support via openparse or UnstructuredPDFLoader. Easy to add more.
-
online_pdf: via URL then treated as a pdf (see above)
-
anki: any subset of an anki collection db. alt
and title
of images can be shown to the LLM, meaning that if you used the ankiOCR addon this information will help contextualize the note for the LLM.
-
string: the cli prompts you for a text so you can easily paste something, handy for paywalled articles!
-
txt: .txt, markdown, etc
-
text: send a text content directly as path
-
local_html: useful for website dumps
-
logseq_markdown: thanks to my other project: LogseqMarkdownParser you can use your Logseq graph
-
local_audio: supports many file formats, can use either OpenAI's whisper or deepgram's Nova-3 model. Supports automatically removing silence etc. Note: audio that are too large for whisper (usually >25mb) are automatically split into smaller files, transcribed, then combined. Also, audio transcripts are converted to text containing timestamps at regular intervals, making it possible to ask the LLM when something was said.
-
local_video: extract the audio then treat it as local_audio
-
online_media: use youtube_dl to try to download videos/audio, if fails try to intercept good url candidates using playwright to load the page. Then processed as local_audio (but works with video too).
-
epub: barely tested because epub is in general a poorly defined format
-
powerpoint: .ppt, .pptx, .odp, ...
-
word: .doc, .docx, .odt, ...
-
json_dict: a text file containing a single json dict.
-
Recursive types
- youtube playlists: get the link for each video then process as youtube
- recursive_paths: turns a path, a regex pattern and a filetype into all the files found recurisvely, and treated a the specified filetype (for example many PDFs or lots of HTML files etc).
- link_file: turn a text file where each line contains a url into appropriate loader arguments. Supports any link, so for example webpage, link to pdfs and youtube links can be in the same file. Handy for summarizing lots of things!
- json_entries: turns a path to a file where each line is a json dict: that contains arguments to use when loading. Example: load several other recursive types. An example can be found in
docs/json_entries_example.json
. - toml_entries: read a .toml file. An example can be found in
docs/toml_entries_example.toml
.