Huge News!Announcing our $40M Series B led by Abstract Ventures.Learn More →

apify-haystack

Package Overview

Dependencies

Advanced tools

Install Socket

Detect and block malicious and high-risk dependencies

Install

apify-haystack

Apify-haystack integration

0.1.7
PyPI

Maintainers: 1

Apify-Haystack integration

The Apify-Haystack integration allows easy interaction between the Apify platform and Haystack.

Apify is a platform for web scraping, data extraction, and web automation tasks. It provides serverless applications called Actors for different tasks, like crawling websites, and scraping Facebook, Instagram, and Google results, etc.

Haystack offers an ecosystem of tools for building, managing, and deploying search engines and LLM applications.

Installation

Apify-haystack is available at the apify-haystack PyPI package.

pip install apify-haystack

Examples

Crawl a website using Apify's Website Content Crawler and convert it to Haystack Documents

You need to have an Apify account and API token to run this example. You can start with a free account at Apify and get your API token.

In the example below, specify apify_api_token and run the script:

from dotenv import load_dotenv
from haystack import Document

from apify_haystack import ApifyDatasetFromActorCall

# Set APIFY_API_TOKEN here or load it from .env file
apify_api_token = "" or load_dotenv()

actor_id = "apify/website-content-crawler"
run_input = {
    "maxCrawlPages": 3,  # limit the number of pages to crawl
    "startUrls": [{"url": "https://haystack.deepset.ai/"}],
}


def dataset_mapping_function(dataset_item: dict) -> Document:
    return Document(content=dataset_item.get("text"), meta={"url": dataset_item.get("url")})


actor = ApifyDatasetFromActorCall(
    actor_id=actor_id, run_input=run_input, dataset_mapping_function=dataset_mapping_function
)
print(f"Calling the Apify actor {actor_id} ... crawling will take some time ...")
print("You can monitor the progress at: https://console.apify.com/actors/runs")

dataset = actor.run().get("documents")

print(f"Loaded {len(dataset)} documents from the Apify Actor {actor_id}:")
for d in dataset:
    print(d)

More examples

See other examples in the examples directory for more examples, here is a list of few of them

Load a dataset from Apify and convert it to a Haystack Document
Call Website Content Crawler and convert the data into the Haystack Documents
Crawl websites, retrieve text content, and store it in the InMemoryDocumentStore
Retrieval-Augmented Generation (RAG): Extracting text from a website & question answering
Analyze Your Instagram Comments’ Vibe with Apify and Haystack

Support

If you find any bug or issue, please submit an issue on GitHub. For questions, you can ask on Stack Overflow, in GitHub Discussions or you can join our Discord server.

Contributing

Your code contributions are welcome. If you have any ideas for improvements, either submit an issue or create a pull request. For contribution guidelines and the code of conduct, see CONTRIBUTING.md.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Keywords

FAQs

What is apify-haystack?

Is apify-haystack well maintained?

Did you know?

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

apify-haystack

Apify-Haystack integration

Installation

Examples

Crawl a website using Apify's Website Content Crawler and convert it to Haystack Documents

More examples

Support

Contributing

License

Keywords

Related posts

Malicious npm Package Typosquats Popular TypeScript ESLint Plugin, Exfiltrates Data and Enables Remote Exploitation

Ultralytics PyPI Package Compromised Through GitHub Actions Cache Poisoning