πŸš€ Big News: Socket Acquires Coana to Bring Reachability Analysis to Every Appsec Team.Learn more β†’
Socket
DemoInstallSign in
Socket

notion-md-crawler

Package Overview
Dependencies
Maintainers
1
Versions
10
Alerts
File Explorer

Advanced tools

Socket logo

Install Socket

Detect and block malicious and high-risk dependencies

Install

notion-md-crawler

A library to recursively retrieve and serialize Notion pages with customization for machine learning applications.

1.0.1
latest
Source
npm
Version published
Weekly downloads
24K
-8.27%
Maintainers
1
Weekly downloads
Β 
Created
Source

notion-md-crawler

A library to recursively retrieve and serialize Notion pages and databases with customization for machine learning applications.

NPM Version

🌟 Features

  • πŸ•·οΈ Crawling Pages and Databases: Dig deep into Notion's hierarchical structure with ease.
  • πŸ“ Serialize to Markdown: Seamlessly convert Notion pages to Markdown for easy use in machine learning and other.
  • πŸ› οΈ Custom Serialization: Adapt the serialization process to fit your specific machine learning needs.
  • ⏳ Async Generator: Yields results on a page-by-page basis, so even huge documents can be made memory efficient.

πŸ› οΈ Installation

@notionhq/client must also be installed.

Using npm πŸ“¦:

npm install notion-md-crawler @notionhq/client

Using yarn 🧢:

yarn add notion-md-crawler @notionhq/client

Using pnpm πŸš€:

pnpm add notion-md-crawler @notionhq/client

πŸš€ Quick Start

⚠️ Note: Before getting started, create an integration and find the token. Details on methods can be found in API section

Leveraging the power of JavaScript generators, this library is engineered to handle even the most extensive Notion documents with ease. It's designed to yield results page-by-page, allowing for efficient memory usage and real-time processing.

import { Client } from "@notionhq/client";
import { crawler, pageToString } from "notion-md-crawler";

// Need init notion client with credential.
const client = new Client({ auth: process.env.NOTION_API_KEY });

const crawl = crawler({ client });

const main = async () => {
  const rootPageId = "****";
  for await (const result of crawl(rootPageId)) {
    if (result.success) {
      const pageText = pageToString(result.page);
      console.log(pageText);
    }
  }
};

main();

🌐 API

crawler

Recursively crawl the Notion Page. dbCrawler should be used if the Root is a Notion Database.

Note: It tries to continue crawling as much as possible even if it fails to retrieve a particular Notion Page.

Parameters:

  • options (CrawlerOptions): Crawler options.
  • rootPageId (string): Id of the root page to be crawled.

Returns:

  • AsyncGenerator<CrawlingResult>: Crawling results with failed information.

dbCrawler

Recursively crawl the Notion Database. crawler should be used if the Root is a Notion Page.

Parameters:

  • options (CrawlerOptions): Crawler options.
  • rootDatabaseId (string): Id of the root page to be crawled.

Returns:

  • AsyncGenerator<CrawlingResult>: Crawling results with failed information.

CrawlerOptions

OptionDescriptionTypeDefault
clientInstance of Notion Client. Set up an instance of the Client class imported from @notionhq/client.Notion Client-
serializers?Used for custom serialization of Block and Property objects.Objectundefined
serializers?.block?Map of Notion block type and BlockSerializer.BlockSerializersundefined
serializers?.property?Map of Notion Property Type and PropertySerializer.PropertySerializersundefined
metadataBuilder?The metadata generation process can be customize.MetadataBuilderundefined
urlMask?If specified, the url is masked with the string.string | falsefalse
skipPageIds?List of page Ids to skip crawling (also skips descendant pages)string[]undefined

BlockSerializers

Map with Notion block type (like "heading_1", "to_do", "code") as key and BlockSerializer as value.

BlockSerializer

BlockSerializer that takes a Notion block object as argument. Returning false will skip serialization of that Notion block.

[Type]

type BlockSerializer = (
  block: NotionBlock,
) => string | false | Promise<string | false>;

PropertySerializers

Map with Notion Property Type (like "heading_1", "to_do", "code") as key and PropertySerializer as value.

PropertySerializer

PropertySerializer that takes a Notion property object as argument. Returning false will skip serialization of that Notion property.

[Type]

type PropertySerializer = (
  name: string,
  block: NotionBlock,
) => string | false | Promise<string | false>;

MetadataBuilder

Retrieving metadata is sometimes very important, but the information you want to retrieve will vary depending on the context. MetadataBuilder allows you to customize it according to your use case.

[Example]

import { crawler, MetadataBuilderParams } from "notion-md-crawler";

const getUrl = (id: string) => `https://www.notion.so/${id.replace(/-/g, "")}`;

const metadataBuilder = ({ page }: MetadataBuilderParams) => ({
  url: getUrl(page.metadata.id),
});

const crawl = crawler({ client, metadataBuilder });

for await (const result of crawl("notion-page-id")) {
  if (result.success) {
    console.log(result.page.metadata.url); // "https://www.notion.so/********"
  }
}

πŸ“Š Use Metadata

Since crawler returns Page objects and Page object contain metadata, you can be used it for machine learning.

πŸ› οΈ Custom Serialization

notion-md-crawler gives you the flexibility to customize the serialization logic for various Notion objects to cater to the unique requirements of your machine learning model or any other use case.

Define your custom serializer

You can define your own custom serializer. You can also use the utility function for convenience.

import { BlockSerializer, crawler, serializer } from "notion-md-crawler";

const customEmbedSerializer: BlockSerializer<"embed"> = (block) => {
  if (block.embed.url) return "";

  // You can use serializer utility.
  const caption = serializer.utils.fromRichText(block.embed.caption);

  return `<figure>
  <iframe src="${block.embed.url}"></iframe>
  <figcaption>${caption}</figcaption>
</figure>`;
};

const serializers = {
  block: {
    embed: customEmbedSerializer,
  },
};

const crawl = crawler({ client, serializers });

Skip serialize

Returning false in the serializer allows you to skip the serialize of that block. This is useful when you want to omit unnecessary information.

const image: BlockSerializer<"image"> = () => false;
const crawl = crawler({ client, serializers: { block: { image } } });

Advanced: Use default serializer in custom serializer

If you want to customize serialization only in specific cases, you can use the default serializer in a custom serializer.

import { BlockSerializer, crawler, serializer } from "notion-md-crawler";

const defaultImageSerializer = serializer.block.defaults.image;

const customImageSerializer: BlockSerializer<"image"> = (block) => {
  // Utility function to retrieve the link
  const { title, href } = serializer.utils.fromLink(block.image);

  // If the image is from a specific domain, wrap it in a special div
  if (href.includes("special-domain.com")) {
    return `<div class="special-image">
      ${defaultImageSerializer(block)}
    </div>`;
  }

  // Use the default serializer for all other images
  return defaultImageSerializer(block);
};

const serializers = {
  block: {
    image: customImageSerializer,
  },
};

const crawl = crawler({ client, serializers });

πŸ” Supported Blocks and Database properties

Blocks

Block TypeSupported
Textβœ… Yes
Bookmarkβœ… Yes
Bulleted Listβœ… Yes
Numbered Listβœ… Yes
Heading 1βœ… Yes
Heading 2βœ… Yes
Heading 3βœ… Yes
Quoteβœ… Yes
Calloutβœ… Yes
Equation (block)βœ… Yes
Equation (inline)βœ… Yes
Todos (checkboxes)βœ… Yes
Table Of Contentsβœ… Yes
Dividerβœ… Yes
Columnβœ… Yes
Column Listβœ… Yes
Toggleβœ… Yes
Imageβœ… Yes
Embedβœ… Yes
Videoβœ… Yes
Figmaβœ… Yes
PDFβœ… Yes
Audioβœ… Yes
Fileβœ… Yes
Linkβœ… Yes
Page Linkβœ… Yes
External Page Linkβœ… Yes
Code (block)βœ… Yes
Code (inline)βœ… Yes

Database Properties

Property TypeSupported
Checkboxβœ… Yes
Created Byβœ… Yes
Created Timeβœ… Yes
Dateβœ… Yes
Emailβœ… Yes
Filesβœ… Yes
Formulaβœ… Yes
Last Edited Byβœ… Yes
Last Edited Timeβœ… Yes
Multi Selectβœ… Yes
Numberβœ… Yes
Peopleβœ… Yes
Phone Numberβœ… Yes
Relationβœ… Yes
Rich Textβœ… Yes
Rollupβœ… Yes
Selectβœ… Yes
Statusβœ… Yes
Titleβœ… Yes
Unique Idβœ… Yes
Urlβœ… Yes
Verificationβ–‘ No

πŸ’¬ Issues and Feedback

For any issues, feedback, or feature requests, please file an issue on GitHub.

πŸ“œ License

MIT

Made with ❀️ by TomPenguin.

Keywords

notion

FAQs

Package last updated on 23 Jan 2025

Did you know?

Socket

Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.

Install

Related posts