MapIntel
Category | Tools |
---|
Development | |
Package | |
Documentation | |
Communication | |
Introduction
MapIntel is a system for acquiring intelligence from vast collections of text data by representing each document as a
multidimensional vector that captures its semantics. The system is designed to handle complex Natural Language queries while it
provides Question-Answering functionality. Additionally, it allows for a visual exploration of the corpus. The MapIntel uses a
retriever engine that first finds the closest neighbors to the query embedding and identifies the most relevant documents. It
also leverages the embeddings by projecting them onto two dimensions while preserving the multidimensional landscape, resulting in
a map where semantically related documents form topical clusters which we capture using topic modeling. This map aims to promote a
fast overview of the corpus while allowing a more detailed exploration and interactive information encountering process. MapIntel
can be used to explore many types of corpora.
Installation
For user installation, mapintel
is currently available on the PyPi's repository, and you can install it via pip
:
pip install mapintel
Development installation requires cloning the repository and then using PDM to install the
project as well as the main and development dependencies:
git clone https://github.com/NOVA-IMS-Innovation-and-Analytics-Lab/MapIntel.git
cd mapintel
pdm install
Configuration
MapIntel aims to be a flexible system that can run with any user provided corpus. In order to achieve this goal, it standardizes
the data and models, while the deployment of all services is expected to be on AWS. An example of how to fully set up a MapIntel
instance can be found at MapIntel-News. After deploying
the required services, a file .env
should be created at the root of the project with environmental variables that are described
below.
AWS credentials
The following environmental variable should be included in the .env
file:
The user should have permissions to interact with the services described below.
Data
An OpenSearch database instance should be deployed in AWS with documents contained in an index called document
. Each document is
expected to have the content
, date
, embedding
, embedding2d
and topic
fields with the following types:
content
: text type that contains the main text of the document.date
: long
type that represents the ordinal format of a date.embedding
: knn_vector
type that represents the embedding vector of the document.embedding2d
: float
type that represents the 2D embedding vector of the document.topic
: keyword
type that assigns a topic label to each document.
The relevant environmental variables are the following:
OPENSEARCH_ENDPOINT
: The AWS endpoint of the OpenSearch deployed instance.OPENSEARCH_PORT
: The port of the instance.OPENSEARCH_USERNAME
: The username.OPENSEARCH_PASSWORD
: The password.
Models
MapIntel uses three models trained on the user provided data. The first is a Haystack retriever model, the second is a model that
reduces the dimensions of the embeddings to 2D, while the third is a generator model used for question-answering. The
corresponding environmental variables are the following:
HAYSTACK_RETRIEVER_MODEL
: The value of the parameter embedding_model
of the Haystack class EmbeddingRetriever
.SAGEMAKER_DIMENSIONALITY_REDUCTIONER_ENDPOINT
: The SageMaker endpoint of the deployed dimensionality reductioner.SAGEMAKER_GENERATOR_MODEL_ENDPOINT
: The SageMaker endpoint of the deployed generator.
Usage
To run the application use the following command:
mapintel
Then the server starts and listens to connections at http://localhost:8080
. You may open the browser and use this URL to
interact with the MapIntel UI.