
Security News
npm Adopts OIDC for Trusted Publishing in CI/CD Workflows
npm now supports Trusted Publishing with OIDC, enabling secure package publishing directly from CI/CD workflows without relying on long-lived tokens.
Helper for converting CONLLU files and uploading the corpus to LiRI Corpus Platform (LCP)
Command-line tool for converting CONLLU files and uploading the corpus to LCP
Make sure you have python 3.11+ with pip
installed in your local environment, then run:
pip install lcpcli==0.2.5
Example:
Corpus conversion:
lcpcli -i ~/conll_ext/ -o ~/upload/
Data upload:
lcpcli -c ~/upload/ -k $API_KEY -s $API_SECRET -p my_project --live
Help:
lcpcli --help
lcpcli
takes a corpus of CoNLL-U (PLUS) files and imports it to a project created in an LCP instance, such as catchphrase.
Besides the standard token-level CoNLL-U fields (form
, lemma
, upos
, xpos
, feats
, head
, deprel
, deps
) one can also provide document- and sentence-level annotations using comment lines in the files (see the CoNLL-U Format section).
lcpcli
ships with an example one-video "corpus": the video is an excerpt from the CC-BY 3.0 "Big Buck Bunny" video ((c) copyright 2008, Blender Foundation / www.bigbuckbunny.org) and the "transcription" is a sample of the Declaration of the Human Rights
To populate a folder with the example data, use this command
lcpcli --example /destination/folder/
This will create a subfolder named free_video_corpus in /destination/folder which, itself, contains two subfolders: input and output. The input subfolder contains four files:
start
and end
in the MISC column), segment- (# start =
and # end =
comments) and document-level (#newdoc start =
and #newdoc end =
)namedentity
token cells of doc.conllu with two attributes, type
and form
view
column, where the start
and end
columns are timestamps, in seconds, relative to the document referenced in the doc_id
columnThe CoNLL-U format is documented at: https://universaldependencies.org/format.html
The LCP CLI converter will treat all the comments that start with # newdoc KEY = VALUE
as document-level attributes.
This means that if a CoNLL-U file contains the line # newdoc author = Jane Doe
, then in LCP all the sentences from this file will be associated with a document whose meta
attribute will contain author: 'Jane Doe'
.
All other comment lines following the format # key = value
will add an entry to the meta
attribute of the segment corresponding to the sentence below that line (i.e. not at the document level).
The key-value pairs in the MISC
column of a token line will go in the meta
attribute of the corresponding token, with the exceptions of these key-value combinations:
SpaceAfter=Yes
vs. SpaceAfter=No
(case senstive) controls whether the token will be represented with a trailing space character in the databasestart=n.m|end=o.p
(case senstive) will align tokens, segments (sentences) and documents along a temporal axis, where n.m
and o.p
should be floating values in secondsSee below how to report all the attributes in the template .json
file.
CoNLL-U Plus is an extension to the CoNLLU-U format documented at: https://universaldependencies.org/ext-format.html
If your files start with a comment line of the form # global.columns = ID FORM LEMMA UPOS XPOS FEATS HEAD DEPREL DEPS MISC
, lcpcli
will treat them as CoNLL-U PLUS files and process the columns according to the names you set in that line.
If your corpus includes media files, your .json
template should report it under a mediaSlots
key in meta
, e.g.:
"meta": {
"name": "Free Single-Video Corpus",
"author": "LiRI",
"date": "2024-06-13",
"version": 1,
"corpusDescription": "Single, open-source video with annotated shots and a placeholder text stream from the Universal Declaration of Human Rights annotated with named entities",
"mediaSlots": {
"video": {
"mediaType": "video",
"isOptional": false
}
}
},
Your CoNLL-U file(s) should accordingly report each document's media file's name in a comment, like so:
# newdoc video = bunny.mp4
The .json
template should also define a main key named tracks
to control what annotations will be represented along the time axis. For example the following will tell the interface to display separate timeline tracks for the shot, named entity and segment annotations, with the latter being subdivided in as many tracks as there are distinct values for the attribute speaker
of the segments:
"tracks": {
"layers": {
"Shot": {},
"NamedEntity": {},
"Segment": {
"split": [
"speaker"
]
}
}
}
Finally, your output corpus folder should include a subfolder named media
in which all the referenced media files have been placed
The values of each attribute (on tokens, segments, documents or at any other level) have a type; the most common types are text
, number
or categorical
. The attributes must be reported in the template .json
file, along with their type (you can see an example in the section Convert and Upload)
text
vs categorical
: while both types correspond to alpha-numerical values, categorical
is meant for attributes that have a limited number of possible values (typically, less than 100 distinct values) of a limited length (as a rule of thumb, each value can have up to 50 characters). There is no such limits on values of attributes of type text
. When a user starts typing a constraint on an attribute of type categorical
, the DQD editor will offer autocompletition suggestions. The attributes of type text
will have their values listed in a dedicated table (lcpcli
's conversion step produces corresponding .csv
files) so a query that expresses a constraint on an attribute will be slower if that attribute if of type text
than of type categorical
the type labels
(with an s
at the end) corresponds to a set of labels that users will be able to constrain in DQD using the contain
keyword: for example, if an attribute named genre
is of type labels
, the user could write a constraint like genre contain 'drama'
or hobbies !contain 'comedy'
. The values of attributes of type labels
should be one-line strings, with each value separated by a comma (,
) character (as in, e.g., # newdoc genre = drama, romance, coming of age, fiction
); as a consequence, no label can contain the character ,
.
the type dict
corresponds to key-values pairs as represented in JSON
the type date
requires values to be formatted in a way that can be parsed by PostgreSQL
Create a directory in which you have all your properly-fromatted CONLLU files.
In the same directory, create a template .json
file that describes your corpus structure (see above about the attributes
key on Document
and Segment
), for example:
{
"meta": {
"name": "Free Single-Video Corpus",
"author": "LiRI",
"date": "2024-06-13",
"version": 1,
"corpusDescription": "Single, open-source video with annotated shots and a placeholder text stream from the Universal Declaration of Human Rights annotated with named entities",
"mediaSlots": {
"video": {
"mediaType": "video",
"isOptional": false
}
}
},
"firstClass": {
"document": "Document",
"segment": "Segment",
"token": "Token"
},
"layer": {
"Token": {
"abstract": false,
"layerType": "unit",
"anchoring": {
"location": false,
"stream": true,
"time": true
},
"attributes": {
"form": {
"isGlobal": false,
"type": "text",
"nullable": true
},
"lemma": {
"isGlobal": false,
"type": "text",
"nullable": false
},
"upos": {
"isGlobal": true,
"type": "categorical",
"nullable": true
},
"xpos": {
"isGlobal": false,
"type": "categorical",
"nullable": true
},
"ufeat": {
"isGlobal": false,
"type": "dict",
"nullable": true
}
}
},
"DepRel": {
"abstract": true,
"layerType": "relation",
"attributes": {
"udep": {
"type": "categorical",
"isGlobal": true,
"nullable": false
},
"source": {
"name": "dependent",
"entity": "Token",
"nullable": false
},
"target": {
"name": "head",
"entity": "Token",
"nullable": true
},
"left_anchor": {
"type": "number",
"nullable": false
},
"right_anchor": {
"type": "number",
"nullable": false
}
}
},
"NamedEntity": {
"abstract": false,
"layerType": "span",
"contains": "Token",
"anchoring": {
"location": false,
"stream": true,
"time": false
},
"attributes": {
"form": {
"isGlobal": false,
"type": "text",
"nullable": false
},
"type": {
"isGlobal": false,
"type": "categorical",
"nullable": true
}
}
},
"Shot": {
"abstract": false,
"layerType": "span",
"anchoring": {
"location": false,
"stream": false,
"time": true
},
"attributes": {
"view": {
"isGlobal": false,
"type": "categorical",
"nullable": false
}
}
},
"Segment": {
"abstract": false,
"layerType": "span",
"contains": "Token",
"attributes": {
"meta": {
"text": {
"type": "text"
},
"start": {
"type": "text"
},
"end": {
"type": "text"
}
}
}
},
"Document": {
"abstract": false,
"contains": "Segment",
"layerType": "span",
"attributes": {
"meta": {
"audio": {
"type": "text",
"isOptional": true
},
"video": {
"type": "text",
"isOptional": true
},
"start": {
"type": "number"
},
"end": {
"type": "number"
},
"name": {
"type": "text"
}
}
}
}
},
"tracks": {
"layers": {
"Shot": {},
"Segment": {},
"NamedEntity": {}
}
}
}
If your corpus defines a character-anchored entity type such as named entities, make sure you also include a properly named and formatted CSV file for it in the directory.
Visit an LCP instance (e.g. catchphrase) and create a new project if you don't already have one where your corpus should go.
Retrieve the API key and secret for your project by clicking on the button that says: "Create API Key".
Once you have your API key and secret, you can start converting and uploading your corpus by running the following command:
lcpcli -i $CONLLU_FOLDER -o $OUTPUT_FOLDER -k $API_KEY -s $API_SECRET -p $PROJECT_NAME --live
$CONLLU_FOLDER
should point to the folder that contains your CONLLU files$OUTPUT_FOLDER
should point to another folder that will be used to store the converted files to be uploaded$API_KEY
is the key you copied from your project on LCP (still visible when you visit the page)$API_SECRET
is the secret you copied from your project on LCP (only visible upon API Key creation)$PROJECT_NAME
is the name of the project exactly as displayed on LCP -- it is case-sensitive, and space characters should be escapedFAQs
Helper for converting CONLLU files and uploading the corpus to LiRI Corpus Platform (LCP)
We found that lcpcli demonstrated a healthy version release cadence and project activity because the last version was released less than a year ago. It has 1 open source maintainer collaborating on the project.
Did you know?
Socket for GitHub automatically highlights issues in each pull request and monitors the health of all your open source dependencies. Discover the contents of your packages and block harmful activity before you install or update your dependencies.
Security News
npm now supports Trusted Publishing with OIDC, enabling secure package publishing directly from CI/CD workflows without relying on long-lived tokens.
Research
/Security News
A RubyGems malware campaign used 60 malicious packages posing as automation tools to steal credentials from social media and marketing tool users.
Security News
The CNA Scorecard ranks CVE issuers by data completeness, revealing major gaps in patch info and software identifiers across thousands of vulnerabilities.