Midas Loop

1. Introduction

Midas Loop is a web application for taking Universal Dependencies corpora and improving the quality of their annotations. For more information on motivation, functionality, and supported workflows, please see our paper.

1.1. Key features

1.1.1. CoNLL-U Import/Export Support

Midas Loop supports import and export of corpora in the CoNLL-U format.

1.1.2. CoNLL-U Editing

Editing of most annotations in the CoNLL-U format is supported.

1.1.3. Active Learning Support

Midas Loop allows NLP models to report probability distributions on several annotation types and uses these distributions to provide visual cues for annotators that a certain annotation is suspicious. The annotator may then decide whether to keep or replace the annotation, and the model used may be trained further on the improved data. Models are completely decoupled from the core system and communicate with it via HTTP, so any model may be used as long as it obeys the HTTP protocol.

These model-provided label distributions are also aggregated at the document level to allow annotators to triage documents based on how uncertain a model was about certain annotation types on average throughout the document.

1.2. Limitations

Midas Loop supports the following:

Sentence break editing
LEMMA editing
XPOS editing
HEAD editing
DEPREL editing
Active Learning support for HEAD, XPOS, and sentence breaks

Midas Loop does NOT support the following:

FORM/tokenization editing
UPOS editing
FEATS editing
DEPS editing
MISC editing

(Caveat: these are the limitations of the Midas Loop UI, but the Midas Loop core system actually supports editing of all core data types—see /swagger-ui on a running server for API documentation. It is possible to build your own UI or extend the Midas Loop UI to provide some of these additional editing features. Additionally, please open an issue on GitHub if there is a kind of editing you would like to see added to Midas Loop.)

1.3. Roadmap

The following are priorities for future work:

More efficient NLP processing
Support for full CoNLL-U editing
Online retraining of NLP models

Please do not hesitate to open an issue on GitHub with feature requests, etc.

2. Operation

2.1. Server Setup

Get an uberjar either by building it or by downloading the latest pre-built one. Note the following top level commands:

COMMANDS:
run, r               Start the web app and begin listening for requests.
import, i            Read and ingest CoNLL-U files.
export, e            Export all documents in the database as CoNLL-U files.
token, t             Token-related helpers.

Run one of the top level commands (e.g. java -jar midas-loop.jar import --help) to see more details about each command.

Important

Each command requires that your server not be running. If you are running your server using the run command, be sure you shut it down before running any of the other commands.

2.2. Configuration

By default, the uberjar will use its copy of the config located at env/prod/resources/config.edn. If you wish to customize this, specify another config using -Dconf=…:

java -Dconf="/path/to/my/config.edn" -jar midas-loop.jar …

Config keys:

:midas-loop.server.xtdb/config

Should be a map with two subkeys: :main-db-dir (required) has a string specifying the main database’s path on the filesystem relative to the CWD; :http-server-port, if present, should be a number specifying the port on which to serve XTDB’s internal HTTP interface.

:midas-loop.server.tokens/config

Map with a single key, :token-db-dir, (required) which specifies the location on the filepath of the authorization token database.

:dev

Either true or false. If true, do not require any authorization. This should always be false in production.

:nlp-services

A vector of three-key maps. Each map should have a :type (currently always :http), a :anno-type (must be :sentence, :xpos, :upos, or :head), and a url (must be pointed at running NLP Services)

:nlp-retry-wait-period-ms

Time, in milliseconds, to wait after a failure before attempting to contact an HTTP NLP service again. Defaults to 10000 (10 seconds).

:port

Port used for the main web server.

:cors-patterns

A set of CORS patterns (regular expressions) for adding additional allowed origins, e.g. #{"*.georgetown.edu"}. Localhost and the main origin are always allowed regardless of this item’s value.

2.3. Authorization

Warning

Midas Loop’s authorization scheme is primitive and vulnerable to attack, and is therefore only useful for preventing low-effort unauthorized access. You SHOULD NOT store sensitive data in a Midas Loop system.