1. Introduction

Midas Loop is a web application for taking Universal Dependencies corpora and improving the quality of their annotations. For more information on motivation, functionality, and supported workflows, please see our paper.

1.1. Key features

1.1.1. CoNLL-U Import/Export Support

Midas Loop supports import and export of corpora in the CoNLL-U format.

1.1.2. CoNLL-U Editing

Editing of most annotations in the CoNLL-U format is supported.

1.1.3. Active Learning Support

Midas Loop allows NLP models to report probability distributions on several annotation types and uses these distributions to provide visual cues for annotators that a certain annotation is suspicious. The annotator may then decide whether to keep or replace the annotation, and the model used may be trained further on the improved data. Models are completely decoupled from the core system and communicate with it via HTTP, so any model may be used as long as it obeys the HTTP protocol.

These model-provided label distributions are also aggregated at the document level to allow annotators to triage documents based on how uncertain a model was about certain annotation types on average throughout the document.

1.2. Limitations

Midas Loop supports the following:

  • Sentence break editing

  • LEMMA editing

  • XPOS editing

  • HEAD editing

  • DEPREL editing

  • Active Learning support for HEAD, XPOS, and sentence breaks

Midas Loop does NOT support the following:

  • FORM/tokenization editing

  • UPOS editing

  • FEATS editing

  • DEPS editing

  • MISC editing

(Caveat: these are the limitations of the Midas Loop UI, but the Midas Loop core system actually supports editing of all core data types—​see /swagger-ui on a running server for API documentation. It is possible to build your own UI or extend the Midas Loop UI to provide some of these additional editing features. Additionally, please open an issue on GitHub if there is a kind of editing you would like to see added to Midas Loop.)

1.3. Roadmap

The following are priorities for future work:

  • More efficient NLP processing

  • Support for full CoNLL-U editing

  • Online retraining of NLP models

Please do not hesitate to open an issue on GitHub with feature requests, etc.

2. Operation

2.1. Server Setup

Get an uberjar either by building it or by downloading the latest pre-built one. Note the following top level commands:

COMMANDS:
run, r               Start the web app and begin listening for requests.
import, i            Read and ingest CoNLL-U files.
export, e            Export all documents in the database as CoNLL-U files.
token, t             Token-related helpers.

Run one of the top level commands (e.g. java -jar midas-loop.jar import --help) to see more details about each command.

Important
Each command requires that your server not be running. If you are running your server using the run command, be sure you shut it down before running any of the other commands.

2.2. Configuration

By default, the uberjar will use its copy of the config located at env/prod/resources/config.edn. If you wish to customize this, specify another config using -Dconf=…​:

java -Dconf="/path/to/my/config.edn" -jar midas-loop.jar …​

Config keys:

:midas-loop.server.xtdb/config

Should be a map with two subkeys: :main-db-dir (required) has a string specifying the main database’s path on the filesystem relative to the CWD; :http-server-port, if present, should be a number specifying the port on which to serve XTDB’s internal HTTP interface.

:midas-loop.server.tokens/config

Map with a single key, :token-db-dir, (required) which specifies the location on the filepath of the authorization token database.

:dev

Either true or false. If true, do not require any authorization. This should always be false in production.

:nlp-services

A vector of three-key maps. Each map should have a :type (currently always :http), a :anno-type (must be :sentence, :xpos, :upos, or :head), and a url (must be pointed at running NLP Services)

:nlp-retry-wait-period-ms

Time, in milliseconds, to wait after a failure before attempting to contact an HTTP NLP service again. Defaults to 10000 (10 seconds).

:port

Port used for the main web server.

:cors-patterns

A set of CORS patterns (regular expressions) for adding additional allowed origins, e.g. #{"*.georgetown.edu"}. Localhost and the main origin are always allowed regardless of this item’s value.

2.3. Authorization

Warning
Midas Loop’s authorization scheme is primitive and vulnerable to attack, and is therefore only useful for preventing low-effort unauthorized access. You SHOULD NOT store sensitive data in a Midas Loop system.

2.3.1. Granting

Token-based authorization is used. Each user should have a token made for them, like so:

java -jar midas-loop.jar token add --name "Sam Doe" --email "sd42@gmail.com" --quality "gold"

Give your user their token and instruct them to keep it secret.

If you are using a non-standard configuration using java -Dconf=…​, be sure to include it during import.

2.3.2. Listing

You can see all valid tokens with java -jar midas-loop.jar token list.

If you are using a non-standard configuration using java -Dconf=…​, be sure to include it during import.

2.3.3. Revoking

You can revoke a token like so:

java -jar midas-loop.jar token revoke --secret "gold;secret=84EO60tU6lhcBhplbuEEGElECuh1yZod8fTCn6DqkQA"

If you are using a non-standard configuration using java -Dconf=…​, be sure to include it during import.

2.4. Importing

Use the import subcommand and supply it with a directory path. The directory will be recursively searched for files ending in .conllu and each will be loaded into the database. Example invocation:

java -jar midas-loop.jar import dir/with/conllu-files/

If you are using a non-standard configuration using java -Dconf=…​, be sure to include it during import.

2.5. Exporting

Use the export subcommand and provide it with a directory path. A separate .conllu file for each document will be created directly under that directory. Example invocation:

java -jar midas-loop.jar export output/dir/

If you are using a non-standard configuration using java -Dconf=…​, be sure to include it during import.

2.6. NLP Services

Midas Loop is able to contact NLP services via HTTP in order to get machine learning model outputs for certain kinds of annotations. NLP services work by waiting to be contacted by the Midas Loop server, which will contact the service when it needs fresh label distributions for a given annotation type.

Specifically, Midas Loop is able to accommodate outputs for sentence splits (i.e., token-level classification of whether a particular token is the beginning of a new sentence) as well as UPOS, XPOS, and HEAD annotations. For each of these annotation types, it is expected that a service will be able to take a sentence as input and provide a list of probability distribution over labels, one distribution per token.

2.6.1. Inputs

The service should be listening for POST requests at /, and can expect that the JSON payload will include the keys conllu and json: the conllu key will have the stringified CoNLL-U representation of the sentence, and the json key will have Midas Loop’s verbose internal representation of the sentence.

2.6.2. Outputs

The service should respond with a JSON in the response body with a single key, probabilities. The value associated with this key should be a list of objects (= Python dicts) where each object holds key-value pairs expressing labels' probabilities as predicted by the model for the corresponding token at that position. Values should sum to 1.

For any input sentence, the number of output label distributions must exactly match the expected numbers. For UPOS, XPOS, and HEAD, this is the number of normal tokens or ellipsis tokens, and for sentence splits, this is the number of normal tokens. Model outputs will be rejected if the expected number of label distributions is not met.

Label Value Requirements

For UPOS and XPOS, any label is acceptable, but HEAD and sentence splits require careful attention to labels:

  • For HEAD, labels must be the internal IDs for tokens provided in the json input representation, i.e. UUIDs such as 013769d9-dc90-4278-9bc2-5d6a9f96d0fc instead of CoNLL-U IDs like 3 or 11.2. The only exception is the string value "root", used to indicate the root of the sentence.

  • For sentence splits, labels must be either "B" or "O", where "B" indicates the beginning of a new sentence.

Warning
Be sure that you are using the ID for the token entity in the JSON, and not the head entity in the JSON, when providing your outputs.

2.6.3. Service Registration

NLP services will not be contacted unless Midas Loop is told about them. See :nlp-services in Configuration.

2.6.4. Example

Consider a sample XPOS tagging service at services/sample_xpos.py. This is a barebones HTTP service implemented using Flask which loads a pretrained English part of speech tagger from spaCy and uses it to respond to requests. It listens for a POST request, and when it receives it, uses the model to parse the CoNLL-U string and recover the probabilities from the model’s outputs. Note that the model is initialized globally so that it may reside in memory in between requests.

2.7. Running

Simply java -Dconf=…​ -jar midas-loop.jar run once you are satisfied with your configuration. Be sure that any required NLP services are running as well. To stop the server, interrupt it with CTRL+C. Avoid killing the process, as this may corrupt the database.

2.8. Clearing Database Files

All data is stored on-disk: authorization information is by default stored at xtdb_token_data, and all other information is stored at xtdb_data. If you wish to clear either database, you may simply delete the relevant folder—​just make sure that the system is not running before you do so, and next time the system starts, the folder will be regenerated.

3. Changelog

0.0.1

Initial Release

4. Development

Leiningen is used to build code.

4.1. Dev Server

Run lein repl in order to get a dev REPL, then execute (start) in the prompt. (stop) and (restart) are also available in the REPL. This will use the config at env/dev/resources/config.edn.

4.2. Testing

Run lein test. This will use the config at env/test/resources/config.edn.

4.3. Production Build

4.3.1. Build the Client

  1. Clone midas-loop-ui.

  2. Examine and modify the contents of webpack.prod.js, specifically the definitions. You must at least provide a new value for API_ENDPOINT, which should match the URL at which your Midas Loop backend system will be reachable. For example, if you have a machine reachable at my.university.edu, and the Midas Loop backend system is exposed on port 3000, your API_ENDPOINT should be set to my.university.edu:3000/api. You may also wish to customize XPOS_LABELS and DEPREL_LABELS.

  3. Install dependencies: yarn

  4. Compile assets for production deployment: yarn build

  5. Ensure that assets were successfully compiled at dist/

4.3.2. Build the Server

  1. Clone midas-loop.

  2. Move the contents of the dist/ folder you just created into resources/public/. The .js files, etc. should be directly in the resources/public/ folder, not in resources/public/dist/.

  3. Compile an uberjar with lein uberjar. This will produce a standalone JAR ready for distribution and execution via java -jar. Unless overridden, this will use the config at env/prod/resources/config.edn.

  4. Verify that the uberjar was produced successfully by running java -jar target/uberjar/midas-loop.jar. This .jar is the only artefact you will need to deploy.

4.4. Building Docs

Install Asciidoctor, then:

asciidoctor-pdf -o target/book.pdf -b pdf -r asciidoctor-diagram docs/book.adoc
asciidoctor -o target/book.html -b html -r asciidoctor-diagram docs/book.adoc

4.5. Version Bump Checklist

Always do the following:

  • Change version number in project.clj.

  • Change version number in core.clj.

  • Change version number in package.json.

  • Change links in midas-loop.sh.

  • Ensure that Changelog and Introduction are up to date.

  • Compile and push the latest docs.

  • Make a GitHub release with the appropriate version number and with an accompanying uberjar.