1. Introduction
Midas Loop is a web application for taking Universal Dependencies corpora and improving the quality of their annotations. For more information on motivation, functionality, and supported workflows, please see our paper.
1.1. Key features
1.1.1. CoNLL-U Import/Export Support
Midas Loop supports import and export of corpora in the CoNLL-U format.
1.1.2. CoNLL-U Editing
Editing of most annotations in the CoNLL-U format is supported.
1.1.3. Active Learning Support
Midas Loop allows NLP models to report probability distributions on several annotation types and uses these distributions to provide visual cues for annotators that a certain annotation is suspicious. The annotator may then decide whether to keep or replace the annotation, and the model used may be trained further on the improved data. Models are completely decoupled from the core system and communicate with it via HTTP, so any model may be used as long as it obeys the HTTP protocol.
These model-provided label distributions are also aggregated at the document level to allow annotators to triage documents based on how uncertain a model was about certain annotation types on average throughout the document.
1.2. Limitations
Midas Loop supports the following:
-
Sentence break editing
-
LEMMA
editing -
XPOS
editing -
HEAD
editing -
DEPREL
editing -
Active Learning support for
HEAD
,XPOS
, and sentence breaks
Midas Loop does NOT support the following:
-
FORM
/tokenization editing -
UPOS
editing -
FEATS
editing -
DEPS
editing -
MISC
editing
(Caveat: these are the limitations of the Midas Loop UI, but the Midas Loop core system actually supports editing of all core data types—see /swagger-ui
on a running server for API documentation.
It is possible to build your own UI or extend the Midas Loop UI to provide some of these additional editing features.
Additionally, please open an issue on GitHub if there is a kind of editing you would like to see added to Midas Loop.)
1.3. Roadmap
The following are priorities for future work:
-
More efficient NLP processing
-
Support for full CoNLL-U editing
-
Online retraining of NLP models
Please do not hesitate to open an issue on GitHub with feature requests, etc.
2. Operation
2.1. Server Setup
Get an uberjar either by building it or by downloading the latest pre-built one. Note the following top level commands:
COMMANDS:
run, r Start the web app and begin listening for requests.
import, i Read and ingest CoNLL-U files.
export, e Export all documents in the database as CoNLL-U files.
token, t Token-related helpers.
Run one of the top level commands (e.g. java -jar midas-loop.jar import --help
) to see more details about each command.
Important
|
Each command requires that your server not be running.
If you are running your server using the run command, be sure you shut it down before running any of the other commands.
|
2.2. Configuration
By default, the uberjar will use its copy of the config located at env/prod/resources/config.edn
.
If you wish to customize this, specify another config using -Dconf=…
:
java -Dconf="/path/to/my/config.edn" -jar midas-loop.jar …
Config keys:
|
Should be a map with two subkeys: |
|
Map with a single key, |
|
Either |
|
A vector of three-key maps. Each map should have a |
|
Time, in milliseconds, to wait after a failure before attempting to contact an HTTP NLP service again. Defaults to |
|
Port used for the main web server. |
|
A set of CORS patterns (regular expressions) for adding additional allowed origins, e.g. |
2.3. Authorization
Warning
|
Midas Loop’s authorization scheme is primitive and vulnerable to attack, and is therefore only useful for preventing low-effort unauthorized access. You SHOULD NOT store sensitive data in a Midas Loop system. |
2.3.1. Granting
Token-based authorization is used. Each user should have a token made for them, like so:
java -jar midas-loop.jar token add --name "Sam Doe" --email "sd42@gmail.com" --quality "gold"
Give your user their token and instruct them to keep it secret.
If you are using a non-standard configuration using java -Dconf=…
, be sure to include it during import.
2.3.2. Listing
You can see all valid tokens with java -jar midas-loop.jar token list
.
If you are using a non-standard configuration using java -Dconf=…
, be sure to include it during import.
2.3.3. Revoking
You can revoke a token like so:
java -jar midas-loop.jar token revoke --secret "gold;secret=84EO60tU6lhcBhplbuEEGElECuh1yZod8fTCn6DqkQA"
If you are using a non-standard configuration using java -Dconf=…
, be sure to include it during import.
2.4. Importing
Use the import
subcommand and supply it with a directory path.
The directory will be recursively searched for files ending in .conllu
and each will be loaded into the database.
Example invocation:
java -jar midas-loop.jar import dir/with/conllu-files/
If you are using a non-standard configuration using java -Dconf=…
, be sure to include it during import.
2.5. Exporting
Use the export
subcommand and provide it with a directory path.
A separate .conllu
file for each document will be created directly under that directory.
Example invocation:
java -jar midas-loop.jar export output/dir/
If you are using a non-standard configuration using java -Dconf=…
, be sure to include it during import.
2.6. NLP Services
Midas Loop is able to contact NLP services via HTTP in order to get machine learning model outputs for certain kinds of annotations. NLP services work by waiting to be contacted by the Midas Loop server, which will contact the service when it needs fresh label distributions for a given annotation type.
Specifically, Midas Loop is able to accommodate outputs for sentence splits (i.e., token-level classification of whether a particular token is the beginning of a new sentence) as well as UPOS, XPOS, and HEAD annotations. For each of these annotation types, it is expected that a service will be able to take a sentence as input and provide a list of probability distribution over labels, one distribution per token.
2.6.1. Inputs
The service should be listening for POST requests at /
, and can expect that the JSON payload will include the keys conllu
and json
: the conllu
key will have the stringified CoNLL-U representation of the sentence, and the json
key will have Midas Loop’s verbose internal representation of the sentence.
2.6.2. Outputs
The service should respond with a JSON in the response body with a single key, probabilities
.
The value associated with this key should be a list of objects (= Python dicts) where each object holds key-value pairs expressing labels' probabilities as predicted by the model for the corresponding token at that position.
Values should sum to 1.
For any input sentence, the number of output label distributions must exactly match the expected numbers. For UPOS, XPOS, and HEAD, this is the number of normal tokens or ellipsis tokens, and for sentence splits, this is the number of normal tokens. Model outputs will be rejected if the expected number of label distributions is not met.
Label Value Requirements
For UPOS and XPOS, any label is acceptable, but HEAD and sentence splits require careful attention to labels:
-
For HEAD, labels must be the internal IDs for tokens provided in the
json
input representation, i.e. UUIDs such as013769d9-dc90-4278-9bc2-5d6a9f96d0fc
instead of CoNLL-U IDs like3
or11.2
. The only exception is the string value"root"
, used to indicate the root of the sentence. -
For sentence splits, labels must be either
"B"
or"O"
, where"B"
indicates the beginning of a new sentence.
Warning
|
Be sure that you are using the ID for the token entity in the JSON, and not the head entity in the JSON, when providing your outputs. |
2.6.3. Service Registration
NLP services will not be contacted unless Midas Loop is told about them. See :nlp-services
in Configuration.
2.6.4. Example
Consider a sample XPOS tagging service at services/sample_xpos.py
.
This is a barebones HTTP service implemented using Flask which loads a pretrained English part of speech tagger from spaCy and uses it to respond to requests.
It listens for a POST request, and when it receives it, uses the model to parse the CoNLL-U string and recover the probabilities from the model’s outputs.
Note that the model is initialized globally so that it may reside in memory in between requests.
2.7. Running
Simply java -Dconf=… -jar midas-loop.jar run
once you are satisfied with your configuration.
Be sure that any required NLP services are running as well.
To stop the server, interrupt it with CTRL+C
.
Avoid killing the process, as this may corrupt the database.
2.8. Clearing Database Files
All data is stored on-disk: authorization information is by default stored at xtdb_token_data
, and all other information is stored at xtdb_data
.
If you wish to clear either database, you may simply delete the relevant folder—just make sure that the system is not running before you do so, and next time the system starts, the folder will be regenerated.
3. Changelog
0.0.1
Initial Release
4. Development
Leiningen is used to build code.
4.1. Dev Server
Run lein repl
in order to get a dev REPL, then execute (start)
in the prompt.
(stop)
and (restart)
are also available in the REPL.
This will use the config at env/dev/resources/config.edn
.
4.2. Testing
Run lein test
.
This will use the config at env/test/resources/config.edn
.
4.3. Production Build
4.3.1. Build the Client
-
Clone midas-loop-ui.
-
Examine and modify the contents of
webpack.prod.js
, specifically the definitions. You must at least provide a new value forAPI_ENDPOINT
, which should match the URL at which your Midas Loop backend system will be reachable. For example, if you have a machine reachable atmy.university.edu
, and the Midas Loop backend system is exposed on port3000
, yourAPI_ENDPOINT
should be set tomy.university.edu:3000/api
. You may also wish to customizeXPOS_LABELS
andDEPREL_LABELS
. -
Install dependencies:
yarn
-
Compile assets for production deployment:
yarn build
-
Ensure that assets were successfully compiled at
dist/
4.3.2. Build the Server
-
Clone midas-loop.
-
Move the contents of the
dist/
folder you just created intoresources/public/
. The.js
files, etc. should be directly in theresources/public/
folder, not inresources/public/dist/
. -
Compile an uberjar with
lein uberjar
. This will produce a standalone JAR ready for distribution and execution viajava -jar
. Unless overridden, this will use the config atenv/prod/resources/config.edn
. -
Verify that the uberjar was produced successfully by running
java -jar target/uberjar/midas-loop.jar
. This.jar
is the only artefact you will need to deploy.
4.4. Building Docs
Install Asciidoctor, then:
asciidoctor-pdf -o target/book.pdf -b pdf -r asciidoctor-diagram docs/book.adoc
asciidoctor -o target/book.html -b html -r asciidoctor-diagram docs/book.adoc
4.5. Version Bump Checklist
Always do the following:
-
Change version number in project.clj.
-
Change version number in core.clj.
-
Change version number in package.json.
-
Change links in midas-loop.sh.
-
Ensure that Changelog and Introduction are up to date.
-
Compile and push the latest docs.
-
Make a GitHub release with the appropriate version number and with an accompanying uberjar.