ActiveTigger Quickstart

This is how to get started with your annotation project with ActiveTigger.

ActiveTigger is a tool that will be useful if you want to:

Quickly annotate a text-based dataset in a dedicated collaborative interface
Train a model on a small set of annotated data to extend it on a larger corpus

This guide explains the basics of all functionalities, but you do not have to use all of them for your project. For example, if you only want to use ActiveTigger for manual annotation, focus on the sections detailing how to set up and export your project data.

ActiveTigger is in beta

ActiveTigger is still in its beta version, and we are working on improving it. If you encounter any issues (or you have suggestions), please let us know on Github issues

Note: ActiveTigger currently only works for multiclass/multilabel annotation, not span annotation.

Creating a project

The first step is to create a project. Creating the project is the most important step, since it will define the framework of your annotation process.

First, you need to import raw data : a csv, xlsx or parquet file with your texts separated at the level you wish to annotate (sentences, paragraphs, social media posts, articles...). Each element should be a row. These will be loaded as the elements that you can individually annotate. It is up to you to produce this file.

Give your project a name (the project name). Each name is unique in the application and will allow you to identify your project.

Name and id can be transformed by the process

Both the project name and the ID will be transformed to be url-compatible. This means for instance that accentuated characters will be replaced by their non-accentuated equivalent, and spaces/underscores will be replaced by a dash. Do anticipate if you need to match this data further the process with other informations.

Specify the name of the column that contains the unique (numerical or textual) IDs for each element (id columns), and the name of the column (or columns) that contains the text (text(s) columns). Specify the language of the texts.

If the file has already been annotated and you want to import these annotations, you can specify the column of existing annotations.

Optionally, if there are context elements that you can display while annotating (for example the author, the date, the publication...), you can specify the relevant columns here. You can view these informations when annotating by going to Display Parameters and clicking on Contextual Information.

The next step is to define both the training dataset that you will annotate and an optional test dataset that can be used for model evaluation.

The training dataset is the most important since it will be the elements that you will be able to see and annotate. The underlying idea is that computation will be limited to this dataset in a first phase.

You need to specify the number of elements you want in each dataset. Those elements will be then picked randomly in the raw data, prioritizing elements that have already been annotated if any for the training dataset.

Using a test set is not mandatory. Further down the line, if you would like to validate your model on a test set, this will be possible at a later stage.

It is possible to stratify the elements of the test set, ensuring that each category of a given variable is equally represented. For example, if annotating a press corpus, you can ensure that the proportion of articles from Newspaper A and Newspaper B remains the same, even if their overall distribution in the corpus is uneven.

Size of the dataset

Create a project

Once the project is created, you can start working on it.

Visibility

By default, a project is only visible to the user who created it (and the administrator of the service). You can add users on a project if you want to work collaboratively.

Project page

Click on the name of your project in the left-hand menu to see a summary of your situation.

Every project can have several coding schemes. A scheme is a set of specific labels that you can use to annotate the corpus. Each scheme works as an independant layer of annotation. One is created by default when you create a project.

You can create a new coding scheme or delete an old one in the menu at the top. Creating a new coding scheme means starting from zero, but will not modify previous coding schemes. You can toggle between schemes as you go.

Coding schemes

There are two different coding schemes : multi-class and multi-label (experimental). Multi-class means one label per element; multi-label means several labels per element. You cannot switch between them, and multi-label are for the moment not completely implemented in the interface, so we recommand to test if the features you need are available.

You can also see a summary of all your current annotations (per category), a history of all your actions in the project, and a summary of the parameters you set up while creating your project. You can also delete the project in the Parameters tab once you finished with it.

Destroy a project

Be aware that deleting a project will delete all the annotations and the project itself. But it will also release space for other projects, so please don't hesitate to clean.

Once you entered the annotation phase, you will have an history of already annotated elements. This is the session history.

Session history

Be aware that you can only see any particular element once during a unique session, so if you need to re-annotate them, you will need to clear the history first.

Overview of project tab

Explore

The Explore tab gives you an overview of your data. You can filter to see elements with certain keywords or regex patterns and get an overview of your annotations so far. You can click on elements to switch to annotation.

Overview of the Project tab

Prepare

Define labels

Before annotating, you need to define your labels.

We recommend keeping your labels simple. If you are aiming to train a model, binary categorizations tend to be easier to handle. For example, if you are annotating newspaper headlines, it is easier to classify it as "politics/not politics", rather than to include all possible subjects as multiple categories. You can layer different binary categorizations as different coding scheme, or add labels at a later stage.

Enter the name of each label under "New label" and click the plus sign.

Overview of the Prepare tab

You can also delete or replace labels.

If you want to delete a label, pick the relevant label under Available labels and then the trash bin. All existing annotations will be deleted.
If you want to replace a label, pick the relevant label under Available labels, write the label's new name, and click the sign next to Replace selected label. All the existing annotations will be converted to the new label.

Merging labels

If you want to merge 2 labels in one, you just have to rename one with the name of the other.

Define features

The Prepare tab also lets you define the features you want to use for your model.

Features means that each text element is represented by a numerical vector. This is necessary to train certain models (especially for active learning) or to do projections.

By default, we recommend using the sbert feature, which is a pre-trained model that converts your text into a numerical representation. This is a good starting point for most projects.

Write your codebook

Under the Codebook tab, you can also include written instructions on how to distinguish your categories. It is helpful if you work collaboratively.

Annotate

The Annotate tab is where you will spend most of your time.

Selection mode

In the Annotate section, the interface will pick out an element that you can annotate according to your pre-defined labels. Once you have validated an annotation, the interface will pick the next element for you following the selection mode that is configured. The ID of the element is displayed in the URL.

By default, the selection modes "deterministic" and "random" are available:

Deterministic mode means that ActiveTigger will pick out each element in the order of the database, as created when creating your project.
Random mode means that ActiveTigger will pick out the next element at random.

Overview of the Annotation tab

Click on Get element if you want to apply a new selection mode after modifying it.

The selection mode refers both the general rule of getting new elements (e.g. random) and specific rules, such as specified regular expressions (regex) patterns. You can search for elements with particular keywords or particular syntax patterns (regex). This could mean fishing out all elements that contain certain symbols, for example. If you are unfamiliar with regex patterns, this generator can be a useful reference.

Keyboard shortcuts

You can use the keyboard shortcuts to annotate faster. The number keys correspond to the labels you have defined. You can move the labels to change the order if needed.

Add comment

You can add a comment to each annotation. This can be useful to explain why you chose a certain label, or to note any particularities about the text.

You can also go back to the previous annotated element.

Active learning

Active learning is a method to accelerate the process of annotation and to improve dataset for model fine-tuning.

Often, we want to classify imbalanced datasets, i.e. where one category is much less represented in the data than the other. This can mean very lengthy annotation processes, if you go through each element in a random order hoping to stumble upon both of your categories.

Using the already annotated data, ActiveTigger can find the elements that your current model is either most certain or most uncertain that it knows how to predict, given your existing coding scheme and annotations. Here is how to set it up:

First, make sure you have at least a feature created under the Prepare tab (by default, we recommend sbert).

Training a prediction model

Second, you need to train a current prediction model based on the annotations you have made so far. You do this at the bottom of the annotation tab. The basic parameters can be used for the first model and refined latter.

Once the prediction model is trained, you can now choose the active and maxprob selection modes when picking elements. That means you can use the prediction of this model to guide your selection.

Overview of the selection modes

Active mode means that Active Tigger will pick the elements on which it is most uncertain (where, based on previous annotations, it could be classified either way)
Maxprob mode means that Active Tigger will pick the elements on which it is most certain (where, based on previous annotations, the model guesses where to categorize it with the highest levels of confidence).

When constructing your training dataset, we recommend starting in random mode in order to create a base of annotations on which to train a prediction model. There is no absolute minimum number. A couple dozen annotations representing both of your labels can serve as a base. By default, you need to retrain the model at will, but you can also configure it to be retrained every N steps.

If the aim is to train a model, we recommend alternating between active and maxprob mode in order to maximize the number of examples from both of your categories prioritizing on the uncertain elements.

Overview of the Annotation tab

If displayed (see Display parameters), the Prediction button above your available labels indicates the model's prediction of a certain label (given previous annotations) and its level of certainty (you can deactivate it in Display parameters)

Vizualisation

To improve your coding, you can use the Visualization mode. Select your SBERT feature (you can also add your trained model if you have one), then click compute. This will project your models into two dimensions. This projection allows you to see how your models group similar texts naturally. You can either visualize the elements you have annotated or the model’s predictions. The separation (or overlap) of the different groups can give you a visual indication of your model's performance – clear separation suggests a model that distinguishes categories well. Points appearing in "incorrect regions" (e.g., isolated orange points in a mostly blue area) may indicate classification errors or interesting edge cases to analyze. When you spot such cases, you can click on a point and annotate it.

Fine-tune your BERT classifier

Active Tigger allows you to train a BERT classifier model on your annotated data with two goals: extending your annotation on the complete dataset, or retrieving this classifier for other uses. Basically, it is a fine-tuning : the base model pre-trained will be adjusted to your specific data.

This is done on the Train tab. Click on New Model to train a new model.

Name it and pick which BERT model base you would like to use (note that some are language-specific, by default in English use ModernBert and in French CamemBert).

You can adjust the parameters for the model, or leave it at default values.

Overview of the model training tab

Leave some time for the training process (you can follow the progress). Depending the parameters it will consume more or less computational power, especially GPU. It can take some time depending on the number of elements. Once the model is available, you can consult it under the Models tab.

GPU load

When available, the process will use GPU. Since ressources are limited, overload can happen. Consequently, process can failed if there is no enough memory. You can follow the current state in the left menu of the screen.

Overview of the selecting models

For the moment, you only have the model. Now, you can decide to apply it on your data, either on the training dataset to see metrics of its performance, or to extend it on the whole initial dataset.

Choose the name of the model under Existing models, click on the Scores tab, and click Predict using train set. it will use the model on the training dataset (so on the elements you haven't annotated yet).

Once the prediction is done, you will see a series of scores that allows you to evaluate the model's performance.:

F1 micro: The harmonic mean of precision and recall, calculated globally without considering category imbalance.
F1 macro: The harmonic mean of precision and recall calculated per class, treating all categories equally regardless of their size.
F1 weighted: The harmonic mean of precision and recall calculated per class, weighted by the number of true instances in each category to account for imbalance.
F1: The harmonic mean of precision and recall (shown per each label)
Precision: Proportion of correctly predicted positive cases out of all predicted positives.
Recall: Proportion of correctly predicted positive cases out of all actual positives.
Accuracy: Proportion of correctly classified elements out of total elements.

All of these variables tell you useful information about how your model performs, but the way you assess them depends on your research question.

For example, say that you are classifying social media posts according to whether they express support for climate policies or not. A low precision score means many posts labeled as "supportive" are actually irrelevant or against climate change policies (false positives). A low recall means the model misses many supportive posts (false negatives). Improving precision might involve stricter rules for classifying posts as supportive (e.g., requiring multiple positive keywords). However, this could hurt recall, as subtle supportive posts might be overlooked.

The generic F1 score is often the variable most of interest, as it indicate how precision and recall are balanced. The closer the F1 score is to 1, the better the model performs according to the coding scheme you have trained it on.

Active Tigger also provides a confusion matrix. You can read it as follows (for binary classifications): the first number represents the true negatives, meaning the documents correctly classified as belonging to the negative class. The second number represents the false positives (documents incorrectly classified as belonging to the positive class). The third number corresponds to the false negatives (documents incorrectly classified as not belonging to the positive class). Finally, the fourth number represents the true positives (documents correctly classified as belonging to the positive class.

If you find yourself with low scores, it is a good idea to first consider your coding scheme. Are your categories clear? Several rounds of iterative annotations are often necessary as you refine your approach. If you realize that you have annotated certain words ambiguously, you can revise your coding from the Explore tab.

To improve your score, you can also check in the False Prediction tab for texts where discrepancies between your coding and the model's coding is indicated. Try to understand the reason for the difference and revise your coding accordingly.

Another way to improve your coding is by going to the Visualization tab (under the Annotate tab), which helps you identify ambiguous cases.

Once you find the model satisfactory, you can apply it to the whole dataset in the tab Compute prediction. This will apply the model to all the elements in the dataset, and you can then export the results.

Test your model

If you have defined or imported a test set, you can also apply the model on it. This is useful to see how the model performs on unseen data. It is seen as good practice to validate a model on a dedicated test set. The first step is to import the test set, then to select "test set mode". You should then go the the annotation tab. Once you have annotated enough data, you can calculate your scores to see if your model is robust.

Predict

To extend the annotations across all your data, select a model that you have trained and click on « Launch prediction complete dataset ». The file (which you can export from the Export tab) will then provide the probability associated with each text in your dataset.

You can also apply your model on data that hasn't been uploaded on the project with « Import external texts to predict ».

Export

You can export your total annotations in csv, xlsx or parquet format.

On the Export tab, select the desired format and click Export training data.

You can also export the features and models you have trained if you wish to use them elsewhere.

Users management

You can add users to your project. This is useful if you want to work collaboratively on a project.

Create user

The right to create users is restricted.

Account

You can change your password.