Software characteristics

Roadmap

Calendar

Experimental generative for April.
Stable classical version planned for mid-June (+ Docker).

Next Steps

Multilabel workflow + bert-fine tuning
Create Python wrapper
Write documentation + tutorial
Optimize vizualisation for large dataset ⚙
Create a easy/medium/pro mode
- need definition
Define the monitor panel
- need definition
Optimize GPU management (prediction)

Enhancements

Build Docker image ⚙
Integrate genAI tools ⚙
Add carbon count
Add new models (Modernbert, Phi3, pleias, ...)
Refactor design (better ergonomy / colors / etc.) ⚙
Better data set management (expand)
Animate community on Discord

Possibilities

Attribute specific task to users

Architecture

This is a collection of technical points/choices for the app

Overall architecture :

backend : Python/FastAPI
frontend : React/Typescript

Backend

config.yaml define the parameters at the server launch
The unit is the project, composes of different classes
- Features
- Schemes
- Simplemodels
- Bertmodels
- Users
CPU/GPU bound computation is managed in separated processes with a queue
State of the service is checked at each request (with a threshold)

Data management

Tabular data is stored as separated parquet files divided in train / test / complete
SQLite database to manage annotations/parameters/users/logs
Projects are loaded into memory to facilitate computation (filter, etc.)
- Unloaded after one day
Bert models are saved in dedicated filesystems

Processes

ProcessPoolExecutor with workers
- https://superfastpython.com/processpoolexecutor-in-python/
Different type of parallel process : training ; predicting
Only one process possible by user/project

Users role

Role-Based Access Control (RBAC) - 3 roles : root, manager, annotator
Authentification with OAuth2 and token in header
- Table of valid tokens
A table of authorization defines the relation users/projects
Different uses can modify a same project : no lock

Select element to annotate

The selection combines different strategy : filters and/or active learning.

Active learning is a prediction with a model trained on already annotated data.

Different modes of selection
- deterministic
- aleatory
- maxprob for a label
- max entropy
Pipeline of choice
- sample (tagged, untagged, all)
- regex
- proba / entropie

Frontend

State management

Each project is described by its general state (not user specific)
- Computed/computing elements

« Previous Next »