Software characteristics
Roadmap
Calendar
- Experimental generative for April.
- Stable classical version planned for mid-June (+ Docker).
Next Steps
- Multilabel workflow + bert-fine tuning
- Create Python wrapper
- Write documentation + tutorial
- Optimize vizualisation for large dataset ⚙
- Create a easy/medium/pro mode
- need definition
- Define the monitor panel
- need definition
- Optimize GPU management (prediction)
Enhancements
- Build Docker image ⚙
- Integrate genAI tools ⚙
- Add carbon count
- Add new models (Modernbert, Phi3, pleias, ...)
- Refactor design (better ergonomy / colors / etc.) ⚙
- Better data set management (expand)
- Animate community on Discord
Possibilities
- Attribute specific task to users
Architecture
This is a collection of technical points/choices for the app
Overall architecture :
- backend : Python/FastAPI
- frontend : React/Typescript
Backend
config.yaml
define the parameters at the server launch- The unit is the project, composes of different classes
- Features
- Schemes
- Simplemodels
- Bertmodels
- Users
- CPU/GPU bound computation is managed in separated processes with a queue
- State of the service is checked at each request (with a threshold)
Data management
- Tabular data is stored as separated parquet files divided in train / test / complete
- SQLite database to manage annotations/parameters/users/logs
- Projects are loaded into memory to facilitate computation (filter, etc.)
- Unloaded after one day
- Bert models are saved in dedicated filesystems
Processes
- ProcessPoolExecutor with workers
- https://superfastpython.com/processpoolexecutor-in-python/
- Different type of parallel process : training ; predicting
- Only one process possible by user/project
Users role
- Role-Based Access Control (RBAC) - 3 roles : root, manager, annotator
- Authentification with OAuth2 and token in header
- Table of valid tokens
- A table of authorization defines the relation users/projects
- Different uses can modify a same project : no lock
Select element to annotate
The selection combines different strategy : filters and/or active learning.
Active learning is a prediction with a model trained on already annotated data.
-
Different modes of selection
- deterministic
- aleatory
- maxprob for a label
- max entropy
-
Pipeline of choice
- sample (tagged, untagged, all)
- regex
- proba / entropie
Frontend
State management
- Each project is described by its general state (not user specific)
- Computed/computing elements