feat : Add Indexation API for tracking sync jobs (!8) · Requêtes de fusion · iavia / iavia-document-analysis-service

zakariae yahya a demandé de fusionner feature/accounts-folders vers develop déc. 30, 2025

Summary

This MR implements a complete indexation tracking system that monitors document synchronization from external sources (Google Drive, SharePoint, S3) and integrates with n8n workflows for automated RAG processing.

Key Features

Indexation Job Tracking: Track sync jobs with status, progress, and error handling
Document Tracking: Automatic tracking of indexed documents in database
Immediate Sync: Trigger file synchronization immediately after wizard completion
n8n Integration: Seamless integration with n8n workflows for automated processing
RAG Pipeline: Documents are processed, chunked, and stored in Qdrant

Changes

New Models & Schemas

IndexationJob model with status tracking (pending, running, completed, failed)
IndexedDocument model for tracking processed documents
Pydantic schemas for API payloads and responses

New API Endpoints

GET /indexations/stats - Global indexation statistics
GET /indexations/stats/timeline - Daily stats for charts
GET /indexations/jobs - List jobs with filters and pagination
POST /indexations/jobs - Create new indexation job
DELETE /indexations/jobs/{id} - Cancel a job
GET /indexations/documents - List indexed documents
DELETE /indexations/documents/{id} - Delete document from index
POST /indexations/trigger-sync - Manual sync trigger
POST /indexations/webhooks/job-progress - n8n progress updates
POST /indexations/webhooks/job-complete - n8n job completion

Wizard Integration

Automatic sync trigger after wizard activation
Creates indexation jobs for each datasource config
Triggers n8n webhooks to start processing immediately

n8n Workflow Updates

Google Drive immediate sync workflow with subfolder support
Progress tracking via webhooks
PDF/DOCX/TXT file filtering (no images)

Bug Fixes

Fixed document tracking condition in /documents/process
Added torch multiprocessing spawn mode for ASGI compatibility
Added fallback for batch embedding errors
Added pad_token for gpt2 tokenizer

feat : Add Indexation API for tracking sync jobs