feat : Add Indexation API for tracking sync jobs
Summary
This MR implements a complete indexation tracking system that monitors document synchronization from external sources (Google Drive, SharePoint, S3) and integrates with n8n workflows for automated RAG processing.
Key Features
- Indexation Job Tracking: Track sync jobs with status, progress, and error handling
- Document Tracking: Automatic tracking of indexed documents in database
- Immediate Sync: Trigger file synchronization immediately after wizard completion
- n8n Integration: Seamless integration with n8n workflows for automated processing
- RAG Pipeline: Documents are processed, chunked, and stored in Qdrant
Changes
New Models & Schemas
-
IndexationJobmodel with status tracking (pending, running, completed, failed) -
IndexedDocumentmodel for tracking processed documents - Pydantic schemas for API payloads and responses
New API Endpoints
-
GET /indexations/stats- Global indexation statistics -
GET /indexations/stats/timeline- Daily stats for charts -
GET /indexations/jobs- List jobs with filters and pagination -
POST /indexations/jobs- Create new indexation job -
DELETE /indexations/jobs/{id}- Cancel a job -
GET /indexations/documents- List indexed documents -
DELETE /indexations/documents/{id}- Delete document from index -
POST /indexations/trigger-sync- Manual sync trigger -
POST /indexations/webhooks/job-progress- n8n progress updates -
POST /indexations/webhooks/job-complete- n8n job completion
Wizard Integration
- Automatic sync trigger after wizard activation
- Creates indexation jobs for each datasource config
- Triggers n8n webhooks to start processing immediately
n8n Workflow Updates
- Google Drive immediate sync workflow with subfolder support
- Progress tracking via webhooks
- PDF/DOCX/TXT file filtering (no images)
Bug Fixes
- Fixed document tracking condition in /documents/process
- Added torch multiprocessing spawn mode for ASGI compatibility
- Added fallback for batch embedding errors
- Added pad_token for gpt2 tokenizer