Five-Stage Pipeline
Skill Extraction
spaCy PhraseMatcher + NER pulls 200+ skills from resume text and persists to BigQuery.
Gap Analysis
Exact match + semantic similarity via Vertex AI embeddings. Gap score 0–100 with transferable skills.
Industry Match
BQML cosine distance against 8 industry centroid vectors. Ranked fit scores in-database.
Gemini Narrative
RAG-grounded 3-sentence career story from Gemini 2.5 Flash — no hallucinations, facts only.
90-Day Pathway
CrewAI two-agent crew researches real courses (DuckDuckGo) and builds a phased reskilling roadmap.
System Architecture
GCP Service Map
| Service | Used for | Resource |
|---|---|---|
| BigQuery | Skill storage, profiles, embeddings, JD catalog, industry vectors, drift metrics | reskillio.* |
| BigQuery ML | Cosine distance against 8 industry centroid vectors — scoring in-database | industry_vectors |
| BigQuery Lakehouse | Medallion Bronze/Silver/Gold layers for analytics & auditability | reskillio_bronze/silver/gold.* |
| Vertex AI Embeddings | 768-dim skill vectors for gap analysis + industry matching | text-embedding-004 |
| Vertex AI Gemini | RAG-grounded career narrative generation | gemini-2.5-flash |
| Vertex AI Model Registry | Versioned spaCy skill extractor with F1 gating | reskillio-skill-extractor |
| Vertex AI Pipelines (KFP) | Orchestrated PDF ingestion: load → extract → embed | reskillio-ingestion-pipeline |
| Cloud Storage | Model artifacts, taxonomy JSON, pipeline root | {project}-models |
| Cloud Build | CI/CD retraining on taxonomy.json change in GCS | Pub/Sub trigger |
| Cloud Monitoring | Drift metrics + alert policy (unknown_rate > 20%) | 3 custom metric descriptors |
| Cloud Run | FastAPI API hosting | reskillio-api |
API Reference
Demo endpoint — POST /analyze
# PDF upload curl -X POST http://localhost:8000/analyze \ -F "resume=@resume.pdf" \ -F "target_role=Senior Data Engineer" \ -F "jd_text=We need Python, BigQuery, Airflow, dbt..." \ -F "candidate_id=demo-001" # Plain text (no PDF needed) curl -X POST http://localhost:8000/analyze \ -F "resume_text=Experienced data engineer with Python and BigQuery..." \ -F "target_role=Senior Data Engineer" # With 90-day pathway (~45s extra) curl -X POST http://localhost:8000/analyze \ -F "resume=@resume.pdf" \ -F "target_role=Senior Data Engineer" \ -F "include_pathway=true"
All Endpoints
Sample Output
// POST /analyze response (truncated) { "candidate_id": "demo-001", "target_role": "Senior Data Engineer", "skill_count": 47, "top_skills": [ { "name": "Python", "category": "technical", "confidence": 0.97 }, { "name": "BigQuery", "category": "tool", "confidence": 0.95 } ], "gap": { "gap_score": 72.4, "matched_skills": ["Python", "SQL", "BigQuery", "Spark", "dbt"], "missing_skills": ["Kafka", "Terraform"], "transferable_skills": [{ "jd_skill": "Airflow", "candidate_skill": "Luigi", "similarity": 0.81 }], "recommendation": "Strong match. Bridge Kafka and Terraform to close remaining gaps." }, "industry_match": { "top_industry": "data_ai", "scores": [ { "rank": 1, "industry": "data_ai", "match_score": 88.3 }, { "rank": 2, "industry": "cloud_devops", "match_score": 71.1 } ] }, "narrative": "A data-first engineer with a strong foundation in Python and BigQuery, your transferable workflow orchestration experience positions you well for this role. Rounding out Kafka and Terraform would make you a top candidate.", "stages": { "extract": { "success": true, "duration_ms": 342 }, "gap": { "success": true, "duration_ms": 1820 }, "industry": { "success": true, "duration_ms": 510 }, "narrative": { "success": true, "duration_ms": 3102 }, "pathway": { "success": true, "duration_ms": 0, "error": "skipped" } }, "total_duration_ms": 5892 }
BigQuery Medallion Lakehouse
Bronze — Raw Ingestion
- raw_resume_ingestion
- raw_jd_ingestion
Append-only. Every source document preserved as-is.
Silver — Validated
- candidate_skills
- jd_skill_profiles
- ingestion_log
Deduplicated, validated, enriched via MERGE SQL.
Gold — Analytics
- match_scores
- industry_rankings
- candidate_readiness
Computed scores. Readiness index = 40% match + 30% industry + 20% confidence + 10% breadth.
Getting Started
Clone & install
git clone https://github.com/vipul9811kumar/ReSkillio.git cd ReSkillio pip install -r requirements.txt python -m spacy download en_core_web_lg
Configure GCP
cp .env.example .env # Set GCP_PROJECT_ID and GOOGLE_APPLICATION_CREDENTIALS
Bootstrap GCP resources
python scripts/setup_gcp.py python scripts/build_industry_vectors.py
Start the API
uvicorn reskillio.api.main:app --reload --port 8000 # Docs at http://localhost:8000/docs
Run the demo
curl -X POST http://localhost:8000/analyze \ -F "resume=@data/raw/sample_resume.pdf" \ -F "target_role=Senior Data Engineer"