OCBC Ignite Programme · 2026

Building with Data

Three technical projects that shaped my journey as a data engineer and analyst — by Anupam P Menon

01AWS Serverless PipelineCloud Engineering
02Employee Attrition MLMachine Learning
03Pharma Quality AnalyticsData Storytelling
01Cloud Engineering · AWS · Python

AWS Serverless
Energy Data Pipeline

I designed and deployed a fully serverless, end-to-end data pipeline on AWS to ingest real-time electricity production data from Denmark's national grid operator, Energinet, at 5-minute intervals — processing it through a multi-layer data lake and surfacing insights in Power BI.

What I Built — Step by Step

1

Designed the full pipeline architecture selecting each AWS service based on cost and scalability — the entire project ran for $4.40 total.

2

Wrote two Python Lambda functions using boto3: one to call the Energinet API every 5 minutes and store raw JSON to S3, and an Orchestrator to trigger daily Glue ETL jobs.

3

Configured an EventBridge cron rule (00 15 * * ? *) for daily scheduling, and built a 3-layer S3 data lake: Raw → Processed → Wrangled using AWS Glue ETL jobs.

4

Set up Glue Crawler + Data Catalog to auto-register schema, enabling SQL queries via Amazon Athena, then connected Power BI via ODBC for live dashboards.

5

Implemented 3 CloudWatch alarms (Ingestion, Orchestrator, GlueJob) with SNS email alerts — I personally received the alarm emails, proving the monitoring works end-to-end.

AWS Cost Breakdown (Total: $4.40)

LambdaS3GlueAthenaCloudWatchOther$0$0.55$1.1$1.65$2.2

Estimated cost breakdown by AWS service

Full AWS Architecture Diagram — designed by Anupam

Full AWS Architecture Diagram — designed by Anupam

$4.40
Total Cost
10+
AWS Services
5 min
Data Interval
02Machine Learning · Python · Alteryx
Alteryx modeling workflow — built by Anupam

Alteryx modeling workflow — built by Anupam

Model Recall Comparison (Attrition Class)

Logistic RegressionDecision TreeBoosted Model0%20%40%60%80%
  • Recall %
  • Accuracy %

Decision Tree recall (54.6%) nearly doubles Logistic Regression (29.7%)

Predicting Employee
Attrition with ML

I built a complete machine learning pipeline to predict employee attrition using a 5,000-record HR dataset with 27 features — from raw data profiling through EDA, feature engineering, model training, and business-driven model selection.

What I Did — Step by Step

1

Loaded and profiled a 5,000-record HR dataset with 27 features in Alteryx — identified and imputed 12 missing MonthlyIncome values (0.24%) to ensure clean training data.

2

Wrote Python EDA scripts using matplotlib and seaborn to visualise attrition distribution, income vs attrition boxplots, and age distribution KDE plots — revealing younger employees (22–30) had the highest attrition risk.

3

Applied one-hot encoding for 5 categorical variables (Gender, MaritalStatus, Department, JobRole, EducationLevel) and normalised numeric features before modelling.

4

Used Alteryx's Data Sampling tool to split 70/30 (3,500 training / 1,500 test) and built 3 models: Logistic Regression, Decision Tree, and Boosted Model.

5

Selected Decision Tree as the final model — not because of accuracy (all ~77.8%), but because its attrition recall of 54.6% was nearly double Logistic Regression's 29.7%, which matters most in HR risk detection.

Attrition Rate by Department (%)

0%6%12%18%24%SalesHRR&DITOperationsFinance

Sales had the highest attrition rate at 20.6%

03Data Storytelling · Tableau · Interdisciplinary

Pharmaceutical Tablet
Quality Analytics

In a cross-disciplinary collaboration between Data Science and Pharmaceutics students, I analysed Maxeo tablet manufacturing data across 32 production batches to identify the root cause of British Pharmacopoeia compliance failures — and communicated findings through a structured data story.

What I Did — Step by Step

1

Collaborated in a cross-disciplinary team to analyse Maxeo tablet data across 32 production batches, targeting British Pharmacopoeia (BP) compliance for Uniformity of Mass.

2

Structured the entire analysis as a 5-act data story (Introduction → Rising Action → Climax → Falling Action → Conclusion) to make complex QC findings accessible to non-technical QA personnel.

3

Built a 4-panel Tableau dashboard showing Avg Weight, Avg Height, Temperature trend, and Humidity trend across all 32 batches — revealing high weight fluctuation (62.1–64.8mg, mean 63.5mg).

4

Created scatter plots to test whether temperature (24.5°C→25.8°C) and humidity (62.2%→59.7%) caused failures — and ruled out environmental factors as the primary cause.

5

Concluded that process-level variability (compression force inconsistency, die fill variation) was the root cause, and recommended tighter compression parameter control and enhanced monitoring.

Simulated Avg Weight by Batch (mg) — 32 Batches

12345678910111213141516171820222426283032Batch61mg63mg66mg

High fluctuation indicates tablet press struggling to maintain uniform fill

4-panel Tableau dashboard — 32-batch analysis by Anupam's team

4-panel Tableau dashboard — 32-batch analysis by Anupam's team

32
Batches Analysed
62.1–64.8mg
Weight Range
+1.3°C
Temp Drift
Process
Root Cause

Three Projects,
Three Strengths

Each project reflects a different dimension of how I approach data — from infrastructure to modelling to communication. Together, they represent how I think about building solutions that are end-to-end, business-aware, and collaborative.

AWS PipelineEnd-to-End ThinkingFrom API ingestion to Power BI — I own the full stack
Attrition MLBusiness MindsetChose recall over accuracy because HR needs to catch leavers
Pharma AnalyticsCollaborative CommunicationTurned complex QC data into a 5-act story for QA teams