I Built a BFIU-Compliant AML Detection System in Python (Here's Why the Kaggle Approach Doesn't Work)

Most AML tutorials end with a confusion matrix and a 99% accuracy score. Here's why that doesn't work — and what I built instead.

I've been working in fintech compliance data for a while. The one thing I kept noticing: every "fraud detection project" on GitHub or Kaggle uses the same dataset — the UCI credit card fraud dataset from 2013. It has 284,000 rows, 30 features labeled V1-V28, and approximately zero explanatory value for anyone who wants to understand how financial crime actually works.

So I built something different.

The problem with the standard approach

Real transaction monitoring engines don't work like Kaggle competitions. They don't take a CSV, train a model, and output a probability score. They work like this:

A rule engine runs first — deterministic, auditable, regulatory-cited rules that generate alerts
Those alerts get scored and triaged by risk tier
An ML layer reduces false positives among the high-risk alerts
The remaining candidates go into a SAR queue for human review

If your "AML project" skips steps 1-3 and goes straight to "I trained an XGBoost model", you've missed the entire operational reality of how compliance teams work.

Why Bangladesh MFS data is the interesting problem

Bangladesh has 100M+ mobile financial service users across bKash, Nagad, Rocket, and others. The transaction patterns here break generic AML tools constantly.

Example: a standard ROUND_AMOUNT rule fires on transactions that are suspiciously round — like BDT 500, 1000, 5000. But in Bangladesh, those are completely normal: rent, bazar purchases, rickshaw fares, utility bills. A generic rule would generate thousands of false positives per day.

The BFIU (Bangladesh Financial Intelligence Unit) knows this. Their guidelines are calibrated to local behavior. The structuring threshold in Bangladesh is BDT 10,000 — not $10,000. Dormancy patterns, high-value thresholds, velocity limits — all of these are context-specific.

So I built a dataset and rule engine that reflects this reality.

What I built: the detection architecture

The toolkit has 4 layers:

Layer 1: Synthetic MFS data generator

10,000+ transactions with realistic bKash/Nagad patterns — round amounts, merchant payments, P2P transfers, agent cash-outs. I injected 5 typologies with known ground truth: structuring, smurfing, dormant account spikes, late-night clusters, and sudden high-value single transactions.

python generate_data.py --accounts 500 --txns 10000 --inject-typologies

Layer 2: BFIU-calibrated rule engine

Six rules, each tied to specific BFIU circulars:

STRUCTURING — ≥3 transactions below BDT 10,000 in a 24-hour window (BFIU Circular 02/2019)
VELOCITY — ≥5 transactions within any 60-minute rolling window
DORMANT_SPIKE — account inactive for 30+ days followed by sudden transaction surge
LATE_NIGHT — transaction between 01:00–04:00 AM (elevated risk period for agent fraud)
ROUND_AMOUNT — amount ≥ BDT 50,000 AND ≥ 5× the sender's own 90-day median (this is the calibrated version — it doesn't fire on normal round amounts)
HIGH_VALUE — single transaction above BDT 20,000

Notice the ROUND_AMOUNT rule: it's baseline-relative. A BDT 1,000 round amount from someone who sends BDT 800 regularly doesn't fire. A BDT 50,000 round amount from that same account does. This is what calibration actually means.

Layer 3: Composite risk scoring + threshold backtesting

Each rule fires with a weight. The composite score (0–100) maps to LOW / MEDIUM / HIGH / CRITICAL tiers. The backtesting script lets you shift any rule's threshold and see the effect on alert volume and precision before you deploy:

python backtest_thresholds.py --rule STRUCTURING --threshold 2 --compare-baseline

Output shows you exactly how many additional accounts get flagged, what their ground-truth typology rate is, and whether the precision-recall tradeoff is worth it.

Layer 4: LightGBM ML layer

The rule engine generates alerts. The ML model is trained on rule-enriched features — not raw transaction features. It learns which rule combinations predict actual typologies vs noise. False positive rate drops ~40% without changing recall significantly.

The output

The final output is a SAR candidates CSV — not a model accuracy report. It looks like what a compliance analyst would actually review: account ID, risk score, triggered rules, supporting transaction summary, regulatory reference.

That's the difference between a data science project and a compliance tool.

Why this architecture transfers globally

I calibrated this for BD MFS data, but the same problem exists everywhere:

UPI in India — round amounts are normal, velocity patterns differ from Western banking
JazzCash/EasyPaisa in Pakistan — similar threshold calibration needed
M-Pesa in Kenya — agent network patterns require dormancy/velocity rules specific to agent behavior

Any MFS or digital payments context requires baseline-relative thresholds, not global fixed values. The toolkit's architecture is the pattern. The BD calibration is one implementation of it.

What's in the free preview

I've put the data generation notebook and a 500-row sample on GitHub, free:

👉 github.com/monsurhabib01/aml-detection-preview

You can run the full EDA, see the transaction patterns, and understand the typology injection logic.

The full toolkit

The complete rule engine, backtesting scripts, ML layer, SAR export, compliance dashboard, regulatory documentation, and test suite are in the paid version.

I've written a full breakdown of what's included — features, comparison table, FAQ — on the product page:

👉 aitipseveryday.com/p/python-aml-toolkit.html — landing page
👉 Get it on Gumroad — $39 one-time

If you're targeting fintech or AML roles, or building compliance tooling for South Asian financial services, this is a more honest portfolio project than submitting the Kaggle credit card dataset one more time.

Questions or feedback? Leave a comment or email monsurhabib01@gmail.com

Search

AML Data with Python