🤖 Technical Deep-Dive

ML Architecture
& Implementation

SecurePass uses a Random Forest Classifier trained on 3,600 synthetic passwords to classify password strength into three tiers. Here's exactly how it works — from raw text to prediction.

Active model: RandomForest (scikit-learn)

→ Full Pipeline

⌨️

Raw Password

Plain text string

›

⚙️

Feature Extraction

7 numeric features

›

🌲

Random Forest

100 decision trees

›

📊

predict_proba()

Confidence per class

›

🏷️

Label + Score

Weak / Medium / Strong

1 Feature Engineering

The raw password string is converted into a 7-dimensional numeric vector. Each dimension captures a distinct security-relevant property. The model never sees the original characters — only these features.

Feature	Type	Description
length	Integer	Total number of characters
has_upper	Binary 0/1	Contains at least one uppercase letter (A–Z)
has_lower	Binary 0/1	Contains at least one lowercase letter (a–z)
has_digit	Binary 0/1	Contains at least one digit (0–9)
has_symbol	Binary 0/1	Contains at least one non-alphanumeric character
entropy	Float (bits)	Shannon entropy — measures unpredictability of character distribution
unique_ratio	Float 0–1	Ratio of unique characters to total length — penalises repetition

2 Shannon Entropy

Shannon entropy measures how unpredictable or random a password is. It is calculated using the frequency of each character:

def shannon_entropy(password):
    freq = {}
    for c in password:
        freq[c] = freq.get(c, 0) + 1
    n = len(password)
    return -sum((cnt/n) * log2(cnt/n) for cnt in freq.values())

A password like "aaaaaaa" has entropy ≈ 0 bits (completely predictable). A truly random 12-character password using all character classes reaches ≈ 5–6 bits per character — making brute-force computationally infeasible.

3 Random Forest Classifier

A Random Forest is an ensemble of 100 independent Decision Trees. Each tree is trained on a random subset of the 3,600 training passwords (bootstrap sampling), and each split considers a random subset of the 7 features. This controlled randomness prevents overfitting and improves generalisation.

At prediction time, every tree independently votes for a class (Weak / Medium / Strong). The final prediction is a majority vote, and predict_proba() returns the fraction of trees that voted for each class — used directly as the confidence score.

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(
    n_estimators=100,   # 100 decision trees
    max_depth=None,        # Trees grow until leaves are pure
    min_samples_split=4,  # At least 4 samples per split
    random_state=42,      # Reproducible results
    n_jobs=-1             # Use all CPU cores for training
)
clf.fit(X_train, y_train)

# At runtime:
label      = ["Weak", "Medium", "Strong"][clf.predict(X)[0]]
confidence = clf.predict_proba(X)[0].max()  # fraction of trees in agreement

4 Training Dataset

Since distributing real leaked passwords is unethical, SecurePass uses a synthetic dataset generator (src/train_model.py) that creates 3,600 balanced passwords across three classes:

1200

Weak Samples

1200

Medium Samples

1200

Strong Samples

Labels are assigned using deterministic rules: Weak if length < 6 or entropy < 1.8; Strong if length ≥ 12, ≥ 3 character classes, and entropy ≥ 3.2; Medium otherwise. The dataset is split 80/20 for training and evaluation.

5 Model Performance

After training, the model is evaluated on the held-out 20% test set (720 passwords). Typical results:

~97%

Test Accuracy

~97%

F1-Score

100

Decision Trees

The model is regenerated each time train_model.py is run, and saved to Data/model.pkl using joblib. The Flask server loads it once at startup for fast, sub-millisecond inference.

6 Suggested Improvements

High Impact
Have I Been Pwned API — Check passwords against 10+ billion breached credentials using k-Anonymity (only first 5 chars of the SHA-1 hash are sent over the network).
High Impact
zxcvbn Integration — Dropbox's open-source password estimator detects dictionary words, keyboard patterns (qwerty, 12345), dates, and repetitions that pure entropy misses.
Medium
Real-World Training Data — Train the model on the publicly available RockYou dataset labels (without distributing the passwords) for a far more realistic decision boundary.
Medium
Gradient Boosting (XGBoost) — Replace RandomForest with XGBoost or LightGBM for higher accuracy and built-in feature importance scores.
Low
API Rate Limiting — Add Flask-Limiter to prevent programmatic abuse of the /analyze endpoint (e.g. limit to 60 req/min per IP).
Low
Password Generator — After detecting a weak password, offer to generate a cryptographically secure replacement using secrets.token_urlsafe().

ML Architecture& Implementation

ML Architecture
& Implementation