Back to Analyzer
🤖 Technical Deep-Dive

ML Architecture
& Implementation

SecurePass uses a Random Forest Classifier trained on 3,600 synthetic passwords to classify password strength into three tiers. Here's exactly how it works — from raw text to prediction.

Active model: RandomForest (scikit-learn)

Full Pipeline
⌨️
Raw Password
Plain text string
⚙️
Feature Extraction
7 numeric features
🌲
Random Forest
100 decision trees
📊
predict_proba()
Confidence per class
🏷️
Label + Score
Weak / Medium / Strong
1 Feature Engineering

The raw password string is converted into a 7-dimensional numeric vector. Each dimension captures a distinct security-relevant property. The model never sees the original characters — only these features.

FeatureTypeDescription
lengthIntegerTotal number of characters
has_upperBinary 0/1Contains at least one uppercase letter (A–Z)
has_lowerBinary 0/1Contains at least one lowercase letter (a–z)
has_digitBinary 0/1Contains at least one digit (0–9)
has_symbolBinary 0/1Contains at least one non-alphanumeric character
entropyFloat (bits)Shannon entropy — measures unpredictability of character distribution
unique_ratioFloat 0–1Ratio of unique characters to total length — penalises repetition
2 Shannon Entropy

Shannon entropy measures how unpredictable or random a password is. It is calculated using the frequency of each character:

def shannon_entropy(password): freq = {} for c in password: freq[c] = freq.get(c, 0) + 1 n = len(password) return -sum((cnt/n) * log2(cnt/n) for cnt in freq.values())

A password like "aaaaaaa" has entropy ≈ 0 bits (completely predictable). A truly random 12-character password using all character classes reaches ≈ 5–6 bits per character — making brute-force computationally infeasible.

3 Random Forest Classifier

A Random Forest is an ensemble of 100 independent Decision Trees. Each tree is trained on a random subset of the 3,600 training passwords (bootstrap sampling), and each split considers a random subset of the 7 features. This controlled randomness prevents overfitting and improves generalisation.

At prediction time, every tree independently votes for a class (Weak / Medium / Strong). The final prediction is a majority vote, and predict_proba() returns the fraction of trees that voted for each class — used directly as the confidence score.

from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier( n_estimators=100, # 100 decision trees max_depth=None, # Trees grow until leaves are pure min_samples_split=4, # At least 4 samples per split random_state=42, # Reproducible results n_jobs=-1 # Use all CPU cores for training ) clf.fit(X_train, y_train) # At runtime: label = ["Weak", "Medium", "Strong"][clf.predict(X)[0]] confidence = clf.predict_proba(X)[0].max() # fraction of trees in agreement
4 Training Dataset

Since distributing real leaked passwords is unethical, SecurePass uses a synthetic dataset generator (src/train_model.py) that creates 3,600 balanced passwords across three classes:

1200
Weak Samples
1200
Medium Samples
1200
Strong Samples

Labels are assigned using deterministic rules: Weak if length < 6 or entropy < 1.8; Strong if length ≥ 12, ≥ 3 character classes, and entropy ≥ 3.2; Medium otherwise. The dataset is split 80/20 for training and evaluation.

5 Model Performance

After training, the model is evaluated on the held-out 20% test set (720 passwords). Typical results:

~97%
Test Accuracy
~97%
F1-Score
100
Decision Trees

The model is regenerated each time train_model.py is run, and saved to Data/model.pkl using joblib. The Flask server loads it once at startup for fast, sub-millisecond inference.

6 Suggested Improvements
  • High Impact
    Have I Been Pwned API — Check passwords against 10+ billion breached credentials using k-Anonymity (only first 5 chars of the SHA-1 hash are sent over the network).
  • High Impact
    zxcvbn Integration — Dropbox's open-source password estimator detects dictionary words, keyboard patterns (qwerty, 12345), dates, and repetitions that pure entropy misses.
  • Medium
    Real-World Training Data — Train the model on the publicly available RockYou dataset labels (without distributing the passwords) for a far more realistic decision boundary.
  • Medium
    Gradient Boosting (XGBoost) — Replace RandomForest with XGBoost or LightGBM for higher accuracy and built-in feature importance scores.
  • Low
    API Rate Limiting — Add Flask-Limiter to prevent programmatic abuse of the /analyze endpoint (e.g. limit to 60 req/min per IP).
  • Low
    Password Generator — After detecting a weak password, offer to generate a cryptographically secure replacement using secrets.token_urlsafe().