SecurePass uses a Random Forest Classifier trained on 3,600 synthetic passwords to classify password strength into three tiers. Here's exactly how it works — from raw text to prediction.
Active model: RandomForest (scikit-learn)
The raw password string is converted into a 7-dimensional numeric vector. Each dimension captures a distinct security-relevant property. The model never sees the original characters — only these features.
| Feature | Type | Description |
|---|---|---|
| length | Integer | Total number of characters |
| has_upper | Binary 0/1 | Contains at least one uppercase letter (A–Z) |
| has_lower | Binary 0/1 | Contains at least one lowercase letter (a–z) |
| has_digit | Binary 0/1 | Contains at least one digit (0–9) |
| has_symbol | Binary 0/1 | Contains at least one non-alphanumeric character |
| entropy | Float (bits) | Shannon entropy — measures unpredictability of character distribution |
| unique_ratio | Float 0–1 | Ratio of unique characters to total length — penalises repetition |
Shannon entropy measures how unpredictable or random a password is. It is calculated using the frequency of each character:
def shannon_entropy(password):
freq = {}
for c in password:
freq[c] = freq.get(c, 0) + 1
n = len(password)
return -sum((cnt/n) * log2(cnt/n) for cnt in freq.values())A password like "aaaaaaa" has entropy ≈ 0 bits (completely predictable). A truly random 12-character password using all character classes reaches ≈ 5–6 bits per character — making brute-force computationally infeasible.
A Random Forest is an ensemble of 100 independent Decision Trees. Each tree is trained on a random subset of the 3,600 training passwords (bootstrap sampling), and each split considers a random subset of the 7 features. This controlled randomness prevents overfitting and improves generalisation.
At prediction time, every tree independently votes for a class (Weak / Medium / Strong). The final prediction is a majority vote, and predict_proba() returns the fraction of trees that voted for each class — used directly as the confidence score.
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(
n_estimators=100, # 100 decision trees
max_depth=None, # Trees grow until leaves are pure
min_samples_split=4, # At least 4 samples per split
random_state=42, # Reproducible results
n_jobs=-1 # Use all CPU cores for training
)
clf.fit(X_train, y_train)
# At runtime:
label = ["Weak", "Medium", "Strong"][clf.predict(X)[0]]
confidence = clf.predict_proba(X)[0].max() # fraction of trees in agreementSince distributing real leaked passwords is unethical, SecurePass uses a synthetic dataset generator (src/train_model.py) that creates 3,600 balanced passwords across three classes:
Labels are assigned using deterministic rules: Weak if length < 6 or entropy < 1.8; Strong if length ≥ 12, ≥ 3 character classes, and entropy ≥ 3.2; Medium otherwise. The dataset is split 80/20 for training and evaluation.
After training, the model is evaluated on the held-out 20% test set (720 passwords). Typical results:
The model is regenerated each time train_model.py is run, and saved to Data/model.pkl using joblib. The Flask server loads it once at startup for fast, sub-millisecond inference.
/analyze endpoint (e.g. limit to 60 req/min per IP).secrets.token_urlsafe().