In this article we look at the SMS spam collection dataset. This dataset contains 5,574 tagged SMS messages in English that have been collected for SMS Spam research, labeled acording being ham (legitimate) or spam. This dataset is tiny for today’s standards, however it is easy to get reasonable results with simple methods. We will apply logistic regression to perform binary classification using scikit-learn and minimal computer power.
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
Ordinary linear regression assumes that the response variable is normally distributed. In logistic regression, the response variable describes the probability that the outcome is the positive case. If the response variable is equal to or exceeds a discrimination threshold, the positive class is prediceted; otherwise, the negative class is predicted. The response variable is modeled as a function of a linear combination of the explanatory variables using the logistic function, defined as
\[\sigma(x) = \frac{1}{1 + e^{-x}},\]whose output is always in $[0, 1]$. The inverse of the logistic function is called the logit function.
df = pd.read_csv('SMSSpamCollection', delimiter='\t', header=None)
df.head()
0 | 1 | |
---|---|---|
0 | ham | Go until jurong point, crazy.. Available only ... |
1 | ham | Ok lar... Joking wif u oni... |
2 | spam | Free entry in 2 a wkly comp to win FA Cup fina... |
3 | ham | U dun say so early hor... U c already then say... |
4 | ham | Nah I don't think he goes to usf, he lives aro... |
counts = df[0].value_counts()
print(f"# ham: {counts['ham']}, # spam: {counts['spam']}")
# ham: 4825, # spam: 747
By default, train_test_split()
assignes 25% of the samples to the training set.
X_train_raw, X_test_raw, y_train, y_test = train_test_split(df[1], df[0], random_state=0)
Next, we create a TfidVectorizer
, which combines CountVectorizer
and TfidTransformer
. The fitting is applied to the training set and applied to both the training and the testing sets.
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(X_train_raw)
X_test = vectorizer.transform(X_test_raw)
classifier = LogisticRegression()
_ = classifier.fit(X_train, y_train)
score_train = classifier.score(X_train, y_train)
score_test = classifier.score(X_test, y_test)
print(f"Score on the train set = {score_train:.2%}, on the test set = {score_test:.2%}")
Score on the train set = 97.32%, on the test set = 97.06%
A variety of metrics exist to evaluate the performance of binary classifiers against trusted labels. The most commonly used are:
- accuracy;
- precision;
- F1 measure;
- ROC AUC score.
All those measures depend on the concepts of true positives, true negatives, false positives and false negatives, where positive and negative depend on the classes, and true and false denote whether the predicted class is the same as the true class. The confusion matrix can be used to visualize those values.
from sklearn.metrics import confusion_matrix
A ‘negative’ is when we don’t detect a spam
(that is, ham
), a ‘positive’ is when we detect a spam.
labels = ['ham', 'spam']
cm = confusion_matrix(y_test, y_pred, labels=labels)
pd.DataFrame(cm, columns=labels, index=labels)
ham | spam | |
---|---|---|
ham | 1206 | 2 |
spam | 39 | 146 |
As we see from the confusion matrix, only two ham
messages were wrongly classified as spam
, while we have 39 spam
messages that were not properly classified. Let’s give a look at those messages.
from IPython.display import HTML
y_pred = classifier.predict(X_test)
css = 'background-color: FireBrick; font-family: Lucida Console;'
bad_spam, bad_ham = '', ''
for i, prediction in enumerate(y_pred):
exact = y_test.iloc[i]
if prediction == exact:
continue
if exact == 'ham':
bad_ham += f'<p style="{css}">{X_test_raw.iloc[i]}</p>'
else:
bad_spam += f'<p style="{css}">{X_test_raw.iloc[i]}</p>'
Here we have the two messages that are ham
but were wrongly classified as spam
:
HTML(bad_ham)
U can call now...
Are you free now?can i call now?
And now the 39 spam messages that were not properly classified:
HTML(bad_spam)
Sunshine Quiz Wkly Q! Win a top Sony DVD player if u know which country the Algarve is in? Txt ansr to 82277. £1.50 SP:Tyrone
Hi I'm sue. I am 20 years old and work as a lapdancer. I love sex. Text me live - I'm i my bedroom now. text SUE to 89555. By TextOperator G2 1DA 150ppmsg 18+
As a Registered Subscriber yr draw 4 a £100 gift voucher will b entered on receipt of a correct ans. When are the next olympics. Txt ans to 80062
You have 1 new message. Call 0207-083-6089
Email AlertFrom: Jeri StewartSize: 2KBSubject: Low-cost prescripiton drvgsTo listen to email call 123
Sunshine Quiz! Win a super Sony DVD recorder if you canname the capital of Australia? Text MQUIZ to 82277. B
18 days to Euro2004 kickoff! U will be kept informed of all the latest news and results daily. Unsubscribe send GET EURO STOP to 83222.
Sunshine Quiz Wkly Q! Win a top Sony DVD player if u know which country Liverpool played in mid week? Txt ansr to 82277. £1.50 SP:Tyrone
Send a logo 2 ur lover - 2 names joined by a heart. Txt LOVE NAME1 NAME2 MOBNO eg LOVE ADAM EVE 07123456789 to 87077 Yahoo! POBox36504W45WQ TxtNO 4 no ads 150p.
Your next amazing xxx PICSFREE1 video will be sent to you enjoy! If one vid is not enough for 2day text back the keyword PICSFREE1 to get the next video.
I'd like to tell you my deepest darkest fantasies. Call me 09094646631 just 60p/min. To stop texts call 08712460324 (nat rate)
Good Luck! Draw takes place 28th Feb 06. Good Luck! For removal send STOP to 87239 customer services 08708034412
Sunshine Quiz Wkly Q! Win a top Sony DVD player if u know which country the Algarve is in? Txt ansr to 82277. £1.50 SP:Tyrone
More people are dogging in your area now. Call 09090204448 and join like minded guys. Why not arrange 1 yourself. There's 1 this evening. A£1.50 minAPN LS278BB
Dear Voucher holder Have your next meal on us. Use the following link on your pc 2 enjoy a 2 4 1 dining experiencehttp://www.vouch4me.com/etlp/dining.asp
88800 and 89034 are premium phone services call 08718711108
Do you realize that in about 40 years, we'll have thousands of old ladies running around with tattoos?
Missed call alert. These numbers called but left no message. 07008009200
thesmszone.com lets you send free anonymous and masked messages..im sending this message from there..do you see the potential for abuse???
http//tms. widelive.com/index. wml?id=820554ad0a1705572711&first=true¡C C Ringtone¡
WOW! The Boys R Back. TAKE THAT 2007 UK Tour. Win VIP Tickets & pre-book with VIP Club. Txt CLUB to 81303. Trackmarque Ltd info@vipclub4u.
Call Germany for only 1 pence per minute! Call from a fixed line via access number 0844 861 85 85. No prepayment. Direct access!
Oh my god! I've found your number again! I'm so glad, text me back xafter this msgs cst std ntwk chg £1.50
Fantasy Football is back on your TV. Go to Sky Gamestar on Sky Active and play £250k Dream Team. Scoring starts on Saturday, so register now!SKY OPT OUT to 88088
Got what it takes 2 take part in the WRC Rally in Oz? U can with Lucozade Energy! Text RALLY LE to 61200 (25p), see packs or lucozade.co.uk/wrc & itcould be u!
Latest News! Police station toilet stolen, cops have nothing to go on!
100 dating service cal;l 09064012103 box334sk38ch
For sale - arsenal dartboard. Good condition but no doubles or trebles!
Hi ya babe x u 4goten bout me?' scammers getting smart..Though this is a regular vodafone no, if you respond you get further prem rate msg/subscription. Other nos used also. Beware!
Bloomberg -Message center +447797706009 Why wait? Apply for your future http://careers. bloomberg.com
(Bank of Granite issues Strong-Buy) EXPLOSIVE PICK FOR OUR MEMBERS *****UP OVER 300% *********** Nasdaq Symbol CDGT That is a $5.00 per..
Send a logo 2 ur lover - 2 names joined by a heart. Txt LOVE NAME1 NAME2 MOBNO eg LOVE ADAM EVE 07123456789 to 87077 Yahoo! POBox36504W45WQ TxtNO 4 no ads 150p
Not heard from U4 a while. Call me now am here all night with just my knickers on. Make me beg for it like U did last time 01223585236 XX Luv Nikiyu4.net
FreeMsg>FAV XMAS TONES!Reply REAL
Want 2 get laid tonight? Want real Dogging locations sent direct 2 ur mob? Join the UK's largest Dogging Network bt Txting GRAVEL to 69888! Nt. ec2a. 31p.msg@150p
ROMCAPspam Everyone around should be responding well to your presence since you are so warm and outgoing. You are bringing in a real breath of sunshine.
3. You have received your mobile content. Enjoy
Babe: U want me dont u baby! Im nasty and have a thing 4 filthyguys. Fancy a rude time with a sexy bitch. How about we go slo n hard! Txt XXX SLO(4msgs)
Sorry! U can not unsubscribe yet. THE MOB offer package has a min term of 54 weeks> pls resubmit request after expiry. Reply THEMOB HELP 4 more info