Score cut-offs can blow up in your face

Posted to Linkedin at https://www.linkedin.com/today/post/article/20140609165942-5425117-score-cut-offs-can-blow-up-in-your-face
Risk scores are extremely powerful tools in determining the final disposition of credit applications. Typically scores are used in a consumer lending scenario – but can be in a commercial environment as well (SME segment).
Most scores would include variables encompassing application details, bureau variables (including a generic bureau derived score – for e.g. a FICO Score, or Vantage Score – or in India the CIBIL TransUnion Score) and internal bank variables if the customer already has a relationship with the bank. In absence of a specialized application score – the generic bureau score can also be used to grade the applications.
Operationally, the scores can be used to give Yes/No decisions to customer applications – though in some scenarios; scores on the margins can be referred or decisions on partial exposures taken.
Most Banks/Financial Institutions will calibrate the scores using extensive analysis to identify the Odds or Bad rate at score bands. A business specific bad rate definition can be used here – for e.g. 2, or 3 missed payments in next 12 months (i.e. the loan going bad in a fixed time period post loan sanction). This calibration can be done by retrospective analysis of applications in the past – and their performance post sanction. (Assumption being that the patterns of past will propagate into the future without too much variance – macro or otherwise). Basis the retro analysis – a score cut-off is identified which allows the bank to target a specific bad rate. The score cut off also forces a rejection rate on incoming applications.
In order to illustrate impact of score cut-offs on bad rates – I am going to assume the score has been calibrated to grade incoming applications on a normal distribution with a mean score of 600 and a standard deviation of 50 points. Additionally the score has been equalized at an anchor of 600 with a PDO (points to double the odds) at 25 points. (There is no fixed rule that a scorecard needs to be centred at the mean/median point – done for illustration purpose only). Odds at the centre are calibrated at 1/70 or roughly 1.42% of customers with score of 600 will go delinquent on their loan.
Below graphic and table gives the score distribution of a 10,000 applicants and the bad rate by score bands.

PIC1

The below table gives the bad rate by score cut-offs for the same population –>

Score Band % Bad Rate
No cutoff+ 2.6%
450+ 2.5%
475+ 2.3%
500+ 1.9%
525+ 1.5%
550+ 1.1%
575+ 0.7%
600+ 0.4%
625+ 0.2%
650+ 0.1%
675+ 0.1%
700+ 0.0%
725+ 0.0%
750+ 0.0%

The table essentially tells as that out of the 10000 odd customers – the expected bad rate if the bank approves everybody is 2.6%. i.e. no cutoff – we get an approval rate of 100% and a bad rate of 2.6%.

As evident- there is a trade off between approval rate and the expected bad rate – in order to reach our target bad rate of 1.5 % – the  table can be referred to identify 525 as a potential score cut-off.

That is banks can continue to approve applications to score bands which in isolation may be considered high risk, but pooled with a larger number of customers in higher bands, still maintain the overall portfolio bad rate. And why would a bank lend to customers in say the 550-599 band when it clearly has an elevated bad rate – there can be a multitude of reasons – capturing market share, approval rate pressures and sales targets – you name it. After all, sub- prime customers are the most profitable, as long as we can predict the bad rates; and have a pool of good customers to balance them out. Sub-prime customers are theoretically charged a higher interest rate which is supposed to take care of the extra risk the bank is taking.

So now by enforcing a cut-off of 525 on incoming applications (instead of 575) – we get an approval rate of approximately 93%. (Calculating area under curve of a normal distribution with a known mean and std deviation). i.e. approximately 7 % of incoming applications will be deemed as high risk and rejected – and approved population will have a target bad rate of 1.5%. Now with a 93% approval rate, both the risk and sales teams are happy! Or are they?

Let the Bad times roll

 

One major weakness of using score cut-offs is the long list of assumptions inherent in score building and deployment process. Even slight deviations from these assumptions can have a disproportionate impact on the risk exposure of the bank.

One of the most critical assumptions is around the probability distribution of the applications. Score cut offs  are calculated based on studying the past distributions (they need not be  normal), as in the case of the example being discussed – based on the chart above – a cutoff of 525 gives an approval rate of 93% and a bad rate of 1.5%.  If the distribution remains stable – the cutoff can give a predictable bad rate and can be controlled and the bank can confidently lend to subprime customers as well, thus cornering market share as well as a much healthier interest spread while relying confidently relying on their ‘Million Dollar Statistical Model’.

However, take the scenario of worsening macro-economic situations (not unlike witnessed in 2008), or a new sourcing channel opening up. A distribution shift can happen for any number of reasons – and even slight deviations can have a large impact.

For e.g. let’s assume the distribution of incoming applications left shifts to a mean of 580 (from 600 previously). The std  deviation and PDO remaining constant – the table below gives the impact on the bad rates based on different cut offs now –>

PIC2

The above figure shows the new application distribution as compared to the original.

Assuming the score anchor and PDO remains unchanged, based on the new incoming application distribution – we see the shift in the score based cutoffs. Previously the score cutoff at 525 gave a bad rate of 1.5%. However when the applicants mean shifts to 580 from 600 originally, the same score cutoff of 525 now gives a bad rate of 2.1%. (an increase of more than 30% in the bad rate!), and that’s not all – the approval rate has now fallen to 86% – a rejection rate of 14%.

 

Score Band % Bad Rate (New Dist.)
No Cutoff 4.7%
450+ 4.2%
475+ 3.6%
500+ 2.9%
525+ 2.1%
550+ 1.3%
575+ 0.8%
600+ 0.5%
625+ 0.3%
650+ 0.1%
675+ 0.1%
700+ 0.0%
725+ 0.0%
750+ 0.0%

The reason for this is due to the distribution shifting slightly to the left, % of applicants in higher score bands go down, these customers were supposed to drive the portfolio bad rates down – but now, the % of customers sourced in the not so good score bands shoot up (but hey, we didn’t compromise on the score cutoffs did we?).

The sales team is now hopping mad with rejection rates having more than doubled from before; and risk team is under pressure – even after rejecting so many applications – the bad rates are shooting up!

A cursory look at the re-calculated bad rates on the updated distribution shows that the score cutoff needs to be revised to 550 from 525 to maintain the same bad rate as before. The actual approval rate needs to be 72%!

This illustrates how a small shift in the incoming population would need the risk team to quickly revise the score cutoff to bring the approval rate down from 93% earlier to 72% now just to maintain the target bad rate.

What this essentially means is that the risk exposure of the bank has suddenly shot up, the subprime customers have actually not been priced correctly on this model now, the interest rate calculation did not take into consideration this particular scenario. The bank continues to source on the new distribution  – confident that the score will continue to perform (which it is – just not as assumed).

It may not end here, when macro-economic parameters deteriorate, worsening credit quality of incoming customers as discussed above is one symptom, the other impact happens on the credit scores itself. Scores built or calibrated on ‘good times’ will almost certainly begin to wander when the ‘bad times’ come in. The score odds are not set in stone and do change based on how the industry is performing.

In ‘bad times’ deterioration of the odds ratio itself at score intervals can be expected as many banks found out in 2008. (FICO faced some heat for this), however the basic purpose of the score still holds irrespective – which is to rank order customers from highest risk to lowest risk. In a case where macroeconomic parameters impact individual behaviour – any score would need to be recalibrated to capture new behaviour. The basic presumption of past behaviour propagating into future is invalidated here as behaviour is now changing rapidly.

For our example where the score was anchored at 600 with an odds of 70 to 1 and PDO of 25, lets assume a deterioration of odds to 60 to 1 with the PDO unchanged. The new interval bad rate table is as below (capped at 99% for lowest interval) –>

 

Score Band % New Bad Rate Original Bad Rate
<450 99.0% 91.4%
450-474 53.3% 45.7%
475-499 26.7% 22.9%
500-524 13.3% 11.4%
525-549 6.7% 5.7%
550-574 3.3% 2.9%
575-599 1.7% 1.4%
600-624 0.8% 0.7%
625-649 0.4% 0.4%
650-674 0.2% 0.2%
675-699 0.1% 0.1%
700-724 0.1% 0.0%
725-749 0.0% 0.0%
750+ 0.0% 0.0%

The difference may not look very high, but let’s explore what happens when we combine this new data with our update probability distribution for the cutoff bad rates.

Score Band % Bad Rate (New Dist.) % Bad Rate (Old Dist.)
No cutoff 5.4% 2.6%
450+ 4.9% 2.5%
475+ 4.2% 2.3%
500+ 3.3% 1.9%
525+ 2.4% 1.5%
550+ 1.6% 1.1%
575+ 1.0% 0.7%
600+ 0.5% 0.4%
625+ 0.3% 0.2%
650+ 0.2% 0.1%
675+ 0.1% 0.1%
700+ 0.0% 0.0%
725+ 0.0% 0.0%
750+ 0.0% 0.0%

Based on previous cutoff of 525 – post a odds and a population shift, the actual new bad rate faced by the bank is 2.4% instead of the expected 1.5%, i.e. the bad rate has suddenly spiked up by 60%.

To compensate for this, the score cut off actually needs to be revised significantly north of 550, with an approval rate of even lesser than the 72% when the odds had not shifted.

Both factors; a population shift alongwith odds change can deliver a double whammy to the risk team of any bank. There are practical problems a risk team will face in convincing the sales head that the approval rate needs to be cut down to less than 70% from 94% earlier because of the small matter of score mean shifting by 20 points (on a score scale which ranges from 400 to 800) and a odds shift to 60 to 1 from 70 to 1.

 

While the scores continue to do their job of ranking customers, reliance on pure cutoffs by banks can be suicidal and invalidate the scorecard needlessly. Like any other tool a scorecard is also only as good as the risk manager behind it. If a risk manager does not have the authority or freedom as a case in point in this example to cut approval rates down to 70% from 94%, cutoffs will simply not work. In fact quite the opposite, enforcing a score cutoff can be spectacularly counterproductive.

While the illustration discussed above is fairly simplistic and with assumptions which are unlikely to present themselves so neatly in the real world, the scenario discussed has unfortunately replicated itself in many banks and lenders throughout the world.

Raghuram Rajan (ex IMF chief economist and current RBI governor) talks about a conference he attended in his book ‘Fault Lines’ where he is addressing a group of risk managers (a while before 2008 happened) about tail risk and its possible impact; the talk is not well received by the audience and then someone pulls him aside and tells him that the risk managers who could understand and push what he was saying inside their bank’s had long since been fired for being Cassandra’s. The whole concept of tail risk is that while probability of the event happening is low, but when it does happen – they wipe out all profits accumulated over the so called good times. The concept that pricing the risk to lenders exposing themselves to subprime can be modelled out is inherently faulty; and while using scores – you may be able to generate handsome profits over years and years – tail risk is actually much higher than what our models estimate.

Data Analysis Toolpak – 2 Sample T Test

We check the significance of the difference in means between 2 samples using a T Test (in Excel).
Dataset can be downloaded at www.learnanalytics.in/blog/wp-content/uploads/2014/02/car_sales.xlsx

Data Analysis Tool Pak – Multiple Linear Regression (Excel 2010)

In this segment – I demonstrate the use of Data Analysis Tool Pak to build a multiple linear model using only Excel 2010 and basic interpretation of the results.

While the Data Analysis Toolpak does not replace a dedicated statistical tool, it does allow you to quickly check and do basic tests, analysis on Excel in a user friendly way, however it’s capabilities are limited and not recommended as a dedicated tool.

Who ‘does’ Analytics?

Who ‘does’ Analytics?

 

A weird question really, not meant in the literal sense, but really who are the people who work in Analytics or are trying to learn/enter the Analytics industry. Going through the Google Analytics reports of my YouTube channel, this question struck me as quite relevant and threw up some very interesting insights.

A little background first – back in early 2012, while in-between jobs, I taught an introduction to analytics batch to a varied set of students (based in India and US). The curriculum covered SAS Programming (taught by my wife) and Basic Statistics, predictive modelling (taught by myself) using webex. The training sessions were recorded using Webex. A few months back, I uploaded the full SAS and analytics trainings on YouTube for general viewing.

Now, almost 100,000 views later, 850 subscribers, @300 views a day – YouTube analytics section throws up some unique insights about the viewer’s logging in. Firstly, topics like PROC SQL and PROC LOGISTIC are highly esoteric and of no interest to the average YouTuber searching for the Miley Cyrus VMA twerk, i.e. it is unlikely to be either suggested by the recommendation engine or searched for except by someone who is actually interested in these topics and is actively searching/viewing such stuff. However, internet being the great enabler it is, the channel has managed to garner some attention (inspite of the dubious audio quality) and generate some regular traffic. (8-10K views a month).

So, then who is “doing analytics”?

Country Distribution

Country List

That India tops my list of incoming viewers is no surprise, after all – my networks (social and otherwise) which drive a significant volume of traffic is primarily India based. However, after India – the worldwide anglosphere (UKUSA) community dominates and infact “rank-orders” perfectly based on their relative population size. i.e. all English speaking countries in descending order based on their populations. After the anglosphere – the list of non english countries show a pattern (France, Germany, Singapore and Brazil); the size of their financial markets and population driving it. China & Japan do not figure anywhere on the list – even though they would be mature geographies in terms of analytics use and analytics professionals primarily due to the language divide. (Hypothesized, my opinion, could be wrong).

Being an analytics professional, I just cant stop at simply publishing a bar chart of the geographic traffic to my channel, I feel the urge to draw an “insight” from it too. (How many of you can relate to your bosses screaming – “don’t give me numbers dammit – give me an INSIGHT”). I hate this word, but yeah – whats the insight. The question I want to ask is – “Which country has the highest concentration of analytics professionals?”. The data source being only the traffic to my channel – I claim no reliability of the answer but here goes à

  1. Simply relying on the traffic is not enough – large countries have large pools and dominate, hence India and US on the top.
  2. I am going to ignore the bias of English vs Non English, since being a technical topic, majority of analytics professionals around the world will have a working mastery of the English language
  3. Dividing the traffic by the population of the countries will give me an index which will remove the bias of population size and give me an apple to apple comparison (or closest to it) with which I can directly compare and rank order countries based on their “Analytics Concentration”

And the list/ranking goes as à
Country_Index

Surprisingly, the data shows that the tiny city state of Singapore accounts for the highest concentration of analytics! Punching way above its weight with a score of 15 and leader of the pack by far! (Methodology for calculating the index is very simple – views/population).

In the Anglo-sphere Australia/USA and UK after removing the effect of population show almost the same Analytics penetration (countries with very similar HDI, income levels and the language), however Canada surprisingly has a small but significant lead over the other 3 (11 vs. 9). Therefore amongst the Anglo countries, Canada ranks as the most “Analytics” country! Maybe I should also try and find a job there eh!

India, not surprisingly has fallen way behind others after removing the huge population advantage, but still ranks higher than Germany and France. Brazil again finishes last in this ranking as well, but again I am sure the language bias is probably affects it more than others.

Gender Distribution

I ask all the analytics professionals to stand up from their cubicles and look around , calculate the rough ratio of males to females in your office – the chances neigh almost a certainty that the skew is towards males, but how much? Let’s look at the data once more à

Gender

Not surprisingly, the traffic shows a major skew towards males. 77% to 23%, or rather more than 3 out 4 analytics professionals (or professionals in making) are males, makes for a very boring office culture if we keep adding only more and more male geeks to the profession.

But the what I really want to see is, does geography play a role here? Is there variance in gender participation by country? Remember, India is driving almost 40% of the numbers here, and lesser said about gender participation in India, the better.

Gender distribution by top countries as below à

Country

Male

Female

India

81%

19%

USA

70%

30%

UK

85%

15%

Canada

74%

26%

YouTube does not give in-depth analysis beyond the top 4 countries for me currently, still – the data shows that the gender divide is global, and very surprisingly the worst in the United Kingdom (and not India)! The UK shows a massive 85-15 divide as compared to the global average of 77-23. US has the best ratio of 70-30 from the countries on the list. So if you are a female analytics professional – the best office climate would probably be in the US and the worst in the UK!

 

Age Distribution

Distribution by age for the top 4 countries à

Country <18 Years 18-24 Years 25-34 Years 35-44 Years 45-54 Years 55-64 Years 65+ Years
India

0%

10%

51%

23%

10%

4%

1%

USA

1%

5%

24%

32%

26%

11%

2%

UK

0%

9%

34%

30%

19%

8%

0%

Canada

0%

4%

21%

40%

28%

7%

0%

 

The first thing which hits you when you look at this table is the massive youth skew in India as compared to the rest. By and far, the average Indian Analytics professional (or the would be professional) is much younger as compared to her global counterparts.

A full 61% of Indians are less than 34 years, with a solid 51% sitting in the 25-34 year bracket. Comparing that with the US where 71% are older than 34 years! This table probably shows clearer than any other set of numbers the India story à Young Indian professionals working for their older American bosses out of offices in Banglaore, Gurgaon and Mumbai.

Amongst other countries, UK has 57% older than 35, Canada has 75% (compared to India with a mere 38%). Canada and the US have similar age profiles, UK is comparatively younger as compared to these two, and of course Indians are the babies in the room.

Conclusions

 

So, what have we learned, any analytics presentation without the summary is incomplete, therefore top conclusions based on our Data à

 

  1. Singapore and Canada are the most analytics heavy countries in the world.
  2. Analytics industry globally is skewed towards male participation, UK has the worst male-female ratio and the US has the best
  3. Indians are the youngest analytics professionals globally whereas the Americans and Canadians are likely to be the oldest.

Please feel free to comment with your views on this article below or on the linkedin page – as well as mail me at info@learnanalytics.in .

Karan Sarao has over 5 years of experience in analytics including training professionals on tools and techniques. Currently works with TransUnion India in Analytics Business Development based out of Mumbai, India. 

 

Data Analysis ToolPak – Karl Pearson Correlation Matrix

In this video segment, I talk about enabling the Data Analysis ToolPak – Addin in excel. This is a powerful and rarely explored feature in MS Excel which can do a lot of stuff. In this series of video demonstration, I will be exploring these features.

The first of which is creating a Correlation matrix in Excel. The file used can be downloaded  here –> car_sales.

 

Advanced SAS Programming – Day 12 & 13 (Introduction to PROC SQL)

 

 

 

Day 12 – Introduction to PROC SQL

  • Group By and Having clauses in SQL
  • – Using aggregate functions to summarize data in SQL
  • – Using nested Where clause in SQL
  • – Modifying datasets using Create Table Like, Create Table As, Alter Table and Update clauses

Day 13 – PROC SQL Continued

– Using SQL queries in SAS
– Queries to create a table and select values from a table
– Using SQL conditional statements (Where and Case When clauses)
– Order by clause in SQL to sort datasets

Regression – Linear and Logistic using SAS

Basic introduction to Multiple Linear and Logistic regression using SAS with real life data sets.

Basic introduction to credit scoring using Logit modeling. Key concepts of binary prediction like lift, KS, ROC curve, Gains charts etc are explained in these set of videos.

Basic Linear Regression – Part 1

Linear Regression – Part 2

Introduction to Logistic Regression – Part 1

Logistic Regression – Part 2

Logistic Regression – Part 3