Make credit scoring results transparent without compromising security


The most common scoring problem these days is that not everyone can quickly interpret the results produced by the model. A complex algorithm examines dozens of dynamic factors, and hardly anyone can immediately see and understand what they are. The article describes an algorithm that explains in natural language the strengths and weaknesses of a business, which affect decision making.

Why does a bank need it? Increase the speed and culture of corporate communication and effective dialogue with the customer. Banks had to implement new post-factum and hasty solutions, costing them market share that was taken over by young and daring FinTechs. In the eyes of ordinary people, a rating system is a black box, and this fact lowers self-confidence. After being turned down, most customers turn to the help desk for an explanation of what the bank didn’t like and why the loan terms are so tough.

If the bank decides to disclose the factor space used for rating, fraudsters immediately have a tool to manipulate the most significant indicators in the rating model. All of this suggests that scoring models need to learn how to communicate directly with the customer while still meeting security requirements.

Applying the NLP Pipeline Schema for Scoring

NLP Pipeline is a scheme that the most powerful chatbots like Siri or Alexa work on. The algorithm can be divided into several key steps.

Step 1

At the first stage, speech recognition and translation of sounds into symbols, words and phrases take place. This step is absent for the written word. Among the mathematical models, Deep Learning on neural networks is most often used at this stage.

2nd step

Then, through radicalization and lemmatization procedures, the text document is converted into a more convenient machine-readable form. At this point, the system cuts out the suffixes and endings that make the speech beautiful but carry no semantic load. As a result, the text becomes as close as possible to a machine-readable form.

It is believed that this step is highly dependent on the complexity of the grammatical structure of a language. However, this is only partially true – modern processors are able to work even with very complex languages ​​and extract facts from texts written in them, despite their grammatical complexity. Analyzing a Hungarian or Icelandic text will only take a few milliseconds longer than a similar analysis of an English text. However, the lack of libraries to analyze texts in complex languages ​​is surely a serious obstacle.

Step 3

The next step is the transformation of the text into tables using algorithms that implement the theory of formal grammars, such as the bag-of-words, the word in vec, etc. At this point, the text is transferred to the database, and only the semantic constructs remain, not its full grammatical structure. An ontological analysis of the text is performed, transforming it into a set of formal constructs such as objects and subjects, properties and methods; these are modifying characteristics.

Step 4

Finally, the context and contextual meanings of the facts stated in the text are determined: it is an interesting step which relies on the contextual dependence of the language and a particular text. Thus, legal texts and other formal types are much easier to analyze than works of fiction. As a result, at this point the text is finally transformed into a table which is then entered into the scoring model.

Then the scoring model processes the data received at the input, is tested, trained and recycled. But the important thing is that as soon as the scores are received and a decision is made based on them, the most interesting part begins: all the steps described above begin to repeat in reverse order:

  1. Depending on the context, the appropriate dictionary is selected.

  2. The cases, sex and declination are placed; a sentence with the correct grammatical structure is written.

  3. Natural speech is synthesized, if necessary, which interprets the result obtained by the methods of Machine Learning.

Thus, the algorithm described above is the algorithm that automatically explains in natural language which weaknesses or strengths of a company or person have had their influence on certain decisions. It is much better not to just get a rejection, but to find out the main reasons that led to it. Additionally, it can reveal an error in customer data, which can be quickly eliminated, leading to increased customer loyalty and sales.

Additionally, the use of this technology means that employees don’t have to try to explain how the scoring model works and why it works well.

Fraud protection

The question remains open: how to eliminate the risk of revealing the factor space and the complexity of explaining the dependencies between factors?

Here, the great mobility of modern rating systems comes to the aid. Real-time learning technologies make it easy to change the role of factors that influence the final decision. This makes it unnecessary for fraudsters to hack into the system. As they build a business or borrower that meets the criteria they have learned to be important, the external environment and the scoring models that describe it will change, so all their efforts will be in vain.

It is more difficult to explain nonlinear dependencies and how the role of a factor changes depending on the other factors around it. Until now, a text document can only say the presence of such relationships but not interpret them in natural language. However, technologies keep improving and everyone must closely follow its evolution to offer its customers effective solutions.


Comments are closed.