Big Data, Machine Learning, and Bias - The Consumer Protection Implications of Emerging Technologies

As Seen in the July / August 2019 Issue of RMA Journal

Compliance with fair lending regulations is a matter of growing concern as the financial services industry, either directly or with the assistance of third parties, increasingly uses big data and machine learning to distribute credit. While these analytic methods can help lenders identify additional creditworthy consumers and more accurately price credit risk, they also have the potential to amplify bias.

Risk management professionals should understand the potential for bias in model development and usage. Compliance professionals, in particular, should understand the sources of bias in machine learning, ask the right questions about the data and models, and evaluate whether the necessary controls are in place. And they should work with their risk management and model validation colleagues to detect and mitigate fair lending risks in emerging technologies.

Technology Continues to Evolve
Since the introduction of credit scores, lenders have used technology to facilitate access to credit and estimate credit risk. As computing power and available data have increased, so has the use of models, algorithms, and other forms of technology-assisted credit processes. Today, many fintechs and bank lenders are leveraging big data and machine learning to model risks and make credit determinations. Properly identifying and managing the fair lending risks of credit models and large data sources is a complicated process made even more complex by the use of algorithms and the underlying data.

Regulatory Considerations
The Equal Credit Opportunity Act (ECOA), the Fair Housing Act (FHA), and the Unfair, Deceptive, or Abusive Acts and Practices (UDAAP) Act are three of the most important regulations that apply to the lending industry. The fair lending laws—ECOA and FHA—prohibit both the disparate treatment of applicants and any disparate impact based on membership in certain demographic groups or their exercise of consumer rights.

Treatment of these groups is reviewed when determining whether marketing, underwriting, pricing, or servicing decisions have resulted in a disparate impact on a prohibited basis, or whether there is evidence of disparate treatment during the lending process or its outcome. The Dodd-Frank Act expanded the existing prohibition against unfair or deceptive acts or practices to also include abusive practices. “UDAAPs can cause significant financial injury to consumers, erode consumer confidence, and undermine fair competition in the financial marketplace,” the Consumer Financial Protection Bureau noted in CFPB Bulletin 2013-07.

Sources and Types of Bias
Often when people speak of bias, they are referring to prejudice or discrimination. In the lending industry, the word is usually associated with illegal discrimination under federal, state, and local laws and regulations. That type of bias is clearly important in a discussion of consumer protection. But in the context of algorithms, models, machine learning, and even more advanced artificial intelligence, these additional types of bias have the potential to harm borrowers:

Sample bias. Sometimes known as selection bias, sample bias can arise from mistakes in choice of data or in data aggregation. Development data that does not represent the population on which algorithm will be used can introduce this type of bias. For example, models built from a data set limited to consumers with full time employment may not accurately predict outcomes for part-time or gig economy workers.
Association bias. If the training data or algorithms used in the initial machine- learning model development reflect or amplify historical societal bias, then association or prejudicial bias may be incorporated into the resulting models. If a development data set only includes employment data for consumers in careers that are typical for their sex, then prejudicial gender bias may be incorporated into the resulting algorithms.
Confirmation bias. Giving greater weight to data, outcomes, or interpretations that support initial hypotheses is known as confirmation bias. The effect of confirmation bias can be seen in social media, where algorithms that analyze a user’s data feed for the purpose of offering additional content result in information flows that mirror existing information, rather than offering counterpoints or opposing views. The same phenomenon occurs in online shopping, where purchase histories are often used to offer additional
products.
Automation bias. In machine learning, automation bias can arise when differences exist between the actual goals and the machine’s understanding of those goals and constraints. If the AI fails to consider societal or public policy factors, automation bias can result.
Interaction bias. Skewed learning over time, whether as a result of overrides or user interactions, can lead to interaction bias. A frequently cited example of interaction bias is the chatbot that was shut down because it learned racial slurs in only a day of interacting with humans.

Big Data and Data Bias
Big data is defined as extremely large data sets that can be analyzed to reveal patterns, trends, and associations, especially those relating to human behavior and interactions. Big data may include high velocity, high variability, unstructured and multi-structured data that businesses collect from a variety of sources.

Alternative data is any data from nontraditional data sources. It may include newer data points, such as device or operating system information and social media postings, as well as nontraditional credit bureau data, including utility and rent payment histories. When combined with powerful analytic tools, big data and alternative data can be used to identify previously unrecognized patterns that can be used for decision making. In lending, fintechs are using this data to make decisions about extending credit.

In using any nontraditional data source, it is important to first consider why certain data elements are being selected. There should be a logical relationship between the data points being analyzed and a consumer’s risk, not just a statistically significant correlation. This clear relationship allows for consistency with the historical interpretation of Regulation B’s requirements for credit scoring systems—as well as more recent guidance, such as the New York State Department of Financial Services’ Insurance Circular Letter No. 1 (2019).

Under these rules, there are rarely issues with using traditional factors, such as detailed financial information. However, factors not previously included in lending decisions can raise issues with regulators. Careful analysis is required to ensure these newer variables are not causing disparate impacts for consumers.

More on Machine Learning

In machine learning, computers have the ability to learn without explicit programming. For example, to develop a credit model, the computer would mine large data sets for variables that may be predictive of credit performance and then develop models from that data.

As with traditional model development, available data may be divided into two parts for the initial model build. “Training data” and “development data” are interchangeable terms for the data used in the initial model construction. “Hold-out data” or “validation data” refers to the data that is not used in model creation, but is instead used for testing the model’s efficacy.

Given the large number of variables available for use in classifying the model outcome for a given observation and the potential correlation between possible predictive variables, it may be necessary to use dimensionality reduction. Here, the machine
essentially seeks a smaller subset of the original data for use in modeling by removing some of the variables with high levels of mutual information content. It is important to note that, although dimensionality reduction helps remove redundant variables and reduce required data storage and computation time, it may also lead to data loss and inappropriate variable reduction.

Once in production, a machine learning environment typically develops ongoing model enhancements. It does this by incorporating performance information on originated credits into the predictive models for model tuning and ongoing redevelopment. The ongoing learning process may result in the selection of different predictive variables or revisions to variable weights and parameter estimates.

Step #1
Evaluate the variables for conceptual soundness. Are they logically related to creditworthiness? Then consider whether including the variables would result in disparate impact from a fair lending perspective. If so, are there alternate model specifications with similar effectiveness in managing credit risk but resulting in less disparate impact?

For example, some private companies that originate or refinance student loans use the applicant’s college major or alma mater as a factor in determining default risk or pricing. Similarly, several consumer reporting agencies and credit-score developers are evaluating the use of nontraditional data points, such as mobile-device location data, social media posts, social network connections, and club membership, in assessing credit risk. It is unclear whether the relationship between these data points and creditworthiness has been developed and documented. In addition, even if such a relationship exists, there is a clear potential for disparate impact on a prohibited basis.

Step #2
Lenders should take care that the data sets used in algorithm development are appropriate for, and consistent with, the attributes of the population to be evaluated. If the development and input data do not meet these requirements, the resulting models may be biased.

To avoid building bias into models, developers should evaluate the comprehensiveness, accuracy, and bias of development data. For internally developed models, this can be achieved through careful data curation. For third-party models, lenders may initially have to rely on developmental evidence and validation information provided by the vendor. Over time, the lender’s internal model risk management team should assess whether the vended model effectively rank-orders risk and is predictive of default on the population being evaluated.

For example, if the development data includes a limited number of certain demographic groups, the resulting algorithms may not perform well when predicting those groups’ creditworthiness, even though the use of big data will provide more data elements for each applicant. Developers should incorporate testing of the data set’s completeness in their developmental processes.

One method is to test the demographic and geographic representativeness of the developmental data set. Another would be to test how well a model built on a given data set performs when applied to a different data set. The latter approach is conceptually similar to the established model development practice of dividing available data into training (development) and testing (hold-out) portions.

Step #3
When preparing data for use, model developers and data scientists should take care not to incorporate their own or societal biases into the data or its treatment in models. Consider, for example, the growing use of deposit account transaction data in credit risk modeling.

Suppose there are two consumers with similar incomes and similar spending amounts, but their type of spending differs. Consumer A spends her entertainment budget at the opera, while consumer B spends his entertainment dollars on other activities. All other things being equal, a modeling approach that identified opera as a good expenditure and other entertainment as a bad expenditure could result in denial of applicants who lack easy access to opera, even if the underlying consumer preference was identical. This group would include many rural, suburban, and lower-income consumers.

Similarly, a credit card company received criticism for account management practices that classified shopping at certain stores as higher risk than shopping at other stores.

Evaluation of credit bureau records and payment histories that discounts or ignores timely payment of rent, utility bills, or accounts at non-traditional lenders could also have a disparate impact on populations that have historically been underserved by bank lenders.

Model Development, Machine Learning, and Bias
Statistically, measurement bias refers to the tendency of an algorithm or model to consistently over- or underestimate the true value of a population parameter. For example, a model that consistently predicts substantially lower defaults than actually occur in a given population is producing biased estimates. Even if a machine learning model eventually detected and corrected for its measurement bias, financial institutions could incur higher-than expected losses during the learning period. Similarly, if the measurement bias is associated with a protected characteristic under ECOA or FHA, consumers being harmed by unwarranted denials or higher pricing during the learning period would result in legal and compliance risks.

Machine learning is the process by which complex computers create analytical models that are used to make decisions. The computer determines which data elements to use in its predictions— in the case of this discussion, determining whether to extend or deny credit. As the system ingests more data, the models update themselves, selecting which variables are important for the desired outcome. During this process, data or other biases can cause unexpected or undesirable results.

For example, suppose a business credit underwriting system incorporates data from a period of high defaults among oil production companies. If the data includes business addresses but not economic conditions or industry identifiers (such as NAICS codes), the machine may learn that businesses in oil-producing states like Texas, North Dakota, Alaska, California, and New Mexico are high risk, rather than learning that oil production companies are high risk under certain economic conditions.

Model Validation
To avoid excessive model risk, including fair lending risk, companies must ensure they are properly and regularly testing their algorithms, as well as ensuring and documenting the business justifications for using a data set or algorithm. Frequency of model revalidation should be determined by the risk associated with poor model performance.

The interagency supervisory guidance on model risk management notes that periodic model reviews should be conducted at least annually, but more frequently if warranted. In a machine learning environment, where model features can change frequently, significant changes in algorithms should be tested before being put into production. This can be achieved by independent validation by statisticians, either through an internal function or an external party. Appropriate validation practices help guard against bias.

Regulation B specifically requires credit scoring systems to be “empirically derived, demonstrably and statistically sound.” The official staff commentary to Regulation B notes, “The credit scoring system must be revalidated frequently enough to ensure that it continues to meet recognized professional statistical standards for statistical soundness. To ensure that predictive ability is being maintained,
the creditor must periodically review the performance of the system.” In addition, the guidance on model risk management sets forth regulatory expectations for independent model validation.

Furthermore, it is important to monitor a model’s performance over time. Shifts in model performance or changes in the applicant population could indicate a model that needs attention. Risk managers should use benchmarking, back-testing, sensitivity analysis, override analysis, and population stability analysis to detect changes that might impair model performance.

The model risk guidance notes that model validation activities should continue on an ongoing basis to track known model limitations, identify new model limitations, and monitor model performance under a variety of economic and financial conditions. Over time, as business strategies, product mix, or customer bases change, model effectiveness may also change. Ongoing model maintenance and monitoring, along with periodic outcome analysis and revalidation, will determine whether the models are continuing to perform as expected.

Lastly, designing and implementing policies and procedures for the governance of models, throughout their development and use, are essential to maintaining the critical lines of defense within the business. Governance activities should be tailored to the business’s level of model use, control, and risk. Areas of importance include ensuring that security and change-control procedures are adequate and documented, that the roles and responsibilities of staff are clearly defined, and that an inventory of models is maintained.

Overrides
As with traditional lending processes, introducing discretion into any lending decision is always an area of interest for regulators. If there are situations where the decisions of automated processes can be re-decisioned, or overridden by a human, it is prudent to have sufficient documentation that explains in detail why the decision was made to override. Remember, overrides can also introduce interaction bias into machine learning systems.

10 Key Questions for Compliance Risk Management
To help manage these risks, compliance professionals should consider the following questions:

Why did you select this data point or data set?
Has the data set been tested for bias, incompleteness, and inclusion of inappropriate demographic characteristics prior to use?
How does the model treat missing values?
What is the connection between the model’s features, or predictive variables, and creditworthiness or risk? Is the model conceptually sound?
What constraints were placed on the machine’s ability to learn and change the underlying algorithms?
How will you map an adverse action based on machine learning to an adverse action reason that complies with Regulation B?
If model overrides are permitted, are they documented and monitored?
Is the model or algorithm empirically derived and demonstrably and statistically sound, as required by Regulation B?
Has the model been validated, and are changes validated prior to implementation?
Do model development, testing, validation, and usage include consumer considerations so that resulting decisions are fair and transparent?

Conclusion
Done right, big data and machine learning have the potential to benefit consumers by giving them greater access to credit. When using these innovations in credit processes, however, it is important to ensure that decisions are not affecting credit access, terms, or conditions in a prohibited way.

One approach to avoid illegal discrimination while using innovative data and modeling is to ensure that predictive variables are not proxies for membership in the demographic groups protected by ECOA and FHA.

Data scientists, model developers, and compliance professionals should carefully monitor data sets, development processes, and results for evidence of unfair or discriminatory effects.