The power of data: How PayPal leverages machine learning to tackle fraud

What kind of data is needed to detect and prevent fraud, and what are the benefits of machine learning?

Data is the currency of our modern, technology-powered world. It flows from customers to businesses and back again, enriching us all. But in the wrong hands, it can also line the pockets of fraudsters. This is why we need machine learning. Using high-quality data, we can train these complex, intelligent models to help spot suspicious online behavior that human eyes may miss.

Our access to data on more than 350 million consumers and merchants in over 200 markets is therefore a large part of our success in helping merchants detect and prevent fraud. But what kind of data do we need, and how do we use it?

Why machine learning?

Fraud prevention is not only about machine learning. Rules-based systems can also be helpful, by detecting scenarios that occur frequently and enabling us to set up simple “if X then Y” commands. For example: “decline transactions if they come from a specific country.” However, rules deal in absolutes, and can’t handle the complexity of scenarios that characterize today’s user behavior, and the fraud patterns that seek to imitate it.

That’s where machine learning comes in. This subset of artificial intelligence enables us to create algorithms that process huge datasets with multiple variables to find correlations at lightning-fast speeds. By training these models with thousands of good and bad transactions, they can be taught to help identity future bad buying behavior independently. This is faster than setting up new rules, is easy to retrain with the latest data, and involves less manual work, reducing operational costs. Ultimately, it’s a more adaptive, flexible, and effective approach to fraud prevention.

What data do we need?

However, exactly how effective these tools are depends to a great extent on the quality and type of data fed into machine learning models. Let’s take a look at some of the main types of fraud and the kind of data needed.

Signup Fraud is among the fastest-growing types of fraud. It occurs when scammers use stolen identities or create synthetic ones to open a new financial account — typically a bank or credit card. They will usually try to max-out credit limits before disappearing, leaving the victim to pick up the pieces. Signup fraud is difficult to spot as account creation is the first time a company will see that customer, so there’s no historical data for comparison, and interaction with them is minimal.

PayPal feeds its algorithms device data, third-party information like email address checks and identity scores, session analysis, and data collected during enrolment to help spot scammers. We build hundreds of signals on top of the data to detect, for example, differences between a user’s real and stated location, or difficulty in typing familiar information like first and second names.

Login Fraud is another fast-growing fraud type and involves the hijacking of existing customer accounts, usually through stolen log-ins. Depending on the type of account, a fraudster could siphon off funds, carry out fraudulent transactions or steal and sell personal information. The impact on the victim could range from increased chargebacks, potentially leading to customer churn and brand damage.

PayPal uses machine learning to help assess in real-time if an individual is a legitimate customer or not. Device, email, IP, phone, transaction, and behavioral user information are key here, as are checks for proxies to hide the true location of an individual. The signals we build check for high-risk activity like a large number of log-in attempts, or account changes unlike any seen in the past.

​Payment fraud involves the use of card details without the actual cardholder’s knowledge. Previous transaction data can be used for comparison to check for fraud here, alongside device, email, IP, phone, user, and address information. Typical indicators of malicious activity include a shipping-billing address mismatch, unusually large orders, and the use of multiple different cards for purchases shipped to the same address. A combination of these signals can indicate a higher chance of fraud.

The power of PayPal

There are no hard and fast rules for how much data machine learning models need to function effectively. But a good rule of thumb is “as much as possible.” The further back in time data goes the more useful it can be to this kind of modeling, just as the more complex the algorithm, the more data you need. Understanding good and bad events more accurately during training will also require a comprehensive dataset, including edge cases.

Fortunately, this is where the PayPal 2-Sided Network really comes into its own. The scale of insight we have into transaction data—from both a customer and merchant perspective—is the fuel for our machine learning algorithms. This data ultimately helps to reduce consumer friction, lower fraud losses, and enhance customer trust. More importantly, it’s a system which will continue to adapt even as the fraudsters constantly evolve their tools and tactics, to help keep us one step ahead.

Was this content helpful?

Related content

Sign up to stay informed

Share your email to receive the latest enterprise updates, top stories, and industry reports.

*Required fields

We use cookies to improve your experience on our site. May we use marketing cookies to show you personalized ads? Manage all cookies