This forum has been a huge resource for me beginning in March 2015 and in the spirit of the holidays, I've decided to share a recent batch of work I've done on Lending Club's historical data. For the past week I've been working to crack open the employment title data and see if it reliably predicts charge off rates. My complete findings are contained in the attached PDF file.
The biggest challenge in this project was dealing with spelling errors and other issues that arise from a free-text field, which make it hard to group the data. For instance, using Excel, it is not easy to simply group all "Presidents" in one category. I had to deal with the people who misspelled a word, included extraneous spaces and special characters, or used compound descriptors like "President & CEO."
Methodology:1. All loans issued between September 2007 (the earliest issue date) and June 2015 were included in my analysis - more than 643,000 loans. A "charge off" was defined as a loan status of "charged off" or "default," where the date of last payment occurred 0 - 12 months after the loan was issued. If the loan was charged off in month 18, for example, it was not counted as charged off for the purpose of this project. By aligning the data in this way, I was able to remove the effect of loan age for loans issued recently. Why not include all loans through November of 2015 (i.e., all loans with up to twelve months of payment data)? Because I wanted a buffer to account for delinquent loans that might roll into charge-off status. Why choose 12 months of payment data as the cut-off? A: I wanted the largest sample possible; B: fewer than 12 months wouldn't allow for enough seasoning; C: more than 12 months would whittle down the database too significantly; D: a person's employment status is more likely to change as a loan gets older.
2. After I had gathered the data and defined what a "charge off" is, I used a pivot table to determine the most commonly used employment titles. If you're curious, the top three most common were "teacher," "manager," and "owner." I decided to create a short list of employment title categories by taking the top 100 most common titles. The list eventually grew to 124 total categories, since some common categories were not detected by the initial pivot table analysis.
3. The pivot table report showed a clear separation between the low-hanging fruit in the database and everything else. The low-hanging fruit were the people who used a simple description for their employment title and spelled it correctly, such as "teacher." The difficult people used the name of their employer instead of an employment title, or used a compound description such as "President & CEO," or misspelled a word, or included special characters such as "&" or "/".
4. I quickly encountered a problem: some people could be included in multiple employment categories. Based on my shortlist, "Assistant director systems engineering" was an assistant, a director, an engineer, and a systems engineer all at once. I resolved to allow for multiple employment categories/labels per loan in order to overcome this problem:
Breakdown of loans by number of labels
One label | 335,495 | Two labels | 71,845 | Three labels | 3,556 | Four labels | 98 | not labeled | 271,000 | emp title blank | 36,886 | -------------------- | ----------- | total | 643,698 |
5. I used keyword searches to label as many loans as possible. For example, "fire fighter," "fire marshall" and "fire chief" were all lumped together. Similarly, "CEO," "COO," "CTO," "CFO", and their non-abbreviated versions were all lumped together into the "C-suite" category.
6. As noted in the table above, over a quarter million loans in my sample remain unlabeled. This represents the really difficult ones - most of them are not employment titles, but employer names, and hence impossible to categorize. Others are indeed employment titles, but they contain severe misspellings or belong in categories that are too uncommon to be statistically significant. There are also probably many loans that could be labeled, but belong to a category that I missed.
7. I calculated charge-off rates for each category and sorted from lowest-to-highest. Median income was also calculated for each employment label. Complete results for all categories are included in the attached PDF file. Below is a small sampling of the categories for the purpose of including a pretty chart:

8. I validated the results by splitting the sample into "earlier loans" and "later loans" and recalculating the charge off rates. If the results were reasonably similar for the separate samples, then we can conclude that employment title is predictive of charge off rates. The cutoff date that I chose was October 2014 - loans issued on or before this date are in the "early" sample and loans after this date are in the "later" sample. I chose this cutoff point solely in the interest of creating equally-sized samples - I wanted an even split. Below is the chart that proves my methodology is mostly accurate:

Please feel free to dig into the complete data set (attached) and offer feedback. I'm particularly interested in suggestions for how to automate the labeling process - parsing through text data for keywords and misspellings is not easy to do in Excel. But if it can be accomplished, then I'm confident it would be useful for including this data in a regression model. |