July 2020 Data Release – Open Source Asset Pricing

Fixes

Previously, quarterly Compustat data was only lagged one month instead of 3. This lag is intended to account for the delayed release of accounting data. We also revised Cash and EarnIncrease to be closer to the original papers as these two predictors were affected by this revision. Incidentally, this seems to be the same bug that Robert Novy-Marx found affected earlier versions of the Hou, Xue, and Zhang (previously L. Chen and Zhang) 4-factor model. This fix makes the t-stats for NumEarnIncrease, EarnIncrease, and EarnSupBig decline by 6.0, 1.6, and 1.5, respectively, but has little effect on our main results.
Previously, the method we were using for merging datasets would drop permnos if multiple permnos were assigned to the same ticker. This dropped only 301 permnos out of 27,000 in the full dataset. We now are more careful with our merge to make sure no permnos are dropped. This results in small changes to the t-stats of all predictors. The average change is -0.007, and the standard deviation of the changes is 0.078.
We are grateful to Yang Liu (Tsinghua Finance), who carefully examined our data and pointed out the quarterly Compustat error.

Data

208 firm-level characteristics (1GB zipped csv)
- These omit size and price characteristics, which can be downloaded from WRDS.
- Warning: Matlab has trouble reading very large CSV files (H/T Chris Jones @USC).
Returns for 210 long-short portfolios
- Link is fixed (Jan 27, 2021) thanks to Huichou Huang.

Additional “Test Asset” Portfolios

Caution: the code behind these portfolios is not in the repo and has not been checked carefully.
Returns for N portfolios for each predictor based on the original papers
- For each predictor, we generate 5 portfolios if our benchmark is quintiles, 10 portfolios if our benchmark is deciles, 2 if binary, etc.
Returns for 10 portfolios for each non-binary predictor formed by decile sorts.
- Portfolio weights based on the original papers (typically equal-weighted), all stock breakpoints
- Value-weighted returns, all stock breakpoints
- Value-weighted returns, NYSE breakpoints
- Caution: some predictors are not well-behaved enough to produce deciles

Additional Data for Making the Extended Dataset (Not Recommended)

The extended data contains characteristics that were not shown to clearly predict returns in the original papers. Most of these come from Hou, Xue, and Zhang’s (2020) “Replicating Anomalies” paper.
This includes, for example, R&D / Sales. Chan, Lakonshok, and Sougiannis (2001) find that “there is little if any relation between R&D relative to sales and future returns.”
This also includes, for example, analyst coverage. Scherbina (2008) shows that change in analyst coverage predicts returns, based on the idea that a decline in coverage may be caused by bad news. Elgers, Lo, and Pfeiffer (2001) use analyst coverage to show their predictor’s power is related to information frictions. Neither paper examines the predictive power of analyst coverage nor argues that analyst coverage should predict returns.
For readers interested in characteristics that may or may not predict returns, we recommend randomly generating characteristics following Yan and Zheng (2017) or Chordia, Goyal, and Saretto (2020)
But if you must, here are 104 additional firm-level characteristics that may or may not predict returns (0.7 GB)
- This omits the bid-ask spread from TAQ due to license restrictions. TAQ spreads can be produced using code from Andrews work with Mihail Velikov, which is based on Holden and Jacobsen’s code.
Here are 1,050 additional portfolios
- These are mostly made by altering the rebalancing frequency of other portfolios.