I’ve covered the Red Sox for 20 seasons. This year marked the first time that I’d been given an American League Cy Young vote by the Boston Baseball Writers chapter, a responsibility that forced me to reconsider a seemingly basic but endlessly complicated question: What matters when judging a pitcher’s performance?
First, let’s get this out of the way: It’s not wins and losses. In an era when complete games arrive with the frequency of the Leonids, and when it’s become customary to see dominant performances end in a loss and poor ones result in wins, crediting a win or loss to a pitcher seems misguided.
On the other end of the spectrum, there’s WAR (Wins Above Replacement). It’s a valuable idea in terms of defining a player’s total individual contribution to a team. But there was a roadblock in assessing 2021 AL Cy Young candidates.
Here are the leaders in WAR, as calculated by a few key resources:
Baseball-Reference.com (bWAR): 1. Robbie Ray (6.7,); 2. Gerrit Cole (5.7); 3. Lance Lynn (5.4).
FanGraphs (fWAR): 1. Nate Eovaldi (5.6); 2. Cole (5.3); 3. Carlos Rodón (4.9).
Baseball Prospectus (WARP): 1. Cole (4.7); 2. Ray (3.9); 3. Rodón (3.6)
So, that doesn’t exactly offer a ton of clarity. The inconsistency across systems underscores that they value different things. “WAR” reflects choices about what matters and how much weight to give individual components of a performance.
Before diving into the weeds on those systems, it’s also worth noting one problematic element: All three downplay what actually happens in games.
FanGraphs leans on Fielding Independent Pitching (meaning any balls in play that aren’t homers, strikeouts, or walks are stripped out). WARP uses Deserved Run Average to account for several variables — among them the expected outcome based on contact (launch angle and exit velocity), the catcher’s framing, and the quality of an opposing hitter — to determine what a pitcher might have been expected to do in a neutral environment. Baseball-Reference.com uses an adjustment meant to account for team defense.
All have a similar goal. Once a ball is in play, there are forces — luck, the ballpark, the quality of defense — that are out of a pitcher’s control. All three systems attempt to limit the influence of those elements.
That approach arguably serves as a better bellwether of a pitcher’s “true” performance and thus serves as a better predictor of his future. But shouldn’t the Cy Young vote reflect what a pitcher did on the field in a given year more than it does what he might have been expected to do in a baseball lab?
The ability to create a pitch shape — a cutter or changeup or other offering with movement — in a specific spot to create a specific kind of contact lies somewhere between artistry and magic. There are times when pitchers work in a fashion that is intended to achieve a very particular outcome with a ball in play. Defense-independent or defense-adjusted pitching metrics — and the WAR metrics that employ them — may undervalue that art in assessing what happened on the field in a given year.
For instance, if a pitcher induces a hard-hit grounder for a double play, he wouldn’t get credit for doing anything meaningful in a FIP-based system, he might receive negative credit in a DRA-based system, and the team adjustment to defense used by bWAR could result in some strange outcomes. (Joe Posnanski had a fantastic look at this phenomenon.)
Credit where it is due
One general manager offered a simple solution when asked what statistics deserved prominence when considering the Cy Young vote: innings and ERA, the bottom-line results in games. (Runs allowed per nine innings might be better than ERA, given that a pitcher is simply absorbing the work of the defense behind and around him, including catcher framing, great plays, and misplays.) Such an approach would yield a fairly obvious answer for the Cy Young race, since Robbie Ray led the AL in innings (193⅓) and ERA (2.84).
There’s wisdom in such an approach, and in crediting pitchers for dominant game performances. In an era when so few pitchers go more than five innings, identifying excellent outings of six and seven or more matters.
Yet the Cy Young Award comes without guidance, save for mention of a “Most Outstanding Pitcher” on the trophy. It’s not a “Most Valuable Pitcher” award. Pitchers should get credit, too, for what they can control. Do any of the WAR metrics successfully do that?
Ultimately, FanGraphs’s FIP-based WAR metric seems dated. It focuses on roughly one-third of all plate appearances — those ending in strikeouts, walks, homers, and hit batters — and ignores the other two-thirds in which there’s a ball in play. It makes no difference whether those balls in play are popups, line drives, or ground balls.
Thanks to Statcast, it’s possible to quantify the expected outcome of balls in play (expected batting average, slugging, in what percentage of parks it would be a homer) based on their exit velocity and launch angle. Given that, why employ a system that ignores that data?
Baseball-Reference.com’s system seems only slightly better, since it looks at the quality of a team’s defense on the whole, rather than defense behind a specific pitcher on specific kinds of balls in play.
In some ways, it makes more sense to look at Statcast’s “expected” statistics — expected weighted on-base average (xwOBA) or expected ERA (xERA) — based on the combination of a pitcher’s strikeout rate, his walk rate, and the specific outcomes expected from the contact he induced. While Statcast doesn’t yet offer its own version of WAR, it’s possible to create some kind of cumulative pitcher value by taking one of those expected statistics and using a pitcher’s workload to identify some sort of “runs prevented” metric.
Yet even that is ultimately unsatisfying. Statcast, after all, featured a curious cluster of three White Sox (Lynn, Rodón, and Lucas Giolito) among the four top AL starters in xERA and xwOBA. Why? Chicago’s division was awful. The other four AL Central teams had losing records and four of the worst seven run totals in the AL. (Remember that if Eduardo Rodriguez emerges as a Cy Young candidate next year.)
Quality of competition matters. In that respect, WARP does a better job of comparing pitcher performance than the more widely used WAR systems of Baseball-Reference.com and FanGraphs as well as Statcast. Alternately, it’s also possible to break down traditional pitcher performance (ERA, strikeout percentage, etc.) by quality of competition (teams that finished above .500 or in the postseason).
One other variable factored into my Cy Young thinking this year: How did pitchers perform through the first three months of the season, and how did they perform over the last three months?
The significance of that question was twofold this year. First, because of the uniqueness of returning to a 162-game season from last year’s 60-game season, the degree of difficulty of sustaining excellence through a full season was higher than ever.
Pitchers who were able to sustain workloads and elite performance accomplished something that was truly outstanding in a year when ERA by month broke down thusly:
April: 3.99 (starters 4.03)
May: 4.08 (4.02)
June: 4.44 (4.52)
July: 4.39 (4.44)
August: 4.27 (4.38)
September/October: 4.42 (4.64)
Beyond fatigue, there was another curiosity — and concern — in evaluating performance over a full season. The enforcement of the foreign substances ban, which was implemented in late June, created chaos for some pitchers.
For those whose performances dropped significantly down the stretch — or whose spin rates declined initially after the start of the ban and then rebounded (in some cases, to suspicious extents) in the second half — the question loomed: Should they be penalized on the mere possibility that their success had been built on prohibited behavior?
I wrestled with these questions over the last month of the season. I canvassed pitchers, hitters, pitching coaches, managers, coaches, and executives for what they prioritized when thinking about who deserved Cy Young consideration.
Ultimately, I relied on an imprecise formula. I balanced innings pitched (a huge factor for me), ERA, DRA, xwOBA, strikeout rate, and the number of standout outings of six and seven innings or more; I looked at how pitchers did against opponents with records over .500; I did give greater weight to late-season performance than total-season performance; and I tried to account for the dozens of eye-opening conversations I had with people in the game.
In the end, I voted for:
1. Robbie Ray: Fantastic down the stretch, and not only did he lead the AL in innings and ERA, he also threw nearly 30 more innings against opponents with records of .500 or better than anyone else.
2. Gerrit Cole: The leader in DRA and (among qualifying starters) xwOBA, fantastic strikeout rate, but a significant fade down the stretch.
3. Nate Eovaldi: Stayed consistent over the full season in the meat-grinder division and with a giant innings load.
4. Lance Lynn: The leader in xwOBA and ERA, but not enough innings to qualify for the ERA title and a paltry 54 innings against teams that finished at .500 or better.
5. Frankie Montas: Had the workload, dominance over the last five months of the season, and excellence against elite opponents ― with a lot of starts against teams that contended to the end of the season.
I respect any disagreements with that order. Beyond highlighting some wonderful performances (not just with those five but also with pitchers such as Rodón, Lance McCullers, and José Berrios), the process of getting to that conclusion highlighted the difficulty of defining and measuring pitcher success.
None of the modern statistical tools is perfect. Instead, they serve as discussion points rather than conclusions.
Alex Speier can be reached at firstname.lastname@example.org. Follow him on Twitter at @alexspeier.