Milky Way Stellar Populations Galaxy Evolution Cosmology


Data Integration Scalable Inference Sampling Methods


My astronomy research interests are quite broad but are focused around trying to understand how galaxies like our own Milky Way behave, form, and evolve. I focus on using large datasets to generate maps of billions of astronomical objects and the diffuse "stuff" within and between them. I am particularly interested in developing techniques to jointly model observations from separate (but often complementary) datasets such as imaging, spectroscopy, and time series.

Milky Way

Our own Milky Way Galaxy is comprised of roughly 200 billion stars. These stars trace Galactic structure and serve as records of the formation history of the Galaxy. Studying them, however, is challenging because we can't tell, e.g., how old they are or how far away they are from us from just images of the sky. This is made even more difficult because the Galaxy is also filled with cosmic dust, which blocks and ''reddens'' the light from many of these stars. We also know many stars are born together in clusters from clouds of dense molecular gas in a complicated, hierarchical process.

Studying Galactic structure with modern astronomical datasets requires integrating observations of all of these features (stars, dust, gas, etc.) from many different datasets and at many different scales. I am interested in combining these data with theoretical and data-driven models to create sophisticated models of the Milky Way such as 3-D dust maps.

2-D map of local dust
A projected 2-D map of the distribution of dust around the Sun. Credit: Zucker, Speagle et al. (2020).

Stellar Populations

In addition to serving as useful tracers of Galactic structure, stars are also natural laboratories for testing fundamental physics. Modeling their evolution and internal structure requires a combination of nuclear physics, radiative transfer, magnetohydrodynamics, and more. Their evolution can also be sensitive to additional processes such as chemical composition, magnetic dynamos, and interactions with binary companions. Studying them therefore provides a window into a whole suite of interesting physical processes that are important for understanding how stars like our Sun came to be the way they are today.

Over the last decade, imaging, spectroscopic, and time-series data has become available for stellar populations ranging from the tens of thousands to the billions. I am interested in using these data to investigate stellar behavior using methods such as asteroseismology and gyrochronology, test and improve current state-of-the-art theoretical stellar models, and build new, empirical stellar libraries.

Deriving empirical corrections for theoretical stellar populations
Deriving physically-motivated empirical corrections to theoretical stellar populations using a population of stars located in the nearby open cluster NGC 2682 (also known as M67). Credit: Speagle et al. (2021a).

Galaxy Evolution

The story of how galaxies form and evolve is complex and involves many moving pieces. Observations suggest galaxies are formed hierarchically through the merger of many smaller galaxies throughout the course of their lifetimes, and their evolution is a complex interplay of secular processes involving ongoing star formation and gas physics as well as catastrophic processes such as mergers with neighboring galaxies and feedback from their central supermassive black holes that can rapidly "quench" star formation. This leads to an extraordinary diversity of galaxies with varied assembly histories and physical properties.

To understand the details of how galaxies evolve, we need to observe large samples of galaxies across the electromagnetic spectrum in order to constrain their formation and evolutionary histories. Astronomers have had difficulty keeping pace with the increasing quantity and quality of data from large surveys, which now include images taken at many wavelengths, 1-D and 2-D spectroscopy, and spatially-resolved IFU data. These often contain complementary information but are challenging to model. I am interested in developing techniques and tools designed to model these data to learn more about the underlying galaxy and stellar populations.

Evolution of star-forming galaxies across cosmic time
Evolution of star-forming galaxies across cosmic time based on a compilation of 25 studies. Credit: Speagle et al. (2014).

Cosmology and Large-Scale Structure

The growth of large-scale structure and the evolution of the Universe at large is mysterious and complex. We now know that baryonic matter (i.e. "normal stuff") only makes up 5% of the energy budget of the Universe. Of the remaining 95%, 25% is comprised of "dark matter" and the remaining 70% is made up of "dark energy". While we can infer the presence of dark matter through its gravitational effects, dark energy only can be observed by studying the evolution of the Universe on the largest of scales over long periods of time. Galaxy evolution takes place against the backdrop of the "cosmic web" of large-scale structure whose properties depend on all of these various components.

Improving our understanding of cosmology requires knowing the distances to many of galaxies that trace these large-scale structures. I am interested in developing quick yet robust probabilistic approaches to derive distances to the billions of galaxies collected in modern surveys along with methods to subsequently incorporate them into cosmological analyses.

Cosmological constraints from HSC
Cosmological constraints from cosmic shear measurements from the HSC-SSP Survey. Credit: Hikage et al. (2019).


My statistics research interests are broadly centered around performing statistical inference over large, diverse datasets. This requires developing frameworks to jointly model separate (but complementary) observations, scalable methods to implement them, and sampling methods to explore the inferred distribution of model parameters.

Data Integration

Modern science has entered the era of "big data", with large datasets commonly available across a host of scientific domains from genomics to economics to astronomy. These datasets often overlap with each other while probing complementary sets of information, and so jointly modeling them often enables us to draw better statistical inferences.

While straightforward in theory, this type of "data integration" in astronomy is often difficult since many of these datasets possess widely different characteristics, trace different sub-populations, depend on different physical processes, and often only partially overlap. I am interested in developing methods to deal with these types of data to improve our understanding of the Milky Way) and galaxy evolution).

Distances across the Perseus cloud.
Distance and velocity estimates to various regions of the Perseus cloud derived through a combination of observations of stars, dust, and gas. Credit: Zucker et al. (2018).

Scalable Inference

Analyzing large datasets has increasingly become the domain of machine learning methods. However, these methods tend to have difficulty deriving estimates of the uncertainty and reliability of their predictions and can often be difficult to interpret. While performing statistical inference with interpretable models can help to address these concerns, many methods are often prohibitively slow and therefore limited to small datasets.

In order to address these concerns, I am interested in combining elements of machine learning and statistical modeling to develop quick yet robust approaches for "scalable" inference that can be applied to large datasets. I am especially keen on incorporating fundamental physics (e.g., doppler shifts), geometry (e.g., symmetries), and observational effects (e.g., noise, censoring) into this process.

Workflow combining kNN with SED fitting.
A schematic showing a hybrid approach that combined statistical modeling and machine learning to derive distances to galaxies. Credit: Speagle et al. (2019).

Sampling Methods

Much of science involves using data to test, constrain, and/or rule out various models that represent our current understanding of how we think things work. These models are often complex, involving on many parameters and requiring lots of computational effort to generate. The constraints on these parameters given our data (and possibly prior beliefs) are often unknown (and can sometimes be multi-modal), requiring the use of numerical techniques to estimate them.

One class of techniques for estimating these constraints relies on generating random samples from the distribution via computationally tractable numerical simulation. These are known as a "Monte Carlo" approaches, and include methods such as Markov Chain Monte Carlo that are widely used throughout the sciences. I am interested in developing efficient sampling strategies that can be applied to multi-modal distributions to estimate parameters and perform model comparison.

An animation of dynesty in action.
An animated demonstration of the Nested Sampling code "dynesty" (Speagle 2020). Credit: dynesty documentation.