The Northwestern Computational Research Day provides opportunities for University faculty, researchers, graduate students, and postdocs to discuss successful practices and challenges in research computing.
Reproducibility in Computational Research: Code, Data, Statistics, and Implementation
Victoria Stodden, Associate Professor of Information Sciences, University of Illinois at Urbana-Champaign
Reproducibility in the computational sciences can be interpreted in increasingly expansive ways. The most narrow interpretation being re-execution (or even just preservation) of the computational steps that lead to the published scientific claims, using the same input data and parameter settings. The most expansive interpretation might be a completely independent implementation of an experiment designed to test the same scientific claim as a previously published work. In this talk I will start with the narrow interpretation and discuss steps by the community to enable reproducibility, and present a roadmap for achieving this goal. I will then propose steps to achieve the more expansive interpretation of reproducibility, including how we can compare and extend computational experiments to enhance scientific knowledge. I will present recent infrastructure research on experiment pipeline Abstractions for Improving Machine learning (AIM) to enable the comparison and extension of computational research.
About Victoria Stodden: Victoria Stodden joined the School of Information Sciences as an associate professor in Fall 2014. She is a leading figure in the area of reproducibility in computational science, exploring how can we better ensure the reliability and usefulness of scientific results in the face of increasingly sophisticated computational approaches to research. Her work addresses a wide range of topics, including standards of openness for data and code sharing, legal and policy barriers to disseminating reproducible research, robustness in replicated findings, cyberinfrastructure to enable reproducibility, and scientific publishing practices. Stodden co-chairs the NSF Advisory Committee for CyberInfrastructure and is a member of the NSF Directorate for Computer and Information Science and Engineering (CISE) Advisory Committee. She also serves on the National Academies Committee on Responsible Science: Ensuring the Integrity of the Research Process. Previously an assistant professor of statistics at Columbia University, Stodden taught courses in data science, reproducible research, and statistical theory and was affiliated with the Institute for Data Sciences and Engineering. She co-edited two books released in 2014—Privacy, Big Data, and the Public Good: Frameworks for Engagement published by Cambridge University Press and Implementing Reproducible Research published by Taylor & Francis. Stodden earned both her PhD in statistics and her law degree from Stanford University. She also holds a master’s degree in economics from the University of British Columbia and a bachelor’s degree in economics from the University of Ottawa. View faculty profile..
At Home in a Storm of Stars: Observing, Simulating, and Pondering the Milky Way Galaxy
Shane Larson, Associate Director of CIERA (Center for Interdisciplinary Exploration and Research in Astrophysics) and WCAS Research Associate Professor of Physics and Astronomy
The Milky Way galaxy is an elaborate storm of stars that is 10 billion years old and built of 400 billion stars. Historically our ideas about the nature and structure of the Cosmos have grown out of our attempts to understand the Milky Way. Four hundred years ago, we didn’t even know what the Milky Way was, until the invention of the telescope revealed it was comprised of stars. Less than 100 years ago, we didn’t know that if the Universe was the Milky Way, or if the Milky Way was simply a mote in a much vaster cosmic void. That question was resolved once again by telescopes and new theoretical ideas from Einstein’s general relativity. Today, we know a great deal more about the galaxy, but are still woefully limited by our inability to probe it. Large scale surveys, both with ground and space telescopes, are attempting to make unprecedented maps of the Milky Way. At the end of the 2020s, the gravitational wave observatory LISA will add to those maps a survey of the stellar graveyard of the Milky Way. Here at Northwestern we are engaged in a vast array of different computational and observational work to better understand the galaxy in which we live. In this talk we’ll examine some of those projects, talk about what we hope to learn about the Milky Way, and speculate about what mysteries are still out of our reach.
About Shane Larson: Shane Larson is a research associate professor of physics at Northwestern University, where he is the Associate Director of CIERA (Center for Interdisciplinary Exploration and Research in Astrophysics). He works in the field of gravitational wave astrophysics, specializing in studies of compact stars, binaries, and the galaxy. He works in gravitational wave astronomy with both the ground-based LIGO project, and the future space-based detector LISA. Shane grew up in eastern Oregon, and was an undergraduate at Oregon State University where he received his B.S. in Physics in 1991. He received a Ph.D. in theoretical physics (1999) from Montana State University. He is an award winning teacher, and a Fellow of the American Physical Society. He currently lives in the Chicago area with his wife, daughter and cats. He contributes regularly to a public science blog at writescience.wordpress.com, and tweets with the handle @sciencejedi. View faculty website..
Learning Probabilistic Models for Graph Partitioning and Community Detection
Aravindan Vijayaraghavan, Assistant Professor, Electrical Engineering and Computer Science; Robert R. McCormick School of Engineering and Applied Science
The Stochastic Block Model or the Planted Partition Model is the most widely used probabilistic model for community detection and clustering graphs in various fields, including machine learning,statistics, and social sciences. Many existing algorithms (e.g. spectral algorithms) successfully learn the communities or clusters when the data is drawn exactly according to the model. However, many of these guarantees do not hold in the presence of modeling errors, or when there is overlap between the different communities. In this talk, I will address the following question: can we design robust efficient algorithms for learning probabilistic models for community detection that work in the presence of adversarial modeling errors? I will describe different computationally efficient algorithms that probably recover communities or clusters (up to small recovery error).These algorithmic results will work for probabilistic models that are more general than the stochastic block model, or when there are different kinds of modeling errors or noise.
Learn more about Aravindan Vijayaraghavan.
Mechanistic Modeling of the (Bio)Conversion of (Bio)Macromolecules
Linda Broadbelt, Assistant Professor, Electrical Engineering and Computer Science; Robert R. McCormick School of Engineering and Applied Science
Fast pyrolysis, a potential strategy for the production of transportation fuels from biomass, involves a complex network of competing reactions, which result in the formation of bio-oil, non-condensable gaseous species, and solid char. Bio-oil is a mixture of anhydro sugars, furan derivatives, and oxygenated aromatic and low molecular weight (LMW) compounds. Previously, the successful modeling of fast pyrolysis reactors for biomass conversion was hampered by lumped kinetic models, which fail to predict the bio-oil composition. Hence, a fundamental understanding of the chemistry and kinetics of biomass pyrolysis is important to evaluate the effects of process parameters like temperature, residence time and pressure on the composition of bio-oil. In this talk, a mechanistic model that was recently developed to characterize the primary products of fast pyrolysis of cellulose is described. The kinetic model of pyrolysis of pure cellulose was then extended to describe cellulose decomposition in the presence of sodium salts. To quantify the effect of sodium, a density functional theory study of glucose dehydration, an important class of decomposition reactions of a cellulose-derived intermediate, was carried out. The theoretical results reveal alterations in the reaction rate coefficients when sodium is present and a change in the relative rates of different reactions. These kinetic parameters were used in the kinetic model to describe Na-mediated pathways, capturing trends in the experimental product distributions as the salt loading was increased based on classic catalytic cycles. In contrast to pyrolysis, conversion of macromolecules such as cellulose in Nature takes place at ambient temperature, aided by enzymes. Mechanistic details of the action of these enzymes will also be discussed and contrasted to high-temperature pyrolysis pathways. We have also developed a computational discovery platform for identifying and analyzing novel biochemical pathways to target chemicals. Automated network generation that defines and implements the chemistry of what we have coined “generalized enzyme functions” based on knowledge compiled in existing biochemical databases is employed. The output is a set of compounds and the pathways connecting them, both known and novel. To identify the most promising of the thousands of different pathways generated, we link the automated network generation algorithms with pathway evaluation tools. The simplest screening metrics to rank pathways are pathway length and number of known reactions. More sophisticated screening tools include thermodynamic feasibility and potential of known enzymes for carrying out novel reactions. Our method for automated generation of pathways creates novel compounds and pathways that have not been reported in biochemical or chemical databases. Thus, our method goes beyond a survey of existing compounds and reactions and provides an alternative to the conventional approaches practiced to develop novel biochemical processes that harness the power of enzymes as catalysts.
Learn more about Linda Broadbelt.
Analysis Animation: A New Paradigm for Exploring Population Omics Data–Lake Room
Denise Scholtens, Professor of Preventive Medicine (Biostatistics) and Neurological Surgery; Chief, Division of Biostatistics
Integration of genetics and metabolomics data demands careful accounting of complex dependencies, particularly when modeling familial omics data, for example, to study fetal programming of related maternal-offspring phenotypes. Efforts to find ‘genetically determined metabotypes’ using classic genome-wide association study (GWAS) approaches have proven useful for characterizing complex disease, but conclusions are often limited to a disjointed series variant-metabolite associations. Our research group is adapting Bayesian network models to integrate metabotypes with maternal-fetal genetic dependencies and metabolic profile correlations. Using data from the multiethnic Hyperglycemia and Adverse Pregnancy Outcome (HAPO) Study, we demonstrate that strategic specification of ordered dependencies, pre-filtering of candidate metabotypes, spinglass clustering of metabolites and conditional linear Gaussian methods clarify fetal programming of newborn adiposity related to maternal glycemia. Exploration of network growth over a range of penalty parameters, coupled with interactive plotting and external validation using publically available results, facilitate interpretation of network edges. These methods are broadly applicable to integration of diverse omics data for related individuals.
Learn more about Denise Scholtens.
Balloon-borne Observations of the Birth of Stars and Planets in Magnetized Galactic Clouds
Giles Novak, Professor, Physics and Astronomy; Judd A. and Marjorie Weinberg College of Arts and Sciences
The Galactic magnetic fields that permeate the coldest interstellar clouds where new stars and planets are born are believed to control many aspects of the stellar birth process. These fields are notoriously difficult to observe, but advanced detectors for submillimeter astronomy combined with the availability of long duration balloons flown over Antarctica are now making such observations possible on a large scale. Separating (a) the tiny astrophysical signals that carry information on cosmic magnetic fields from (b) the large thermal signals produced by the optics and residual atmosphere is a job for cluster computing. I will review this problem and other challenges, discuss recent results from our NASA-funded program of Antartic ballooning, and summarize future plans.
Learn more about Giles Novak.
Leveraging computational processes and neuroimaging data to understand the developing human brain
Elizabeth Norton, Assistant Professor, Communication Sciences and Disorders; School of Communication
The human brain is impressively complex, and only recently have we started to understand the ways in which differences in brain structure and function can lead to meaningful behavior differences in individuals. An even greater challenge for our field to tackle is to understand how the brain changes over development. Neuroimaging studies using functional and structural MRI, EEG, and other technologies now let us peer in to the minds of adults and children with disorders such as dyslexia or autism, and even into preschoolers and infants who might develop these disorders. This presentation will discuss the computational tools that we use in our lab as cognitive neuroscientists to better understand these brain processes, such as software that allows us to track fibers and segment meaningful brain areas from MRI data, and to analyze and integrate EEG data from two people during interaction. We will discuss how brain measures and computational approaches might be used in a practical way to understand and predict common developmental disorders.
Learn more about Elizabeth Norton.
Using data to understand the social and systemic drivers of HIV
Michelle Birkett, Assistant Professor, Medical Social Sciences; Feinberg School of Medicine
Understanding the drivers of health disparities within populations is extremely complex – particularly within stigmatized populations, such as racial or sexual and gender minorities. Health disparities have been suggested to occur because of intersecting individual, relational, and environmental processes caused by stigma, but little is known about the exact pathways. This talk will present how utilizing large and diverse datasets, as well as a systems-perspective, allows researchers to understand the social and contextual drivers of these disparities.
Learn more about Michelle Birkett.
Good, Fast, Cheap: Applying the Iron Triangle to Big Data Governance
Justin Starren, Professor, Preventive Medicine, Health and Biomedical Informatics Division; Feinberg School of Medicine
The presentation builds upon the experience of the Northwestern Medicine Enterprise Data Warehouse, with a mature, decade-old, Enterprise Data Warehouses (EDW), containing over 6.7 million patients and data from 142 discrete source systems. The presentation will discuss the various governance for large mixed-use data repositories and the relative advantages and risks of each. Our experience is that the “politics,” such as policies governing access and ownership, are much more difficult and time consuming than the technical “plumbing” of the data handling. We will discuss how these issues are magnified in multi-institution repositories, and some strategies for addressing these challenges. The presentation will present the concept of the Iron Triangle. It will also discuss how the Iron Triangle can be expanded to address the unique challenges of big data-centric solutions—the Iron Box. The presentation will present a graphical version of the Iron Box. It will discuss how the tool can be applied to these tradeoffs and it can be used to facilitate organizational strategy discussion. It will include a hands-on exercise for participants to help them understand how to apply the tool. The presentation will conclude with case studies demonstrating how a shared-model EDW can facilitate advanced analytics and a Learning Health System.
Learn more about Justin Starren.
Vision Science in Visualization
Christie Nothelfer, PhD candidate in the Brain, Behavior & Cognition program in the Department of Psychology
Data are ubiquitous. Data visualizations are powerful ways to explore data, and a persuasive way to communicate the patterns within. How do we design effective visualizations? One route is to incorporate research on human vision, to design visualizations that leverage the strengths and weaknesses of the human perceptual system in visualizations. While the visualization and vision science research communities have historically maintained minimal contact, they have much to teach each other, and collaborative research questions can advance both fields. For example, should we use multiple visual features, like colors and shapes, to distinguish groups of data in a scatterplot, even if it adds more visual complexity? How many sections of a line chart can we perceive and remember? How quickly can we extract relationships between data points, such as whether a bar graph contains more increases or decreases in value? I will present research at this exciting intersection of visualization and vision science, and demonstrate how both research communities can benefit from cross-disciplinary work.
Learn more about Christie Nothelfer.
Building Big Data to Measure Legal Bias
Kat Albrecht, PhD candidate, Department of Sociology; Judd A. and Marjorie Weinberg College of Arts and Sciences
Crime data can be extremely difficult to use when you are trying to measure actual crime. One reason for this is that the court system arbitrates the crime categories that are used by researchers, who often proceed without clear understanding of the separation between the original criminal event and the court derived data. Compounding this measurement issue are human sources of bias within the justice system, notably defined by the academic literature as judges. While it true that judges wield a lot of discretion, over 95% of federal cases are actually decided via plea bargaining, giving prosecutors widespread influence over most plea deals. This project focuses on creating victim-offender sentencing dyads from Florida Homicide data (N=43,459) that include prosecutor/defense attorney information in order to probe this potential source of bias. This project relies heavily on computational methodsincluding:transforming PDF records into digital data, scraping massive amounts of newspaper data to develop condensed victim profiles, fuzzy matching analysis to construct a large cohesive dataset, and tests of API’s designed to predict race/ethnicity for images. In addition to substantive findings of racial bias among prosecutors, benchmarking and functionality of these computational processes are discussed.
Statistical Signals Underlying Repeated Attempts That Lead to Success
Yian Yin, PhD candidate, Department of Industrial Engineering and Management Sciences; Robert R. McCormick School of Engineering and Applied Sciences
Success is often preceded by repeated attempts, but little is known about statistical signatures underlying repeated attempts that eventually lead to success. Two important questions regarding failures have remained open: First, are there quantitative patterns governing the dynamics underlying repeated attempts that failed? Second, since there are only two outcomes to which repeated attempts lead: success and non-success, are there early statistical signatures behind their dynamical patterns that distinguish these two cases? Here we explore simple models of repeated failures by assuming a ﬁnite number of past attempts to learn from through repeated attempts. We ﬁnd that there exist diﬀerent regimes with fundamentally diﬀerent characteristics as we tune the memory length, ranging from a random phase where each attempt with similar time cost and converging performance, to a cumulative phase where the time cost decreases as a power law of past attempt numbers and performance improves continuously, consistent with the universal learning curve that has been investigated by huge literatures across multiple levels and disciplines. To validate our theory, here we leverage millions of records through three datasets, ranging from scientists applying for research grants, to entrepreneurs working on startup ventures, to terrorism organizations launching armed attacks. All the three systems support the predicted co-existence of different phases. Our findings uncover new kinds of statistical indicators within failures that can predict the onset of future success, which not only have practical implications for science, but also improve our understanding of predictive patterns underlying complex interconnected systems.