Symbolic Regression

A practical guide for industry professionals and students.


Detailed description of genetic programming for symbolic regression.

Variants of genetic programming methods, evolutionary operators: selection, crossover, mutation, rules of thumb for setting hyperparameters, diversity, premature convergence ...

Describes advanced techniques in detail

Knowledge integration and semi-analytical models, differential equations, uncertainty quantification, parameter identification, ...

Model validation and selection

Visualization techniques for validation: intersection plots, partial dependence plots...; cross-validation; model selection criteria: AIC, BIC, minimum description length (MDL); pruning; variable relevance

Step-by-step examples

Includes many examples from a wide range of applications in science and engineering with results and guidelines for choosing hyperparameter values.

Book Blurb

Symbolic regression (SR) is one of the most powerful machine learning techniques that produces transparent models, searching the space of mathematical expressions for a model that represents the relationship between the predictors and the dependent variable without the need of taking assumptions about the model structure. Currently, the most prevalent learning algorithms for SR are based on genetic programming (GP), an evolutionary algorithm inspired from the well-known principles of natural selection. This book is an in-depth guide to GP for SR, discussing its advanced techniques, as well as examples of applications in science and engineering.

The basic idea of GP is to evolve a population of solution candidates in an iterative, generational manner, by repeated application of selection, crossover, mutation, and replacement, thus allowing the model structure, coefficients, and input variables to be searched simultaneously. Given that explainability and interpretability are key elements for integrating humans into the loop of learning in AI, increasing the capacity for data scientists to understand internal algorithmic processes and their resultant models has beneficial implications for the learning process as a whole.

This book represents a practical guide for industry professionals and students across a range of disciplines, particularly data science, engineering, and applied mathematics. Focused on state-of-the-art SR methods and providing ready-to-use recipes, this book is especially appealing to those working with empirical or semi-analytical models in science and engineering.

What's the content of the book?

  1. Introduction 1
  2. Basics of Supervised Learning 5
    1. Introduction 5
    2. Regression 7
    3. Classification 11
    4. Time Series Prediction 13
    5. Model Selection 14
    6. Cross-validation 18
    7. Further Reading 19
  3. Basics of Symbolic Regression 21
    1. Example: Identification of a Polynomial 21
    2. Example: Discovery of Laws of Physics from Data 26
    3. Example: Approximation of the Gamma Function 29
    4. Extending Symbolic Regression to Classification 32
    5. Further Reading 24
  4. Evolutionary Computation and Genetic Programming 35
    1. General Concepts 39
    2. Population Initialization 42
    3. Fitness Calculation 45
    4. Parent Selection 45
    5. Bloat and Introns 47
    6. Crossover and Mutation 49
    7. Power of the Hypothesis Space 50
    8. GP Dynamics 54
    9. Algorithmic Extensions 64
    10. Conclusions 77
    11. Further Reading 78
  5. Model Validation, Inspection, Simplification, and Selection 81
    1. Model Validation 82
    2. Model Selection 100
    3. Model Simplification 108
    4. Example: Boston Housing 113
    5. Conclusions 125
    6. Further Reading 125
  6. Advanced Techniques 127
    1. Integration of Knowledge 127
    2. Optimization of Coefficients 138
    3. Prediction Intervals 143
    4. Modeling System Dynamics 153
    5. Non-numeric Data 166
    6. Non-evolutionary Symbolic Regression 168
  7. Examples and Applications 173
    1. Yacht Hydrodynamics 173
    2. Industrial Chemical Processes 179
    3. Interatomic Potentials 187
    4. Friction 189
    5. Lithium-ion Batteries 197
    6. Biomedical Problems 214
    7. Function Approximation 219
    8. Atmospheric CO2 Concentration 223
    9. Flow Stress 225
    10. Dynamics of Simple Mechanical Systems 229
    11. Conclusions 247
  8. Conclusion 249
    1. Unique Selling Points of Symbolic Regression 249
    2. Limitations and Caveats 250
  9. Apendix 253
    1. Benchmarks 253
    2. Open-source Software for Genetic Programming 254
    3. Commercial Software for Genetic Programming 258

Datasets and Resources

Several datasets have been used for examples in the book. Downloads are available on the datasets page.

A list of recent software implementation and other resources can be found on the software page.


Gabriel Kronberger

Professor for Business Intelligence and Data Engineering

His research is focused on symbolic regression algorithms and their application in science and engineering. From 2018 until the end of 2022, he headed the Josef Ressel Center for Symbolic Regression in which he developed symbolic regression methods for semi-analytical modelling to improve interpretability, trustworthiness, and extrapolation capabilities.

Bogdan Burlacu

Professor for Data Analytics and Machine Learning

He has been an active researcher in the symbolic regression community for over a decade and his main area of expertise is the development of new symbolic regression algorithms and software.

Michael Kommenda

Consultant for Artificial Intelligence Solutions

He authored several papers on symbolic regression and genetic programming and organized the workshop for symbolic regression at the Genetic and Evolutionary Computation Conference (GECCO).

Stephan Winkler

Professor for Bioinformatics, Machine Learning, and Evolutionary Algorithms

He is head of the Department of Medical and Bioinformatics as well as the Bioinformatics Research Group. For more than 20 years, he has been an active researcher in genetic programming and symbolic regression. Stephan Winkler has published numerous articles and books on data science and bioinformatics, and is member of the organization team of the Genetic Programming in Theory and Practice Workshop (GPTP).

Michael Affenzeller

Professor for Heuristic Optimization and Machine Learning

He has published several papers, journal articles and books dealing with theoretical and practical aspects of evolutionary computation, genetic algorithms, and meta-heuristics in general. In 2001 he received his PhD in engineering sciences and in 2004 he received his habilitation in applied systems engineering, both from the Johannes Kepler University of Linz, Austria. He is head of the research group HEAL.