ECDA 2018 - Scientific Programme

Conference Programme

The preliminary conference program is now available in PDF format here.
Moreover, it can be accessed online under this link (updated: 2018-06-26 13:06 am UTC).

Book of Abstracts

The book of abstracts is available in PDF format here.

Invited Speakers

Nico Beerenwinkel, ETH Zürich, Switzerland
Title: "Analyzing molecular tumor profiles for precision oncology"

Abstract: Molecular profiling of tumor biopsies plays an increasingly important role not only in cancer research, but also in the clinical management of cancer patients. Multi-omics approaches hold the promise of improving diagnostics, prognostics, and personalized treatment. To deliver on this promise of precision oncology, appropriate bioinformatics methods for managing, integrating and analyzing large and complex data are necessary. I will discuss some of these computational challenges in the context of the molecular tumor board and the specific bioinformatics support that it requires, from the primary analysis of raw molecular profiling data to the automatic generation of a clinical report and its delivery to decision-making clinical oncologists.
James Berger, Duke University, USA
Title: "Gaussian process emulation of computer models with massive output"

Abstract: Often computer models yield massive output; e.g., a weather model will yield the predicted temperature over a huge grid of points in space and time. Emulation of a computer model is the process of finding an approximation to the computer model that is much faster to run than the computer model itself (which can often take hours or days for a single run). Many successful emulation approaches are statistical in nature, but these have only rarely attempted to deal with massive computer model output; some approaches that have been tried include utilization of multivariate emulators, modeling of the output (e.g., through some basis representation, including PCA), and construction of parallel emulators at each grid point, with the methodology typically based on use of Gaussian processes to construct the approximations. These approaches will be reviewed, with the startling computational simplicity with which the last approach can be implemented being highlighted and its remarkable success being illustrated and explained; in particular the surprising fact that one can ignore spatial structure in the massive output is explained. All results will be illustrated with a computer model of volcanic pyroclastic flow, the goal being the prediction of hazard probabilities near active volcanoes.
Andreas Christmann, University of Bayreuth, Germany
Title: "Kernel-based methods in machine learning"

Abstract: Machine learning and big data analysis play an important role in current research in statistics, computer science, computational biology, engineering, and in many other research areas. Although there exist a huge number of different approaches in machine learning, almost of all them have two goals in common: universal consistency and applicability in real life situations where the knowledge on the data generating process and on the data quality is limited or non-existent. In the first part of the talk, I will give a short overview of some currently used machine learning approaches, with special emphasis on kernel based methods. In the second part, some recent results on universal consistency, learning rates, statistical robustness, and stability of kernel based methods will be given. Some results will also be given for the bootstrap approximations of such kernel based methods.
Ines Couso, University of Oviedo, Spain
Title: "Maximum likelihood estimation from coarse data: what do we maximise?"

Abstract: The term "coarse data" encompasses different types of incomplete data where the (partial) information about the outcomes of a random experiment can be expressed in terms of subsets of the sample space. Maximum likelihood estimation (MLE) is a point-valued estimation procedure widely used in different areas of Statistics and Data Analysis. It searches for the parameter/vector of parameters that maximises the probability of occurrence of the dataset, and is assymptotically optimal under some conditions. We present and compare several extensions of this procedure to the case of coarse data, independently proposed by different authors during the last decades. We highlight the importance of modelling the so-called “coarsening process” that transforms the true outcome of the experiment into an incomplete observation and show that it can be modelled by means of a family of conditional probability distributions. We show some specific areas of statistics where such a process is very often overlooked, and provide examples illustrating how ignoring this coarsening process may produce misleading estimations. We discuss the conditions under which it can be safely ignored.
Iryna Gurevych, TU Darmstadt, Germany
Title: "Disentangling the thoughts: Latest news in computational argumentation"

Abstract: In this talk, I will present a bunch of papers on argument mining (co-)authored by the UKP Lab in Darmstadt. The papers have appeared in NAACL, TACL and related venues in 2018. In the first part, I will talk about large-scale argument search, classification and reasoning. In the second part, the focus will be on mitigating high annotation costs for argument annotation. Specifically, we tackle small-data scenarios for novel argument tasks, less-resourced languages or web-scale argument analysis tasks such as detecting fallacies. The talk presents the results of ongoing projects in Computational Argumentation at the Technische Universität Darmstadt: Argumentation Analysis for the Web (ArguAna), Decision Support by Means of Automatically Extracting Natural Language Arguments from Big Data (ArgumenText) .
Barbara Hammer, Bielefeld University, Germany
Title: "Transfer learning and learning with concept drift"

Abstract: One of the main assumptions of classical machine learning is that data are generated by a stationary concept. This, however, is violated in practical applications e.g. in the context of life long learning, for the task of system personalisation, or whenever sensor degradation or non-stationary environments cause a fundamental change of the observed signals. Within the talk, we will give an overview about recent developments in the field of learning with concept drift, and we will address two particular challenges in more detail: (1) How to cope with a fundamental change of the data representation which is caused e.g. by a misplacement or exchange of sensors? (2) How to deal with drifting concepts which change either rapidly or smoothly over time, e.g. caused by a non-stationary environment? We will present novel intuitive distance-based classification approaches which can tackle such settings by means of suitable metric learning and brain-inspired adaptive memory concepts, respectively, and we will demonstrate their performance in different application domains ranging from computer vision to the control of protheses.
Johannes Hartig, DIPF, Frankfurt, Germany
Title: "Analysis of data from educational achievement tests with generalized linear mixed models"

Abstract: Generalized linear mixed models (GLMM) allow modeling clustered non-normal data with both fixed and random effects. GLMMs include specific models that are traditionally taught as separate topics in educational science and psychology. For instance, hierarchical linear models (HLMs) and many models from item response theory (IRT) are special cases of GLMMs. Thus, GLMMs provide a flexible framework to combine separate lines of research and address new research questions. The talk will illustrate the potential of GLMMs for educational research with empirical applications in the analysis of data from educational assessments. This data has a hierarchical structure with item responses nested in students nested in classrooms and / or schools. Research questions can focus on variables on all levels. Examples include effects of item level predictors, effects of the item position within an assessment, and effects of school level predictors. Challenges regarding the analysis of data sets from international large scale assessments will be discussed.
Luc de Raedt, KU Leuven, Belgium
Title: Can we automate data science?

Abstract: Inspired by recent successes towards automating highly complex jobs like automatic programming and scientific experimentation, I want to automate the task of the data scientist when developing intelligent systems. In this talk, I shall introduce some of the involved challenges of and some possible approaches and tools for automating data science.
More specifically, I shall discuss how automated data wrangling approaches can be used for pre-processing and how both predictive and descriptive models can in principle be combined to automatically complete spreadsheets and relational databases. Special attention will be given towards the induction of constraints in spreadsheets and in an operations research context.
Roberto Rocci, University of Rome Tor Vergata, Italy
Title: "Finite mixtures for simultaneous clustering and reduction of matrix value observations"

Abstract: Finite mixture models are often used to classify two-way (units by variables) data. However, two issues arise: model complexity and recovering the true clustering structure. Indeed, a huge number of variables and/or occasions implies a large number of model parameters, while the existence of noise dimensions could mask the true cluster structure. The approach adopted in the present work is to reduce the number of model parameters by identifying a sub-space containing the information needed to classify the observations nesting a PCA-like reparameterization into the classification model. This also helps in identifying and discarding noise dimensions that could coincide with some observed variables. The aforementioned problems become more serious in the case of three-way (units by variables by occasions) data where the observations are matrices rather than vectors. Our approach is extended to this case by nesting a three-way PCA-like reparameterization (named Tucker2) into the classification model. This allows us to reduce the number of parameters identifying noise dimensions for the variables and/or the occasions. The effectiveness of the proposals is assessed through a simulation study and an application to real data.

SPECIAL SESSIONS

Advances in Recursive Partitioning and Related Methods

Organization: Claudio Conversano

Recursive Partitioning Methods (RPM) are a computer-intensive data-mining tool originally designed for analyzing vast databases of data characterized by mixed (numerical and categorical) variables and nonlinear relationships. Searching unknown patterns, predicting the distribution of homogeneous sub-groups of outcomes and identifying factors that distinguish particular sub-groups are among the main features of RPM that determined these methods to be increasingly popular in Data Science applications. The track is focused on specific issues related to the use of RPM, such as: model-based RPM, stability and visualization of RPM outcomes, Boosted RPM and use of RPM in semi-supervised clustering.
Algorithm Selection/Configuration and Machine Learning

Organization: Bernd Bischl, Felix Mohr, Marcel Wever

In computer science, certain problem classes (such as integer optimization, SAT, classification, clustering, etc.) can normally be tackled by a large variety of algorithms, the performance of which may differ depending on the concrete problem instance at hand. Algorithm selection and configurarion (ASC) seeks to support and partly automate the selection of an algorithm that is most suitable for a given problem instance, to set the parameters of that algorithm in the most appriate manner, and perhaps to combine several algorithms into a complex solution. This session is meant to explore the intersection of ASC and machine learning (ML), and invites contributions that combine ASC with ML in either direction. That is, we welcome work that uses ML techniques to improve ASC as well as ASC techniques to improve ML, or even the combination of both. This includes but is not limited to work on automated machine learning (Auto-ML). We appreciate both theoretical and pratical contributions.
Applications in Digital Humanities

Organization: Michaela Geierhos

In this special session, we would like to discuss possible applications in digital humanities that go beyond theoretical questions. What do Digital Humanities offer for practice? Which tools and projects exist; what should be considered when dealing with data analysis focused on a interdisciplinary research question? The presentation of projects and software tools, their development history and faced challenges helps to to figure out new aspects of the practical side of digital humanities.
Big Data and Complex Network Analytics

Organization: Martin Atzmüller

This special session invites contributions on the field of Big Data and Complex Network Analytics, with a special emphasis on data analysis methods for handling large and complex graph data such as knowledge graphs, as well as computational methods for analyzing complex network structures. The session aims to present and promote novel ideas from the data analytics as well as the complex networks community in order stimulate exchange between these.
Bioinformatics and Biostatistics
Organization: Dominik Heider

This special session invites contributions from all aspects of biostatistics and bioinformatics, with a special emphasis on machine learning and statistical learning for biomedical problems, ranging from biotechnological applications towards medical diagnostics and prognostics. We further encourage work describing novel methods for preprocessing of biomedical data, e.g., feature selection methods.
Comparison and Benchmarking of Cluster Analysis Methods
Organization: Christian Hennig

There is an already huge and further growing variety of cluster analysis methods. The situation cries out for systematic comparison and benchmarking of such methods using simulations and benchmark datasets. Some such comparisons exist, but they are mostly rather patchy and unsystematic. Benchmarking cluster analysis methods is essentially more difficult than benchmarking supervised classification. In cluster analysis, usually there isn't just a single true grouping, rather different aims of clustering may lead to different clusterings on the same dataset that could be optimal according to different criteria. Benchmarking cluster analysis needs to take this into account. The track encourages researchers to share their work, experiences and thoughts on the systematic evaluation of clustering methods.
Computational Social Science
Organization: Henning Wachsmuth

Computational social science investigates research questions from the social sciences through empirical data analyses, from natural language processing and information retrieval, to machine learning and data mining. Input data includes social media text, social network structures, online activities, socio-cultural key indicators, and time series. The focus is on insights into social phenomena and dynamics rather than the technologies behind, raising a particular need for output interpretation and visualization. The aim of the session is to present learn about approaches related to computational social science as well as to discuss the benefit and limitations of data analysis on societal developments.
Consumer Preferences and Marketing Analytics

Organization: Friederike Paetz, Daniel Guhl

This track invites methodological, theoretical or empirical papers which aim to contribute to the general understanding of consumer preferences. The focus is laid on tools/methods like conjoint analysis or discrete choice analysis in a marketing context that aim at informing and improving management decisions. However, we also encourage work that deals with further quantitative techniques to extract preference information from consumer data.
Data Analysis Models in Economics and Business
Organization: Jozef Pociecha

This is a proposal for those who are dealing with applications of data analysis and classification models, machine learning procedures, multivariate time series and other multivariate methods in various areas of economic and business research. We are waiting for examples of a new approach to such type of empirical investigations, useful both for analytics and practice.
Data Analysis in Finance
Organization: Krzysztof Jajuga

Finance is the most often explored area, where data analysis methods, including classification methods, are successfully applied. There is a plethora of different approaches of data analysis to be used for financial data, including stochastic approach and purely descriptive approach. The proposed session covers wide array of possible topics, including: Big Data in Finance, Time Series Analysis of Financial Market Data, Text Mining of Financial Databases, Financial Risk Analysis Methods, Analysis of Digital Assets.
Data Science for Mental Health
Organization: Fionn Murtagh, Mohsen Farid

Mental health and mental well-being are the main themes of this session. Also what can be relevant are poor health from dementia and Alzheimers, and consequences and repercussions of the state of mental health. Developments in methodology will be relevant and especially new developments relating to Big Data analytics. Data Science has, and will have, so much to offer for health and well-being, and for psychoanalysis, for mental capital in the social sciences, and for cognitive science and neurosciences.
Dimension Reduction and Visualisation for Classification
Organization: Niel le Roux

Visualization of multidimensional data requires some form of data reduction technique. These low-dimensional visualizations are useful in supervised and unsupervised learning situations. They are invaluable tools for revealing cluster structure; selecting an appropriate model and quantifying separation/overlap of predefined groups. In general, more than a single visualization is needed for a particular data set. Attention is focused on recently developed methods like computationally efficient procedures for finding optimal hyperplane separators; the -machine which uses dissimilarity measures for classification; correlation-based distance measures and component analysis techniques similar to canonical correlation analysis for quantifying neural reliability. In addition, the visualization of incomplete data, categorical and continuous, is considered using generalized orthogonal Procrustes analysis and correspondence analysis related biplots as well as summarizing multivariate binary data using latent variables.
European Association for Data Science (EuADS) Symposium on Data Science Education
Organization: Rolf Biehler, Reinhold Decker, Peter Flach, Berthold Lausen, Carsten Schulte

Data is the lifeblood of our society and digitalisation is at the centre of recent industrial developments and government policies. Accordingly, data science includes the extraction, preparation, exploration, transformation, storage and sharing of structured and unstructured data and aims at turning data into knowledge and value. We experience that big data, data science and data science education are emerging and important fields.
This session initiates a first exchange on experiences, best practices and research on teaching and learning data science. We expect contributions from tertiary education but also from emerging experimental projects or courses at secondary level. We are thus open for new perspectives, ideas, experiences and pedagogical foundations. The shared focus of these perspectives should be the fact that in many places, new study programs are emerging.
We want to discuss these programs and provide a forum to exchange experiences in teaching and learning data science, both at schools and universities, with their many implications, like e.g. aims and goals, content, methods, best practices, curriculum design. The forum also addresses questions how to use data science methods to evaluate the success of such programs but it is open for all (quantitative and qualitative) methods of educational research into teaching and learning practice. What does the data say about best practices, learning issues like misconceptions, recruitment and retention? Who are the students that enter a data science program? What are employers expecting as skill set of data scientists?
The session covers: transition between school and university, recruiting and maintaining students, outreach programs, linking school and university (e.g. to foster curriculum developments at school), datacy (statistical literacy, data science literacy), developing skills, examples for good data projects as part of the curriculum, the role of the concept of programming, pedagogies for learning in data science, the role of tools for learning, balancing math and computer science.
Interpretable Machine Learning
Organization: Eneldo Loza Mencia, Johannes Fürnkranz

The interest in machine learning has greatly increased recently, in great measure due to impressive advances in some fields achieved by complex methods like deep learning. Many of these advances are about to make their way into society soon. This has renewed the demand on models that are not only accurate but also interpretable. Interpretable models allow the practitioner to obtain important insights about the task at hand. Moreover, ensuring model interpretability is of central importance in application domains where trust into the system is essential for its acceptance and where malfunctioning may result in legal liability.
In this special session, we therefore solicit contributions that (i) discuss and evaluate the interpretability of various types of machine learning models, (ii) use representations which are understandable by experts and even non-experts, (iii) introduce models that can be inspected, verified, and possibly also modified by non-expert users, (iv) offer explanations or visualizations of their decisions, (v) develop methods for interpretable learning in complex domains like multi-target learning or structured output prediction.
Machine Learning and Optimization
Organization: Kevin Tierney

Optimization techniques are widely used in the area of operations research (OR) to provide decision support in industry on a wide range of computationally difficult problems. In general, these techniques require a domain or OR expert to provide problem-specific heuristics to ensure good performance of the approaches either in terms of solution quality or run time (or both). A goal of combining machine learning (ML) and optimization techniques is to reduce the dependency of these methods on an expert through the use of ML approaches. ML can be used, for example, to determine branching during a search, estimate costs, or influence a solution procedure in other ways. In this session, we are soliciting contributions for new (meta-)heuristics and exact methods that utilize learning techniques (broadly interpreted) in some way as part of an optimization procedure. The session covers methods as well as applications.
Machine Learning for Dynamic Systems
Organization: Volker Lohweg, Oliver Niggemann

Dynamic systems such as production systems face various situations where Machine Learning is decisive. By applying Machine Learning solutions, production systems will become more adaptable, more resource efficient and more user friendly. But dynamic systems have specific requirements, e.g. they have to deal with state-based behavior, high data dimensions and the integration of a-priori knowledge such as physical laws or plant structures. This track invites theoretical methodological and experimental papers on frontiers of Machine Learning for dynamic systems, e.g. for application scenarios such as preventive maintenance, optimization, soft sensing and Information fusion.
Mining Streaming and Time-Evolving Data
Organization: Barbara Hammer, Georg Krempl, Jerzy Stefanowski

Recent years have seen a steep increase in the availability of data, which is often generated sequentially and by non-stationary processes. Mining such streaming and time-evolving data often requires to consider the ordering and temporal context of instances, which has been the subject of different lines of research: In data stream mining, the focus of analysis is mostly on the recently observed instances, and instances have a limited life cycle due to adaptation and forgetting mechanisms. In time series analysis, on the other hand, the internal structure of data points taken over time is considered. While the focus of this special session is on the problem of data stream mining, its aim is to present novel ideas from both lines of research that stimulate exchange between their communities.
Multimodal Data and Cross-modal Relations: Analytics and Search
Organization: Ralph Ewerth

Multimodal information is omnipresent, for example, in the web: web pages contain multimodal information, other multimodal information sources comprise videos, online news, educational material, scientific talks and papers, etc. Besides multimodal web search, there are other application domains that deal with multimodal data, e.g.: human computer interaction, digital humanities (graphic novels, audiovisual data), or sports analytics. Overall, from a computational perspective it is difficult to adequately make use of multimodal information and cross-modal relations. One reason is that the automatic understanding and interpretation of textual, visual or audio sources itself is difficult – and it is even more difficult to model and understand the interplay of different modalities. While, for instance, the visual/verbal divide has been investigated in the communications sciences for years, it is in its infancy from an information retrieval perspective. But if we want to build powerful search engines, analytics tools, and recommender systems, we need to research and develop algorithms that are able to understand relations and the interplay of multimodal information – which is a very challenging task.
Recent Developments in Longitudinal Data Analysis in Psychology
Organization: Casper Albers

With the recent advancement of techniques such as smartphones and other smart devices, more and more often psychological research is conducted based on intensive longitudinal data. In these data, individuals fill in items about the experiences and affect in their daily life. Such data has higher ecological validity than data from retrospective studies. These rich data are then analyzed to study the dynamics of, for example, emotion and psychopathology. One object of study is temporal dynamics. When studying temporal dynamics, the focus is not on detecting a gross underlying trend, as is often the case in developmental research, but rather on the intricate temporal dependence of and between variables such as emotions, or how variables within an individual influence each other or themselves over time. Although these studies are rising in popularity, techniques to study the dynamics are lagging behind. Due to the fact that quite often psychological time series are of a different nature than, e.g. economic time series, models from other fields of science cannot be imported without adaptations. The aim of this special session is to bring together scientists working on the methodology behind psychological longitudinal data and to discuss the state of the art methods to analyse such data.
Statistical Aspects of Machine Learning Methods
Organization: Florian Dumpert

We invite contributions which are concerned with statistical aspects of machine learning methods from a theoretical point of view like, e.g., consistency, theoretical learning rates, (asymptotic) distributions of predictors, theoretical confidence bands or bounds for misclassification rates, robustness properties, theoretical results given by uncertainty modelling, statistical properties of regularization etc.
Statistical Learning with Imprecision
Organization: Thomas Augustin, Sebastien Destercke

In this session, we are interested in learning situations where imprecision, or indecision, plays a key role in the data analysis process. This may be due to partial data whose missingness process is atypical, to the need of providing robust conclusions in case of partially specified probabilities, etc. We welcome contributions of theoretical, methodological and applied nature in which imprecision is processed explicitly. This imprecision may occur in the data themselves, in the model or in the predictions produced by the model.
Statistical Visualization for Data Science
Organization: Koji Kurihara, Adalbert Wilhelm

Visualising data and information is a key component of data science and an integral part of merging domain expertise with data-driven knowledge. It is a primary goal of data visualisation to provide a communication channel both within the data science team and also between data analyst, domain experts and a more general interested audience. Data visualisation also aims at stimulating users to further engage with the data and to generate further analysis approaches. In this regard, visualisation is not a goal in itself, but a comprehensive linkage tool within the data analysis cycle. There are in particular three interfaces to be considered: linking the visualisation to the data structures – in particular in the realm of Big data with its characteristic features of large volumes, high velocity and variety – linking the visualisation to the analysis method, and linking the visualisation to the domain knowledge. In this regard, visualisation can be considered the key interface between all stakeholders in the data analysis process. In this session we would like to bring together practitioners and researchers from computer science, statistics and application domains to discuss recent trends in this field.
Time Series Analysis and Online Algorithms

Organization: Wolfgang Konen

With the increase of data streams in economics and industry, the data science of time series becomes more and more important. In this special session we would like to deal with new developments in time series analysis, with a special focus on online algorithms, that is algorithms which have the ability to adapt while the time series are being processed. Tasks, topics and application areas include, but are not limited to: Time series classification, prediction, motif discovery, anomaly detection, predictive maintenance, online algorithms for the preceding tasks, multivariate time series, internet of things.
Web Science
Organization: Axel Ngonga

Since its creation for the access to distributed information, the Web has evolved to arguably the most important sources of knowledge. It is hence used across a multitude of applications for a plethora of different goals. However, the mere volume and velocity of changes on the Web make it difficult to analyze when relying on classical algorithms. We invite submissions on all data-driven aspects of Web, including (but not limited to) the extraction, integration, fusion, storage, querying, ranking of data, information and knowledge on the Web or at Web-scale. Related applications.and systems are also welcome.