The field of machine learning, science that studies the design of algorithms that can learn, is advancing rapidly and is becoming widespread in critical care medicine given the large amounts of data collected routinely in the intensive care units. Typical tasks are concept learning, function learning or “predictive modeling”, clustering and finding predictive patterns. These tasks are learned through available data that are observed through experiences or instructions.
The goal of this Data Science masterclass is to teach doctors and other health care professionals basic concepts and skills and give tools for working more effectively with data. Moreover, in the literature there is an increasing number of papers describing AI/machine learning algorithms and prediction models so clinicians and other healthcare provides must know the key concepts of Data Science to correctly interpret results.
The Data Science masterclass was a very interactive and practical course were participants have the possibility to discover insights about large, rich and complex data sets, to find new ways to answer clinical questions using large datasets of electronic health records, to cooperate with specialists of different fields and to learn more about the potential of medical data, machine learning and predictive modelling that could provide new insights and improve patient care.
To start familiarizing with Clinical Data Science for Critical Care you need
- a laptop
- to install R and R studio
- to have or to sign up for a Google docs account (optional)
- to download and install a spreadsheet software
Moreover, you need to have an understanding of how files and folders (directories) are named on your computer because unlike your usual habit of pointing and clicking to open something you don’t have a graphical user interface (GUI) and you will need to start writing instructions/scripts in the R terminal.
What is R?
R is a free cross-platform (UNIX platforms, Windows and MacOS) software environment for statistical computing and graphics well suited to data analysis. R is not graphical (GUI) instead is based on scripts and the learning curve might be steeper than with other software. Working with scripts forces you to have deeper understanding of what you are doing.
3 good reasons:
- You can do anything in R
- Science should be reproducible
- You have a vast support network
People think R is hard because it’s not a graphical user interface (GUI) and you have to describe what tasks you want the computer to complete in text, using the R language.
Building data pipelines is a core component of data science. Data pipeline is a set of actions that extract data (or directly analytics and visualisation) from various sources to produce an output (tables, plots, manuscripts, presentations) thanks to a R script.
After obtaining data from electronic health records databases, web servers, logs, online open-source repositories you have your data in a spreadsheet, you write instructions/scripts using the R language and you obtain an output: a table, plot or entire manuscript. You can change your data, or add new data, and run the script another time and instantly you regenerate the output.
Data preparation is the combination of data cleaning and data modelling. To be able to describe, plot, and test data must be tidy following the rule that “Each column is a variable. Each row is an observation.”. Data preparation includes variable re-naming, extract numbers and strings, parsing dates, columns to rows, missing and duplicate values.
Types of data: Not all data is equal, aim for consistency in every column, never try to record more than 1 type in a column: integers, decimals, strings, datetime, booleans, factors, try to think like a computer.
Complex ideas must be communicated with clarity, precision and efficiency with storytelling, decluttering, avoid misleading and pie chart horror, scaling up and rational use of colours.
Visualisation is a fundamentally human activity. A good visualisation will show you things that you did not expect, or raise new questions about the data. A good visualisation might also indicate you that you’re asking the wrong question or you need to collect different data.
Models are complementary tools to visualisation. Once you have made your questions sufficiently precise, you can use a model to answer them. Machine learning algorithms are divided in three categories:
- supervised: model training, focused on predictive tasks (e.g. risk of death, readmission, length of stay, early deterioration, …);
- unsupervised: discovery of latent structure/subclasses in a dataset, useful to define subgroups and phenotypes;
- reinforcement learning:virtual agents ought to take actions in an environment so as to maximize some notion of cumulative reward. This is the most immature branch of machine learning.
The last step of data science is communication, a critical part because It doesn’t matter how well your models and visualisation have led you to understand the data unless you can also communicate your results to others.
Tips in case of error messages
To conclude, my infographic from masterclass in Data Science at summarising the key concepts. Follow me on Twitter: Scquizzato Tommaso @tscquizzato.