The World Health Organization provides a way to classify human diseases: the ICD10 Classification. About 16,000 diseases are defined and organized. This page provides a few way to visualize this information. A clean data frame providing the information is available here.
library(tidyverse)
library(rmarkdown) # You need this library to run this template.
library(epuRate) # Install with devtools: install_github("holtzy/epuRate", force=TRUE)
library(DT) # To display tables
# Treemap
library(treemap) # For the static version
library(d3treeR) # For the interactive version
# Trees
library(networkD3) # For radial network
library(collapsibleTree) # Tree collapsible
#Read the file
ICD=read.table("WHO_disease_classification.txt.gz", sep="\t", header=T, quote = "")
# I take only the level1. It has a parent node number = 0
level1 = ICD %>% filter(parent_id==0) %>% select(-selectable)
colnames(level1)=c("coding_L1", "meaning_L1", "node_L1", "parent_L1")
# Merge with the level2
level2 = ICD %>% filter(parent_id>0 & parent_id<=22) %>% select(-selectable)
colnames(level2)=c("coding_L2", "meaning_L2", "node_L2", "parent_L2")
all = merge(level1, level2, by.x="node_L1", by.y="parent_L2", all.x=T)
# Merge with the level3
level3 = ICD %>% filter(parent_id>22 & parent_id<=285) %>% select(-selectable)
colnames(level3)=c("coding_L3", "meaning_L3", "node_L3", "parent_L3")
all = merge(all, level3, by.x="node_L2", by.y="parent_L3", all.x=T)
# Merge with the level4
maxlevel4 = ICD$node_id[which(ICD$parent_id %in% level3$node_L3) ] %>% max()
level4 = ICD %>% filter(parent_id>285 & parent_id<=maxlevel4) %>% select(-selectable)
colnames(level4)=c("coding_L4", "meaning_L4", "node_L4", "parent_L4")
all = merge(all, level4, by.x="node_L3", by.y="parent_L4", all=T)
# Merge with the level5
# Do understand why some level5 have a node value very low. Example: node=4342 / code=I7000
level5 = ICD %>% mutate(nchar=nchar(as.character(coding))) %>% filter(nchar==5) %>% select(-selectable, -nchar)
colnames(level5)=c("coding_L5", "meaning_L5", "node_L5", "parent_L5")
all = merge(all, level5, by.x="node_L4", by.y="parent_L5", all.x=T)
# Just a small bug to remove for a few codes that have a X in their name
all = all[which(!is.na(all$node_L1)) , ]
# By hand, I create a table that describes the 22 major categories
mainCat=data.frame(
node=seq(1,22),
short=c("Infectious-Parasitic","Neoplasms","Blood-Immune","Nutritional","Mental","Nervous","Eye","Ear","Circulatory","Respiratory","Digestive","Skin","musculoskeletal", "Genitourinary","Childbirth","Perinatal","unclassified","Malformation","Injury","External-Causes","Factor-influencing","Special"),
long=c("Certain infectious and parasitic diseases","Neoplasms" ,"Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism" ,"Endocrine, nutritional and metabolic diseases" ,"Mental and behavioural disorders" ,"Diseases of the nervous system" ,"Diseases of the eye and adnexa" ,"Diseases of the ear and mastoid process" ,"Diseases of the circulatory system" ,"Diseases of the respiratory system" ,"Diseases of the digestive system" ,"Diseases of the skin and subcutaneous tissue" ,"Diseases of the musculoskeletal system and connective tissue" ,"Diseases of the genitourinary system" ,"Pregnancy, childbirth and the puerperium" ,"Certain conditions originating in the perinatal period" ,"Congenital malformations, deformations and chromosomal abnormalities" ,"Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified" ,"Injury, poisoning and certain other consequences of external causes" ,"External causes of morbidity and mortality" ,"Factors influencing health status and contact with health services" ,"Codes for special purposes")
)
# And I add these explicit names of highest groups (Level1)
all = merge(all, mainCat, by.x="node_L1", by.y="node", all.x=T)
# Order the table
all=all %>% arrange(node_L1, node_L2, node_L3, node_L4, node_L5)
# CHange name
ICD=all
# Save in 2 formats (R ans csv)
save(ICD, file="WHO_disease_classification_clean.R")
z <- gzfile("WHO_disease_classification_clean.csv.gz")
write.csv(ICD, z)
# Keep 3 levels
ICD_10 = ICD %>% select(short, meaning_L2, meaning_L3) %>% unique() %>% droplevels()
collapsibleTree(ICD_10, c("short", "meaning_L2", "meaning_L3"), width="2000px", height="800px", fontSize = 12, zoomable = FALSE)
tmp = ICD %>%
mutate(Chapt=gsub("Chapter ", "", coding_L1)) %>%
mutate(meaning_L1=gsub("Chapter","",meaning_L1)) %>%
mutate(code_L2=gsub("Block ","",coding_L2)) %>%
mutate(meaning_L2=gsub(".*? (.+)", "\\1", meaning_L2)) %>%
mutate(meaning_L3=gsub(".*? (.+)", "\\1", meaning_L3)) %>%
mutate(meaning_L4=gsub(".*? (.+)", "\\1", meaning_L4)) %>%
mutate(short_name=short) %>%
select(Chapt, short_name, code_L2, meaning_L2, coding_L3, meaning_L3, coding_L4, meaning_L4)
datatable(tmp, filter = 'top', rownames = FALSE )
# reformat data
ICD_10 = ICD %>%
select(short, meaning_L2, meaning_L3) %>%
unique() %>%
droplevels() %>%
mutate(value=1)
# basic treemap
static = treemap(ICD_10, index=c("short", "meaning_L2"), vSize="value", type="index", palette = "RdYlGn", draw=FALSE)
interactive=d3tree2( static, rootname = "The 22 groups of the ICD10 classification")
interactive
The ICD10 classification is divided in 22 major chapters. Here is the number of diseases repertoried in each of them.
ICD %>% mutate(meaning_L1=gsub("Chapter","",meaning_L1)) %>% group_by(meaning_L1) %>% count() %>% arrange(n) %>% ungroup() %>% mutate(meaning_L1=factor(meaning_L1,meaning_L1)) %>%
ggplot(aes(x=meaning_L1, y=n)) +
geom_segment( aes(x=meaning_L1, xend=meaning_L1, y=n, yend=0), color="skyblue", alpha=0.7) +
geom_point(size=4, color="orange") +
coord_flip() +
theme_light() +
theme( panel.grid.major.y = element_blank()) +
ylab("Number of diseases repertoried in the group") +
xlab("")
The ICD10 Classification has been downloaded on the WHO website as a text file in September 2017 (link).
The first step of this work was to modify the format to get something more confortable to work with. The resulting clean file is available here in case you would like to conduct further analysis.
There are 22 main categories, which are subdivided in 4 levels of grouping. The file giving the meaning of the ICD codes has been found here the 24/08/2017.
The following diagram tries to describe how the classification is organized. It shows the 5 levels of classification leading to the disease Cholera.
NODE 1: Infectious disease (Level 1) nodes= 1 - 22
-- NODE 23: A00-A09 Intestinal infectious diseases (Level 2) nodes= 23 - 285
-- NODE286: A00 Cholera (Level 3) nodes= 286 - 19154
-- NODE287 A00.0 Cholera due to Vibrio cholerae 01, biovar cholerae (Level 4)
-- NODE288 A00.1 Cholera due to Vibrio cholerae 01, biovar eltor (Level 4)
-- NODE2881 something eventually, but not always, what makes the classification even more complicated. (Level 5) always 5 characters in code
A work by Yan Holtz
Yan.holtz.data@gmail.com