Data to Diagnose
Please enter the 9 graded attributes of breast Fine Needle Aspiration so that to determine whether the breast mass is malignant or benign.
Diagnosis in progress...
Diagnosis failed.
Tumor Diagnosis: -
Probability: -%

Algorithm provides the diagnosis under the following performances assessed over 210 test data :

  • Classification_error = 1.90 % (+/- 1.85)
  • 95% Confidence Interval of the Classification_error is : [ 0.0006 , 0.0375 ]
  • Accuracy_score = 98.10 %
  • Recall_score = 100.00 %
  • Precision_score = 94.37 %

2D Visualization through Principal Components Analysis (PCA)

This representation is the projection’s result of the 9 normalized attributes of 210 test data and data to diagnose on the two principal components.
PCA 2D allows us then to reduce data's dimension from 9D to 2D. Our Algorithm applied for diagnosis step is now fitted over 2D reduced training data.
Then the Data Visualization function displays the data to be diagnosed in a 2D graph as well as 210 labeled test data not seen by the Algorithm. This function also plots the decision boundary of the algorithm applied on test data.
Algorithm classifies 2D data with the following performances assessed over 210 test data:
  • 2D Classification_error = 2.86 % (+/- 2.25)
  • 95% Confidence Interval of the 2D Classification_error is: [ 0.0060 , 0.0511 ]
  • accuracy_score = 97.14 %
  • recall_score = 98.51 %
  • precision_score = 92.96 %
The Algorithm's boundary divides the space onto two areas with two different colors:
  • - Gray area is the one of benign cases.
  • - Blue area is the one of malignant cases.
CAVEATS : Please note that the algorithm provides better performances on original data than on reduced data. In case 2D graphical representation provides different result from the DIAGNOSIS computed on original data, please use the result provided by the DIAGNOSIS.

Summary and Caveats

This application is intended for pathologists who grade the 9 following features of “Fine Needle Aspiration” in accordance with Wisconsin Dataset's grading scale(1) so that to determine whether a breast mass is benign or malignant:

  • Clump Thickness
  • Uniformity of Cell Size
  • Uniformity of Cell Shape
  • Marginal Adhesion
  • Single Epithelial Cell Size
  • Bare Nuclei
  • Bland Chromatin
  • Normal Nucleoli
  • Mitoses

After rigorous tuning and training cycles of 8 models over 489 train data randomly selected from the available dataset, Random Forest algorithm turned out to be the best model. 98.1% of 210 test cases are correctly classified and 100% of the cancerous tumors are perfectly diagnosed by the algorithm.

In other words, the algorithm provides a classification error of 1.9% (+/-1.85) which is the percentage of incorrect predictions to the number of predictions made, moreover, the algorithm does not miss any cancerous tumor and then it is 100% sensitive to malignancy (recall score is 100%).

These performance results were obtained with 210 test data that have never been seen by the algorithm during its training step (143 benign tumors / 67 malignant tumors).

Test data is a sample of the available data that has been randomly selected and removed from the available data, such that it is not used during model selection or configuration.

Lastly, please note that tuning and training were performed with cross validation method (StratiFiedKFold).

1 Wolberg, W.H., & Mangasarian, O.L. (1990). Multisurface method of pattern separation for medical diagnosis applied to breast cytology. In Proceedings of the National Academy of Sciences, 87, 9193—9196.
Dr. WIlliam H. Wolberg (physician) University of Wisconsin Hospitals, Madison, Wisconsin, USA (1992-07-15). Breast Cancer Wisconsin (Original) Data Set.


In the early 1990’s, Professors William H. Wolberg and Olvi L. Mangasarian at the University of Wisconsin published a near 700-sample dataset of breast cancer masses.
These masses had been biopsied via fine needle aspirates.
Nine cytological characteristics of breast FNAs were valued on a scale of 1 to 10, with 1 being the closest to benign and 10 the most anaplastic.
This data was then published to the University of California Irvine’s Machine Learning Repository as public domain.
I am grateful for access to this data, as it provided my algorithm with training and testing data.
The data was also very appropriate for the classification task.

I would also like to acknowledge Brittany Wenger for her contribution in this domain.
In 2012, based on Wisconsin database, Brittany Wenger provided a service built on a neural nets algorithm.

I thank Axel Tessier for his great contribution on the web user interface of my application and also for making a secured server available for this application.

Finally, I would like to thank my family for their continuous support throughout this project.


The contents of Site, such as text, graphics, images, figures and other material contained on Site are for informational purposes only. The content is not intended to a substitute for professional medical advice, diagnosis, or treatment. Always, seek the advice of your physician or other qualified health provider with any questions you may have regarding medical condition. Never disregard professional medical advice or delay in seeking it because of something you have read on the Site. Reliance on any information provided by is solely at your own risk.