Welcome

To deploy the app on command line or to create a URL using Heroku, access our GitHub repository here

About

These tools aim to provide a reproducible and consistent data visualisation platform where experimental and computational researchers can use random forest and statistical analysis to find the best materials for specific applications. Random forest is a supervised machine learning model that can be used to perform regression tasks. The model learns to map data (features or descriptors) by constructing a multitude of decision trees to an output (target variable) in the training phase of the model. Hyperparameter tuning is also used to optimise and tune the algorithm. It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees that predicts the target variable more accurately than a single decision tree.

Figure 1: Screenshot of tool in use

Data File Requirements

The data file to upload must meet the following requirements:
1. A .xlsx, .csv or a .txt data file must be uploaded. Please note that large .xlsx files take time to process so .csv or .txt files are preferred. For .txt files the application only accepts comma text separators.
2. The uploaded datasheet must have the structure name or identifier on its first column (shown in table 1).
3. The uploaded datasheet must be completely populated (no blank cells). Blank cells can be replaced to ‘0’ using the ‘Replace All’ function in Excel. No symbols should be used in the spreadsheet.
4. Data must be tidy data. If required, data must be transposed so that there is a single column stating the variables simulated with a column for the desired animation frame containing discrete numerical values (e.g. pressures). Example data files before e.g. AAML_Oxygen_Raw_Data.csv and after transposition AAML_Oxygen_Data.csv can be found here

Table 1: Example File Upload

DDEC code	Family	Pressure	Grav. Uptake (mol/kg)	Vol. Uptake (cm3/cm3)	Qst (kJ/mol)	…
AFIXAO	ZIF	1	0.3379	8.4985	15.6505	…
ABUWOJ	ZnO	1	0.5687	14.7654	16.5644	…
AVAQIX	None	1	0.8956	26.2387	18.2041	…
HOWPUF	None	1	0.3486	7.5929	19.7446	…
HOWQAM	None	5	0.5020	11.4214	15.7436	…
HOWQEQ	None	5	0	17.5332	16.8026	…
HOXKUB	None	5	0.4167	16.0261	20.7236	…
…	…	…	…	…	…	…

The tab of the app will fade when the app is computing a user input. It will return to its original color once it has completed the user's input (Figure 2). In addition to the faded tab, the dashboard browser tab will show “Updating…” when the tool is updating the data. Wait for this to return to ‘Dash’ before using the tool. This is when the file upload is complete.

Figure 2: Faded tab when app is computing

Functions found on 2D Plots

Download plots

As shown in the picture, click on the camera icon and get the plot in PNG format.

Figure 3: Download plot as PNG

Pan mode

If the plot’s drag mode is set to ‘Pan’, click and drag on the plot to pan and double-click to reset the pan.

Figure 4: Pan feature

Zoom modes

If the plot’s mode is set to ‘Zoom’, click and drag on the plot to zoom-in and double click to zoom-out completely. i.e. auto scale of both axis. The user can also zoom in and out by clicking on the + and – buttons. Axes labels will automatically optimize as you zoom in.

Figure 5: Auto scale of both axis (Zoom)

Reset axes

One can also drag the x and y axis in a horizontal and vertical motion respectively to move along the length of the axis. When the user drags the middle of the axis, a double headed arrow will appear. This allows the user to adjust the range of the axis i.e. both the lower and upper bound (Figure 6). Clicking ‘Reset axes’ will reset the axes.

Figure 6: Reset axis double headed arrow

When the user drags the top of the axis or the bottom of the axis, a single headed arrow will appear. This allows the user to adjust only the upper limit of the axis i.e. the upper bound or the lower limit of the axis i.e. the lower bound respectively. (Figure 7). Clicking 'Reset axes' will reset the axes.

Figure 7: Reset axis single headed arrow

Hover options

One of these two buttons is selected at all times. Clicking ‘Show closest data on hover’ will display the data for just one point under the cursor. Clicking ‘Compare data on hover’ will show you the data for all points with the same x-value.

Figure 8: Hover options

Preparing Data for RF

Selecting data for RF

Users can select which variables they would like as features and select their target variable in the 'Selecting Data for RF' tab. Non-numerical values are removed from the user's data frame and a table of selected features is populated.

Figure 9: Selecting data for RF

Feature correlation

In the 'Feature Correlation' tab, a heatmap of the coefficient of determination of features and the target variable is also produced, allowing users to see correlations between the variables used.

Hyperparameter tuning

Hyperparameter tuning is a way one can optimise their random forest model by changing settings in their algorithm to optimise its performance. These hyperparameters are set before training the model. Scikit-learn uses a set of default hyperparameters for all models but these may not be optimal. In the apps hyperparameter tuning, 5-Fold Cross- Validation is used to obtain optimal hyperparameters from a random hyperparameter grid. Random search will randomly test n number of iterations, where n is a number inputted by the user, to find the best hyperparameters. These hyperparameters are then used to create a second grid search where all possible hyperparameter settings in this grid are tested. The optimal hyperparameters from this grid search are then used as hyperparameters in the user's random forest model. This part of the tool requires the most time. When computing, the apps background will become faded and a loading cursor will appear. The process of your hyperparameter tuning can be trailed on the terminal as the user runs the app.

Once completed, optimal hyperparameters with improvement in accuracy compared to the base model (default hyperparameters) will be computed. These hyperparameters will automatically be used in the final random forest model.

Figure 9: Hyperparameter tuning

Output plots

Once the app has computed the optimal hyperparameters, a parity plot using these hyperparameters will appear in the Results from 'Hyperparameter tuning' tab. A bar plot of feature importance and cumulative feature importance will also appear. This is not the final output of the model but useful to look at to analyse model performance metrics, overfitting, and feature engineering.

Results from Hyperparameter tuning

Figure 10: Results from Hyperparameter tuning tab

Random Forest Output

In the 'Random Forest Output' tab, the user can determine a new test size to compute the final random forest result. This will take a few seconds to compute. A feature importance bar plot will also appear which can be useful to see which features were considered most important in the model and which features can be removed to improve run time. In the subtab 'Error Plots', an error distribution of the models observed and predicted values is available as well as a scatter plot illustrating model error in percentage terms and observed values.

Figure 11: Random Forest Output

Error Plot

Figure 12: Error plot distribution

Figure 13: Error and observed values scatter plot

Data tables

The user's inputs across the app will provide the output of the data tables. The user can download the feature and target variable correlations, performance metrics, random forest data and feature importance data from the 'Random Forest Output' tab.

Figure 14: Downloadable data tables

Contributing

For changes, please open an issue first to discuss what you would like to change. You can also contact the AAML research group to discuss further contributions and collaborations.

Contact Us

Email: Mythili Sutharson, Nakul Rampal, Rocio Bueno Perez, David Fairen Jimenez
Website: http://aam.ceb.cam.ac.uk/
Address:
Cambridge University,
Philippa Fawcett Dr,
Cambridge
CB3 0AS

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Acknowledgments

AAML Research Group for developing this dashboard for the MOF community. Click here to read more about our work
Dash - the python framework used to build this web application