Modeling Critical Micelle Concentration of Anionic Surfactants QSPR Analysis within MAPS

Introduction

Surfactants are amphiphilic molecules presenting a polar or more generally hydrophilic head and a lipophilic or more generally hydrophobic tail. They are usually used to reduce the surface tension of a liquid (liquid/gas interface), between two different liquids (liquid/liquid interface) and between a liquid and a solid (liquid/solid interface). They are widely applied in industry as detergents, wetting agents or dispersant.
 

Figure 1: Representation of the evolution of the surface tension and of the position of the surfactants in the systems when [surfactant] increases.
Figure 1: Representation of the evolution of the surface tension and of the position of the surfactants in the systems when [surfactant] increases.
In these systems, the surfactants usually align at the interface presenting their hydrophilic head to one environment (usually liquid water) and their hydrophobic tail to the other (for example oil), see Figure 1. This will allow to reduce the surface tension between them. However, at some point, when the concentration increases, the whole surface will be covered of surfactants and no more space will be left for new one. At that moment the additional molecules will create micelle in one of the two systems. This concentration is called the Critical Micelle Concentration or CMC. Modeling and predicting the value of this concentration is critical to control the tension property of your interface.
 
QSPR analysis set up

In the following study, MAPS QSAR analysis tool [1] was used to model and predict the CMC property of sulfate and sulfonate anionic surfactants. We used a set of 60 molecules where the experimental CMC was known [2-5. We sketched the molecules with the MAPS building tools and optimized their geometry with the conjugated gradient approach, using the UFF force field [6]. An initial selection of above 180 Dragon6 descriptors [7,8] was calculated including descriptors from the following families: Constitutional indices, Topological indices, Functional group count, Atom-centered Fragment, CATS 2D descriptors, 2D Atom Pairs and Molecular properties.
 

Figure 2: Correlation based predictor selection interface in MAPS.
Figure 2: Correlation based predictor selection interface in MAPS.

After removing zero variance descriptors, over 150 descriptors were still remaining. We used the MAPS QSPR “Predictor subset selection” tool (see Figure 2) to remove highly correlated descriptors (correlation threshold = 0.85): descriptors having similar effects on the property to model for the current molecular set. Such tool reduced the number of descriptors to 46.
 
In order to reduce further the number of descriptors, we used Partial Least Squares (PLS) for regression modeling. The best performing (in RMSE sense) and shortest models were selected with a Genetic Algorithm (GA) based search. This approach tries to get models with the lowest RMSE and with a number of descriptors as reduced as possible. From these simulations we selected the descriptors from the best model with less than 10 descriptors. We also used the “Descriptors statistics” tool to review the relative utilization frequency of the different descriptors. We added the most used descriptors to the descriptor selection in order to have a group of 15 meaningful descriptors for the final model regression.
 
For the final regression we divided the total set of molecules into training, test and validation sets. The main idea, here, is to regress the model on the training set and to test it on the test set. Finally, the predictivity of the model is validated using the validation set. The latter is a set of molecules that were not used for model building and that are considered as blind. In order to get the best division possible we used a random selection of the molecule but with two constraints:

  • The molecules in the different sets should cover the range of value of the experimental CMC to model
  • The molecules in the different sets should cover the different functionalities of the molecules of the total set.

For the last constraint, we used MAPS clustering tool to separate the molecules into groups which are similar with respects to the set of descriptors used. Then we used that division to make sure that every parts of the cluster tree were equally represented (see Figure 3).
 
Once the total set divided we used GA approach using training set / test set validation approach to regress the best possible model.
 

Figure 3: Visualization of the cluster tree of the 60 molecules. Each bullet represent a molecule and each cluster is represented in a different color.
Figure 3: Visualization of the cluster tree of the 60 molecules. Each bullet represent a molecule and each cluster is represented in a different color.

Description of QSPR results 

The model regressed contains only two descriptors. The Figure 4 shows the correspondence between the computed CMC and the experimental one for the different sets. If two points appear as outliers, the remaining remain in very good agreement with the experimental values. The different statistical analysis will therefore be performed without these two points.
 

Figure 4: Representation of the QSPR log(CMC) as a function of the experimental one for the Validation (blue triangle), Test (orange circle) and Training (green square) sets. For comparison, y=x reference is plotted.
Figure 4: Representation of the QSPR log(CMC) as a function of the experimental one for the Validation (blue triangle), Test (orange circle) and Training (green square) sets. For comparison, y=x reference is plotted.

Table 1 shows R2 and RMSE for the total, training, test and validation sets. No difference of accuracy can be found between the different sets, the RMSE remains between 0.2 and 0.25 and the R2 is between 0.92 and 0.94. More specifically, the results for the validation set appear to be in very good agreement with the experimental one since it is the one with the best accuracy. The model described here appears therefore able to predict well the CMC of sulfate or sulfonate anionic surfactants.
 

Table 1: Value of R2 and RMSE for the Total, Training, Test and Validation sets (once the two outlier points have been removed)
Table 1: Value of R2 and RMSE for the Total, Training, Test and Validation sets (once the two outlier points have been removed)

The model can be expressed as:

log(CMC) = 1.90476 - 0.0276327 * S3k - 0.15023 * Sv

Where S3K (Dragon "Topological indices " group) is the 3r d order path Kier alpha–modified shape index which is defined in Ref 8 and is related to the shape of the molecule (decreasing for more branched molecules and larger chains). The second descriptor Sv (Dragon "Constitutional indices" group) is the sum of atomic van der Waals volumes (scaled on Carbon atom).
 
Conclusion

In this study we used MAPS QSPR tools to model the Critical Micelle Concentration (CMC) of different sulfates and sulfonates anionic surfactants. MAPS QSPR tool is based on the use of Dragon descriptors. We created a regression model using only two descriptors (the 3rd order path Kier alpha–modified shape index, S3K, and the sum of atomic van der Waals volumes, Sv) that gave an R2 of about 0.93 and a RMSE of 0.24. Additionally the RMSE on the validation set was about 0.20 proving the predictivity ability of such model.
 
Dragon descriptors based MAPS QSPR appears therefore as an intuitive and efficient tool to analyze data, create QSPR/QSAR models and use them to predict desired properties.
 
References:

  1. http://scienomics.com/
  2. Katritzky A.R., Pacureanu L., Dobchev D., Karelson M. J. Chem. Info. Mod. 2007.
  3. Huibers, P. D. T.; Lobanov, V. S.; Katritzky, A. R.; Shah, O. D.; Karelson, M. J. Colloid Interface Sci. 1997, 113-120.
  4. Li, X.; Zhang, G.; Dong, J.; Zhou, X.; Yan, X.; Luo, M. J. Mol. Struct. (Theochem) 2004, 710, 119-126.
  5. Miyazawa, H.; Igawa, K.; Kondo, Y.; Yoshino, N. Ring. J. Fluorine Chem. 2003, 189-196.
  6. Rappe A. K., Casewit C.J., Colwell K.S., Goddard W.A., Skiff W.M. J. Am. Chem. Soc. 1992, 10024-10035.
  7. http://www.talete.mi.it/index.htm
  8. odeschini R., Consonni V., Molecular Descriptors for Chamoinformatics, 2nd Ed., Vol 41 in Methods and Principles in Medicinal Chemistry, Wiley-VCH, 2009.