## Modeling Critical Micelle Concentration of Anionic Surfactants QSPR Analysis within MAPS

*Introduction*

Surfactants are amphiphilic molecules presenting a polar or more generally hydrophilic head and a lipophilic or more generally hydrophobic tail. They are usually used to reduce the surface tension of a liquid (liquid/gas interface), between two different liquids (liquid/liquid interface) and between a liquid and a solid (liquid/solid interface). They are widely applied in industry as detergents, wetting agents or dispersant.

*QSPR analysis set up*In the following study, MAPS QSAR analysis tool [1] was used to model and predict the CMC property of sulfate and sulfonate anionic surfactants. We used a set of 60 molecules where the experimental CMC was known [2-5. We sketched the molecules with the MAPS building tools and optimized their geometry with the conjugated gradient approach, using the UFF force field [6]. An initial selection of above 180 Dragon6 descriptors [7,8] was calculated including descriptors from the following families: *Constitutional indices, Topological indices, Functional group count, Atom-centered Fragment, CATS 2D descriptors, 2D Atom Pairs *and* Molecular properties*.

After removing zero variance descriptors, over 150 descriptors were still remaining. We used the MAPS QSPR “Predictor subset selection” tool (see Figure 2) to remove highly correlated descriptors (correlation threshold = 0.85): descriptors having similar effects on the property to model for the current molecular set. Such tool reduced the number of descriptors to 46.

In order to reduce further the number of descriptors, we used Partial Least Squares (PLS) for regression modeling. The best performing (in RMSE sense) and shortest models were selected with a Genetic Algorithm (GA) based search. This approach tries to get models with the lowest RMSE and with a number of descriptors as reduced as possible. From these simulations we selected the descriptors from the best model with less than 10 descriptors. We also used the “Descriptors statistics” tool to review the relative utilization frequency of the different descriptors. We added the most used descriptors to the descriptor selection in order to have a group of 15 meaningful descriptors for the final model regression.

For the final regression we divided the total set of molecules into training, test and validation sets. The main idea, here, is to regress the model on the training set and to test it on the test set. Finally, the predictivity of the model is validated using the validation set. The latter is a set of molecules that were not used for model building and that are considered as blind. In order to get the best division possible we used a random selection of the molecule but with two constraints:

- The molecules in the different sets should cover the range of value of the experimental CMC to model
- The molecules in the different sets should cover the different functionalities of the molecules of the total set.

For the last constraint, we used MAPS clustering tool to separate the molecules into groups which are similar with respects to the set of descriptors used. Then we used that division to make sure that every parts of the cluster tree were equally represented (see Figure 3).

Once the total set divided we used GA approach using training set / test set validation approach to regress the best possible model.

*Description of QSPR results *

The model regressed contains only two descriptors. The Figure 4 shows the correspondence between the computed CMC and the experimental one for the different sets. If two points appear as outliers, the remaining remain in very good agreement with the experimental values. The different statistical analysis will therefore be performed without these two points.

Table 1 shows R^{2} and RMSE for the total, training, test and validation sets. No difference of accuracy can be found between the different sets, the RMSE remains between 0.2 and 0.25 and the R^{2} is between 0.92 and 0.94. More specifically, the results for the validation set appear to be in very good agreement with the experimental one since it is the one with the best accuracy. The model described here appears therefore able to predict well the CMC of sulfate or sulfonate anionic surfactants.

The model can be expressed as:

log(CMC) = 1.90476 - 0.0276327 * S3k - 0.15023 * Sv

Where S3K (Dragon "Topological indices " group) is the 3^{r d} order path Kier alpha–modified shape index which is defined in Ref 8 and is related to the shape of the molecule (decreasing for more branched molecules and larger chains). The second descriptor Sv (Dragon "Constitutional indices" group) is the sum of atomic van der Waals volumes (scaled on Carbon atom).

*Conclusion*

In this study we used MAPS QSPR tools to model the Critical Micelle Concentration (CMC) of different sulfates and sulfonates anionic surfactants. MAPS QSPR tool is based on the use of Dragon descriptors. We created a regression model using only two descriptors (the 3^{rd} order path Kier alpha–modified shape index, S3K, and the sum of atomic van der Waals volumes, Sv) that gave an R^{2} of about 0.93 and a RMSE of 0.24. Additionally the RMSE on the validation set was about 0.20 proving the predictivity ability of such model.

Dragon descriptors based MAPS QSPR appears therefore as an intuitive and efficient tool to analyze data, create QSPR/QSAR models and use them to predict desired properties.

*References:*

- http://scienomics.com/
- Katritzky A.R., Pacureanu L., Dobchev D., Karelson M. J. Chem. Info. Mod. 2007.
- Huibers, P. D. T.; Lobanov, V. S.; Katritzky, A. R.; Shah, O. D.; Karelson, M. J. Colloid Interface Sci. 1997, 113-120.
- Li, X.; Zhang, G.; Dong, J.; Zhou, X.; Yan, X.; Luo, M. J. Mol. Struct. (Theochem) 2004, 710, 119-126.
- Miyazawa, H.; Igawa, K.; Kondo, Y.; Yoshino, N. Ring. J. Fluorine Chem. 2003, 189-196.
- Rappe A. K., Casewit C.J., Colwell K.S., Goddard W.A., Skiff W.M. J. Am. Chem. Soc. 1992, 10024-10035.
- http://www.talete.mi.it/index.htm
- odeschini R., Consonni V., Molecular Descriptors for Chamoinformatics, 2nd Ed., Vol 41 in Methods and Principles in Medicinal Chemistry, Wiley-VCH, 2009.