The Simplified Molecular Input Line Entry System (SMILES) is an open chemical notation that represents the structure of a molecule as an ASCII string. Developed in the late 1980s by David Weininger at Daylight Chemical Information Systems and now maintained as a public specification through the OpenSMILES project, SMILES uses letters for atoms, digits for ring closures, and a small set of symbols for bond types and stereochemistry. SMILES is human-readable for small molecules, compact, and the de facto input format for cheminformatics software. A SMILES string converts losslessly to and from a chemical structure, and can be canonicalised so that the same molecule always produces the same string.
What a SMILES looks like
| Substance | SMILES |
|---|---|
| Water | O |
| Methane | C |
| Ethanol | CCO |
| Acetic acid | CC(=O)O |
| Benzene | c1ccccc1 |
| Caffeine | CN1C=NC2=C1C(=O)N(C(=O)N2C)C |
| Sodium chloride | [Na+].[Cl-] |
| Sulphuric acid | OS(=O)(=O)O |
| Sodium hydroxide | [Na+].[OH-] |
The notation is compact. Methane is one letter. Water is one letter. Ethanol is three letters. The complexity grows with the molecule but stays human-readable up to ~30-50 atom structures.
SMILES syntax in brief
| Element | Syntax |
|---|---|
| Atoms | Capital letters for organic atoms (C, N, O, S, P, F, Cl, Br, I); lowercase for aromatic atoms (c, n, o); brackets for everything else ([Na+], [OH-], [Fe+3]) |
| Bonds | Single bond is implicit; = for double, # for triple, : for aromatic (also implicit in lowercase atoms) |
| Branches | Parentheses: CC(C)C is isobutane (a methyl branch on the second carbon) |
| Ring closures | Matching digits: C1CCCCC1 is cyclohexane (open ring at first 1, close at second 1) |
| Disconnections | Period: [Na+].[Cl-] is sodium chloride as separate ions |
| Stereochemistry | @ or @@ for tetrahedral configuration; / and \ for cis/trans on double bonds |
| Charges | Inside brackets: [NH4+], [OH-], [Cu+2] |
| Isotopes | Mass number prefix in brackets: [13C], [2H] |
The full grammar is more nuanced (aromatic rings, charge layers, hydrogen counts, isotopes), but the core syntax is the seven elements above. Most cheminformatics software auto-generates SMILES from a structure, so day-to-day users rarely write SMILES by hand.
Canonical SMILES vs arbitrary SMILES
A given molecule has many valid SMILES strings. Acetic acid can be written as CC(=O)O, OC(=O)C, O=C(O)C, or OC(C)=O, all valid, all the same molecule. For database use, a canonicalisation algorithm produces a unique canonical SMILES per molecule. Canonical SMILES is to SMILES what InChIKey is to InChI: a deterministic single representation per substance.
Different software canonicalisation algorithms produce different canonical strings, so canonical SMILES is software-specific. PubChem’s canonical SMILES is not the same as RDKit’s canonical SMILES. This is the main reason InChIKey is preferred over canonical SMILES for cross-database interchange, the InChI algorithm is a single specification that everyone implements identically.
When SMILES is the right notation
SMILES is the right notation for:
- Computational chemistry input. Quantum chemistry software, molecular dynamics, machine-learning models, and most cheminformatics tools accept SMILES as input.
- Compact human-readable representation of small molecules in technical documents, R&D notebooks, and patents.
- Substructure search queries in chemistry databases. SMARTS (an extension of SMILES) is the dominant query language for “find me all molecules containing this fragment.”
- AI-friendly chemistry content, like InChI, SMILES gives an AI engine a deterministic way to identify the molecule. SMILES is more compact and arguably more readable for small molecules.
SMILES is the wrong notation for:
- Mixtures and undefined substances. SMILES represents a single defined molecule, not a mixture or a polymer with variable composition.
- Customs and commercial documents. No customs authority asks for SMILES on a commercial invoice. CAS number and product name are the standard.
- Stereochemistry that is unknown. SMILES can either specify stereochemistry or omit it; it cannot represent “we know this is one of two stereoisomers but do not know which.”
- Large macromolecules. A SMILES for a large protein or polymer becomes unreadably long.
SMILES vs InChI vs IUPAC name
The three open structural notations differ in purpose:
| Identifier | Strength | Weakness |
|---|---|---|
| SMILES | Compact, human-readable for small molecules, fast computational input | Multiple valid forms per molecule unless canonicalised; software-specific canonicalisation |
| InChI | Single canonical algorithm; InChIKey is hyperlink-friendly; cross-database stable | Long, less human-readable, harder to write by hand |
| IUPAC name | Systematic and pronounceable; common in scientific literature | Long for complex molecules; multiple valid IUPAC names possible; not machine-readable without parsing |
For chemistry content tuned for AI and search engines, including all three identifiers (CAS, InChIKey, SMILES) on a product page is the most thorough approach. Different AI engines and different downstream users prefer different identifiers.
How Chinese factories produce SMILES for export documentation
Most Chinese factories generate SMILES from internal product master data using free software:
- The molecular structure is drawn or imported into a chemistry editor (ChemDraw, MarvinSketch, BIOVIA Draw, or open-source tools like ChemAxon Marvin or RDKit-Python).
- Canonical SMILES is generated from the structure using the editor’s canonicalisation function.
- The SMILES is added to the product master data and propagated to the SDS, the product page, and any data exchange with downstream buyers.
For factories producing pharmaceutical intermediates, fine chemicals, or specialty chemicals where downstream chemistry matters, SMILES is part of the standard product data. For bulk industrial chemical factories, SMILES is rarely included unless specifically requested.
Common SMILES mistakes
Three patterns recur:
- Inconsistent canonical SMILES across databases. Two databases listing the same chemical with different canonical SMILES strings are using different canonicalisation algorithms. Confirm by comparing InChIKey instead.
- Aromatic notation inconsistency. Benzene can be written
c1ccccc1(Kekule lower-case) orC1=CC=CC=C1(Kekule upper-case). Different software prefers different forms; both are valid but only one will round-trip cleanly through a given pipeline. - Salt and counterion handling. A salt like sodium acetate can be
[Na+].CC(=O)[O-](separate ions) orCC(=O)O[Na](covalent representation). The first is correct for ionic species in solution; the second can mislead a reader into expecting a covalent compound.
Related terms
InChI is the IUPAC-developed open identifier with deterministic InChIKey hash. IUPAC Name is the systematic chemical name. CAS Number is the proprietary registry identifier. EC Number is the EU regulatory identifier. SMILES sits alongside these as a structural notation; the four together cover most chemical-identification use cases.