Chemistry ID

SMILES

Simplified Molecular Input Line Entry System

An open chemical notation that represents the structure of a molecule as an ASCII string. Developed in the late 1980s by David Weininger and now maintained as a public specification, SMILES uses letters for atoms, digits for ring closures, and a small set of symbols for bonds and stereochemistry. SMILES is human-readable, compact, and the de facto input format for cheminformatics software. A SMILES string converts losslessly to and from a chemical structure.

Updated May 2, 2026

The Simplified Molecular Input Line Entry System (SMILES) is an open chemical notation that represents the structure of a molecule as an ASCII string. Developed in the late 1980s by David Weininger at Daylight Chemical Information Systems and now maintained as a public specification through the OpenSMILES project, SMILES uses letters for atoms, digits for ring closures, and a small set of symbols for bond types and stereochemistry. SMILES is human-readable for small molecules, compact, and the de facto input format for cheminformatics software. A SMILES string converts losslessly to and from a chemical structure, and can be canonicalised so that the same molecule always produces the same string.

What a SMILES looks like

SubstanceSMILES
WaterO
MethaneC
EthanolCCO
Acetic acidCC(=O)O
Benzenec1ccccc1
CaffeineCN1C=NC2=C1C(=O)N(C(=O)N2C)C
Sodium chloride[Na+].[Cl-]
Sulphuric acidOS(=O)(=O)O
Sodium hydroxide[Na+].[OH-]

The notation is compact. Methane is one letter. Water is one letter. Ethanol is three letters. The complexity grows with the molecule but stays human-readable up to ~30-50 atom structures.

SMILES syntax in brief

ElementSyntax
AtomsCapital letters for organic atoms (C, N, O, S, P, F, Cl, Br, I); lowercase for aromatic atoms (c, n, o); brackets for everything else ([Na+], [OH-], [Fe+3])
BondsSingle bond is implicit; = for double, # for triple, : for aromatic (also implicit in lowercase atoms)
BranchesParentheses: CC(C)C is isobutane (a methyl branch on the second carbon)
Ring closuresMatching digits: C1CCCCC1 is cyclohexane (open ring at first 1, close at second 1)
DisconnectionsPeriod: [Na+].[Cl-] is sodium chloride as separate ions
Stereochemistry@ or @@ for tetrahedral configuration; / and \ for cis/trans on double bonds
ChargesInside brackets: [NH4+], [OH-], [Cu+2]
IsotopesMass number prefix in brackets: [13C], [2H]

The full grammar is more nuanced (aromatic rings, charge layers, hydrogen counts, isotopes), but the core syntax is the seven elements above. Most cheminformatics software auto-generates SMILES from a structure, so day-to-day users rarely write SMILES by hand.

Canonical SMILES vs arbitrary SMILES

A given molecule has many valid SMILES strings. Acetic acid can be written as CC(=O)O, OC(=O)C, O=C(O)C, or OC(C)=O, all valid, all the same molecule. For database use, a canonicalisation algorithm produces a unique canonical SMILES per molecule. Canonical SMILES is to SMILES what InChIKey is to InChI: a deterministic single representation per substance.

Different software canonicalisation algorithms produce different canonical strings, so canonical SMILES is software-specific. PubChem’s canonical SMILES is not the same as RDKit’s canonical SMILES. This is the main reason InChIKey is preferred over canonical SMILES for cross-database interchange, the InChI algorithm is a single specification that everyone implements identically.

When SMILES is the right notation

SMILES is the right notation for:

  1. Computational chemistry input. Quantum chemistry software, molecular dynamics, machine-learning models, and most cheminformatics tools accept SMILES as input.
  2. Compact human-readable representation of small molecules in technical documents, R&D notebooks, and patents.
  3. Substructure search queries in chemistry databases. SMARTS (an extension of SMILES) is the dominant query language for “find me all molecules containing this fragment.”
  4. AI-friendly chemistry content, like InChI, SMILES gives an AI engine a deterministic way to identify the molecule. SMILES is more compact and arguably more readable for small molecules.

SMILES is the wrong notation for:

  1. Mixtures and undefined substances. SMILES represents a single defined molecule, not a mixture or a polymer with variable composition.
  2. Customs and commercial documents. No customs authority asks for SMILES on a commercial invoice. CAS number and product name are the standard.
  3. Stereochemistry that is unknown. SMILES can either specify stereochemistry or omit it; it cannot represent “we know this is one of two stereoisomers but do not know which.”
  4. Large macromolecules. A SMILES for a large protein or polymer becomes unreadably long.

SMILES vs InChI vs IUPAC name

The three open structural notations differ in purpose:

IdentifierStrengthWeakness
SMILESCompact, human-readable for small molecules, fast computational inputMultiple valid forms per molecule unless canonicalised; software-specific canonicalisation
InChISingle canonical algorithm; InChIKey is hyperlink-friendly; cross-database stableLong, less human-readable, harder to write by hand
IUPAC nameSystematic and pronounceable; common in scientific literatureLong for complex molecules; multiple valid IUPAC names possible; not machine-readable without parsing

For chemistry content tuned for AI and search engines, including all three identifiers (CAS, InChIKey, SMILES) on a product page is the most thorough approach. Different AI engines and different downstream users prefer different identifiers.

How Chinese factories produce SMILES for export documentation

Most Chinese factories generate SMILES from internal product master data using free software:

  1. The molecular structure is drawn or imported into a chemistry editor (ChemDraw, MarvinSketch, BIOVIA Draw, or open-source tools like ChemAxon Marvin or RDKit-Python).
  2. Canonical SMILES is generated from the structure using the editor’s canonicalisation function.
  3. The SMILES is added to the product master data and propagated to the SDS, the product page, and any data exchange with downstream buyers.

For factories producing pharmaceutical intermediates, fine chemicals, or specialty chemicals where downstream chemistry matters, SMILES is part of the standard product data. For bulk industrial chemical factories, SMILES is rarely included unless specifically requested.

Common SMILES mistakes

Three patterns recur:

  1. Inconsistent canonical SMILES across databases. Two databases listing the same chemical with different canonical SMILES strings are using different canonicalisation algorithms. Confirm by comparing InChIKey instead.
  2. Aromatic notation inconsistency. Benzene can be written c1ccccc1 (Kekule lower-case) or C1=CC=CC=C1 (Kekule upper-case). Different software prefers different forms; both are valid but only one will round-trip cleanly through a given pipeline.
  3. Salt and counterion handling. A salt like sodium acetate can be [Na+].CC(=O)[O-] (separate ions) or CC(=O)O[Na] (covalent representation). The first is correct for ionic species in solution; the second can mislead a reader into expecting a covalent compound.

InChI is the IUPAC-developed open identifier with deterministic InChIKey hash. IUPAC Name is the systematic chemical name. CAS Number is the proprietary registry identifier. EC Number is the EU regulatory identifier. SMILES sits alongside these as a structural notation; the four together cover most chemical-identification use cases.

Reference: https://opensmiles.org/

Related

Other terms you'll see on the same shipment

Need this on your next shipment?

We handle the documentation chain.

Every chemical we ship from Shanghai or Qingdao goes out with the COA, MSDS, DG declaration, and inspection certificate the destination port will ask for. Send us your spec and we will quote it with the paperwork already mapped.

Request a Quote