Parsing and Classifying Ingredients

Posted on: 2 Nov 2025

The hardest thing I've had to deal with so far while compiling the vegandex database has been dealing with parsing and classifying ingredients. Although right now the list of "primary" ingredients is only something like 4,000 ingredients, the list of different ingredients that show up in ingredients is like 140,000. The vast majority of these are caused by either parsing issues when trying to extract ingredients from the ingredient list or from different ways of specifying or spelling the same ingredient.

Parsing Ingredients

Parsing the ingredient list seems simple at first: split the list on commas and you have your list of ingredients. That works for very simple ingredient lists, but then you get something like:

["NON-GMO SUGAR","NON- GMO FLOUR (WHEAT, MALTED BARLEY)","NON-GMO ORGANIC RICE MILK (FILTERED WATER, BROWN RICE [PARTIALLY MILLED], EXPELLER-PRESSED CANOLA OIL AND/OR SAFFLOWER OIL AND/OR SUNFLOWER OIL, TRICALCIUM PHOSPHATE, SEA SALT, VITAMIN A PALMITATE, VITAMIN D2, VITAMIN B12)","NON-GMO EXPELLER-PRESSED CANOLA OIL","WATER","SALT","NON-GMO BAKING POWDER (MONOCALCIUM PHOSPHATE, SODIUM BICARBONATE [BAKING SODA], NON-GMO CORN STARCH)","BAKING SODA (SODIUM BICARBONATE)","NON-GMO VANILLA AND CRUMBS (NON-GMO GLUTEN FREE FLOUR (RICE FLOUR, WHOLE GRAIN BROWN RICE FLOUR, WHOLE SORGHUM FLOUR, TAPIOCA STARCH, POTATO STARCH, CELLULOSE, XANTHAN GUM, VITAMIN AND MINERAL BLEND [CALCIUM CARBONATE, NIACINAMIDE (VITAMIN B3), REDUCED IRON, THIAMIN HYDROCHLORIDE (VITAMIN B1), RIBOFLAVIN (VITAMIN B2)], NON-GMO BROWN SUGAR, NON-GMO VEGAN BUTTER (NATURAL OIL BLEND (PALM FRUIT, CANOLA AND OLIVE OIL), WATER, CONTAINS 2% OR LESS OF: SALT, NATURAL FLAVOR (DERIVED FROM CORN, NO MSG, NO ALCOHOL, NO GLUTEN, GMO FREE), SUNFLOWER LECITHIN-AN EMULSIFIER, LACTIC ACID (NON-DAIRY DERIVED FROM SUGAR BEETS), ANNATTO EXTRACT-COLOR), NON-GMO VANILLA, CINNAMON)."]

This is actually one of the easier ingredient lists to parse. The algorithm I'm currently using looks like:

def split_ingredients(text):
    """
    Split up a JSON/text string of ingredients. Is a generator.

    Examples:

        split_ingredients("a,b") -> ['a','b']
        split_ingredients("a,(b,c)") -> ['a','b','c']
        split_ingredients("a,b (c,d)") -> ['a','c','d']
        split_ingredients("a (b)") -> ['a']
        split_ingredients("a :b") -> ['b']
        split_ingredients("a (c:b)") -> ['b']
        split_ingredients("a (vegan)") -> ['a (vegan)']
        split_ingredients("(a, b (vegan))") -> ['a', 'b (vegan)']
    """
    if text == '':
        return

    rex = re.compile("[:(),]")

    # position of the leftmost left parenthesis
    pos = 0
    # number of left parentheses
    lbr = 0
    for match in rex.finditer(text):
        i = match.span()[0]
        c = text[i]
        if c == ':' and lbr == 0:
            yield from split_ingredients(text[i+1:])
            return
        elif c == ',' and lbr == 0:
            # This is the main point of this algorithm, if we come to a comma
            # we split the ingredients to the left and right of it recursively.
            yield from split_ingredients(text[:i])
            yield from split_ingredients(text[i+1:])
            return
        elif c == '(':
            if lbr == 0:
                pos = i
            lbr += 1
            continue
        elif c == ')':
            lbr -= 1
            if lbr < 0:
                # There was an error.
                yield from split_ingredients(text[:i] + text[i+1:])
                return
            if lbr == 0:
                if len(text[:pos].strip()) == 0:
                    # We come to the end of a parentheses and there is nothing
                    # before it
                    yield from split_ingredients(text[pos+1:i] + text[i+1:])
                    return
                keyword_pos = -1
                for keyword in KEYWORDS:
                    tmp = text[pos+1:i].find(keyword)
                    if tmp > -1:
                        if keyword_pos == -1:
                            keyword_pos = pos+1+tmp
                        else:
                            keyword_pos = min(keyword_pos,pos+1+tmp)
                if '(' not in text[pos+1:i] and ')' not in text[pos+1:i] and keyword_pos >= 0:
                    # a (vegan), b
                    #         ^
                    # FIXME: this is kind of hacky
                    if i == len(text) - 1:
                        yield text[:i+1]
                        return
                    # We reached the end of the last right parentheses and
                    # there is a keyword, i.e. we want to keep the parentheses.
                    # Therefore we continue and wait until the next comma.
                    continue
                elif ',' in text[pos+1:i]:
                    # We reached the end of the last right parentheses and
                    # there is no keyword or sub-parentheses *and* there is a
                    # comma. Here, we want to ignore what's before the
                    # parentheses and just split the ingredients inside the
                    # parentheses and the following text.
                    yield from split_ingredients(text[pos+1:i] + text[i+1:])
                    return
                else:
                    # We reached the end of the last right parentheses and
                    # there is no keyword or sub-parentheses and there is *no*
                    # comma. Here, we want to ignore what's inside the
                    # parentheses.
                    if any(keyword in text[:pos] for keyword in L_KEYWORDS):
                        continue
                    yield from split_ingredients(text[:pos] + text[i+1:])
                    return
    # If we got here and there is an unmatched parenthesis, assume they
    # all close here
    if lbr > 0:
        yield from split_ingredients(text + ')')
        return
    yield text

Normalizing Ingredients

One of the ways I've tried to deal with this is by writing a python script to look for obvious matches. For example, stripping out the word "organic" from the beginning of an ingredient name, since whether something is organic or not does not affect it's vegan status. This helps a lot, and helped me to classify many ingredients. However, I've had to get more clever for trickier things like food dyes. Right now this is the regular expression I use to normalize food colorings:

color_regex = re.compile(r'^(?:(?:artificial )?color(?:ing|s)? )?(?:[(]?f[. ]*[dc][. ]*&[ ]*c[ .]*)?(?: lake)? ?(?:colors?)?[( ]*(red|green|blue|yellow)[ ]?(?:[ ]?lake)?[ ]?(?:#|no[.]?|n[.]?)?[ ]?([0-9]+)(?:[ ]?lake)?([ ]?\(e[0-9]*\))?[)]?(?: dye)?')
name = color_regex.sub(r"\g<1> \g<2>",name)

This normalizes things like the following:

color_regex.sub(r"\g<1> \g<2>","artificial color (red lake 40)") == "red 40"
color_regex.sub(r"\g<1> \g<2>","coloring (red #40 lake)") == "red 40"
color_regex.sub(r"\g<1> \g<2>","red n.40") == "red 40"
color_regex.sub(r"\g<1> \g<2>","f.d. & c blue n.2") == "blue 2"

There are also a lot of misspelled ingredients. See below for all the different ways MSG is spelled. Right now, these are mostly handled manually. I thought about trying to use an algorithm which calculates the Levenshtein distance between unknown ingredients and known ingredients, but there are too many edge cases I can think of off the top of my head for why this wouldn't work. For example, beefsteak tomato, beefsteak plant and beefsteak.

Ways to spell/misspell MSG

contains monosodium glutamate
monosodiumglutamate
e 621
e621
621
msg
pure msg
monosodium glumate
msg ( a natural flavor enhancer derived from corn or beets)
msg(a natural flavor enhancer derived from corn or beets)
monosodium l-glutamate
monosodium glutamate as a flavor enhancer
monosodium glutamiate e-621
e621 monosodium glutamate
monosodium glutumate
monosodium gultamate
monosodium glutatmate
monosodium glutamte
monosodium glutmate
monosodium glutamiate
monosodium gluatamate
monosodiium glutamate
monosodium glutenmate
monosodium glutamate flavor enhancer
monosodium glytamate
monosodium glutammate
monosodium glutemate
monosodium lutamate
monosodio glutamate
flavour enhanced monosodium glutamate
monosodium glut
monosodium l -glutamate
monosodium glutamata
monosodium glutanate
monosodium glutamamte
monosodium gutamate
monosodium gglutamate
flavor enhancer monosodium glutamate
flovors monosodium glutamate
monosodium glutamate as flavor enhancer
monosodium glutamate e621
monosodium glutameta
monosodium l -dlutamate
monosodic glutamate
monosodium glumate as flavor enhancer
monosodium glutamine
monosodium glutamute
monosodum glutamate
monosodium l glutamate
monosodium glutmamate
monosodium glutiamate
monosodium glutamate as permitted flavour enhancer
monosodium glumate as permitted flavour enhancer
monosodium l-glutaminate
monosodium glutamake
monosodi-um glutamate
monosodium l-glutamate e621
monosodium glutimate
monosodium monosodium glutamate
monosodium gluamate
monosodium glutamante
monosodium glutamate imsgi
monosodium gluta-mate
l-monosodium glutamate
monosodium glutaomate
natural flavors monosodium glutamate
monosdium glutamate
mono sodium glutamate
monosoodium glutamate
sodium glutamate
monosodium glutamate as flavour enhancer
momosodium glutamate
marinated with monosodium glutamate
marinade with monosodium glutamate
mono-sodium glutamate
monosoidum glutamate
disodium glutamate
monsodium glutamate
monododium glutamate
mono- sodium glutamate
monosidum glutamate
maonosodium glutamate
monosoidium glutamate
monoso-dium glutamate
monosodium glutamate (a natural flavor enhancer derived from corn or beets)
flavor enhancer sodium glutamate
monossodium glutamate
modium glutamate
monososodium glutamate
nonosodium glutamate