Parsing and Classifying Ingredients

A short post about the tricks used to parse and classify ingredients.

Posted on: 2 Nov 2025

The hardest thing I've had to deal with so far while compiling the vegandex database has been dealing with parsing and classifying ingredients. Although right now the list of "primary" ingredients is only something like 4,000 ingredients, the list of different ingredients that show up in ingredients is like 140,000. The vast majority of these are caused by either parsing issues when trying to extract ingredients from the ingredient list or from different ways of specifying or spelling the same ingredient.

Parsing Ingredients

Parsing the ingredient list seems simple at first: split the list on commas and you have your list of ingredients. That works for very simple ingredient lists, but then you get something like:
["NON-GMO SUGAR","NON- GMO FLOUR (WHEAT, MALTED BARLEY)","NON-GMO ORGANIC RICE MILK (FILTERED WATER, BROWN RICE [PARTIALLY MILLED], EXPELLER-PRESSED CANOLA OIL AND/OR SAFFLOWER OIL AND/OR SUNFLOWER OIL, TRICALCIUM PHOSPHATE, SEA SALT, VITAMIN A PALMITATE, VITAMIN D2, VITAMIN B12)","NON-GMO EXPELLER-PRESSED CANOLA OIL","WATER","SALT","NON-GMO BAKING POWDER (MONOCALCIUM PHOSPHATE, SODIUM BICARBONATE [BAKING SODA], NON-GMO CORN STARCH)","BAKING SODA (SODIUM BICARBONATE)","NON-GMO VANILLA AND CRUMBS (NON-GMO GLUTEN FREE FLOUR (RICE FLOUR, WHOLE GRAIN BROWN RICE FLOUR, WHOLE SORGHUM FLOUR, TAPIOCA STARCH, POTATO STARCH, CELLULOSE, XANTHAN GUM, VITAMIN AND MINERAL BLEND [CALCIUM CARBONATE, NIACINAMIDE (VITAMIN B3), REDUCED IRON, THIAMIN HYDROCHLORIDE (VITAMIN B1), RIBOFLAVIN (VITAMIN B2)], NON-GMO BROWN SUGAR, NON-GMO VEGAN BUTTER (NATURAL OIL BLEND (PALM FRUIT, CANOLA AND OLIVE OIL), WATER, CONTAINS 2% OR LESS OF: SALT, NATURAL FLAVOR (DERIVED FROM CORN, NO MSG, NO ALCOHOL, NO GLUTEN, GMO FREE), SUNFLOWER LECITHIN-AN EMULSIFIER, LACTIC ACID (NON-DAIRY DERIVED FROM SUGAR BEETS), ANNATTO EXTRACT-COLOR), NON-GMO VANILLA, CINNAMON)."]
This is actually one of the easier ingredient lists to parse. The algorithm I'm currently using looks like:
def split_ingredients(text):
    """
    Split up a JSON/text string of ingredients. Is a generator.

    Examples:

        split_ingredients("a,b") -> ['a','b']
        split_ingredients("a,(b,c)") -> ['a','b','c']
        split_ingredients("a,b (c,d)") -> ['a','c','d']
        split_ingredients("a (b)") -> ['a']
        split_ingredients("a :b") -> ['b']
        split_ingredients("a (c:b)") -> ['b']
        split_ingredients("a (vegan)") -> ['a (vegan)']
        split_ingredients("(a, b (vegan))") -> ['a', 'b (vegan)']
    """
    if text == '':
        return

    rex = re.compile("[:(),]")

    # position of the leftmost left parenthesis
    pos = 0
    # number of left parentheses
    lbr = 0
    for match in rex.finditer(text):
        i = match.span()[0]
        c = text[i]
        if c == ':' and lbr == 0:
            yield from split_ingredients(text[i+1:])
            return
        elif c == ',' and lbr == 0:
            # This is the main point of this algorithm, if we come to a comma
            # we split the ingredients to the left and right of it recursively.
            yield from split_ingredients(text[:i])
            yield from split_ingredients(text[i+1:])
            return
        elif c == '(':
            if lbr == 0:
                pos = i
            lbr += 1
            continue
        elif c == ')':
            lbr -= 1
            if lbr < 0:
                # There was an error.
                yield from split_ingredients(text[:i] + text[i+1:])
                return
            if lbr == 0:
                if len(text[:pos].strip()) == 0:
                    # We come to the end of a parentheses and there is nothing
                    # before it
                    yield from split_ingredients(text[pos+1:i] + text[i+1:])
                    return
                keyword_pos = -1
                for keyword in KEYWORDS:
                    tmp = text[pos+1:i].find(keyword)
                    if tmp > -1:
                        if keyword_pos == -1:
                            keyword_pos = pos+1+tmp
                        else:
                            keyword_pos = min(keyword_pos,pos+1+tmp)
                if '(' not in text[pos+1:i] and ')' not in text[pos+1:i] and keyword_pos >= 0:
                    # a (vegan), b
                    #         ^
                    # FIXME: this is kind of hacky
                    if i == len(text) - 1:
                        yield text[:i+1]
                        return
                    # We reached the end of the last right parentheses and
                    # there is a keyword, i.e. we want to keep the parentheses.
                    # Therefore we continue and wait until the next comma.
                    continue
                elif ',' in text[pos+1:i]:
                    # We reached the end of the last right parentheses and
                    # there is no keyword or sub-parentheses *and* there is a
                    # comma. Here, we want to ignore what's before the
                    # parentheses and just split the ingredients inside the
                    # parentheses and the following text.
                    yield from split_ingredients(text[pos+1:i] + text[i+1:])
                    return
                else:
                    # We reached the end of the last right parentheses and
                    # there is no keyword or sub-parentheses and there is *no*
                    # comma. Here, we want to ignore what's inside the
                    # parentheses.
                    if any(keyword in text[:pos] for keyword in L_KEYWORDS):
                        continue
                    yield from split_ingredients(text[:pos] + text[i+1:])
                    return
    # If we got here and there is an unmatched parenthesis, assume they
    # all close here
    if lbr > 0:
        yield from split_ingredients(text + ')')
        return
    yield text

Normalizing Ingredients

One of the ways I've tried to deal with this is by writing a python script to look for obvious matches. For example, stripping out the word "organic" from the beginning of an ingredient name, since whether something is organic or not does not affect it's vegan status. This helps a lot, and helped me to classify many ingredients. However, I've had to get more clever for trickier things like food dyes. Right now this is the regular expression I use to normalize food colorings:

color_regex = re.compile(r'^(?:(?:artificial )?color(?:ing|s)? )?(?:[(]?f[. ]*[dc][. ]*&[ ]*c[ .]*)?(?: lake)? ?(?:colors?)?[( ]*(red|green|blue|yellow)[ ]?(?:[ ]?lake)?[ ]?(?:#|no[.]?|n[.]?)?[ ]?([0-9]+)(?:[ ]?lake)?([ ]?\(e[0-9]*\))?[)]?(?: dye)?')
name = color_regex.sub(r"\g<1> \g<2>",name)
This normalizes things like the following:
color_regex.sub(r"\g<1> \g<2>","artificial color (red lake 40)") == "red 40"
color_regex.sub(r"\g<1> \g<2>","coloring (red #40 lake)") == "red 40"
color_regex.sub(r"\g<1> \g<2>","red n.40") == "red 40"
color_regex.sub(r"\g<1> \g<2>","f.d. & c blue n.2") == "blue 2"

There are also a lot of misspelled ingredients. See below for all the different ways MSG is spelled. Right now, these are mostly handled manually. I thought about trying to use an algorithm which calculates the Levenshtein distance between unknown ingredients and known ingredients, but there are too many edge cases I can think of off the top of my head for why this wouldn't work. For example, beefsteak tomato, beefsteak plant and beefsteak.

Ways to spell/misspell MSG

  1. contains monosodium glutamate
  2. monosodiumglutamate
  3. e 621
  4. e621
  5. 621
  6. msg
  7. pure msg
  8. monosodium glumate
  9. msg ( a natural flavor enhancer derived from corn or beets)
  10. msg(a natural flavor enhancer derived from corn or beets)
  11. monosodium l-glutamate
  12. monosodium glutamate as a flavor enhancer
  13. monosodium glutamiate e-621
  14. e621 monosodium glutamate
  15. monosodium glutumate
  16. monosodium gultamate
  17. monosodium glutatmate
  18. monosodium glutamte
  19. monosodium glutmate
  20. monosodium glutamiate
  21. monosodium gluatamate
  22. monosodiium glutamate
  23. monosodium glutenmate
  24. monosodium glutamate flavor enhancer
  25. monosodium glytamate
  26. monosodium glutammate
  27. monosodium glutemate
  28. monosodium lutamate
  29. monosodio glutamate
  30. flavour enhanced monosodium glutamate
  31. monosodium glut
  32. monosodium l -glutamate
  33. monosodium glutamata
  34. monosodium glutanate
  35. monosodium glutamamte
  36. monosodium gutamate
  37. monosodium gglutamate
  38. flavor enhancer monosodium glutamate
  39. flovors monosodium glutamate
  40. monosodium glutamate as flavor enhancer
  41. monosodium glutamate e621
  42. monosodium glutameta
  43. monosodium l -dlutamate
  44. monosodic glutamate
  45. monosodium glumate as flavor enhancer
  46. monosodium glutamine
  47. monosodium glutamute
  48. monosodum glutamate
  49. monosodium l glutamate
  50. monosodium glutmamate
  51. monosodium glutiamate
  52. monosodium glutamate as permitted flavour enhancer
  53. monosodium glumate as permitted flavour enhancer
  54. monosodium l-glutaminate
  55. monosodium glutamake
  56. monosodi-um glutamate
  57. monosodium l-glutamate e621
  58. monosodium glutimate
  59. monosodium monosodium glutamate
  60. monosodium gluamate
  61. monosodium glutamante
  62. monosodium glutamate imsgi
  63. monosodium gluta-mate
  64. l-monosodium glutamate
  65. monosodium glutaomate
  66. natural flavors monosodium glutamate
  67. monosdium glutamate
  68. mono sodium glutamate
  69. monosoodium glutamate
  70. sodium glutamate
  71. monosodium glutamate as flavour enhancer
  72. momosodium glutamate
  73. marinated with monosodium glutamate
  74. marinade with monosodium glutamate
  75. mono-sodium glutamate
  76. monosoidum glutamate
  77. disodium glutamate
  78. monsodium glutamate
  79. monododium glutamate
  80. mono- sodium glutamate
  81. monosidum glutamate
  82. maonosodium glutamate
  83. monosoidium glutamate
  84. monoso-dium glutamate
  85. monosodium glutamate (a natural flavor enhancer derived from corn or beets)
  86. flavor enhancer sodium glutamate
  87. monossodium glutamate
  88. modium glutamate
  89. monososodium glutamate
  90. nonosodium glutamate