The hardest thing I've had to deal with so far while compiling the vegandex database has been dealing with parsing and classifying ingredients. Although right now the list of "primary" ingredients is only something like 4,000 ingredients, the list of different ingredients that show up in ingredients is like 140,000. The vast majority of these are caused by either parsing issues when trying to extract ingredients from the ingredient list or from different ways of specifying or spelling the same ingredient.
Parsing Ingredients
Parsing the ingredient list seems simple at first: split the list on commas and you have your list of ingredients. That works for very simple ingredient lists, but then you get something like:["NON-GMO SUGAR","NON- GMO FLOUR (WHEAT, MALTED BARLEY)","NON-GMO ORGANIC RICE MILK (FILTERED WATER, BROWN RICE [PARTIALLY MILLED], EXPELLER-PRESSED CANOLA OIL AND/OR SAFFLOWER OIL AND/OR SUNFLOWER OIL, TRICALCIUM PHOSPHATE, SEA SALT, VITAMIN A PALMITATE, VITAMIN D2, VITAMIN B12)","NON-GMO EXPELLER-PRESSED CANOLA OIL","WATER","SALT","NON-GMO BAKING POWDER (MONOCALCIUM PHOSPHATE, SODIUM BICARBONATE [BAKING SODA], NON-GMO CORN STARCH)","BAKING SODA (SODIUM BICARBONATE)","NON-GMO VANILLA AND CRUMBS (NON-GMO GLUTEN FREE FLOUR (RICE FLOUR, WHOLE GRAIN BROWN RICE FLOUR, WHOLE SORGHUM FLOUR, TAPIOCA STARCH, POTATO STARCH, CELLULOSE, XANTHAN GUM, VITAMIN AND MINERAL BLEND [CALCIUM CARBONATE, NIACINAMIDE (VITAMIN B3), REDUCED IRON, THIAMIN HYDROCHLORIDE (VITAMIN B1), RIBOFLAVIN (VITAMIN B2)], NON-GMO BROWN SUGAR, NON-GMO VEGAN BUTTER (NATURAL OIL BLEND (PALM FRUIT, CANOLA AND OLIVE OIL), WATER, CONTAINS 2% OR LESS OF: SALT, NATURAL FLAVOR (DERIVED FROM CORN, NO MSG, NO ALCOHOL, NO GLUTEN, GMO FREE), SUNFLOWER LECITHIN-AN EMULSIFIER, LACTIC ACID (NON-DAIRY DERIVED FROM SUGAR BEETS), ANNATTO EXTRACT-COLOR), NON-GMO VANILLA, CINNAMON)."]This is actually one of the easier ingredient lists to parse. The algorithm I'm currently using looks like:
def split_ingredients(text):
"""
Split up a JSON/text string of ingredients. Is a generator.
Examples:
split_ingredients("a,b") -> ['a','b']
split_ingredients("a,(b,c)") -> ['a','b','c']
split_ingredients("a,b (c,d)") -> ['a','c','d']
split_ingredients("a (b)") -> ['a']
split_ingredients("a :b") -> ['b']
split_ingredients("a (c:b)") -> ['b']
split_ingredients("a (vegan)") -> ['a (vegan)']
split_ingredients("(a, b (vegan))") -> ['a', 'b (vegan)']
"""
if text == '':
return
rex = re.compile("[:(),]")
# position of the leftmost left parenthesis
pos = 0
# number of left parentheses
lbr = 0
for match in rex.finditer(text):
i = match.span()[0]
c = text[i]
if c == ':' and lbr == 0:
yield from split_ingredients(text[i+1:])
return
elif c == ',' and lbr == 0:
# This is the main point of this algorithm, if we come to a comma
# we split the ingredients to the left and right of it recursively.
yield from split_ingredients(text[:i])
yield from split_ingredients(text[i+1:])
return
elif c == '(':
if lbr == 0:
pos = i
lbr += 1
continue
elif c == ')':
lbr -= 1
if lbr < 0:
# There was an error.
yield from split_ingredients(text[:i] + text[i+1:])
return
if lbr == 0:
if len(text[:pos].strip()) == 0:
# We come to the end of a parentheses and there is nothing
# before it
yield from split_ingredients(text[pos+1:i] + text[i+1:])
return
keyword_pos = -1
for keyword in KEYWORDS:
tmp = text[pos+1:i].find(keyword)
if tmp > -1:
if keyword_pos == -1:
keyword_pos = pos+1+tmp
else:
keyword_pos = min(keyword_pos,pos+1+tmp)
if '(' not in text[pos+1:i] and ')' not in text[pos+1:i] and keyword_pos >= 0:
# a (vegan), b
# ^
# FIXME: this is kind of hacky
if i == len(text) - 1:
yield text[:i+1]
return
# We reached the end of the last right parentheses and
# there is a keyword, i.e. we want to keep the parentheses.
# Therefore we continue and wait until the next comma.
continue
elif ',' in text[pos+1:i]:
# We reached the end of the last right parentheses and
# there is no keyword or sub-parentheses *and* there is a
# comma. Here, we want to ignore what's before the
# parentheses and just split the ingredients inside the
# parentheses and the following text.
yield from split_ingredients(text[pos+1:i] + text[i+1:])
return
else:
# We reached the end of the last right parentheses and
# there is no keyword or sub-parentheses and there is *no*
# comma. Here, we want to ignore what's inside the
# parentheses.
if any(keyword in text[:pos] for keyword in L_KEYWORDS):
continue
yield from split_ingredients(text[:pos] + text[i+1:])
return
# If we got here and there is an unmatched parenthesis, assume they
# all close here
if lbr > 0:
yield from split_ingredients(text + ')')
return
yield text
Normalizing Ingredients
One of the ways I've tried to deal with this is by writing a python script to look for obvious matches. For example, stripping out the word "organic" from the beginning of an ingredient name, since whether something is organic or not does not affect it's vegan status. This helps a lot, and helped me to classify many ingredients. However, I've had to get more clever for trickier things like food dyes. Right now this is the regular expression I use to normalize food colorings:
color_regex = re.compile(r'^(?:(?:artificial )?color(?:ing|s)? )?(?:[(]?f[. ]*[dc][. ]*&[ ]*c[ .]*)?(?: lake)? ?(?:colors?)?[( ]*(red|green|blue|yellow)[ ]?(?:[ ]?lake)?[ ]?(?:#|no[.]?|n[.]?)?[ ]?([0-9]+)(?:[ ]?lake)?([ ]?\(e[0-9]*\))?[)]?(?: dye)?')
name = color_regex.sub(r"\g<1> \g<2>",name)
This normalizes things like the following:
color_regex.sub(r"\g<1> \g<2>","artificial color (red lake 40)") == "red 40"
color_regex.sub(r"\g<1> \g<2>","coloring (red #40 lake)") == "red 40"
color_regex.sub(r"\g<1> \g<2>","red n.40") == "red 40"
color_regex.sub(r"\g<1> \g<2>","f.d. & c blue n.2") == "blue 2"
There are also a lot of misspelled ingredients. See below for all the different ways MSG is spelled. Right now, these are mostly handled manually. I thought about trying to use an algorithm which calculates the Levenshtein distance between unknown ingredients and known ingredients, but there are too many edge cases I can think of off the top of my head for why this wouldn't work. For example, beefsteak tomato, beefsteak plant and beefsteak.
Ways to spell/misspell MSG