• Decrease font size
  • Return font size to normal
  • Increase font size
U.S. Department of Health and Human Services

Food

  • Print
  • Share
  • E-mail

Listeria monocytogenes Risk Assessment: Appendix 12: Cluster Analysis For Grouping of Food Categories

Return to Table of Contents

FDA/Center for Food Safety and Applied Nutrition
USDA/Food Safety and Inspection Service
September 2003

(Also available in PDF)


The results of the uncertainty analysis of the risk assessment were summarized by a cluster analysis of food categories. The similarity between categories was evaluated for the predicted number of cases of listeriosis expressed as the risk per serving and per annum. Cluster analysis is a descriptive statistical technique by which a set of objects are partitioned or classified into subsets according to some measure of similarity between objects(1). Typically, this partitioning is defined to generate hierarchical subsets of the objects to be classified. A single level of disjoint partitioning, without any sub-partitioning of the objects within the primary clusters, is a special case of the more general objective of obtaining a hierarchical classification.

The use of a cluster analysis to summarize the results of the L. monocytogenes risk assessment provides a means to convey the implications of the uncertainty analysis of the rankings of food categories which is, in some sense, more informative than statistical point null hypothesis tests of differences in the location of the distribution of ranks across food categories (e.g., as provided by Kruskal-Wallis test or sign test). Testing for differences in location (e.g., the median) of the uncertainty distributions of risk rankings, according to either risk per serving or cases per annum, does not incorporate any consideration of whether or not the differences obtained are meaningful on a practical level.

Although the possibility exits that the elicitation and specification of the variability and uncertainty of the model could result in two or more pairings of food categories with identical distributions for either risk per serving or expected cases per annum, this is very unlikely and small differences in the location of rank distributions are expected. In this event, statistical analysis of the output of the simulation based on use of point null hypothesis tests to define differences between food categories is likely to result in categorizing all such (small) differences as significant (i.e., provided that the output of the simulation is sufficiently large). While composite rather than point null hypotheses could be used to define practical or meaningful differences between the risk rankings of different food categories (e.g., by equivalence testing methods), the application of these methods is not readily available. Consequently, a cluster analysis approach was adopted as an alternative.

Central to any cluster analysis is the specification of a definition of similarity, or conversely dissimilarity, between the objects to be classified(1). With respect to a cluster analysis of risk ranking of the food categories, the "objects" to classified are the uncertainty distributions (of risk per serving and expected cases per annum) and thus a classification requires a definition of the "distance" or dissimilarity between any two such distributions. The measure of similarity adopted here for the cluster analysis was defined by the degree to which any two uncertainty distributions overlap. If, for two food categories, the uncertainty distribution of their risk rankings were identical then the distributions would overlap maximally and it would be reasonable to infer that they are two food categories that should be judged to be similar in risk ranking. Conversely, if the risk rank distributions of two food categories did not overlap at all then it would be reasonable to infer that they are very dissimilar foods in regard to risk ranking.

Based on this intuitive notion of distance between two distributions the following measure of dissimilarity was used:

distance (A,B)   =   Pr(rank(A)   >   rank(B))

where A and B denote any two food categories, and rank() denotes their rank distributions (according to either risk per serving or expected number of cases per annum). Thus, if the rank of food category A is higher than that of food category B with a high probability of belief (i.e., according to their uncertainty distributions) then A and B would be considered sufficiently dissimilar to belong in different clusters. A level of 90% probability of belief that the rank of one food category was higher than another was chosen as a cut-off value for classifying any two distributions as dissimilar. That is to say, any two food categories A and B were considered to be of different risk category (or cluster) if:

distance (A,B)   >   0.90

Obviously, both the definition of distance used and what constitutes a "significant" distance based on the definition are subjective. With respect to the latter, this is not intrinsically different from the specification of confidence levels in frequentist-based hypothesis testing. A level of 0.05 is common by convention but it is a subjective choice nonetheless and other significance levels can and often have been advocated. With respect to the former, we note that the chosen measure of distance is not the only one that could be made. Also, it is a pseudo-distance measure because it does not satisfy all properties of distance measure proper; specifically it is not a symmetric function of the argument. However, other more sophisticated information-theoretic measures of the distance between two distributions such as the Kullback-Leibler divergence are computationally difficult and also do not satisfy all of the properties of a distance measure per se (i.e., they are quasi- or pseudo-distances).

Given the chosen definition of distance between two distributions and the cut-off probability value for significant distance, all food categories were compared in a pairwise fashion. Based on these comparisons a partitioning of the food categories into disjoint subsets of similar risk (either by risk per serving or cases per annum) was obtained by defining clusters in the ordering of food categories from highest median rank to lowest median rank. Specifically, the food categories were ranked according their median rank and then partitions where formed by taking the first cluster as being the largest set of ordered food categories (starting from the first) for which all pairwise comparisons of food categories within the set were equivalent based on the definition of significant distance between their respective uncertainty distributions. This process was repeated with all of the remaining food categories until each food category was assigned to one (and only one) cluster. If, for any given food category, there was no other food category that was similar, based on the definition, then that single food category was taken to form a cluster of one.

The results of the calculations of dissimilarity (or distance) between the twenty-three food categories are shown in Tables A12-1 and A12-2 based on the simulation output of the uncertainty distributions of mean risk per serving and expected number of cases per annum, respectively (n = 4,000 uncertainty samples or iterations). Based on these calculations the results of clustering the food categories according to either per serving risk or cases per annum are shown in Table A12-3. The sensitivity of the results to different specification of cut-off values for belief that one food category ranks higher than another, and is therefore dissimilar, is shown in Table A12-4. A level of 90% probability was chosen here as a reasonable summarization in order to obtain a relatively small number of clusters. At the 90% cut-off value there is a high degree of belief that, based on the uncertainty distributions, the foods in one cluster are of appreciably higher risk than those foods in any lower ranked cluster. While there are differences in risk rankings of food categories within any given cluster we are not "confident at a 90% level" that the differences are practically significant given all the attendant uncertainties that have been incorporated into the assessment.


 

Table A12-1. Probabilities1 (over uncertainty) that food categories rank higher (or lower) than other food categories based on the mean risk per serving.
 DMFNRPUMSSCRHFDSUCPMFSCFRPFRSFDFSSSCSRCVDSICPCCDHC
DM0.0%50.6%65.8%84.9%82.8%94.7%98.5%95.1%97.4%100.0%100.0%99.1%100.0%95.8%99.6%99.9%99.9%100.0%100.0%100.0%100.0%100.0%100.0%
FNR49.5%0.0%71.8%86.5%84.7%96.3%98.3%96.8%97.8%99.9%100.0%99.1%100.0%96.1%99.8%100.0%99.9%100.0%100.0%100.0%100.0%100.0%100.0%
P34.2%28.2%0.0%77.4%76.6%90.1%96.0%91.0%96.0%99.9%100.0%98.7%100.0%93.1%99.2%100.0%99.9%100.0%100.0%100.0%100.0%100.0%100.0%
UM15.1%13.6%22.6%0.0%49.9%56.0%68.9%69.2%80.1%92.0%94.9%91.0%96.5%84.4%92.2%97.0%95.3%98.4%98.3%100.0%99.9%99.5%99.8%
SS17.3%15.3%23.5%50.2%0.0%52.8%66.6%69.1%81.7%95.1%99.8%92.0%99.5%84.8%93.6%98.8%97.0%100.0%99.3%100.0%100.0%100.0%100.0%
CR5.4%3.7%10.0%44.0%47.2%0.0%71.0%68.0%84.6%97.6%99.8%93.3%99.7%83.5%94.4%99.4%97.3%100.0%99.6%100.0%100.0%100.0%100.0%
HFD1.5%1.8%4.0%31.2%33.5%29.0%0.0%57.9%74.9%94.6%98.8%87.9%99.3%79.5%91.1%98.3%95.6%99.9%99.1%100.0%100.0%99.9%100.0%
SUC4.9%3.2%9.0%30.8%30.9%32.0%42.1%0.0%57.2%78.1%85.2%78.5%87.3%73.4%81.6%87.9%84.8%90.7%91.7%97.2%97.0%96.7%98.3%
PM2.6%2.2%4.0%19.9%18.4%15.5%25.1%42.8%0.0%80.0%93.2%78.3%96.6%75.0%84.1%95.8%88.9%99.2%97.0%99.8%100.0%99.4%100.0%
FSC0.0%0.2%0.1%8.1%5.0%2.5%5.4%21.9%20.0%0.0%66.6%63.1%78.2%61.8%68.4%83.2%75.1%88.6%89.2%98.0%98.4%95.7%98.5%
FR0.0%0.0%0.0%5.2%0.2%0.3%1.2%14.9%6.8%33.4%0.0%57.4%77.1%58.0%62.8%82.3%70.2%86.5%86.6%99.2%99.5%96.0%99.0%
PF1.0%0.9%1.4%9.1%8.0%6.7%12.1%21.5%21.7%36.9%42.6%0.0%50.6%48.4%50.7%57.5%57.6%62.6%69.8%84.5%83.7%83.3%91.1%
RS0.0%0.0%0.0%3.6%0.5%0.4%0.7%12.7%3.4%21.8%23.0%49.5%0.0%50.3%51.8%65.9%60.2%73.1%78.0%96.8%97.8%91.7%96.7%
F4.2%3.9%6.9%15.6%15.2%16.6%20.5%26.6%25.0%38.2%42.0%51.6%49.7%0.0%50.5%57.2%57.8%60.9%69.5%84.5%84.3%83.2%91.3%
DFS0.4%0.2%0.8%7.8%6.4%5.6%8.9%18.4%15.9%31.6%37.2%49.3%48.2%49.5%0.0%58.1%58.3%64.4%71.8%88.8%89.1%85.8%92.7%
SSC0.1%0.0%0.0%3.0%1.2%0.6%1.7%12.1%4.2%16.9%17.7%42.5%34.2%42.9%42.0%0.0%50.5%58.5%69.0%89.7%90.4%84.7%92.3%
SRC0.1%0.1%0.1%4.7%3.0%2.8%4.5%15.2%11.2%24.9%29.8%42.4%39.8%42.2%41.7%49.5%0.0%55.3%63.1%80.7%80.9%79.3%88.5%
V0.0%0.0%0.0%1.6%0.0%0.0%0.1%9.3%0.9%11.4%13.5%37.4%26.9%39.1%35.7%41.5%44.7%0.0%63.0%86.1%85.8%81.1%91.8%
DS0.0%0.0%0.1%1.8%0.7%0.5%0.9%8.3%3.1%10.9%13.4%30.2%22.0%30.6%28.2%31.0%36.9%37.0%0.0%72.7%72.3%71.2%85.1%
IC0.0%0.0%0.0%0.1%0.0%0.0%0.1%2.9%0.2%2.0%0.8%15.5%3.3%15.5%11.3%10.4%19.3%14.0%27.4%0.0%50.9%53.0%70.5%
PC0.0%0.0%0.0%0.2%0.0%0.0%0.0%3.1%0.0%1.6%0.5%16.3%2.2%15.7%11.0%9.6%19.1%14.2%27.8%49.1%0.0%51.9%69.4%
CD0.0%0.0%0.0%0.6%0.1%0.1%0.2%3.3%0.6%4.3%4.0%16.7%8.3%16.8%14.2%15.3%20.8%18.9%28.8%47.1%48.2%0.0%65.9%
HC0.0%0.0%0.0%0.2%0.0%0.0%0.0%1.7%0.1%1.5%1.1%8.9%3.3%8.7%7.3%7.7%11.5%8.2%14.9%29.5%30.7%34.1%0.0%
1 Probabilities are defined as Prob(rank(A) > rank(B)) where A is the food category identified in the row labels and B is the food category identified in the column labels (based on 4,000 uncertainty iterations of the model).
LEGENDDM = Deli Meats
FNR = Frankfurters (not reheated)
P = Pâté and Meat Spreads
UM = Unpasteurized Fluid Milk
SS = Smoked Seafood
CR = Cooked Ready-To-Eat Crustaceans
HFD = High Fat and Other Dairy Products
SUC = Soft Unripened Cheese
PM = Pasteurized Fluid Milk
FSC = Fresh Soft Cheese
FR = Frankfurters (reheated)
PF = Preserved Fish
RS = Raw Seafood
F = Fruits
DFS = Dry/Semi-Dry Fermented Sausages
SSC = Semi-soft Cheese
SRC = Soft Ripened Cheese
V = Vegetables
DS = Deli-type Salads
IC = Ice Cream and Frozen Dairy Products
PC = Processed Cheese
CD= Cultured Milk Products
HC = Hard Cheese

Table A12-2. Probabilities 1 (over uncertainty) that food categories rank higher (or lower) than other food categories based on the number of cases per annum.
 DMPMHFDFNRSUCPCRUMSSFFRVDFSFSCSSCSRCDSRSPFICPCCDHC
DM0.0%91.9%98.5%99.6%99.8%100.0%100.0%99.8%99.6%92.4%100.0%100.0%100.0%100.0%100.0%100.0%100.0%100.0%100.0%100.0%100.0%100.0%100.0%
PM8.1%0.0%60.3%75.1%83.8%96.0%98.0%93.8%94.5%77.9%100.0%99.2%99.1%100.0%99.9%99.6%99.5%100.0%99.9%100.0%100.0%100.0%100.0%
HFD1.5%39.7%0.0%69.7%80.4%95.3%97.9%93.1%94.0%75.4%99.8%99.1%99.0%100.0%99.8%99.8%99.4%100.0%100.0%100.0%100.0%100.0%100.0%
FNR0.4%24.9%30.3%0.0%70.2%92.0%95.8%87.2%91.0%72.5%99.3%98.5%98.0%100.0%99.7%99.6%98.9%100.0%99.8%100.0%100.0%100.0%100.0%
SUC0.2%16.2%19.7%29.9%0.0%60.6%64.9%60.8%69.8%59.5%83.8%78.8%85.8%91.1%90.3%88.4%88.2%92.8%92.9%95.6%95.7%96.0%97.7%
P0.1%4.0%4.7%8.0%39.4%0.0%57.2%54.0%69.8%59.6%94.0%82.2%90.1%100.0%98.2%93.6%93.2%100.0%98.9%99.7%100.0%99.6%100.0%
CR0.0%2.0%2.2%4.2%35.1%42.8%0.0%48.7%65.8%57.0%91.8%79.3%89.2%99.9%98.0%92.2%92.5%100.0%98.5%99.6%100.0%99.2%99.9%
UM0.3%6.2%6.9%12.8%39.2%46.0%51.3%0.0%61.3%56.5%80.5%76.2%85.7%96.6%93.9%89.3%89.4%97.7%95.3%98.4%98.7%97.4%99.0%
SS0.4%5.5%6.0%9.0%30.2%30.2%34.3%38.7%0.0%54.0%72.7%71.0%82.2%98.7%94.7%87.0%88.8%99.5%95.4%99.5%99.8%97.6%99.3%
F7.6%22.1%24.7%27.6%40.5%40.5%43.0%43.5%46.0%0.0%54.7%57.6%69.7%75.9%74.6%74.5%76.8%82.3%79.9%89.7%89.4%89.1%94.2%
FR0.0%0.1%0.2%0.7%16.2%6.0%8.2%19.5%27.3%45.4%0.0%58.5%75.2%89.8%88.9%78.4%82.4%98.4%88.4%98.7%99.3%95.7%98.5%
V0.0%0.9%0.9%1.6%21.3%17.8%20.7%23.8%29.0%42.5%41.6%0.0%66.4%78.6%77.6%72.9%76.7%89.5%81.2%92.8%94.7%91.0%96.4%
DFS0.0%1.0%1.0%2.0%14.2%9.9%10.8%14.4%17.9%30.3%24.9%33.7%0.0%59.2%58.0%57.9%58.9%67.6%66.2%78.2%79.6%79.7%88.3%
FSC0.0%0.0%0.0%0.0%8.9%0.0%0.1%3.4%1.3%24.1%10.2%21.5%40.8%0.0%50.1%50.1%51.4%67.1%60.1%76.2%78.7%77.1%87.2%
SSC0.0%0.2%0.2%0.3%9.7%1.8%2.0%6.2%5.3%25.4%11.1%22.4%42.0%49.9%0.0%50.1%53.1%66.7%60.4%77.5%78.5%76.8%86.8%
SRC0.0%0.4%0.2%0.4%11.6%6.4%7.8%10.7%13.1%25.6%21.6%27.1%42.1%49.9%49.9%0.0%50.6%58.6%60.1%69.0%70.5%72.6%81.3%
DS0.0%0.5%0.6%1.1%11.8%6.8%7.5%10.6%11.2%23.2%17.6%23.3%41.1%48.6%47.0%49.4%0.0%53.3%58.8%71.9%72.8%74.5%86.3%
RS0.0%0.0%0.0%0.0%7.2%0.0%0.0%2.3%0.5%17.7%1.6%10.5%32.5%33.0%33.3%41.5%46.7%0.0%52.6%69.8%72.0%72.8%84.9%
PF0.0%0.1%0.0%0.2%7.1%1.1%1.5%4.7%4.6%20.1%11.7%18.8%33.8%39.9%39.6%39.9%41.2%47.5%0.0%59.0%60.7%62.0%71.4%
IC0.0%0.0%0.0%0.1%4.4%0.3%0.4%1.7%0.5%10.3%1.3%7.2%21.8%23.8%22.5%31.0%28.2%30.2%41.1%0.0%53.0%59.8%74.9%
PC0.0%0.0%0.0%0.0%4.3%0.0%0.0%1.3%0.2%10.6%0.7%5.3%20.4%21.3%21.5%29.5%27.3%28.0%39.3%47.0%0.0%57.0%72.6%
CD0.0%0.0%0.0%0.0%4.0%0.4%0.9%2.7%2.4%10.9%4.3%9.0%20.3%22.9%23.2%27.4%25.6%27.2%38.1%40.2%43.0%0.0%62.0%
HC0.0%0.0%0.0%0.0%2.3%0.0%0.2%1.1%0.7%5.8%1.6%3.6%11.7%12.9%13.3%18.7%13.8%15.1%28.6%25.1%27.5%38.0%0.0%
1Probabilities are defined as Prob(rank(A) > rank(B)) where A is the food category identified in the row labels and B is the food category identified in the column labels (based on 4,000 uncertainty iterations of the model).
LEGENDDM = Deli Meats
PM = Pasteurized Fluid Milk
HFD = High Fat and Other Dairy Products
FNR = Frankfurters (not reheated)
SUC = Soft Unripened Cheese
P = Pâté and Meat Spreads
CR = Cooked Ready-To-Eat Crustaceans
UM = Unpasteurized Fluid Milk
SS = Smoked Seafood
F = Fruits
FR = Frankfurters (reheated)
V = Vegetables
DFS = Dry/Semi-Dry Fermented Sausages
FSC = Fresh Soft Cheese
SSC = Semi-soft Cheese
SRC = Soft Ripened Cheese
DS = Deli-type Salads
RS = Raw Seafood
PF = Preserved Fish
IC = Ice Cream and Frozen Dairy Products
PC = Processed Cheese
CD= Cultured Milk Products
HC = Hard Cheese

Table A12-3. Clustering of Similar Food Categories Based on the Uncertainty Distribution of Relative Risk Ranking on Per Serving and Per Annum Basis.
ClusterRisk per ServingRisk per Annum
Cluster 1right bracketDeli Meats
Frankfurters, not reheated
Pâté and Meat Spreads
Unpasteurized Fluid Milk
Smoked Seafood
Deli Meats
Cluster 2right bracketCooked RTE Crustaceans
High Fat and Other Dairy Products
Pasteurized Fluid Milk
Soft Unripened Cheese
High Fat and Other Dairy Products
Frankfurters, not reheated
Pasteurized Fluid Milk
Soft Unripened Cheese
Cluster 3right bracketDeli-type Salads
Dry/Semi-dry Fermented Sausages
Fresh Soft Cheese
Frankfurters, reheated
Fruits
Preserved Fish
Raw Seafood
Semi-soft Cheese
Soft Ripened Cheese
Vegetables
Cooked RTE Crustaceans
Fruits
Pâté and Meat Spreads
Unpasteurized Fluid Milk
Smoked Seafood
Cluster 4right bracketCultured Milk Products
Ice Cream and Frozen Dairy Products
Processed Cheese
Hard Cheese
Deli-type Salads
Dry/Semi-dry Fermented Sausages
Frankfurters, reheated
Fresh Soft Cheese
Semi-Soft Cheese
Soft Ripened Cheese
Vegetables
Cluster 5right bracketNot ApplicableCultured Milk Products
Hard Cheese
Ice Cream and Frozen Dairy Products
Preserved Fish
Processed Cheese
Raw Seafood

 

 

Table A12-4. Sensitivity of clustering procedure to the cut-off probability used to define similar versus dissimilar food categories.
Measure for rankingCut-off probability
(distance) for defining
any two categories
as dissimilar
Total # of pairwise
comparisons for which
food categories are
not judged dissimilar 1
# of distinct disjoint
clusters 2 of similarly
ranked food categories
Risk per serving0.951394
0.901164
0.75617
Cases per annum0.951494
0.901245
0.75697
1 There are a total of 276 pairwise comparisons of 23 food types; two food categories where considered dissimilar if Pr(rank(A) > rank(B)) > the cut-off probability value where A is the food with higher mean rank and B is the food with lower mean rank
2 A cluster is defined here as a collection of food categories for which Pr(rank(A) > rank(B)) < cut-off probability value for any pair (A,B) in the cluster; each food is assigned to only one cluster and therefore clusters are disjoint.

 [1] Jain A.K., Murty M.N. and Flynn P.J. (1999). Data Clustering: A review. ACM Computing Surveys 31(3), pg 264-323.