acm para salud

Upload: mauriciorenteria

Post on 03-Jun-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Acm Para Salud

    1/45

    Documentosde Trabajo

    Michael Greenacre

    in the Exploration of Health Survey Data

    The Use of

    CorrespondenceAnalysis

    5 2002

  • 8/12/2019 Acm Para Salud

    2/45

    The Use of

    Correspondence Analysisin the Exploration

    of Health Survey Data

    Michael Greenacre

    U N I V E R S I D A D P O M P E U F A B R A

    AbstractThis working paper gives a comprehensive explanationof the multivariate technique called correspondenceanalysis, applied in the context of a large survey ofa nations state of health, in this case the SpanishNational Health Survey. It is first shown howcorrespondence analysis can be used to interpret asimple cross-tabulation by visualizing the table in

    the form of a map of points representing the rowsand columns of the table. Combinations of varia-bles can also be interpreted by coding the data inthe appropriate way. The technique can also beused to deduce optimal scale values for the levelsof a categorical variable, thus giving quantitativemeaning to the categories Multiple correspon-

    ResumenEste documento desarrolla ude una tcnica de anlisis mnada anlisis de corresponddatos de una encuesta naciocaso, a la Encuesta Naciona(ENS). Primero se muestra correspondencias puede ser

    tar una tabla de contingenciforma de un grfico de punfilas y columnas de la tablainterpretadas diferentes comvariables, codificando los dpiada. Adems esta tcnica obtener valores ptimos de

  • 8/12/2019 Acm Para Salud

    3/45

    La decisin de la Fundacin BBVA de publicar el presente documento de traba- jo no implica responsabilidad alguna sobre su contenido ni sobre la inclusin,dentro del mismo, de documentos o informacin complementaria facilitadapor los autores.

    The Foundations decision to publish this working paper does not imply any responsibility for its content. The analyses, opinions, and findings of this paper represent the views of its authors; they are not necessarily those of the BBVA Foundation.

    No se permite la reproduccin total o parcial de esta publicacin, incluido el di-seo de la cubierta, ni su incorporacin a un sistema informtico, ni su transmi-sin por cualquier forma o medio, sea electrnico, mecnico, reprogrfico,fotoqumico, ptico, de grabacin u otro sin permiso previo y por escrito del ti-tular del copyright.

    No part of this publication including cover design may be reproduced or transmitted and/or published in print, by photocopying, on microfilm or in any form or by any means without the written consent of the copyright holder at the address below; the same applies to

    whole or partial adaptations.

    La serie Documentos de Trabajo, as como informacin sobre otras publicaciones de laFundacin BBVA, pueden consultarse en: http:/ /www.fbbva.es

    D E PA R TA M E N TO E D I TO R I A L

    D E L A F U N D A C I N B B V A

    DIRECTORA

    Paz Prez-Bilbao

    COORDINADORA DE REDACCIN Y ESTILO

    Mercedes Bravo

    The Use of Correspondence Analysis in the Exploration of Health Survey Data

    EDITA

    Fundacin BBVAPlaza de San Nicols 4 48005 Bilbao

  • 8/12/2019 Acm Para Salud

    4/45

    C O N T E N T S

    1. Introduction . . . . . . . . . . . . . . . . . . . . . . .

    2. Correspondence analysis . . . . . . . . . . . . . . .

    3. A simple illustration . . . . . . . . . . . . . . . . . .

    4. Other applications to cross-tabulations. . . . .

    5. Using correspondence analysis to develop

    6. Exploring missing data. . . . . . . . . . . . . . . .

    7. Visualizing trends . . . . . . . . . . . . . . . . . . . .

    8. Conclusions . . . . . . . . . . . . . . . . . . . . . . . .

    Appendix: Correspondence analysis theory . . . .

    Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . .

    About the author. . . . . . . . . . . . . . . . . . . . . . .

  • 8/12/2019 Acm Para Salud

    5/45

  • 8/12/2019 Acm Para Salud

    6/45

    1. Introduc

    THE National Health Survey (Encuesta Nacional ted every three years in Spain is an example of a lasurvey designed to provide a snapshot of the nationparticular moment in time. In the 1997 survey of adthe subject of this working paper, there are 46 basic which consist of possible multiple responses, pushinnumber of questions effectively to 83. Added to thisquestions which are conditional on the responses to giving an additional upper limit of 27 questions. Earespondents interviewed thus provide between 83 aninformation, so that the complete data file comprise640,000 numbers.

    The usual way to summarize these data is to cresponse and present these in the form of bar or linecation lndicadores de Salud (Regidor and Gutirrez-Fple, is a collection of tables where the 1997 data areprevious years, and only in a few cases are some grgiven of the results as an aid to interpretation.

    A second level of analysis is to explore relatirent questions in the survey. There are various waysmore complicated and more ambitious than others. Ople, postulate some functional relationship between number of visits to the doctor and age. Since both v

    simple numerical scale, the solution is fairly straighinspecting the scatterplot of these two variables, oneregression model relating expected number of visits wcomes to relating health status, which is a multicategfive possible responses, and the intake of medicines,

    17 t i f di i th t

  • 8/12/2019 Acm Para Salud

    7/45

    CA is a method aimed specifically at quantifying ca

    assigning numerical scale values to the response cat variables, with certain optimal properties. These scashown to have interesting geometric properties and called maps of the relationships between variables.

    After introducing the method, we will give a using a cross-tabulation computed from the 1997 heapplications will be given using more complex cros

    We also show how CA can be used to develosize the responses to several questions which have aThis is of great use to the modeller, who can replace variables by a single scale, which can then be used ses, such as regression analysis, which require interv

    Several other issues are dealt with; for examppatterns of missing data and how to explore trends bfrom different years.

    m i c h a e l g r e e n a c r e

  • 8/12/2019 Acm Para Salud

    8/45

    2. Corresp

    analysis

    A LTHOUGH the theory is fully explained in severgraphy), we present a practical introduction in the csurvey data analyzed in this working paper, as well summary in an Appendix.

    In its simplest form, correspondence analysis two-way cross-tabulation, such as the one in Table 1rizes the distribution of perceived health status catego

    groups. The ultimate aim of CA is to produce a meach row and each column is represented by a poinis quite similar to principal components analysis (PC variance of the table is defined and then this total ismally along so-called principal axes. For mappingusually hoped that a large percentage of total varianby the first two principal axes, thereby allowing the

    in two dimensions.CA contains three basic concepts: that of a ponal space, a weight (or mass) assigned to each poindistance function between the points, called the these three concepts are defined, the method tries todimensionality of the points by projecting them ontoa two-dimensional plane as mentioned above. This sfits the points by weighted least-squares, where eachits respective mass, and measurement of distance besubspace is in terms of chi-square distance.

    Let us look at each of these three concepts infined equivalently for rows or columns, we shall exthe rows of Table 1, with the understanding that the

  • 8/12/2019 Acm Para Salud

    9/45

    expressed them in the more familiar form of percenprofile vectors which are the multidimensional poin will attempt to show us these points representing thgroups in this case, where each age group is describfive coordinates, its distribution across the health sta

    Each row profile point will then be weighted

    which is the frequency of the row category divided For example, since age group 16-24 has 1,223 respotal of 6,371, then this row point is weighted by the .192. The row masses add up to 1, and are nothing ginal proportions of the table.

    Finally we measure distance between row poi

    m i c h a e l g r e e n a c r e

    TABLE 1: Cross-tabulation of age groups by perceived health status

    AGE GROUP Very Good Good Regular Bad 16-24 243 789 167 18 25-34 220 809 164 35 35-44 147 658 181 41 45-54 90 469 236 50 55-64 53 414 306 106 65-74 44 267 284 98 75+ 20 136 157 66

    SUM 817 3,542 1,495 414

    TABLE 2: Row percentages calculated from Table 1 AGE GROUP Very Good Good Regular Bad

    16-24 19.9 64.5 13.7 1.5 25-34 17.8 65.6 13.3 2.8 35-44 14.2 63.6 17.5 4.0 45-54 10.5 54.5 27.4 5.8 55-64 5.8 45.5 33.7 11.7 65-74 6.2 37.4 39.8 13.7 75+ 5.1 34.3 39.6 16.7

    AVERAGE 12.8 55.6 23.5 6.5

  • 8/12/2019 Acm Para Salud

    10/45

    physical distance = ( ) ( ) x y x y 1 1 2 2 2 2

    but the chi-square distance is a distance which weigterm as follows:

    weighted distance = ( ) / ( )x y v x y1 1 2 1 2 2

    In PCA this type of distance function is alreav j is equal to the variance of the j -th variable (as a

    trix of numerical measurements. Specifically, the chtween row points weights each term inversely by thecolumn marginal proportion c j :

    chi square distance ( ) / ( x y c x 1 1 2 1 2

    where in our example (Table 1) c 1 = 817/6,371 = .

    .556, and so on. The idea here is just like in PCA, icompensates for the different variances in the colummatrix we say it is variance standardizing. Diffefirst column of Table 2 will tend to be smaller, sincesmaller (they actually vary from 5.1 to 19.9, i.e. 14. whereas differences in the second column will be gthey are larger percentages (they vary from 34.3 to

    centage points). Dividing by the column margin effethese inherent differences.The total variance in correspondence analysis

    so-called inertia, which is simply the usual Pearson culated on the cross-tabulation, divided by the total this inertia which measures the degree of differencegroups that we are trying to represent optimally in t

    As we have said, the map usually two-dimeby weighted least-squares (more specific details of tinvolved are given in the Appendix). In practice, whthe row profile points are projected onto the best-fitcoordinates of these points are called principal cooare the coordinates with respect to the principal a

    t h e u s e o f c o r r e s p o n d e

  • 8/12/2019 Acm Para Salud

    11/45

    In addition, we have points on the map

    as well. There are two ways of representing trows. The easier of the two to understand, thnerally used, is the asymmetric map shown in Figrow profiles are depicted as described above,but the column points are depicted by projeconto the same space.

    A unit profile vector is a vector of zerople, the unit profile vector [ 1 0 0 0 0 ] represin the space of the row profiles. The practicatric map is that the column points are spreadrow points (see Figure 1 as an example). Themap is the symmetric map, in which both row poare represented in principal coordinates. As spendix, there is a simple scaling factor differstandard coordinates, which lends some theosymmetric map. One should remember, thougreally involves the projections of two sets ofrow profiles in one space and column profiletation of these maps is explained more fully tual examples.

    m i c h a e l g r e e n a c r e

  • 8/12/2019 Acm Para Salud

    12/45

    3. A simp

    A S a first illustration of how CA operates, w which cross-tabulates age with perceived hea

    We needed to define age groups, which35-44, etc., but this choice hardly affects ourpoint out later. The frequencies in Table 1 arese, because of the different marginal frequencgroups, so it is usual to calculate row percentthe groups, as shown in Table 2.

    The rows of Table 2 are the profiles of health status categories. CA visualizes these picts the distance between each group and alstatus categories should be scaled in order tooptimally. There are two ways to report this mshown before in Figure 1 and the symmetric map (

    The only difference is that in Figure 1 we bunch of points within the health status categori whereas in Figure 2 the two sets of points are msince the spread of both sets of points is the sam vertically. Notice from definitions (A.2) and (A.only difference between principal and standard calong each principal axis. Figure 2 is generally tbecause it simply looks better, but Figure 1 is pebecause the row and column points occupy the s

    thus easier to interpret jointly. In Figures 1 and 2up from right to left, with a slight arch formed bcompared to the extremes. In Figure 1 the health which can be considered fictitious age groups wcategory; for example, the point very good (muy bua percentage of 100% in this category and 0% in

  • 8/12/2019 Acm Para Salud

    13/45

    m i c h a e l g r e e n a c r e

    FIGURE 1 : Asymmetric CA map of Table 1

    FIGURE 2: Symmetric CA map of Table 1

    16-24

    25-3435-44

    45-5455-64

    65-74

    75+

    Regular

    Bad

    Very Bad

    0.0021 (1.5%)

    35-4445-5455-64

    GoodRegular

    0.0021(1.5%)

  • 8/12/2019 Acm Para Salud

    14/45

    importance only. The essential information in the original the horizontal spread of the points, and the percentage of ifirst axis actually puts a figure on the quality of the displa97.3%. Figure 2 tells the same story, showing the unimpodimension, with the health categories now scaled identicaalong both axes.

    What can we conclude from this graphical display?right-to-left spread of the age groups, we see that there is from age group 1 to age group 2, then a larger step to agelarger one to age group 4, then the biggest step of all to agsmaller steps to group 6 and then group 7. The ordering ocategories along this dimension agrees with the inherent ogood ( muy bueno ) to very bad ( muy malo ), and their give scale values which can be interpreted; for example, tdifference between bad ( malo ) and very bad ( muydifference between, say, good ( bueno ) and regular, distinguishing the responses between different age groups

    The health scale values (first principal coordinatestandardized but can be linearly transformed to any othscale; for example, we could transform them to have enand 100, with 0 representing very bad and 100 very

    Original scale: 0.767 0.755 0.439 0New scale: 0 1 27.6 81

    This is a quite different scale from what one wouldtances between the categories were equal, in which case thbe 0, 25, 50, 75 and 100. The category regular is not inscale, but very much towards the bad end of the scale, ations of respondents. Or, putting it another way, it is clearlnegative direction to admit ones health is regular as op

    Using the above scale values one can establishall those in the age groups:16-24 75.9725-34 74.6935-44 70.6345-54 62.25

    t h e u s e o f c o r r e s p o n d e

  • 8/12/2019 Acm Para Salud

    15/45

    Since we have the exact ages of each respondemore detailed plot by calculating and plotting the ave(Figure 4).

    m i c h a e l g r e e n a c r e

    FIGURE 3: Plot of health status index (first dimension of CA) against age group

    I n

    d e x

    40

    50

    60

    70

    80

    16-24 25-34 35-44 45-54 55-

    Age Group

    FIGURE 4: Plot of health status index against age (up to 86)

    I n d

    e x

    40

    50

    60

    70

    80

  • 8/12/2019 Acm Para Salud

    16/45

    There are some interesting patterns, suchealth in the years just preceeding 30, 40, 50slight recovery in the years after.

    Because of the high sample size in this data at least one level further by splitting theanother variable. Sex is the most obvious ocross-tabulation of the seven age groups splitfemales, with the corresponding health categ

    The symmetric map in Figure 5 shows ththemselves as unhealthier than their male counpoints are always to the left of the male pointsgroup, so that females of 65-74, for example, athan males of 75+.

    t h e u s e o f c o r r e s p o n d e

    TABLE 3: Age group and sex interactively cross-tabulated with health s AGE GROUP Very Good Good Regular Bad

    MALES 16-24 145 402 84 5 25-34 112 414 74 13 35-44 80 331 82 24 45-54 54 231 102 22

    55-64 30 219 119 53 65-74 18 125 110 35 75+ 9 67 65 25

    FEMALES 16-24 98 387 83 13 25-34 108 395 90 22 35-44 67 327 99 17 45-54 36 238 134 28 55-64 23 195 187 53

    65-74 26 142 174 63 75+ 11 69 92 41

    SUM 817 3,542 1,495 414

  • 8/12/2019 Acm Para Salud

    17/45

    m i c h a e l g r e e n a c r e

    FIGURE 5: Correspondence analysis of Table 3, symmetric map

    m35-m45-54

    m55-64m65-74

    m75+

    f35-44f45-54

    f55-64

    f65-74

    f75+

    Good

    Regular

    Bad

    Very Bad

    0.1417 (94.5%)

    0.0039 (2.6%)

  • 8/12/2019 Acm Para Salud

    18/45

    4. Other

    to cros

    W E present several other examples of how Crelationships between several variables. Querespondents if they have had to reduce their nobecause of some pain or other symptom. Forthere follows a list of 18 possible symptoms,other category. In Table 4 we have tabulatefive health status categories for each of the re

    Notice that the table is not a contingency tabsince multiple responses are possible to the W

    TABLE 4: Ailments tabulated by perceived health AILMENT Very Good Good Regular Bad

    a. Bones, joints 5 64 132 104 b. Nerves, depression 0 13 24 39 c. Throat, cough 12 77 62 25 d. Headache 2 47 41 30 e. Cuts, injuries 8 21 13 8 f. Earache 0 4 7 4 g. Diarrhea 3 6 5 7 h. Allergies 0 5 8 6 i. Kidneys, urinary 0 6 12 7 j. Stomach 2 13 18 13 k. Fever 3 20 17 6

    l. Teeth 2 5 4 2 m. Fainting 2 10 21 21 n. Chest 0 1 10 18 o. Ankles 1 1 13 15 p. Suffocation 0 5 27 22 q. Fatigue 1 9 35 26 r. Others 5 29 46 20

  • 8/12/2019 Acm Para Salud

    19/45

    Figure 6 shows the symmetric map of this tablefive health status categories spread along the first prinpositions similar to those in the previous analyses. Thescaled from left to right in accordance with the associchest problems, ankles, suffocation, respiratory ppsychiatric problems on the bad left side, and teethand fever on the good right side. The second axisthis analysis, and is determined mostly by the status cand the three symptoms in the upper part of the map: and teeth. This indicates a subgroup of people who but who also tend to report higher than average very tending to have one of these afflictions which is just aNotice the position of diarrhea, which is associated people: some who view their health at the very goodothers at the opposite very bad end, but with fewer with regular health.

    m i c h a e l g r e e n a c r e

    FIGURE 6: Correspondence analysis of Table 4, symmetric map

    Bones

    Nerves

    Thro

    Headache

    Diarrhea

    All

    Urinary

    Stomach

    Fever

    Chest Ankles

    Breathing Fatigue

    OtherGood Regular

    Bad Very Bad

    0.0209 (12.6%)

    Fainting

  • 8/12/2019 Acm Para Salud

    20/45

    The next example concerns smoking, and here wequestion 19 about smoking habits with health status (Tab

    The differences between smoking groups with rcategories are not large this fact can be deduced frominertias along the principal axes in Figure 7. The smalexist, however, show that those who smoke have a slig view of their health. As an attempt to explain this findthe relationship between smoking and age given in TaClearly there is a strong tendency for younger people

    t h e u s e o f c o r r e s p o n d e

    TABLE 5: Smoking categories by perceived healthSMOKING CATEGORY Very Good Good Regular Bad

    Smoke daily 288 1,309 398 102 Smoke, not daily 31 92 36 7 Used to smoke 107 519 234 74 Never smoked 391 1,622 831 228

    SUM 817 3,542 1,499 411

    FIGURE 7: Correspondence analysis of Table 5

    Used to smoke

    Never smoked

    Very Good

    G

    Regular

    Bad

    Very Bad 0.0013 (8.5%)

  • 8/12/2019 Acm Para Salud

    21/45

    finding in Figure 6 can be attributed to the fact that thnon-smoking groups have an older profile with worse This leads us to consider the smoking categories withi(Table 6), giving a more detailed explanation of the rel

    i d h l h d ki (Fi 8) H

    m i c h a e l g r e e n a c r e

    TABLE 6: Age groups and smoking habits interactively cross-tabulated with healthby numbers 1 to 7, and smoking group indicated by + (smokes daily), Sdaily), N (doesnt smoke, but did), (never smoked); e.g., 2 = age gro

    SMOKING CATEGORY Very Good Good Regular Bad

    1+ 63 282 71 10 1S 13 32 8 2 1N 5 44 10 2 1 161 429 78 4

    2+ 95 431 90 20

    2S 11 26 5 2 2N 30 101 18 3 2 84 249 51 10

    3+ 88 285 80 24 3S 2 17 4 1 3N 21 118 27 4 3 36 236 70 12

    4+ 26 165 76 13 4S 2 6 9 0 4N 19 77 36 10 4 43 221 115 27

    5+ 10 100 46 17 5S 2 3 5 2 5N 15 81 50 20 5 26 229 204 67

    6+ 4 28 24 16 6S 0 6 3 0 6N 10 61 58 16 6 29 70 199 64

    7+ 2 14 9 2 7S 1 1 1 0 7N 7 36 34 19 7 10 84 112 44

    SUM 815 3,532 1,493 411

  • 8/12/2019 Acm Para Salud

    22/45

    t h e u s e o f c o r r e s p o n d e

    FIGURE 8: Correspondence analysis of Table 6, symmetric map

    3

    3S

    3

    4+

    4S

    4N

    4

    5+

    5S

    5N

    5

    6+

    6S

    6 N

    6

    7+

    7S

    7N

    7

    Good

    Regular

    Bad

    Very Bad

    0.1433 (89.1%)

    0.0108 (6.7%)

  • 8/12/2019 Acm Para Salud

    23/45

    5. Using c

    analysiscales

    W E have already seen an example in Sectiomal scaling; the assignment of scale values toptimal properties. We obtained values for th which lead to maximum separation, or discrigroups. In general, we can use CA to obtain of categorical variables which form a substan

    For example, question 8a of the health s which of 17 different types of medicines they vious two weeks (of the original 18 types, w which only apply to women). More than halfken any medicines, so we excluded them frotion differs from the previous ones, because wrelation between medicine consumption and

    age or smoking. Here we are trying to reduceof variables in much the same way as factor common factors which capture the relationshby explaining a maximum amount of variabical to principal component analysis, apart frobles are categorical in nature, so the missing given to the categories.

    Multiple correspondence analysis (MCAhomogeneity analysis (HA) solves this procategory scale values which lead to scores fomaximally correlated with each respondentsthis, let us suppose that we make the ad hoc decivalues 1 to each medicine taken and 0 5 to

  • 8/12/2019 Acm Para Salud

    24/45

    additional column of this notional matrix. Thcorrelation between the respondent scores anand summarize in some way how well the scMCA this is done by calculating the average the score vector and the 16 scales. The objecout which scale values lead to a maximum vcorrelation, so that in this sense the scores mthe 16 scales. Once this factor has been ideanother set of scale values and associated scoscores already identified, which again maximcorrelation, and so on.

    The basic numerical results of the MCAdimensions (i.e., factors) are given in Table 7

    In this table the squared correlations aremeasures and the average squared correlatio Another way of thinking about the table is thcoefficients of determination (R 2) giving the vaexplained by each dimension (factor). Since uncorrelated, these R 2 can be added up row-w variances by subsets of factors. The dimensio

    t h e u s e o f c o r r e s p o n d e

    TABLE 7: Eigenvalues and discrimination measures for each dimension

    D1

    Eigenvalue .1031

    Throat, cough .183 Pain, fever .127 Vitamins, minerals .001 Laxatives .025 Antibiotics .044 Tranquillisers... .144 Anti-allergy .003 Diarrhea .001 Rheumatism .084 Heart .277 Blood pressure .311 Digestive remedies 071

  • 8/12/2019 Acm Para Salud

    25/45

    m i c h a e l g r e e n a c r e

    Medicine Responsedim

    Throat, cough yes 1 .74no 2 .25

    Pain, fever yes 1 .5no 2 .25

    Vitamins, minerals yes 1 .1no 2 .0

    Laxatives yes 1 1.0no 2 .0

    Antibiotics yes 1 .7no 2 .06

    Tranquillisers... yes 1 .95no 2 .1

    Anti-allergy yes 1 .2no 2 .0

    Diarrhea yes 1 .27no 2 .0

    Rheumatism yes 1 1.0no 2 .0

    Heart yes 1 1.6no 2 .1

    Blood pressure yes 1 1.1no 2 .2

    Digestive remedies yes 1 .84no 2 .0

    Antidepressants yes 1 1.2no 2 .0

    Slimming yes 1 .14no 2 .0

    Control cholesterol yes 1 1.8

    TABLE 7 (continued): Eigenvalues and discrimination measures for each d

  • 8/12/2019 Acm Para Salud

    26/45

    descending order of eigenvalue, which is thcorrelation, the quantity which is maximizedset of scale values given in the second part oscale values are given, one for each yes antypes of medicine, for each dimension.

    The first factor is a dimension which grmedicines, in order of explained variance: mfor the heart, for lowering cholesterol and tbetes as well as tranquillisers and sleeping pithat medicines for minor ailments such as thrand fever, and antibiotics, have their signs ofIn other words, people who have been takingchronic health complaints are usually not takless serious, transient, ailments.

    The second factor groups mainly the fotranquillisers and sleeping pills and antidepre

    psychiatric dimension. Although not so we we also note high scale values for diarrhea aThe first two dimensions can be plotted

    ure 9). This gives an interesting view of the ithe medicines, with the grouping at bottom rnic diseases, at the top for psychiatric and dileft for more common ailments of a transient

    As a complementary analysis to the maperform a hierarchical cluster analysis of theFigure 10 shows the cluster tree, based on cothe Jaccard index to measure similarity betwsee the same clusters as in Figure 9.

    In the optimal scaling, we can continue yond the second. For example, the third facto

    cines for flu, throat, pains and fever, by themrespondents who have had a bacterial or viratwo weeks, but are not taking any other medi

    There is one final issue to resolve in thtroversial one. If we retain two dimensions, ob h h i i b i l i d? T

    t h e u s e o f c o r r e s p o n d e

  • 8/12/2019 Acm Para Salud

    27/45

    Originally MCA was defined as the cor

    m i c h a e l g r e e n a c r e

    FIGURE 9: Multiple correspondence analysis, showing optimalscale values in two dimensions of yes responses to medic

    Throat, cough

    Pain, fever

    Vitamins, minerals

    L

    Antibiotics

    Tra

    Anti-allergy

    Diarrhea

    Digestiv

    Slimming

  • 8/12/2019 Acm Para Salud

    28/45

    sample size), and contains only zeros and ones, withfor each respondent his or her categories of responsinertias of this matrix are exactly the eigenvalues of

    of evaluating variance explained would be to expresa percentage of the total. The total inertia of an indibeen shown to be equal to a constant: ( J Q )/ Q ,number of categories of response, and Q = the nuin this example (32 16)/16 = 1. So the eigenvalue

    l h i f i i i h h fi

    t h e u s e o f c o r r e s p o n d e

    FIGURE 10: Hierarchical clustering tree of medicine types

    Dendrogram using Complete Lin

    Rescaled Distance

    0 5 10 15+- - - - - - - - - +- - - - - - - - - +- - - - - - -

    heart - +- - - - - - - - - - - - - - - - - - - +

    blood press. - + +- - - - - +

    diabetes - - - - - - - - - - - - - - - - - - - - - +

    rheumatism - - - - - - - - - - - - - - - - - - - - - - - - - - -

    cholesterol - - - - - - - - - - - - - - - - - - - - - + - - - - -

    diabetes - - - - - - - - - - - - - - - - - - - - - +

    tranquill. - - - +- - - - - - - - - - - - - - - - - - - - - - -

    antidepress. - - - +

    digestive - - - - - - - - - - - - - - - - - - - - - - - - - - -

    laxatives - - - - - - - - - - - - - - - - - - - - - - - - - - -throat, cough - - - - - - - - - - - - - - - - - +- - - - - +

    pain, fever - - - - - - - - - - - - - - - - - + + - - - -

    antibiotics - - - - - - - - - - - - - - - - - - - - - - - +

    vitamins - - - - - - - - - - - - - - - - - - - - - - - - - -

    allergy - - - - - - - - - - - - - - - - - - - - - - - - - - -

    diarrhea - - - - - - - - - - - - - - - - - - - - - - - - - -

    slimming - - - - - - - - - - - - - - - - - - - - - - - - - -

  • 8/12/2019 Acm Para Salud

    29/45

    A second way of defining MCA is to performcator matrix, usually denoted by Z, but on the so-caZTZ. This is the super-matrix of all two-way cross-t variables. It is well known that this CA leads to thedinates as before, but with principal inertias equal tothose for the indicator matrix, so that the percentagecalculated on the squared eigenvalues. We calculateeigenvalues (there are 16 in total) to be 0.06609, so of inertia explained by the first three dimensions are0.10312/0.06609, 0.0815 2/.06609, 0.0745 2/0.0660916.1, 10.1 and 8.4% respectively. These look more obefore, but they are actually still too low. This is ex(1989), who pointed out that neither of these ways opercentages of inertia have the simple two-variable Section 3 as a special case.

    A more realistic alternative, which agrees wit

    of simple CA, is to ascertain how well a solution isthe two-way association pattern of the variables. Grexplains how a simple calculation using the eigenvaalternative measure. First, we adjust the total inertiaobtain the average inertia of all the two-way tables bthe 16 variables:

    average inertia = Q Q

    J Q

    1 2inertia of B

    that is:

    1615

    006609 1616

    00038302

    . .

    (The difference between the previous total of 0.06600.003830 is that part of the Burt matrix which we aexplain at all, and which creates the problem in the tion.) Second, we adjust the eigenvalues themselves

    Q 12 2

    eigenvalue

    m i c h a e l g r e e n a c r e

  • 8/12/2019 Acm Para Salud

    30/45

    for example:

    1615

    2

    (0.1031 0.0625) 2 = 0.001875

    and express this as a percentage of the average inert49.0%. In the case of the second and third inertias, percentages of 10.7 and 5.0% respectively. These peare more realistic reflections of the variance explain justification than the usual approaches.

    We can thus conclude that the two-dimensionexplains at least 59.7% of the total inertia in the 16

    t h e u s e o f c o r r e s p o n d e

  • 8/12/2019 Acm Para Salud

    31/45

    6. Explorin

    missing

    C A is frequently used to explore patterns of missin

    and to answer questions such as: is there a specific tending to not answer questions? Or is non-responsebetween variables, i.e. can we say that certain grouphave non-responses simultaneously? A way to answ would be to set up a data matrix of binary informatrespondent we simply code whether the respondent using a one for a missing response and a zero for an whatever that may be. We would code the data this interested more in the occurrence of a non-responsebut if we wished to treat these two possibilities equcoding in MCA and introduce two columns for each variable for non-response and a dummy variable forN respondents and Q questions under investigationeither be of order N Q or N 2Q. The CA of than idea of which questions have non-responses by talso which respondents are associated with which n

    In this particular survey, the level of non-respsuch questions can not be investigated: but there is Income which raises an interesting issue. Theresponses to the question on income (denoted by ICA to investigate the relationship between this quesgories of response and non-response, and other biog which are answered by almost all the respondents. Icross-tabulated with the following variables: sex, mschooling, work situation, breadwinner or not, and whead of family. Although these are separate cross-tab

  • 8/12/2019 Acm Para Salud

    32/45

    especially interested in the position of the income nogory (I?).

    Figure 11 shows the resulting map. The incomeI1 to I6 in the map, lie in their expected order, with ththe right and the highest income on the left (notice thathe sign of all the coordinates on the first axis so that the right; this makes no difference to the CA results). h th th t g i l d f ight t l ft i

    t h e u s e o f c o r r e s p o n d e

    TABLE 8: Income categories and non-response cross-tabulated with biographical vIncome groups and missing category (I?)

    I1 I2 I3 I4 I5 I6

    Male 177 644 711 454 262 18Female 300 684 728 433 250 156

    Bachelor 104 294 410 305 204 123Married 207 844 941 543 292 208Separated 21 24 23 7 4 2Divorced 8 5 10 9 3 Widowed 137 161 55 22 9

    Illiterate 59 66 26 2 1 Read & write 35 86 37 7 5 School 383 1174 1376 878 506 341

    Working 49 304 559 466 298 219Retired 143 394 228 60 27 22Pensioner 92 95 24 8 4 1Unemployed-A 84 140 153 71 35 11Unemployed-B 9 19 23 8 9 8

    Student 9 64 129 116 71 42Self-employed 86 303 318 150 65 34Other 3 8 5 8 3

    Head of household: yes 323 711 658 373 184 125Head of household: no 150 609 774 514 324 216

    Working 43 233 552 424 270 19Retired 63 287 179 72 48 2Pensioner 17 24 17 8 1

    Unemployed-A 24 58 24 9 4 Self-employed 1 2 1 0 0 Other 0 3 0 1 1

  • 8/12/2019 Acm Para Salud

    33/45

    side, just below response 4 (150,000-200,000 pts./monthe first axis. This is an informal estimate of the positirespect to the other income groups. But it should be reis an average position of the non-respondents, not a spand there is likely to be a high spread of incomes withformal way of estimating the income in this group of n would be to set up a model at the individual respondegroup related to biographical variables, then estimate tfor each non-respondent.

    m i c h a e l g r e e n a c r e

    FIGURE 11: Correspondence analysis of Table 8; the job status in italicsrefers to that of the head of household (last part of Table 8)

    Male

    Female

    Bachelor

    Married

    SeparatedDivorced

    Read & writ

    School Working

    Retired

    Unempl.-A

    Unempl.-BStudent

    Self-empl.

    Other

    Yes

    No

    Working

    Retired

    Pensioner

    Unempl.-A

    Self-empl.

    Other

    I1

    I2 I3

    I4 I5 I6

    I?

    0.0104 (10.7%)

  • 8/12/2019 Acm Para Salud

    34/45

    7. Visual

    THE usual way to display trends is in the fohorizontal axis depicting the time line and th

    variable which is being observed over time. dor and Gutirrez-Fisac (1999), the number oin Spain is plotted over the years 1989 to 19 which this figure is based, Table 5.1.2 on pagreported cases for each autonomous region in year, 19 regions in all. To visualize and compdifficult since we would have to make 19 dif

    to compare them amongst one other and withFigure 5.1.1. CA can be used to interpret theautonomous regions. The symmetric map of Gutirrez-Fisac (1999) is given in Figure 12.the centre of the display corresponds to the tror average row profile. Thus a complete trenand the points representing the autonomous r

    region deviates from this overall pattern, witthe interpretation of these deviations.First, notice the trajectory traced out by

    A circle is traced out from years 1989 to 199 wards the centre of the map (1994 to 1996) aposition near 1993 and 1994. The most outlythose that show the greatest deviation from thinitial years has more than average incidencecia, Aragon and then the group formed by CeMelilla in 1992, and the Canary Islands in 19such as the Balearic Islands and Extremadurafrom the average trend.

  • 8/12/2019 Acm Para Salud

    35/45

    m i c h a e l g r e e n a c r e

    FIGURE 12: Correspondence analysis of measles trend data

    Andalusia

    Aragon

    Asturias

    Balearic Islands

    Canary IslandsCastilla-La Mancha

    Castille and Leon

    Catalonia Valencia

    Extremadura

    Madrid

    Murcia

    Navarre

    BasqueCountry

    La Rioja

    Ceuta

    Melilla

    1989 1990

    1991

    1992

    1993 1994

    1995

    1996

    1997

    ...... ...... ...... ...... ............ ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ...... ........ ...... ......

    ...... ............ ......

    ...... ...... ...... ......

    ...... ............ ........

    ...............

    ..............

    ......................

    ...............

    ...............

    ........................

    ..............

    ...............

    ...............

    .....................

    ............................................................................................................................................................................................................................................................................................................. ...... ...... ...... ...... ...... ...... ...... ...... ............ ...... ...... ...... ...... ...... ...... ...................................................... .............................. .................................................................. ............

    ......

    ......

    .......

    ...........................................................

    ....................................

    ........................................................

    ..................................

    ..........................

    0.1006 (23.8%)

  • 8/12/2019 Acm Para Salud

    36/45

    8. Conclus

    IN this working paper we have tried to give a comprof how correspondence analysis can assist in deciphe

    information contained in a national health survey. Frotabulation to a multiway table and a set of intercorrel variables, correspondence analysis provides a mediumpatterns in the data and suggesting hypotheses. It alsoquantification of categorical data, which can assist wibuilding process. Optimal scales can be defined whicmaximum percentage of variation and condense the d

    time, and these scales can be used in other analyses winterval scales. The method also allows investigation which is a categorical item of information, and provimethod for plotting trend data as a movement betweemultidimensional space.

  • 8/12/2019 Acm Para Salud

    37/45

    Appen

    Corresanalysi

    1. Let N be the I J table with grand total n andcorrespondence matrix, with grand total equ

    2. Let r and c be the vectors of row and coluand Dr and Dc the diagonal matrices with

    3. Compute the singular value decompositionstandardized matrix with general eleme

    D P rc D UD V T Tr c 1 2 1 2/ /( )

    where the singular values are in descen4. Compute the standard coordinates X and Y:

    X D U r 1 2/ Y = D V c 1 2/

    and principal coordinates F and G:

    F = XD G = YD Notice the following:

    The results of CA are in the form of a maprows and columns with respect to a selecorresponding to pairs of columns of th usually the first two columns for the fchoice between principal and standard coo

    The total variance, called inertia, is equal to thmatrix decomposed in (A.1):

    ( ) /( )ij i j i j p r c r c 2

  • 8/12/2019 Acm Para Salud

    38/45

    The squared singular values 2 2, , ..., calleddecompose the inertia into parts attributable toprincipal axes, just as in PCA the total varianalong principal axes.

    The most popular type of map, called the symmtwo columns of F for the row coordinates andcolumns of G for the column coordinates, thacoordinates as given by (A.3).

    An alternative scaling, which has a more coherent

    interpretation, but less aesthetic appearance, isfor example, rows in principal coordinates F standard coordinates Y in (A.2) (or vice versabetween a row-principal or column-principal agoverned by whether the original table is conrows or a set of columns, respectively, when epercentage form.

    The positions of the rows and the columns in a mpoints, called profiles, from their true positionspace onto a best-fitting lower-dimensional spprofile is the corresponding row or column ofits respective total in the case of a contingeis a conditional frequency distribution. Each pa mass equal to the value of the correspondin

    margin, r i or c j . The space of the profiles is sted Euclidean distance function called the the optimal map is obtained by fitting a lower which fits the profiles by weighted least-squa

    Equivalent forms of (A.4) which show the use of chi-square distance are:

    r p

    r c c c p

    c r i i

    ij

    i j

    j j j

    j

    ij

    j i

    i

    2 2

    / /

    Thus the inertia is a weighted average squared

    the profile vectors (e.g., p r

    ij

    i, j = 1, ..., J, for

    t h e u s e o f c o r r e s p o n d e

  • 8/12/2019 Acm Para Salud

    39/45

    An equivalent definition of CA is as a pairblems, one for the rows and one for thesquare symmetric matrix of chi-square between the row profiles, with each poitive row mass. Applying classical scalincoordinate analysis) to this distance mamasses into account, leads to the row pr

    We can write the SVD in (A.1) in terms ofthe following equivalent form, for the (i, j )

    p ij r i c j = r i c j (1 + k ik ik k

    x y )

    which shows that CA can be consideredchapter 6 by van der Heijden, Mooijaarand Blasius, 1994). For any particular sdimensions where the first two terms oftained, the residual elements have been

    least squares.

    m i c h a e l g r e e n a c r e

  • 8/12/2019 Acm Para Salud

    40/45

    Bibliog

    BENZCRI, J.-P. (1973): Analyse des Donnes; tome I: Analyse dClassification, Paris, Dunod.

    BLASIUS, J. and M. J. GREENACRE (1998): Visualization of Cat Academic Press.

    GIFI, A. (1990): Nonlinear Multivariate Analysis, Chichester, WGREENACRE, M. J. (1984): Theory and Applications of Correspo

    Academic Press. (1989): The Carroll-Green-Schaffer scaling in corr

    and empirical appraisal, Journal of Marketing Research, 26

    (1993): Correspondence Analysis in Practice, London, Acad and J. BLASIUS (1994): Correspondence Analysis in the SocPress.

    LEBART, L., A. MORINEAU and K. W ARWICK (1984): MultivariAnalysis, Chichester, Wiley.

    NISHISATO, S. (1980): Analysis of Categorical Data: Dual ScalinToronto, University of Toronto Press.

    R EGIDOR , E. and J. L. GUTIRREZ-FISAC (1999): Indicadores d Espaa del Programa Regional Europeo Salud para Todos, MaConsumo.

    V AN DER HEIJDEN, P., A. MOOIJAART and Y. T AKANE (1994):and contingency table models, in M. Greenacre anAnalysis in the Social Sciences, chapter 6, pp. 79-111, Lo

  • 8/12/2019 Acm Para Salud

    41/45

  • 8/12/2019 Acm Para Salud

    42/45

    A B O U T T H E A U T H O R

    MICHAEL GREENACRE, formerly Professor of Statistics at the UniverSouth Africa, is presently a permanent foreign professor in the Dment of Economics and Business at Pompeu Fabra University in Bna. His doctoral studies were at the University of Paris where he sunder Jean-Paul Benzcri, the originator of correspondence anaGreenacre has written two books on correspondence analysis and two other books on the topic, as well asgivingshort courses toaudiemarketing researchers, economists and environmental scientists in scountries, notably the USA, UK, Germany, Finland, Spain, Italy, land, South Africa and Norway.

  • 8/12/2019 Acm Para Salud

    43/45

  • 8/12/2019 Acm Para Salud

    44/45

    D O C U M E N T O S D E T R

    NMEROS PUBLICADOSDT 01/02 Trampa del desempleo y educacin: un anlisis de

    desincentivadores de las prestaciones en el Estado Jorge Calero Martnez y Mnica Madrigal Bajo

    DT 02/02 Un instrumento de contratacin externa: los vales oAnlisis terico y evidencias empricas Ivan Planas Miret

    DT 03/02 Financiacin capitativa, articulacin entre niveles a y descentralizacin de las organizaciones sanitariasVicente Ortn-Rubio y Guillem Lpez-Casasnovas

    DT 04/02 La reforma del IRPF y los determinantes de la ofeen la familia espaola Santiago lvarez Garca y Juan Prieto Rodrguez

  • 8/12/2019 Acm Para Salud

    45/45

    Documentosde Trabajo

    Sede Social Plaza de San Nicols, 448005 Bilbao

    Sede en Bilbao

    Sede en Madrid

    Gran Va, 1248001 BilbaoTel.: 94 487 52 52

    Fax.: 94 424 46 21

    Paseo de Recoletos, 1028001 MadridTel.: 91 374 54 00

    Fax.: 91 374 85 22

    [email protected] www.fbbva.es

    5