Saturday Plan: Department of Linguistics

Saturday Plan

Data Collection Summary
Data Preparation
Data Analysis
Project Results and Interpretation

Preparing Data for Analysis:

To prepare your data for analysis in Excel, replicate the final Powerpoint display for each participant in a separate Excel sheet. In your Excel sheet, create a grid with the same dimensions as your Powerpoint grid. Enter text codes corresponding to each stimulus item in the same location on the Excel grid as they appear on the Powerpoint grid. Empty cells in the Powerpoint grid should correspond to blank cells in the Excel grid.

Process your data by running the macro (process_macro.txt) separately for each participant. You will first need to edit the macro for your grid size (16 x 16 in the example) and your stimulus set size (20 x 20 in the example). You may need to Run... Reset between each run of the macro to avoid an indexing error. Please note that the macro is missing a loop somewhere, so that it produces incorrect group number assignments for groups with the following shape: ¬. You will need to correct those by hand (or fix my macro!).

The macro returns a copy of the stimulus grid with the text identifiers replaced by group codes (1 to M, where M is the number of groups produced), a list of the stimulus items in each group (1 to M), and an unsorted item x item similarity matrix. Items that were put in the same group have a similarity of 1 and items that were put in different groups have a similarity of 0. Sort the rows and columns of the item x item similarity matrix for each participant, so that the similarity matrices can be compared across participants.

You may also consider processing your data using a macro by Midam Kim (freeclassification_macro ver. 1.3.ppt) that runs directly in Powerpoint. This macro has the advantage of eliminating the step of copying the Powerpoint grid into Excel. Note that you must use the sample grid (grid.ppt) to use this macro. The most recent version of Midam's macro (freeclassification_macro ver. 2.4.ppt) should be flexible enough to work with most experiments, without modification of the code. Midam has also put together instructions for version 2.4.

Descriptive Statistics:

Calculate the mean, median, and range of the number of groups produced across all of the participants in your experiment. Calculate the mean, median, and range of the number of items in each group across all of the participants in your study. These descriptive statistics will give you a baseline for thinking about the classification strategy that your participants used to perform the task.

If your stimuli can be objectively classified, you can also analyze the accuracy of your participants' classification judgments. Percent correct pairings is defined as the number of correct pairings divided by the total number of possible correct pairings. The total number of possible correct pairings is the sum of n(n-1)/2 where n is the number of items in each experimenter-defined group. You can think of this number as "hits." Percent error is defined as the number of incorrect pairings divided by the total number of possible incorrect pairings. The total number of possible incorrect pairings is the sum of n x m where n and m are the number of items in each experimenter-defined group. You can think of this number as "false alarms." Given that percent correct pairings and percent error are not corrected for number groups produced (and you will therefore observe both more errors and more correct pairings as the number of groups decreases), the values are better interpreted as a "d-prime" or percent correct pairings - percent correct errors.

Example:

Three experimenter-defined groups of five items each:
Total possible pairings: n(n-1)/2 = 15(14)/2 = 105
Total number of possible correct pairings: 5(4)/2 + 5(4)/2 +5(4)/2 = 30
Total number of possible errors: 5(5) + 5(5) + 5(5) = 75

Clustering Analysis:

Clustering analyses produce a tree (or dendrogram) visualization of the perceptual similarity of the free classification data. To conduct a clustering analysis, first sum all of your participants' item x item similarity matrices to produce a single item x item similarity matrix in which similarity is defined as 0 to N (where N is the number of participants in your study).

Hierarchical Clustering (HCS)

Using the R script (hclust.txt) and your input file (see example: hclust_in.txt), conduct a hierarchical clustering analysis of your similarity data. Note that you must first convert your similarity data to dissimilarity data by subtracting every cell from N (the number of participants in your study). One of the parameters in the hierarchical clustering model is the clustering method. Different methods produce different results, so you may want to explore different methods to obtain the best visual representation of your data. For example, the Ward and Complete methods are examples of "compact" clustering methods and tend to produce many small clusters that are later joined together. The Single method is an example of a "chaining" clustering method and tends to add single items to existing clusters. In HCS, all inter-cluster distances are equal and all intra-cluster distances are shorter than all inter-cluster distances. These relationships are typically empirically false for real data, so an alternative to HCS is the Additive Similarity Tree analysis.

Additive Similarity Tree Analysis (AST)

In AST analyses, the branches of each node do not have to be the same length and items added later to the tree can be closer to objects in another cluster than to objects in their own cluster. Use addtree.exe (downloadable from http://www.columbia.edu/~jec34/) and your input file (see example: addtree.txt) (http://www.linguistics.northwestern.edu/documents/clopper-materials/addtree.txt) to conduct an additive similarity tree analysis of your data. Note that the input and output filenames requested by addtree.exe are limited to 8 characters + extension (.txt or .out).

Interpretation of Clustering Solutions

Perceptual distance is represented in the clustering solutions by the lengths of the branches needed to connect any two objects. Cluster divisions are interpreted by the experimenter (i.e., there is no alpha value).

MDS Analysis:

Multidimensional scaling analyses produce a spatial representation of the perceptual similarity of free classification data in one or more dimensions.

Kruskal's Non-Metric MDS

Using the R script (mds.txt) and your input file (see example: hclust_in.txt), conduct a non-metric MDS analysis of your similarity data. Metric MDS analyses assume ratio data, whereas the only constraint on non-metric MDS analyses is that the data be ordinal. Thus, the non-metric MDS analysis is more conservative and more appropriate for more different kinds of data. As with the clustering analyses, you must again convert your similarity data to dissimilarity data by subtracting every cell from N (the number of participants in your study). The MDS analysis returns two kinds of information: stress (or overall badness-of-fit of the model) and points (values along each dimension).

ALSCAL

Another non-metric MDS model is ALSCAL. You can use ALSCAL as part of the IBM SPSS Statistics Subscription at https://www.ibm.com/products/spss-statistics. The ALSCAL model has also been implemented in Praat and SPSS (alscal.sps).

Interpretation of MDS Solutions

To select the number of dimensions to interpret, first make a scree plot to compare the stress level at each number of dimensions that you included in your analysis. You should look for an elbow in the scree plot. That is, you want to choose a dimensionality that substantially reduces stress from the next lowest dimensionality, but that is not substantially worse than the next highest dimensionality. Second, plot the points that were returned to determine whether or not all of the dimensions in your solution are interpretable. In the non-metric MDS conducted in R and in ALSCAL analyses, the resulting similarity space can be rotated and reflected, so interpretation of the space may require rotation to find the most interpretable dimensions. The resulting dimensions should be interpreted as the perceptual dimensions of similarity that were most relevant for the listeners in making their classification judgments. The values along each dimension could be regressed against acoustic measures or other data to confirm your interpretations. As a general rule, the number of items in your analysis should be greater than 4 times the number of the dimensions. For example, if you have 20 items, you should have fewer than 5 dimensions in your solution.

Comparing Across Solutions

Given that most MDS models produce solutions that are invariant with respect to rotation, reflection, and scale, it is not possible to directly compare two or more MDS solutions. To compare the solutions across different participant groups or experimental conditions, you need to conduct an individual differences scaling (INDSCAL) analysis. This model will return a similarity space with your stimulus items as well as weights for each of your participants. The weights can be thought of as perceptually shrinking or stretching each dimension as a result of attention paid to the dimension. INDSCAL solutions cannot be rotated and must be interpreted with the dimensions that are returned. The INDSCAL model has been implemented in Praat and SPSS (indscal72.sps). For an intuitive example of how the INDSCAL model can be used to compare participants groups, see Viken, R. J., Treat, T. A., Nosofsky, R. M., McFall, R. M., & Palmeri, T. J. (2002). Modeling individual differences in perceptual and attentional processes related to bulimic symptoms. Journal of Abnormal Psychology, 111, 598-609.