SPSS Statistics is a software package used for interactive, or batched, statistical analysis.
We just released a full course on the freeCodeCamp.org YouTube channel that will teach you how to use the popular statistical application SPSS from IBM.
This course provides a concise overview of how you can use SPSS to explore and analyze your data for actionable insights.
Barton Poulson developed this course. Barton is a university professor and data scientist.
Here are the sections in this course:
- Versions, Editions, & Modules
- Taking a Look
- Sample Data
- Graphboard Templates
- Bar Charts
- Labels & Definitions
- Entering Data
- Importing Data
- Hierarchical Clustering
- Factor Analysis
- Next Steps
Watch the full course on the freeCodeCamp.org YouTube channel (2-hour watch).
You are about to learn about the popular statistical application SPSS from IBM.
This course from university professor Burton Paulson will show you how you can use SPSS to explore your data for actionable insights.
Welcome to SPSS an introduction.
I’m Martin Paulson.
And in this course, we’re going to look at these statistical program SPSS and some of its basic functionality, and give you an idea of what it can do and how well it might work in your own data work.
Now, SPSS, the name deserves a little bit of explanation.
Once upon a time, it stood for a statistical package for the social sciences.
Now it’s just SPSS.
But that’s its origin.
One important thing to know is how popular SPSS is.
Here’s a chart right here that comes from the excellent website are for stats.com.
And what it shows is the number of scholarly articles published in 2015, using various statistical packages and languages.
And we can see here is right at the top is SPSS Statistics.
SPSS is number one, by far in terms of scholarly research.
Also, you can look at jobs.
Here’s another chart that is also from our for stats Comm.
And what this shows is analytics and job listings on indeed.com.
In 2015, one major source of tech jobs, SBS is on the list, but this time, you see, it’s actually a lot lower, it’s number six.
And so there is a difference here, between academic publishing and employment in analytics.
Really what this tells you something about the population or the audience for SPSS, the primary audience of SPSS is academic researchers, especially in the social sciences, but in other fields like business.
Now, there’s some reasons that SPSS is popular in these fields.
Number one, it’s user friendly.
It’s kind of point and click interface, which allows you to assemble code really quickly.
You can save that code as what’s called a syntax file.
And then you can reuse it, you can adapt it and you can share it with others.
Also, SPSS is really well adapted for data from experiments, where you’re comparing means via t tests and analysis of variance, several important options like effect sizes and power analysis built in.
And so those are some of the reasons for SPSS is popularity, especially within academic research.
In some who can say a few things.
Number one, SPSS, despite being developed about 40 years ago, is still popular.
It’s got an easy to use interface.
And it’s easy to save and reuse the syntax, giving you a code basis for the work that you do within SPSS.
The first thing we need to talk about in SPSS and introduction is setting up and getting ready to do the work.
To do that, however, we need to take a minute and talk about versions, additions and modules, which all refer to different kinds of things in SPSS.
The choice is really making me think of an overwhelming a plethora of possibilities ahead of you.
And it’s nice to break it down a little bit.
So the things we’re going to talk about are versions, those are the release updates, the old version one version two additions, those vary according to what’s included in a particular purchase.
And modules are extra functions that you can get to add on to the abilities of SPSS.
We’ll start by talking about versions.
version one came out in 1968.
And at that point, it was called statistical package for the social sciences SPSS version 24 came out in 2016.
And now it’s called IBM SPSS Statistics act like SPSS doesn’t stand for anything.
Now for this course, I’m using version 22 on a Macintosh computer.
Fortunately, there haven’t been any extraordinarily major changes between 22 and 24.
And everything I’m going to show you in this course will work just fine in almost any other version of SPSS.
Now, it is possible that you’ve heard of something called p a s, w at some point, and SPSS was briefly called predictive analytics software.
During a trademark dispute after SPSS got bought by IBM, it only lasted for a year or so and it got resolved.
The important thing to know is that no matter what version you’re using, the files generally are highly compatible between versions.
And so code that you created in version 16 is probably readable in version 24.
There are some backwards compatibility issues for advanced functions like automatic modeling and so on.
But most of it is consistent all the way through.
Now we also need to talk about editions of SPSS.
And there are a few major choices here.
There’s the base edition, the standard edition, the Professional Edition The premium edition.
And they differ by price.
And they differ by the functions that are included with each edition.
So for example, in base, you get basic statistics, you get linear regression, you get clustering and factor analysis.
On the other hand standard adds on to that logistic regression, generalized linear models and survival analysis.
It also adds drag and drop interactive tables.
The Professional Edition adds to that Data Prep, forecasting decision trees, and imputation methods.
And then finally, the top of the line premium edition of SPSS adds bootstrapping, complex sampling, exact tests and structural equation modeling.
And so each one adds on a number of other functions.
Now, this is the product pricing as of August of 2016.
And you see, for instance, that SPSS starts in the base at $1,170 per year per person.
So it’s an annual license.
And it goes all the way up to nearly $8,000 per user per year.
And so it gets really expensive.
However, I want to say this, don’t panic.
There are other ways aside from having to like, you know, sell your house to get SPSS.
Number one, there is a free trial.
And you can download SPSS and you can try it for 14 days.
And during that time, the best way to do this is see if you can make a business case and get somebody else to buy it for you.
There is also academic pricing student pricing for SPSS started $35 for six months.
It’s not the super duper version, but it is absolutely sufficient for doing the majority of academic research.
Now, we also need to talk about modules.
And these are the components that add extra functionality to SPSS.
And they’re the things that differentiate the different editions primarily.
modules of rephrase the available modules include advanced statistics, bootstrapping categories, and complex samples, can joint custom tables, data preparation, decision trees, direct marketing, exact tests, forecasting, missing values, neural networks, and regression.
So that’s 14 additional modules.
And this sounds like a lot, but if you can compare it to the 9000 packages that are available for our, there is a difference there.
The other major difference is that these packages, they cost money, so you need to work that into your budget.
On the other hand, there are also free plugins that make it possible to use code in our Python, Java, and the Microsoft dotnet framework within SPSS.
So there are abilities that you can add depending on what you need.
In some way we can say this.
SPSS has a long history as far as statistical software goes, there are several variations and additional rephrasing.
There are several variations and additions that you can make to it by adding extra modules.
On the other end, it can be very pricey, so it’s something to consider when you’re doing the cost benefit analysis of SPSS.
The next step in SPSS an introduction and setting up is simply taking a look at SPSS and seeing what the program’s like.
And the easiest way to do that is to just open it up.
When you first open SPSS, you’ll get this introductory splash screen.
That gives you an opportunity to open up some files, recent files and learn more about various things that you can get from SPSS.
If you want to, you can click on this box, don’t show this dialog in the future, then you won’t have to deal with it again, you can also just press Cancel.
And that brings you to the data window in SPSS, which has a lot in common with a spreadsheet.
It has these rows and columns, where you have one row per case and one column per variable.
But there’s a very important differences between SPSS and a spreadsheet.
To demonstrate this, I want to open up a data set that I’ve used recently.
And then when this opens up, you see that it does resemble a spreadsheet, we have the variable names across the top, we have row numbers down the bottom and we have data in the middle.
Now one important difference between SPSS is data window and a spreadsheet is this.
You have a Data View, but you also have something called a variable view.
And it’s the same data set but if we click on it, we see it in a different way.
Each of the variables has metadata associated with it.
So for instance age, it tells you the time type of the variable.
Now these are mostly numeric, there’s a string variable.
But you can see there’s a lot of choices here, numeric common dot, and so on.
You also can specify the width of the variable, the number of decimal places.
And then an really important thing that makes SPSS different from most other programs is the use of labels.
This column right here shows variable labels.
And the idea is we have a short one word variable name over here on the left.
And if you use a very old version of SPSS, they were limited to eight characters.
And you ended up with sometimes with very cryptic names, you don’t have quite the same restrictions anymore.
But what’s common is to give a short name to the variable.
And then to give it a label, that is more descriptive.
In addition, you can have value labels.
So let’s come here to marital and we click on this.
And this is a way of telling SPSS that in that column zero means unmarried and one means and married.
Obviously, you can make them whenever you want.
And when you come back here, you have the option of seeing them.
So I’m going to come right up here.
And I’m going to click on this one, too, which will show the value labels, and you see how they’ve appeared.
Now, I can have them go away.
There were the variables, if I just hover over, then I see the longer name.
Going back to Variable View, you can also specify values for missing values, you can give the width of the column, the alignment, and then you can specify the scale of the rephrase.
And then you can specify the level of measurement.
Now SPSS uses three values scale, which is a interval or ratio level measured variable ordinal, which is ranked data, and nominal, which is categories, you also have the option of specifying whether something is an input variable, a target variable, or both.
And there are certain functions that use those.
But most of the time, that’s not a big deal.
And you see that in this demonstration data set, those haven’t been changed at all.
So the first window in SPSS is this data window.
But there’s more to SPSS than that.
So for instance, let’s make a very quick graph, I’m just going to make a simple chart here.
Come and make a histogram of age.
and hit OK.
And so you see, I have a graphical user interface with drag and drop menus that allows me to assemble my commands this way, I hit OK.
And then what we get is a another window that opens up, it’s super tiny up here.
So I’m going to make it much bigger.
And this is the output window.
So it’s a separate window, the data is in one window.
And when you do an analysis, you get a separate output window, you can actually have multiple output windows.
And what this one does is it has the graphs or any statistical analyses we do.
It also has a table of contents over here that you can collapse things or you can expand them.
And an important thing is I’ve got it set.
So it shows the code that SPSS generates behind the scenes to create this analysis.
And the nice thing about that is you can actually use that code and you can manipulate it directly.
This code is called syntax in SPSS.
Now, by default, SPSS opens up only a data window and an output window.
But you can get a syntax window as well.
In fact, let me do that I’m going to come up to File, New, and syntax.
And this is a very blank window.
But it’s one that you can type in.
Or you can also use the drop down menus to put a command in there.
So I’m going to come back here to the recent command.
And I did it histogram.
And I could Press OK again, but now what I’m going to do is I’m going to press a paste.
And what that’s going to do is is going to get the code for that chart.
And it’s going to put it right here.
In fact, this is the part that we use.
And if I select that, I can hit Run, I can also do Command or Ctrl R, it runs the selection, and you’ll see we get the output window again.
And it’s done the exact same thing a second time.
But this time it did it from a window where I’m able to have the text.
Now a lot of people are uncomfortable with syntax, and they like the drag and drop menus.
But a really important thing about this is it allows you to save your analysis.
So you can repeat it again, without having to go through all the menus.
You can simply paste the syntax from the dialogues into a syntax file, and then you can repeat it as many times as you want.
It’s also really easy to modify things when you do it that way.
And syntax files are just plain text files.
They’re saved with a dot SPSS extension, but they read Just like plain text files.
Now, these are the most important elements of the SPSS environment, the data window with both the data and the Variable View, the output windows and the syntax windows that allow you to save the command.
And this is what gives SPSS both some of its flexibility and its power.
And as you become more comfortable, moving back and forth between these various windows, and seeing what you’re able to do, both with the drag and drops and by typing text, you will discover there’s a great amount of flexibility and power in SPSS looking to allow you to do the analyses you need to do and get the insight you want from your data.
We’ll continue our introduction and discussion of setting up in SPSS by taking a look at the sample data that comes as part of the SPSS application.
The really nice thing about that is it allows you to get started now start working with things and see how SPSS works.
The hard part, however, is that it’s totally hidden.
And so you need to know where to look in order to use the sample data.
Now, if you’re on a Macintosh like I am, then it’s going to be in your Applications folder under IBM SPSS Statistics 22 or whatever version you’re using, then samples and then in English, then you’ll have them.
In Windows, it’s a little bit different.
It’s going to be C program files, IBM SPSS Statistics 22, or whatever version you have samples and then English.
So you have to navigate to that manually in order to be able to find those.
But when you do, you’ll see a bunch of files there.
Now, there’s a few kinds in particular that are important.
There are the dot s, a V files.
These are data files in the proprietary SPSS format, they can only be opened up in SPSS.
Usually, there are also dot SBS files.
And these are SPSS syntax files.
There are text files with the commands that can run a number of analyses and graphs and other functions in SPSS.
Now, we can try it in SPSS, by having you on your computer open up the window, and opening up a file called demo dot save.
But let me show you how it works.
When you navigate to the folder with the SPSS Sample Files in it, again, it’s several hidden layers down.
These are the files that you’ll find these are the dot shp data files.
And these are the dot SPSS syntax files.
Now there are other things in there, there’s something called a CSA plan.
That’s an analysis plan.
There’s an XML file, and there’s a few other things in there.
But the majority of what we want to deal with, in fact, rephrase.
But the only ones that we’re going to deal with are the dot shp files, and possibly the dot SPSS files.
Let’s scroll down here until we find demo dot Save.
Now, please note, there’s a lot of other demo files around that.
So you want this one in particular demo dot shp because that’s the SPSS file.
I’m going to double click on that.
And SPSS opens up the file.
Now you can set SPSS, so it has only one data file open at a time, or you can have multiples, I’m going to close this empty file right here.
But here is our demo file.
And this allows us to start working with a lot of the analyses and see how they work.
In fact, I’ll be using this file all the way through this entire course, because it allows you to do a number of analyses that require specific kinds of data.
And this has it all set up.
So I’ll show you a very quick one I’m going to come up to analyze, and to explore.
And I will get level of education and put that in.
And so I have a long list of variables that I can work with.
These are all the same variables, just hit OK.
And that opens up my output window again, opens up microscopically here in the top corner.
So I’m going to make it bigger.
And now I’m able to start working with my sample data.
And that allows me to get some hands on experience to see how the functions work in SPSS and to try some of the options and see how they affect things.
Our next step in SPSS and introduction is to look at basic graphics because those are always a good first step in analysis.
And the easiest way to do that in SPSS is with something called graph board templates.
Really, you can just think of these as graphs made easy.
The idea here is that if you set the levels of measurement in SPSS, then SPSS considered just graphs that would be appropriate for those variables.
Now in terms of level of measurement, remember SPSS uses the Number one is nominal for different categories.
Number two is ordinal for ranks, and number three is scales, that’s for interval or ratio level measurements.
And then when you’re in the graph board templates, you have two basic choices, you have basic graphs.
And those are where you choose the variables.
First that you want to graph.
And then SPSS will show you suggested graphs, you can see what you want to do with them.
There’s also an option for detailed and this is where you choose the graph style first, and then you choose the variables that go into it.
Now, these aren’t exclusive, you can bounce back and forth between the two tabs, and it’ll be easiest to see how it works.
If we just go to SPSS, if you’re logged into data lab.cc, then you should be able to download the exercise files from the same page that this video is on, open up this file SPSS, oh one, underscore three underscore one underscore graph board dot SPSS syntax file.
And let’s see what it looks like.
Syntax file that you’ve opened, looks kind of complicated.
But this is really because I want to have a written record of the same things that we’re going to do with the drag and drop menus and the graph board, we do need to open a data set.
And as I mentioned before, depending on whether you’re on a Macintosh or on a Windows computer, the path to the data set is a little bit different.
And also depending on the version you’re using, I’m using 22.
And so if you’re using something else changed that number right there, most of it should be the same.
And you can run this command and open up the dataset and activate it.
Now I’ve already done that, I’ll show you.
There’s my data set right there as the demo dot save.
And we can come down here to Variable View.
And see the levels of measurement that SPSS is assigned to these.
Most of them are scale, we have a few that are ordinal, we have only one variable in this data set that’s truly coded as nominal.
And that’s gender, which is actually a string variable.
In this case.
I’ll go back to this index.
Now, I have some rather complicated syntax here.
But what you’ll see is that when we use the menus, it’s actually pretty simple.
The first thing we’re going to do is make a chart of age.
But I’m going to come up here to graphs to the graph board template chooser.
And when I come to that, you see I’m in this tab of basic graphs, and this is where I choose a variable.
I’m going to choose age right here.
And it recommends three different kinds of charts, a dot plot the histogram and the histogram with a normal distribution.
We’ll take the very first one that’s available dot plot, and hit OK.
It puts it in the output window, which I have to maximize.
And there it is, it’s a dot plot looks a lot like a histogram of aging year.
So it goes down to 18 years, it looks like it goes up to about 7778.
And it’s an easy way to get a feel of the distribution that we’re dealing with.
Again, the command in text and syntax is complicated.
But the graphical interface makes this very easy to do.
Go back to the syntax for a moment.
If you were to paste the syntax for that command, this is what you would see right here.
And this way of saving it, you can modify it manually if you want.
Now we’ll do a histogram of age with a superimposed normal distribution.
Again, I’ll come up to graphs, graph board template chooser.
And this time, all I have to do is come over to the right, I click histogram with normal distribution, and hit OK.
Expand the output window.
And it’s really simple.
Now both of those charts that I showed you were with age, which is a ratio level or scaled variable in SPSS terminology, we can also do this with categorical variables, I’ll use gender and make a bar chart, come back up to graphs, hit GRAPH board template chooser.
And when I come down to gender, you’ll see that the Recommended Charts change because this time it knows it’s a categorical variable.
Now, if I had GPS data, I could put that in here, I can do a bunch of different things, I’m just gonna do a bar chart because that’s the easiest to deal with.
I’ll hit OK.
Make the output window bigger.
There’s my bar chart and you see that in this particular data set, we have an almost exactly equal number of men and women or data on them.
Now, those are the basic charts where you choose the variable first and SPSS recommends particular graphs.
You can also do detailed charts.
These are ones where you choose the style of chart first and then you fill in the variable.
I’m going to do this again for $1 plot of income and then show you that it’s really easy to modify it, come up to graphs to graph board template chooser.
This time, I’ll go to the detailed tab, click on that.
And I’m going to make a dot plot.
So I’m going to scroll through this, you see we have a lot of choices.
Choose dot plot.
And that’s going to ask what I want to make a dot plot of.
I’m going to click on this, and I’m going to scroll to income.
See, the one that I want is right here, household income and 1000s.
I can click OK.
then expand the output window.
And here’s my chart.
It’s really basic charting, you see that most of the people are at the low end, especially because this is hundreds of 1000s of dollars.
So that’s going to be a million dollars right there.
But I want to show you an interesting thing about this.
If we double click on the chart, that opens up the Edit window, and the graph board editor has some special options.
For one thing, I can change the number of decimal places here, I just click on the decimals come to format and change the minimum level, or rather the minimum number of decimals to zero.
But the more interesting one is, if I click on the dots themselves, they’re done as points and the modifier is to pile them, there are a few other modifiers that can be useful.
One is to dodge them.
And what that does is it puts them in the middle of expanding out either way, it might be a little harder to make comparisons from one level to another.
But it’s an interesting kind of chart, I can click on it again.
And we can do what’s called jitter with a normal distribution, and that takes points with the same value and it kind of randomly spreads them out, up and down.
And again, you can see that we’ve got a whole lot there at the bottom.
One other choice is jitter uniform, which makes them stay within certain boundaries.
But it’s hard to tell really how much things are spread out there at the bottom.
So I actually prefer pile or I think dodge is interesting in this case.
And so that’s one way of using graph board to both set it up and then to manually modify it by double clicking on the chart can close this because I’m done with that.
And you see I have the modified version right there.
Now, we can get a lot more complicated.
So for instance, I can make a scatterplot of age and income with colors for point density.
There’s a lot of options, and you can explore them.
This time, I’m going to do a little bit differently, I’m just going to select this command.
And again, the way I got these was by setting them up in the menus and then simply hitting paste, and it put this index into this index file, so I could save it and run it later.
So I’m going to show you how that works.
I’ve got the command here that I created using the graph board template chooser.
And I’ll simply come up and select Run selection.
And I maximize that window.
And there you can see I actually have what’s called a hex scatterplot.
And it’s showing a few different things.
And it’s a really neat way.
So you have a lot of options on the way you display things in the graph board template chooser.
And while the code is complicated, the interaction with the menus is really simple.
He can be creative and you can get different views on your data and try to get more insight as you’re doing your analysis.
The next step in our introduction to SPSS and basic graphics is bar charts.
And we like bar charts for a very simple reason.
They are simple and simple is good, or more specifically, bar charts are the most basic graphic for the most basic data just frequencies for a simple category.
It’s also a very basic command in SPSS.
Now, we actually have a few options on different kinds of bar charts.
One, we can make a simple bar chart, so a single variable simply showing the category frequencies in that variable, too.
We can do a group bar chart where we break it down by some other variable.
And then three, we can do multiple variables and show the bars simultaneously.
But let’s try this in SPSS, it’s really easy to do, just open up this SPSS syntax file, and we’ll give it a whirl.
Once you’ve got the file open, you’ll need to open the demo data set and we’ve used it before.
This is the command for Mac if you’re running 22 and this is the command for Windows if you’re running 22 just change the version number if you need to.
Once you have the file open, we’re going to make some bar graphs.
Now I’m going to do it by coming up here to what are called the legacy dialogues.
These are specialized ones One graph only dialogues that come from earlier versions of SPSS.
And truthfully, I usually use these because I find them so quick and easy to deal with, what we’re going to do is we’re going to make a bar chart for levels of education in our sample.
So I’m gonna hit bar, we’re going to do a simple bar chart.
And we’ll do groups of cases.
And all I need to do is hit level of education, put it into the category axis, and hit OK.
And I make the output bigger.
There it is, absolute Piece of cake.
And it’s also very, very simple syntax.
You see this syntax right here, it’s really could be one line.
And just as a point of comparison, here’s the same chart produced with the chart builder.
But you see, we have this really complicated, overwhelming code, the legacy chart produces an extremely simple way.
So that’s a simple bar chart, Piece of cake.
Now let’s do a clustered bar chart for groups of cases, we’ll look at levels of education by gender.
To do that, we come back up to graphs, legacy dialogues to bar.
And now we’re going to cluster it into a level of education clustered by gender, I hit define, get level of education, that sort of our outcome variable, put that under category axis, and then define clusters by gender, I put that right there.
I’ll hit OK.
And make it bigger.
And this time, it uses nicer colors.
But you have the five levels of education broken down where women are in blue, and men are in green.
But it’s really easy to see here, the relationship between the two variables.
And in this particular data set, it really looks like there’s no substantial difference between the men and women.
Now, I will say I believe this is an artificial data set.
So we wouldn’t expect a lot of differences.
But this is a nice way to compare them.
By the way, come up, and you’ll see that the code for this is really simple.
All it does is it adds by gender.
So again, a very short command, I’m going to go back to the syntax.
And we’re going to do one more here.
And that is for multiple variables.
So this is a situation in which it can be confusing if you have a lot of categories within each variable.
What I’m doing here, is I’m going to get the means of variables or the numbers of ones.
If you have an indicator variable where 04 No, no one, four, yes, this is a really nice way of comparing the frequencies of each one of them across.
I’ll show you how that works.
We’ll go up to graphs, we’ll come back over to bar.
And we’re going to do a simple one.
But this time we’re doing separate variables define.
Then I’m gonna come down here and this data set.
Again, which I believe is fictional, asked a lot of people about various things that they might do.
We’re gonna ask them about wireless service.
And we’re going to come down to whether they own a fax machine, because this is old data.
And it’s asking about old technology pagers, I’ve never had a pager.
But I simply select all those variables, I put them in here.
And as long as they’re all in the same scale, it’s going to do the mean of each one.
And on the 01, the mean is the proportion of one’s head, okay.
And there we have it, it’s a way of looking at the distribution of multiple variables simultaneously.
It’s a very information dense display.
And especially when you’re the analysts are exploring your data, this can be a really quick and easy way of getting a feel for your data, which can then direct your further analyses.
As we continue to look at basic graphics in SPSS, a really common one is histograms.
And this is a graphic for data that is quantitative or scaled or measured, or interval or ratio level, those really all are referring to basically the same thing.
And in any of them, you’re going to want to make a histogram to see what the variable is like.
Now, I mentioned that SPSS prefers the terms scale for these variables.
And that’s what shows up in the data definitions.
And I like to think of it as the scales of justice.
But why are we making a histogram? The point is to see what you have to see what the data is like.
And there’s a few things in particular that you’re going to be looking for.
Number one, you’re going to be looking for the shape of the distribution, is it uni modal, bi modal skewed, left skewed right? Are there gaps in the data This suggests that maybe you have some important mechanism operating or they’re out wires that you would need to take consideration of before you do your analysis.
Is your data symmetrical, there are a lot of different things that you could look for.
And some of these are going to have a lot of influence on your analyses.
So it’s important to take a look at the data and histogram will give you a great impression of a quantitative or scaled variable.
We’ll try it in SPSS, simply open up this syntax file, and we’ll see how it works.
When you’re in SPSS, most of this is really just to open up the data set is the same one we’ve used in the others this demo data set.
And here’s the code for Mac, adjust the version number if you need to, and here’s the code for Windows.
But once you have the data set open, you can use the commands and it’s really, really simple.
All you need to do is come up to graphs, we’ll go to legacy dialogues.
And we’ll come down here to the bottom to histogram.
And we’re going to make a basic histogram of age.
So I click that, and I come to age, it’s our first variable.
And I simply click this to move it over and hit OK.
Make the output window bigger.
And there’s our histogram.
And from this, we can see that our distribution is uni modal, we can see it’s pretty close to normal, it’s slightly skewed on the high end, but not very much.
And this is going to be a really good variable for most of our analyses, because it means most of the assumptions of the kinds of procedures that we might want to use.
Now if I want to make things slightly more complicated, because you see that the command for this is extremely simple, we can make a small modification I’ll show you here, we can superimpose a normal distribution.
And all I have to do for that is come back to graphs, legacy dialogues into histogram.
And I just checked this box right here display normal curve.
And what that’s going to do is going to create the same distribution, we’re just going to put on top of it, a line of a bell curve, a normal distribution that has the same mean and standard deviation.
And here you can see, we’re pretty close to normal.
And this is a nice way of confirming that.
And again, the code for it is really simple.
All it does is it adds the word normal in this sentence, and that gives us everything we need.
One of the reasons I really like the legacy dialogues in SPSS, is because it’s so concise, it’s so simple, and it gets you what you need.
So you can get a grip of your data and move ahead.
As we continue SPSS and introduction in basic graphics, we should look at scatter plots a very common method of looking at associations, or as I like to think as a way of assessing togetherness in data.
In other words, you want to see what goes with what or more specifically, what variable goes with what other variable.
So scatterplot are a great way of visualizing the association between two quantitative variables.
When you make a scatterplot, there are some things you should look for.
And in case you’re wondering what they are.
They include, for instance, whether the association between the two variables is linear, because a lot of the procedures that are common, assume that you can draw a straight line through the data, you want to check the spread of the data, especially whether the spread changes as you go from left to right, on a scatterplot.
That’s called heterogeneity of variance and it can cause problems with certain procedures.
You want to look for outliers, either univariate.
That’s a score that’s unusual on a single variable by itself.
Or, in this case, what’s even more significant is by variant where you have an unusual combination of scores.
And then finally, you want to try to get some idea for the correlation or the strength of the association between the two variables.
his counterpart will allow you to do all of those.
Now in SPSS, there are three general kinds of scatter plots that you can do.
Number one, is a simple scatter.
It’s a binary x&y chart, easy to do.
Number two is a matrix scatterplot, where you actually have several variables, and they’re simultaneously.
And it’s a good way of looking at complex associations between collections of variables.
And number three, SPSS is able to do a 3d scatterplot.
But I’ll have some words to say about that a little bit later.
But let’s try this and see how scatterplots work in SPSS, at least very basically.
So just open up this syntax file, and we can see how it works.
When you open up the syntax file, we have the same situation where you can load the data, we’ll use demo dot save.
And you can use this command if you’re on a Mac using version 22.
And this command on Windows version 22.
But we’re just going to make a couple of scatter plots and it’s a really basic, easy command.
The first thing we’re going to do is make a scatterplot of age and income, let’s come up to graphs, legacy dialogues, and down to scatter, want to use a simple scatter, that’s just a basic bi variate xy chart, I’ll hit define.
And all I need to do here is pick my variables for the x axis across the bottom, and the y axis of the side, we’re going to pick age for the x axis and put it right there.
And household income for the y axis.
And the idea is, maybe there’s an association between household income and how old the person is.
That’s all I need to do, except click OK.
And when I get that, I get this basic scatterplot.
So I have agent years across the bottom, I have household income and 1000s of this side.
And you can see, of course, that most of the people are near the bottom.
That’s because most people make less than $200,000 a year, this graph goes up to 1.2 million.
We have a marker that’s a large, empty circle, it’s in black, and you can change the markers.
And there’s things you can do to clean up the chart.
But it’s also easy to tell the people who for instance, make a lot of money are generally older.
So we can see in this data, there is some kind of association between age and income.
But let’s try to get a more nuanced one by looking at several variables simultaneously with a scatterplot matrix.
Come back up to graphs and legacy dialogues.
And down to scatter.
This time, however, I’m going to pick matrix scatter, click define.
And now all I need to do is pick the variables I want to include, I don’t have to specify X or Y, because they’re all going to serve as both x and y in different parts of the matrix.
I’m going to pick a few here, I’m going to get household income, I’ll move it over, I will get age and move that over.
Okay, address yours at current address and move that over, I’ll get reside, which is the number of people residing in the house, move that.
And then finally, I’ll get level of education.
There’s nothing especially meaningful about these, they’re just ones that I thought would be easy to look at.
Now, as a general recommendation, if you do have one variable that is an outcome variable, you might want to put that one in first, that puts it in the first column in the first row.
And it makes it easier to find it when you’re looking at your analyses.
But I’ve got my five variables in there, and I just come in, press OK.
takes a moment.
And then I come up.
And this is the scatter plot matrix.
And so you have all five variables listed on the side, you have all five variables listed across the bottom.
So each one functions as both an X and a Y, you have empty boxes down the diagonal, because that would be each variable with itself.
And the correlation is always one.
Now, there are things you can do to clean this up, you can change the marker from a big black circle to something that’s smaller and easier to see you can put regression lines through.
But it’s easy to see that there’s some really important patterns.
So for instance, age in years, and years at current address right here, obviously, there’s a limit, you can’t live someplace longer than you’ve been alive.
That’s why we have nothing in the top left at that.
But you do see some associations and some cut offs that go through.
Now, this one’s really dense in a lot of situations, it’s going to be a lot easier to see the patterns that’s there, especially if you change the markers and put in regression lines.
But this gives a good idea of what you can do with a scatter plot matrix.
Now, let’s go back one more time to the legacy dialogues and to scatter, because you saw that there were other options there.
There’s a dot plot that’s like a histogram.
And there’s an overlay scatter, which I don’t want to deal with.
And then there’s a 3d scatter.
And you might look at that you’re like, oh, cool, it’s interactive, it’s 3d, it’s a great thing.
I’m actually not even going to do it.
Because every time I’ve done a 3d diagram, I found it, it’s impossible to read it clearly.
It’s very hard to manipulate in SPSS.
And it ends up being really a bad experience.
And it’s much easier to look at the association between variables using a scatterplot matrix.
That’s why I recommend that you avoid the 3d completely, even though it’s available here.
But avoid it completely.
And use the BI variate and the scatterplot matrices as a way of looking at the associations between variables in your data.
Once you’ve done the basic graphics for your data and seeing what you’re dealing with, it’s a good idea to move on to basic statistics.
And in SPSS, the most basic version of this is frequencies.
I like to think of it as putting things into buckets, and then simply counting what’s in the buckets.
So the idea is when you have a limited number of categories in your data, then you should just count how often each category occurs.
It’s a first step to release some significant insight.
But wait, I just want to mention that the frequencies command in SPSS can do so much more than that.
And I’m going to show you how it works.
For example, it can do charts.
It can do bar charts and pie charts and histograms and normal distributions.
And they can do a lot of statistics beyond frequencies, it can do quartiles, percentiles, mean, median, mode, standard deviation, variance, skewness, kurtosis, and so on.
In fact, because of this, I like to think of frequencies as SPSS is version of the competent man character in literature and movies, who can do everything well.
You know, somebody like Leonardo da Vinci, or Iron Man who seems to be able to do everything, or you know, Mary Curie right here, because she won two Nobel prizes, and one of the rest of us done.
But anyhow, back to statistics.
Let’s take a look at frequencies and let’s try it in SPSS.
Just open up this syntax file, and we’ll see the things that it’s able to do for you.
As always, we need to begin by opening a data set, we’ll use demo dot save.
And you can use this command in Mac or this command in Windows to do that.
Once you have the data set open, it’s a very simple thing to get the frequencies.
Now I have the syntax saved here.
But really, it’s more as a record of what I’ve done, because I use the drop down menus to create these commands.
So I’m going to come up to frequencies and I’m going to get the frequencies for gender and job satisfaction.
To do that, I come to analyze two descriptive statistics.
And then the first option there is frequencies.
And what I’m going to get is gender.
Just right here, I’ll just double click to move it over.
Was it good job satisfaction, I’ll double click and move that over.
Now, what’s important is these are two different kinds of variables.
Gender is a categorical variable, nominal.
And job satisfaction here is a scaled variable.
And so normally, you don’t do the same kinds of things for these.
But frequencies is very flexible.
So I’m just going to hit OK, and we’ll see the default output for frequencies.
The first thing that it shows us is how many valid observations are so how many of our 6400 cases have data on these variables? The answer is all of them.
There’s no missing data here.
And then it comes down.
And it gives us frequency tables where it lists every value, or a possible score on the variable, and then says how often each one occurs.
So for gender, we have 3179 female respondents, that’s 49.7%.
And the percent and the valid present would be different if we had missing data.
But we don’t, so we can ignore that.
And then the cumulative simply adds up to 100.
And then job satisfaction.
This is a scaled variable, which has 12345 as answers.
And here you can see how many people put each of the answers 17% highly dissatisfied, 21.8, neutral and 19.1 highly satisfied.
And that’s a quick look at the frequencies that we’re dealing with.
It’s a nice way also to check if your variables are coded well.
But what we can do is more than that, we can also turn off the tables.
And we can do bar charts using the frequencies command.
So I’m going to keep those same two variables, gender and job satisfaction.
But this time, I’m just going to make bar charts.
I’ll go back to my recent commands and frequencies.
And what I’m going to do is I’m going to click this, it’s going to give me a little error message because I haven’t changed the other thing.
First, I’m going to come to charts right here.
I tell it to make bar charts, obviously, you can make pie charts and histograms as well.
I’ll click Continue.
And then click OK.
And now the same general command frequencies is not producing tables, but it’s producing charts.
And here you can see that we are very closely mess in terms of the number of male and female respondents.
And here you can see job satisfaction sort of peaks at neutral and somewhat satisfied.
So that’s a really nice thing.
You don’t even have to use the bar chart command, you can do it right here.
You can also get more kinds of statistics in there.
So for instance, this one, I’m going to keep the tables off when I’m going to ask for a few extra things.
In fact, let me just come back to this one.
We’re going to analyze, descriptives and frequencies.
And this time, I’m going to do age, reside and job set.
So I’m going to remove my one categorical variable here.
Just reset that I’ll do age or the other to resign.
And job set.
And then I think that’s this one right here.
Then we’ll come down to job satisfaction.
And we’ll move that over.
So I have three variables, but they’re all scaled variables.
What I’m going to do here is first I’m going to come to statistics.
And I have a really an impressive range of things I can get, I can get the mean, I can get the median, the mode, if you want the mode, I think this is the only place to get it in SPSS, I can get core tile values.
Now it doesn’t do the minimum and the maximum, you have to select those separately down here.
But you can also get cut points.
Now, a cut point an interesting one.
The quartiles are cut point, it splits the data into four equal sized groups with the same number of people in each.
Sometimes you want something other than that.
So for instance, I know that if you’re doing propensity scores, it’s not uncommon to use five equal groups quintiles.
And also, there are situations in which you want not the most extreme scores, but near the most.
And so I’m going to put, for instance, the 2.5 percentile.
And the 97.5 percentile, because those frame the middle 95% of the data, I can also get the standard deviation and the variance sentence or anything else I want right here.
I want skewness and kurtosis.
I’m gonna hit continue, then I’m going to come back to this one, I’m going to turn off the frequency tables, because otherwise, I have a lot of different possible interest here, I’d have a lot going on.
I’ll hit charts.
And this time, I’m going to ask for histograms.
And we’ll put a normal curve on top of each histogram.
And click OK.
And so here’s what we get.
It starts with the statistical output, here are the three variables I selected, it gives us the mean, the standard deviation, the variance, skewness, and standard error of Skewness kurtosis.
We have the minimum and maximum scores.
And then the percentiles.
Now it’s a funny list here, because I’ve got three things intermingled, I have the core tiles, that’s something I asked for.
So we have the 25th percentile, the 50th percentile, and the 75th percentile.
Those are the core tile values.
I had the minimum and maximum up here.
So those are the zero and 100% courthouse, but I also asked for quintiles.
And so that’s puts it at 20 4060 and 80%.
And then finally, I manually entered the two and a half percentile, and then 97 and a half percent.
And so they’re all put there together, but it’s really easy to see the changes in the distribution.
Beneath that we have the histograms and we have each variable has its own histogram, along with a normal distribution with the same mean as standard deviation laid on top age is pretty close to normal.
Here’s the current address, however, you can see is really skewed because most people haven’t lived there that long.
And then finally, job satisfaction is a little flatter than we would expect if it were normally distributed.
The point of this is that I’m able to do a tremendous amount of statistical and graphical work using a single command the frequencies function in SPSS, one of the most versatile commands you’ll ever use.
In our previous movie, we looked at the power of the frequencies command, but for basic statistics, another very common choice is descriptives within SPSS.
The nice thing about descriptives is it allows you to achieve maximum density.
That is how to get a lot of numbers on a lot of variables in just a little space.
That’s what descriptives is really good for.
On the other hand, there is a restriction, it works only with numerical variables.
But that’s a lot of the data that you might have.
And if you have that, it can give you things like the mean the sum, the standard deviation, the standard error, the variance, the minimum and maximum, the range, the skewness and kurtosis.
Now, I say that Guess what, you know, in case you don’t remember frequencies does more, but that’s okay.
There’s certain things that the descriptives command does well, here’s what it does well, first, it gives you a very concise compact tabular output.
So it’s really easy to see a bunch of information in a small space.
Second, it’s a really quick way to find obvious errors in coding in your data.
Finally, you can get proportions for indicator variables as 01 variables and I’ll show you how that works.
Also, we have a bonus feature here in descriptives.
descriptives is the home of SPSS is top secret hidden one step Z score procedure.
I’ve seen people knock themselves out trying to get z scores by getting standard deviations in means you don’t have to do any of that you click one button and you’re done.
But let’s try it in SP SS and I’ll show you how it works.
Just open up this syntax file, and we’ll see what you can do with descriptives.
We’ll begin as always, by opening the dataset, we’ll be using demo dot save, here’s the path on a Macintosh, it running version 22.
And the path on a Windows also running version 22.
This is my first command, and it looks really long.
But that’s because I have a lot of variables in it.
All we need to do is come up to analyze, to descriptive statistics, and descriptives.
We click on that.
Now one of the things it does is it only shows you the variables that it can analyze.
So gender, which was a string variable, meaning it had just text that’s not in there.
But what I can do is I can just select all of them and do a command or control a, and then move everything over.
And then I’m just going to do the default analysis.
I’ll just hit OK.
And here’s our output.
We have a whole bunch of variables, and it tells us first, the number of observations is 6400.
Almost all the way down this question about internet is missing some data.
But that appears to be the only one we have the minimum value and the maximum value.
By the way, this is where I talk about quick and easy data checking.
If you have a variable that’s only supposed to go from one to five or zero to one, if you have a 17, you know something’s wrong.
And so by simply checking the outer boundaries, that’s a fast way seeing if there are any really obvious errors, we also have the mean and the standard deviation to the things, you generally need the first two moments of a distribution.
And so that’s a lot of information.
And it’s in a very concise format.
That’s a wonderful thing.
If we go back to the syntax, I do want to mention this one thing about indicator variables I said it earlier, is this.
If you have indicator variables that a binary or dichotomous variable that has only two possible values, and if that variable is coded as zero and one, then you can in fact, get the mean of it, and it tells you something that tells you the proportion of observations that have once.
And this works best if you use the standard programmer format, zero equals false or no.
And one equals true or Yes.
And strangely, in this particular data set, that’s true for most of the variables, but not the last one or two and demo dot save.
And I have no idea why they switch that.
But it’s something that you want to check in the coding before you go ahead and do it.
So if I go back to the output, you can see for instance, that most of these wireless service down through owns fax machine, those are all zero ones were zeros, no.
And one is yes.
The mean right here tells us that 99% of the people own TVs, nine six own VCRs because this is a long time ago that 25% had paging services.
And I like this one, where’s the internet on this list? 27% of the internet because this was apparently generated and like, you know, 1990 who knows what? Anyhow, those are meaningful data points, the mean tells you that proportion of ones or yeses.
I’ll go back to the syntax here.
And then let’s take a quick look at the Z scores.
Now any reasonable person would think that Z score is a transformation of the data and therefore it would be under the transform menu.
But you know it’s not there.
Instead, it’s hidden as an option and descriptives.
So let’s go back to descriptives unless you age in income, so I’m going to reset this.
I’m going to pick age.
And I’m going to pick household income.
And I’m going to get both of these z scores because a lot of procedures work a lot better if you have z scores.
All you have to do is this.
Click Save standardized values as variables.
And if I hit OK.
What it’s done here is it gives me the descriptives because I actually still ran the descriptives command for those two variables.
But more significantly, let’s take a look at the data set.
When I come to the data set, if I scroll to the end here, I have two variables that were not there previously z age, pansy income, and they have lots of decimal places because you need those z scores.
Now, I’m refreshed.
Now under normal circumstances, you would want to save this into the data.
I’m not going to do that because this is one of SPSS built in default data sets.
But I do want to show you that we can do one other thing here.
Let’s go back and get descriptives for those z scores.
So I’m gonna come to analyze, descriptives.
I’m going to reset this Come down to see our two new variables.
I’ll select do a little shift click to get both of them and pop them over here, then I’ll hit OK.
And as you would expect, a z score has a mean of zero, and a standard deviation of one.
And we didn’t have to do it manually, we didn’t have to remember any values, we didn’t have to round things off and did exactly for us.
And so that is what the descriptives command does, it makes a very concise tabular output.
And it also allows you to save standardized or z scores for use in certain procedures.
For a final look in SPSS at basic statistics, we’ll look at the Explorer command.
I like to think of this as a way to get a lot closer get a little macro view on your subject, get closer and see what’s there in detail.
Now, the Explorer command is going to give you a bunch of statistics, it can give you the mean and the confidence interval for the mean.
And the trimmed mean, as well as the variance, the standard deviation, the interquartile range, the minimum and maximum, the range skewness.
kurtosis, is a collection of M estimators, which are special robust ways for measuring the center of a distribution.
percentiles, which we’ve seen before, and lists of outliers can also give you a collection of plots.
It’s the one place in SPSS that you can get a stem and leaf plot.
Now, traditionally, those are things that are drawn by hand.
So it’s kind of cute to see a computer do them.
You can also get box plots, and you can get histograms.
And you can get a set of normality plots, such as a QQ plot or a detrended QQ plot.
And the neat thing after that is you can break all of these analyses down by groups.
So let’s try it in SPSS and see how it works.
Just open up this index file.
And we’ll run through the various procedures and explore and see how it can add up to your own analysis.
As always, we’ll begin by opening the demo dot save data set.
Here’s the command for a Mac, here’s the command for Windows.
Now, again, I’m saving this as syntax that makes it repeatable, and it means that you can download it and try running it on your own.
But I created all this by using the menu commands.
Let’s start by doing a default explore analysis for a couple of variables.
I’ll come up to analyze, to descriptives, and then we’ll come over here to explore.
And what we’re going to do is age and income category.
And again, this is kind of interesting, because these are different kinds of variables.
Age is a scalar variable.
And income category in this case is an ordinal variable.
I’m just going to leave all the defaults as they are and hit OK.
And here’s what we get from this.
First, we find out whether there were any missing cases there weren’t in this situation.
And then we get a collection of descriptive statistics for these we have first for age, then for income category, we have the mean with the standard error, the confidence intervals, the 5%, trimmed mean, median, variance, standard deviation, minimum maximum range, Interquartile, range, skewness, and kurtosis, along with their standard errors.
So there’s a lot of information there.
And we scroll down we find the same kinds of information for income category in 1000s.
Now remember, some of this you wouldn’t normally want to use because income category in this case is not a scaled variable.
And a lot of these things like minimum maximum and trim mean work best with a scale variable.
But SPSS is able to kind of run it on everything.
So interpret with caution.
Then we come down and look we have a stem and leaf plot, where this is age, which in our sample is two digit numbers.
And so this means 118.
And each of these leaves, each of these numbers over here is the leaf that represents 10.
Remember, we have 6400 cases, we have about 640 numbers right here.
And you can see for instance, that the 30s appear really common late 30s.
And that we go up to somebody in their late 70s.
And so that’s an easy way to see what’s going on.
Simultaneously, we get a boxplot.
And the nice thing about this as you can tell really quickly if there are no outliers on age, not in this particular data set.
We do the same thing with income category.
And the stem and leaf plot looks funny, but that’s because there’s only a few possible values one or two or three or four.
And it’s drawing it so it looks a little weird.
But we can come down and get the boxplot as well and see there’s no outliers, at least on this kind of variable.
Again, not normally something you would do with a rank order variable.
But it’s possible here.
Now the neat thing is there are additional statistics.
I’ll do the same to statistics.
But I’m going to go check off a lot of options that I have right here.
So let’s go back to that dialog, I’ll go to explore.
What I’m going to do is I’m going to say, just give me the statistics right now.
And I’ll come up here, and I’ll make some selections.
One thing, although 95% confidence intervals are by far the most common, I have seen significant situations where people use 80% confidence interval, so you can change it if you want.
Then I can get all of the estimators.
It’s a whole collection, I can get a list of outliers and a list of percentile values.
I hit Continue.
And I click OK.
And now we have the same table we had before.
That’s their descriptives up their top, then we have the M estimators.
And this is for different robust measures of center.
Again, all of them are trying to give us something equivalent to the mean.
And you see in this case, huber’s estimator, turkeys by weight handles, estimator, and Andrews wave, the numbers are all pretty similar.
I mean, it goes from a low of 41, point 18 to high 41.5 to four, they’re all really close.
And each of these has specific parameters that go into them, you can’t adjust them in the dialog box.
But let me just return to the syntax for one second.
You see here, these are the parameters for each of the EM estimators, you could change them here, if you want to do.
I’ll go back to the output.
Then we have percentiles 510 25, up to 95.
And then it gives us the case numbers for the highest and lowest five cases on each variable.
And so this is a really nice way of seeing a multi dimensional picture of our data.
Now in terms of pictures, and even better ways to do this with more graphs.
So let me go back to the syntax for a second.
And you see that we can get some additional plots, I’m going to use age and income category again.
But I’m going to change that what it tells us.
So first off, I’m going to say give me just the plots, we’re not going to get any statistics, I’m coming to the plots menu, I say well, we have a stem leaf by default, let’s get a histogram.
Let’s also get normality plots.
That’s a way of assessing how closely your data match a normal distribution.
I’ll hit Continue.
Now I have a histogram for age, the stem and leaf plot.
But this one here is normal.
But this one here is new.
It’s a normal qq or quantile quantile plot of age in years.
And if it were normally distributed, all of these circles would fall exactly on this line.
You see, it’s really close, but it does deviate at each end.
And then a D trended one takes that line sort of flattens it out, and it’s much easier to see where the changes are.
Now I know it looks really big in this case, but this variable is in fact pretty close to normal distribution.
Then we have our boxplot.
And then we do the same thing for income, we start with a histogram, our stem and leaf plot.
And our normal QQ plot, again, a little weird, because there’s only four possible values in this data set.
But they all fall pretty well on the line.
And there’s our D trended plot.
And then finally, the bot file that we saw before.
Now there’s one more thing we can do with the Explore command.
And that is we can take some of these analyses and break them down by groups.
So if we go back to the syntax, we’ll see I’m going to do income and break it down by gender.
Let’s go back to the menu here.
And I’m going to reset this.
And we’re going to take income, and put that into our dependent or outcome variable list or the thing that we’re pretending to predict.
And then we’ll take gender scroll down a little bit, there’s gender and put into the factor list.
Or sometimes people call it independent variable.
So that’s if it’s an experimentally manipulated variable for the predictor variable.
I’m going to come up here and I’m actually going to skip the statistics and get plots only.
I don’t want a stem and leaf but I will get a histogram on get the normality plot.
And now because I’m breaking it down by groups, I can check the spread versus level with Levine’s test.
The idea here is that the data should be spread out approximately the same amount for each of the groups so we can compare them using some uniform statistics.
I’m going to do what’s called a power estimation here, click Continue.
And then okay.
And now what we get is, again, is a list of the number of cases that have complete data and then all of them do that With no missing data, we have a test of normality.
And what we see here is based on both of these, that the data for neither group is normal.
That’s okay, because we knew that income was strongly positively skewed.
As for homogeneity of variance, whether the two groups have about the same variance or spread, you know, there is some difference, but they are not statistically significant.
And so it appears to be the same for the men and the women, which is good in this particular data set.
And then we can come down and see the histograms first for women.
And you see, it’s got a really strong skewness there.
And the same thing, again, for men really strongly skewed, then we get the normal qq or quantile quantile plots.
And again, if it were normally distributed, all of these points would fall right on this line is strongly skewed.
And so we have this really big bend in the data.
The same is true for men.
And here’s the detrended lines, where they should all be flat on that lines too, as you get this swoosh Mark instead.
So it just confirms that we’re not dealing with normally distributed data they want you to have is this big collection of outliers in the box plots, I’m going to do one thing, I’m going to double click on this.
And then I’m going to come right up to here.
And this will turn off the data labels so we can get rid of the ID numbers.
And you can see that we have a lot of outliers in both demand and both the women and there’s no really obvious differences between the two groups.
And the spread versus level plot is something that you can use if you have multiple levels, that it can help you select a kind of power transformation, a square root or reciprocal, a square, something like that.
But that’s a more complicated topic and something for another day.
And besides, it appears that we have relatively homogeneous variance in the two groups.
So we’d be good to go ahead and do our other analyses.
So those are some of the options and explore.
And that’s where we’ll end our discussion of basic statistics, we can see how they can be used to see how well your data meet the assumptions of the procedures that you use, and then really, how well you can make inferences from your sample to other groups.
When you’re working in SPSS, and you’re accessing data, one of the most important things you can do is to create labels and definitions for your data.
I like to think of this as the statistical version of Alice in Wonderland and the caterpillar asking her to explain herself, you need to explain yourself or more specifically, when it comes to your data, you need to tell SPSS, what do your data mean.
Now, that is the data description, and I see two kinds of information that you tell SPSS about your data.
The first one I’m going to call semiotics, which comes from the study of meaning.
This is where you tell SPSS, what the variable names are the data types, the variable labels, the value labels, the missing values, the level of measurement, and the role that each variable plays.
contracted with that there are other elements that even call aesthetics.
And that addresses variable width, decimal places, column width, and alignment.
And these are all settings within the data window of SPSS.
One of the most important though, at least for human consumption is going to be the variable and value labels.
And so I’m going to take a little time and talk about those with the variable names.
That’s what the short names the ones that you have there at the top of the column, there are some important rules.
So the rules for variable names.
Number one, the names must be unique.
No two variables can have the same name, that shouldn’t be too surprising.
It’s an identifier.
Rule number two, the names must start with a letter.
I put an asterisk there because you can start with an ad a pound sign or $1 sign, but you don’t want to because those are generally reserved for special functions within SPSS.
Rule number three names can use letters, upper or lowercase, they can use numbers, and they can use period underscore at pound dollar sign, on the other hand, don’t end with a period that can cause confusion with the command Terminator.
And don’t end with an underscore because that’s used for automatic variable names when SPSS is doing computations.
Rule number four names cannot include spaces.
And rule number five names must be less than 64 bytes.
And most text coding systems that 64 characters, but if you’re using a Unicode system that might be only 32 characters.
And the last rule rule number six is the names cannot be any of these words all and by eq GGTL, e, lt and E not or two or with Because those are all reserved function names within SPSS, so don’t create that confusion.
So those are the short names that go at the top of a variable.
On the other hand, the label that you associate with that you can give it a more descriptive name.
Those are the variable labels.
And so there are a few rules for those.
Rule number one, they must be less than 256 bytes.
That actually means it could be really long, although you don’t usually want to do that, because some procedures will display as few as 40 bytes 40 characters, and you really want to be able to read what it is.
So you want to keep it short.
But you can go longer if you need to.
Rule number two, the labels must be enclosed in quotes, although on tell you they need to be straight quotes, the vertical ones, and not the curly quotes are SPSS chokes on those.
Rule number three labels can include any character, including spaces, which is something that you can’t have the variable name, but you can put it here.
So that allows you to put labels that sort of float on top of the variable names.
And those can show up in the variable lists, they can show up in the charts in the output that you create.
Another really important one is value labels.
So you may have a variable called gender and you may put zeros and ones.
But do you remember what those zeros and ones are.
And so I’m going to show you some ways of dealing with that.
The most important thing is to put value labels on there.
So here are the rules for value labels.
Rule number one, they must be less than 121 bytes.
So that actually is really long, you generally want to keep your labels pretty short.
Rule number two, like the variable labels, the value labels must be enclosed in quotes, and they need to be the straight quotes and not curly quotes.
Rule number three labels can include any character including spaces, that’s good.
This is an interesting one.
And rule number four, the value labels do not need to be unique, that is more than one value can have the same label.
So you might have the numbers one through nine.
And it could be that 789 all say the same thing.
But they underneath have different code interest situations where you might want to do that.
But mostly, I want to show you how this works in SPSS.
So just open up this syntax file.
And this one’s going to be a little different, cuz we’re actually not going to use a data file, I’ll refer to one but I mostly just want to show you the syntax.
This index file shows how to write variable labels and value labels.
Now, you don’t necessarily have to put them all broken down in lines, I do it because it makes them a lot more readable, it’s a lot easier to see what’s going on.
The first thing is the command variable labels.
Because it’s an SPSS command, it’s written on all capitals.
And then what you do is you write the short name of the variable.
And then you have at least one space and then you have straight quotes, and then the long label.
So here, for instance, I’ve got vair 01, that would be the first variable.
And then this is its label written out.
And you don’t need to have anything after don’t need any commas or question marks or semi colons or anything.
You just go to the next one.
Now I put it into another line, because that makes it easy to follow.
And I run them all through here, I’m going to make one important recommendation.
If you have a dichotomous variable or binary one that has only two possible values, and gender might fit into that category.
Let me recommend this, that you coded as zeros and ones.
A lot of people use ones and twos, but that gets confusing if you coded as zeros and ones, and named the variable after whatever the one is.
Now, when it comes to male and female, I generally give one to whichever group I think’s going to have the higher score on my main outcome variable, so it’ll switch around.
But if for some reason, I think that men are going to have a higher score on an outcome variable, then I will call it male and then the label will be are for respondent is male.
On the other hand, if I think women are going to have a higher score, then I will call the variable female and the label will be RS female, I would obviously only use one of those two.
Now here are some other examples.
I tend to give generic names such as variable or really just q for question q one q two, and I use the leading zeros so they store it properly in the dialog boxes.
And when you’re done listing all of your variable names and the variable labels and quotes, just end with a period doesn’t have to be have a space before it that’s leftover from earlier versions of SPSS.
It’s a habit I have.
So you can run this at any time and it will assign these labels to the variables and then they’ll show up in the data file which is nice.
Next are the value labels.
And what you have here is the first command Which is written in all caps.
And then you give a list of variables to which the values apply.
And you can list them out separately, very one very two.
Here, I’ve got a VAR three without a leading zero.
And then if they’re all next to each other, if they are adjacent, they can actually specify ranges vair, three, two, and capitals, very tan, so that’ll do 345678 910.
And then you just go to the next line, and you give the first value that is zero, and then I give zero equals No, and one equals Yes, when you’re done giving the values need to put a slash, so it knows you’re done with the values for that variable, then you can go on to the next variable.
I said, for instance, if I gave one on a gender variable to men, I would call it male.
And so zero, which would mean No, they’re not male, would be female in one, yes, they are or true, that would be male, and do a slash.
On the other hand, if you coded it the other way, and then you just call it female and zero, which means no or false means they’re not female, they’re male, one means they are fine.
Obviously, use just one of these, I do the slash.
And then I could have a rating variable, say, for instance, a lot of people call it a Likert scales, just a rating scale.
And I could do rate 01, to rate 10.
And I can specify every value.
So this is a five point scale from strongly disagree to strongly agree, finished with a slash, or maybe you have a different kind of scale.
Here at the end, I have scale 01 through scale 02.
That’s an 11 point scale, but I only mark the two ends, the zero and the 10.
So zero is never or not at all 10 is always completely.
And then to let SPSS know that I’m done specifying value labels, and with a period.
So this is actually a single sentence.
And it’s a way of telling it, how you want the numbers to appear, both in the data window, and in any output that you get.
Finally, I’ll mention something about missing values, because it can also be easier to specify these in syntax, the command is missing values.
And you just give the names of the variables and you can use two in the same way.
And then in parentheses, you put the number that is assigned to missing values.
99 is common.
So I’ve got that there.
And then you can do a slash if you’re going to use different codes after that, I could do male through female.
And here I say two through high.
And really what that means is anything other than a zero or a one is missing.
So if I accidentally type in a seven, you know it’s missing.
And then here I specify several different values, I can put seven comma eight, comma, nine.
So if any of those show up, those would be considered missing values, do what you want.
The nice thing is it will exclude them automatically from analyses.
But it will include them in frequencies when you’re getting that output finished with a period.
And then you just run these like you do any other command.
And it’s going to do a lot to clarify your data and make it easier to follow your analyses and reconstitute your work in the future.
When you’re working in SPSS, and you’re trying to access data, you may get the idea of entering data.
Well, let me tell you my thoughts.
You want to enter data in SPSS, I just see it as an exercise in frustration.
It’s a pain to do it manually.
And I’d say maybe you’re entering 10 or 12 numbers, you know, basically, nothing is something that’s often referred to as a toy data set.
Maybe you could do that.
Now, it’s also possible to copy and paste data, but I’m gonna say sort of because it doesn’t work really well, I’ll show you that.
It’s much, much easier to just import the data from a CSV file or txt file.
And I’ll show you how to do that in the next section.
But in terms of entering data, let me show you how it works in SPSS, we’ll just open up a blank document.
And we’ll try it.
So here’s a blank data window in SPSS, I can come right here and I can enter a number.
And, you know, unfortunate if I press tab, it actually goes down, which is an unusual behavior.
And you see it gives it an automatic variable name very 00001.
Well, if I want to move sideways, I actually need to move the right arrow key.
So I’ll go this way to three, and so on.
And then I can hit return, and it goes down.
I’ll come back to here and I’ll go 456 I’ll hit tab and it comes back to the beginning.
So it’s not the most intuitive behavior plus you see it gives it these generic names.
That’s because you can’t enter the variable name directly in this window.
Instead, what you have to do is go to Variable View How to get there by just double clicking on the variable name.
Here we go.
And you can enter the variable name and you can change other things you want to do.
It works, but it’s a pain.
I’m going to come back here to Data View.
Now, I mentioned you can important data sort of.
So let me show you how this works.
I’m actually going to go to a Google Sheet that has nothing in it at the moment.
And here, I’m going to enter a few values, I have a few different kinds, I’ll do 5643.
And I’ll enter a number j, return.
Okay, so there’s some data, I’ve got two digit numbers, and I have letters, which will be string variables in SPSS, I’m going to copy those.
And we’ll see how well they paste over in SPSS.
So I’m going to go back there.
I’ll come over here to the side.
And I will paste those in.
And you see that the values came in and showed up with decimal places, and I can get rid of that.
But it’s really weird with the string variable with the letters and so you can copy it.
Notice also, I can’t copy in variable names, I still have to enter those in manually, you can deal with those when you import.
But really, this is a demonstration that putting stuff manually in SPSS, it’s not a good environment for that.
Use a spreadsheet, use Google Sheet, use numbers, use Excel, anything, enter it there and then import it.
I’ll show you that in the next section.
And you’ll see that it’s a much, much easier process.
The last thing I want to say in SPSS about accessing data is about importing data.
And you know, compared to entering it manually, it just makes me feel like this and I resorted to cheesy clipart to show how happy I am.
Because no doubt about it.
Importing is absolutely the best way to go if you want to get data into SPSS.
Now the nice thing is SPSS can open text files, it can open CSV or comma separated value files, and even XLS.
x that’s Excel files, as long as they’re formatted right.
Now, what do I mean by formatted right? There’s a term from Hadley Wickham in the our developer community tidy data and it’s referring to something very specific.
It says that your file should have only one sheet.
So that’s one worksheet even though Excel can take more than that, that each column should be exactly equal to one variable.
And that each row should be equal to one case.
And an important thing is no funny stuff in your Excel sheet.
Because Excel is very flexible.
And when I refer to funny stuff, I’m talking about things like macros, and formulas and graphs and formatting and comments or merge cells, or headers, taking up their own rows or duplicating row numbers.
You don’t want any of that, basically, you want to treat it like a CSV file.
And if you do that, then you find you can import it very easily into SPSS.
And in fact, let me show you how this works.
We’re going to try this in SPSS, but I want you to do two things.
First, I want you to download the course files.
And that will include a zipped folder by this name that ends with data sets, that’s going to have three files inside it.
I’ll show you those in just a second.
And then you can also open up this syntax file that will work with them.
But let’s go to see what’s inside the folder and explain a little bit what’s going to happen here.
The folder that I’ve asked you to download contains three different files.
Now, I have both the folder here, and I have the three files saved separately next to it, but normally they would be inside it.
But for the syntax to work properly, you want them sitting separately on the desktop.
All three of them contain the same data.
It says MBB, which stands for Mozart, Beethoven and Bach, because this is Google Trends data about the popularity of search for each of these three composers names since 2004.
This first one is in CSV or comma separated value format.
The second one is a plain text file, and it’s tab separated.
And the third one is an XLS.
So it’s an Excel sheet.
And you can see it’s the same number but it appears a little bit differently when I do the quick view here on my Macintosh.
What we’re going to do then is open up the syntax file.
And we’re going to see what we need to do to import each of these.
I’ve saved this syntax, but the fact is, it’s easier to do this stuff through the menus.
Now I give some information here about using the file path.
In each of these syntax commands, I have to specify the file location.
Now, this is the format.
If you’re on a Macintosh like I am, of course, you’ll want to change Bart to be the name of your home directory.
If you’re on a Windows computer, you’re going to need to change it to something a little more like this, or possibly depending on the version of your operating system using backslashes instead.
Anyhow, I’m going to show you how to import each of these.
And I’ve got the duplicate information here in the script, in case you want to run it that way.
But it’s actually really easy to do it from the menus.
So here’s what I’m going to do, I’m going to come up to my data window, I’ll just click over to that.
And my data window is empty right now.
I’m going to go to File, Open, and data, you do that if you’re opening an existing SPSS file, or if you’re importing something in a different format.
Now here, I’m on the desktop, you can see my folder there.
But you can’t see the three data files I have next to it, because right now, it’s only going to display files that are in the dot save.
That’s the SPSS proprietary data format.
I’m going to click on that, and come way down here.
And we’ll start with the text file the txt version.
I’m going to hit that and now you can see that it’s there.
I’ll select that file, and I’ll click Open.
So now I have the SPSS text import wizard.
And we can scroll through most of this pretty quickly.
It asks if it matches a predefined format, something that would have saved somewhere else it doesn’t.
It asked if they’re delimited Yes, or delimited by tabs in this case, are the variables included at the top of the file, you see how they show up here as the first row.
I click Yes.
And now it excludes those because it knows that those need to be the in the header of the data file.
Each line represents a case, I want all of the cases you could sample from if you had a very large data set, they would allow you to do exploratory analyses more quickly than you could otherwise.
And that’s what delimiters appear.
Now, by default, a text file, the one that I have uses tabs and it knows that it asks about text qualifiers, I don’t have text qualifiers in here.
So I just hit continue, don’t have to change anything.
Now I have dates here at the beginning, and they are year dash month.
Now, SPSS can handle dates, however, it doesn’t like the fact that I’m using year and month without the day associated with it.
Consequently, I’m gonna leave it just as a string variable as a text variable.
And it still works properly in any analyses I want to do.
So that’s fine.
I’m just gonna hit continue, I’m not changing anything here.
It asked if I like to save the file format for future use.
That’s the thing I was referring to in the first dialog here.
And that’s if I want to paste the syntax, I could do that.
But I’ve already got it pasted, I’m just gonna hit Done.
And there it is, it’s opened it up, and it’s formatted properly.
If we go to Variable View, you can see it’s got a string variable, it’s got three numeric variables, it has the proper number of digits, it has the proper number of no decimal places, and it recognizes them as nominal, which actually is not the case.
So I actually need to come here and change that to a scale variable.
Because the data that you get from Google Trends is sort of zero to one percentages in terms of relative popularity, search terms, change that to scale.
And otherwise, I’m good to go.
Now, let’s do the same thing, but with a CSV file.
To do that, I’m just gonna get rid of this data file, I’ll just open up a new one.
There we go.
I’ll come back up to the file and open to data.
This time, I need to tell I’m looking for a CSV, but if remember it that’s actually under text.
So I click here.
And except this time, instead of selecting the dot txt file, I’ll select the dot CSV file.
And what you find is that the procedure is almost identical.
There’s only one super tiny change here.
I hit Continue.
I tell the variable names are at the top.
It is delimited.
and nice to know that each line is a case I just hit continue and all this.
Here’s the one difference.
When I did the text file tab was automatically selected.
Now that I’m doing a CSV, which means comma separated values, comma is automatically selected.
I hit Continue.
It does the same thing with month we’re going to leave it as string I hit continue and I can hit Done.
And you see, it looks exactly the same.
I do have the same issue though, that these three numbers which go from zero to 100, are coded as nominal, I need to change them manually to scale.
Now we’ll do the third one, an Excel file.
Now in a lot of programs, you get very stern warnings about importing Excel files.
And there’s good reasons for that.
Because Excel files are very flexible, and people can put a lot of stuff in there, again, comments and changing column widths and merging cells that make it easy to use Excel just for displaying information.
But if you’re importing it don’t want to do that.
Fortunately, I have it set up as tidy data already.
columns are the same as variables, rows are the same as cases, there’s nothing else in there.
And so what I can do in this case is come to File, Open, we’ll go to data again.
And this time I come down to this one, it actually has Excel file as a format.
There it is, I’ll hit open.
And you’ll see that the dialog is different.
In this case.
It says opening Excel data source instead of the text import wizard.
It says read the variable names from the first row that’s checked, by default, it knows how many rows of data I have.
And it’s got this thing about maximum width, I don’t need to worry about that I just hit OK.
And that was that.
Here’s the data from Excel, it’s the same data, I still need to change these three measures manually, you could save this information in syntax if you’re going to be doing it many times over.
But that is sufficient for the need.
And so it turns out that importing information into SPSS is really easy.
And it’s massively more efficient and easier to do than entering it directly.
You do it in a spreadsheet.
And especially if you do it on Google Sheets, if you’re entering stuff manually, you can collaborate on it.
And then you save it as a CSV file, and you pop it in there.
And then you can get straight to your analysis.
And that is the point of all this work anyhow.
And now in SPSS and introduction, we get to the part that maybe you were waiting for, and that’s analyzing data.
However, I’m going to give only a very small overview of analyzing data, because we have an entire separate course here for data analysis, and also data visualization in SPSS.
And I recommend that you check those out.
But as a taste of what’s available, we’ll talk about a procedure that’s of interest to a lot of people in applied settings.
And that’s hierarchical clustering.
Now, the idea here is that you’re trying to find clusters, you’re trying to find the clusters in your data.
More specifically, what you’re trying to see is whether similar cases cluster together in some way that you can use to make inferences about them.
The trick, however, is that similarity depends on your criteria.
And there’s a few decisions that you have to make when you’re doing a cluster analysis of any kind.
So for instance, you have to decide whether you’re going to do a hierarchical cluster analysis, which goes from one group to as many groups as you have cases, or whether you’re going to use a set K or set number of clusters.
You also have to decide on the measures of distance that you’re going to use Euclidean distance, which is sort of like measuring the as the crow flies distance between cases is very common as his squared Euclidean distance, which is what SPSS uses.
There’s also the question of whether you want to start with everything together and split it up in a divisive procedure, or start with everything separate and put it together in an agglomerative procedure.
By default, some programs like our devices, but by default, SPSS does agglomerative, you basically end up with the same general findings anyhow, so it’s really not a huge difference.
So we’re going to do a cluster analysis, but we’re going to try to keep it simple.
We’re going to use some of the most basic methods for doing this will use Euclidean distance or squared Euclidean distance.
In this case, we’ll use hierarchical clustering where we don’t have to choose the number of groups ahead of time.
And we’re going to use an agglomerative procedure where it starts with every case separate and then gradually puts them together.
Well try this in SPSS, but I need you to do something first.
There is a folder that you can download from the case files that ends with data here and in it there’s one file, it’s cars dot save word SAE is a proprietary SPSS data format.
And in addition to that, there is the SPSS syntax file and you’ll want both of those for this demonstration.
If you save the data file to your desktop, it looks like this.
You can just double click on it and it will open up in SPSS, you also have the option of using syntax to do that.
It depends on your operating system.
This is for a Macintosh right here.
And this is for a Windows computer though, you may need to use backslashes.
Instead, depending on your version of Windows.
I’m just gonna go back and double click on this, to open it up in SPSS.
And there’s my data set.
What this data set is, is a slight variation on a data set called m t cars.
That’s in the default data sets package in R.
It contains road test data on a number of cars from 1974, from the magazine Motor Trend.
And what we’re going to do is we’re going to look at this information, we’re going to see whether the cars clustered together in some important way.
I’ll go to the data view here.
And you can see we have Mazda RX four Hornet sport about Mercedes, 450, se, Lincoln, continental and so on cars that were all available in the early 70s.
And we have information about miles per gallon.
We have the cylinders, we have the displacement and cubic inches horsepower, weight in tons, quarter second time in the standing quarter mile.
Whether it’s an automatic or manual transmission, the number of gears in the transmission.
And the number of carburetors are probably carburetor barrels here, I’m going to turn on the labels.
Only one variable changes here.
By the way, one of the things I did is I formatted this for SPSS by adding labels and change some of the decimals makes it a little easier to work with in the program.
But let’s go to the syntax file right now.
Once we have the data open, we want to do a default hierarchical clustering.
Now this is the code to produce it right here.
But I’m going to do it with the drop down menus to show you that it’s really not hard to do.
All we need to do is come up to analyze.
And then we come down to classify.
Now I have to admit off the top of my head, I cannot remember if every version of SPSS has this particular menu, most will, I hope yours does.
So you can follow along with this hierarchical cluster.
I’m going to click on that.
And what I’m going to do here is I’m going to take Carnegie, which really tells me just says what the cars are.
And I’m going to use that to label cases because that’s going to mean something to me.
And I’m going to take all of my other variables on the stool will shift click here and put them over here.
And at this moment, I’m going to change nothing else, you’ll see there’s going to cluster cases, that’s what we want, it’s going to give us both some statistics and some plots, that’s fine.
I’m going to hit OK.
And we’re going to get a result identical to my first syntax command.
I see it sound I’ll make the output window bigger here.
And here’s what we have.
First off, it tells us how many cases there were and there were 32.
And they all had complete data, which is nice.
Then SPSS gives us something kind of unusual, called an agglomeration schedule.
And it really specifies, at what point in the procedure did two cases get put into the same cluster.
I personally don’t have much use for this, except I know that when there’s a big jump in the coefficients as there is here from three to 26, you know that there’s a very distinct category change as far as from 662 1125, and so on.
Most of the time, though, I would just completely ignore this one.
And this, this is called an icicle plot.
And it’s just sort of the same information about when various cases got dropped in with everything else.
It’s kind of pretty to look at, I find it kind of meaningless.
And so truthfully, the default output for SPSS is hierarchical clustering to me is not very helpful.
In fact, it’s so unhelpful, I’m just going to delete it all.
And I’m going to do this over again, come back up to my recent menu items, I’m going to go to this analysis again, I’m going to make a couple of changes.
I don’t want the agglomeration schedule, that doesn’t really help me.
And for plots, I’m going to get rid of the icicle plot.
And then when you get a dendrogram instead of dendogram.
That means branches in Greek.
So it’s a graph of the branches.
And this is usually the most important thing you can get out of a hierarchical cluster analysis.
I’ll hit OK.
And now what we have is a chart here that lists all the cases the cars on On the side, and it shows how they grouped together.
So we see, for instance, these first four cars, the Mazda RX four, and the wagon and the Mercedes 280 and two, etc, are very similar to one another, they all go here together, we see this a mother’s, we come down here.
So for instance, the Cadillac Fleetwood, the Lincoln Continental, the Chrysler Imperial, which are all gargantuan American cars with big VAT, they all go there together.
And then we see down here at the bottom, that this one, the mazarine board is all by itself for a very long time.
This is where cases are individual here on the left, and they gradually get put together.
And you see how they come together in each of these branches.
That’s why it’s called a dendrogram.
And so this is a really nice way of seeing how similar your cases are.
And if you have more pixels displayed, you can see the entire graph at once I’ve got a low resolution right here.
And you can see, maybe it makes sense to split this off into, say, four groups looks like we’ve got a distinct group right here, right there, right there, and right there.
And so I can do something else with this.
I’m going to come back to the menu here.
And what I’m going to do is I’m going to save group membership.
Now, I’ve done a hierarchical analysis.
So I didn’t have to specify the number of groups.
But now that I’ve looked at the chart, for seems like a good number.
So I’m going to come here and say, give me the group membership for each case, breaking it down into four clusters.
I’ll hit Continue.
And then I’m going to ask for it to not give me any plots.
I hit OK.
And this time, we’re not going to get the output except for to say that it did the work.
Let’s just get that.
Here it says it it processed them.
The place where we’re going to see the differences in the data files, I’m going to move over to the data file.
This button, by the way, will get me over to the data.
And now you can see I have a new variable that got added here, for clusters for and you can see that each of the cars is listed in one of these four clusters.
And what you can do then is you can then take these cluster memberships, and you can compare them on the other variables.
Again, remember, the clustering here is only as valid as the data that we give it.
So it’s only comparing these cars on a small number of variables.
And it’s using that to decide what goes with what it’s here, for instance, that you see them was Roddy Bora is in a category all by itself.
And this is a neat way of looking at the similarity between items.
You can do it with people if you’re doing market research you can do with companies if you’re doing some sort of segmentation.
And it allows you to see what groups have important similarities for what your purposes are, and which groups you need to treat differently as one another.
That’s the goal of hierarchical clustering analysis.
And you find it’s a very easy thing to do in SPSS.
Another important procedure in SPSS when you’re analyzing data is something called a factor analysis.
Now, I like to think of it as looking at your data and trying to find shadows.
In this picture, what you have are shadows, those are the black figures that you see, it takes a moment to figure out that you’re looking down and there actually are people, but kind of sticking straight out.
And so in this photo, we are going from a sort of a three dimensional origin, that’s the person itself to a two dimensional variation with the shadow.
What’s interesting about that is you maintain most of the useful data, you can tell that there are people that they’re walking, you can probably even tell some things about how tall they are, what they’re wearing, and so on.
What you’ve done is you’ve made things more efficient.
Now in the data world, that’s called dimensionality reduction, where each variable is a dimension.
And too many variables can actually be really problematic.
You’re trying to boil things down a little bit.
And you can think about the saying less is more or less equals more.
More specifically, that is less noise, and fewer unhelpful variables in your data set.
Equal more meaning because that’s what you’re trying to do, you’re trying to extract meaning.
Now, when it comes to factor analysis and related techniques, I have one very important piece of advice and that is to be practical.
at all points, you want to remember what is your goal? So what is the goal? Well, the goal of factor analysis, I’ll tell you what it’s not.
It’s not an exercise in analytical purity.
You’re not there to show that you know how to go through all the steps in the approved format.
Really, you’re working with your data Because you’re trying to get some understanding.
So the goal of a procedure like factor analysis is useful insight, trying to follow the rules, do what you can to make sure you don’t make any obvious mistakes.
But remember, you’re not bound by the mathematics, you’re bound by what the data tells you about the people.
Another way of looking at that is use factor analysis, or really any other procedure for us heuristic value.
That is, it suggests possibilities to you as you analyze the data as you’re trying to get insight to people.
Now, that’s sort of a philosophical discouraging, let me show you how this actually works in SPSS, you’re going to need to download from the course files a folder that says data here at the end.
And from it the cars dot save dataset, this is the one that we use in hierarchical clustering as well.
And then you want to open up the SPSS syntax file that goes with this particular section.
Now, the easiest way to open the data set is simply to double click on it and you’ll be ready to go.
I do have some syntax you can use if you saved it to your desktop.
I’ve got it open already.
So let’s take a quick look at the data set.
We have a collection of cars listed down the side and attributes like mpg and so on and gears in the transmission and carburetors.
Now I will have to make a very important confession here.
This is a very, very small data set for factor analysis.
It only has nine variables other than the identifier, and it only has 32 cases, really, you would want to have at least several 100 cases.
And let’s say several dozen variables before you can do this really reliably.
But this example works.
And it, it actually is really easy to see how it’s happening, and how to interpret the results.
The first thing we’re going to do, if you look at the syntax, is we’re going to do a default factor analysis.
And it’s actually a misnomer, because it’s not a factor analysis.
It’s principal components analysis, but it’s in the factor analysis command within SPSS.
So let’s come up here to analyze.
And down to dimension reduction.
Remember I said that’s what this is called pick factor, It’s our only choice there.
And what we need to do is choose the variables that we’re going to use to see what we can compress what goes into what so we don’t need the name of the car, that’s just an identifier.
We can take the rest of these however, and we can put them under variables.
Now we’ve got a lot of options here, I’m not going to do any of them, I’m just going to hit OK for right now.
And make the output window bigger.
And here’s what we get from the default analysis, we get a text output of the commands that were generated by the drop down menus, we get something called communality.
each variable brings with it one unit of standardized variance.
That’s based on how spread out the scores are.
And if you standardize them, then you have a variance and the standard deviation of one for each.
And the extraction tells us how much of that variance is really able to get constituted through the process that we’re doing.
An important one right here is the total variance explained because what this has done is it has created components memory said this is actually a principal components analysis here, which well, it has profoundly different philosophical underpinnings from factor analysis, the difference has to do with which came first, the factors or the observed variables.
And truthfully, most people treat them as relatively interchangeable.
And if you’re using them for heuristic value, it’s not going to be a big difference.
But what we have here are two components, we have one with 5.472 units of variance that 61% of the original variance of the nine variables, and then another one with 2.341.
I’m getting those numbers from right here.
And you can see it held on to these two, which collectively add up to about 87% of the variance.
Now the component matrix shows the relationship between the original variables and the two components.
These are like correlation coefficients, you can see that mpg is strongly negatively associated with the first component and really not associated with the second.
But number of carburetors has a pretty strong association with each.
And so that’s a way to start to look at it.
But it’s going to be a lot easier if we do certain modifications to this.
In fact, I’m going to just delete this output right here.
And we’re going to start over I’m going to make a few changes Let’s go through each of these options.
First we go to the descriptives.
And I don’t really feel like I need the initial solution.
So I’m going to unselect that I’ll hit Continue.
This is the actual algorithm that SPSS uses to work through the relationships in the multi dimensional space.
You’ll see right here is principal components.
That’s why I said this is really a principal components analysis, you’ve got a lot of options here.
Now, in many situations, maximum likelihood would be a very good answer, I’m going to choose principal axis factoring simply because it’s the classical version of factor analysis.
I don’t need to see the unrotated factor solution, but I do want to see something called a scree plot.
And that is a graph that shows me maybe how many factors I should keep, I’m going to come down here and change the maximum iterations for convergence that has to do with the math, that’s done, I’m going to change it to 50.
Then I’m going to come to rotation, what you get here is a multi dimensional space.
And sometimes it’s a little easier.
If you rotate, the axes can increase interpretability.
Now, there are a lot of different methods.
varimax is a method that maintains orthogonal relationships that makes all of your axes perpendicular to each other.
There are situations where that’s really good.
But truthfully, for exploratory purposes, which is what we’re doing, I like to use what’s called an oblique rotation, that allows your factors to be correlated with each other, they don’t have to be totally perpendicular.
I’m gonna use direct oblem.
And promax is another really good choice.
But it usually is for larger data sets, and I’ve got a tiny one here.
Now, here, I can get a rotated solution, I don’t think I really need that.
But I do want to see the loading plot.
And I’m going to increase the maximum number of iterations to 50.
I’ll hit Continue.
We’ll come down to scores.
And you can save the factor loadings as scores.
And there might be situations where you want to do that.
But because I’m using factor analysis for its heuristic value, as a way, suggesting what variables to go with others, I’m actually not going to do that.
So I’m going to hit cancel.
Then finally, options.
This is where you get to talk about excluding cases, I have a complete data set, so I don’t need to worry about that.
But the coefficient display format, now, I’m going to sort it and then I’m actually going to have it completely erase small coefficients.
Now I’ve done this one before.
So I happen to know that a value of point six, under normal circumstances has really high.
But given my very small data set, this seems like a reasonable choice.
And it makes the solution very, very clear when we look at it.
So I’m going to hit Continue.
And then there, I’m going to hit OK.
I’ve got my output here.
And the first part is pretty similar, except it doesn’t start with a unit variance for each of these.
That’s because I’m not doing principal components anymore.
I’m doing principal axis factoring.
And so the math behind it’s a little bit different.
But we don’t need to dwell on that one.
total variance explained, you see that we still have two factors.
And the first one accounts for a lot of the variance, the second one accounts for a fair amount also, and these are very close to what we had with the principal components.
The scree plot, is a very simple line pot.
This suggests how many factors we might want to keep.
Now there are several different rules you can use for interpreting this.
One is anything that’s above a value of one because one is what it would be if a variable explained simply one unit of variance, but that’s what it brought with it, you want factors that have been explained more than that.
And you see we have two that do a lot more than one and these others are sort of straggling down.
The other rule is to look for a bend in the line and you do see a strong band right here.
So three is where the band is, we’re justified in saying with two, there are other methods to get more involved about checking for the slope of this line and finding things that are above that slope.
You can do those in another time.
This is a quick demonstration.
Now what we get next are three matrices, we get a factor matrix, a pattern matrix, and a structure matrix.
They’re all associated with each other.
And I’ve got a little note here in the syntax that explains them.
I’ll come down here.
The factor matrix is the Association of each variable with each factor and it’s similar in nature, it’s analogous to our the correlation coefficient.
That’s the one that we’re going to be focusing on.
The structure matrix tells us how much each variable is predicted by the factors because the idea Here’s the factors come first and variables come second, using what are called the unique and common contribution.
So a factor might contribute something on its own compared to the other factors or it contributes together.
And then the pattern matrix is an indication of each factor is unique contribution to variables variance.
Those can both be important in different situations, and they can help you interpret things.
But for right now, I’m just going to focus on this first one.
There we go, the factor matrix.
So let me go back to where I was.
And when we come up to the factor matrix, what you see here is because I suppressive values with an absolute value of less than point six, we have this totally clear separation.
Factor one is strongly associated with the number of cylinders in the car.
So more cylinders higher on factor one, and then displacement, very high, mpg is negative, but very strong.
And then we have weight in tons very strong in horsepower.
This is really the big factor cars that are really big are going to score heavily on factor one.
factor two is composed of the number of gears and more gears, the quarter mile time, so the less time it takes to get through the quarter mount that is, the faster it is.
That’s because it’s negative here, the higher it is automatic or manual, you have to know that zero is automatic, and one is manual.
So these are manual transmission cars, and those with more carburetors.
This is really the fast factor.
And that’s where sports cars are going to this one, this one has the Cadillacs and the Lincoln’s and this one has the Ferraris and the Lotus isn’t and so on.
And that makes perfect sense, it’s really easy to see why that would be the way it is.
And then if you come down here, this plot is also really helpful, it’s got the two factors, we have factor one across the bottom, that’s our big factor.
And you can see that weight goes on that one displacement goes on that one cylinder, and then we have number of years and miles per gallon, obviously, you’re on the low end.
factor two is the fast factor.
More carburetors, more horsepower, more cylinders, and lower quarter mile times.
And that makes a lot of sense.
And so this lets us know that we could boil down our data to really just these two factors, sort of how big is the car, and how fast is it.
And that can give us a much more concise image of our data, and allows us to extract more meaning and that is the overall purpose of a procedure like factor analysis or principal components in SPSS.
For a final look at SPSS and analyzing data, at least in this brief overview course, let’s take a look at one of the most useful procedures around regression.
Now, you might think of regression as sort of the statistical version of The Three Musketeers where it’s all for one.
I say that because all for one is actually all variables for predicting one outcome.
Put another way, regression uses many different variables many predictor variables to predict scores on one outcome variable.
This makes it really useful in a huge range of circumstances, especially because there’s something for everyone with regression.
There are many different versions of it and many adaptations of regression that make it truly flexible, and powerful.
When analyzing data and make it a go to tool for almost any analytical purpose you might have.
We’ll try a simple version of this in SPSS.
First, make sure you’ve downloaded this data folder from the course files.
It will use the cars dot save data set that we’ve used in our two previous examples, along with this syntax file.
And when you get to this index file, it begins as usual.
With the code for loading the data set from the desktop truthfully, is easier to just double click on the file cars dot save and haven’t opened it up directly in SPSS.
That’s what I’ve done here.
And you can see is the same data set with about 32 rows of data, a bunch of cars from 1974.
And several variables we’re going to try to predict in this one is miles per gallon, based on things like the number of cylinders the displacement, horsepower, weighed quarter, second time transmission and kind and gears and carburetors.
Alright, so that should be pretty easy.
What we’re going to do is go to analyze, and come down to regression.
And we’ll use this second option here linear.
That’s just basic linear regression.
Now we need to put under dependent the outcome variable, the thing we’re trying to predict, kind of bugs me here, because independent and dependent really should be reserved for manipulated experiments.
But we still know what they mean, our outcome variable, the thing that we’re trying to predict goes here independent.
So that’s mpg.
Now we can take everything else except car name, that’s just a label, we’ll take all the rest of these and we’ll put them under our independent or the variables that we are using to predict the outcome.
Now, I want to do the totally default, no extra steps version first.
So I’ve put the variables in their respective place.
And I’ll just hit OK.
And now we get our output.
And it tells us first, the code that was used to produce this analysis, that it used all of these variables simultaneously to predict a single outcome, which is listed down here.
And they were entered at once.
The model summary tells us that we have a multiple correlation of these predictor variables with our outcome variable of point 931, which is really high.
If you square that to get the proportion of variance explained it’s 86.7%.
Even the adjusted R squared, because we have a small sample is still 82%.
It’s It’s huge.
We get a significance test right here, we are not surprised to see that the significant is point 000.
It’s not zero all the way through, but it’s it’s highly significant.
And then we get coefficients for the individual regression coefficients.
So what we’re looking for here are significance levels that are under oh five, and interestingly, only one of them in this collection is under oh five and that weight in tunt.
None of the others are there close.
That doesn’t mean that none of the others matter.
It simply means that when you take all of the variables together at the same time, when they are taken as a whole, really only one of them deviates significantly from zero to become a predictor.
That’s a wait.
Now, there are a lot of other ways of doing regression.
And SPSS gives you a lot of choices.
I’m going to come up here, back to analyze, down to regression.
Now I will mention, there’s a really interesting one here called automatic linear modeling.
This is a SPSS function, it’s came in a few versions ago, it does a lot of automatic Data Prep, it does a lot of combining and splitting up the variables.
On the other hand, it’s really kind of difficult to explain how it all works.
And then to interpret it properly.
I’m going to save that for another course where I specifically talk about analyzing data.
For now I’m going to go back to linear.
And we’re going to make a few choices, we’re going to make a few options, rephrase.
And we’re going to make a few choices, we’re going to take some of the options that SPSS makes available.
Now the first one I’m going to do at the risk of doing something very controversial, is I’m actually going to go from simultaneous entry to stepwise regression.
This is controversial because some people in the literature have called it positively diabolic.
And its risk of a type one or false positive error.
And there’s some good evidence for that.
On the other hand, in modern machine learning, stepwise procedures have been very fruitful use.
And so it’s not totally unacceptable to try, especially when we’re doing sort of an exploratory project like this right now, you certainly wouldn’t want to use it for rigorous model building, but it’s a nice way to get some insight into the data pretty quickly.
I’ll come just to statistics, and I’m going to add a few things, I’m going to get confidence intervals for the coefficients, those are nice to have, we have the overall model fit.
And I’m going to get the R squared change, because a stepwise model goes through several different steps, adding variables.
And we want to see if each variable adds something that is statistically significant to the overall model.
We could get a lot more information here, but I’ll leave it there for now.
Under plots, we can get a ton of different plots, but I’m actually just going to come down here and choose the standardized residual plots a histogram and a normal probability plot.
Now there are other options as well.
I could save about 15 different kinds of scores to the data set.
I can say unstandardized predicted values, I can save studentized, deleted residuals, and so on and so forth.
things I could do here and there are situations in which I might want to do those.
But for right now, I’m going to skip them because I’m simply trying to build a model without necessarily saving all of the steps in between Options really just talks about the criteria used in the stepwise procedure, I’m gonna leave it at the default right now, but you could change it if you wanted to.
And then style is a new thing that has to do with the formatting of the table.
I’m going to leave that one alone for right now, because we’re going to have exactly what we need.
Now I’ve created this already, and I’ve saved it in the syntax, I’m just going to hit OK.
And you’ll see that we get a different kind of output right now.
I’ll zoom in on this.
Now what we have is some code that’s a little bit longer, this has to go through the variables one at a time.
And find the predictive variable that is most strongly associated with the outcome, put it in the model, get partial correlations and go through step after step.
What we find here is that although we had nine predictors, originally, only two of them were statistically significant when put into the model.
They were weight, and number of cylinders.
Again, what we’re trying to predict is gas mileage mpg.
If you come down here, you can see that they were both statistically significant, or the adjusted R squared for just weight is 74.5.
And when you add on number of cylinders, it goes up.
Not a huge amount, but it goes up almost 8%.
The analysis of variance table lets us know that both of these models with just one variable and with two predictor variables, they’re both statistically significant.
Here are the individual coefficients along with their confidence intervals over here on the right side.
Now, because we’ve gone through a stepwise procedure, it’s not surprising that all of these are statistically significant, because that was the criterion used for including them.
Here we have a list of excluded variables along with their colinearity statistics.
And this has to do with how much each of these variables is correlated with the others.
So for instance, number of carburetors is highly colinear, or easily predicted by the other variables that we could have included in the model.
Now we come down to the residuals, I’m going to look specifically at the chart.
In an ideal world, your residuals are normally distributed, which means they’re just as likely to be high as they are low, and they’re symmetrical.
And we see here that they’re not horribly, pathologically far from normal.
So this is probably a good model on the set.
And here’s a normal p p probability probability plot of this same data.
And if it were perfectly normal, all the dots would be on the line, the diagonal line, they’re close.
These are the 32 individual observations and how far off they’re, they’re close enough.
And so this lets us know that our model is predicting really well and it appears to be not biased in one direction or another.
So this is one method of developing a model.
Again, this stepwise procedure is best for exploratory analyses, it’s not something you would use for confirming a finding.
But as a quick way of sifting through a large collection of potential variables, this is a nice way to do it.
It lets us know for instance, that in this particular data set mpg is predicted primarily by a weight, which completely makes sense about the car, and number of cylinders, which is associated with having a large and thirsty engine.
So the general idea of multiple regression, again, is to use many variables to predict a single outcome, SPSS gives a lot of options for those we’ve looked at the default we looked at one variation on there, but there’s a lot more that you can explore and that we will cover in another course on statistical analysis in SPSS.
But for now, I encourage you to take some time and look at some of these options and see the kind of insight that they can give you on your own data, and see what options you can use to get useful insight into your own analyses.
I want to thank you for joining me in SPSS and introduction.
And we’ll conclude by giving you some next steps, things that you can do next, because you know, once you get through this, it can be a little confusing, feel like things are going everywhere.
And it may not be totally clear where you should go.
Well, here at data lab.cc, we’ve got a few opportunities for you.
First, of course, is more SPSS, we have additional courses on data preparation, on data visualization, on statistical analysis, and other topics that you can use to expand what you’ve learned in this introductory course, and work on your own data.
Now, if you’ve liked what you’ve learned with SPSS, you may want to try branching out into some other languages.
This statistical programming language R and the general purpose programming language Python, are very common powerful tools in the data science community and analytics in general.
They’re a great way to To expand both the things that you can do with your analyses and your employment opportunities, and so, I strongly encourage you to take a look at the courses on our in Python data lab.
Next, we have specific courses on data visualization, one of the most important things you can do in getting to understand your data.
SPSS can work well in those as well as other programs.
And then I’m going to mention one final thing here.
SPSS is a wonderful program, but it still has a fair amount of bugs and it can also be very expensive.
Fortunately, some really interesting work recently in the open source community has developed a program called j SB.
It’s actually pronounced Jasper, which is sort of an open source version of SPSS, it runs very differently, I find it very easy to use, and it makes it reproducible.
It makes it easy to share.
It’s got some tremendous advantages and we have courses on Jasper here today lab, I suggest you check those out and see how well that program is able to fulfill some of your computing needs.
That being said, there are some things missing.
What’s missing exactly? Well, SPSS doesn’t have a really strong and active user and developer community the same way that languages like R and Python do.
But if you’re creative, you can get around that academic conferences, meaning specifically topical academic conferences like biology or management or the social sciences.
They often have very dedicated SPSS users, and teachers and may sponsor specific hands on workshops for learning more about SPSS and how I can use it within your particular domain.
But no matter what you do, I’m going to encourage you to simply get started, go exploring and see what you can do with SPSS in your own data work.
Thanks so much for joining me and happy computing