Although there is much better software to use for Data Analysis than Maple, here are some things to know.
In many cases, dealing with data mean dealing with variables of different types. Because of this, matrices or arrays are not the best storage types.
A DataFrame
is a common type of data structure across many languages which are designed for use with data. A data frame is a rectangular data structure in which each column has a different type.
Consider the following list of people:
First Name | Last Name | Gender | Age |
---|---|---|---|
Homer | Simpson | Male | 35 |
Marge | Simpson | Female | 34 |
Bart | Simpson | Male | 10 |
Lisa | Simpson | Female | 8 |
Margaret | Simpson | Female | 1 |
The age variable should be a number (or perhaps an integer) and the others should be strings.
We can create a DataFrame
in the following manner.
firsts:=<"Homer","Marge","Bart","Lisa","Margaret">
lasts:=<"Simpson","Simpson","Simpson","Simpson","Simpson">
genders:=<"Male","Female","Male","Female","Female">
ages:=<35,34,10,8,1>
and then
DF := DataFrame(<firsts| lasts | genders | ages>);
which produce a structure that looks like:
$$
\left[ \begin{array}{cccc}
& 1 & 2 &3 & 4 \newline
1 & “Homer” & “Simpson” & “Male” & 35 \newline
2 & Marge & Simpson & Female & 34 \newline
3 & Bart & Simpson & Male & 10 \newline
4 & Lisa & Simpson & Female & 8 \newline
5 & Margaret & Simpson & Female & 1 \newline
\end{array}
\right]$$
where the numbers on the top and left sides are the column and row numbers respectively. This is nice, however, since the columns represent variables, it would be nice to have that included in the dataframe. In addition, the datatypes can be specified.
DF := DataFrame(<firsts| lasts| genders| ages>, columns = <LastName, FirstName, Gender, Age>, datatypes = [string, string, string, integer])
The result is:
$$
\left[ \begin{array}{cccc}
& FirstName & LastName & Gender & Age \newline
1 & “Homer” & “Simpson” & “Male” & 35 \newline
2 & Marge & Simpson & Female & 34 \newline
3 & Bart & Simpson & Male & 10 \newline
4 & Lisa & Simpson & Female & 8 \newline
5 & Margaret & Simpson & Female & 1 \newline
\end{array}
\right]$$
If you want to specify the rows as well (instead of by a number, that can be done with the rows
option to the DataFrame
).
If you want to access part of the data frame it is similar to that of an array. We can access the 2nd row, 4th column, by
DF[2,4]
or
DF[2,Age]
We can select the enter row by
DF[2,..]
And an entire column:
DF[..,FirstName]
and each of these return a DataSeries
type.
Typically data comes from other files or databases. Here we will import data from a CSV file. If you have data in an excel file, typically export it from excel to CSV.
Download iris.csv and save it to a directory on your computer. Next, select the working directory by clicking the working directory at the bottom of the Maple screen. The working directory below is “/Users/pstaab/code/github.io/sym-comp/notes”
You will see a directory chooser pop up. Select the directory that contains the file that you downloaded above.
Importing the data is formed by
data:=Import("iris.csv")
which (as of Maple 2016) will create a DataFrame automatically. You will probably have to do
df:=DataFrame(data,columns=<SepalLength,SepalWidth,PetalLength,PetalWidth,Species>)
if you don’t have a dataframe.
with
statement and DataFramesIf we type
with(data);
then we can use the columns of the data frame data without the data[..,name]
format. For example,
SepalWidth
will return a DataSeries
with just the SepalWidth
variable.
Load the Statistics package using with(Statistics):
to give easier access to many standard statistics features.
The Mean
and StandardDeviation
commands will find the mean and standard deviation. For example
Mean(SepalWidth)
which returns 5.843 and
StandardDeviation(PetalWidth)
returns 0.7622
The ScatterPlot
command in Maple will allow you to perform a scatter plot of data from DataFrames. Although this can be done using the pointplot
command of the plots
package, it’s not simple. Plot SepalWidth versus SepalLength by entering
ScatterPlot(SepalWidth, SepalLength,symbol=solidcircle)
results in
Since there are three different species of iris here, we can color code the different species with
ScatterPlot(SepalWidth, SepalLength, symbol=solidcircle, colorscheme = ["valuesplit", Species])
and the result is
If we look at the scatter plot about, there is the red points in the lower part of the plot. This is the setosa
species and perhaps we are interested in studying that.
We can pull out only these data using:
setosa := data[Species=~ "setosa"];
This matches only the rows of the data where the Species is setosa
.
Just to make sure this looks right, plot it:
setosaPlot:=ScatterPlot(setosa[..,2],setosa[..,1],symbol=solidcircle)
where the 2nd and 1st columns are plotted. (I had trouble pulling out the data using the column names) The result is:
The subsetting can be more complex. If we are looking for all points such that the “distance” of the petal info from 1 in width and 2 in length, consider:
data[(PetalWidth-~1)*~(PetalWidth-~1)+~(PetalLength-~2)*(PetalLength-2)<~1]
where the standard operations needs to be followed by a tilde ~ for these operators.
To make the following a bit easier, let’s swap the main data out and put the setosa data in.
unwith(data)
with(setosa)
We can find the best fit line (regression line) using the following:
Fit(b*t+a, convert(SepalWidth, Vector), convert(SepalLength, Vector), t)
where (unfortunately), we first need to convert the DataSeries in the columns to Vectors. The result is
$$-7.68666666666667+5.36285714285714t$$
and Fit
is quite nice in that nearly any function can be put in. If we plot the line:
line:=plot((#),t=2..5)
where (#) is the line number of the line above and then
plots[display](line,setosaPlot)
returns:
and visually this looks good.