Documente Academic
Documente Profesional
Documente Cultură
Quickstart
If you prefer to learn by diving in and getting your feet wet, then here are some cut-andpasteable examples to play with.
First, lets import stuff and get some data to work with:
In[1]:importnumpyasnp
In[2]:frompatsyimportdmatrices,dmatrix,demo_data
In[3]:data=demo_data("a","b","x1","x2","y","zcolumn")
View Code
demo_data()
In[4]:data
Out[4]:
{'a':['a1','a1','a2','a2','a1','a1','a2','a2'],
'b':['b1','b2','b1','b2','b1','b2','b1','b2'],
'x1':array([1.76405235,0.40015721,0.97873798,2.2408932,1.86755799,
0.97727788,0.95008842,0.15135721]),
'x2':array([0.10321885,0.4105985,0.14404357,1.45427351,0.76103773,
0.12167502,0.44386323,0.33367433]),
'y':array([1.49407907,0.20515826,0.3130677,0.85409574,2.55298982,
0.6536186,0.8644362,0.74216502]),
'zcolumn':array([2.26975462,1.45436567,0.04575852,0.18718385,1.53277921,
1.46935877,0.15494743,0.37816252])}
View Code
Of course Patsy doesnt much care what sort of object you store your data in, so long as
it can be indexed like a Python dictionary, data[varname] . You may prefer to store your
data in a pandas DataFrame, or a numpy record array... whatever makes you happy.
Now, lets generate design matrices suitable for regressing
onto
x1
and
x2
In[5]:dmatrices("y~x1+x2",data)
Out[5]:
(DesignMatrixwithshape(8,1)
y
1.49408
0.20516
0.31307
0.85410
2.55299
0.65362
0.86444
0.74217
Terms:
'y'(column0),
DesignMatrixwithshape(8,3)
Interceptx1x2
11.764050.10322
10.400160.41060
10.978740.14404
12.240891.45427
11.867560.76104
10.977280.12168
10.950090.44386
10.151360.33367
Terms:
'Intercept'(column0)
'x1'(column1)
'x2'(column2))
View Code
The return value is a Python tuple containing two DesignMatrix objects, the first
representing the left-hand side of our formula, and the second representing the righthand side. Notice that an intercept term was automatically added to the right-hand
side. These are just ordinary numpy arrays with some extra metadata and a fancy
__repr__ method attached, so we can pass them directly to a regression function like
np.linalg.lstsq() :
In[6]:outcome,predictors=dmatrices("y~x1+x2",data)
In[7]:betas=np.linalg.lstsq(predictors,outcome)[0].ravel()
In[8]:forname,betainzip(predictors.design_info.column_names,betas):
...:print("%s:%s"%(name,beta))
...:
Intercept:0.579662344123
x1:0.0885991903554
x2:1.76479205551
View Code
Of course the resulting numbers arent very interesting, since this is just random data.
y~
values, use
dmatrix()
and leave
In[9]:dmatrix("x1+x2",data)
Out[9]:
DesignMatrixwithshape(8,3)
Interceptx1x2
11.764050.10322
10.400160.41060
10.978740.14404
12.240891.45427
11.867560.76104
10.977280.12168
10.950090.44386
10.151360.33367
Terms:
'Intercept'(column0)
'x1'(column1)
'x2'(column2)
View Code
Well use dmatrix for the rest of the examples, since seeing the outcome matrix over
and over would get boring. This matrixs metadata is stored in an extra attribute called
.design_info
, which is a
DesignInfo
In[10]:d=dmatrix("x1+x2",data)
In[11]:d.design_info.<TAB>
d.design_info.builderd.design_info.slice
d.design_info.column_name_indexesd.design_info.term_name_slices
d.design_info.column_namesd.design_info.term_names
d.design_info.described.design_info.term_slices
d.design_info.linear_constraintd.design_info.terms
View Code
Usually the intercept is useful, but if we dont want it we can get rid of it:
In[12]:dmatrix("x1+x21",data)
Out[12]:
DesignMatrixwithshape(8,2)
x1x2
1.764050.10322
0.400160.41060
0.978740.14404
2.240891.45427
1.867560.76104
0.977280.12168
0.950090.44386
0.151360.33367
Terms:
'x1'(column0)
'x2'(column1)
View Code
In[13]:dmatrix("x1+np.log(x2+10)",data)
Out[13]:
DesignMatrixwithshape(8,3)
Interceptx1np.log(x2+10)
11.764052.29221
10.400162.34282
10.978742.31689
12.240892.43836
11.867562.37593
10.977282.31468
10.950092.34601
10.151362.33541
Terms:
'Intercept'(column0)
'x1'(column1)
'np.log(x2+10)'(column2)
View Code
Notice that
np.log
np.log
importnumpyasnp
dmatrix()
. For example:
dmatrix()
dmatrix()
was called
In[14]:new_x2=data["x2"]*100
In[15]:dmatrix("new_x2")
Out[15]:
DesignMatrixwithshape(8,2)
Interceptnew_x2
110.32189
141.05985
114.40436
1145.42735
176.10377
112.16750
144.38632
133.36743
Terms:
'Intercept'(column0)
'new_x2'(column1)
View Code
Patsy has some transformation functions built in, that are automatically accessible to
your code:
In[16]:dmatrix("center(x1)+standardize(x2)",data)
Out[16]:
DesignMatrixwithshape(8,3)
Interceptcenter(x1)standardize(x2)
10.879951.21701
10.483950.07791
10.094630.66885
11.356792.23584
10.983450.69899
11.861380.71844
10.065980.00417
11.035460.24845
Terms:
'Intercept'(column0)
'center(x1)'(column1)
'standardize(x2)'(column2)
View Code
See
patsy.builtins
also define your own transformation functions in the ordinary Python way:
In[17]:defdouble(x):
....:return2*x
....:
In[18]:dmatrix("x1+double(x1)",data)
Out[18]:
DesignMatrixwithshape(8,3)
Interceptx1double(x1)
11.764053.52810
10.400160.80031
10.978741.95748
12.240894.48179
11.867563.73512
10.977281.95456
10.950091.90018
10.151360.30271
Terms:
'Intercept'(column0)
'x1'(column1)
'double(x1)'(column2)
View Code
This flexibility does create problems in one case, though because we interpret
whatever you write in-between the + signs as Python code, you do in fact have to
write valid Python code. And this can be tricky if your variable names have funny
characters in them, like whitespace or punctuation. Fortunately, patsy has a builtin
transformation called
Q()
In[19]:weird_data=demo_data("weirdcolumn!","x1")
#Thisisanerror...
In[20]:dmatrix("weirdcolumn!+x1",weird_data)
[...]
PatsyError:errortokenizinginput(maybeanunclosedstring?)
weirdcolumn!+x1
^
#...butthisworks:
In[21]:dmatrix("Q('weirdcolumn!')+x1",weird_data)
Out[21]:
DesignMatrixwithshape(5,3)
InterceptQ('weirdcolumn!')x1
11.764050.97728
10.400160.95009
10.978740.15136
12.240890.10322
11.867560.41060
Terms:
'Intercept'(column0)
"Q('weirdcolumn!')"(column1)
'x1'(column2)
View Code
Q()
In[22]:dmatrix("double(Q('weirdcolumn!'))+x1",weird_data)
Out[22]:
DesignMatrixwithshape(5,3)
Interceptdouble(Q('weirdcolumn!'))x1
13.528100.97728
10.800310.95009
11.957480.15136
14.481790.10322
13.735120.41060
Terms:
'Intercept'(column0)
"double(Q('weirdcolumn!'))"(column1)
'x1'(column2)
View Code
Arithmetic transformations are also possible, but youll need to protect them by
wrapping them in I() , so that Patsy knows that you really do want + to mean
addition:
In[23]:dmatrix("I(x1+x2)",data)#compareto"x1+x2"
Out[23]:
DesignMatrixwithshape(8,2)
InterceptI(x1+x2)
11.66083
10.81076
11.12278
13.69517
12.62860
10.85560
11.39395
10.18232
Terms:
'Intercept'(column0)
'I(x1+x2)'(column1)
View Code
Note that while Patsy goes to considerable efforts to take in data represented using
different Python data types and convert them into a standard representation, all this
work happens after any transformations you perform as part of your formula. So, for
example, if your data is in the form of numpy arrays, + will perform element-wise
addition, but if it is in standard Python lists, it will perform concatentation:
In[24]:dmatrix("I(x1+x2)",{"x1":np.array([1,2,3]),"x2":np.array([4,5,6])})
Out[24]:
DesignMatrixwithshape(3,2)
InterceptI(x1+x2)
15
17
19
Terms:
'Intercept'(column0)
'I(x1+x2)'(column1)
In[25]:dmatrix("I(x1+x2)",{"x1":[1,2,3],"x2":[4,5,6]})
Out[25]:
DesignMatrixwithshape(6,2)
InterceptI(x1+x2)
11
12
13
14
15
16
Terms:
'Intercept'(column0)
'I(x1+x2)'(column1)
View Code
Patsy becomes particularly useful when you have categorical data. If you use a
predictor that has a categorical type (e.g. strings or bools), it will be automatically
coded. Patsy automatically chooses an appropriate way to code categorical data to
avoid producing a redundant, overdetermined model.
If there is just one categorical variable alone, the default is to dummy code it:
In[26]:dmatrix("0+a",data)
Out[26]:
DesignMatrixwithshape(8,2)
a[a1]a[a2]
10
10
01
01
10
10
01
01
Terms:
'a'(columns0:2)
View Code
But if you did that and put the intercept back in, youd get a redundant model. So if the
intercept is present, Patsy uses a reduced-rank contrast code (treatment coding by
default):
In[27]:dmatrix("a",data)
Out[27]:
DesignMatrixwithshape(8,2)
Intercepta[T.a2]
10
10
11
11
10
10
11
11
Terms:
'Intercept'(column0)
'a'(column1)
View Code
The
T.
notation is there to remind you that these columns are treatment coded.
Interactions are also easy they represent the cartesian product of all the factors
involved. Heres a dummy coding of each combination of values taken by a and b :
In[28]:dmatrix("0+a:b",data)
Out[28]:
DesignMatrixwithshape(8,4)
a[a1]:b[b1]a[a2]:b[b1]a[a1]:b[b2]a[a2]:b[b2]
1000
0010
0100
0001
1000
0010
0100
0001
Terms:
'a:b'(columns0:4)
View Code
But interactions also know how to use contrast coding to avoid redundancy. If you have
both main effects and interactions in a model, then Patsy goes from lower-order effects
to higher-order effects, adding in just enough columns to produce a well-defined model.
The result is that each set of columns measures the additional contribution of this effect
just what you want for a traditional ANOVA:
In[29]:dmatrix("a+b+a:b",data)
Out[29]:
DesignMatrixwithshape(8,4)
Intercepta[T.a2]b[T.b2]a[T.a2]:b[T.b2]
1000
1010
1100
1111
1000
1010
1100
1111
Terms:
'Intercept'(column0)
'a'(column1)
'b'(column2)
'a:b'(column3)
View Code
In[30]:dmatrix("a*b",data)
Out[30]:
DesignMatrixwithshape(8,4)
Intercepta[T.a2]b[T.b2]a[T.a2]:b[T.b2]
1000
1010
1100
1111
1000
1010
1100
1111
Terms:
'Intercept'(column0)
'a'(column1)
'b'(column2)
'a:b'(column3)
View Code
Of course you can use other coding schemes too (or even define your own). Heres
orthogonalpolynomialcoding
In[31]:dmatrix("C(c,Poly)",{"c":["c1","c1","c2","c2","c3","c3"]})
Out[31]:
DesignMatrixwithshape(6,3)
InterceptC(c,Poly).LinearC(c,Poly).Quadratic
10.707110.40825
10.707110.40825
10.000000.81650
10.000000.81650
10.707110.40825
10.707110.40825
Terms:
'Intercept'(column0)
'C(c,Poly)'(columns1:3)
View Code
You can even write interactions between categorical and numerical variables. Here we
fit two different slope coefficients for
x1
a1
a2
group:
In[32]:dmatrix("a:x1",data)
Out[32]:
DesignMatrixwithshape(8,3)
Intercepta[a1]:x1a[a2]:x1
11.764050.00000
10.400160.00000
10.000000.97874
10.000002.24089
11.867560.00000
10.977280.00000
10.000000.95009
10.000000.15136
Terms:
'Intercept'(column0)
'a:x1'(columns1:3)
View Code
The same redundancy avoidance code works here, so if youd rather have treatmentcoded slopes (one slope for the
a1
and
a2
a1
#comparetothedifferencebetween"0+a"and"1+a"
In[33]:dmatrix("x1+a:x1",data)
Out[33]:
DesignMatrixwithshape(8,3)
Interceptx1a[T.a2]:x1
11.764050.00000
10.400160.00000
10.978740.97874
12.240892.24089
11.867560.00000
10.977280.00000
10.950090.95009
10.151360.15136
Terms:
'Intercept'(column0)
'x1'(column1)
'a:x1'(column2)
View Code
In[34]:dmatrix("C(a,Poly):center(x1)",data)
Out[34]:
DesignMatrixwithshape(8,3)
InterceptC(a,Poly).Constant:center(x1)C(a,Poly).Linear:center(x1)
10.879950.62222
10.483950.34220
10.094630.06691
11.356790.95939
10.983450.69541
11.861381.31620
10.065980.04666
11.035460.73218
Terms:
'Intercept'(column0)
'C(a,Poly):center(x1)'(columns1:3)
View Code