Pandas
helloworld
len(df)
df.head(), df.tail()
df['column'], df.column
s + value
s1 + s2
s.notnull(), s.isnull()
df.sort_values(ascending=False)
s.sort_values(by='col')
df[ df.col == value ]
df[ (v1 < df.c) & (df.c < v2) ]
df[ ~df['type'].isin(['actor', 'actress']) ] # plain not in doesn't work
df = df[ df['title'].str.startswith('Hamlet') ]
df[ ['col1', 'col2'] ]
s.value_counts().sort_index() # dropna=True by default
# it's like s.groupby('col').size()
s.str.len()
s.str.contains()
s.str.startswith()
s = s.unique()
s = s.sort_values(by=['year', 'name'])
df.plot()
s.plot(kind='bar')
titles.year.value_count().sort_index().plot()
%%time df[ df.title == 'Hamlet' ]
c = cast.set_index(['title']).sort_index()
c.loc('Hamlet')
c = cast.set_index(['title', 'year']).sort_index()
c.loc['Hamlet'].loc['1972']
c.loc[('Hamlet', '1972')]
reset_index
df.groupby('col')
df.groupby(['col1', 'col2])
cols = ['col1', 'col2]; df.sort_values(by=cols)[cols] # to preview what's getting into the groups
.size(), .min(), .max(), .mean(), .agg(['min', 'max'])
???:
df.unstack()
df.stack()
df.fillna()
df.where()??
https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html
Exercises-3.ipynb: t = titles; t[t.title == 'Hamlet'].groupby( t.year // 10 * 10 ).size().sort_index().plot(kind='bar')
Exercises-4.ipynb: cast.groupby(['year', 'type']).size().unstack('type').fillna(0).plot(kind='area')
len(df) series + value df[df.c == value]
df.head() series + series2 df[(df.c >= value) & (df.d < value)]
df.tail() series.notnull() df[(df.c < value) | (df.d != value)]
df.COLUMN series.isnull() df.sort_values('column')
df['COLUMN'] series.order() df.sort_values(['column1', 'column2'])
s.str.len() s.value_counts()
s.str.contains() s.sort_index() df[['column1', 'column2']]
s.str.startswith() s.plot(...) df.plot(x='a', y='b', kind='bar')
df.set_index('a').sort_index() df.loc['value']
df.set_index(['a', 'b']).sort_index() df.loc[('v','u')]
df.groupby('column') .size() .mean() .min() .max()
df.groupby(['column1', 'column2']) .agg(['min', 'max'])
df.unstack() s.dt.year df.merge(df2, how='outer', ...)
df.stack() s.dt.month df.rename(columns={'a': 'y', 'b': 'z'})
df.fillna(value) s.dt.day pd.concat([df1, df2])
s.fillna(value) s.dt.dayofweek
- q: How to create a series, a data frame? — a: https://pandas.pydata.org/pandas-docs/stable/10min.html#object-creation
- q: How to create a column based on other columns? — a:
df.assign( col = df.col1 * df.col2 )
-
q:
df['new_col'] = s
vsdf.assign( new_col = s )
— a: The former has issues with indices??? Inplace vs a copy, the latter can be inlined -
q: How to get columns? How to get index? How to get values? — a: https://pandas.pydata.org/pandas-docs/stable/10min.html#viewing-data
-
q: How to rename a column in pandas? Inplace? — a:
df.rename( columns={'oldName1': 'newName1', 'oldName2': 'newName2'} )
, it can also beinplace=True
or withdf.set_axis(['a', 'b', 'c', 'd', 'e'], axis='columns', inplace=False)
— https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.set_axis.html#pandas.Series.set_axisdf.columns = ['a', 'b', 'c']
is fine too. -
q:
s.str.contains()
vssubstr in a_str
— a: -
q:
DataFrame.merge()
vsDataFrame.join()
— a: Use.merge
..join()
is the same as.merge()
, but has other defaults. - q: How to sort data frame? By multiple columns? — a:
df.sort_value(by='col')
,df.sort_value(by=['col1', 'col2'])
optimized pandas data access methods, .at, .iat, .loc, .iloc and .ix
Using format strings: https://pandas.pydata.org/pandas-docs/stable/style.html#Finer-Control:-Display-Values