特征列可以视为原始数据与Estimator之间的桥梁。比如,在Iris的DNNClassifier分类的示例中,特征列通过tf.feature_column.numeric_column
创建。DNN等Estimator需要使用数值进行运算,但现实中的特征列并不完全是数值类型。
要创建特征列,需要使用tf.feature_column
Numberic column
在Iris分类中即是调用numeric_column
# Defaults to a tf.float32 scalar.
numeric_feature_column = tf.feature_column.numeric_column(key="SepalLength")
设置dtype
# Represent a tf.float64 scalar.
numeric_feature_column = tf.feature_column.numeric_column(key="SepalLength",
dtype=tf.float64)
修改shape
# Represent a 10-element vector in which each cell contains a tf.float32.
vector_feature_column = tf.feature_column.numeric_column(key="Bowling",
shape=10)
# Represent a 10x5 matrix in which each cell contains a tf.float32.
matrix_feature_column = tf.feature_column.numeric_column(key="MyMatrix",
shape=[10,5])
Bucketized column
有时,你可能不想将值直接用于model,而是将其分为几类。比如,将房子的构建日期分类
而bucket的表示如下:
上面将一个单独的数值,分为4个元素的向量。此时,模型可以通过4个独立的权重学习,而不再仅仅是一个。
# First, convert the raw input to a numeric column.
numeric_feature_column = tf.feature_column.numeric_column("Year")
# Then, bucketize the numeric column on the years 1960, 1980, and 2000.
bucketized_feature_column = tf.feature_column.bucketized_column(
source_column = numeric_feature_column,
boundaries = [1960, 1980, 2000])
Categorical identity column
Categorical identity columns可以视为特殊的bucketized columns。
注意,这是一个one-hot encoding,而不是binary numerical encoding。
# Create categorical output for an integer feature named "my_feature_b",
# The values of my_feature_b must be >= 0 and < num_buckets
identity_feature_column = tf.feature_column.categorical_column_with_identity(
key='my_feature_b',
num_buckets=4) # Values [0, 4)
# In order for the preceding call to work, the input_fn() must return
# a dictionary containing 'my_feature_b' as a key. Furthermore, the values
# assigned to 'my_feature_b' must belong to the set [0, 4).
def input_fn():
...
return ({ 'my_feature_a':[7, 9, 5, 2], 'my_feature_b':[3, 1, 2, 2] },
[Label_values])
Categorical vocabulary column
对于strings,我们必须将其转换为numeric或categorical values
vocabulary_feature_column =
tf.feature_column.categorical_column_with_vocabulary_list(
key="a feature returned by input_fn()",
vocabulary_list=["kitchenware", "electronics", "sports"])
当字典值很多时,可以使用
vocabulary_feature_column =
tf.feature_column.categorical_column_with_vocabulary_file(
key="a feature returned by input_fn()",
vocabulary_file="product_class.txt",
vocabulary_size=3)
而product_class.txt内容为
kitchenware
electronics
sports
Hashed Columns
hashed_feature_column =
tf.feature_column.categorical_column_with_hash_bucket(
key = "some_feature",
hash_buckets_size = 100)
Crossed column
将多个特征合并为单个特征,从而让model可以对不同feature集合进行学习。
我们知道,城市中土地价格与其位置相关,但表示位置的经度和维度,当单独与价格关联时,并没有用处。
def make_dataset(latitude, longitude, labels):
assert latitude.shape == longitude.shape == labels.shape
features = {'latitude': latitude.flatten(),
'longitude': longitude.flatten()}
labels=labels.flatten()
return tf.data.Dataset.from_tensor_slices((features, labels))
# Bucketize the latitude and longitude usig the `edges`
latitude_bucket_fc = tf.feature_column.bucketized_column(
tf.feature_column.numeric_column('latitude'),
list(atlanta.latitude.edges))
longitude_bucket_fc = tf.feature_column.bucketized_column(
tf.feature_column.numeric_column('longitude'),
list(atlanta.longitude.edges))
# Cross the bucketized columns, using 5000 hash bins.
crossed_lat_lon_fc = tf.feature_column.crossed_column(
[latitude_bucket_fc, longitude_bucket_fc], 5000)
fc = [
latitude_bucket_fc,
longitude_bucket_fc,
crossed_lat_lon_fc]
# Build and train the Estimator.
est = tf.estimator.LinearRegressor(fc, ...)
Indicator和embedding columns
Indicator columns和embedding columns通常会将categorical columns作为输入