一千萬個為什麽

搜索

測試數據時所做的功能少於訓練數據?

假設我們正在預測商店的銷售情況,我的培訓數據有兩組功能:

  • 關於商店銷售的日期(“商店”字段不是唯一的)
  • 關於商店類型的一個(“商店”字段在這裏是唯一的)

所以矩陣看起來像這樣:

+-------+-----------+------------+---------+-----------+------+-------+--------------+
| Store | DayOfWeek |    Date    |  Sales  | Customers | Open | Promo | StateHoliday |
+-------+-----------+------------+---------+-----------+------+-------+--------------+
|   1   |     5     | 2015-07-31 |  5263.0 |   555.0   |  1   |   1   |      0       |
|   2   |     5     | 2015-07-31 |  6064.0 |   625.0   |  1   |   1   |      0       |
|   3   |     5     | 2015-07-31 |  8314.0 |   821.0   |  1   |   1   |      0       |
|   4   |     5     | 2015-07-31 | 13995.0 |   1498.0  |  1   |   1   |      0       |
|   5   |     5     | 2015-07-31 |  4822.0 |   559.0   |  1   |   1   |      0       |
|   6   |     5     | 2015-07-31 |  5651.0 |   589.0   |  1   |   1   |      0       |
|   7   |     5     | 2015-07-31 | 15344.0 |   1414.0  |  1   |   1   |      0       |
|   8   |     5     | 2015-07-31 |  8492.0 |   833.0   |  1   |   1   |      0       |
|   9   |     5     | 2015-07-31 |  8565.0 |   687.0   |  1   |   1   |      0       |
|   10  |     5     | 2015-07-31 |  7185.0 |   681.0   |  1   |   1   |      0       |
+-------+-----------+------------+---------+-----------+------+-------+--------------+
[986159 rows x 4 columns]

+-------+-----------+------------+---------------------+
| Store | StoreType | Assortment | CompetitionDistance |
+-------+-----------+------------+---------------------+
|   1   |     c     |     a      |         1270        |
|   2   |     a     |     a      |         570         |
|   3   |     a     |     a      |        14130        |
|   4   |     c     |     c      |         620         |
|   5   |     a     |     a      |        29910        |
|   6   |     a     |     a      |         310         |
|   7   |     a     |     c      |        24000        |
|   8   |     a     |     a      |         7520        |
|   9   |     a     |     c      |         2030        |
|   10  |     a     |     a      |         3160        |
+-------+-----------+------------+---------------------+
[1115 rows x 4 columns]

The second matrix describes the store type, the assortment groups of item each of them sell 和 the distance from the nearest competitor store.

But in my test data, I only have information in the first matrix without the CustomersSales fields. The aim is to predict the sales field given the

  • 存儲
  • DAYOFWEEK
  • 日期
  • 打開(商店是否開放)
  • 促銷(商店是否有促銷活動)
  • StateHoliday(是否是州假日)

I can easily train a classifier based on the bulleted fields above to predict Sales but how can I make use of the second matrix in my training data that I would not get in test data?

Is it logical to assume that the second matrix about the Store types is static 和 I can easily join it to the test data?

如果我的測試數據功能集中存在漏洞會發生什麽,比如對於測試數據中的某些行,我沒有“促銷”值。

最佳答案

Use the extra features for unsupervised learning. You might enjoy Vladimir Vapnik's take on this in the context of SVMs, which he calls privileged learning: Learning with Intelligent Teacher: Similarity Control and Knowledge Transfer

轉載註明原文: 測試數據時所做的功能少於訓練數據?