Estonian Land Board data

Available via www.maaamet.ee. API can be accessed programmatically using html requests. One can use “rvest” package. Please have a look at “R/maaamet.R” script in rstats-tartu/datasets repo. Returns html, but that ok too. Html results need little wrangling to get table.

Data that can be downloaded are number of transactions and price summaries per property size, and type for different time periods. Price info is given for data splits with more than 5 transactions.

Load dataset

##  Check if file is alredy downloaded 
if(!file.exists("data/transactions_residential_apartments.csv")){
  url <- "https://raw.githubusercontent.com/rstats-tartu/datasets/master/transactions_residential_apartments.csv"
  
  ## Check if data folder is present
  if(!dir.exists("data")){
    dir.create("data")
  }
  ## Download file to data folder
  download.file(url, "data/transactions_residential_apartments.csv")
}

Import

library(tidyverse)

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

library(stringr)
# Please try broom from GitHub as CRAN version may give obscure warning
# with augment
# devtools::install_github("dgrtwo/broom")
library(broom)
library(viridis)

## Loading required package: viridisLite

apartments <- read_csv("data/transactions_residential_apartments.csv")

## Parsed with column specification:
## cols(
##   year = col_integer(),
##   month = col_character(),
##   county = col_character(),
##   area = col_character(),
##   transactions = col_integer(),
##   area_total = col_double(),
##   area_mean = col_double(),
##   price_total = col_integer(),
##   price_min = col_double(),
##   price_max = col_double(),
##   price_unit_area_min = col_double(),
##   price_unit_area_max = col_double(),
##   price_unit_area_median = col_double(),
##   price_unit_area_mean = col_double(),
##   price_unit_area_sd = col_double(),
##   title = col_character(),
##   subtitle = col_character()
## )

## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)

## Warning: 31 parsing failures.
## row # A tibble: 5 x 5 col     row         col               expected actual expected   <int>       <chr>                  <chr>  <chr> actual 1 11495 price_total no trailing characters     e3 file 2 11504 price_total no trailing characters     e3 row 3 11512 price_total no trailing characters     e3 col 4 11513 price_total no trailing characters     e3 expected 5 11573 price_total no trailing characters     e3 actual # ... with 1 more variables: file <chr>
## ... ................. ... ................................................. ........ ................................................. ...... ................................................. .... ................................................. ... ................................................. ... ................................................. ........ ................................................. ...... .......................................
## See problems(...) for more details.

apartments <- apartments %>% 
  mutate(date = ymd(str_c(year, month, 1, sep = "-"))) %>% 
  select(date, everything())
harju <- apartments %>% filter(str_detect(county, "Harju"))
harju

## # A tibble: 890 x 18
##          date  year month        county      area transactions area_total
##        <date> <int> <chr>         <chr>     <chr>        <int>      <dbl>
##  1 2005-01-01  2005   Jan Harju maakond  10-29,99           65     1418.8
##  2 2005-01-01  2005   Jan Harju maakond  30-40,99          155     5431.5
##  3 2005-01-01  2005   Jan Harju maakond  41-54,99          253    12043.8
##  4 2005-01-01  2005   Jan Harju maakond  55-69,99          230    14515.9
##  5 2005-01-01  2005   Jan Harju maakond 70-249,99          118    10905.0
##  6 2005-02-01  2005   Feb Harju maakond  10-29,99           63     1340.2
##  7 2005-02-01  2005   Feb Harju maakond  30-40,99          175     6125.2
##  8 2005-02-01  2005   Feb Harju maakond  41-54,99          257    12233.8
##  9 2005-02-01  2005   Feb Harju maakond  55-69,99          249    15659.0
## 10 2005-02-01  2005   Feb Harju maakond 70-249,99          174    15926.1
## # ... with 880 more rows, and 11 more variables: area_mean <dbl>,
## #   price_total <int>, price_min <dbl>, price_max <dbl>,
## #   price_unit_area_min <dbl>, price_unit_area_max <dbl>,
## #   price_unit_area_median <dbl>, price_unit_area_mean <dbl>,
## #   price_unit_area_sd <dbl>, title <chr>, subtitle <chr>

Strategy

Start with single unit and identify interesting pattern
Summarise pattern with model
Apply model to all units
Look for units that don’t fit pattern
Summarise with single model

First glimpse

Plot number of transactions per year for Harju county and add smooth line.

p <- ggplot(harju, aes(factor(year), transactions, group = area, color = area)) +
  geom_point() +
  geom_smooth(method = 'loess') +
  facet_wrap(~county, scales = "free_y") +
  labs(
    y = "Transactions",
    x = "Year"
  ) +
  scale_color_viridis(discrete = TRUE)
p

Mean price per unit area:

## Add date 
p <- harju %>%
  ggplot(aes(date, price_unit_area_mean, group = area, color = area)) +
  geom_line() +
  geom_vline(xintercept = ymd("2008-09-15"), linetype = 2) +
  labs(title = "Transactions with residential apartments",
       subtitle = "Harju county",
       x = "Date",
       y = bquote(Mean~price~(eur/m^2)),
       caption = str_c("Source: Estonian Land Board, transactions database.\nDashed line, the collapse of the investment bank\nLehman Brothers on Sep 15, 2008.")) +
  scale_color_viridis(discrete = TRUE, name = bquote(Apartment~size~m^2))
p

Adjust mean price to changes in consumer index:

download.file("https://raw.githubusercontent.com/rstats-tartu/datasets/master/consumer_index.csv",
              "data/consumer_index.csv")
consumer_index <- read_csv("data/consumer_index.csv")

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   X1 = col_integer(),
##   year = col_integer(),
##   Jan = col_double(),
##   Feb = col_double(),
##   Mar = col_double(),
##   Apr = col_double(),
##   May = col_double(),
##   Jun = col_double(),
##   Jul = col_double(),
##   Aug = col_double(),
##   Sep = col_double(),
##   Oct = col_double(),
##   Nov = col_double(),
##   Dec = col_double()
## )

consumer_index

## # A tibble: 22 x 14
##       X1  year   Jan   Feb   Mar   Apr   May    Jun    Jul    Aug    Sep
##    <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1     1  1996 61.86 63.84 64.90 66.02 66.35  66.74  66.94  66.28  66.55
##  2     2  1997 68.59 69.05 69.58 70.63 71.95  72.42  72.68  73.01  73.34
##  3     3  1998 76.50 77.23 77.89 78.22 78.48  78.68  79.14  79.01  78.88
##  4     4  1999 79.93 80.13 80.46 80.73 80.99  80.92  80.99  80.86  80.86
##  5     5  2000 82.37 82.44 83.03 83.10 83.23  83.50  84.29  84.35  84.68
##  6     6  2001 87.06 87.32 87.72 88.31 88.84  89.04  89.23  89.30  89.50
##  7     7  2002 90.75 91.28 91.54 92.33 92.53  92.46  92.14  91.74  91.87
##  8     8  2003 93.06 93.32 93.52 93.39 93.19  92.86  92.99  92.99  93.26
##  9     9  2004 93.65 93.85 94.18 94.77 96.62  96.95  96.69  96.62  96.82
## 10    10  2005 97.54 98.14 98.73 99.19 99.39 100.05 100.45 100.71 101.57
## # ... with 12 more rows, and 3 more variables: Oct <dbl>, Nov <dbl>,
## #   Dec <dbl>

## check if 2005 is 100%
consumer_index[10, 3:14] %>% rowMeans()

## [1] 100

divide_by_100 <- function(x) {
  x/100
  }
consumer_index <- consumer_index %>% 
  select(-X1) %>% 
  gather(key = month, value = consumerindex, -year) %>% 
  mutate_at("consumerindex", divide_by_100)
consumer_index

## # A tibble: 264 x 3
##     year month consumerindex
##    <int> <chr>         <dbl>
##  1  1996   Jan        0.6186
##  2  1997   Jan        0.6859
##  3  1998   Jan        0.7650
##  4  1999   Jan        0.7993
##  5  2000   Jan        0.8237
##  6  2001   Jan        0.8706
##  7  2002   Jan        0.9075
##  8  2003   Jan        0.9306
##  9  2004   Jan        0.9365
## 10  2005   Jan        0.9754
## # ... with 254 more rows

Join apartments and consumer index data:

apartments <- left_join(apartments, consumer_index, by = c("year", "month"))

Select harju county data:

harju_ci <- filter(apartments, str_detect(county, "Harju"))
harju_ci

## # A tibble: 890 x 19
##          date  year month        county      area transactions area_total
##        <date> <int> <chr>         <chr>     <chr>        <int>      <dbl>
##  1 2005-01-01  2005   Jan Harju maakond  10-29,99           65     1418.8
##  2 2005-01-01  2005   Jan Harju maakond  30-40,99          155     5431.5
##  3 2005-01-01  2005   Jan Harju maakond  41-54,99          253    12043.8
##  4 2005-01-01  2005   Jan Harju maakond  55-69,99          230    14515.9
##  5 2005-01-01  2005   Jan Harju maakond 70-249,99          118    10905.0
##  6 2005-02-01  2005   Feb Harju maakond  10-29,99           63     1340.2
##  7 2005-02-01  2005   Feb Harju maakond  30-40,99          175     6125.2
##  8 2005-02-01  2005   Feb Harju maakond  41-54,99          257    12233.8
##  9 2005-02-01  2005   Feb Harju maakond  55-69,99          249    15659.0
## 10 2005-02-01  2005   Feb Harju maakond 70-249,99          174    15926.1
## # ... with 880 more rows, and 12 more variables: area_mean <dbl>,
## #   price_total <int>, price_min <dbl>, price_max <dbl>,
## #   price_unit_area_min <dbl>, price_unit_area_max <dbl>,
## #   price_unit_area_median <dbl>, price_unit_area_mean <dbl>,
## #   price_unit_area_sd <dbl>, title <chr>, subtitle <chr>,
## #   consumerindex <dbl>

harju_ci <- harju_ci %>% 
  mutate(pua_mean_ci = price_unit_area_mean / consumerindex)

Price per m^2 after consumer index adjustment:

harju_ci %>%
  ggplot(aes(date, pua_mean_ci, group = area, color = area)) +
  geom_line() +
  geom_vline(xintercept = ymd("2008-09-15"), linetype = 2) +
  labs(title = "Transactions with residential apartments",
       subtitle = "Harju county",
       x = "Date",
       y = "Consumer index corrected mean price\nrelative to 2005 (eur/m^2)",
       caption = str_c("Source: Estonian Land Board, transactions database.\nDashed line, the collapse of the investment bank\nLehman Brothers on Sep 15, 2008.")) +
  scale_color_viridis(discrete = TRUE, name = bquote(Apartment~size~m^2))

Seasonal pattern

Seasonal pattern in number of transactions? Mean price eur/m^2 per month:

## Note that we are rearranging x axis using R builtin vector of month names
p <- harju %>%
  ggplot(aes(factor(month, levels = month.abb), transactions, group = year)) +
  geom_line(alpha = 0.3) +
  geom_vline(xintercept = ymd("2008-09-15"), linetype = 2) +
  facet_wrap(~ area, scales = "free_y") +
  labs(title = "Seasonal trend in number of transactions\nwith residential apartments",
       subtitle = "Harju county",
       x = "Month",
       y = "Transactions",
       caption = "Source: Estonian Land Board, transactions database.") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
p

Add trendline per month. Fit linear model per month for each apartment size class. augment() function from “broom” library adds fitted values and residuals to the original data. We can use this to overlay model fit to our data.

predicted_transa <- harju %>% 
  lm(transactions ~ month + area, data = .) %>%
  augment() # broom
predicted_transa %>% 
  arrange(desc(.cooksd)) %>% 
  dplyr::select(.cooksd, everything()) %>% 
  head

##      .cooksd transactions month     area  .fitted  .se.fit   .resid
## 1 0.02159508          530   Nov 41-54,99 258.5006 8.840215 271.4994
## 2 0.01811853          512   Oct 41-54,99 256.5092 8.613145 255.4908
## 3 0.01785884          461   Nov 55-69,99 214.1018 8.840215 246.8982
## 4 0.01776184          457   Dec 55-69,99 210.7732 8.840215 246.2268
## 5 0.01653771          492   Aug 41-54,99 247.9092 8.613145 244.0908
## 6 0.01528834          443   Sep 55-69,99 208.3103 8.613145 234.6897
##         .hat   .sigma .std.resid
## 1 0.01878010 63.87491   4.248853
## 2 0.01782772 63.95253   3.996386
## 3 0.01878010 63.99134   3.863856
## 4 0.01878010 63.99436   3.853348
## 5 0.01782772 64.00444   3.818068
## 6 0.01782772 64.04544   3.671015

Overlay fitted values to previous plot.

p + geom_line(data = predicted_transa, aes(month, .fitted, group = 1), 
              color = "red",
              size = 2) +
  labs(caption = "Red line: model fit of seasonal effect to number of transactions.\nSource: Estonian Land Board, transactions database.")

The number of transactions increases during period from March to May?

Do we have similar effect to average price per m^2?

predicted_pua <- harju %>% 
  lm(price_unit_area_mean ~ month + area, data = .) %>%
  augment()
predicted_pua %>% head

##   price_unit_area_mean month      area  .fitted .se.fit    .resid
## 1               766.69   Jan  10-29,99 1167.756 36.6406 -401.0660
## 2               738.23   Jan  30-40,99 1143.193 36.6406 -404.9631
## 3               719.91   Jan  41-54,99 1159.304 36.6406 -439.3937
## 4               699.59   Jan  55-69,99 1166.595 36.6406 -467.0046
## 5               886.06   Jan 70-249,99 1347.656 36.6406 -461.5959
## 6               705.96   Feb  10-29,99 1175.696 36.6406 -469.7363
##         .hat   .sigma     .cooksd .std.resid
## 1 0.01782772 274.2346 0.002467191  -1.474712
## 2 0.01782772 274.2279 0.002515370  -1.489042
## 3 0.01782772 274.1661 0.002961275  -1.615643
## 4 0.01782772 274.1129 0.003345132  -1.717167
## 5 0.01782772 274.1236 0.003268097  -1.697280
## 6 0.01782772 274.1074 0.003384381  -1.727212

harju %>%
  ggplot(aes(factor(month, levels = month.abb), price_unit_area_mean, group = year)) +
  geom_line(alpha = 0.3) +
  geom_vline(xintercept = ymd("2008-09-15"), linetype = 2) +
  facet_wrap(~ area, scales = "free_y") +
  labs(title = bquote(Seasonal~trend~price~per~m^2~of~residential~apartments),
       subtitle = "Harju county",
       x = "Month",
       y = bquote(Mean~price~per~unit~area~(eur/m^2)),
       caption = "Source: Estonian Land Board, transactions database.") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) +
  geom_line(data = predicted_pua, aes(month, .fitted, group = 1), 
              color = "red", 
              size = 2)

Residuals? Add also year to model to visualise residuals.

predicted_trans2 <- harju %>% 
  lm(transactions ~ month + year + area, data = .) %>% 
  augment
predicted_trans2 %>% 
  head()

##   transactions month year      area   .fitted  .se.fit     .resid
## 1           65   Jan 2005  10-29,99  88.58743 9.129342 -23.587427
## 2          155   Jan 2005  30-40,99 153.91327 9.129342   1.086731
## 3          253   Jan 2005  41-54,99 245.34585 9.129342   7.654146
## 4          230   Jan 2005  55-69,99 200.94698 9.129342  29.053023
## 5          118   Jan 2005 70-249,99 175.78967 9.129342 -57.789674
## 6           63   Feb 2005  10-29,99  99.50743 9.129342 -36.507427
##         .hat   .sigma      .cooksd  .std.resid
## 1 0.02134301 62.52077 1.867599e-04 -0.38155192
## 2 0.02134301 62.52597 3.964297e-07  0.01757903
## 3 0.02134301 62.52544 1.966600e-05  0.12381403
## 4 0.02134301 62.51808 2.833382e-04  0.46996380
## 5 0.02134301 62.49468 1.121045e-03 -0.93480995
## 6 0.02134301 62.51350 4.473885e-04 -0.59054678

predicted_trans2 %>% 
  ggplot(aes(factor(month, levels = month.abb), .resid, group = year)) + 
  geom_hline(yintercept = 0, colour = "white", size = 3) +
  geom_line() + 
  facet_wrap(~ area)

For some apartement size classes, few years may have large influence on average number of transactions. We can look at the large cook’s distance.

predicted_trans2 %>% 
  arrange(desc(.cooksd)) %>% 
  dplyr::select(.cooksd, everything()) %>% 
  head()

##      .cooksd transactions month year     area  .fitted  .se.fit   .resid
## 1 0.02066502          530   Nov 2005 41-54,99 285.0741 9.243226 244.9259
## 2 0.01732572          512   Oct 2005 41-54,99 284.8125 9.129342 227.1875
## 3 0.01672219          461   Nov 2005 55-69,99 240.6752 9.243226 220.3248
## 4 0.01662042          457   Dec 2005 55-69,99 237.3466 9.243226 219.6534
## 5 0.01563057          492   Aug 2005 41-54,99 276.2125 9.129342 215.7875
## 6 0.01429830          443   Sep 2005 55-69,99 236.6136 9.129342 206.3864
##         .hat   .sigma .std.resid
## 1 0.02187882 61.96100   3.963025
## 2 0.02134301 62.04045   3.675001
## 3 0.02187882 62.06920   3.564966
## 4 0.02187882 62.07199   3.554102
## 5 0.02134301 62.08812   3.490594
## 6 0.02134301 62.12556   3.338521

Same for mean price per m^2:

predicted_pua2 <- harju %>% 
  lm(price_unit_area_mean ~ month + year + area, data = .) %>% 
  augment 
predicted_pua2 %>% 
  ggplot(aes(factor(month, levels = month.abb), .resid, group = year, color = factor(year))) + 
  geom_hline(yintercept = 0, colour = "white", size = 3) +
  geom_line() + 
  facet_wrap(~ area) +
  scale_color_viridis(discrete = TRUE) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

Seems ok!

predicted_pua2 %>%
  arrange(desc(.cooksd)) %>% 
  dplyr::select(.cooksd, everything(.)) %>% 
  head()

##       .cooksd price_unit_area_mean month year     area  .fitted  .se.fit
## 1 0.009230944              1725.98   Apr 2007 10-29,99 1043.973 34.24826
## 2 0.008394703               421.50   Jul 2009 10-29,99 1093.030 33.20911
## 3 0.007883079              1650.16   Jun 2007 30-40,99 1019.909 34.24826
## 4 0.006989624              1665.78   Aug 2007 10-29,99 1072.318 34.24826
## 5 0.006803107              1604.90   Apr 2007 30-40,99 1019.410 34.24826
## 6 0.006686472              1615.97   Apr 2007 41-54,99 1035.521 34.24826
##      .resid       .hat   .sigma .std.resid
## 1  682.0070 0.01950485 244.2558   2.808659
## 2 -671.5304 0.01833919 244.2911  -2.763871
## 3  630.2512 0.01950485 244.4184   2.595517
## 4  593.4616 0.01950485 244.5261   2.444009
## 5  585.4899 0.01950485 244.5485   2.411179
## 6  580.4493 0.01950485 244.5626   2.390421

Robust lm fit to number of transactions. Play with weights: price_min, price_unit_area_mean etc..

# install.packages("MASS")
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

predicted_transa_robust <- harju %>% 
  rlm(transactions ~ month + area, data = ., weights = price_min) %>% 
  augment

Add robust fit to plot:

p + geom_line(data = predicted_transa, aes(month, .fitted, group = 1), 
              color = "red",
              size = 1) + 
  geom_line(data = predicted_transa_robust, aes(month, .fitted, group = 1), 
            color = "blue",
            size = 1) +
  labs(caption = "Red line: model fit of seasonal effect to number of transactions.\nBlue line: robust model fit of seasonal effect to number of transactions, weighted for minimal price. \nSource: Estonian Land Board, transactions database.")

## Whole dataset: all counties

Plot whole dataset:

apartments %>% 
  ggplot(aes(date, transactions, group = area, color = area)) +
  geom_line() +
  facet_wrap(~ county, scales = "free_y")

apartments %>% 
  mutate(price_unit_area_mean = price_unit_area_mean / consumerindex) %>% 
  ggplot(aes(date, price_unit_area_mean, group = area, color = area)) +
  geom_line() +
  facet_wrap(~ county, scales = "free_y")

## Warning: Removed 8 rows containing missing values (geom_path).

Fit multiple models

Fit model for each county. Goodness of model fit is evaluated by its ability to predict response values. Response values can be predicted using the same data that was used to train model. In this case we are dealing with in-sample analysis, which is ok when all we want to know is model coeficients (eg. intercept and slope) to describe present data. In cases when we want to predict using some unseen future data, such models are usually overfitted, overly optimistic and don’t perform very well. To manage overfitting, we split dataset into training set for model fitting and test set for model performance assessment.

Resampling can be performed very well using base R function sample(). Let’s say we want to use approx. 70% of rows from “mtcars” table to be used in model training and leave remaining 30% for model testing. Basically, all we need to do is to generate to non-overlapping row indexes.

set.seed(123)
index <- sample(1:nrow(mtcars), floor(0.7 * nrow(mtcars)), replace = FALSE)
index

##  [1] 10 25 13 26 27  2 14 23 29 11 22 32 24 31 28 16  4  1  5 30 19  8

## ~70% data goes to training set
train <- mtcars[index, ]
## remaining ~30% will be used for testing model performance
test <- mtcars[-index, ]

Index based resampling is also implemented in tidyverse. We can use resample_partition() function from “modelr” library to perform. Here we generate data splits and move them to separate columns.

library(modelr)

## 
## Attaching package: 'modelr'

## The following object is masked from 'package:broom':
## 
##     bootstrap

## Generate data splits and move into separate columns
trans <- apartments %>%
  group_by(county) %>%
  nest %>% 
  mutate(parts = map(data, ~resample_partition(.x, c(test = 0.3, train = 0.7))),
         test = map(parts, "test"),
         train = map(parts, "train")) %>% 
  dplyr::select(-parts)

Fit model to each data split. Let’s start with number of transactions.

## Fit model
trans <- trans %>%
  mutate(mod_train = map(train, ~lm(transactions ~ month + area, data = .x)))
## Print data with new model column
trans %>% 
  dplyr::select(mod_train, everything())

## # A tibble: 15 x 5
##    mod_train             county                data           test
##       <list>              <chr>              <list>         <list>
##  1  <S3: lm>      Harju maakond <tibble [890 x 18]> <S3: resample>
##  2  <S3: lm>       Hiiu maakond <tibble [400 x 18]> <S3: resample>
##  3  <S3: lm>   Ida-Viru maakond <tibble [890 x 18]> <S3: resample>
##  4  <S3: lm>     Jõgeva maakond <tibble [798 x 18]> <S3: resample>
##  5  <S3: lm>      Järva maakond <tibble [824 x 18]> <S3: resample>
##  6  <S3: lm>      Lääne maakond <tibble [832 x 18]> <S3: resample>
##  7  <S3: lm> Lääne-Viru maakond <tibble [884 x 18]> <S3: resample>
##  8  <S3: lm>      Põlva maakond <tibble [738 x 18]> <S3: resample>
##  9  <S3: lm>      Pärnu maakond <tibble [889 x 18]> <S3: resample>
## 10  <S3: lm>      Rapla maakond <tibble [797 x 18]> <S3: resample>
## 11  <S3: lm>      Saare maakond <tibble [786 x 18]> <S3: resample>
## 12  <S3: lm>      Tartu maakond <tibble [890 x 18]> <S3: resample>
## 13  <S3: lm>      Valga maakond <tibble [847 x 18]> <S3: resample>
## 14  <S3: lm>   Viljandi maakond <tibble [875 x 18]> <S3: resample>
## 15  <S3: lm>       Võru maakond <tibble [832 x 18]> <S3: resample>
## # ... with 1 more variables: train <list>

coef() and summary() return vector and S3 object respectively. Model coeficients and summary statistics can be returned as a data frame using tidy() function from “broom” library:

## model coeficients
trans %>% 
  mutate(mod_tidy = map(mod_train, tidy)) %>% 
  dplyr::select(county, mod_tidy) %>% 
  unnest %>% 
  head()

## # A tibble: 6 x 6
##          county        term   estimate std.error  statistic      p.value
##           <chr>       <chr>      <dbl>     <dbl>      <dbl>        <dbl>
## 1 Harju maakond (Intercept)  90.720597  10.95902  8.2781703 7.956235e-16
## 2 Harju maakond    monthAug   8.425303  12.99468  0.6483655 5.169934e-01
## 3 Harju maakond    monthDec   7.407290  12.87110  0.5754980 5.651675e-01
## 4 Harju maakond    monthFeb -19.712457  13.36976 -1.4744058 1.408899e-01
## 5 Harju maakond    monthJan -34.256393  12.75323 -2.6860961 7.426733e-03
## 6 Harju maakond    monthJul -11.766739  13.05168 -0.9015499 3.676530e-01

Pseudo out-of-sample model performance, augment() adds predictions to original data frame (.fitted). We subtract fitted values from the original values and calculate root-mean-squared deviation (rmsd). Smaller rmsd indicates better fit.

trans <- trans %>%
  mutate(mod_pred = map2(mod_train, test, ~augment(.x, newdata = .y)),
         mod_pred = map(mod_pred, ~mutate(.x, .resid = transactions - .fitted)),
         mod_rmsd = map_dbl(mod_pred, ~sqrt(mean(.x$`.resid`^2)))) 
trans %>% 
  dplyr::select(county, mod_rmsd) %>% 
  arrange(desc(mod_rmsd))

## # A tibble: 15 x 2
##                county   mod_rmsd
##                 <chr>      <dbl>
##  1      Harju maakond 66.4419493
##  2      Tartu maakond 16.5692863
##  3   Ida-Viru maakond  9.7488850
##  4      Pärnu maakond  9.2678223
##  5 Lääne-Viru maakond  4.8961802
##  6   Viljandi maakond  3.5521556
##  7      Järva maakond  3.1063724
##  8      Rapla maakond  2.9415115
##  9      Valga maakond  2.9113005
## 10      Lääne maakond  2.7434744
## 11      Saare maakond  2.5359518
## 12     Jõgeva maakond  2.4825984
## 13       Võru maakond  2.3033747
## 14      Põlva maakond  1.9872345
## 15       Hiiu maakond  0.7937306

We can see that Harju county displays worst fit and Tartu is next. These two counties show also majority of transactions.

trans_pred <- trans %>% 
  dplyr::select(county, mod_pred) %>% 
  unnest
trans_pred %>% 
  dplyr::select(.fitted, transactions, everything()) %>% 
  head()

## # A tibble: 6 x 22
##    .fitted transactions        county       date  year month     area
##      <dbl>        <int>         <chr>     <date> <int> <chr>    <chr>
## 1 136.5889          175 Harju maakond 2005-02-01  2005   Feb 30-40,99
## 2 221.3443          257 Harju maakond 2005-02-01  2005   Feb 41-54,99
## 3 244.3975          350 Harju maakond 2005-03-01  2005   Mar 41-54,99
## 4  90.7206          101 Harju maakond 2005-04-01  2005   Apr 10-29,99
## 5 241.0568          360 Harju maakond 2005-04-01  2005   Apr 41-54,99
## 6 199.9944          329 Harju maakond 2005-04-01  2005   Apr 55-69,99
## # ... with 15 more variables: area_total <dbl>, area_mean <dbl>,
## #   price_total <int>, price_min <dbl>, price_max <dbl>,
## #   price_unit_area_min <dbl>, price_unit_area_max <dbl>,
## #   price_unit_area_median <dbl>, price_unit_area_mean <dbl>,
## #   price_unit_area_sd <dbl>, title <chr>, subtitle <chr>,
## #   consumerindex <dbl>, .se.fit <dbl>, .resid <dbl>

Modelling

Taavi Päll