-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy pathtidycells-intro.Rmd
498 lines (393 loc) · 20.7 KB
/
tidycells-intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
---
title: "Tidycells Package"
author: "Indranil Gayen"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
fig_width: 4.5
fig_height: 3
vignette: >
%\VignetteIndexEntry{Tidycells Package}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
<style>
body {
text-align: justify}
</style>
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup, include = FALSE}
library(tidycells)
```
```{r, echo=FALSE}
tce <- asNamespace("tidycells")
```
<img src="ext/logo.png" align="right" width="100"/>
# **tidycells**
#### _Read Tabular Data from Diverse Sources and Easily Make Them Tidy_
## TL;DR
Given a `file_name` which is a path of a file that contains table(s). Run this `read_cells()` in the R-console to see whether support is present for the file type. If support is present, just run
```{r, eval=FALSE}
read_cells(file_name)
```
**Note** Just start with a small file.
## Inspiration
A spreadsheet is really flexible in storing and sharing data. Many organizations use excel files as their fundamental medium of data exchange (particularly in the government sector). These data are generally not in tidy format (thanks to ultra flexibility of spreadsheets!).
This is a great time killer for R data scientist adopting [tidyverse](https://www.tidyverse.org/) techniques. A lot of time is spent to bring the data into a consumable or usable format ([tidy format](https://r4ds.had.co.nz/tidy-data.html)).
For example, you may check out [Database of Indian Economy](https://dbie.rbi.org.in/) (a public data dissemination platform from [_Reserve Bank of India_](https://www.rbi.org.in/)). This platform is a great source of Indian economy data. However, since the target audience of this platform is possibly common people and not data scientists, the data structure is not suitable for ready consumption in _tidyverse_ functions. This is not only one, but there are plenty of great source of data which provides data in _"eye-relaxing"_ format rather than tidy format.
Thanks to great packages like [tidyr](https://github.com/tidyverse/tidyr), [tidyxl](https://github.com/nacnudus/tidyxl), [unpivotr](https://github.com/nacnudus/unpivotr) (and many other sweet packages mentioned in _reference_) which enable users to spend less time in reshaping the data and focus more on analysis rather than data wrangling.
These packages require user interaction and decision to transform and reshape the raw data (which is still easier than doing _copy-paste and transpose_ in excel) into the tidy format.
However, consider the situations like, you are given a pile of excel (or any other format of spreadsheets) which are somehow similar but are generated by various human hands over a long period (obviously the format and structure will change from file to file). Or you want to build an automated forecast mechanism based on some spreadsheet which updates periodically (possibly you may get the updated file through some automated channel like web scraping, FTP transfer etc.). Or simply you want something like _"I'm Feeling Lucky!"_
## Introduction
The package provides utilities to read, cells from complex tabular data and heuristic detection based 'structural assignment' of those cells to a columnar or tidy format.
Read functionality has the ability to read (in a unified manner) structured, partially structured or unstructured tabular data (usually spreadsheets for public data dissemination and aimed for common human understanding) from various types of documents. The tabular information is read as cells.
The 'structure assignment' functionality has both supervised and unsupervised way of assigning cells data to columnar/tidy format. Multiple disconnected blocks of tables in a single sheet are also handled appropriately.
These tools are suitable for unattended conversation of (maybe a pile of) messy tables (like government data) into a consumable format(usable for further analysis and data wrangling).
## Getting Started
### 1. Reading Data
Assuming you have a document(in either csv, xls, xlsx, docx, doc, html or in pdf, _well not a scanned one_) which contains a table, you can read the file directly to a tidy format (let's refer the same as composed format) using `read_cells` like the following example:
```{r}
fold <- system.file("extdata", "messy", package = "tidycells", mustWork = TRUE)
# this is because extension is intentionally given wrong
# while filename is actual identifier of the file type
fcsv <- list.files(fold, pattern = "^csv.", full.names = TRUE)[1]
```
Let's see how the file actually looks like
```{r, eval=FALSE}
utils::read.csv(fcsv)
```
```{r, echo=FALSE}
knitr::kable(utils::read.csv(fcsv), align = "l", longtable = FALSE)
```
Now we can simply run
```{r, eval=FALSE}
read_cells(fcsv)
```
```{r, echo=FALSE}
knitr::kable(read_cells(fcsv))
```
Not a big deal for csv though. Also looks similar to
```{r, eval=FALSE}
read.csv(fcsv) %>% tidyr::gather(measure, val_of_measure, -Kid.Name)
```
```{r, echo=FALSE}
knitr::kable(read.csv(fcsv) %>% tidyr::gather(measure, val_of_measure, -Kid.Name), align = "l", longtable = FALSE)
```
So what is this package is doing. Looks trivial right? But we'll look at some more complicated examples in following sections which perhaps may feel to you as non-trivial. So let’s continue.
Note that `read_cells` is neutral to file extension which means if somehow .xls file is saved as .csv file, then also it will detect the underlying file correctly as xls and use appreciate function to read tabular information contained in the file.
However, support for each file types apart from csv is based on optional R packages. If you simply type `read_cells()` in the console, it will print which file types are supported and which package is potentially required for functionality, those are currently not available. If any package is potentially required, install the same and restart the R session to make that effective.
Here is a sample output:
```{r, eval=FALSE}
read_cells()
```
```{r, out.width = "550px", echo=FALSE, dpi=300}
knitr::include_graphics("ext/read_cells_out.png")
```
Well, this is like _"I'm feeling lucky"_ kind of thing as told you earlier. If the final result is not what you are looking for, you may have to do some bit of extra work.
The function `read_cells` has a few layers or stages of processing. You can choose the stage to stop processing and give intermediate results. Then you may manipulate the intermediate result, tune it and finally again resume processing (or do it outside `read_cells` completely!).
Notice that `read_cells` can be broken into pieces like
```{r}
d1 <- read_cells(fcsv, at_level = "va_classify") %>%
read_cells()
d2 <- read_cells(fcsv, at_level = 2) %>%
read_cells(at_level = 4) %>%
read_cells()
identical(d1, d2)
```
Now to make you understand what you can do with intermediate results, let's explain to you other functionalities of the package and what are the steps followed by `read_cells`.
The flowchart of `read_cells` (_which will make more sense once you go through following section_)
```{r, out.width = "356px", echo=FALSE, fig.align='center', dpi=300}
knitr::include_graphics("ext/read_cells.svg")
```
### 2. Making Data _tidy_
#### 2.1 as_cell_df
If you have any R object `x`, you can call `as_cell_df` convert it into a `cell-df` format (the format to store cell by cell information. See the documentation of _as_cell_df_ or _cell_df-class_ for more details).
```{r}
x <- iris %>% head()
cd <- x %>% as_cell_df(take_col_names = TRUE)
```
Each stored cell has an address specified by row and column and a value (converted to a string) and a possible data type. Thus the `cell_df` structure is defined. This is the first level of operations performed by `read_cells`.
You can even see how the data looks by issuing plot command.
```{r, eval=TRUE, out.width = "356px"}
plot(cd, adaptive_txt_size = FALSE, txt_size = 2.5)
```
#### 2.2 value_attribute_classify
Next task is to assign each cell into **value**, **attribute** (of those values) and **empty** (irrelevant information). This is referred to in the package as value/attribute classification (VA classification). There are main 4 alternative ways to do this apart from manually editing by hand. These are documented in `value_attribute_classify` function.
For now, we'll use the most useful classifier, `numeric_values_classifier`. It assumes the cells which can be converted to a number, as values and rest as attributes. If the cell is storing a number as a string (_remember the famous warning you may have seen in Excel about "Number Stored as Text"_) it will treat that as values.
Let's take a sample example
```{r, eval=FALSE}
dummy_dat <- tibble::tibble(name = c("Mr. X", "Mr. Y"), sal = c("1,000","12,000"), age = c(35, 60))
dummy_dat
```
```{r, echo=FALSE}
dummy_dat <- tibble::tibble(name = c("Mr. X", "Mr. Y"), sal = c("1,000","12,000"), age = c(35, 60))
knitr::kable(dummy_dat)
```
```{r}
dummy_dat_cells <- dummy_dat %>%
as_cell_df(take_col_names = TRUE)
```
We can either do this
```{r}
va_cells <- numeric_values_classifier(dummy_dat_cells)
```
Or this
```{r}
va_cells <- value_attribute_classify(dummy_dat_cells,
classifier = numeric_values_classifier())
```
Let's plot it
```{r, fig.width=3, fig.height=2, eval=TRUE, out.width = "256px"}
plot(va_cells, adaptive_txt_size = FALSE, txt_size = 3)
```
Once VA classification is done, if you plot the cells you'll notice the color scheme is now pointed at VA classification (which is given by `Type`) rather than cell types.
```{r, eval=TRUE, out.width = "356px"}
# let's resume with iris example
cd <- numeric_values_classifier(cd)
plot(cd, adaptive_txt_size = FALSE, txt_size = 2.5)
```
```{r, echo=FALSE}
cd <- numeric_values_classifier(cd)
```
#### 2.3 analyze_cells
After this stage now the `cell-df` is ready for analysis (to understand data blocks, the orientation of attributes with respect to data or values). This is done via the following command.
```{r}
ca <- analyze_cells(cd)
```
This takes a while (the time taken is largely dependent on the complexity of the underlying tabular structure). Once it's done you can plot the cell analysis and see how the heuristic detection has performed. Currently, only interactive way is developed for possible modifications in a cell analysis (covered in the subsequent section: _Interactive Modules_).
```{r, eval=TRUE, out.width = "356px"}
plot(ca)
```
#### 2.4 compose_cells
Having a cell analysis at hand, the next process to perform is to compose this cell analysis to form a composed columnar format data.frame. This is easily done as
```{r, eval=FALSE}
dcomp <- compose_cells(ca)
# discarding other columns for printing (you can check them!)
head(dcomp[1:7])
```
```{r, echo=FALSE}
dcomp <- compose_cells(ca)
knitr::kable(head(dcomp[1:7]))
```
```{r, eval= FALSE}
# let's check the same for va_cells (earlier dummy example)
va_cells %>%
analyze_cells() %>%
compose_cells(discard_raw_cols = TRUE)
```
```{r, echo=FALSE}
va_cells %>%
analyze_cells() %>%
compose_cells(discard_raw_cols = TRUE) %>%
knitr::kable()
```
After this, you may like to see how the final data is connected or rendered from the actual table. To do this you may run
```{r, fig.width=3, fig.height=2}
cell_composition_traceback(ca, trace_row = 1)
cell_composition_traceback(ca, trace_row = 10)
```
There is an optional interactive version of it, which is explained in the subsequent section _Interactive Modules_.
#### 2.5 collate_columns
The data should be consumable after the composition stage, in most of the cases. However, in the presence of extreme variations in the underlying data structure, another step may be required which can be coined as "collation of columns". If somehow the data is a bit cooperative in a sense that from the data itself it can be inferred that which column should be mapped with which column, then this stage is useful. It is a stage which does approximate string matching (which is enhanced if you have optional dependency [`stringdist`](https://CRAN.R-project.org/package=stringdist) package) to find out inter-column mapping among various data blocks (if the data has multiple small tables). In case there is only one table (a single data block) then this stage will do almost nothing except renaming the columns. In `iris` and `dummy_dat` example, this is kind of meaningless. This will be clear to you once you move forward to the next section (_where **"A more complicated example"** is discussed_).
The function `collate_columns` is designed to carry out this operation. This can be easily done as
```{r, eval=FALSE}
dcc <- collate_columns(dcomp)
head(dcc)
```
```{r, echo=FALSE}
dcc <- collate_columns(dcomp)
head(dcc) %>% knitr::kable()
```
Note: You may omit this stage by choosing `read_cells(..., at_level = "compose")`.
That's all. These are the process flow followed by `read_cells`. Thus, summing it all
```{r, eval=FALSE}
# below should match with
# read_cells(fcsv) %>% dplyr::select(-table_tag)
# * which (the column table_tag) is added to identify
# * potential multiple tables.
fcsv %>%
read.csv() %>%
as_cell_df(take_col_names = TRUE) %>%
numeric_values_classifier() %>%
analyze_cells() %>%
compose_cells() %>%
collate_columns()
```
```{r, echo=FALSE}
fcsv %>%
read.csv() %>%
as_cell_df(take_col_names = TRUE) %>%
numeric_values_classifier() %>%
analyze_cells() %>%
compose_cells() %>%
collate_columns() %>%
knitr::kable()
```
A half-done item can be further manipulated and after the same, we can give it back to `read_cells` again to finish further processing (or simply can do it ourselves). Here is a sample manipulation example.
```{r}
rc_part <- read_cells(fcsv, at_level = 2)
# it is a list with `read_cells_stage` attribute
# which indicate the last processed stage in read_cells
str(rc_part)
# sample_based_classifier is another VA classifier
# for details see coresponding documentation
rc_part[[1]] <- rc_part[[1]] %>%
numeric_values_classifier() %>%
sample_based_classifier(empty_sample = "6")
# below should be similar to
# rc_part[[1]] %>%
# analyze_cells() %>%
# compose_cells(discard_raw_cols = TRUE)
rc_part %>% read_cells(from_level = 3)
```
**A more complicated example**
Let's take a quick look at another example data as given in
```{r, eval=FALSE}
system.file("extdata", "marks.xlsx", package = "tidycells", mustWork = TRUE)
```
The data looks like (in excel)
```{r, out.width = "451px", echo=FALSE, eval=TRUE, dpi=300}
knitr::include_graphics("ext/marks.png")
```
Let's try our luck in this data
Read at once
```{r, eval=FALSE}
system.file("extdata", "marks.xlsx", package = "tidycells", mustWork = TRUE) %>%
read_cells()
```
```{r, echo=FALSE}
if(tce$is_available("tidyxl")){
system.file("extdata", "marks.xlsx", package = "tidycells", mustWork = TRUE) %>%
read_cells() %>%
knitr::kable()
}
```
Let's understand step by step procedures followed by `read_cells`.
```{r, eval=FALSE}
# if you have tidyxl installed
d <- system.file("extdata", "marks.xlsx", package = "tidycells", mustWork = TRUE) %>%
read_cells(at_level = "make_cells") %>%
.[[1]]
```
```{r}
# or you may do
d <- system.file("extdata", "marks_cells.rds", package = "tidycells", mustWork = TRUE) %>%
readRDS()
d <- numeric_values_classifier(d)
da <- analyze_cells(d)
```
Then you need to run `compose_cells` with additional new argument `print_attribute_overview = TRUE`
```{r, echo=TRUE, eval=FALSE}
dc <- compose_cells(da, print_attribute_overview = TRUE)
```
```{r, out.width = "451px", echo=FALSE, eval=TRUE, dpi=300}
knitr::include_graphics("ext/compose_cells_cli1.png")
dc <- compose_cells(da)
```
```{r}
# bit tricky and tedious unless you do print_attribute_overview = TRUE in above line
dcfine <- dc %>%
dplyr::mutate(name = dplyr::case_when(
data_block == 1 ~ major_row_left_2_1,
data_block == 2 ~ major_col_bottom_1_1,
data_block == 3 ~ major_row_left_1_1
),
sex = dplyr::case_when(
data_block == 1 ~ major_row_left_1_1,
data_block == 2 ~ major_col_bottom_2_1,
data_block == 3 ~ minor_row_right_1_1
),
school = dplyr::case_when(
data_block == 1 ~ minor_col_top_1_1,
data_block == 2 ~ minor_corner_topLeft_1_1,
data_block == 3 ~ minor_col_top_1_1
)) %>%
dplyr::select(school,sex, name, value)
```
`head(dcfine)` looks like
```{r, echo=FALSE}
knitr::kable(head(dcfine), align = c(rep("l", 3), "c"))
```
This is still not good right! You had to manually pick some weird column-names (when it was evident from data which columns should be aligned with whom).
The `collate_columns` functions does exactly this for you. So instead of manually picking column-names after compose cells you can simply run
```{r, eval=FALSE}
collate_columns(dc) %>%
head()
```
```{r, echo=FALSE}
knitr::kable(head(collate_columns(dc)), align = c(rep("l", 5), "c"))
```
Looks like staged example! Yes you are right, this is not always perfect (same is true for `analyze_cells` also). However, if the data is somehow helpful in demystifying underlying columns structure (like this one), then this will be useful.
Once again, these functions `read_cells` (all functionalities combined), `analyze_cells`, `collate_columns` are here to ease your pain in data wrangling and reading from various sources. It may not be full-proof solution to all types of tabular data. It is always recommended to perform these tasks manually whenever expected results are not coming.
### 3. Support for Multiple Formats
When coming to support for multiple formats (assuming you have all the required packages as specified by `read_cells`) you can run following set of command to test the multi-format support in `read_cells`.
```{r}
dm <- tibble::tibble(fn = list.files(fold, full.names = T))
dm$fn %>% basename()
```
```{r, echo=FALSE, include=FALSE}
# filter based on file-type
dm <- dm %>%
dplyr::mutate(original = dm$fn %>%
purrr::map_chr(~ basename(.x) %>%
stringr::str_split("\\.") %>%
purrr::map_chr(1)))
# making sure csv always remains in 1st position
dm <- dm %>% dplyr::filter(original=="csv") %>%
dplyr::bind_rows(dm %>% dplyr::filter(original!="csv"))
if(!(tce$is_available("readxl") | tce$is_available("xlsx"))){
dm <- dm %>% dplyr::filter(original!="xls")
}
if(!(tce$is_available("tidyxl"))){
dm <- dm %>% dplyr::filter(original!="xlsx")
}
if(!(tce$is_available("docxtractr"))){
dm <- dm %>% dplyr::filter(original!="docx" & original!="doc")
}
if(!(tce$is_available("tabulizer"))){
dm <- dm %>% dplyr::filter(original!="pdf")
}
if(!(tce$is_available("XML"))){
dm <- dm %>% dplyr::filter(original!="html")
}
```
```{r}
dcomps <- dm$fn %>% purrr::map(read_cells)
dcomps_sel <- dcomps %>%
purrr::map(~ .x %>%
dplyr::select(value, collated_1, collated_2))
# all of them are same [intentionaly made. but the file types are totally different]
dcomps_sel %>%
purrr::map_lgl(~identical(.x, dcomps_sel[[1]])) %>%
all()
# check one sample
dcomps_sel[[1]]
```
Each file is with the wrong extension and the original extension should be the file name part before extension (intentionally created for testing this feature). Like for example, docx.csv is actually a docx. You can copy into some location where you have write access, then rename appropriately and open it to verify.
### 4. Interactive Modules
The package also provides shiny based interactive visualisations for understanding how the heuristic is functioning and also provides object (like `cell-df` or `cell-analysis`) editing capabilities.
The [shiny](https://shiny.rstudio.com/) package is required for these functions. Most of the features are self-explanatory and guided.
Check out interactive documentation of any of these functions listed below. All of these functions are available as [RStudio Addins](https://rstudio.github.io/rstudioaddins/).
Here are screenshots of each interactive widgets.
1. Plot tune (part of all modules)
2. `visual_crop()` for data crop and deletion of sections
```{r, out.width = "516px", echo=FALSE, dpi=300}
knitr::include_graphics("ext/v12.png")
```
3. `visual_va_classify()` for interactive VA classification
4. `visual_data_block_inspection()` this shows how the heuristic has performed the analysis after `analyze_cells`
```{r, out.width = "516px", echo=FALSE, dpi=300}
knitr::include_graphics("ext/v34.png")
```
5. `visual_orientation_modification()` for modification to heuristic based results
6. `visual_traceback()` this is for observing how the original data is composed to form the final output. (`compose_cells` is called internally)
```{r, out.width = "516px", echo=FALSE, dpi=300}
knitr::include_graphics("ext/v56.png")
```
For each of these modules, there is a dynamic plot option available from [plotly](https://github.com/ropensci/plotly). If you have that package the corresponding tab will be activated.
Since all of these modules are entirely optional the dependency is kept at **tidycells** 'suggests' level only.