It allows to extract the individual speeches of each legislator in a document and obtain a data.frame.
speech_build(
file,
add.error.sir = NULL,
rm.error.leg = NULL,
compiler = FALSE,
quality = FALSE,
param = list(char = 6500, drop.page = 2)
)
list or character vector specifying the path or URL to a PDF file. It can be one or more files.
character vector. It allows to specify different ways in which
the term that orders the speeches could be miswritten: sir. By default it is NULL
.
character vector. It allows to add legislator's names
to be eliminated. By default it is NULL
.
By default, "PRESIDENTE", "SECRETARIO", "SUBSECRETARIO", and "MINISTRO" are eliminated.
logical. When the checking of the process of conversion from pdf to data frame
is completed, it is necessary to compile the data frame. To compile implies to unite all the
speeches of each of the legislators for each document. As it is an operation
that must be carried out after making corrections, it is necessary to opt for it.
By default it is FALSE
.
logical. If TRUE
, two quality indicators are added about
the process, according to the quality of the document.
index_1: Proportion of the text recovered according to the original document
(param = list(char = 6500, drop.page = 2)
) that must have the document.
index_2: Proportion of the final text as a function of the recovered text. It is the proportion of the document in which there are only interventions by legislators.
list of length 2 with magnitudes for arguments "character for page" and "drop page non evaluate" respectively. The default values are the median characters of 8500 documents that make up the speech datasets.
data.frame class puy
with the following variables:
legislator
: name of the legislators
speech
: speeches by legislators
date
: session date
id
: name file
legislature
: legislature id (period of government)
sex
: sex
chamber
: chamber to which the document belongs.
It can be: Chamber of Representatives, Senate, General Assembly or Permanent Commission.
If quality is TRUE, the following are added:
index_1
: index_1
index_2
: index_2
This function converts PDF documents to data.frame. The conversion is
made by seeking interventions of legislators from the word "SENOR". As the
quality of PDF files is not always the best it is recommended to verify that
no legislator is omitted in the data.frame construction process. To make
corrections of the word "SENOR" is that the argument add.error.sir
should be used. The function has a long list of different ways in which
the word "SENOR" may be written in a document, but not all possible future
problems are covered. When the PDF document is a scan that was treated with
an OCR, it should be checked with greater caution to ensure that the operation
was performed correctly.
# \donttest{
# url <- speech::speech_url(chamber = "C", from = "17-09-2019", to = "17-09-2019")
# out <- speech_build(file = url)
# out <- speech_build(file = url, compiler = FALSE,
# quality = TRUE,
# add.error.sir = c("SEf'IOR"),
# rm.error.leg = c("PRtSIDENTE", "SUB", "PRfSlENTE"),
# param = list(char = 6000, drop.page = 3))
# out <- list.files(pattern = "*.pdf") %>% speech_build()
# out <- list.files(pattern = "*.pdf") %>%
# speech_build(., compiler = TRUE, param = list(char = 4500, drop.page = 3))
# }