This function is designed to make it easy to perform some data integrity checks, specifically checking for values that are impossible or unrealistic. These values can then be replaced by another value, or the offending cases can be deleted from the dataframe.

checkDataIntegrity(x,
                   dat,
                   newValue = NA,
                   removeCases = FALSE,
                   validValueSuffix = "_validValue",
                   newValueSuffix = "_newValue",
                   totalVarName = "numberOfInvalidValues",
                   append = TRUE,
                   replace = TRUE,
                   silent = FALSE,
                   rmarkdownOutput=FALSE,
                   callingSelf = FALSE)

Arguments

x

This can be either a vector or a list. If it is a vector, it should have two elements, the first one being a regular expression matching one or more variables in the dataframe specified in dat, and second one being the condition the matching variables have to satisfy. If it is a list, it should be a list of such vectors. The conditions should start with a Comparison operator followed by a value (e.g. "<30" or ">=0).

dat

The dataframe containing the variables of which we should check the integrity.

newValue

The new value to be assigned to cases not satisfying the specified conditions.

removeCases

Whether to delete cases that do not satisfy the criterion from the dataframe (if FALSE, they're not deleted, but the offending value is replaced by newValue).

validValueSuffix

Suffix to append to variable names when creating variable names for new variables that contain TRUE and FALSE to specify for each original variable whether its value satisfied the specified criterion.

newValueSuffix

If replace is FALSE, original values are not replaced, but instead new variables are created where the offending values have been replaced. This suffix is appended to each original variable name to create the new variable name.

totalVarName

This is the name of a variable that contains, for each case, the total number of invalid values among all variables checked.

append

Whether to append the columns to the dataframe, or only return the new columns.

replace

Whether to replace the offending values with the value specified in newValue or whether to create new columns (see newValueSuffix).

silent

Whether to display the log, or only set it as attribute of the returned dataframe.

rmarkdownOutput

Whether to format the log so that it's ready to be included in RMarkdown reports.

callingSelf

For internal use; whether the function calls itself.

Value

The dataframe with the corrections, and the log stored in attribute checkDataIntegrity_log.

Examples

### Default behavior: return dataframe with ### offending values replaced by NA checkDataIntegrity(c('mpg', '<30'), mtcars);
#> Matching cases to criterion '<30' for all variables matching regular expression 'mpg'. #> In 4 rows, for variable 'mpg', replacing values that do not satisfy criterion 'mpg<30' with 'NA'.
#> mpg cyl disp hp drat wt qsec vs am gear carb #> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 #> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 #> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 #> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 #> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 #> Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 #> Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 #> Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 #> Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 #> Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 #> Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 #> Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 #> Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 #> Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 #> Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 #> Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 #> Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 #> Fiat 128 NA 4 78.7 66 4.08 2.200 19.47 1 1 4 1 #> Honda Civic NA 4 75.7 52 4.93 1.615 18.52 1 1 4 2 #> Toyota Corolla NA 4 71.1 65 4.22 1.835 19.90 1 1 4 1 #> Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 #> Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 #> AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 #> Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 #> Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 #> Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 #> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 #> Lotus Europa NA 4 95.1 113 3.77 1.513 16.90 1 1 5 2 #> Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 #> Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 #> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 #> Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 #> mpg_validValue numberOfInvalidValues #> Mazda RX4 TRUE 0 #> Mazda RX4 Wag TRUE 0 #> Datsun 710 TRUE 0 #> Hornet 4 Drive TRUE 0 #> Hornet Sportabout TRUE 0 #> Valiant TRUE 0 #> Duster 360 TRUE 0 #> Merc 240D TRUE 0 #> Merc 230 TRUE 0 #> Merc 280 TRUE 0 #> Merc 280C TRUE 0 #> Merc 450SE TRUE 0 #> Merc 450SL TRUE 0 #> Merc 450SLC TRUE 0 #> Cadillac Fleetwood TRUE 0 #> Lincoln Continental TRUE 0 #> Chrysler Imperial TRUE 0 #> Fiat 128 FALSE 1 #> Honda Civic FALSE 1 #> Toyota Corolla FALSE 1 #> Toyota Corona TRUE 0 #> Dodge Challenger TRUE 0 #> AMC Javelin TRUE 0 #> Camaro Z28 TRUE 0 #> Pontiac Firebird TRUE 0 #> Fiat X1-9 TRUE 0 #> Porsche 914-2 TRUE 0 #> Lotus Europa FALSE 1 #> Ford Pantera L TRUE 0 #> Ferrari Dino TRUE 0 #> Maserati Bora TRUE 0 #> Volvo 142E TRUE 0
### Check two conditions, and instead of returning the ### dataframe with the results appended, only return the ### columns indicating which cases 'pass', what the new ### values would be, and how many invalid values were ### found for each case (to easily remove cases that ### provided many invalid values) checkDataIntegrity(list(c('mpg', '<30'), c('gear', '<5')), mtcars, append=FALSE);
#> Matching cases to criterion '<30' for all variables matching regular expression 'mpg'. #> In 4 rows, for variable 'mpg', replacing values that do not satisfy criterion 'mpg<30' with 'NA'. #> Matching cases to criterion '<5' for all variables matching regular expression 'gear'. #> In 5 rows, for variable 'gear', replacing values that do not satisfy criterion 'gear<5' with 'NA'.
#> mpg cyl disp hp drat wt qsec vs am gear carb #> Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 #> Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 #> Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 #> Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 #> Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 #> Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 #> Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 #> Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 #> Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 #> Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 #> Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 #> Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 #> Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 #> Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 #> Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 #> Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 #> Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 #> Fiat 128 NA 4 78.7 66 4.08 2.200 19.47 1 1 4 1 #> Honda Civic NA 4 75.7 52 4.93 1.615 18.52 1 1 4 2 #> Toyota Corolla NA 4 71.1 65 4.22 1.835 19.90 1 1 4 1 #> Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 #> Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 #> AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 #> Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 #> Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 #> Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 #> Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 NA 2 #> Lotus Europa NA 4 95.1 113 3.77 1.513 16.90 1 1 NA 2 #> Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 NA 4 #> Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 NA 6 #> Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 NA 8 #> Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 #> mpg_validValue numberOfInvalidValues gear_validValue #> Mazda RX4 TRUE 0 TRUE #> Mazda RX4 Wag TRUE 0 TRUE #> Datsun 710 TRUE 0 TRUE #> Hornet 4 Drive TRUE 0 TRUE #> Hornet Sportabout TRUE 0 TRUE #> Valiant TRUE 0 TRUE #> Duster 360 TRUE 0 TRUE #> Merc 240D TRUE 0 TRUE #> Merc 230 TRUE 0 TRUE #> Merc 280 TRUE 0 TRUE #> Merc 280C TRUE 0 TRUE #> Merc 450SE TRUE 0 TRUE #> Merc 450SL TRUE 0 TRUE #> Merc 450SLC TRUE 0 TRUE #> Cadillac Fleetwood TRUE 0 TRUE #> Lincoln Continental TRUE 0 TRUE #> Chrysler Imperial TRUE 0 TRUE #> Fiat 128 FALSE 1 TRUE #> Honda Civic FALSE 1 TRUE #> Toyota Corolla FALSE 1 TRUE #> Toyota Corona TRUE 0 TRUE #> Dodge Challenger TRUE 0 TRUE #> AMC Javelin TRUE 0 TRUE #> Camaro Z28 TRUE 0 TRUE #> Pontiac Firebird TRUE 0 TRUE #> Fiat X1-9 TRUE 0 TRUE #> Porsche 914-2 TRUE 1 FALSE #> Lotus Europa FALSE 2 FALSE #> Ford Pantera L TRUE 1 FALSE #> Ferrari Dino TRUE 1 FALSE #> Maserati Bora TRUE 1 FALSE #> Volvo 142E TRUE 0 TRUE