Studying Crime and Place with the Crime Open Database Social and Behavioural

The study of spatial and temporal crime patterns is important for both academic understanding of crime-generating processes and for policies aimed at reducing crime. However, studying crime and place is often made more difficult by restrictions on access to appropriate crime data. This means understanding of many spatio-temporal crime patterns are limited to data from a single geographic setting, and there are few attempts at replication. This article introduces the Crime Open Database (code), a database of 16 million offenses from 10 of the largest United States cities over 11 years and more than 60 offense types. Open crime data were obtained from each city, having been published in multiple incompatible formats. The data were processed to harmonize geographic co-ordinates, dates and times, offense categories and location types, as well as adding census and other geographic identifiers. The resulting database allows the wider study of spatio-temporal patterns of crime across multiple US cities, allowing greater understanding of variations in the relationships between crime and place across different settings, as well as facilitating replication of research.


Introduction
Research on crime and place has become a substantial field within the broader interdisciplinary study of crime and its effects.Numerous studies have demonstrated crime is concentrated in a few places while no crime at all occurs in most others (Johnson, 2010) and that patterns of land use can generate crime hotspots (Wilcox & Eck, 2011).These findings have implications for policy and practice, be it in decisions about where to deploy police officers (Ratcliffe, Taniguchi, Groff, & Wood, 2011) or the management of potentially criminogenic land uses (Eck & Wartell, 1999).Temporal crime patterns, and particularly interactions between spatial and temporal patterns, are also important (Tompson & Coupe, 2018).For example, many types of crime exhibit seasonal patterns (Andresen & Malleson, 2013), while a daily 'wave' of street crime hits cities in the late afternoon (Wheeler & Haberman, 2018).
It is important to research individual crime types, rather than considering crime as an undifferentiated whole (Cornish & Smith, 2012).Many spatial and temporal patterns differ across crime types, with (for example) residential burglary typically peaking in the afternoon while commercial burglaries most often happen overnight (Butler, 1994).
Also important is the choice of units of analysis.Although there is no single best analytical scale (Hipp, 2007), recent scholars have suggested that in spatial analysis "smaller is better" (Oberwittler & Wikström, 2009) because many environmental influences on crime operate at small spatial scales (Steenbeek & Weisburd, 2016).The same is true in temporal analysis, since some of the largest fluctuations in crime happen over short timescales such as hours and days (Felson & Poulsen, 2003).
Conducting analysis at these important micro scales requires micro-level data.For example, knowing on which street a crime occurred is more useful than only knowing the crime occurred in a certain city.Similarly, knowing the date and time of an offence is more valuable than knowing only the month or year.This makes many sources of official crime data -such as the Federal Bureau of Investigation (fbi) Uniform Crime Reports (ucr) -of only limited value in studying crime and place.
Many researchers have obtained micro-scale crime data by forming partnerships with a single local police agency, typically that covering the city in which the research team is based.The historically secretive nature of many police organizations and the importance of maintaining privacy for crime victims means such relationships often require substantial investment of time to establish trust between researchers and police leaders.When police do share data, it is typically on condition that it be stored securely and not shared outside the original research team (Kane, 2007).Single-city studies based on confidential data have generated valuable results on crime and place, but have at least two limitations.Firstly, the importance of environmental characteristics in generating spatio-temporal crime patterns calls into question the generalizability of studies conducted in a single setting.For example, Los Angeles is 54% larger in area than New York City but has less than half the population, creating a very different urban fabric.Social context differs, too: the median income of San Francisco, for example, is three times that of Detroit (US Census Bureau, 2018).
Secondly, studies using confidential data can be difficult to replicate, since other research teams typically cannot access the data or other materials.Replication is important both to identify errors (e.g. of method) in existing studies and to understand the extent and nature of outcome variation in different environments (Lösel, 2017;Weisburd & Taxman, 2000).However, in practice very few criminological studies are subject to replication (McNeeley & Warner, 2015).
Some phenomena in crime and place have been studied across multiple settings, such as the elevated short-term 'near repeat' risk of burglary victimization experienced by households surrounding a previous offence location (Grove, Farrell, Farrington, & Johnson, 2012).Nevertheless, the extensive dataaccess negotiations typically preceding such research mean that making use of results from different cities is typically only possible after several years.This may be too late for policy makers, who must typically make decisions over shorter time scales.
Given these limitations, a source of crime data that allows both simultaneous study of multiple geographic settings and sharing data with other researchers, would potentially benefit research into crime and place.In recent years, police agencies have begun to release crime data as 'open data' , typically as part of efforts to increase transparency (Caplan, Rosenblat, & Boyd, 2015).Open data is information that is freely available for anyone to use and share with others (Open Knowledge International, 2018).Open crime data are popular both with citizens and journalists (Stoneman, 2015), but also provide a potential source of information for a multi-city crime dataset.
Since they are released for non-research purposes, open crime data are not necessarily immediately useful for criminological studies across cities.For example, crime types are typically published using categories that are bespoke to a particular city or state.More prosaically, elements such as dates or geographic co-ordinates are often provided in different, incompatible formats.As such, further processing is necessary to maximize the usefulness of these datasets.The remainder of this article introduces the Crime Open Database (code), a dataset combining harmonized open crime data from large cities in the United States that can be used for multi-city studies of crime and place.

Data Sources
The municipal websites of the 50 largest cities in the United States by population were checked for a source of open crime data for inclusion in code.A city was included in the database if data were available: 1.
for at least four consecutive years (to allow longitudinal analysis), 2. at the offense or incident level (rather than, for example, aggregated offense counts), 3. including either geographic co-ordinates or offense addresses (to allow analysis at micro scales), 4. including the date and (optionally) the time at which the offense occurred, 5. with sufficient information about the offense type to map to a harmonized set of crime categories.The United States was chosen because its laws (e.g. on personal data) and well-established open-data movement (Ubaldi, 2013) mean US cities may be more likely to release open data with sufficient detail in comparison to other countries.For example, although the UK Home Office releases a national open crime dataset at www.police.uk, it lacks detail in that it does not specify the date on which an offense occurred and uses very broad offense types.
Ten cities met these criteria and are shown in Figure 1.Links to the original data sources and brief explanations of why other cities could not be included are provided on the project website.It is hoped that more cities will be included in future as they begin to release suitable open crime data.

2.2.
Data Processing Data from each city were processed to allow their use in multi-site research.Dates were converted to a consistent format and local co-ordinate reference systems were converted to latitude/longitude pairs using the wgs 84 coordinate system.
Geocoded offense locations were included in the open data published by all-but-three cities.For Fort Worth, Kansas City and Louisville, addresses were geocoded using the US Census Geocoder accessed via the MapChi package in R (Welgus, 2018), with residual ungeocoded locations resolved using the commercial Geocod.iogeocoding api.To ensure the privacy of crime victims, cities releasing open crime data typically introduce small inaccuracies into the addresses or co-ordinates reported for each crime.This process differs between cities, but code offense locations can be considered to be accurate to the nearest hundred block on a particular street.1This spatial inaccuracy prevents analysis at the individual address level using code data.However, many studies do not require address-level analysis, particularly since the inaccuracy of locations reported by victims means apparent address-level data accuracy is sometimes spurious (Ratcliffe, 2002).Each code offense record is supplemented with the state, county, census tract, block group and census block for the geocoded location.
To produce a harmonized typology of offense types, offense categories in each city were manually mapped to the 52 offense types included in the fbi National Incident-Based Reporting System (nibrs) using the offense definitions published in Federal Bureau of Investigation (2018).Since the nibrs categories do not distinguish between commercial and personal robbery, or between residential and non-residential burglary, these offenses were further distinguished where city crime categories allowed.The final categorization is shown in Table 1.
1 Depending on the configuration of each city, a hundred block (i.e. the length of a street between buildings with numbers 100 apart, e.g. between 101 and 201 Main Street) may not always be the same as a physical block.The inclusion of offense types varies between cities.While some offenses (such as aggravated assault) are recorded in every code city, others (such as gambling offenses) are included only by some.Cities may exclude offense types for several reasons, perhaps because an activity is not criminal in a particular state or to preserve victim privacy (particularly for sex offenses).While these exclusions limit the analysis possible using code data, this is an inevitable consequence of using open data.Nevertheless, multi-site studies remain possible for almost all crime types, albeit using fewer cities for some offenses.
It can be useful to disaggregate crimes according to the type of place in which they occur.For example, spatio-temporal patterns of assaults committed in bars may vary substantially from those in private homes.To facilitate such analysis, code includes fields for harmonized location category and location type (Table 2) for the five cities -Chicago, Fort Worth, Los Angeles, Louisville and New York -that include location descriptions or categories in their open crime data.
The project website at www.osf.io/zyaqncontains technical documentation for the database, lookup tables for offense and location types, summaries of quality-assurance tests undertaken on the data and links to the R code used to process it.

Data
-Crime Open Database (code) deposited at Open Science Frameworkurl: https://www.osf.io/zyaqn-Crime Open Database -Data files -url: https://www.osf.io/zyaqn/files-Temporal coverage: [2007][2008][2009][2010][2011][2012][2013][2014][2015][2016][2017] After being processed into a consistent format, data from each city were filtered to allow analysis across cities. Incidents occurring before 1 January 2007 (or the first complete year for which each city has published data -see Table 3), located outside the city boundary, or which were non-criminal in nature (e.g.traffic collisions) were excluded.Comparing the final dataset to the data published by each city demonstrates the importance of this processing and filtering.In Tucson, for example, 45% of records in the original data were for non-criminal matters such as traffic incidents.Meanwhile, some cities' data included a small number of historical offenses that occurred decades ago but have only been recently recorded.The resulting dataset includes a single record for each criminal offense recorded by police in code cities. code data are released under a Creative Commons (CC) Attribution 4.0 International license in compressed commaseparated values files.Two files are provided for each year.The first contains only those variables that have been harmonized across cities, including geographic co-ordinates, census identifiers, dates and offense types.These 'core' files are likely to be sufficient for almost-all research purposes.If researchers wish to make use of unharmonized fields in the data published by specific cities, these can be found (together with the core fields) in the second, 'extended' data files.
code data can also be accessed using the crimedata package for the R statistical programming language, which can be downloaded from the Comprehensive R Archive Network (cran) at https://cran.r-project.org/package=crimedata

4.
Research Potential code contains records for more than 16 million offenses across 11 years and more than 60 offense types.The size, detail and spatial variety of this dataset opens up new opportunities for research into crime and place.code is likely to be useful in three ways.Firstly, the large sample size available in this dataset will allow more-detailed study of some aspects of crime and place.For example, some environmental characteristics may be relatively rare in a single city, potentially leading to high uncertainty about relationships between those characteristics and crime that can be reduced by including data from multiple cities.Secondly, using data from different cities allows understanding of a wider range of potentially criminogenic environments that may not be present in every city.For example, inland cities typically do not have beaches, while many cities are not adjacent to military bases, making it impossible for single-city studies to explore the influences of the full range of settings that might be encountered elsewhere.Thirdly, it is possible to study differences in the effects of similar environments on crime in different cities.For example, is the relationship between high schools and crime in the surrounding area the same in different cities? code is likely to be particularly useful in improving the generalizability of crime-and-place research.
While code offers new opportunities for studying crime and place, care must be taken to understand its limitations.The database is primarily intended for the study of spatio-temporal patterns of crime at micro scales, and in particular how those patterns differ across geographic settings.It is not likely to be useful for research that uses the city as the unit of analysis, such as studies analyzing associations between crimes and various city-level social policies.Official sources of crime data such as the ucr may be more appropriate in such cases.Due to the variations between laws and practices between cities, code is also unlikely to be useful for studies that attempt to study `crime' as an undifferentiated whole.While there are overlaps between spatio-temporal patterns for different types of crime (Newton, 2015), combining for analysis such crimes as (for example) welfare fraud and driving under the influence is unlikely to be productive.Since code is based on police crime records, it inherits the limitations of those data.The principal limitation is the "dark figure" of crime (Grünhut, 1951, p. 149), the difference between the number of crimes that occur and those that are reported to police (for a discussion, see Maguire & McVie, 2017).The proportion of crimes appearing in police data varies by type, for example almost all vehicle thefts are reported while most assaults without injury are not (Gove, Hughes, & Geerken, 1985;Tarling & Morris, 2010).A particular problem is those crimes (such as drugs possession) that usually only become known to police if officers catch the offender in the act.For these "intangible" offenses (Chappell & Walsh, 1974, p. 494), police records reflect patterns of police activity more than the underlying distribution of offenses.
Although the dark figure is an acknowledged problem in criminological research, official figures are often the only available source of micro-scale spatiotemporal crime data (Brantingham & Brantingham, 1975).Attempts have been made to use other sources of data to capture unreported offenses (Solymosi & Bowers, 2018), but these have their own limitations.In many cases, police records may be the best available source of crime data.
Open crime data have further limitations.While a researcher with an ongoing relationship with a particular police agency may be able to request access to particular data fields, researchers using open data (particularly from multiple cities) are typically limited to whatever information the agency as decided to release.In addition, not having a direct relationship with those who generate and process crime data may make it harder for open-data researchers to understand any specific issues relating to data from a particular city.
Balancing the impact of these limitations versus the benefits of code data outlined above is a decision best taken in consideration of the individual needs of particular research studies.While all data sources have limitations, the emerging crime-and-place research that has used open data (Solymosi, Ashby, Cohen, & Sidebottom, 2017;Tompson, Johnson, Ashby, Perkins, & Edwards, 2014) suggests open crime datasets can complement existing alternatives.This Downloaded from Brill.com07/15/2019 10:38:05AM via University College London is likely to be increasingly the case as more cities begin to release open crime data, particularly if that allows for comparison of places in different countries.

Figure 1
Figure 1 Cities included in the Crime Open Database

Table 1
Crime-type data available for each code city

Table 3
Years of data available for each code city.