HRI API Specification¶
The HRI consists of two separate APIs: the Management API and Apache Kafka.
Management API Specification¶
The Management API is defined using the OpenAPI 3.0 specification: management.yml. You can open the file directly or use a program such as IntelliJ or Swagger UI to view it.
HRI Tenants & Elasticsearch Indices¶
HRI has been designed with a multi-tenant cloud architecture. The API mainly contains methods for managing Tenants like creating, getting, and deleting. Each of these calls takes in the tenantId. The ID is appended with the suffix -batches
to create an index in Elasticsearch, where all the batch metadata is stored.
A Get
call without a tenantId will return a list of all tenants. The Get
call, when given a tenantId, will return information on the Elastic index of a specific tenant.
Below is a table of the fields returned by this call:
Field | Description |
---|---|
health | health of the Elastic cluster |
status | status of the index, can be open or closed |
index | the name of the index, which will be the tenantId with -batches appended to it |
uuid | universally unique identifier |
pri | number of primary shards |
rep | number of replicas |
docs.count | number of batches documents stored in the index |
docs.deleted | number of batches documents deleted from the index |
store.size | store size taken by primary and replica shards |
pri.store.size | store size taken only by primary shards |
Batches¶
The API contains methods for managing batches like creating, getting, and updating. Below is a table of the fields:
Field | Description |
---|---|
id | auto generated unique ID |
name | name of the batch, provided by the Data Integrator |
integratorId | unique ID of the Data Integrator that created this batch |
topic | Kafka topic that contains the data, provided by the Data Integrator |
dataType | the type of data, provided by the Data Integrator |
status | status of the batch: [ started, sendCompleted, completed, terminated, failed ] |
startDate | the date and time the batch was started |
endDate | the date and time the batch was completed, terminated, or failed |
expectedRecordCount | the number of records in the batch, provided by the Data Integrator when calling ‘sendComplete’ |
recordCount (deprecated) | the number of records in the batch, provided by the Data Integrator when calling ‘sendComplete’. Replaced by expectedRecordCount and deprecated in v2.0.0 |
actualRecordCount | the number of records received, calculated by validation processing |
invalidThreshold | the number of invalid records allowed in this batch before the batch fails validation, provided by the Data Integrator; defaults to -1 (infinite) |
invalidRecordCount | the number of invalid records, calculated by validation processing |
metadata | custom json value, optional |
Only the name
, topic
, and dataType
fields are required when creating a batch.
The invalidRecordCount
field is used by validation processing, so that when this many invalid records are encountered, the HRI will have determined that the entire batch has Failed Validation.
The expectedRecordCount
is provided by the Data Integrator when calling the ‘sendComplete’ endpoint, and thus not always present. The recordCount
field is identical and provides backward compatibility with older versions of HRI. recordCount
is deprecated in release v2.0.0 and will be removed in a later release.
The metadata
field is optional and allows the Data Integrator to include any additional information about the batch that Data Consumers might request. This information will be included in all notification messages.
All other fields are generated by the API.
Streams¶
The API also contains methods for managing Kafka topics like creating, getting, and deleting. Below is a table of the fields:
Field | Description |
---|---|
id | stream ID, consisting of a data integrator and optional qualifier, delimited by ‘.’ |
numPartitions | the number of partitions on the topic |
retentionMs | length of time in milliseconds before log segments are automatically discarded from a partition |
retentionBytes | optional maximum size in bytes that a partition can grow before discarding log segments |
cleanupPolicy | optional retention policy on old log segments |
segmentMs | optional time in milliseconds after which Kafka will force the log to roll even if the segment file isn’t full |
segmentBytes | optional log segment file size in bytes |
segmentIndexBytes | optional size in bytes of the index that maps offsets to file positions |
Only the numPartitions
and retentionMs
fields are required when creating a stream. The rest of the topic configurations (retentionBytes
, cleanupPolicy
, segmentMs
, segmentBytes
, and segmentIndexBytes
) are optional. Below is a table of the default values and acceptable ranges for these optional fields:
Field | Default value | Acceptable values/ranges |
---|---|---|
retentionBytes | 1073741824 | [10485760..1073741824] |
cleanupPolicy | delete | [ delete, compact ] |
segmentMs | nil | [300000..2592000000] |
segmentBytes | 536870912 | [10485760..536870912] |
segmentIndexBytes | nil | [102400..104857600] |
If the cleanupPolicy
field is set to compact, it will disable deletion based on time, ignoring the value set for the field retentionMs
.
Apache Kafka¶
Apache Kafka has its own API and clients are available for most languages. If using IBM Event Streams, see their documentation for details on connection parameters. Below are the requirements on the records written to and read from Kafka.
Health Input Data - FHIR Model¶
HRI does not impose any requirements on the format of the content of the Health (data) records written to Kafka, although Alvearie has selected FHIR as the preferred data model for all Health Data. See their FHIR implementation guide for more details. Data Integrators and Data Consumers must work together to agree on the specifics of the input data such as format and frequency.
HRI-Specific Requirements¶
The HRI does have the following requirements and recommendations:
-
Batch ID Header - every record must have a header entry with the batch ID that uses the key
batchId
. Data Integrators may include any additional header values, which will get passed downstream to consumers. -
Zstd Compression - use
zstd
compression when writing to Kafka by setting the compression.type producer configuration. Event Streams throttles network usage and limits Kafka messages to 1 MB. Using compression will help prevent an Event Streams bottleneck. -
1 MB Message Limit - Event Streams limits messages to 1 MB. There is not a way to directly set the max message size after compression is applied in the Kafka producer. The max.request.size producer configuration is applied before compression. The batch.size producer configuration can be set to limit the batching of records, but it can also affect performance. We recommend doing performance testing to determine appropriate values based on your data. For records over 1 MB compressed, there are two strategies:
-
External References - for records that have large binary attachments like images or pdfs, you may provide a reference to the resource in the message, rather than the (large) resource itself. For example, you could put a COS Object URL, or some other external data store URL, and key into the message.
-
Splitting up Records - records can be split into smaller parts, sent through the HRI, and re-assembled by down stream consumers.
-
Notification Messages¶
The notification messages are json-encoded batches. They match the schema returned by the Management API described above, which is also defined here: batchNotification.json.
Invalid Record Notifications¶
When validation encounters an invalid record, an invalid record notification is written to the *.invalid
topic. It contains a failure message, the batchId, and a pointer to the original record. Below is a table of the fields, and the json schema is defined here: invalidRecord.json.
Field | Description |
---|---|
batchId | Id of the Batch that the original record belongs to |
failure | the description of why the original record was invalid |
topic | the topic of the original record |
partition | the partition of the original record |
offset | the offset of the original record |