Harvesting & Importing
Note
All of these commands should be run within the catalog or cron
Docker container
Script Summary
- pc-import-folio:
Used to harvest and import FOLIO MARC data into the
bibliocollection of Solr. - pc-import-hlm:
Used to harvest and import EBSCO MARC data into the
bibliocollection of Solr from the FTP location given access to by EBSCO. The records contain the HLM dataset that is missing from FOLIO's database. - pc-import-authority:
Used to harvest and import MARC data from Backstage into the
authoritycollection in Solr from the FTP location provided by Backstage. - cron-reserves.sh:
- pc-full-import:
Wrapper script for the
pc-import-folioandpc-import-hlmscripts to perform a full import of data.
HLM Data
One of our sources is HLM (Holdings Link Management) which is files from
EBSCO's FTP server with electronic resources our library has access to
which are not in FOLIO. To automate the retrieval and import of these
files we have a wrapper script, pc-import-hlm.
Since occasionally EBSCO sends us full sets of data, we want to be able
to exclude all previous sets from them, so to do this we have an ignore
files /mnt/shared/oai/ignore_patterns.txt with will check to see if the file
name contains any of those substrings and ignore them if they match.
Also, we will only harvest files that match the pattern *.m*c and *.zip.
*.m*c files will be imported with the exception of -del*.m*c (case
insensitive) or .delete files which will instead be tagged as deletion files.
If you simply want to see what files are on the FTP server currently, you can
run our helper script, pc-list-hlm-remote, to list all the files. If you
want to download a specific file, run pc-get-hlm-remote [NAME OF FILE].
Full Data Harvests
This section describes the steps needed to re-harvest all the data from each source.
FOLIO
-
Notify your hosting provider prior to your harvest attempt (EBSCO in our case) so that they can prepare the environment to allocate appropriate additional resources and advise you on the window in which you should start the harvest.
-
Ensure that your OAI settings on the FOLIO tenant are what you want them to be for this particular harvest. For example, if you wish to include storage and inventory records (i.e. the records without a MARC source) then you will need to modify the "Record Source" field in the OAI Settings in FOLIO.
-
Next you will need to clear out the contents of the
harvest_foliodirectory before the next cron job will run. Assuming you want to preserve the last harvest for the time being, you can simply move those directories somewhere else and rename them. Below is an example, but certainly not the only option. The only goal is that theharvest_foliodirectory has no files in it, but can have thelogandprocesseddirectories within it as long as they are empty (they technically can have files in them, you just will not want them to have files since they will get mixed in with your new harvest).STACK_NAME=catalog-preview cd /mnt/shared/oai/${STACK_NAME}/harvest_folio/ # Option 1: Preserving the last harvest set mv processed processed_old mv log log_old mv last_state.txt last_state.txt.old mv harvest.log harvest.log.old # Option 2: Not preserving the last harvest set (if you have # other copies already such as in -beta or -prod and you are # doing the harvest in -preview) sudo find . -type f -delete -
Monitor progress after it starts via the cron job in the monitoring app or in the log file on the container or volume (
/mnt/logs/harvests/).
Syncing the harvest files to other environments
If you have done the harvest for one environment and you want to sync the newly harvested set over to the other environments you can follow these steps:
-
Disable the FOLIO cron for the source and target environment
-
Create a faster compression script (if not done already)
-
Give the script execute permissions (if not done already)
-
Archive the target environment's
harvest_foliodirectory -
Clear out the
harvest_foliodirectory and sync over the source environment files -
Run the
pc-full-importscript on the target environment -
Verify that you are happy with the counts in the new environment
-
Repeat steps 1 through 7 with any other environment you want to sync to
-
Once you are done syncing to all environments, you can re-enable the source environment's cron. There is no need to do this for the target environment(s) since the full import script will have done that.
-
You will eventually want to clean up the archives to save disk space, but keep them around for a while until you are confident in the data in the new set
HLM
This can just be done with the script's --full --harvest flags in a
one-off run, but if you prefer to have it run via the cron job use it's
normal flags, here are the steps you would need to do in order to prepare
the environment.
-
Remove all files from the
/mnt/shared/hlm/[STACK_NAME]/harvest_hlm/directory. You can also just move them somewhere else if you want to preserve a copy of them. -
Monitor progress after it starts via the cron job in the monitoring app or in the log file on the container or volume (
/mnt/logs/harvests/).
Backstage (Authority records)
This can just be done with the script's --full --harvest flags in a
one-off run, but if you prefer to have it run via the cron job use it's normal
flags, here are the steps you would need to do in order to prepare
the environment.
-
Remove all files from the
/mnt/shared/authority/$STACK_NAME/harvest_authority/directory. You can also just move them somewhere else if you want to preserve a copy of them. -
Monitor progress after it starts via the cron job in the monitoring app or in the log file on the container or volume (
/mnt/logs/harvests/).
Full Data Imports
biblio Index
Helper script for full import
There is now a helper script to run all of the below steps in the proper
order to do a full re-import of data in the biblio index. See the full
documentation for the pc-full-import
and the below for a simple example.
sudo screen
# Prompting for confirmation
pc-full-import catalog-prod --debug 2>&1 | tee /mnt/shared/logs/catalog-prod-import_$(date -I).log
# Bypassing user confirmation and notifying pubcat on completion
pc-full-import catalog-prod --email LIB.DL.pubcat@msu.edu --yes --debug 2>&1 | tee /mnt/shared/logs/catalog-prod-import_$(date -I).log
Should you choose to do the steps manually, this section will describe
the process needed to run a full re-import of the data since that is
frequently required to update the Solr index with new field updates.
If other tasks are required (such as full harvests or incremental) refer
to the --help flags on the appropriate script.
Full imports for the biblio collection can be done
- directly in the
croncontainer for prod/beta/preview, - in the
catalogcontainer for dev environments, - using the
bibliocollection alias in thebuildcontainer, to avoid serving incomplete collections in prod (see How to Use the Collection Aliases to Rebuild and Swap below).
Importing FOLIO records using the cron container
Connect to one of the catalog server nodes and move the following files up
a level out of the processed directory in the shared storage. This will allow
them to be picked up by the next cron job and re-started automatically should
the container get stopped due to deployments. Progress can be monitored
by checking then number of files remaining in the directory and the log file
in /mnt/logs/harvests/folio_latest.log.
# Can be done inside or outside the container
mv /mnt/shared/oai/[STACK_NAME]/harvest_folio/processed/* /mnt/shared/oai/[STACK_NAME]/harvest_folio/
Importing FOLIO records in dev environments
This will import the tests records. In the catalog container:
Importing HLM records using the cron container
Assuming HLM records also need to be updated in the biblio index as
well, you will need to copy those files from the shared directory into
the container prior to starting the script. Then start a screen session
and connect to the container again and run the command to import the files.
Note that this will get terminated and not be recoverable if the container
stops due to a deploy like the previous command was. Process can be monitored
by seeing the remaining files in the /usr/local/harvest/hlm/ directory
and by re-attaching to the screen (by using screen -r) to see if the command
has completed.
# Done inside the catalog_cron container
cp /mnt/shared/hlm/[STACK_NAME]/current/* /usr/local/vufind/local/harvest/hlm/
# You will want to kick off this command in a screen session,
# since it can take many hours to run
/usr/local/bin/pc-import-hlm -i -v
How to Use the Collection Aliases to Rebuild and Swap
As mentioned in the Solr documentation, biblio
uses aliases to manage directing VuFind to the collection in Solr that have the
"live" biblio data that should be used for searching: biblio1 or biblio2.
This means we will on occasion need to swap them. This occasion being when
we rebuild the index, such as when we're adding new data fields or doing a
VuFind version upgrade (...which typically add new data fields).
-
Start the manual task "Deploy VuFind Build Env" in GitLab. It will update the
catalog_buildcontainer. This is not done automatically so that other updates to the main branch can be deployed while a full import is running. -
Identify what collection each alias is pointing to currently (i.e. is
bibliopointing tobiblio1orbiblio2) and confirm the other collection is whatbiblio-buildis pointing to. To get the list of aliases, from a container:
- Rebuild the index on
biblio-buildusing thecatalog_buildcontainer. This has everything that thecatalog_croncontainers have access to, but do not runcronjobs since rebuilds do not happen at regular or frequent intervals. In fact, all this container does is sleep! It is recommended to run these commands in ascreen.
Warning
If you run a deploy pipeline while this is running, you will not want to run the manual job that deploys the updates to the build container (since not all the import scripts are configured to resume where they left off yet).
# On Host
screen
docker exec -it $(docker ps -q -f name=catalog-prod-catalog_build) bash
# Inside container
rm local/harvest/folio/processed/*
cp /mnt/shared/oai/${STACK_NAME}/harvest_folio/processed/* local/harvest/folio/
/usr/local/bin/pc-import-folio --verbose --reset-solr --collection biblio-build --batch-import | tee /mnt/shared/logs/folio_import_${STACK_NAME}_$(date -I).log
[Ctrl-a d]
# On Host
screen
docker exec -it $(docker ps -q -f name=catalog-prod-catalog_build) bash
# Inside container
cp /mnt/shared/hlm/${STACK_NAME}/current/* local/harvest/hlm/
/usr/local/bin/pc-import-hlm --import --verbose | tee /mnt/shared/logs/hlm_import_${STACK_NAME}_$(date -I).log
[Ctrl-a d]
- Verify the counts are what you expect on the
biblio-buildcollection using the following command
curl 'http://solr:8983/solr/admin/metrics?nodes=solr1:8983_solr,solr2:8983_solr,solr3:8983_solr&prefix=SEARCHER.searcher.numDocs,SEARCHER.searcher.deletedDocs&wt=json'
- Build the spellchecking indices
Building these indices is only necessary for a full import.
curl 'http://solr1:8983/solr/biblio-build/select?q=*:*&spellcheck=true&spellcheck.build=true' &
curl 'http://solr2:8983/solr/biblio-build/select?q=*:*&spellcheck=true&spellcheck.build=true' &
curl 'http://solr3:8983/solr/biblio-build/select?q=*:*&spellcheck.true&spellcheck.build=true' &
wait
curl 'http://solr1:8983/solr/biblio-build/select?q=*:*&spellcheck.dictionary=basicSpell&spellcheck=true&spellcheck.build=true' &
curl 'http://solr2:8983/solr/biblio-build/select?q=*:*&spellcheck.dictionary=basicSpell&spellcheck=true&spellcheck.build=true' &
curl 'http://solr3:8983/solr/biblio-build/select?q=*:*&spellcheck.dictionary=basicSpell&spellcheck=true&spellcheck.build=true' &
wait
/var/solr/data/biblioN/spellShingle and
/var/solr/data/biblioN/spellchecker should have a significant
size afterward in the solr container (replace biblioN by the
biblio-build collection)
- Your Solr instance will likely require more memory than it typically needs
to do the collection alias swap. Be sure to increase and deploy the stack
with additional
SOLR_JAVA_MEMas required to ensure no downtime during this step. Currently, 6G (which we use in prod) is enough for the swap. Alternatively (for beta and preview), let it crash after these commands and restart the pipeline to help Solr cloud fix itself.
# Open the solr-cloud compose file for your environment
vim docker-compose.solr-cloud.yml
# Modify the memory line to:
SOLR_JAVA_MEM: -Xms8192m -Xmx8192m
# Now on the host, run the deploy helper script
sudo pc-deploy [ENV_NAME] solr-cloud
- Once you are confident in the new data, you are ready to do the swap! BE SURE TO SWAP THE NAME AND COLLECTION IN THE BELOW COMMAND EXAMPLE
Warning
# Command to check the aliases (repeated from above)
curl -s "http://solr:8983/solr/admin/collections?action=LISTALIASES" | grep biblio
# This EXAMPLE sets biblio-build to biblio2, and biblio to biblio1
# biblio-build => biblio2
# biblio => biblio1
curl 'http://solr:8983/solr/admin/collections?action=CREATEALIAS&name=biblio-build&collections=biblio2'
curl 'http://solr:8983/solr/admin/collections?action=CREATEALIAS&name=biblio&collections=biblio1'
# This EXAMPLE sets biblio-build to biblio1, and biblio to biblio2
# biblio-build => biblio1
# biblio => biblio2
curl 'http://solr:8983/solr/admin/collections?action=CREATEALIAS&name=biblio-build&collections=biblio1'
curl 'http://solr:8983/solr/admin/collections?action=CREATEALIAS&name=biblio&collections=biblio2'
-
If needed, back-date the timestamp on your
last_harvest.txtre-harvest some of the OAI changes since you started the import -
Clear out the collection that
biblio-buildis pointing to, to avoid having two large indexing stored for a long period of time (only after you are confident in the new index's data)
-
If
SOLR_JAVA_MEMwas increased, lower it to its previous amount. -
Kick off a manual alpha browse re-index if you don't want it to be outdated until the next scheduled run.
# Run this on all of the host's
docker exec -it \
$(docker ps -q -f name=${STACK_NAME}-solr_cron) \
/alpha-browse.sh -v -f
authority Index
Similar to the process for HLM records, copy the files from the shared
directory into the containers import location prior to starting the script.
Then start a screen session and connect to the container again and run
the command to import the files.
Process can be monitored by seeing the remaining files in the
/usr/local/harvest/authority directory and by re-attaching to the
screen (by using screen -r) to see if the command has completed.
mv /mnt/shared/authority/$STACK_NAME/harvest_authority/processed/*.[0-9][0-9][0-9].xml /mnt/shared/authority/$STACK_NAME/harvest_authority
# You can either run the command manually in a screen, or let the cron pick it up
/usr/local/bin/pc-import-authority -i -B
reserves Index
The course reserves data is refreshed by a Cron job on a nightly basis, so likely you will not need to run this manually if you can just wait for the regular run. But if needed, here is the command to run it off-schedule.
# Done inside the container (ideally within a screen since it will
# take hours to run)
php /usr/local/vufind/util/index_reserves.php
Alternatively, you can also modify the cron entry (or add a temporary
additional cron entry) in the cron container for the cron-reserves.sh
command to run at an earlier time. The benefit of this would be it would
save logs to /mnt/logs/vufind/reserves_latest.log and track them in the
Monitoring site.
Adding generated call numbers
This should be done after each full import (FOLIO + HLM) when data was reset,
in the solr_solr container:
Partial call numbers are added to callnumber-label for records that didn't
have any when the call_numbers.csv file was generated.
Note that the call numbers in /mnt/shared/call-numbers/call_numbers.csv are
meant for beta/preview/prod. There is another file at
/mnt/shared/call-numbers/test_call_numbers.csv that can be used for testing
in dev.
Ignoring certain HLM files
If your EBSCO FTP server is set up in a way where it contains all of the sets
ever generated for you, then you'll likely want a way to have the pc-import-hlm
script ignore the past sets assuming you get new full sets periodically. This can
be done by adding a new substring pattern to ignore to the top level of the hlm
directory in the shared storage (/mnt/shared/hlm/ignore_patterns.txt). This
file is used automatically and created on the cron containers startup if it
doesn't exist. You can override the file path by using the -p|--ignore-file
flag.
Using VuFind Utilities
The preferred method is to use the included wrapper script with this repository.
The pc-import-folio
script can run either, or both, the harvest and import of data from FOLIO to
VuFind. Use the --help flag to get information on how to run that script.
But should you choose to run the commands included directly with VuFind, below is documentation on how to do that.
Harvesting from Folio
cd /usr/local/vufind/harvest
php harvest_oai.php
## This step is optional, it will combine the xml files into a single file
## to improve the speed of the next import step.
find *.xml | xargs xml_grep --wrap collection --cond "marc:record" > combined.xml
mkdir unmerged
mv *oai*.xml unmerged/