You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some admonishments/guidelines for the below investigations, based on 2024-09-10 planning meeting
model choice/configuration
We are assuming that we will use Whisper, or some variant (e.g. WhisperX), because it provided what we felt was the best combination of output quality and performance from the tools that were evaluated earlier in 2024 (@edsu may be able to link to that analysis for context?). If we determine that we want to go with a completely different model, we need to write up our reasoning for approval.
We want our solution to provide access to Whisper's tuning parameters so that we can tweak them as needed, so completely blackbox solutions that run Whisper with no access to configuration aren't acceptable.
terminology
After some discussion, we settled on the term "speech to text" to encompass text extraction from speech in audio, whether it has video or not (there was lack of consensus/confusion about whether "caption" applies to audio-only, and it does also apply to still-image descriptions; and "transcript" doesn't quite encompass what captions do for video).
So e.g. speechToTextWF, speech_to_text as a snake-case var name, "speech to text" or "speech-to-text generation" as a human readable term, etc.
infrastructure provisioning
We would like to avoid (or at least minimize as much as possible) vendor lock-in. We're highly likely to go with AWS to start, since we have more departmental expertise there, but GCP isn't out of the question. The cloud vendor has to be an org with which Stanford has a business agreement, and which is available through Cardinal Cloud, so that might rule out anything other than Amazon and Google? But as much as possible, we should use building blocks that have analogs in multiple major cloud vendors
related, but somewhat standalone point: ultimately, we should define and deploy the cloud infrastructure using Terraform. It's meant to be platform agnostic and the department already uses it. But also, all of our permanent prod/stage/qa cloud infrastructure is deployed assuming that Terraform is the source of truth, so things that were created manually (e.g. using the AWS web console or one off aws CLI commands) will cause confusion in the future. It's totally fine to experiment with building blocks by manually spinning them up that way, but once the experiment is done, those should be torn down and defined formally in Terraform.
model usage
⚠️ It is unacceptable for our data to be used to train the models of other orgs. This rules out, for example, OpenAI's hosted Whisper service. This is a SUL-wide rule, at the moment.
Terraform questions, workflow service wiring questions
(obviated if we trigger processing by making a REST call to a speech-to-text API) How does ECS listen to a bucket, and then how do we trigger an ECS instance to run when new files are dropped in? (i.e. how do we do something like what abbyy watcher does for OCR?)
How does the SDR Workflow Server know when the Whisper output is complete?
How do we configure how many concurrent ECS tasks can run at a time to save $$$
Other miscellaneous questions and areas to explore
Tooling for cost data (using a AWS tag? Aaron says pretty much all AWS components are taggable, and we’ve used that to see what’s costing money before)
Tooling for introspecting on what work is waiting or underway
The text was updated successfully, but these errors were encountered:
jmartin-sul
changed the title
[WIP] EPIC: Prototype workflow for transcriptioning and captioning
EPIC: Prototype workflow for transcriptioning and captioning
Sep 12, 2024
jmartin-sul
changed the title
EPIC: Prototype workflow for transcriptioning and captioning
[EPIC] Prototype workflow for transcriptioning and captioning
Sep 12, 2024
jmartin-sul
changed the title
[EPIC] Prototype workflow for transcriptioning and captioning
[EPIC] Prototype workflow for generating and accessioning transcripts/captions
Sep 12, 2024
jmartin-sul
changed the title
[EPIC] Prototype workflow for generating and accessioning transcripts/captions
[EPIC] Prototype workflow for generating and accessioning speech-to-text extraction
Sep 13, 2024
Some admonishments/guidelines for the below investigations, based on 2024-09-10 planning meeting
model choice/configuration
terminology
After some discussion, we settled on the term "speech to text" to encompass text extraction from speech in audio, whether it has video or not (there was lack of consensus/confusion about whether "caption" applies to audio-only, and it does also apply to still-image descriptions; and "transcript" doesn't quite encompass what captions do for video).
So e.g.
speechToTextWF
,speech_to_text
as a snake-case var name, "speech to text" or "speech-to-text generation" as a human readable term, etc.infrastructure provisioning
aws
CLI commands) will cause confusion in the future. It's totally fine to experiment with building blocks by manually spinning them up that way, but once the experiment is done, those should be torn down and defined formally in Terraform.model usage
todo
speechToTextWF
common-accessioning#1341The text was updated successfully, but these errors were encountered: